Opened 13 years ago
Closed 12 years ago
#2116 closed patch (Fixed)
Application layer protocol for transfering RPC messages + utf8 decoding error
Reported by: | Bro | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | Future |
Component: | Unknown | Version: | 1.3.5 |
Keywords: | Cc: |
Description
Introduction
After upgrading my FreeBSD server I suddenly had problems getting the GTK client to show the torrent list (as well as the states and trackers info to the left.). The CLI client wouldn't show the torrents with the info command. It was pretty much the same symptoms as these bug reports: http://dev.deluge-torrent.org/ticket/1531 http://dev.deluge-torrent.org/ticket/2095 which I presume have encountered the same bug.
Some that reported this problem in the bug reports have a lot of torrents (many hundreds), and I have about 1600 myself, so it looked like this would be a problem only for those with a lot of torrents.
What happens is that the client can see the daemon being online, but when connecting, it will fail to populate the torrent list. If the torrent list is successfully populated, it continues to work with no problems.
After a lot of testing, expecting it was a problem with corruption of the state files, I eventually found the cause, which of course, had nothing to do with the state files ;-)
The problem actually turned out to be twofold.
Problem one - Application layer protocol for transferring data
The first problem has to do with the application layer protocol, or the lack of such a protocol when it comes to transferring the data over the network. When RPC commands are sent over the network, they're usually successfully decompressed by zlib on the receiver, but not always. When transferring one command (say 20 bytes), with a big enough delay until the next command is sent, the 20 bytes are received on the other side, and the data is decompressed and decoded (by rencode).
When transferring multiple messages within a short time period, these messages might be received as one on the other side. So when transferring two messages at the same time, one of 20 bytes, and one of 40 bytes, the receiver might receive 60 bytes, and has no idea that it's actually two messages.
The current protocol trusts that the zlib library will successfully decompress compressed messages that are concatenated by TCP when transfered. Usually it will successfully decompress the first message, but subsequent messages aren't successfully decompressed as far as I've been able to test.
Most messages are received separately, so that is why this bug shows it's ugly face only in rare cases.
The big problem is that in some cases, the torrent list being transferred isn't successfully decompressed, and thereby lost to the client.
The current deluge protocol will add newly received data onto the current buffer of old (but not successfully decompressed) data and then it tries to decompress the whole buffer. If this fails it will wait for more data.
What happens when the torrent list is never populated, is that it fails to decompress the data in the buffer even after the whole torrent list has been transferred. The result is that it just keeps appending incoming data onto the buffer which results in no messages being received at all.
I'm not sure if this is because the torrent list is received as the second message (i.e. it has been concatenated onto the data of another message on arrival), or if there is another reason the torrent list message fails to be decompressed.
As far as I can see, the only way to solve this is to use an application layer protocol that knows how to separate messages.
Problem two - Decoding error with rencode
After implementing a simple application layer protocol it turned out the torrent list still wouldn't load, because the rencode library fails to decode the transferred data with a
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 54: invalid continuation byte
This problem was easily fixed by replacing rencode with 'pickle'.
Other advantages of pickle (or more precisely cPickle) compared to rencode is especially speed, but also size (somewhat):
~/test$ ./pickle_test.py Data size: 5125267 bytes pickle dumps time: 0:00:00.939099 pickle dump size: 3470949 cPickle dumps time: 0:00:00.203606 cPickle dump size: 3426268 rencode dumps time: 0:00:00.448616 rencode dymp size: 4033409 pickle loads time: 0:00:00.673550 cPickle loads time: 0:00:00.162311 rencode loads time: 0:00:00.690657
These are test results from the actual data of the torrent list transferred from my server (1600 torrents).
Patch
I've implemented a class DelugeTransferProtocol which fixes the issues I had, though I cant guarantee that it fixes the issues others have reported, or that it doesn't introduce other bugs ;-)
This patch will render the client unable to communicate with any unpatched daemons.
I've also implemented tests to verify sending and receiving of messages through the transfer protocol.
https://github.com/bendikro/deluge/commit/8e67c763ab7176bc16b871cac974eef74aaf324f
Change History (7)
comment:1 by , 13 years ago
comment:2 by , 13 years ago
Thank you for the effort in putting this together! You even included tests ;)
I do have some comments:
I am not quite sure why we would never be able to properly decompress the message after getting all of the it into the buffer.. But it definitely is a bit of a design flaw in terms of computation required to constantly try decompressing on an incomplete message especially if they are quite large. The message length header is reasonable and I think that should probably be committed into master. The code looks nice and clean with comments and documentation, so again, thank you for that!
The second problem with rencode is something that I would like to look into more.. Moving to pickle isn't really an option due to the security implications of pickle - I'm sure you can do some googling to find out all the reasons, but it essentially boils down to being able to pickle (mostly) any object type, including those that could run malicious code. This was one of the primary reasons I chose to use rencode since it is relatively quick and only functions on the basic types required.
Have you tried my enhanced rencode module? http://code.google.com/p/rencode/
I'd be really interested if it solves your problem. It is also a lot quicker than the Python version included with Deluge. Perhaps we should look into including this module instead of the current pure-Python version.
If it doesn't solve your problem, we should figure out where it's occurring and correct it.. I suspect we might be able to do something in the Protocol to check for invalid string types before doing an rencode.
An updated patch to remove the pickle stuff and just include the message length header would be nice.
comment:3 by , 13 years ago
andar, after reading your comments I tested this out more thoroughly. The problem wasn't that easy to reproduce, but I think I've got it.
As far as I can understand, the problem is caused by two things, which results in errors with the current protocol - The rencode-must-fail Protocol. The problem isn't caused by zlib failing, as I thought when writing the ticket, but it seems zlib fails because of another reason.
The rencode-must-fail Protocol
1) If two messages of 50 bytes are received as 100 bytes, zlib will decompress the first 50 bytes, and put the next 50 bytes in the unused_data buffer. This works well, and the protocol depends on this to work.
However, if only the first 25 bytes are received of the 100 bytes, zlib will not raise any errors, but decompress the 25 bytes. That isn't a problem with zlib, but it is the first problem with the protocol.
2) When zlib doesn't fail when decompressing only 25 bytes, it's up to rencode to fail (for the protocol to function), and it does. It will not successfully decode only 25 bytes (it's more after being decompressed, but lets ignore that) of 50, so it raises an error and the protocol will then wait for more data. This also works well...usually.
However, if less bytes are are received, e.g. 48, rencode might not complain when decoding the 48 bytes, even though it's missing 2 bytes.
The next time data arrives (52 bytes where the first 2 is from the last message), it will not successfully decompress as the next message starts 2 bytes later, and spits out:
Error -3 while decompressing: incorrect header check
When sending a message of 50 bytes, it will of course never receive just 48 bytes, and that is why this protocol works in most cases. Usually the messages aren't split, and when they are, they're split at a place so that rencode fails (which makes sure it will save the data and wait for more).
But when multiple messages are sent, and split by various reasons, and the split is at a place (near the end of the message) where rencode won't complain, it will always fail to decompress later messages, and a reconnect to the daemon is required.
I've created a test _test_rencode_fail_protocol which illustrates this problem. Testing with different values for the packet_size variable you'll see that rencode will accept a few bytes bytes less than the actual message length. Using a value of 53 or greater for the packet_size variable will work because the first packet is 53 bytes. Using 49-52 will cause the test to fail (because rencode doesn't fail as the protocol depends upon).
Also illustrated by:
>>> import rencode >>> data = '(0, 1, {"key": "value"})' >>> dump = rencode.dumps(eval(data)) >>> dump = dump[:-2] >>> print rencode.loads(dump) (0, 1, {u'key': u'val\x00S'})
I tested the same with pickle which also seems to accept a slightly truncated message.
Performance
I've timed the old code with 1600 torrents, and it takes a long time to process the torrent list. As expected the problem isn't that zlib uses a lot of time, it's rencode. When timing all calls on zlib.decompress and rencode.loads for the torrent list message, which was received in about 29 packets meaning 29 calls to zlib.decompress and rencode.loads, zlib.decompress used in total 0.53 seconds, while rencode.loads used about 12 seconds. With this patch, it is reduced to only one call to zlib.decompress and rencode.load which takes about 0.7 seconds with my torrent list.
In regards to the comments about saving space with pickle, it's actually in practice not true. Pickle will use less space than rencode with the same data, but zlib is able to compress the rencode dump more than the pickle dump, so in practice, rencode+zlib uses less space on the wire ;-)
rencode
I tried the Cython version of rencode but it fails as well. I found out that the decoding issue is caused by a "bad" character in the tracker_status field, more precisely \xe5 which is the character å.
>>> import rencode, pickle >>> data = eval("['String with \xe5']") >>> dump = pickle.dumps(data) >>> pickle.loads(dump) ['String with \xe5'] >>> dump = rencode.dumps(data) >>> rencode.loads(dump) decode error: 'utf8' codec can't decode byte 0xe5 in position 12: unexpected end of data input: String with ... ... TypeError: decoder must return a tuple (object,integer)
Updated version of the patch using rencode (It's also simplified a bit):
comment:4 by , 13 years ago
Great analysis work.
So it looks like we have two problems: an issue with the RPC messaging and an issue with rencode. It looks like you've solved the first one and your reasoning makes sense, so I'll work on getting this applied to master.
I've taken a look at the second problem involving rencode and I at least understand why it's happening, but I'm still not sure on how to go about fixing it. With rencode we expect all strings (byte strings) to be either utf8 or ascii encoded (ascii is a subset of utf8 which is why it works). If a unicode object is passed into rencode, it will first encode it into a utf8 bytestring. During a decode of a string, rencode will attempt to decode the string as utf8 so that it will return a unicode object.
>>> data = u"\xe5" >>> print data å >>> rencode.dumps(data) '\x82\xc3\xa5' >>> rencode.loads(rencode.dumps(data)) u'\xe5'
>>> data = "foo" >>> rencode.dumps(data) '\x83foo' >>> rencode.loads(rencode.dumps(data)) u'foo'
When the string is passed in as a unicode object, rencode behaves as expected returning an unicode object when doing a loads().
The problem arises when you pass in an string that is neither ascii or utf8 encoded and then try to loads() from rencode.
>>> data = "\xe5" >>> rencode.dumps(data) '\x81\xe5' >>> rencode.loads(rencode.dumps(data)) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "rencode.pyx", line 498, in rencode._rencode.loads (rencode/rencode.c:5439) File "rencode.pyx", line 466, in rencode._rencode.decode (rencode/rencode.c:5131) File "rencode.pyx", line 386, in rencode._rencode.decode_fixed_str (rencode/rencode.c:4159) File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 0: unexpected end of data
The dumps() works just fine because rencode does nothing to bytestrings passed in as it expects these to be proper ascii or utf8 encoded strings, but when it comes to the loads() rencode will attempt to decode the bytestring as utf8 to produce a unicode object.
A couple options to fix this:
- Don't. Simply enforce the fact that we should be using proper utf8 strings and fix the source of these malformed strings.
- Allow the use of different encoded bytestrings by not attempting a utf8 decode during the loads(). This means that if you pass a unicode object into rencode.dumps(), you will not get a unicode object out on the subsequent loads() but rather a bytestring in an unknown encoding.
I'm really not sure if the latter option would have any effect on Deluge or not. Quite frankly our handling of string encodings across the board is pretty messed up so it's a bit scary changing something as fundamental as this. That being said, if you have tried using pickle and it has worked for you, this may turn out ok for us as well, as I assume this is how pickle approaches encoded bytestrings. On the other hand, I kind of like the idea of enforcing the use of utf8 within Deluge as we really shouldn't be using any other encoding for anything.
Anybody have any thoughts on this? I suppose we could do some tests and see if removing the string decoding has any real effects.
follow-up: 6 comment:5 by , 13 years ago
Could you please rebase your patch against the HEAD of master? I think your patch has been based off the 1.3-stable branch.
comment:6 by , 13 years ago
Replying to andar:
Could you please rebase your patch against the HEAD of master? I think your patch has been based off the 1.3-stable branch.
https://github.com/bendikro/deluge/tree/master-dev
I added some tests on the header of the incoming data.
I've barely tested the patch on the master branch as I had some problems loading the old torrents. When starting deluged on my server the torrents are loaded according the log, but the message with the torrents (the 1600) is never sent to the client. Is it so that torrents from 1.3.5 won't load on current master?
As for how to fix the rencode issue, as you already wrote, Deluge should always pass utf8 strings, so I guess that should be fixed no matter. A problem with not changing rencode is that the same issue might appear with some other strings that Deluge (or some plugin?) doesn't handle properly?
Pickle did work with no problems when I tested, so I doubt there would be any major problems with changing rencode. Still it's best to be certain before changing rencode in Deluge ;-)
comment:7 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
I've merged your branch into master. The string encoding issue we'll deal with separately in #2039.
Thanks a lot. Will test