Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multipart.BodyPartReader appends CRLF #454

Closed
vbraun opened this issue Aug 2, 2015 · 7 comments
Closed

multipart.BodyPartReader appends CRLF #454

vbraun opened this issue Aug 2, 2015 · 7 comments
Labels

Comments

@vbraun
Copy link

vbraun commented Aug 2, 2015

I'm trying to handle binary flies uploaded with a polymer file-upload component (https://github.com/winhowes/file-upload). On the browser side (chrome) that just sticks the file in a FormData object, so I believe the sent form is correct. I'm using

        reader = aiohttp.MultipartReader.from_response(request)
        first_part = yield from reader.next()
        data = yield from first_part.read()

to parse it on the server side and always get a trailing CRLF added to data.

As an example, the browser sends

[...]
0100500  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0     003
0100520  \0  \0  \0  \0  \0  \0   +  \0  \0  \0  \0  \0  \0  \0  \0  \0
0100540  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \0  \0  \0  \0  \0  \0
0100560  \0  \0  \0  \0  \0  \0  \r  \n   -   -   -   -   -   -   W   e
0100600   b   K   i   t   F   o   r   m   B   o   u   n   d   a   r   y
0100620   9   X   r   p   F   i   8   S   X   a   A   2   9   t   P   v
0100640   -   -  \r  \n
0100644

and data (incorrectly) ends with [...]\0\0\0\0\0\0\r\n. Looking at the aiohttp tests, this is by design:

class PartReaderTestCase(TestCase):

    def setUp(self):
        super().setUp()
        self.boundary = b'--:'

    def test_next(self):
        obj = aiohttp.multipart.BodyPartReader(
            self.boundary, {}, Stream(b'Hello, world!\r\n--:'))
        result = yield from obj.next()
        self.assertEqual(b'Hello, world!\r\n', result)

But that seems wrong to me, the CRLF is supposed to be part of the boundary and should NOT be included in the body part.

The relevant RFC http://www.w3.org/Protocols/rfc1341/7_2_Multipart.html says: "NOTE: The CRLF preceding the encapsulation line is considered part of the boundary so that it is possible to have a part that does not end with a CRLF (line break). Body parts that must be considered to end with line breaks, therefore, should have two CRLFs preceding the encapsulation line, the first of which is part of the preceding body part, and the second of which is part of the encapsulation boundary."

@kxepal feature #273

@kxepal
Copy link
Member

kxepal commented Aug 2, 2015

mmm...true. And I know how I avoid that problem: in all my cases binary data falls into read_chunk branch as they always have content length defined while lenghtless parts always fit one line. Eventually.

The problem is in readline method which is used when body part length is unknown since that's the only safe way to parse it. However, it also means that to fix this issue it have to look forward on one line in order to strip accidental CRLF from the last data line. I always though that _unread buffer was a hack, but it seems it have to be a legit solution.

The test therefore is valid for the wrong behaviour (:

Will prepare the fix soon. If you have own ideas about feel free to share (:

@kxepal
Copy link
Member

kxepal commented Aug 2, 2015

@vbraun see #455 PR

@vbraun
Copy link
Author

vbraun commented Aug 2, 2015

Thank, I'll give it a try tomorrow!

I also didn't find find a way to read in chunks except by using readline; Thats not really ideal either as there is no guarantee that there are any newlines. The boundary marker has a fixed maximal length (70 bytes) so it should be easy enough to read + scan in chunks. Am I missing something?

@kxepal
Copy link
Member

kxepal commented Aug 2, 2015

@vbraun That's easy:

reader = aiohttp.MultipartReader.from_response(request)
first_part = yield from reader.next()
while True:
    chunk = yield from first_part.read_chunk()
    if not chunk:
        break
    fobj.write(chunk)

The only requirement is that body part must define Content-Length header.

Ideally, of course it's possible to read by chunks lengthless parts, but it's hard: boundary may be split across two or, even worse, 3+ chunks and I don't know what the sane algorithm there should be to correctly merge such partial reads, avoid false positive ahead reads and so on. Too many issues, pitfalls and edge cases for one missed header.

asvetlov added a commit that referenced this issue Aug 3, 2015
#454: Don't append CRLF for the last read line
@asvetlov
Copy link
Member

asvetlov commented Aug 3, 2015

Fixed by #455

@asvetlov asvetlov closed this as completed Aug 3, 2015
@vbraun
Copy link
Author

vbraun commented Aug 3, 2015

Thanks for the quick fix!

And yes, I should have said that Chrome doesn't send a Content-Length header so read_chunk doesn't work.

The boundary is guaranteed to be <= 70 bytes so all you have to do is read ahead by 70 bytes. Just wrapping a stream to stop at given boundaries is straightforward. What makes chunked reads difficult with the current implementation is that this is not abstracted into a stream wrapper but handled across MultipartReader and BodyPartReader, handing _unread pieces around.

@lock
Copy link

lock bot commented Oct 29, 2019

This thread has been automatically locked since there has not been
any recent activity after it was closed. Please open a new issue for
related bugs.

If you feel like there's important points made in this discussion,
please include those exceprts into that new issue.

@lock lock bot added the outdated label Oct 29, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants