Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resp.content.read(chunk_size) returns HTTP headers instead of just body. #3329

Closed
bradwood opened this issue Oct 7, 2018 · 15 comments
Closed

Comments

@bradwood
Copy link

bradwood commented Oct 7, 2018

Long story short

I am doing a chunked download from a Range enabled HTTP server and it appears to be including HTTP headers, rather than just the body.

Am i using the library incorrectly, or is this a bug?

How do I get only chunks of the body, excluding the HTTP headers and the --boundary delimiter?

Thanks!

Expected behaviour

I expected that only pieces of the HTTP body would be returned when calling resp.content.read(chunk_size)

Actual behaviour

The chunks come down correctly, but the headers and boundary delimiters are present in the resulting file.

Steps to reproduce

Here is the code in question:

    async def fetch(self,
                    *,
                    timeout: int = 60, # sec
                    chunk_size: int = 1048576 # = 1 Mb
                    ) -> None:
        """Fetch the Listings XML file."""
        LOGGER.debug(f'Fetch({self}) called started.')
        to_ = ClientTimeout(total=timeout)
        async with ClientSession(timeout=to_) as session:
            LOGGER.debug(f'Fetch: Inside ClientSession()')
            LOGGER.debug(f'Fetch: About to fetch url={self._url}')
            async with session.get(self._url) as resp:
                LOGGER.debug(f'Fetch: Inside session.get(url={self._url})')
                with open(self._full_path, 'wb') as file_desc:
                    while True:
                        LOGGER.debug(f'Fetch: Inside file writing loop. filename={self._full_path}')
                        chunk = await resp.content.read(chunk_size)
                        if not chunk:
                            break
                        LOGGER.debug('Fetch: Got a chunk')
                        file_desc.write(chunk)
                        LOGGER.debug('Fetch: Wrote the chunk')

        LOGGER.debug(f'Fetch() call finished on {self}')

and here is the head of the resulting file:

(pyskyq-4vSEKDfZ) ✔ [brad@bradmac:~/Code/pyskyq] [31-epg-enh|✚ 2] $ head .epg_data/ea51e77b9fdede19528d599f50182d37edcdbc082b06358146041fe446f6a855.xml
--boundary
Content-Type: application/xml
Content-Disposition: attachment; filename="6729.xml"; filename*=utf-8''6729.xml
Content-Length: 10781354

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tv SYSTEM "xmltv.dtd">
<tv generator-info-name="xmltv.co.uk" source-info-name="xmltv.co.uk">
  <channel id="003b31fb0fd63bd8fd171c7d7a1d0249">
    <display-name>GEO News</display-name>
(pyskyq-4vSEKDfZ) ✔ [brad@bradmac:~/Code/pyskyq] [31-epg-enh|✚ 2] $

Your environment

(pyskyq-4vSEKDfZ) ✔ [brad@bradmac:~/Code/pyskyq] [31-epg-enh|✚ 2] $ python -V
Python 3.7.0
(pyskyq-4vSEKDfZ) ✔ [brad@bradmac:~/Code/pyskyq] [31-epg-enh|✚ 2] $ pip freeze | grep aio
aiohttp==3.4.4
(pyskyq-4vSEKDfZ) ✔ [brad@bradmac:~/Code/pyskyq] [31-epg-enh|✚ 2] $
@aio-libs-bot
Copy link

GitMate.io thinks the contributor most likely able to help you is @asvetlov.

Possibly related issues are #2711 (No content), #2062 (Content-Length header), #2183 ('None' in HTTP headers), #813 (Why uppercase HTTP headers?), and #14 (HttpResponse doesn't parse response body without Content-Length header and Connection: close).

@asvetlov
Copy link
Member

asvetlov commented Oct 8, 2018

It's not a chunked encoded body but multipart/form-data encoded form.
Please use MultipartReader(resp.headers, resp.content) to extract form data.

@bradwood
Copy link
Author

bradwood commented Oct 8, 2018

Its not form data. Its a large XML payload.

@asvetlov
Copy link
Member

asvetlov commented Oct 8, 2018

Check resp.headers.
Your log looks like a multipart message with a large XML payload inside

@bradwood
Copy link
Author

bradwood commented Oct 8, 2018

Sorry, I'm confused.

I want the body, not the headers. Essentially, I want to be able to loop over body chunks to write out the data file, without headers.

@bradwood
Copy link
Author

bradwood commented Oct 8, 2018

Here is the code (a test using aresponses) that is mocking the server, if that helps

@pytest.mark.asyncio
async def test_listing_fetch(aresponses):

    # custom handler to respond with chunks
    async def my_handler(request):
        LOGGER.debug('in handler')
        my_boundary = 'boundary'
        xmlfile_path = Path(__file__).resolve().parent.joinpath('6729.xml')
        LOGGER.debug('xml file path = {xmlfile_path}')
        resp = aresponses.Response(status=200,
                                   reason='OK',
                                   )
        resp.enable_chunked_encoding()
        await resp.prepare(request)

        xmlfile = open(xmlfile_path, 'rb')

        LOGGER.debug('opened xml file for serving')
        with MultipartWriter('application/xml', boundary=my_boundary) as mpwriter:
            mpwriter.append(xmlfile)
            LOGGER.debug('appended chunk')
            await mpwriter.write(resp, close_boundary=False)
            LOGGER.debug('wrote chunk')

        xmlfile.close()
        return resp

    aresponses.add('foo.com', '/feed/6715', 'get', response=my_handler)

    with isolated_filesystem():
        l = Listing('http://foo.com/feed/6715')
        await l.fetch()
        assert l._path.joinpath(l._filename).is_file()

@asvetlov
Copy link
Member

asvetlov commented Oct 8, 2018

Please read about multipart encoding first: https://en.wikipedia.org/wiki/MIME#Multipart_messages

Your mocked server is invalid: application/xml is for the entire xml content, not for multiparts.

P.S.
A thing you call chunk is a multipart's part actually.
The work chunk is used for another concept, at least in HTTP protocol.

@bradwood
Copy link
Author

bradwood commented Oct 8, 2018

So how to make a server then that emulates support for Range headers?

Here is a HEAD request on the server I'm trying to emulate:

HTTP/1.1 200 OK
Accept-Ranges: bytes
Connection: keep-alive
Content-Encoding: gzip
Content-Type: application/xml
Date: Sun, 07 Oct 2018 23:11:56 GMT
ETag: "f8889f-577999e0b6f7d-gzip"
Last-Modified: Sun, 07 Oct 2018 01:42:28 GMT
Server: nginx/1.11.10
Vary: Accept-Encoding

How can I make aiohttp behave like that? If it's in the docs, then maybe I missed it, or go confused between multipart and "streaming".

Thanks for your help.

@bradwood
Copy link
Author

bradwood commented Oct 8, 2018

It should respond like this when a Range: request is given:

(pyskyq-4vSEKDfZ) ✘-INT [brad@bradmac:~/Code/pyskyq/tests] [31-epg-enh|✚ 2] $ curl http://www.xmltv.co.uk/feed/6715 -i -H "Range: bytes=0-1023"
HTTP/1.1 206 Partial Content
Server: nginx/1.11.10
Date: Mon, 08 Oct 2018 07:08:53 GMT
Content-Type: application/xml
Content-Length: 1024
Connection: keep-alive
Last-Modified: Mon, 08 Oct 2018 01:42:20 GMT
ETag: "f9f199-577adbb5d510e"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Range: bytes 0-1023/16380313

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tv SYSTEM "xmltv.dtd">
<tv generator-info-name="xmltv.co.uk" source-info-name="xmltv.co.uk">
  <channel id="003b31fb0fd63bd8fd171c7d7a1d0249">
    <display-name>GEO News</display-name>
  </channel>
  <channel id="0092ad6b181b813d9e2ceed1cfbf5bf1">
    <display-name>Notts TV</display-name>
  </channel>
  <channel id="00da025711e82cf319cb488d5988c099">
    <display-name>Dunya News</display-name>
  </channel>

Is this type of server supported in aiohttp? using Multipart* objects? or Stream*? I have been digging through the docs for this but it's not clear.

@asvetlov
Copy link
Member

asvetlov commented Oct 8, 2018

The latest response is neither streaming nor multipart.

It is just a regular response with truncated body: web.Response(status=206, headers={<fill them itself>}, body=xml_bytes[:1000]).

I'm closing the issue because it is not about aiohttp bugs/improvements but teaching @bradwood HTTP protocol.

Please use another site for it. Maybe StackOverflow fits better.

@asvetlov asvetlov closed this as completed Oct 8, 2018
@bradwood
Copy link
Author

bradwood commented Oct 8, 2018

I don't need to be taught about the HTTP protocol on this forum. @asvetlov. I am perfectly capable of reading wikipedia and RFCs just like you.

I am asking about aiohttp support for this. Does it support it, or not? Please refer me to the documentation, if so, or tell me that it doesn't.

Did you not read this?

Is this type of server supported in aiohttp? using Multipart* objects? or Stream*? I have been digging through the docs for this but it's not clear.

FWIW, while I may have made a mistake in interpretation earlier, I don't appreciate your comment about teaching me HTTP. There is no need for rudeness.

I've been reading your responses to many people on this forum - you are extremely rude with many of them. You like to tell them to read wikipedia, instead of actually being helpful. Its condescending and unhelpful. In many cases, these questions are as a result of poorly documented examples of how aiohttp implements, or doesn't, a particular feature, not the protocol itself.

Look, don't get me wrong, I appreciate your contribution to the community, but it would be much better if (a) the docs were improved so that answers could be found without raising tickets and (b) if you were less dismissing and insulting to people who have legitimate questions about the codebase, not the protocol.

@asvetlov
Copy link
Member

asvetlov commented Oct 8, 2018

  1. aiohttp request supports request.range property to help Range HTTP header parsing. It supports ranged requests in static file serving. The library doesn't provide a magic helper for returning a ranging response for arbitrary data -- a user should construct this response manually.

  2. The main github tracker mission is the development of aiohttp, not for aiohttp usage. For example, CPython itself forbids questions about Python usage in its bug tracker and python-dev mailing list. Should we enable the same policy for aiohttp? I don't know, but this tracker is not a place for general questions. It is not a forum or questions-and-answers resource.

  3. We have a different understanding of rudeness. Pointing on a helpful resource for future reading is a good response for me. RTFM and so far. If it is not enough for you -- that's fine. Please use another sits like stackoverflow.com for asking the usage questions.

  4. The documentation is never perfect. It always can be improved. Please make Pull Request(s) for documentation improvement. I very appreciate it.

@bradwood
Copy link
Author

bradwood commented Oct 8, 2018

  1. aiohttp request supports request.range property to help Range HTTP header parsing. It supports ranged requests in static file serving. The library doesn't provide a magic helper for returning a ranging response for arbitrary data -- a user should construct this response manually.

Ok great -- this is helpful - I will do that. I thought there might be a higher level API that did this, as it the case for Streams and Multipart -- so not an unreasonable question IMHO.

  1. The main github tracker mission is the development of aiohttp, not for aiohttp usage. For example, CPython itself forbids questions about Python usage in its bug tracker and python-dev mailing list. Should we enable the same policy for aiohttp? I don't know, but this tracker is not a place for general questions. It is not a forum or questions-and-answers resource.

Ok, well initially I thought it was a bug, rather than a usage query and I'd assert that insofar as the way in which something can be used, or not used, is part of it's development agenda. If you make something that is difficult to use, or understand, then surely it's a (usability) bug?

  1. We have a different understanding of rudeness. Pointing on a helpful resource for future reading is a good response for me. RTFM and so far. If it is not enough for you -- that's fine. Please use another sits like stackoverflow.com for asking the usage questions.

Ok -- fair enough... I think the Robustness Principle should apply here too... I honestly thought this was a bug/weakness in the API which was a legitimate query. While I don't know every HTTP RFC by heart, I do think I know enough about it to ask relevant questions about aiohttp's implementation of bits of it. So being told that you are not going to "teach someone HTTP" is a pretty blunt response to a legitimate query.

  1. The documentation is never perfect. It always can be improved. Please make Pull Request(s) for documentation improvement. I very appreciate it.

When time permits, and I've got a working example for this topic, I'll try to do exactly that.

@asvetlov
Copy link
Member

asvetlov commented Oct 8, 2018

Sorry for my attitude and thanks for understanding.

@lock
Copy link

lock bot commented Oct 28, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a [new issue] for related bugs.
If you feel like there's important points made in this discussion, please include those exceprts into that [new issue].
[new issue]: https://github.com/aio-libs/aiohttp/issues/new

@lock lock bot added the outdated label Oct 28, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants