-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] 0.9.2 seems to not quite handle 206's properly #617
Comments
Hi yes, it's hard to find a test case. that is why i created quickly a simple server that delivers 206 responses Gzip encoded and Ranged server I tested it with a cnn article (cnn_article.html from tests data) the Warning logged should not influence the result. The only problem is that you will get only the partial response parsed. If the Html is too aggressive cut (for instance if you set the limit to 10000 for the above mentioned article), you will not get any useful text. There is the question why would 206 appear on non-binary content. As far as i could test, browsers do not request the rest of the partial content if they get 206 on the main html page. sure, streaming resources maybe. did not test Regarding gzip encoded, it's weird that they would split the content after gzip, I did not find any references to such a practice .. What i found was that the chunk is gziped and sent. what you encounter seems to be rather an network error? could it be? you can play around with the server and maybe you can simulate a case that is similar to what you encountered in the wild. |
I haven't been able to fully reproduce this one. What I do have is a a VCR.py cassette (https://gist.github.com/palfrey/f8556218fe86e57c1f507b8d65a3e311) that got recorded and then caused issues. Note that it somehow has both a GET with the 206 response and a partial data bit and another GET for the same URL with partial data. I have no idea what's causing that, but deleting the 206 responses from my stored data seems to solve things, and AFAIK this is only occurring in the test scenarios not prod, so it might be a vcr.py issue... |
ok, let's keep an eye on it. I will release 0.9.3 without any extra changes to address this. |
Describe the bug
I haven't seen this on 0.9.1, but now seeing on 0.9.2. For some sites, especially https://www.theguardian.com/ I'm getting logs like
newspaper.network:network.py:192 get_html_status(): bad status code 206 on URL
and it looks like basically a 206 is getting hit with a gzip-encoded response and it's not going back and pulling the rest of the content.Sometimes I get a
newspaper.exceptions.ArticleBinaryDataException
when it's clearly not an actual binary page, it's just a partially retrieved page that's failing zlib stuff because it's only got half the page.To Reproduce
Annoyingly, I don't have an easy repo of this. I've got it semi-reliable, but only within a large test sequence with pytest and VCR.py, and I'm trying to get something more reliable that's a single file I can provide.
Expected behavior
Download just works with gzip-encoded pages, even if they do a 206 part way through.
System information
The text was updated successfully, but these errors were encountered: