Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 0.9.2 seems to not quite handle 206's properly #617

Open
palfrey opened this issue Feb 26, 2024 · 3 comments
Open

[BUG] 0.9.2 seems to not quite handle 206's properly #617

palfrey opened this issue Feb 26, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@palfrey
Copy link
Contributor

palfrey commented Feb 26, 2024

Describe the bug

I haven't seen this on 0.9.1, but now seeing on 0.9.2. For some sites, especially https://www.theguardian.com/ I'm getting logs like newspaper.network:network.py:192 get_html_status(): bad status code 206 on URL and it looks like basically a 206 is getting hit with a gzip-encoded response and it's not going back and pulling the rest of the content.

Sometimes I get a newspaper.exceptions.ArticleBinaryDataException when it's clearly not an actual binary page, it's just a partially retrieved page that's failing zlib stuff because it's only got half the page.

To Reproduce
Annoyingly, I don't have an easy repo of this. I've got it semi-reliable, but only within a large test sequence with pytest and VCR.py, and I'm trying to get something more reliable that's a single file I can provide.

Expected behavior
Download just works with gzip-encoded pages, even if they do a 206 part way through.

System information

  • OS: Linux
  • Python version: 3.11.5
  • Library version: 0.9.2
@palfrey palfrey added the bug Something isn't working label Feb 26, 2024
@AndyTheFactory AndyTheFactory added this to the Release 0.9.3 milestone Feb 26, 2024
@AndyTheFactory
Copy link
Owner

AndyTheFactory commented Mar 5, 2024

Hi
i was not so familiar with the way 206 is used and implemented, had to research a little bit.

yes, it's hard to find a test case. that is why i created quickly a simple server that delivers 206 responses

Gzip encoded and Ranged server

I tested it with a cnn article (cnn_article.html from tests data)

the Warning logged should not influence the result. The only problem is that you will get only the partial response parsed. If the Html is too aggressive cut (for instance if you set the limit to 10000 for the above mentioned article), you will not get any useful text.

There is the question why would 206 appear on non-binary content. As far as i could test, browsers do not request the rest of the partial content if they get 206 on the main html page. sure, streaming resources maybe. did not test

Regarding gzip encoded, it's weird that they would split the content after gzip, I did not find any references to such a practice .. What i found was that the chunk is gziped and sent. what you encounter seems to be rather an network error? could it be?

you can play around with the server and maybe you can simulate a case that is similar to what you encountered in the wild.

@palfrey
Copy link
Contributor Author

palfrey commented Mar 16, 2024

I haven't been able to fully reproduce this one. What I do have is a a VCR.py cassette (https://gist.github.com/palfrey/f8556218fe86e57c1f507b8d65a3e311) that got recorded and then caused issues. Note that it somehow has both a GET with the 206 response and a partial data bit and another GET for the same URL with partial data. I have no idea what's causing that, but deleting the 206 responses from my stored data seems to solve things, and AFAIK this is only occurring in the test scenarios not prod, so it might be a vcr.py issue...

@AndyTheFactory
Copy link
Owner

ok, let's keep an eye on it. I will release 0.9.3 without any extra changes to address this.
Working now on the last touch-ups

@AndyTheFactory AndyTheFactory removed this from the Release 0.9.3 milestone Mar 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants