Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too-many-redirects error for BBC News site, perhaps due to revisit records #21

Closed
anjackson opened this issue Mar 2, 2018 · 3 comments
Labels
bug Something isn't working

Comments

@anjackson
Copy link
Contributor

anjackson commented Mar 2, 2018

On our production APIs, I visit:

http://192.168.45.25:8081/qa-access/20180228215703/http://www.bbc.co.uk/news

I end up in a loop of requests:

GET /qa-access/20180228215703mp_/http://www.bbc.co.uk/news HTTP/1.1
Host: 192.168.45.25:8081
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://192.168.45.25:8081/qa-access/20180228215703/http://www.bbc.co.uk/news
Connection: keep-alive
Upgrade-Insecure-Requests: 1

But the response is:

HTTP/1.1 301 Moved Permanently
X-Cache-Action: PASS (no-cache-control)
X-Archive-Orig-Vary: Accept-Encoding
X-Cache-Age: 0
Content-Type: text/html;charset=utf-8
Date: Fri, 26 Jan 2018 12:10:26 GMT
Location: http://192.168.45.25:8081/qa-access/20180228215703mp_/http://www.bbc.co.uk/news
X-Archive-Orig-X-XSS-Protection: 1; mode=block
X-Mozart-Location: eu-west-1
X-Content-Type-Options: nosniff
X-Archive-Orig-X-Frame-Options: SAMEORIGIN
Content-Length: 0
X-Archive-Orig-Connection: close
Memento-Datetime: Wed, 28 Feb 2018 21:57:03 GMT
Link: <https://www.bbc.co.uk/news>; rel="original", <http://192.168.45.25:8081/qa-access/mp_/https://www.bbc.co.uk/news>; rel="timegate", <http://192.168.45.25:8081/qa-access/timemap/link/https://www.bbc.co.uk/news>; rel="timemap"; type="application/link-format", <http://192.168.45.25:8081/qa-access/20180228215703mp_/https://www.bbc.co.uk/news>; rel="memento"; datetime="Wed, 28 Feb 2018 21:57:03 GMT"; collection="qa-access"
Preference-Applied: rewritten
Vary: Prefer
Content-Security-Policy: default-src 'unsafe-eval' 'unsafe-inline' 'self' data: blob: mediastream: ws: wss: ; form-action 'self'

This appears to happen for revisit records, because if I go to the datestamps for other records playback works.

In this case, we have this record in the CDX server:

<result>
	<compressedoffset>235449396</compressedoffset>
	<mimetype>warc/revisit</mimetype>
	<file>/heritrix/output/warcs/daily/20180228120038/BL-20180228180040873-00015-62~ukwa-h3-pulse-daily~8443.warc.gz</file>
	<redirecturl>-</redirecturl>
	<urlkey>uk,co,bbc)/news</urlkey>
	<digest>3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ</digest>
	<httpresponsecode>0</httpresponsecode>
	<robotflags>-</robotflags>
	<url>https://www.bbc.co.uk/news</url>
	<capturedate>20180228215703</capturedate>
</result>

and scrolling back quite a lot, the corresponding response record, based on digest:

<result>
	<compressedoffset>325775660</compressedoffset>
	<mimetype>text/html</mimetype>
	<file>/heritrix/output/warcs/daily/20180126120026/BL-20180126120519587-00001-62~ukwa-h3-pulse-daily~8443.warc.gz</file>
	<redirecturl>-</redirecturl>
	<urlkey>uk,co,bbc)/news</urlkey>
	<digest>3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ</digest>
	<httpresponsecode>301</httpresponsecode>
	<robotflags>-</robotflags>
	<url>http://www.bbc.co.uk/news/</url>
	<capturedate>20180126121046</capturedate>
</result>

I'm guessing that pywb is going back far enough to find the record, so maybe this is a problem with the way we populated our CDX index?

@anjackson anjackson added the bug Something isn't working label Mar 2, 2018
@anjackson
Copy link
Contributor Author

anjackson commented Mar 2, 2018

Hah, OpenWayback is also having trouble, but returns a Resource Not Available error instead. I think this might be some trailing-slash-on/off + http/https redirect loop, due to the way we're canonicalising URLs? The same thing happens if you start at

http://192.168.45.25:8081/qa-access/20180228215703/https://www.bbc.co.uk/news/

Because the first request, for

http://192.168.45.25:8081/qa-access/20180228215703mp_/https://www.bbc.co.uk/news/

Gets redirected to:

http://192.168.45.25:8081/qa-access/20180228215703mp_/http://www.bbc.co.uk/news

And the looping starts.

Any idea what a properly-indexed version should look like!? The redirecturl field is suspiciously empty.

@anjackson
Copy link
Contributor Author

I found an older issue indicating that the redirecturl field should be unnecessary: iipc/openwayback#114

to be clear, if I go to direct to the not-revisit record it does work:

http://192.168.45.25:8081/qa-access/20180126120357/http://www.bbc.co.uk/news

and if I go to other instances (different html) it works. So, we seem to have a mixture of slash-to-no-slash redirects and HTML responses for the same URL key (depending on whether the original url was http://www.bbc.co.uk/news/ or http://www.bbc.co.uk/news. So maybe this is a problem with OutbackCDX's URL normalisation?

@ikreymer
Copy link
Contributor

ikreymer commented Mar 2, 2018

Turns out the issue was likely caused by the pywb 'self-redirect' check not running, due to status code being set to '0' by the XmlQuery CDX. Changing the self-redirect check to run whenever status code is not 2xx, 4xx, 5xx instead should catch this case

N0taN3rd pushed a commit to webrecorder/pywb that referenced this issue Sep 3, 2019
…es not start with 2, 4, 5,

to more aggressively check invalid status codes, should fix ukwa/ukwa-pywb#21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants