Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDX redirect field not populated by cdx-indexer #114

Closed
ato opened this issue May 5, 2014 · 6 comments
Closed

CDX redirect field not populated by cdx-indexer #114

ato opened this issue May 5, 2014 · 6 comments
Labels

Comments

@ato
Copy link
Member

ato commented May 5, 2014

In commit d258e24 the CDX indexer was changed to not populate the redirect URL ("r") field in the output CDX. The reasoning from the commit message is that the redirect field is not necessary for the replay of redirects as it falls back to reading the (W)ARC headers. However it is necessary for SelfRedirectFilter which prevents redirect loops introduced by URL canonicalisation.

We noticed this when we upgraded the version of Wayback we were using to generate CDX files. Many website index pages (including are our own) were inaccessible as they triggered redirect loops (nla.gov.au redirects www.nla.gov.au which is canonicalised back to nla.gov.au). Uncommenting the setRedirectUrl line in d258e24 and reindexing resolved the problem.

ato added a commit to ato/openwayback that referenced this issue May 7, 2014
This was disabled in d258e24 however
the field is required by SelfRedirectFilter in order to prevent redirect
loops introduced by URL canonicalisation.

Fixes iipc#114
@anjackson
Copy link
Member

As @ikreymer has left the project, he may not be able to verify that this can be re-enabled without causing problems elsewhere. I'm happy to accept it, as it seems reasonable and any downstream problems would show up soon enough. However, I'd ideally like someone else to look at it first, maybe @kris-sigur ?

@kris-sigur
Copy link
Member

I'm not too familiar with this issue, but it may explain certain issues I've observed. I agree, Andy, seems reasonable. Can't say more now. Just hope I wont find myself needing to rebuild all our CDXs to get the redirects back.

@ikreymer
Copy link
Member

As I recall, the redirect field was removed because there was an issue with encoding of the redirect in many cdx files where the redirect field ended up containing a carriage return or other invalid characters. At the time, it was discovered that the redirect field was only used as part of the SelfRedirectFilter. Putting it back is probably ok as long as you check the encoding of the redirect field (there may be a fix needed there also).

What was also discovered is that the SelfRedirectFilter will not catch all self-directs.. in fact, the only way to catch them is to check for a self-redirect after the warc has already been loaded.
The way the cdx is generated, a revisit record may create a self-redirect and as I recall there is no status or revisit field for those.

There is also 'adaptive retry' option (disabled by default) where wayback will try next best capture when the requested one is not available, which too can result in a self redirect.

The main check for revisits should happen here, where the self-redirect is rejected as an error and next capture is tried -- this has generally eliminated all issues of self redirects as far as I remember:

https://github.com/iipc/openwayback/blob/master/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java#L807

I am curious as to why it wasn't working for you.. maybe the selfRedirectCanonicalizer
(https://github.com/iipc/openwayback/blob/master/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java#L1696) is not being set?

It should just be whatever the canonicalizer used elsewhere.. maybe it should be set automatically / retrieved from other parts of the system.. I would recommend trying that and seeing if it eliminates the self-redirect issue

@ato
Copy link
Member Author

ato commented May 15, 2014

Ah, that explains it, I completely missed that isSelfRedirect check. I'm feeling a bit foolish now. Thanks for the explanation Ilya.

Okay, I think we hit this because our production Wayback webapp install predates the isSelfRedirect check existing and probably only has SelfRedirectFilter. Looks like it's based on 6026a10 which is after the CDX format change but before isSelfRedirect was added.

So it sounds like old CDXs against the latest Wayback webapp will be fine but newer CDXs against older Waybacks will have self-redirect issues.

I'll look at upgrading our webapp in the next few days. It'll a good opportunity to redo it as a Maven overlay anyway so we can maintain things a bit easier. Currently it's a bunch of checked in precompiled jars which is not wonderful.

Withdrawing this issue. I'll reopen it if we hit the same problem with the newer webapp version.

@ato ato closed this as completed May 15, 2014
@ato
Copy link
Member Author

ato commented May 15, 2014

Also updated the migration guide on the wiki to note that CDXs are only backward compatible (it previously said they were fully portable between different versions).

https://github.com/iipc/openwayback/wiki/Notes-on-migrating-from-older-version-of-Wayback-to-IIPC-OpenWayback

@anjackson
Copy link
Member

That's great, thanks @ato and @ikreymer.

@hhockx hhockx added the invalid label Sep 2, 2014
kngenie added a commit to kngenie/wayback that referenced this issue Feb 19, 2016
FIX: Reinitialize captureSelector after rerunning capture query.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants