-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDX redirect field not populated by cdx-indexer #114
Comments
As @ikreymer has left the project, he may not be able to verify that this can be re-enabled without causing problems elsewhere. I'm happy to accept it, as it seems reasonable and any downstream problems would show up soon enough. However, I'd ideally like someone else to look at it first, maybe @kris-sigur ? |
I'm not too familiar with this issue, but it may explain certain issues I've observed. I agree, Andy, seems reasonable. Can't say more now. Just hope I wont find myself needing to rebuild all our CDXs to get the redirects back. |
As I recall, the redirect field was removed because there was an issue with encoding of the redirect in many cdx files where the redirect field ended up containing a carriage return or other invalid characters. At the time, it was discovered that the redirect field was only used as part of the SelfRedirectFilter. Putting it back is probably ok as long as you check the encoding of the redirect field (there may be a fix needed there also). What was also discovered is that the SelfRedirectFilter will not catch all self-directs.. in fact, the only way to catch them is to check for a self-redirect after the warc has already been loaded. There is also 'adaptive retry' option (disabled by default) where wayback will try next best capture when the requested one is not available, which too can result in a self redirect. The main check for revisits should happen here, where the self-redirect is rejected as an error and next capture is tried -- this has generally eliminated all issues of self redirects as far as I remember: I am curious as to why it wasn't working for you.. maybe the selfRedirectCanonicalizer It should just be whatever the canonicalizer used elsewhere.. maybe it should be set automatically / retrieved from other parts of the system.. I would recommend trying that and seeing if it eliminates the self-redirect issue |
Ah, that explains it, I completely missed that isSelfRedirect check. I'm feeling a bit foolish now. Thanks for the explanation Ilya. Okay, I think we hit this because our production Wayback webapp install predates the isSelfRedirect check existing and probably only has SelfRedirectFilter. Looks like it's based on 6026a10 which is after the CDX format change but before isSelfRedirect was added. So it sounds like old CDXs against the latest Wayback webapp will be fine but newer CDXs against older Waybacks will have self-redirect issues. I'll look at upgrading our webapp in the next few days. It'll a good opportunity to redo it as a Maven overlay anyway so we can maintain things a bit easier. Currently it's a bunch of checked in precompiled jars which is not wonderful. Withdrawing this issue. I'll reopen it if we hit the same problem with the newer webapp version. |
Also updated the migration guide on the wiki to note that CDXs are only backward compatible (it previously said they were fully portable between different versions). |
FIX: Reinitialize captureSelector after rerunning capture query.
In commit d258e24 the CDX indexer was changed to not populate the redirect URL ("r") field in the output CDX. The reasoning from the commit message is that the redirect field is not necessary for the replay of redirects as it falls back to reading the (W)ARC headers. However it is necessary for SelfRedirectFilter which prevents redirect loops introduced by URL canonicalisation.
We noticed this when we upgraded the version of Wayback we were using to generate CDX files. Many website index pages (including are our own) were inaccessible as they triggered redirect loops (nla.gov.au redirects www.nla.gov.au which is canonicalised back to nla.gov.au). Uncommenting the setRedirectUrl line in d258e24 and reindexing resolved the problem.
The text was updated successfully, but these errors were encountered: