Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pywb failing to handle self-redirects from OutbackCDX #865

Open
obrienben opened this issue Sep 22, 2023 · 7 comments
Open

Pywb failing to handle self-redirects from OutbackCDX #865

obrienben opened this issue Sep 22, 2023 · 7 comments
Assignees

Comments

@obrienben
Copy link

Describe the bug

Pywb is throwing a LiveResourceException when receiving a self-redirect (3xx) from OutbackCDX. This results in Pywb displaying a blank page with the text "Not found".

Steps to reproduce the bug

Example warc file attached in this Slack thread
https://iipc.slack.com/archives/C2NR32PNF/p1691445882952669
Try accessing "http://2020.org.nz/" using the redirect record, should display "Not found" message.

Expected behavior

Pywb to process the self-redirect record from OutbackCDX, and load the record that the self-redirect points to.

Screenshots

Pywb logs
image
OutbackCDX logs
image

Environment

  • Not browser specific
  • Occurs in Pywb v2.7.0 onwards
  • Does not occur before v2.7.0
  • OutbackCDX v0.11.0
@wumpus
Copy link

wumpus commented Sep 22, 2023

Thank you for the excellent bug report, with the pywb version dependence.

I can't access the slack warc file because my org isn't a member. I only have guest access.

@tw4l tw4l self-assigned this Sep 22, 2023
@tw4l tw4l moved this from Triage to Ready for Dev in Webrecorder Projects Sep 22, 2023
Copy link

We experience the same issue with OutbackCDX v. 0.11.1 and PyWb v. 2.7.4. Redirects result in "Not found".

@tw4l tw4l moved this from Ready for Dev to Dev In Progress in Webrecorder Projects Oct 17, 2023
@obrienben
Copy link
Author

@wumpus unfortunately the warc was too big to attach here. Happy to share it with you another way if you'd like it

@wumpus
Copy link

wumpus commented Nov 22, 2023

I see Ilya has been assigned by Tessa and I know he does have access to the IIPC Slack. So it's in good hands.

@andreas-koch
Copy link

We also ran into this issue recently. We use PyWb 2.83. We checked with 2.6.9, there it worked. Are there any plan to fix this?
Thanks

@HeliosLHC
Copy link

Echoing same error as well with PyWb 2.83 and OutbackCDX 1.0.0

@ato
Copy link
Contributor

ato commented Jul 25, 2024

OutbackCDX has a partial workaround for this. If you run it with the --omit-self-redirects command-line option (or pass omitSelfRedirects=true in the query string) it will try to use the CDX redirect field to detect self redirects and hide them.

Unfortunately pywb's cdx-indexer and webrecorder/cdxj-indexer don't populate the redirect field though so if you used them to build your indexes this workaround won't help you. Without the redirect field populated there's no way for OutbackCDX to detect self redirects.

(For reference we use jwarc for CDX indexing plus some weird extra logic to handle our legacy pre-WARC collections.)

lasztoth pushed a commit to lasztoth/pywb that referenced this issue Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Implementing
Development

No branches or pull requests

8 participants