-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDX Indexing failing on weird data #105
Comments
Noting that |
Hmm, worringly, also a failure from a different WARC.
Ah, well this at least seems to be a less troubling problem:
|
Another one:
Could use some improved diagnostic tools for these records, e.g. are the WARC record length/digest headers consistent with the problematic payload? Or has something else gone wrong somehow? Are the GZ blocks either side of the broken one okay? etc. |
And another!
Adding to the excluded set. |
Another
|
Another
|
Another
|
|
A different type of error:
|
|
Mapper failure:
|
|
|
|
|
|
|
|
|
|
|
|
and
|
Patching the indexer to skip and note the bad status codes... See 2.3.3 and 2.3.4. |
Mapper failure:
|
|
|
|
|
Okay, this is still too many errors to handle manually. I'm creating |
As Alex pointed out on the IIPC Slack, this actually looks like a problem with the web host company rather than Heritrix, thankfully. e.g. this fragment is a log of the crawl activity, appearing after the content:
|
Spotted a small error in |
System skips errors now, but we still need to improve the CDX indexer and re-process the marked WARCs at some point. e.g. this query can be used to find the difficult cases: |
The CDX backfill is hitting problems. When submitting to OutbackCDX, we see:
For comparison, a good CDX line looks like this:
So, we can see that the content type is missed
-
and then a malformed content type is where the status code should be.The WARC record from BL-20180614200107699-01321-63
ukwa-h3-pulse-weekly8443.warc.gz at 598530890 compressed length 2479 looks like:The text was updated successfully, but these errors were encountered: