-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid WARCs are silently accepted instead of raising an error #127
Comments
Here's a list of things that I think warcio should validate before even emitting an
Then, upon reading from
If any of these points fail, an exception should be raised. As far as I can tell, only the first point is partially (no CRLF check) implemented so far. |
I took a quick look how this could be implemented. warcio uses the same code for WARC and HTTP header parsing, I'll take a shot at adding a |
One interface thing to keep in mind is that looping over an iterator cannot be continued if the iterator raises. That's why warcio's digest verification has a complicated interface, with 4 options: don't check (default), record problems but carry on, record problems and print something to stderr but carry on, and finally raise on problems. |
Right. But should the iterator be resumable if the underlying stream is not a valid WARC file like in the examples above? For digest verification, it makes sense to log digest mismatches and check the rest of the file. Similarly, content type mismatches or invalid payload data (e.g. corrupted images) can and should be handled downstream. But that isn't really possible or sensible if the file isn't a parsable WARC in the first place. Generic recovery from such a situation isn't possible either. I've had a case in the past where a record in the middle of a WARC was truncated for unknown reasons. Fortunately, the file used per-record compression, so some nasty processing allowed to find the next record and then produce a new file without the offending record. But that's not possible in the general case because the file might be compressed as a whole rather than per-record or, even worse, gzip member boundaries might be offset entirely compared to the WARC record boundaries. You can't simply decompress everything and then search for I suppose it makes sense to split the points mentioned above into two categories. First, there are hard parsing errors. These are errors that are absolutely impossible to recover from. For example, if a file doesn't start with Second, there are softer parsing errors. Examples of this include header names that aren't valid UTF-8, missing any of the other required header fields, header lines (that aren't continuation lines) missing a colon, or LF instead of CRLF as line endings. |
btw I have a not-quite-finished develop-warcio-test branch in the repo that is capable of complaining about soft parsing errors. There are tons of WARCs out there with problems. |
warcio accepts various WARCs that are not actually valid. There is some validation on the beginning of the content, so it looks like the smallest possible content that passes is just
WARC/1.0
(or another version), without even a line termination. Truncations within headers or payload are also silently ignored. As a result, such mutilated files also passwarcio check
:Here's a more thorough example. Every single one of these samples is invalid and should raise an exception on parsing (although arguably for some of these, namely the ones where the truncation occurs at the beginning of or within the payload, it should only happen on trying to read the raw or decoded stream):
The text was updated successfully, but these errors were encountered: