warcio check does not raise error when GZip records are truncated #138

anjackson · 2021-11-22T14:12:08Z

One of the most likely problems we see is failed transfers leading to truncated WARC.GZ files. We can spot this with gunzip -t but it would be good if warcio check also raised this as a validation error. My tests so far have indicated that the warcio and cdxj-indexer etc. tools all skip over these errors silently.

The text was updated successfully, but these errors were encountered:

edsu · 2023-08-03T15:45:38Z

This came up recently in IIPC Slack when trying to diagnose why warcheology was reporting a corrupted WARC file, and warcio was not. It appeared that the WARC file was truncated as a result of a browsertrix-crawler container exiting abnormally, and not closing the GZIP file properly...

In case it's helpful to have a test script (which doesn't emit a warning that I can see):

from warcio.archiveiterator import ArchiveIterator

with open('test.warc.gz', 'rb') as stream:
    for i, record in enumerate(ArchiveIterator(stream)):
        print(i, record.rec_headers.get_header('WARC-Target-URI'))
        if record.rec_type == 'response':
            content = record.content_stream().read()

And here's a test file: test.warc.gz

gunzip on the other hand does notice:

$ gunzip --test test.warc.gz
gunzip: truncated input
gunzip: test.warc.gz: uncompress failed

anjackson · 2023-08-14T12:01:37Z

Wow, I'd totally forgotten about this!

Seems like there's a hook in the underlying Python library to spot this case:: https://docs.python.org/3/library/zlib.html#zlib.Decompress.eof

Decompress.eof
A boolean indicating whether the end of the compressed data stream has been reached.
This makes it possible to distinguish between a properly formed compressed stream, and an incomplete or truncated one.
New in version 3.3.

But it's not clear to me how to weave that in here...

warcio/warcio/archiveiterator.py

Lines 108 to 140 in aa702cb

    
           while True: 
        
               try: 
        
                   self.record = self._next_record(self.next_line) 
        
                   if raise_invalid_gzip: 
        
                       self._raise_invalid_gzip_err() 
        
                   yield self.record 
        
               except EOFError: 
        
                   empty_record = True 
        
               self.read_to_end() 
        
               if self.reader.decompressor: 
        
                   # if another gzip member, continue 
        
                   if self.reader.read_next_member(): 
        
                       continue 
        
                   # if empty record, then we're done 
        
                   elif empty_record: 
        
                       break 
        
                   # otherwise, probably a gzip 
        
                   # containing multiple non-chunked records 
        
                   # raise this as an error 
        
                   else: 
        
                       raise_invalid_gzip = True 
        
               # non-gzip, so we're done 
        
               elif empty_record: 
        
                   break 
        
           self.close()

wumpus · 2023-08-16T14:42:59Z

@edsu what record in test.warc.gz is the truncated one? And where can I find warcheology? Thanks.

edsu · 2023-08-17T16:28:05Z

I believe it's the last record. If you try to gunzip the file, you should see the error error right at the end?

I'm not really familiar with it but here is the warchaeology repo: https://github.com/nlnwa/warchaeology

ikreymer · 2023-08-18T05:18:28Z

@edsu thanks for adding a simple test and @anjackson for looking up the .eof property!

With that, I think detecting this case can be done as follows:

diff --git a/warcio/archiveiterator.py b/warcio/archiveiterator.py
index 484b7f0..451f182 100644
--- a/warcio/archiveiterator.py
+++ b/warcio/archiveiterator.py
@@ -113,7 +113,13 @@ class ArchiveIterator(six.Iterator):
 
                 yield self.record
 
-            except EOFError:
+            except EOFError as e:
+                if self.reader.decompressor:
+                    if not self.reader.decompressor.eof:
+                        sys.stderr.write("warning: final record appears to be truncated")
+
                 empty_record = True
 
             self.read_to_end()

But, what is the desired behavior be more generally?

for warcio check, seems like it should return an error
seems like the gunzip behavior is definitely not desirable, as that fails to unzip any record even if only last one is invalid.
for indexing, it seems like the indexing should still succeed, and maybe print the warning? there are other recoverable errors that are also logged, such as Content-Length mismatches. Should it still return a 1

It sort of depends on how the WARC is being used:

If the goal is to detect if WARC is valid after transfer, this is definitely an error and should be detected.
If the goal is to index a WARC that already exists, this is more of a warning since not much be done at that point, and we definitely don't want to invalid the whole WARC just because of the last record.

wumpus self-assigned this Apr 20, 2022

ikreymer added a commit that referenced this issue Mar 21, 2024

add detection of truncated final record, as per #138

162ca0a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

warcio check does not raise error when GZip records are truncated #138

warcio check does not raise error when GZip records are truncated #138

anjackson commented Nov 22, 2021

edsu commented Aug 3, 2023 •

edited

Loading

anjackson commented Aug 14, 2023

wumpus commented Aug 16, 2023

edsu commented Aug 17, 2023

ikreymer commented Aug 18, 2023

warcio check does not raise error when GZip records are truncated #138

warcio check does not raise error when GZip records are truncated #138

Comments

anjackson commented Nov 22, 2021

edsu commented Aug 3, 2023 • edited Loading

anjackson commented Aug 14, 2023

wumpus commented Aug 16, 2023

edsu commented Aug 17, 2023

ikreymer commented Aug 18, 2023

edsu commented Aug 3, 2023 •

edited

Loading