Block digest verification fails on some copied record #123

dlazesz · 2020-12-20T19:42:25Z

I have a WARC archive created with a previous version of warcio library about a year ago. Copying some records to another record is done without error (with the current version of warcio), but the later verification fails. See the attached code and example warc (input.warc.gz):

from warcio.archiveiterator import ArchiveIterator
from warcio.warcwriter import WARCWriter

print('Validate input')
with open('input.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):
        pass

with open('input.warc.gz', 'rb') as stream, open('out.warc.gz', 'wb') as out_stream:
    writer = WARCWriter(out_stream, gzip=True, warc_version='WARC/1.1')
    for record in ArchiveIterator(stream, check_digests='raise'):
        # Negate the full condition for simplicity (Select the problematic record)
        if not (record.rec_type == 'response' and record.rec_headers.get_header('WARC-Target-URI') ==
                'https://www.origo.hu/hir-archivum/2019/20190119.html'):
            writer.write_record(record)

print('Validate output without problematic URL')
with open('out.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):
        pass

with open('input.warc.gz', 'rb') as stream, open('out.warc.gz', 'wb') as out_stream:
    writer = WARCWriter(out_stream, gzip=True, warc_version='WARC/1.1')
    for record in ArchiveIterator(stream, check_digests='raise'):
        # Select the problematic record
        if record.rec_type == 'response' and record.rec_headers.get_header('WARC-Target-URI') == \
         'https://www.origo.hu/hir-archivum/2019/20190119.html':
            writer.write_record(record) 

print('Validate output just the problematic URL')
with open('out.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream, check_digests='raise'):  # This will fail
        pass

The output is the following:

Validate input
Validate output without problematic URL
Validate output just the problematic URL
Traceback (most recent call last):
  File "test.py", line 30, in <module>
    for record in ArchiveIterator(stream, check_digests='raise'):
  File "/usr/local/lib/python3.6/dist-packages/warcio/archiveiterator.py", line 119, in _iterate_records
    self.read_to_end()
  File "/usr/local/lib/python3.6/dist-packages/warcio/archiveiterator.py", line 212, in read_to_end
    b = self.record.raw_stream.read(BUFF_SIZE)
  File "/usr/local/lib/python3.6/dist-packages/warcio/limitreader.py", line 27, in read
    return self._update(buff)
  File "/usr/local/lib/python3.6/dist-packages/warcio/digestverifyingreader.py", line 99, in _update
    self.digest_checker.problem('block digest failed: {}'.format(self.block_digest))
  File "/usr/local/lib/python3.6/dist-packages/warcio/digestverifyingreader.py", line 31, in problem
    raise ArchiveLoadFailed(value)
warcio.exceptions.ArchiveLoadFailed: block digest failed: sha1:HRYQXT3HWIGWZMDEAAGYGEJGK334YEGM

The expected behavior would be to raise exception earlier:

When input.warc.gz is read
When the problematic record is copied to the output.warc.gz

The current state makes the user imply that the output.warc.gz is valid until it is re-read.

BTW: The behavior of check_digests=True equals to check_digests=False which is not what one would expect.

The text was updated successfully, but these errors were encountered:

wumpus · 2020-12-22T02:53:46Z

There are multiple possible bugs here, one possibility is that the copy is writing the wrong block digest, perhaps because it changed the block and kept the same digest. If I can see the input file, it would be helpful.

The check_digests API bug is a separate one, I'm not sure how I made that mistake, but I'll open a separate bug for it (#124)

dlazesz · 2020-12-22T09:08:58Z

@wumpus

If I can see the input file, it would be helpful.

The input file is attached to the OP and the bug should be reproducible with it: https://github.com/webrecorder/warcio/files/5721385/input.warc.gz

Thank you for investigating the issue!

wumpus · 2020-12-23T00:11:41Z

Thanks, I didn't notice input.warc.gz was a link.

The difference between input.warc and output.warc is that they have the same digest, but the content length is one octet shorter for out.warc. And lo and behold, right at the top, I see

input: HTTP/1.1 200 ^M # trailing whitespace, no OK
out: HTTP/1.1 200^M # no trailing whitespace

Which is to say, there's trailing whitespace in the http headers in input and not in out.

How... interesting! I was thinking my digest-checking code was the guilty party, but instead it could be that the copy is dropping that http header trailing whitespace while repeating the same digest? Changing the output by dropping trailing whitespace is dodgy, repeating the digest is much dodgier.

wumpus · 2020-12-23T00:15:09Z

While I'm here I'll also mention that Transfer-Encoding: chunked is in both input and output headers, but it's not actually chunked. This is a common problem in warcs. warcio happily reads these warcs by falling back to non-chunked.

wumpus · 2020-12-23T19:56:18Z

@ikreymer I see two choices, you probably have an opinion:

Save the verbatim header bytes while reading, and use them on writing if the user has not modified anything, keeping the digests exactly the same if it's an exact copy
Keep the current header parsing, which might change the header on reading, and always recompute digests on write

Option 1 is a "first, do no harm" philosophy, but it will be a little ugly to notice changes to the headers between read and write. Option 2 is a small code change.

wumpus mentioned this issue Dec 22, 2020

check_digests is under-documented, confusing everyone #124

Open

wsdookadr mentioned this issue Jul 5, 2022

Patching WARCs using warcio #149

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block digest verification fails on some copied record #123

Block digest verification fails on some copied record #123

dlazesz commented Dec 20, 2020

wumpus commented Dec 22, 2020 •

edited

Loading

dlazesz commented Dec 22, 2020

wumpus commented Dec 23, 2020

wumpus commented Dec 23, 2020 •

edited

Loading

wumpus commented Dec 23, 2020

Block digest verification fails on some copied record #123

Block digest verification fails on some copied record #123

Comments

dlazesz commented Dec 20, 2020

wumpus commented Dec 22, 2020 • edited Loading

dlazesz commented Dec 22, 2020

wumpus commented Dec 23, 2020

wumpus commented Dec 23, 2020 • edited Loading

wumpus commented Dec 23, 2020

wumpus commented Dec 22, 2020 •

edited

Loading

wumpus commented Dec 23, 2020 •

edited

Loading