Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOT READY: warcio test #66

Open
wants to merge 71 commits into
base: develop
Choose a base branch
from

Conversation

wumpus
Copy link
Collaborator

@wumpus wumpus commented Jan 26, 2019

An opinionated WARC standards-conformance tool.

Ready for review - I have yet to work on test coverage.

$ warcio test test/data/*.warc.gz test/data/*.warc
test/data/example-bad-non-chunked.warc.gz
  saw exception 
    ERROR: non-chunked gzip file detected, gzip block continues
    beyond single record.

    This file is probably not a multi-member gzip but a single gzip file.

    To allow seek, a gzipped WARC must have each record compressed into
    a single gzip member and concatenated together.

    This file is likely still valid and can be fixed by running:

    warcio recompress <path/to/file> <path/to/new_file>
  skipping rest of file
test/data/example-resource.warc.gz
  WARC-Record-ID <urn:uuid:6e7f60da-2c7b-11e7-aaf7-0242ac120007>
    WARC-Type resource
    digest pass
    comment: unknown field, no validation performed Warc-Referer https://webrecorder.io/temp-GRWZVUTV/temp/test/record/http://example.com/
    comment: unknown field, no validation performed Warc-User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36
test/data/example.warc.gz
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:e6e395ca-0221-11e7-a18d-0242ac120005>
    WARC-Type revisit
    digest present but not checked
    recommendation: missing recommended header WARC-Refers-To
    comment: field was introduced after this warc version WARC-Refers-To-Target-URI http://example.com/ 1.0
    comment: field was introduced after this warc version WARC-Refers-To-Date 2017-03-06T04:02:06Z 1.0
  WARC-Record-ID <urn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
test/data/example-wget-bad-target-uri.warc.gz
  WARC-Record-ID <urn:uuid:CEF11DC9-8D86-4F4B-9B8C-2235515B4537>
    WARC-Type request
    digest pass
    error: uri must not be within <> warc-target-uri <http://example.com/>
    error: invalid uri scheme, bad character warc-target-uri <http://example.com/>
  WARC-Record-ID <urn:uuid:FD8A6D04-AF8A-4A36-A889-8094487CDF2D>
    WARC-Type response
    payload digest failed sha1:B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A
    error: uri must not be within <> warc-target-uri <http://example.com/>
    error: invalid uri scheme, bad character warc-target-uri <http://example.com/>
  WARC-Record-ID <urn:uuid:E5AC383F-F107-47BC-99B7-176FD8DE6E94>
    WARC-Type metadata
    digest pass
    error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/MANIFEST.txt>
    error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/MANIFEST.txt>
  WARC-Record-ID <urn:uuid:543BCA4F-A305-4383-B511-0BCF23F7AD8D>
    WARC-Type resource
    digest pass
    error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/wget_arguments.txt>
    error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/wget_arguments.txt>
  WARC-Record-ID <urn:uuid:CCD67DB5-13FA-447B-BF05-BF1BDC8ED3A0>
    WARC-Type resource
    digest pass
    error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/wget.log>
    error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/wget.log>
test/data/example-wrong-chunks.warc.gz
  saw exception Invalid WARC record, first line: <!doctype html>
  skipping rest of file
test/data/post-test.warc.gz
  WARC-Record-ID <urn:uuid:59a6b068-cbc2-4767-9525-33043d2709c7>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:5eb8ee92-cda1-4503-a7a3-c63f1ab6515b>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:c79a62e3-5a4b-450d-a093-3a7fefa09664>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
test/data/example-digest-bad.warc
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    payload digest failed: sha1:1112H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
test/data/example-iana.org-chunked.warc
  WARC-Record-ID <urn:uuid:c46fbf5f-0876-4652-a348-e9b6c322eabb>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
test/data/example-trunc.warc
  WARC-Record-ID <urn:uuid:a9c51e3e-0221-11e7-bf66-0242ac120005>
    WARC-Type response
    block digest failed: sha1:DR5MBP7OD3OPA7RFKWJUD4CTNUQUGFC5
    payload digest failed sha1:G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK
    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 2560
    Remainder: b'\x00\x00\r\n'
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
test/data/example.warc
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:e6e395ca-0221-11e7-a18d-0242ac120005>
    WARC-Type revisit
    digest present but not checked
    recommendation: missing recommended header WARC-Refers-To
    comment: field was introduced after this warc version WARC-Refers-To-Target-URI http://example.com/ 1.0
    comment: field was introduced after this warc version WARC-Refers-To-Date 2017-03-06T04:02:06Z 1.0
  WARC-Record-ID <urn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests

@wumpus wumpus mentioned this pull request Jan 26, 2019
@codecov
Copy link

codecov bot commented Jan 26, 2019

Codecov Report

❗ No coverage uploaded for pull request base (develop@59198eb). Click here to learn what that means.
The diff coverage is 86.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##             develop      #66   +/-   ##
==========================================
  Coverage           ?   96.19%           
==========================================
  Files              ?       19           
  Lines              ?     2078           
  Branches           ?      390           
==========================================
  Hits               ?     1999           
  Misses             ?       36           
  Partials           ?       43
Impacted Files Coverage Δ
warcio/archiveiterator.py 100% <ø> (ø)
warcio/tester.py 88.96% <100%> (ø)
warcio/recordloader.py 98.69% <100%> (ø)
warcio/bufferedreaders.py 94.81% <57.89%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 59198eb...fc19c7d. Read the comment docs.

@wumpus
Copy link
Collaborator Author

wumpus commented Jan 26, 2019

@N0taN3rd traditionally you've been my best reviewer :-)

test/test_tests.py Outdated Show resolved Hide resolved
@@ -55,7 +55,7 @@ class ArcWarcRecordLoader(object):
NON_HTTP_SCHEMES = ('dns:', 'whois:', 'ntp:')
HTTP_SCHEMES = ('http:', 'https:')

def __init__(self, verify_http=True, arc2warc=True):
def __init__(self, verify_http=True, arc2warc=True, fixup_bugs=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of how digest checking is handled by ArchiveIterator, fixup_bugs=True should probably default to False

check_digests=False):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code did the bug fixup unconditionally, I'm preserving that default.

digest checking defaults off because it's expensive and @ikreymer prefers people remain in the dark as to how many invalid warcs there are :-)

warcio/tester.py Outdated Show resolved Hide resolved
@@ -43,12 +43,13 @@ class ArchiveIterator(six.Iterator):
def __init__(self, fileobj, no_record_parse=False,
verify_http=False, arc2warc=False,
ensure_http_headers=False, block_size=BUFF_SIZE,
check_digests=False):
check_digests=False, fixup_bugs=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixup_bugs=True -> fixup_bugs=False in the spirit of check_digests=False

@wumpus
Copy link
Collaborator Author

wumpus commented Jan 30, 2019

@N0taN3rd has done a preliminary review, the main addition since then is some global checks.

At this point I think the code is feature-complete, well, for the things I'm planning for the first pass, and the main work remaining is coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants