Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing Errors with YouTube JSON in POST Request Payload #869

Open
mona-ul opened this issue Oct 26, 2023 · 0 comments
Open

Indexing Errors with YouTube JSON in POST Request Payload #869

mona-ul opened this issue Oct 26, 2023 · 0 comments

Comments

@mona-ul
Copy link

mona-ul commented Oct 26, 2023

Describe the bug

When using pywb (wb-manager reindex, cdx-indexer) and cdxj-indexer a WARC file can’t get indexed. All indexing methods return an error. (“Error: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte”)

WARC file: https://storage.googleapis.com/rhizome-hurl/stuttering-by-nathaniel-stern-20230821133816.warc

The WARC Record causing problems seems to be a POST Request, with a payload containing query data in JSON.
Identified WARC Records causing the error:

cdxj-indexer Error Message

cdxj-indexer -p [warc file] > [index file] 
Error parsing: {"context":{"client":{"hl":"en","gl":"US","clientName":1,"clientVersion":"2.20230815.00.00","configInfo": [...]

The error refers to the payload of the Request Record urn:uuid:4bdccb83-e6e4-43d6-887d-51b81d6d1e90. (Full payload attached above)

wb-manager reindex Error Message

wb-manager reindex [collection]
Error: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

cdx-indexer Error Message

cdx-indexer -p [WARC file]
[...]
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mona/.local/bin/cdx-indexer", line 8, in <module>
    sys.exit(main())
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/cdxindexer.py", line 468, in main
    write_multi_cdx_index(cmd.output, cmd.inputs,
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/cdxindexer.py", line 306, in write_multi_cdx_index
    for entry in entry_iter:
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 342, in __call__
    for entry in entry_iter:
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 215, in join_request_records
    for entry in entry_iter:
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 188, in create_record_iter
    post_query = MethodQueryCanonicalizer(method,
  File "/home/mona/.local/lib/python3.10/site-packages/pywb/warcserver/inputrequest.py", line 281, in __init__
    sys.stderr.write("Ignoring query, error parsing as json: " + query.decode("utf-8") + "\n")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Steps to reproduce the bug

Environment

  • OS: Ubuntu 22.04
  • Version pywb 2.7.4
  • Version cdxj-indexer 1.4.5
  • Version warcio 1.7.4

Additional context

Identification of Error Records

When using the indexing methods a index.cdxj is partly written. Comparing the entries of the WARC and index file chronologically, the first entry in the WARC file that is not in the index file was identified as the record causing problems. To verify that this record is causing problems, it was removed using warcio. After that, the indexing worked.

WARC-Processing with warcio

The WARC file and the identifies error records were processed with warcio and no utf-8 occurred.

from warcio.archiveiterator import ArchiveIterator
import sys

warc1_path = sys.argv[1]

from warcio.archiveiterator import ArchiveIterator

with open(warc1_path, 'rb') as stream:
    for i, record in enumerate(ArchiveIterator(stream)):
        print(i, record.rec_headers.get_header('WARC-Target-URI'))
        print(i, record.rec_headers.get_header('WARC-Record-ID'))
        if record.rec_type == 'request':
            content = record.content_stream().read()
            print(content.decode('utf-8'))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant