You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using pywb (wb-manager reindex, cdx-indexer) and cdxj-indexer a WARC file can’t get indexed. All indexing methods return an error. (“Error: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte”)
The error refers to the payload of the Request Record urn:uuid:4bdccb83-e6e4-43d6-887d-51b81d6d1e90. (Full payload attached above)
wb-manager reindex Error Message
wb-manager reindex [collection]
Error: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
cdx-indexer Error Message
cdx-indexer -p [WARC file]
[...]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/mona/.local/bin/cdx-indexer", line 8, in <module>
sys.exit(main())
File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/cdxindexer.py", line 468, in main
write_multi_cdx_index(cmd.output, cmd.inputs,
File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/cdxindexer.py", line 306, in write_multi_cdx_index
for entry in entry_iter:
File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 342, in __call__
for entry in entry_iter:
File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 215, in join_request_records
for entry in entry_iter:
File "/home/mona/.local/lib/python3.10/site-packages/pywb/indexer/archiveindexer.py", line 188, in create_record_iter
post_query = MethodQueryCanonicalizer(method,
File "/home/mona/.local/lib/python3.10/site-packages/pywb/warcserver/inputrequest.py", line 281, in __init__
sys.stderr.write("Ignoring query, error parsing as json: " + query.decode("utf-8") + "\n")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
use described tools and commands to index the file
Environment
OS: Ubuntu 22.04
Version pywb 2.7.4
Version cdxj-indexer 1.4.5
Version warcio 1.7.4
Additional context
Identification of Error Records
When using the indexing methods a index.cdxj is partly written. Comparing the entries of the WARC and index file chronologically, the first entry in the WARC file that is not in the index file was identified as the record causing problems. To verify that this record is causing problems, it was removed using warcio. After that, the indexing worked.
WARC-Processing with warcio
The WARC file and the identifies error records were processed with warcio and no utf-8 occurred.
from warcio.archiveiterator import ArchiveIterator
import sys
warc1_path = sys.argv[1]
from warcio.archiveiterator import ArchiveIterator
with open(warc1_path, 'rb') as stream:
for i, record in enumerate(ArchiveIterator(stream)):
print(i, record.rec_headers.get_header('WARC-Target-URI'))
print(i, record.rec_headers.get_header('WARC-Record-ID'))
if record.rec_type == 'request':
content = record.content_stream().read()
print(content.decode('utf-8'))
The text was updated successfully, but these errors were encountered:
Describe the bug
When using pywb (
wb-manager reindex
,cdx-indexer
) andcdxj-indexer
a WARC file can’t get indexed. All indexing methods return an error. (“Error: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte”)WARC file: https://storage.googleapis.com/rhizome-hurl/stuttering-by-nathaniel-stern-20230821133816.warc
The WARC Record causing problems seems to be a POST Request, with a payload containing query data in JSON.
Identified WARC Records causing the error:
request: urn:uuid:4bdccb83-e6e4-43d6-887d-51b81d6d1e90
response: urn:uuid:78218a84-3c12-11ee-804f-0242c0a89008
cdxj-indexer Error Message
The error refers to the payload of the Request Record urn:uuid:4bdccb83-e6e4-43d6-887d-51b81d6d1e90. (Full payload attached above)
wb-manager reindex Error Message
cdx-indexer Error Message
Steps to reproduce the bug
Environment
Additional context
Identification of Error Records
When using the indexing methods a index.cdxj is partly written. Comparing the entries of the WARC and index file chronologically, the first entry in the WARC file that is not in the index file was identified as the record causing problems. To verify that this record is causing problems, it was removed using warcio. After that, the indexing worked.
WARC-Processing with warcio
The WARC file and the identifies error records were processed with warcio and no utf-8 occurred.
The text was updated successfully, but these errors were encountered: