-
-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document new WARC fields in 1.x crawler-produced WACZ files #1588
Comments
Are above 3 new warc fields mandatory for modern browserbased replay and are they defacto used in other tools today? |
Hi @tuehlarsen, longer explanation coming but in short:
None of these additions should cause WARC validation to fail or cause any replay issues. Most software other than Browsertrix will simply ignore the fields, as is suggested in section 5.1 of the WARC 1.1 specification:
Re: your comment about missing documentation for screenshot and text WARC files, that is noted and should be coming shortly! As of the latest 1.0.0 crawler beta release, these WARCs will also be prefixed if a WARC prefix is specified. |
I hope it will be more explicitly - as it is of great importance for large older web archives what the new warc fields are for and what they will be used for in the future - it requires some syntax definition and description that can be input to a later iso standardization process.. |
Browsertrix Cloud Version
v1.9.3-79a217b
What did you expect to happen? What happened instead?
I have found some new WARC fields and files in the newest WACZ from beta.browsertrix release: 1.9.3 Browsertrix-Crawler 1.0.0-beta.6 (with warcio.js 2.2.1):
Here a snip from the warc file:
WARC/1.1^M
WARC-Page-ID: 61046c48-286b-485a-a8ed-9974f79a179d^M
WARC-Resource-Type: document^M
WARC-JSON-Metadata: {"cert":{"issuer":"GlobalSign Atlas R3 DV TLS CA 2024 Q1","ctc":"0"}}^M
...
I don't find any proposal concerning the WARC-Page-ID here : https://iipc.github.io/warc-specifications/ .
I also found a text and screendump warc.gz file, but no documentation.
All files validates with the newest version of jwat here: https://github.com/netarchivesuite/jwat-tools/releases/tag/v0.7.2-beta1
Any comments?
Step-by-step reproduction instructions
see above
Additional details
No response
The text was updated successfully, but these errors were encountered: