Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document new WARC fields in 1.x crawler-produced WACZ files #1588

Open
tuehlarsen opened this issue Mar 11, 2024 · 3 comments
Open

Document new WARC fields in 1.x crawler-produced WACZ files #1588

tuehlarsen opened this issue Mar 11, 2024 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@tuehlarsen
Copy link

Browsertrix Cloud Version

v1.9.3-79a217b

What did you expect to happen? What happened instead?

I have found some new WARC fields and files in the newest WACZ from beta.browsertrix release: 1.9.3 Browsertrix-Crawler 1.0.0-beta.6 (with warcio.js 2.2.1):

Here a snip from the warc file:
WARC/1.1^M
WARC-Page-ID: 61046c48-286b-485a-a8ed-9974f79a179d^M
WARC-Resource-Type: document^M
WARC-JSON-Metadata: {"cert":{"issuer":"GlobalSign Atlas R3 DV TLS CA 2024 Q1","ctc":"0"}}^M
...

I don't find any proposal concerning the WARC-Page-ID here : https://iipc.github.io/warc-specifications/ .
I also found a text and screendump warc.gz file, but no documentation.
All files validates with the newest version of jwat here: https://github.com/netarchivesuite/jwat-tools/releases/tag/v0.7.2-beta1

Any comments?

Step-by-step reproduction instructions

see above

Additional details

No response

@tuehlarsen tuehlarsen added the bug Something isn't working label Mar 11, 2024
@tw4l tw4l self-assigned this Mar 12, 2024
@tw4l tw4l moved this from Triage to Todo in Webrecorder Projects Mar 12, 2024
@tuehlarsen
Copy link
Author

Are above 3 new warc fields mandatory for modern browserbased replay and are they defacto used in other tools today?

@tw4l
Copy link
Member

tw4l commented Mar 14, 2024

Hi @tuehlarsen, longer explanation coming but in short:

None of these additions should cause WARC validation to fail or cause any replay issues. Most software other than Browsertrix will simply ignore the fields, as is suggested in section 5.1 of the WARC 1.1 specification:

Because new fields may be defined in extensions to the core WARC format, WARC processing software shall ignore fields with unrecognized names.

Re: your comment about missing documentation for screenshot and text WARC files, that is noted and should be coming shortly! As of the latest 1.0.0 crawler beta release, these WARCs will also be prefixed if a WARC prefix is specified.

@tuehlarsen
Copy link
Author

tuehlarsen commented Apr 9, 2024

I hope it will be more explicitly - as it is of great importance for large older web archives what the new warc fields are for and what they will be used for in the future - it requires some syntax definition and description that can be input to a later iso standardization process..

@tw4l tw4l changed the title [Bug]: ? new WARC fields and WARC files in the WACZ file in the newest 1.9.3 version Document new WARC fields in 1.x crawler-produced WACZ files May 15, 2024
@Shrinks99 Shrinks99 added documentation Improvements or additions to documentation and removed bug Something isn't working labels May 16, 2024
@tw4l tw4l moved this from Todo to Ready in Webrecorder Projects May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: Ready
Development

No branches or pull requests

3 participants