-
-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.warc
strict conforming output ?
#554
Comments
Hi @dbuenzli , thanks for these comments. In terms of the new fields, yes, perhaps we should create/propose an extension to the core WARC format with these new fields, and push for them to be included in future versions of the core standard. I'm putting this issue on our sprint board for after the IIPC WAC conference for consideration of this, so thanks for raising the issue.
This might be a possibility, however some of these fields are necessary for features of Browsertrix Crawler (e.g. WARC-Page-ID relates WARC records to the pages list in the |
Thanks for your answers @tw4l. Note that strictly speaking I don't mind post processing the files to remove these non-standard elements for our long term archive. But that raises two question:
|
I would strongly recommend against doing this. Our goal is to write up the proposals for the Per the WARC-1.1 standard, unknown headers and records should be ignored by conforming tools. The reason for this is to encourage extension of the WARC spec, but doing so in a way that allows users to try new things. ISO standardization can take a long time, is reviewed on a 5 year, and the community decided (if I remember correctly) that it would be best to see what is actually in use 'in the wild' before proposing it as further extension to WARC. Only things that are in actual use should be standardized and it just takes a long time. We do hope that the extensions we have can be standardized one day, but our focus is on making sure we can archive at risk content today at highest fidelity.
The answer is 'maybe', but not necessarily. The goal of these records is to capture all resources loaded by a browser at the time of capture, so that they can later be analyzed / compared with resources loaded at time of replay. This includes resources that are duplicates, loaded from cache, etc.. so they don't correspond one-to-one with a new WARC record, or even revisit record.
Yes, replayweb.page isn't using these at the moment, and we won't require them, but we may add additional features in the future that use these. These records are more intended to be forward-compatible at the moment. I don't think you gain anything by removing them and my advice would be against removing / reprocessing WARCs. , WARCs can contain often additional records that are not used during replay, such as warcinfo, and would generally advise against reprocessing WARCs to remove unused records. Other crawlers such as Heritrix also write custom metadata records for crawl logs, etc.. that are not in use during replay, but may be useful for other types of analysis, etc.. |
Thanks @ikreymer for taking the time to respond. I don't mind having additional stuff in there but I'm a bit uneasy that:
|
We are in the process of documenting these new headers and fields, tracking in https://github.com/webrecorder/browsertrix/issues/issue/1588 |
I'm evaluating
browsertrix-crawler
for long term preservation efforts for a non-profit archival organisation. As such I have a few questions about the.warc
files it generates:I noticed that the
.warc
files have non-standard headers likeWARC-JSON-Metadata
orWARC-Page-ID
. I understand the explanations there. But it's a bit problematic to have these fields if their semantics eventually ends up being standardized differently in the future. Is there a way to convince the crawler to generate strictly conformant.warc
files ?It seems the
.warc
files containresource
records withWARC-Target-URI
of the formurn:pageinfo:URI
, I gather this seems to represent a page and the ressources it needs for display. However is this scheme standard and/or described somewhere ?The text was updated successfully, but these errors were encountered: