-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARC-Resource-Type field possibilities (feedback wanted) #96
Comments
I should note our initial implementation just stores the Chrome CDP value, eg. |
Note that Puppeteer and Playwright use the CDP values but lowercased: https://playwright.dev/docs/api/class-request#:~:text=resourceType%E2%80%8B&text=ResourceType%20will%20be%20one%20of,%2C%20websocket%20%2C%20manifest%20%2C%20other%20 |
Do we have any use cases in mind for this field when reading the WARC? I guess one might be be listing all the top-level crawled documents. This can't be done accurately by Content-Type alone as XHR/Fetch requests can have text/html responses. The main_frame/sub_frame distinction also seems interesting for that use case. It's not in the CDP resource type but if we map to one of the other vocabularies presumably it could be determined from the frameId? I guess the hopsFromSeed metadata field could be used for listing top-level crawled documents but it's coarse grained and doesn't make distinctions between different kinds of embedded content. It's also possible for an image to have a text/html Content-Type and still display correctly due to MIME sniffing. So similarly if you wanted to do something with all the images in a crawl, Content-Type alone is insufficient. |
We've added this to our WARCs in response to a user-submitted issue: webrecorder/browsertrix-crawler#451, with the primary use case being differentiating between resources fetched by JavaScript (via fetch, xhr) versus resources loaded directly from the HTML. |
This is probably off topic for this issue, but it came up recently in the context of using mailbagit that it would be useful to know if a record is for a seed URL. Or is there another common way of doing that? The motivation here is to be able to pick out URLs from the WARC data to serve as entry points during replay. |
For WARCs created by Heritrix a metadata record without the WACZ defines an accompanying pages.jsonl file for entry points. |
See webrecorder/browsertrix-crawler#630 for a feedback "from the trenches". |
Browsers have different ways of reporting the 'resource type' for any resource that's being fetched. When using browser-based crawling, it is often easy to access this 'resource type' and store it in a custom WARC header.
It is possible to introduce a
WARC-Resource-Type
header to store this type. Unfortunately, there isn't a single standard of 'resource types' and various browser APIs expose different variations on this.If a resource type is written to a WARC header, is there a way to make it future proof to support different vocabularies?
Some possibilities include:
Chrome Debug Protocol (CDP) resource type
this is easiest for Chromium-browser based crawling as these fields are directly accessible, but is not especially well standardized and could change anytime.
Fetch Request.destination - this is well standardized vocabulary but not a one-to-one mapping and may not be accessible for non-Fetch data.
Extension API webRequest.resourceType - better standardized and supported by all the major browsers with some differences for browser extensions. Not quite one-to-one with CDP types.
One approach to make this more future proof might be to prefix the resourceType with a namespace based on where the data is coming from and which vocabulary is used.
For example, if using CDP,
cdp:Document
orcdp:Image
, if using webRequest, might bewebRequest:sub_frame
,webRequest:image
, if using destination,destination:image
,destination:document
, etc...This allows for expanding into other vocabularies in the future, but may be harder to parse.
Alternatively, there could be a fixed vocabulary that is allowed that is a common subset of at least 2 of the above, which might be:
document
,image
,media
,script
,stylesheet
,font
,ping
,websocket
,fetch
and a catch-allother
.(In this case, we should specify what the more specific values are recorded as, eg.
main_frame
/sub_frame
would be recorded asdocument
)Other thoughts / suggestions welcome!
The text was updated successfully, but these errors were encountered: