-
-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
warc: add Network.resourceType (https://chromedevtools.github.io/devt… #481
Conversation
…ools-protocol/tot/Network/#type-ResourceType) as WARC-Resource-Type header for response/request pairs fixes #451
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested, looks good
Only question for me is if: https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType is standard enough. There is the extension-focused: https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest/ResourceType which is slightly more standardized (still differences between Chrome and Firefox) but its not a one-to-one mapping with the CDP resourceType that we have access to |
It's a good question, but if the CDP resourceType is what we have access to and we have a reference we can point to, I think that's probably sufficient at this stage, especially since we are only crawling using Chromium-based browsers. Unless there's a very clear standard to map to cleanly, which I don't know that there is in this case. |
This is probably fine for now, but also opened an issue in WARC spec about possibly standardizing how this is approached. iipc/warc-specifications#96 |
Per discussion, it appears that Puppeteer / Playwright both use resourceType as all lowercase. To match their convention, also setting WARC-Resource-Type header to be all lowercase and updating the |
follow up to #481, check reqresp.resourceType with lowercase value
follow up to #481, check reqresp.resourceType with lowercase value just set message based on resourceType value
follow up to #481, check reqresp.resourceType with lowercase value just set message based on resourceType value
Add resourcesType value from https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType as
WARC-Resource-Type
header, fixes #451