-
-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add request initiator to WARC? #451
Comments
I should mention that relying on HTTP
So we cannot and should not expect this HTTP header to be present. Our need is adjacent to this one in that we don't want to interfere with the HTTP negotiation (Server should not be aware of how the request is made) ; we'd just like to have that information in the WARC Headers. |
In addition to the issue mentioned above, we are now also facing difficulties with our new warc2zim version where knowing what triggered the download of a given WARC record would help (the HTML page, some JS code, CSS stylesheet, ...). I know this is even broader than the original request about which kind of system triggered the download, and I also realize that this is not a single value but a list. @ikreymer @tw4l what are your views on this feature request? How complex does it seems for you? Does it make sense? As a side-note, if request is feasible and meaningful for you, we (Kiwix) are probably ok to support you in implementing this feature (how would have to be defined). |
Yes, with the new 1.0.0 we are switch to entirely browser-based capture (not using pywb proxy). With this method, it should be much easier to record either the CDP ResourcesType value or Request.destination in the WARC record. Do you think that will be sufficient for your use case? |
Great, that's a good news ! Thank you for your prompt reply. I'm not able to say whether we would prefer the At least it looks like both might match our need from what I've understood. The choice between both might be more based on a "reality check" on real situations. Probably the best solution would be to give it a try on some situations we have encountered where it would help. The only concern I see (somehow it is more "your" problem, but it might become "ours" quite fast) is that there is no certainty on long term that it will be easy/feasible for you to maintain availability of this information since it is very tight to the browser-based capture. Should you decide to abandon browser-based capture for any reason, this will be on more constraint to fulfill with the new system. Not a big concern, but something to have in mind. @rgaudin @mgautierfr do you have any additional views on this? |
…ools-protocol/tot/Network/#type-ResourceType) as WARC-Resource-Type header for response/request pairs fixes #451
@ikreymer we just noticed the commit ; looks like it should do the trick! In which release can we expect this to land? We'd like to test as soon as possible. |
@rgaudin It'll be in the next beta for testing. But should probably at least discuss standardizing this type of header, opened issue in WARC spec repo: iipc/warc-specifications#96 |
To improve replay systems, it would be beneficial to know whether a request is JS-emmited (xhr, fetch) or not.
This information is not available in the WARC but is available at crawling time in the browser.
Would it be possible to add this information to a dedicated WARC Header?
See openzim/warc2zim#140
The text was updated successfully, but these errors were encountered: