Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add request initiator to WARC? #451

Closed
rgaudin opened this issue Dec 20, 2023 · 6 comments
Closed

Add request initiator to WARC? #451

rgaudin opened this issue Dec 20, 2023 · 6 comments

Comments

@rgaudin
Copy link
Contributor

rgaudin commented Dec 20, 2023

To improve replay systems, it would be beneficial to know whether a request is JS-emmited (xhr, fetch) or not.
This information is not available in the WARC but is available at crawling time in the browser.
Would it be possible to add this information to a dedicated WARC Header?

See openzim/warc2zim#140

@rgaudin
Copy link
Contributor Author

rgaudin commented Jan 9, 2024

I should mention that relying on HTTP X-Requested-With is not an option:

  • It's only set by a number of JS libraries
  • whether it should be used at all is debated
  • it triggers a CORS preflight

So we cannot and should not expect this HTTP header to be present.

Our need is adjacent to this one in that we don't want to interfere with the HTTP negotiation (Server should not be aware of how the request is made) ; we'd just like to have that information in the WARC Headers.

@benoit74
Copy link
Contributor

In addition to the issue mentioned above, we are now also facing difficulties with our new warc2zim version where knowing what triggered the download of a given WARC record would help (the HTML page, some JS code, CSS stylesheet, ...). I know this is even broader than the original request about which kind of system triggered the download, and I also realize that this is not a single value but a list.

@ikreymer @tw4l what are your views on this feature request? How complex does it seems for you? Does it make sense?

As a side-note, if request is feasible and meaningful for you, we (Kiwix) are probably ok to support you in implementing this feature (how would have to be defined).

@ikreymer
Copy link
Member

ikreymer commented Feb 19, 2024

Yes, with the new 1.0.0 we are switch to entirely browser-based capture (not using pywb proxy). With this method, it should be much easier to record either the CDP ResourcesType value or Request.destination in the WARC record. Do you think that will be sufficient for your use case?
Probably the Request.destination makes more sense as it is more standard, but will be based on the protocol ResourceType

@benoit74
Copy link
Contributor

Great, that's a good news ! Thank you for your prompt reply.

I'm not able to say whether we would prefer the CDP ResourcesType or the Request.destination. I'm not sure to have fully understand where the Request.destination comes from and what means the fact that it is based on the protocol ResourceType. Does it means that it depends on what the browser is requesting rather than how the rendering engine interpreted the result?

At least it looks like both might match our need from what I've understood. The choice between both might be more based on a "reality check" on real situations. Probably the best solution would be to give it a try on some situations we have encountered where it would help.

The only concern I see (somehow it is more "your" problem, but it might become "ours" quite fast) is that there is no certainty on long term that it will be easy/feasible for you to maintain availability of this information since it is very tight to the browser-based capture. Should you decide to abandon browser-based capture for any reason, this will be on more constraint to fulfill with the new system. Not a big concern, but something to have in mind.

@rgaudin @mgautierfr do you have any additional views on this?

ikreymer added a commit that referenced this issue Feb 23, 2024
@rgaudin
Copy link
Contributor Author

rgaudin commented Mar 4, 2024

@ikreymer we just noticed the commit ; looks like it should do the trick!

In which release can we expect this to land? We'd like to test as soon as possible.

@ikreymer
Copy link
Member

ikreymer commented Mar 4, 2024

@rgaudin It'll be in the next beta for testing. But should probably at least discuss standardizing this type of header, opened issue in WARC spec repo: iipc/warc-specifications#96

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done!
Development

No branches or pull requests

3 participants