Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use extracted text in WARC resource records #2

Open
edsu opened this issue Feb 18, 2024 · 1 comment
Open

Use extracted text in WARC resource records #2

edsu opened this issue Feb 18, 2024 · 1 comment

Comments

@edsu
Copy link

edsu commented Feb 18, 2024

Thanks for this elegant example of how to do RAG with WARC data! I also very much appreciated how the blog post highlighted limitations with citation (which is important for web archives).

I was wondering if it might be useful to use the text/plain WARC resource records that browsertrix-crawler creates from the rendered page (not just scraped from the static HTML). This could be important for social media content where the page is assembled dynamically?

I think it would mostly be a matter of adding some logic to ingest.py to look for records with WARC-Type: resource and then use the URL that's in the WARC-Target-URI header to determine the URL to associate the text with?

Here's an example for the text generated on the initial page render:

WARC/1.1
Content-Type: text/plain
WARC-Target-URI: urn:text:https://genart.social/tags/genuary
WARC-Date: 2024-02-18T16:58:12.661Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:1d657dd4-1b01-4e76-bba2-ea641d74c029>
WARC-Payload-Digest: sha256:7cd17ef9c0393fcc1f8fd1b956c0f43eab1a2851f01d06fe41692d2284a2905c
WARC-Block-Digest: sha256:7cd17ef9c0393fcc1f8fd1b956c0f43eab1a2851f01d06fe41692d2284a2905c
Content-Length: 897

Mastodon
Create account
Login
Recent searches
No recent searches
Search options
Not available on genart.social.
genart.social
is part of the decentralized social network powered by
Mastodon
.
...

The WARC-Target-URI could also look like WARC-Target-URI: urn:textFinal:{url} which is text in the page after the behaviors have run. But maybe this would complicate the retrieval step if there are multiple records for the same resource?

@matteocargnelutti
Copy link
Collaborator

Hi @edsu !

Thank you for the kind words, and for having a look at WARC-GPT.

I think this is a good suggestion.

As you noted, at the moment, the pipeline solely focuses on HTTP 2XX responses: maybe we could add CLI option to ingest.py to specify what type of WARC records to consider?

Maybe something like:

--record-types=responses (default)
--record-types=responses,resources
--record-types=resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants