Use extracted text in WARC resource records #2

edsu · 2024-02-18T17:17:39Z

Thanks for this elegant example of how to do RAG with WARC data! I also very much appreciated how the blog post highlighted limitations with citation (which is important for web archives).

I was wondering if it might be useful to use the text/plain WARC resource records that browsertrix-crawler creates from the rendered page (not just scraped from the static HTML). This could be important for social media content where the page is assembled dynamically?

I think it would mostly be a matter of adding some logic to ingest.py to look for records with WARC-Type: resource and then use the URL that's in the WARC-Target-URI header to determine the URL to associate the text with?

Here's an example for the text generated on the initial page render:

WARC/1.1
Content-Type: text/plain
WARC-Target-URI: urn:text:https://genart.social/tags/genuary
WARC-Date: 2024-02-18T16:58:12.661Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:1d657dd4-1b01-4e76-bba2-ea641d74c029>
WARC-Payload-Digest: sha256:7cd17ef9c0393fcc1f8fd1b956c0f43eab1a2851f01d06fe41692d2284a2905c
WARC-Block-Digest: sha256:7cd17ef9c0393fcc1f8fd1b956c0f43eab1a2851f01d06fe41692d2284a2905c
Content-Length: 897

Mastodon
Create account
Login
Recent searches
No recent searches
Search options
Not available on genart.social.
genart.social
is part of the decentralized social network powered by
Mastodon
.
...

The WARC-Target-URI could also look like WARC-Target-URI: urn:textFinal:{url} which is text in the page after the behaviors have run. But maybe this would complicate the retrieval step if there are multiple records for the same resource?

The text was updated successfully, but these errors were encountered:

matteocargnelutti · 2024-02-27T21:31:09Z

Hi @edsu !

Thank you for the kind words, and for having a look at WARC-GPT.

I think this is a good suggestion.

As you noted, at the moment, the pipeline solely focuses on HTTP 2XX responses: maybe we could add CLI option to ingest.py to specify what type of WARC records to consider?

Maybe something like:

--record-types=responses (default)
--record-types=responses,resources
--record-types=resources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use extracted text in WARC resource records #2

Use extracted text in WARC resource records #2

edsu commented Feb 18, 2024 •

edited

Loading

matteocargnelutti commented Feb 27, 2024

Use extracted text in WARC resource records #2

Use extracted text in WARC resource records #2

Comments

edsu commented Feb 18, 2024 • edited Loading

matteocargnelutti commented Feb 27, 2024

edsu commented Feb 18, 2024 •

edited

Loading