You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for this elegant example of how to do RAG with WARC data! I also very much appreciated how the blog post highlighted limitations with citation (which is important for web archives).
I was wondering if it might be useful to use the text/plain WARC resource records that browsertrix-crawler creates from the rendered page (not just scraped from the static HTML). This could be important for social media content where the page is assembled dynamically?
I think it would mostly be a matter of adding some logic to ingest.py to look for records with WARC-Type: resource and then use the URL that's in the WARC-Target-URI header to determine the URL to associate the text with?
Here's an example for the text generated on the initial page render:
WARC/1.1
Content-Type: text/plain
WARC-Target-URI: urn:text:https://genart.social/tags/genuary
WARC-Date: 2024-02-18T16:58:12.661Z
WARC-Type: resource
WARC-Record-ID: <urn:uuid:1d657dd4-1b01-4e76-bba2-ea641d74c029>
WARC-Payload-Digest: sha256:7cd17ef9c0393fcc1f8fd1b956c0f43eab1a2851f01d06fe41692d2284a2905c
WARC-Block-Digest: sha256:7cd17ef9c0393fcc1f8fd1b956c0f43eab1a2851f01d06fe41692d2284a2905c
Content-Length: 897
Mastodon
Create account
Login
Recent searches
No recent searches
Search options
Not available on genart.social.
genart.social
is part of the decentralized social network powered by
Mastodon
.
...
The WARC-Target-URI could also look like WARC-Target-URI: urn:textFinal:{url} which is text in the page after the behaviors have run. But maybe this would complicate the retrieval step if there are multiple records for the same resource?
The text was updated successfully, but these errors were encountered:
Thank you for the kind words, and for having a look at WARC-GPT.
I think this is a good suggestion.
As you noted, at the moment, the pipeline solely focuses on HTTP 2XX responses: maybe we could add CLI option to ingest.py to specify what type of WARC records to consider?
Thanks for this elegant example of how to do RAG with WARC data! I also very much appreciated how the blog post highlighted limitations with citation (which is important for web archives).
I was wondering if it might be useful to use the text/plain WARC resource records that browsertrix-crawler creates from the rendered page (not just scraped from the static HTML). This could be important for social media content where the page is assembled dynamically?
I think it would mostly be a matter of adding some logic to ingest.py to look for records with
WARC-Type: resource
and then use the URL that's in theWARC-Target-URI
header to determine the URL to associate the text with?Here's an example for the text generated on the initial page render:
The
WARC-Target-URI
could also look likeWARC-Target-URI: urn:textFinal:{url}
which is text in the page after the behaviors have run. But maybe this would complicate the retrieval step if there are multiple records for the same resource?The text was updated successfully, but these errors were encountered: