You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I crawl a website with mostly static resources, I'm noticing there can be missing resources in the resulting WARC. The reason for that is either broken links or timeouts.
I have written tools to grab all WARC-Target-URI and also go through all the Content-Type: text/html WARC records, pull all the urls found in <img src> or <link href> or <a href> or <script src> and compare what's being referenced with what's available in the WARC and bring in the unavailable ones (using wget --warc-file). That works fine.
Is it possible to do the same but for dynamic pages where the requests are made by js. Does browsertrix record all XHR requests attempted(not necessarily completed) somewhere?
Thanks!
The text was updated successfully, but these errors were encountered:
If I crawl a website with mostly static resources, I'm noticing there can be missing resources in the resulting WARC. The reason for that is either broken links or timeouts.
I have written tools to grab all WARC-Target-URI and also go through all the Content-Type: text/html WARC records, pull all the urls found in
<img src>
or<link href>
or<a href>
or<script src>
and compare what's being referenced with what's available in the WARC and bring in the unavailable ones (usingwget --warc-file
). That works fine.Is it possible to do the same but for dynamic pages where the requests are made by js. Does browsertrix record all XHR requests attempted(not necessarily completed) somewhere?
Thanks!
The text was updated successfully, but these errors were encountered: