Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python in the read me file #41

Open
jburnford opened this issue Feb 1, 2024 · 2 comments
Open

Python in the read me file #41

jburnford opened this issue Feb 1, 2024 · 2 comments
Labels
question Further information is requested

Comments

@jburnford
Copy link

The read me file, as far as I can see, focuses on the command line. I have a dozen WACZ files created to archive Facebook posts of political leaders during COVID, and I'm trying to understand how people could use them in the future beyond uploading them into the browser plugin (which doesn't seem to handle big files very well). Can we use py-wacz to interact with and explore the contents of a WACZ file? Is there any documentation on using archives?

@Shrinks99 Shrinks99 added the question Further information is requested label Feb 29, 2024
@Shrinks99
Copy link
Member

Shrinks99 commented Feb 29, 2024

pywacz doesn't offer any tools for exploring archived content. ReplayWebpage is the primary tool that we make for end users exploring web archives.

You may be interested in warcio (ours) or Internet Archive's warc Python library (older), WACZ files are basically extra data wrapped around WARC files in a ZIP. If you want a more data-driven approach you might start there. Additionally, the pages.jsonl file inside the pages directory of the WACZ contains extracted text metadata you may find useful.

Hopefully this answers your question? :)

@jburnford
Copy link
Author

Thanks. I'll see what I can do with Warcio. For what is worth, we tried uploading the WACZ files into an Archive-It repository so we could use Archives Unleashed tools, but never managed to get it working. Webrecorder is an essential tool to try and capture content on sites that blocks the Internet Archive, but we need to start developing documentation on what to do with the data once it is created. I'm working on a paper that will try to make a start and if there is anyone from your project interested in collaborating, I'd be happy for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants