Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facebook archiving #105

Open
djhmateer opened this issue Nov 2, 2023 · 3 comments
Open

Facebook archiving #105

djhmateer opened this issue Nov 2, 2023 · 3 comments
Labels
archiver enhancement New feature or request help wanted Extra attention is needed nice to have Issues that are not a priority but would enrich the project.

Comments

@djhmateer
Copy link
Contributor

I've got a Facebook archiver working by using the wacz_enricher.py

https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L159

Am using a stored profile to be able to get images which require you to be logged in.

Am running this archiver from a residential IP as if run from a cloud, then FB will block the requests.

This archiver is run as well as the main archiver (which runs on a cloud)

  • looks for any url which contains facebook.com and has an archive status of: wayback: (have added a new config flag called fb_archiver so that the gsheet_feeder.py only gets the rows we want)
  • runs the wacz archiver only
  • runs hash_enricher and screenshot_enricher

It may be that this can be much simpler if I can run everything sequentially (and not on 2 servers)., Need to wait for more bandwidth on residential network, then can potentially do a PR.

Also I've found I need to keep testing the profile as it will need to be re-logged in after a few weeks.

@msramalho
Copy link
Contributor

Looking forward to that PR, we can indeed have an option to run a specific archiver via a residential IP proxy.

@msramalho
Copy link
Contributor

Taking another look at this, can you clarify if you're doing any extra downloads/requests or simply parsing data form inside the wacz?

@djhmateer
Copy link
Contributor Author

djhmateer commented Feb 21, 2024

Hi Miguel

From:

https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L159

Probably best to follow along on link above.

Apart from the /photo special case, I get the root page, then parse it for resources, getting the fb_id and set_id. Then jump down to

https://github.com/djhmateer/auto-archiver/blob/v6-test/src/auto_archiver/enrichers/wacz_enricher.py#L400

which does another request (and another wacz download), then returns the next fb_id back to the main function above.

Regards
Dave

@msramalho msramalho assigned msramalho and unassigned msramalho Apr 9, 2024
@msramalho msramalho added enhancement New feature or request help wanted Extra attention is needed archiver nice to have Issues that are not a priority but would enrich the project. labels Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
archiver enhancement New feature or request help wanted Extra attention is needed nice to have Issues that are not a priority but would enrich the project.
Projects
None yet
Development

No branches or pull requests

2 participants