Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save a copy of the linked content in case of link rot #207

Open
ihavenogithub opened this issue Oct 19, 2014 · 9 comments
Open

Save a copy of the linked content in case of link rot #207

ihavenogithub opened this issue Oct 19, 2014 · 9 comments

Comments

@ihavenogithub
Copy link

From years of using del.icio.us then yahoo's delicious then self-hosted scuttle and semantic scuttle, I really miss the ability to save a local copy to still have a copy of the content when link rot eventually occurs.

@nodiscc
Copy link

nodiscc commented Oct 20, 2014

@ihavenogithub You're right, this has been proposed a long time ago (#58). We have been triaging bugs and fixing some issues at https://github.com/shaarli/Shaarli/ , and concluded that Shaarli should not include complex features like web scraping (or keep them as plugins, but we don't have a plugin system yet).

I'm working on a python script that:

  • Downloads HTML exports from Shaarli
  • Saves the linked pages, with ability to filter tags, download audio/video media and more.

The script can be used fro a client machine (laptop, whatever) or can be placed on the server itself and run periodically (if the host supports python and cronjobs). At the moment, the script works perfectly for me, but needs some cleanup. Would this solve your problem?

@ihavenogithub
Copy link
Author

Probably for a while but I'd rather have this process be done automatically.
Would you mind giving me a link to your script? I'd like to give it a try.

@nodiscc
Copy link

nodiscc commented Oct 20, 2014

It will be automatic if you add it as a scheduled task (cron job). I'm now formatting the script so that it's usable/readable for everyone and will keep this updated.

@nodiscc
Copy link

nodiscc commented Nov 5, 2014

hey @ihavenogithub I've started rewriting my script from scratch (it was too damn ugly), check https://github.com/nodiscc/shaarchiver

For now it only downloads html exports, and downloads audio/video media (with tag filtering), not pages. Rather limited but it's a clean start and more is planned (see the issues). Contributions welcome

@Epy
Copy link

Epy commented May 26, 2015

Hi,
Your archiver script may use Wallabag scraper, you'll be able to scrape many websites without having to re-code the wheel
Wallabag does what you need but needs shaarli integration and automation I think

@nodiscc
Copy link

nodiscc commented May 26, 2015

@Epy

  • This tool is written in Python
  • It's a command line tool
  • This tool is for local offline archiving, not on a remote server
  • This tool leverages youtube-dl for media downloads (supports more than 500 websites)
  • Once the page download features are in, it will download exact copies of pages, not "readable" versions (except ads removed).

So I don't think Wallabag could be useful for me.

However I agree that wallabag should be able to automatically archive pages from RSS feeds. Did you report a bug for this on the Wallabag issue tracker?

@Epy
Copy link

Epy commented May 27, 2015

The wallabag remote server could be at your home ^_^ but I was only suggesting to use some components as a library, if possible.
Maybe re-use patterns only: https://github.com/wallabag/wallabag/tree/master/inc/3rdparty/site_config

It would be a great thing to have a standard library to download webpages, available for all open-source and free softwares

I understand that can't be done if you're developping in Python and wallabag is PHP

Thank you for your tool BTW :]

@nodiscc
Copy link

nodiscc commented May 28, 2015

Thanks for the feedback @Epy I guess the script could also be run by your home server automatically if set up with a cron job

Maybe re-use patterns only: https://github.com/wallabag/wallabag/tree/master/inc/3rdparty/site_config

The patterns are very interesting, as they contain what to strip/extract to obtain "readable" versions of pages (example for bbc.co.uk). This feature could be added in the long run (in another script, or as a command-line option).

For now I want to concentrate on keeping exact copies of the pages, then removing just ads (don't know where I saved it but I have a draft for this, basically download ad blocking lists and fetch pages through a proxy that removes them).

I'm rewriting it (again...) as the script was getting overcomplicated. Next version should be able to download, video, audio (already implemented), webpages, and generate a markdown and html index of the archive. Next next version should make the html index filterable/searchable (text/tags). Next nex next version should support ad blocking.

Feel free to report a feature request so that I won't forget your ideas.

I also think wallabag should really support auto-downloading articles from RSS feeds...

@Epy
Copy link

Epy commented May 29, 2015

Okay, I just made the feature request in your github repo :)

FreshRSS is a RSS feed reader and can export to wallabag (as it is able to do with Shaarli to export links only)
http://freshrss.org/
With 3 Self hostable open source (and KISS) apps connected we should be able to have a nice system, no ? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants