Convert blog to epub using command line or GUI.
My main goal in creating this app is to preserve the legacy of the blogosphere for future generations.
- *.blogspot.com
- *.wordpress.com
- multiple other blogs and even some webpages
- command line (CLI) and graphic user interface (GUI)
- script downloads all text contents of selected blog to epub file,
- if it's possible, it includes post comments,
- images are downsized (to maximum 800/600px) and converted to grayscale,
- one post = one epub chapter,
- chapters are sorted by date ascending,
- cover is generated automatically from downloaded images.
Checkout for latest available builds.
git clone [email protected]:bohdanbobrowski/blog2epub.git
cd blog2epub
poetry install
poetry run blog2epubgui
poetry run build_gui_windows
poetry run build_gui_macos
And then to create dmg image with app:
./make_macos_dmg.sh
Before you start, you'll need to install buildozer following this installation documentation.
poetry shell
buildozer -v android
blog2epub --help
usage: Blog2epub Cli interface [-h] [-l LIMIT] [-s SKIP] [-q QUALITY] [-o OUTPUT] [-d] url
Convert blog (blogspot.com, wordpress.com or another based on Wordpress) to epub using CLI or GUI.
positional arguments:
url url of blog to download
options:
-h, --help show this help message and exit
-l LIMIT, --limit LIMIT
articles limit
-s SKIP, --skip SKIP number of skipped articles
-q QUALITY, --quality QUALITY
images quality (0-100)
-o OUTPUT, --output OUTPUT
output epub file name
-d, --debug turn on debug
Example:
blog2epub starybezpiek.blogspot.com -l=2 -o=example.epub
Starting blogger.com crawler
Found 54 articles to crawl.
Downloading.
1. 10 lat kremlowskiej propagandy, czyli RT ujawnia swoje sekrety
Downloading.
2. "Komunę obaliliśmy, a nadal jest źle. Dlaczego?" Czyli 1984 Orwella właściwie odczytany
Locale set as en_US.UTF-8
Generating cover (800px*600px) from 1 images.
Cover generated: .\starybezpiek.blogspot.com\example.epub.jpg
Epub created: .\example.epub
poetry run blog2epub starybezpiek.blogspot.com
poetry run blog2epub velosov.blogspot.com -l=10
poetry run blog2epub poznanskiehistorie.blogspot.com -q=100
poetry run blog2epub classicameras.blogspot.com --limit=10 --no-images
pytest ./tests
pytest --cov=blog2epub ./tests
pytest --cov=blog2epub --cov-report=html ./tests
- integration testing
- increase unit test coverage
- use sitemaps.xml for scraping
- crawlers refactor
- use data models
- more common methods in crawler class
- expand crawler abstract
- cli interface refactor
- greek alphabet support
- image download and attachment bug solved (ex. modernistyczny-poznan.blogspot.com)
- improved resistance to http errors
- dedicated crawler class for zeissikonveb.de
- (on GUI) skip value is enlarged on limit value (if such is set)
- download progress is much more verbose, also on GUI it can be cancelled everytime
And finally, a list known bugs and future plans for some new functions and enhancements: BACKLOG.md
- 1.0 - somewhat working
- 2.0 - fully working project, 90% unit tested and available builds for Android/Windows/Linux/MacOS