bigheads ✂️

An efficient command line python library to scrape webpages and download the relevant pages as PDFs. It is optimised to be fast and is beneficial for people who do not have access to high-speed internet and require webpages for offline reading.

Download

pip install (recommended):

$ pip install bigheads

Dependencies

Note that bigheads depends on bs4, httplib2, wkhtmltopdf and pdfkit.

Known Issues

IOError: No wkhtmltopdf executable found: 
If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf

Visit the URL to see the steps for your respective OS.

Usage

$ bigheads URL [OPTIONS]

Options

-t, --tags          	[TAG1, TAG2 ....]	Space delimited list of tags for which web articles will be scraped.
-d, --directoryPath 	DIR_PATH 			Path of the directory where PDF files are to be saved.
-l, --limit             LIMIT       		Limit on the number of articles to be scraped.

Examples

Recursively scrape all URLs from the provided seed URL and save the PDFs to a default folder.

$ bigheads  https://www.geeksforgeeks.org/tag/queue

Recursively scrape URLs from the provided seed URL ensuring that the articles match the provided tags.

$ bigheads https://www.geeksforgeeks.org/tag/queue --tags amazon microsoft

Recursively scrape URLs from the provided seed URL ensuring that the articles match the provided tags and store it on the provided path.

$ bigheads https://www.geeksforgeeks.org/tag/queue --tags amazon microsoft -d queue

Recursively scrape URLs from the provided seed URL ensuring that the articles match the provided tags and a provided limit is not breached. The limit is for the number of articles whose PDFs is downloaded to the device.

$ bigheads https://www.geeksforgeeks.org/tag/queue --tags amazon microsoft -d queue -l 100

NOTE:

By default, a new folder called "pdfs_" will be created in the working directory, containing all the downloaded PDFs.
We have a limit of 300 articles that can be scraped in one run. This limit can be increased by parallelising the scraping tasks.
If some tags are provided, articles that match all the provided tags will be considered and scraped to form the downloaded PDFs.

Issues

Lot of noise files may be downloaded because the mechanism to compute relevance by matching tags is a naive approach and not very consistent.
For large number of recursive URLs, the current routine to convert into PDFs takes more time. There is scope to parallelise these tasks into different batches.

Contribute

If you want to add features, improve them, or report issues, feel free to send a pull request or leave a comment.

Contributors

arpan14 (Arpan Mishra)

Disclaimer

bigheads is a tool to ease an user's convenience to go through webpages in the absence of internet. It should not be misused or used for purposes other than education/research. The user agrees to USE IT AT THEIR OWN RISK.

License

FAQs

Why was it named 'bigheads'?

Answer - Well, I wrote this program in 2015. Distributed the initial version to PyPi in 2016. In 2020, when I was revisiting this code I had no clue why I had named this package 'bigheads' because the name had no correlation with the work the code was doing. It took me about 30 minutes to recall that the name is based on the famous character 'Bighead' from the TV Series Silicon Valley.

I never thought I would reference a TV Series character in my work. But at that time I felt the influence of the TV Series was pretty high to inculcate a sense of love for Computer Science, Entrepreneurship and Comedy.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
bigheads		bigheads
images		images
LICENSE		LICENSE
README.md		README.md
requires.txt		requires.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bigheads ✂️

Download

pip install (recommended):

Dependencies

Known Issues

Usage

Options

Examples

NOTE:

Issues

Contribute

Contributors

Disclaimer

License

FAQs

Why was it named 'bigheads'?

About

Releases

Packages

Languages

License

arpan14/bigheads

Folders and files

Latest commit

History

Repository files navigation

bigheads ✂️

Download

pip install (recommended):

Dependencies

Known Issues

Usage

Options

Examples

NOTE:

Issues

Contribute

Contributors

Disclaimer

License

FAQs

Why was it named 'bigheads'?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages