This is an application that crawls news websites and displays the main news in a front page.
The things to look out for are:
The Django ORM is used both for the app and the crawlers. In the latter, it achieves this by faking the launch of the main app, so that the Django's general infrastructure becomes fully ready to persist data, as if it were running the actual app.
- The bash script will create a local copy, assets included, of a single website's page using
wget
. Links are automatically converted to the filesystem's structure, thus being able to serve even foreign assets as if it were the actual server making network requests. - The local mirror can be served
python -m http.server
, a built-in module. That enables you to practice and do dry scraping runs as much as you want in a development environment.
A small app built with Django templates for filtering news by it's domain.
This repository ships with the website's and a sample database pre-populated already, so it's not needed to crawl upfront.
pip install -r requirements.txt
cd PortalDjangoApp && ./manage.py runserver
pip install -r requirements.txt
cd PortalDjangoApp && ./manage.py migrate
It will create a db.sqlite3 file in the same folder.
- A TecMundo dummy copy is already available at the fixtures folder.
cd fixtures/tecmundo && python -m http.server
- In another terminal ❗❗ notice the switch for development
cd ScraperScrapyApp && scrapy crawl TecMundoSpider -a env=DEV
cd ScraperScrappyApp && scrapy crawl TecMundoSpider
Note: After crawling, you can shut down both the server and the crawler.
cd PortalDjangoApp && ./manage.py runserver
scripts/single_page_mirror.sh [URL] [name_of_folder]
cd fixtures/[name_of_folder] && python -m http.server