Introduction

This is an application that crawls news websites and displays the main news in a front page.

The things to look out for are:

Shared ORM

The Django ORM is used both for the app and the crawlers. In the latter, it achieves this by faking the launch of the main app, so that the Django's general infrastructure becomes fully ready to persist data, as if it were running the actual app.

You can emulate a live crawling target by using scripts

The bash script will create a local copy, assets included, of a single website's page using wget. Links are automatically converted to the filesystem's structure, thus being able to serve even foreign assets as if it were the actual server making network requests.
The local mirror can be served python -m http.server, a built-in module. That enables you to practice and do dry scraping runs as much as you want in a development environment.

The app

A small app built with Django templates for filtering news by it's domain.

Run the server

This repository ships with the website's and a sample database pre-populated already, so it's not needed to crawl upfront.

1. Install the dependencies

pip install -r requirements.txt

2. Run the server

cd PortalDjangoApp && ./manage.py runserver

How to crawl and refresh by yourself

1. Install the dependencies

pip install -r requirements.txt

2. Create the database

cd PortalDjangoApp && ./manage.py migrate

It will create a db.sqlite3 file in the same folder.

3. Running...

3.1 Locally

A TecMundo dummy copy is already available at the fixtures folder.

cd fixtures/tecmundo && python -m http.server

In another terminal ❗❗ notice the switch for development

cd ScraperScrapyApp && scrapy crawl TecMundoSpider -a env=DEV

3.2 Against the real Tecmundo website

cd ScraperScrappyApp && scrapy crawl TecMundoSpider

Note: After crawling, you can shut down both the server and the crawler.

4. Checking out the results

cd PortalDjangoApp && ./manage.py runserver

Additional setup

Crawling

Make a local copy for crawling locally

scripts/single_page_mirror.sh [URL] [name_of_folder]

Change into the crawled copy and run the server

cd fixtures/[name_of_folder] && python -m http.server

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
PortalDjangoApp		PortalDjangoApp
ScraperScrapyApp		ScraperScrapyApp
fixtures/tecmundo		fixtures/tecmundo
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Shared ORM

You can emulate a live crawling target by using scripts

The app

Run the server

1. Install the dependencies

2. Run the server

How to crawl and refresh by yourself

1. Install the dependencies

2. Create the database

3. Running...

3.1 Locally

3.2 Against the real Tecmundo website

4. Checking out the results

Additional setup

Crawling

Make a local copy for crawling locally

Change into the crawled copy and run the server

About

Releases

Packages

Languages

resolritter/webscraper-news-portal

Folders and files

Latest commit

History

Repository files navigation

Introduction

Shared ORM

You can emulate a live crawling target by using scripts

The app

Run the server

1. Install the dependencies

2. Run the server

How to crawl and refresh by yourself

1. Install the dependencies

2. Create the database

3. Running...

3.1 Locally

3.2 Against the real Tecmundo website

4. Checking out the results

Additional setup

Crawling

Make a local copy for crawling locally

Change into the crawled copy and run the server

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages