Skip to content

KohoVolit/scraper-sk_nrsr

Repository files navigation

scraper-sk_nrsr

Scraper of Slovak National Council for Visegrad+ project. Scrapes MPs, their memberships, votes and debates and stores the data into Visegrad+ parliament API.

Installation

Prerequisites

Requires:

  • lxml library to parse HTML documents,
  • LibreOffice core and unoconv to convert documents from RTF format,
  • some Python packages.

On Debian-based distributions install the libraries:

$ sudo apt-get install libxml2-dev libxslt1-dev zlib1g-dev libreoffice-core unoconv

Download

Get the scraper:

$ sudo mkdir --p /home/projects/scrapers
$ cd /home/projects/scrapers
$ sudo git clone https://github.com/KohoVolit/scraper-sk_nrsr.git sk_nrsr

Get VPAPI client and SSH certificate of the server:

$ cd sk_nrsr
$ sudo wget https://raw.githubusercontent.com/KohoVolit/api.parldata.eu/master/client/vpapi.py
$ sudo wget https://raw.githubusercontent.com/KohoVolit/api.parldata.eu/master/client/server_cert.pem

Create a virtual environment for the scraper and install the required packages into it:

$ sudo virtualenv /home/projects/.virtualenvs/scrapers/sk_nrsr --no-site-packages
$ source /home/projects/.virtualenvs/scrapers/sk_nrsr/bin/activate
(sk_nrsr)$ sudo pip install -r requirements.txt
(sk_nrsr)$ deactivate

Configuration

Check that SERVER_NAME and SERVER_CERT variables in vpapi.py have correct values.

Copy file conf/private-example.json to conf/private.json and fill in your username and password for write access through API. Those sensitive data must not be present in the repository.

Running

Run in the virtual environment. See help message of the scraper for parameters the scraper accepts

$ source /home/projects/.virtualenvs/scrapers/sk_nrsr/bin/activate
$ python scrape.py --help

unoconv listener must be running to scrape transcripts of former debates (election terms 1-4)

$ unoconv --listener &

Scrape people and their memberships first, then debates and finally votes (initial scrape of debates deletes all existing sessions and sittings)

$ sudo -u visegrad python scrape.py --people initial --debates none --votes none
$ sudo -H -u visegrad python scrape.py --people none --debates initial --votes none
$ sudo -u visegrad python scrape.py --people none --debates none --votes initial

(unoconv creates tmp files in HOME). Or all at once

$ sudo -H -u visegrad python scrape.py --people initial --debates initial --votes initial

You can stop unoconv listener unless needed for other scrapers or conversions

$ sudo killall soffice.bin

Then schedule periodic scrape

$ sudo -u visegrad python scrape.py --people recent --debates recent --votes recent

or, knowing that recent is the default value, simply

$ sudo -u visegrad python scrape.py

About

Scraper of Slovak National Council for Visegrad+ project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published