Scraper of Slovak National Council for Visegrad+ project. Scrapes MPs, their memberships, votes and debates and stores the data into Visegrad+ parliament API.
Requires:
- lxml library to parse HTML documents,
- LibreOffice core and unoconv to convert documents from RTF format,
- some Python packages.
On Debian-based distributions install the libraries:
$ sudo apt-get install libxml2-dev libxslt1-dev zlib1g-dev libreoffice-core unoconv
Get the scraper:
$ sudo mkdir --p /home/projects/scrapers $ cd /home/projects/scrapers $ sudo git clone https://github.com/KohoVolit/scraper-sk_nrsr.git sk_nrsr
Get VPAPI client and SSH certificate of the server:
$ cd sk_nrsr $ sudo wget https://raw.githubusercontent.com/KohoVolit/api.parldata.eu/master/client/vpapi.py $ sudo wget https://raw.githubusercontent.com/KohoVolit/api.parldata.eu/master/client/server_cert.pem
Create a virtual environment for the scraper and install the required packages into it:
$ sudo virtualenv /home/projects/.virtualenvs/scrapers/sk_nrsr --no-site-packages $ source /home/projects/.virtualenvs/scrapers/sk_nrsr/bin/activate (sk_nrsr)$ sudo pip install -r requirements.txt (sk_nrsr)$ deactivate
Check that SERVER_NAME
and SERVER_CERT
variables in vpapi.py
have correct values.
Copy file conf/private-example.json
to conf/private.json
and fill in your username and password for write access through API. Those sensitive data must not be present in the repository.
Run in the virtual environment. See help message of the scraper for parameters the scraper accepts
$ source /home/projects/.virtualenvs/scrapers/sk_nrsr/bin/activate $ python scrape.py --help
unoconv
listener must be running to scrape transcripts of former debates (election terms 1-4)
$ unoconv --listener &
Scrape people and their memberships first, then debates and finally votes (initial scrape of debates deletes all existing sessions and sittings)
$ sudo -u visegrad python scrape.py --people initial --debates none --votes none $ sudo -H -u visegrad python scrape.py --people none --debates initial --votes none $ sudo -u visegrad python scrape.py --people none --debates none --votes initial
(unoconv creates tmp files in HOME). Or all at once
$ sudo -H -u visegrad python scrape.py --people initial --debates initial --votes initial
You can stop unoconv listener unless needed for other scrapers or conversions
$ sudo killall soffice.bin
Then schedule periodic scrape
$ sudo -u visegrad python scrape.py --people recent --debates recent --votes recent
or, knowing that recent
is the default value, simply
$ sudo -u visegrad python scrape.py