Diavgeia

A crawler that leverages Diavgeia's OpenDataAPI to download decisions as well as their respective metadata.

Prerequisites

Python 2.7.x
Scrapy 1.0.3

Introduction

The cralwer parses command line arguments (as defined by Scrapy Framework), and performs the respective calls to Diavgeia's OpenData API in order to fetch decisions metadata and documents. Documents are downloaded using a seperate threaded downloader pipeline. Currently the work on this project is ongoing.

Settings

Diavgeia stores its configuration under diavgeia/settings.py file. The values that are configurable are the following:

DOWNLOAD_DIR: The directory where the downloader pipeline stores the decision documents. To avoid resulting to a directory full of documents, the downloader seggregates decisions by their organization id. Therefore when a decision is downloaded, a new subdirectory is created (if not exists) named under the organization ID. Default is /tmp/diavgeia
THREADS: Number of concurrent threads that will be used to download decision documents. Default is 15.

Run Diavgeia

Diavgeia can be run using the following command:

scrapy crawl diavgeia_spider

The above command fetches decision documents that were submitted to Diavgeia the previous day. Queries can be performed to Diavgeia using search arguments defined by the API. Such arguments can be passed to the crawler with the -a flag. e.g: If we wish to search every decision published by an organization we can use the following command:

scrapy crawl diavgeia_spider -a org=30

where 30 is the organization ID. Diavgeia allows us to request an organization using its name as well.

Allowed arguments for the spider are depicted on the following table:

Argument	Description
`ada`	Unique decision ID
`org`	ID or latin name of organization
`subject`	Subject of the decision
`protocol`	Protocol number
`term`	General search term
`unit`	Unit ID
`signer`	Signer ID
`type`	Type of decisions
`tag`	Category tag
`from_date`	From submission date (in YYYY-MM-DD format)
`to_date`	To submission date (in YYYY-MM-DD format)
`from_issue_date`	From issue date (in YYYY-MM-DD format)
`status`	To issue date (in YYYY-MM-DD format)
`size`	Results size.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
diavgeia		diavgeia
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diavgeia

Prerequisites

Introduction

Settings

Run Diavgeia

About

Releases

Packages

Languages

bug2point0/diavgeia

Folders and files

Latest commit

History

Repository files navigation

Diavgeia

Prerequisites

Introduction

Settings

Run Diavgeia

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages