GitHub - djon3s/foidberg: Alien Crawler-thing gets nsw gov FOI disclosures and chucks em in a DB for use by django.

djon3s / foidberg Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Alien Crawler-thing gets nsw gov FOI disclosures and chucks em in a DB for use by django.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
disclosure		disclosure
README		README

Repository files navigation

This asynchronously spiders the FOI disclosure log  pages of transport, health and communities in NSW & posts the details to a database which is django compatible. If you start djangos admin interface you can see the details ready to create a site.  

Other disclosure sites can be added easily using scrapy. To see example spiders see ./disclosures/disclosure/spiders (directory structure names are confusing, can be changed later) You'll need to have the XPATH of the fields you want to grab handy, a useful way to get this is with the firebug extension for firefox
.

Please report any bugs/issues/feature suggestions at https://github.com/djon3s/foidberg/issues/

The cool thing about this is how you can populate a django DB by just writing spider modules. Evil.

SETUP

Requires django & scrapy (which should install twisted)

Unfortunately for somereason you cannot use virtualenv for scrapy

sudo pip install twisted
sudo pip install scrapy
sudo pip install django
sudo apt-get install python-dateutil

edit: for some reason it's reliant on bootstrap (pip install bootstrap)
this is a dependency that needs to be removed.

HOW TO USE

First set up the django database in ./disclosure/foisite 

'python manage.py syncdb'

edit the file foidberg/disclosure/disclosure/settings.py to reflect right location

populate with entries from transport

'scrapy crawl transport' or 'health' or 'communities' in the 'foisite directory' #TODO: fix this for all directories

in between spiders it might help to run 'python manage.py syncdb'

now 'python manage.py runserver'

you can view the admin interface on 127.0.0.1:8000/admin

TO ADD A SPIDER

Spider logic is located in /disclosure/disclosures add a spider in the spider directory

It might help to be familiar with Xpaths and Scrapy http://scrapy.readthedocs.org/en/0.14/intro/tutorial.html

TODO  - SPIDER AND BACKEND
* fix directory structure names to be less confusing
* make the reference number the unique identifier in the SQL DB
* create BASH script to spider periodically (say daily) and coupled with reference number being unique identifier only add to database when new entries come up


Suggestion for future features

* for each entry, check if it provides an email address as reference, if it does provide create an email to automatically request document on behalf of site, future emails with attachments are directed to be checked by admin interface before publishing. IF no attachment, email is forwarded to email of admin to continue conversation manually. It sucks having to get people to email someone to view data they are entitled to. :-(