-
Notifications
You must be signed in to change notification settings - Fork 0
djon3s/foidberg
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This asynchronously spiders the FOI disclosure log pages of transport, health and communities in NSW & posts the details to a database which is django compatible. If you start djangos admin interface you can see the details ready to create a site. Other disclosure sites can be added easily using scrapy. To see example spiders see ./disclosures/disclosure/spiders (directory structure names are confusing, can be changed later) You'll need to have the XPATH of the fields you want to grab handy, a useful way to get this is with the firebug extension for firefox . Please report any bugs/issues/feature suggestions at https://github.com/djon3s/foidberg/issues/ The cool thing about this is how you can populate a django DB by just writing spider modules. Evil. SETUP Requires django & scrapy (which should install twisted) Unfortunately for somereason you cannot use virtualenv for scrapy sudo pip install twisted sudo pip install scrapy sudo pip install django sudo apt-get install python-dateutil edit: for some reason it's reliant on bootstrap (pip install bootstrap) this is a dependency that needs to be removed. HOW TO USE First set up the django database in ./disclosure/foisite 'python manage.py syncdb' edit the file foidberg/disclosure/disclosure/settings.py to reflect right location populate with entries from transport 'scrapy crawl transport' or 'health' or 'communities' in the 'foisite directory' #TODO: fix this for all directories in between spiders it might help to run 'python manage.py syncdb' now 'python manage.py runserver' you can view the admin interface on 127.0.0.1:8000/admin TO ADD A SPIDER Spider logic is located in /disclosure/disclosures add a spider in the spider directory It might help to be familiar with Xpaths and Scrapy http://scrapy.readthedocs.org/en/0.14/intro/tutorial.html TODO - SPIDER AND BACKEND * fix directory structure names to be less confusing * make the reference number the unique identifier in the SQL DB * create BASH script to spider periodically (say daily) and coupled with reference number being unique identifier only add to database when new entries come up Suggestion for future features * for each entry, check if it provides an email address as reference, if it does provide create an email to automatically request document on behalf of site, future emails with attachments are directed to be checked by admin interface before publishing. IF no attachment, email is forwarded to email of admin to continue conversation manually. It sucks having to get people to email someone to view data they are entitled to. :-(
About
Alien Crawler-thing gets nsw gov FOI disclosures and chucks em in a DB for use by django.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published