Note: I never got very far with this and none of this probably works any more.
django-scraping
is a library for making it easy to scrape content from websites.
Note: it is currently extremely alpha status in that it works for the specific use case that I use it for and probably not a lot else!
Installing using pip
will bring down most dependencies, but a few things require external packages to be installed.
django-scraping relies upon PyQuery which in turn relies upon lxml. In order to
install lxml, you must have a compiler and development versions of python,
libxml2
and libxslt
available.
See also http://lxml.de/build.html
Some features require additional dependencies to be manually installed, because the features are rarely used or because the dependency is "difficult" to install (such as a dependency requiring native compilation against packages
- For dealing with images, Pillow is required
This section is not too detailed for now because the API keeps changing as I figure out new use cases or problems with the existing definitions. As said: extremely alpha!
You will need to add scraping
and djcelery
to INSTALLED_APPS
and run syncdb
or migrate
Inside your django apps, create a handlers.py
:
# import the register function to add your handlers to django-scraping
from scraping.handlers import register
# define a callable
def handle_something(doc, scraper_page):
# doc is a pyquery document of the scraped content
# scraper_page is the ScraperPage model, discussed below
… do stuff with the doc …
# map a name onto a callable
register('handle_something', handle_something)
In the django admin, create a ScraperPage
object, using 'handle_something' as the value for 'scraper'. In the ScraperPage
list view, you can use an admin action to queue up a scrape.
Run ./manage.py celeryd
to execute the tasks. The url
will be downloaded, parsed, and passed into your handler function.
More to come later..