dirbot

This is a Scrapy project to scrape websites from public web directories.

This project is only meant for educational purposes.

Items

The items scraped by this project are websites, and the item is defined in the class:

dirbot.items.Website

See the source code for more details.

Spiders

This project contains two spiders: googledir and dmoz. When in doubt, you can check the available spiders with:

scrapy list

Spider: googledir

The googledir spider crawls the entire Google Directory, though you may want to try it by limiting the crawl to a certain number of items.

For example, to run the googledir spider limited to scrape 20 items use:

scrapy crawl googledir --set CLOSESPIDER_ITEMCOUNT=20

Spider: dmoz

The dmoz spider scrapes the Open Directory Project (dmoz.org), and it's based on the dmoz spider described in the Scrapy tutorial

Unlike the googledir spider, this spider doesn't crawl the entire dmoz.org site but only a few pages by default (defined in the start_pages attribute). These pages are:

So, if you run the spider regularly (with scrapy crawl dmoz) it will scrape only those two pages. However, you can scrape any dmoz.org page by passing the url instead of the spider name. Scrapy internally resolves the spider to use by looking at the allowed domains of each spider.

For example, to scrape a different URL use:

scrapy crawl http://www.dmoz.org/Computers/Programming/Languages/Erlang/

You can scrape any URL from dmoz.org using this spider

googledir - for scraping Google Directory (directory.google.com)

Pipelines

This project uses a pipeline to filter out websites containing certain forbidden words in their description. This pipeline is defined in the class:

dirbot.pipelines.FilterWordsPipeline

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dirbot		dirbot
README.rst		README.rst
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dirbot

Items

Spiders

Spider: googledir

Spider: dmoz

Pipelines

About

Releases

Packages

gurbinder533/dirbot

Folders and files

Latest commit

History

Repository files navigation

dirbot

Items

Spiders

Spider: googledir

Spider: dmoz

Pipelines

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages