To run you need the following depdencies:
- Scrapy
- Selenium for Python (Only for YouTube Forums Crawler)
The simplest way to install them is by using pip, the easy to use tool to install Python packages. Once you have pip installed just run the following in your shell of choice:
pip install scrapy
pip install selenium
To get the code from this git repository you need to first install git, the version control system. Once you have git you can run the following to fetch the code:
git clone [email protected]:climatewarrior/elc-scrapers.git
After running this you will have downloaded the scrapers code to the working directory of your shell when you ran the command.
To run the scrapers change directory in your shell to the root directory of elc-scrapers. Then run the following:
scrapy crawl <crawler name>
for example to run the youtube crawler you would do:
scrapy crawl youtube
On the following section there is a list of all the available crawlers and their Scrapy names which are located inside parenthesis. To data scraped will located in a directory called:
../data/<scraper name>/
The directory will as indicated, will be at one higher level of in the hierarchy as the place where you ran scrapy. The data will be separated into files, where each file represents a thread. In the case of scrapers that also scrap user profiles there will also be files for each user profile.
Some crawlers aren’t capable of scraping the entire forums because forum searches are required to get very old posts. A workaround for this is doing the searches for terms you are interested in the respective forums and then setting those results urls as the start_urls for the spiders. The that need such a workaround are marked with a “*”.
deviantART (deviantart)*
Scraping normal forums looks to be working fine.
Fanart Central Forums (fanart)*
Looks like it’s working just fine.
OverClocked ReMix Forums (overcloked)*
Parsing threads and normal forum posts seems to be ok.
Remix64 (remix64)
Practically DONE
YouTube Forums (youtube)
In theory it is complete, but it needs more testing.
MMORPG Forum (mmorpg)
Seems to be in working order so far.
HPFanFic Forums (hpfanfic)
Appears to be done, needs more testing.
Twisting the Hellmouth (tthfanfic)
It seems like it works alright.
NaNoWriMo (nanowrimo)
100% Ready
Etsy (etsy)
Looks pretty easy to scrape.
eBaum’s World (ebaums)
It keeps giving me 403s whenever I try to scrape it. I have tried changing the user agent, removing cookies and other techniques but I keep facing the same problem.
- copyright
- legal
- illegal
- permission
- trademark
- stealing / steal / stole
- license
- rights
- attorney
- infringement
- copy / copying
- plagiarism
Most web forums are very similar. They contain multi-page threads, are organized in sub-forums and they include common attributes for posts e.g. date posted and author. To simplify the development of new forum scrapers I have created a Python class that abstracts all of the common things away so you only have to worry about the differences. Also, most of the code leverages from the Scrapy provided class CrawlSpider which helps implement Scrapy crawlers. In the following sections I will explain how to use these classes to make a new scraper. Please if you are not familiar with the basics of Scrapy please go first through the Scrapy documentation , especially the beginner tutorial.
The best way to start is to copy the following files:
./research_scrapers/spiders/example_spider.py
./research_scrapers/spiders/spider_helpers/ExampleHelper.py
and give the copies the name of your new scraper. For example if you were creating a scraper for foo.com you would name the copies the following way:
./research_scrapers/spiders/foo_spider.py
./research_scrapers/spiders/spider_helpers/FooHelper.py
After you have made these copies you can start working on your brand new forum scraper.
Open the file foo_spider.py with your following editor and take a look. First you will notice that this class inherits from CrawlSpider and from ExampleHelper. CrawlSpider makes it very easy to crawl through entire sites, ExampleHelper will be explained in the next section. The job of our Spider class is to go through the entire site to find forum threads and call the “parse_thread” method on them. The “parse_thread” method, as the name implies, takes care of parsing through threads.
What you have to do here is some pretty simple.
- Update the class name from ExampleSpider to FooSpider.
- Change the name field from “example” to “foo”.
- Update the allowed_domains list which as the name implies is just a list of allowed domains. This keeps the Spider from wandering off to other websites.
- Update the start_urls list, this is usually just the link to forums you want to scrape, but you could specify more urls where the Spider should start crawling.
- Update the rules. This is the most important part. Here you specify
regular expressions which the Spider should follow or call methods
on. You can look at some of the spiders for examples. The key here
is to make sure you have expressions to follow through all
sub-forums and that all thread urls are detected. For the thread
rules you should specify:
# We want to call parse_thread on our thread urls callback="parse_thread"
and
follow = False
We don’t want to follow thread urls. If we follow them page links will confuse the spider. The threads won’t be parsed correctly unless you take special precautions, “parse_thread” knows how to go through threads with multiple pages. For the details please see the code in helper_base.py.
ExampleHelper is a dedicated helper for the spider, which in turn inherits from HelperBase. HelperBase contains the logic that is consisted across forum sites and ExampleHelper encapsulates logic that is specific to the forum we are scraping. Here all you have to do is to finish implementing the methods. Docstrings on what each method is responsible for are found in help_base.py. Also, you can look at the available scrapers for examples. The core task is to fill the data fields with the required data using XPath selectors. For example in the method
load_first_page(self, ft)
you can add the page title the following way:
ft['title'] = self.hxs.select("//div[@id='page-body']/h2/a/text()").extract()[0]
To create these XPath selectors I highly recommend to use Scrapy interactively with the command:
scrapy shell <foo.com/forums/thread_url>
This allows for rapid XPath test and development with the use of an interactive Scrapy enabled Python shell.
In the files pipelines.py there is a Scrapy Item Pipeline called TextFileExportPipeline. In order to modify the data output you must modify that class.
This is and Open Source Free Software provided under the terms of the GNU GPL v3.0. For the full license text read the COPYING file.