-
-
Notifications
You must be signed in to change notification settings - Fork 17
Webscraping
Scraping data from the web requires three steps:
- Loading the site
- Locating the data
- Saving the data
There are lots of automation tools that will do some or all of these steps for you. Some are simple. But depending on how complex the websites are that you wish to scrape, and depending on how specific your needs for the data are, you may need the more advanced options.
For that reason, this is not a step-by-step tutorial. Rather, this is a launching pad, with links and resources to tools that you can use to automate different steps in the web-scraping process.
How often do you need to run your scraper? Does it need to run at regular intervals or irregularly?
How hands-off does your scraper need to be? Do you want the entire process to run unattended or do you want a user involved?
How complex are the sites you are scraping? Does your scraper need to navigate interactive websites or is it scraping simpler static html pages?
What do you want to use to run your scraper? Do you want to run it from within your browser, while viewing the pages you need to scrape? Do you want to run scripts locally on your machine, which don’t require you to actively use the browser, but do require your computer to be on and running? Or do you want to run it in a cloud computing service at regular intervals?
When deployed incautiously, web-scraping tools can cause real problems to others, whether by reducing the level of service of other site users, or by increasing data fees for site owners. To minimize the impact of your project and scrape responsibly:
- Schedule your scraping scripts to run no more often than necessary
- Limit the number of requests that you send simultaneously or in close succession
- Schedule scripts for off-peak hours
- Use caches and avoid duplicate requests
And lastly, avoid web scraping entirely if the data you need is available from official APIs.
- Simple Tools
- Google Sheets has built-in functions that can grab data from simple websites. (See documentation for IMPORTXML, IMPORTHTML, and IMPORTFEED)
- Browser plugins such as the free plugin offered by webscraper.io
- Python packages
- Selenium, a programmable browser for automated web browsing. Can be used as a python package or as a standalone application.
- Beautiful Soup, a python package for navigating and extracting data from html files.
- Scrapy, another package for crawling web pages and extracting data from them.
- Javascript-based methods
- Bookmarklets are a simple way of running javascript code from your browser, and can help you quickly pull data from large pages. However, they may run differently in different browsers, and security features of different browsers may create hurdles for saving the data to your hard drive.
- Userscripts software for running javascript on websites
- Userscripts on the App Store for Mac
- FireMonkey (Firefox) plugin for Firefox
- Tampermonkey plugin for Chrome
- Google Scripts allows you to schedule and run javascript code from the cloud, and has built-in methods for saving data to Google Drive. It includes basic methods for fetching website data, though it doesn’t have specialized tools for parsing or navigating websites.
- Services
- Scrapestack (100 requests per month free)
- Octoparse (Free plan allows you to run 10 projects on your local machine, paid plans allow cloud-based scraping.)
- Webscraper.io (Free in local plugin, paid plans allow cloud-based scraping.)
Tool | Platform | Scheduled Scraping? | Navigate Complex Websites? | Output |
---|---|---|---|---|
Google Sheets | Google Drive | No | No | Google sheets only |
Browser plugins | Browser | No | Yes (with limitations) | Any |
Selenium | Python | Yes | Yes | Any |
Beautiful Soup | Python | n/a | n/a | Any |
Scrapy | Python | Yes | Yes | Any |
Bookmarklets | Javascript (Browser) | No | Yes (with limitations) | Dependent on browser |
Userscripts | Javascript (Browser) | No | Yes | Dependent on browser |
Google Scripts | Javascript (Google Drive) | Yes | No | Google drive, other cloud services |
Once you’ve loaded the website contents, your next step is extracting just the data you need. For the simpler projects, you might need nothing more than a straightforward search of the website text. For Most websites, though, you’ll want to use a selector to identify where in the page’s html structure your data will be found.
Selectors use css selector syntax to find parts of the page. As an example, to find all links (<a>
html elements) inside list bullets (<li>
) with the class reference, you could use the selector li.reference a
.
Javascript and Python both have methods for finding specific nodes within a page’s structure.
- In browser-based Javascript, you can use built-in document.querySelector and document.querySelectorAll functions
- In Python you can use the Beautiful Soup package to pull data from a website’s structure, or Selenium to navigate via selectors.
Use your browser’s developer console to explore the page you want to scrape. The console inspector will let you click on specific elements of the page and see which selectors apply to them. Try to think about how this webpage might change over time, and to pick the selectors least likely to be affected by small changes in the layout. For example, if your webpage has a single <main>
html element, you may want to include that, so that your selector will never grab any html elements later added to, say, the navigational bar.
Many of the tools have built in methods for exporting your data to a file. But some methods, especially browser-based web-scrapinng, may make it tricky to save your data.
There are different methods of creating and downloading files via bookmarklets, but they may be blocked by the security features of some browsers. One possible solution from stackoverflow here.
Alternatively, you can use javascript’s fetch method to send the data to another server, or to a cloud computing service like Google Scripts via its web app functions, for further processing.
- Hack For LA’s webscraping system for 311 data:
- Basic website scraping with google sheets:
- Running google scripts to scrape websites from the cloud:
- Creating a fully hands-off cloud-based webscraper:
(Some of these issues may be closed or open/in progress.)