Webscraping

Web Scraping Tutorial

Breaking Down Web Scraping

Scraping data from the web requires three steps:

Loading the site
Locating the data
Saving the data

There are lots of automation tools that will do some or all of these steps for you. Some are simple. But depending on how complex the websites are that you wish to scrape, and depending on how specific your needs for the data are, you may need the more advanced options.

For that reason, this is not a step-by-step tutorial. Rather, this is a launching pad, with links and resources to tools that you can use to automate different steps in the web-scraping process.

Things to think about

How often do you need to run your scraper? Does it need to run at regular intervals or irregularly?

How hands-off does your scraper need to be? Do you want the entire process to run unattended or do you want a user involved?

How complex are the sites you are scraping? Does your scraper need to navigate interactive websites or is it scraping simpler static html pages?

What do you want to use to run your scraper? Do you want to run it from within your browser, while viewing the pages you need to scrape? Do you want to run scripts locally on your machine, which don’t require you to actively use the browser, but do require your computer to be on and running? Or do you want to run it in a cloud computing service at regular intervals?

Conscientious Web Scraping

When deployed incautiously, web-scraping tools can cause real problems to others, whether by reducing the level of service of other site users, or by increasing data fees for site owners. To minimize the impact of your project and scrape responsibly:

Schedule your scraping scripts to run no more often than necessary
Limit the number of requests that you send simultaneously or in close succession
Schedule scripts for off-peak hours
Use caches and avoid duplicate requests

And lastly, avoid web scraping entirely if the data you need is available from official APIs.

Tools to consider

Simple Tools
- Google Sheets has built-in functions that can grab data from simple websites. (See documentation for IMPORTXML, IMPORTHTML, and IMPORTFEED)
- Browser plugins such as the free plugin offered by webscraper.io
Python packages
- Selenium, a programmable browser for automated web browsing. Can be used as a python package or as a standalone application.
- Beautiful Soup, a python package for navigating and extracting data from html files.
- Scrapy, another package for crawling web pages and extracting data from them.
Javascript-based methods
- Bookmarklets are a simple way of running javascript code from your browser, and can help you quickly pull data from large pages. However, they may run differently in different browsers, and security features of different browsers may create hurdles for saving the data to your hard drive.
- Userscripts software for running javascript on websites
  - Userscripts on the App Store for Mac
  - FireMonkey (Firefox) plugin for Firefox
  - Tampermonkey plugin for Chrome
- Google Scripts allows you to schedule and run javascript code from the cloud, and has built-in methods for saving data to Google Drive. It includes basic methods for fetching website data, though it doesn’t have specialized tools for parsing or navigating websites.
Services
- Scrapestack (100 requests per month free)
- Octoparse (Free plan allows you to run 10 projects on your local machine, paid plans allow cloud-based scraping.)
- Webscraper.io (Free in local plugin, paid plans allow cloud-based scraping.)

Tool	Platform	Scheduled Scraping?	Navigate Complex Websites?	Output
Google Sheets	Google Drive	No	No	Google sheets only
Browser plugins	Browser	No	Yes (with limitations)	Any
Selenium	Python	Yes	Yes	Any
Beautiful Soup	Python	n/a	n/a	Any
Scrapy	Python	Yes	Yes	Any
Bookmarklets	Javascript (Browser)	No	Yes (with limitations)	Dependent on browser
Userscripts	Javascript (Browser)	No	Yes	Dependent on browser
Google Scripts	Javascript (Google Drive)	Yes	No	Google drive, other cloud services

Finding Data

Once you’ve loaded the website contents, your next step is extracting just the data you need. For the simpler projects, you might need nothing more than a straightforward search of the website text. For Most websites, though, you’ll want to use a selector to identify where in the page’s html structure your data will be found.

About Selectors

Selectors use css selector syntax to find parts of the page. As an example, to find all links (<a> html elements) inside list bullets (<li>) with the class reference, you could use the selector li.reference a.

Javascript and Python both have methods for finding specific nodes within a page’s structure.

In browser-based Javascript, you can use built-in document.querySelector and document.querySelectorAll functions
In Python you can use the Beautiful Soup package to pull data from a website’s structure, or Selenium to navigate via selectors.

Deciding which selectors to use

Use your browser’s developer console to explore the page you want to scrape. The console inspector will let you click on specific elements of the page and see which selectors apply to them. Try to think about how this webpage might change over time, and to pick the selectors least likely to be affected by small changes in the layout. For example, if your webpage has a single <main> html element, you may want to include that, so that your selector will never grab any html elements later added to, say, the navigational bar.

Saving your data

Many of the tools have built in methods for exporting your data to a file. But some methods, especially browser-based web-scrapinng, may make it tricky to save your data.

Saving from a browser

There are different methods of creating and downloading files via bookmarklets, but they may be blocked by the security features of some browsers. One possible solution from stackoverflow here.

Sending data to the web

Alternatively, you can use javascript’s fetch method to send the data to another server, or to a cloud computing service like Google Scripts via its web app functions, for further processing.

Some Tutorial Links

Hack For LA’s webscraping system for 311 data:
- Video tutorial
- Github subdirectory
Basic website scraping with google sheets:
- Simple Web Scraping using Google Sheets
Running google scripts to scrape websites from the cloud:
- Scrape and save data to Google Sheets with Apps Script | by Kamie Robinson | Medium
Creating a fully hands-off cloud-based webscraper:
- Cloud Web Scraper — The Setup. Part one of a guide to setting up a…

Issues used in the creation of this page

(Some of these issues may be closed or open/in progress.)

#146

Contributors

Henry Kaplan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly