A project to extract GeoJSON from the web focusing on websites that have 'store locator' pages like restaurants, gas stations, retailers, etc. Each chain has its own bit of software to extract useful information from their site (a "spider"). Each spider can be individually configured to throttle request rate to act as a good citizen on the Internet. The default User-Agent
for the spiders can be found here, so websites wishing to prevent our spiders from accessing the data on their website can block that User Agent, but please feel free to contact us with any requests or recommendations.
The project is built using scrapy
, a Python-based web scraping framework. Each target website gets its own spider, which does the work of extracting interesting details about locations and outputting results in a useful format.
To scrape a new website for locations, you'll want to create a new spider. You can copy from existing spiders or start from a blank, but the result is always a Python class that has a process()
function that yield
s GeojsonPointItem
s. The Scrapy framework does the work of outputting the GeoJSON based on these objects that the spider generates.
To get started, you'll want to install the dependencies for this project.
-
This project uses
pipenv
to handle dependencies and virtual environments. To get started, make sure you havepipenv
installed. -
With
pipenv
installed, make sure you have theall-the-places
repository checked outgit clone [email protected]:alltheplaces/alltheplaces.git
-
Then you can install the dependencies for the project
cd alltheplaces pipenv install
-
After dependencies are installed, make sure you can run the
scrapy
command without errorpipenv run scrapy
-
If
pipenv run scrapy
ran without complaining, then you have a functionalscrapy
setup and are ready to write a scraper.
-
Create a new file in
locations/spiders/
with this content:# -*- coding: utf-8 -*- import scrapy from locations.items import GeojsonPointItem class TemplateSpider(scrapy.Spider): name = "template" allowed_domains = ["www.sample.com"] start_urls = ( 'https://www.sample.com/locations/', ) def parse(self, response): pass
This blank/template spider will start at the given
start_urls
, only touch the domains listed inallowed_domains
, and all web requests will be returned to theparse()
function with response content in theresponse
argument. Once you have the response content, you can perform various operations on it. For example, the most useful is probably running XPath selections on the HTML of the page to extract data out of the page. Check out the "Scraper tips" section below for more information about how to use these tools to efficiently get data out of the page. -
Once you have your spider written, you can give it a test run to make sure it's finding the expected results.
pipenv run scrapy crawl template
The
scrapy crawl template
command runs a spider namedtemplate
. If you changed the name of your spider, you should use the name you chose. By default,scrapy crawl
does not save the output anywhere, but it does log the results of the spider operation fairly verbosely.To generate GeoJSON locally, you can enable a couple options during the crawl process to use the GeoJSON exporter and to specify the file to write it to:
pipenv run scrapy crawl template -O output.geojson
-
Finally, make sure your
parse()
function isyield
ingGeojsonPointItem
s that contain the location and property data that you extract from the page:def parse(self, response): yield GeojsonPointItem( lat=latitude, lon=longitude, street_address="1234 Fifth Street", city="San Francisco", state="CA", country="US" )
-
Once you have a spider that logs out useful results, you can create a new branch and push it up to your fork to create a pull request. The build system will run your spider and output information about the results as a comment on your pull request.
There is usually a few ways to find locations:
-
An XML sitemap, often https://<domain>/sitemap.xml, the domain's
robots.txt
file can also be useful for finding sitemaps (https://<domain>/robots.txt). These can crawled with a SitemapSpider. -
A "store directory" that is a hierarchical listing of all locations. These listings are sometimes hidden in the footer or on the site map page. Keep an eye out for these, because it's a lot easier if they enumerate all the locations for you rather than having to program a spider to do it for you. These can be crawled with CrawlSpider.
-
A "store finder" that lets the user search by location. Keep an eye on your browser's developer tools "network" tab to see what the request is so you can replicate it in your spider. You may be able to change the request to get the API to return all the stores. These can be made with a normal Spider and specific
start_urls
orstart_requests()
. -
But if the only option is search by latitude/longitude, these can be crawled with Searchable Points.
Some websites may already be publishing there data in a standard way. We can parse these with our StructuredDataSpider, use a SitemapSpider
or CrawlSpider
to obtain the pages and pass them to parse_sd
it will parse any Microdata or Linked Data with a type defined in wanted_types
, you can clean up source data with pre_process_data
and clean up the item, or add extra attributes with post_process_item
.
validator.schema.org can be really helpful when making spiders to see what structured data is available.
For store locators that do allow searches by latitude/longitude, a grid of searchable latlon points is available for the US, CA, AU, and Europe here. Each point represents the centroid of a search where the radius distance is indicated in the file name. See the Dollar General scraper for an example of how you might utilize them for national searches.
For stores that do not have a national footprint (e.g. #1034), there are separate point files that include a state/territory attribute e.g. 'us_centroids_100mile_radius_state.csv'. This allows for points to be filtered down to specific states/territories when a national search is unnecessary.
Note: A search radius may overlap multiple states especially when it’s centered near a state boundary. This creates a one to many relationship between the search radius point and the states covered in that search zone. This means that for the state files, there will be records that share the same latlon associated to differing states. The same is true for the European and Canadian territory files.
The simplest thing a spider can do is to load the start_urls
, process the page, and yield
the data as GeojsonPointItem
objects from the parse()
method. Usually that's not enough to get at useful data, though. The parse()
method can also yield
a Request object, which scrapy will use to add another URL to the request queue.
By default, the parse()
method on the spider will be called with the response for the new request. In many cases it's easier to create a new function to parse the new page's content and pass that function in via the Request
object's callback
parameter like so:
yield scrapy.Request(
response.urljoin(store_url.extract()),
callback=self.parse_store
)
Since the next URL you want to request is usually pulled from an href
in the page and relative to the page you're on, you can use the response.urljoin()
method as a shortcut to build the URL for the next request.
Instead of running the scrapy crawl
every time you want to try your spider, you can use the Scrapy shell to load a page and experiment with XPath queries. Once you're happy with the query that extracts interesting data you can use it in your spider. This is a whole lot easier than running the whole crawl command every time you make a change to your spider.
To enter the shell, use scrapy shell http://example.com
(where you replace the URL with your own). It will dump you into a Python shell after having requested the page and parsing it. Once in the shell, you can do things with the response
object as if you were in your spider. The shell also offers a shortcut function called fetch()
that lets you pull up a different page.
The data generated by our spiders is provided on our website and released under Creative Commons’ CC-0 waiver.
The spider software that produces this data (this repository) is licensed under the MIT license.