Skip to content
This repository has been archived by the owner on Jul 16, 2018. It is now read-only.
/ si-scrape Public archive

A Ruby script for scraping collection data from the Smithsonian Institution

License

Notifications You must be signed in to change notification settings

mdlincoln/si-scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 

Repository files navigation

si-scrape

This ruby script will scrape information from the web portal for the collections of the Smithsonian Institution and parsing it into easily-readable JSON. This is a quick and dirty solution for downloading structured data from the Smithsonian while they are finalizing a Linked Open Data implementation for their collections.

This README assumes basic knowledge of how to run a Ruby script from your command line. Beginner tutorials can be found for Windows computers and OS X You will also need to install the Ruby gems Nokogiri and ruby-progressbar (Tutorials on adding Ruby gems).

Generate your query URL

Establish your search parameters on the SI collection portal. You can enter search keywords, as well as narrow your results by various parameters like date, culture, or catalog record source (this parameter is particularly helpful for limiting your search to a particular museum within the SI.) Once you have entered your terms, copy the URL from your browser's address bar.

My test query is looking for objects of the type Works of Art that feature the keyword space. The URL for this query looks like this: http://collections.si.edu/search/results.htm?tag.cstype=all&q=space&fq=object_type:%22Works+of+art%22

Run si-scrape.rb

Run ruby si-scrape.rb and paste the copied URL when prompted. The script will begin to download from collections.si.edu, displaying a rough progress bar like this:

$ ruby si-scrape.rb 
Enter query URL: http://collections.si.edu/search/results.htm?tag.cstype=all&fq=object_type%3A%22Works+of+art%22
Looking up query on collections.si.edu...
Pages of results: 704
|===>>                                               | 6% Results scraped

As each page of records are downloaded, they will be parsed into a JSON file in the same directory as the script (default file path: output.json). This script creates a JSON file instead of a CSV for two reasons:

  1. In order to accommodate the diverse metadata that are present in some Smithsonian object records, but not in others. A CSV file is best suited for rows of data that all share the same columns; this would not work for the output from collections.si.edu.
  2. Some fields, like those for Topic or Type, have more than one value, which JSON can handle with nested arrays; a CSV needs an additional delimiter character, and support for reading such complex CSVs is patchy at best.

Records will appear as such:

{
"saam_1978.146.1": {
    "Title": "Slaughterhouse Ruins at Aledo",
    "Image": "http://americanart.si.edu/images/1978/1978.146.1_1a.jpg",
    "Artist": [
      "Gertrude Abercrombie, born Austin, TX 1909-died Chicago, IL 1977"
    ],
    "Medium": [
      "oil on canvas"
    ],
    "Dimensions": [
      "20 x 24 in. (50.9 x 61.0 cm)"
    ],
    "Type": [
      "Painting"
    ],
    "Date": [
      "1937"
    ],
    "Topic": [
      "Landscape",
      "Landscape\\Spain\\Aledo",
      "Architecture Exterior\\ruins",
      "Architecture Exterior\\industry\\slaughterhouse"
    ],
    "Credit Line": [
      "Smithsonian American Art Museum, Gift of the Gertrude Abercrombie Trust"
    ],
    "Object number": [
      "1978.146.1"
    ],
    "See more items in": [
      "Smithsonian American Art Museum Collection"
    ],
    "Data Source": [
      "Smithsonian American Art Museum"
    ],
    "Record ID": [
      "saam_1978.146.1"
    ],
    "Visitor Tag(s)": [
      "\nNo tags yet, be the first!\n\nAdd Your Tags!\n"
    ]
  },
}

Every SI object comes with a unique ID (e.g. saam_1978.146.1), title (e.g. Slaughterhouse Ruins at Aledo), and image URL (n.b. sometimes this URL will lead to a blank image, however). Other elements could potentially have multiple values, and so they are stored as nested arrays in the JSON output, which can easily be parsed by Ruby's JSON module or other library of your choice.

To-Do

Fork/contact me with more suggestions for the project!


Matthew D. Lincoln | Ph.D Student, Department of Art History & Archaeology, University of Maryland, College Park

About

A Ruby script for scraping collection data from the Smithsonian Institution

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages