This ruby script will scrape information from the web portal for the collections of the Smithsonian Institution and parsing it into easily-readable JSON. This is a quick and dirty solution for downloading structured data from the Smithsonian while they are finalizing a Linked Open Data implementation for their collections.
This README assumes basic knowledge of how to run a Ruby script from your command line. Beginner tutorials can be found for Windows computers and OS X You will also need to install the Ruby gems Nokogiri and ruby-progressbar (Tutorials on adding Ruby gems).
Establish your search parameters on the SI collection portal. You can enter search keywords, as well as narrow your results by various parameters like date
, culture
, or catalog record source
(this parameter is particularly helpful for limiting your search to a particular museum within the SI.) Once you have entered your terms, copy the URL from your browser's address bar.
My test query is looking for objects of the type Works of Art
that feature the keyword space
. The URL for this query looks like this: http://collections.si.edu/search/results.htm?tag.cstype=all&q=space&fq=object_type:%22Works+of+art%22
Run ruby si-scrape.rb
and paste the copied URL when prompted. The script will begin to download from collections.si.edu
, displaying a rough progress bar like this:
$ ruby si-scrape.rb
Enter query URL: http://collections.si.edu/search/results.htm?tag.cstype=all&fq=object_type%3A%22Works+of+art%22
Looking up query on collections.si.edu...
Pages of results: 704
|===>> | 6% Results scraped
As each page of records are downloaded, they will be parsed into a JSON file in the same directory as the script (default file path: output.json
). This script creates a JSON file instead of a CSV for two reasons:
- In order to accommodate the diverse metadata that are present in some Smithsonian object records, but not in others. A CSV file is best suited for rows of data that all share the same columns; this would not work for the output from
collections.si.edu
. - Some fields, like those for
Topic
orType
, have more than one value, which JSON can handle with nested arrays; a CSV needs an additional delimiter character, and support for reading such complex CSVs is patchy at best.
Records will appear as such:
{
"saam_1978.146.1": {
"Title": "Slaughterhouse Ruins at Aledo",
"Image": "http://americanart.si.edu/images/1978/1978.146.1_1a.jpg",
"Artist": [
"Gertrude Abercrombie, born Austin, TX 1909-died Chicago, IL 1977"
],
"Medium": [
"oil on canvas"
],
"Dimensions": [
"20 x 24 in. (50.9 x 61.0 cm)"
],
"Type": [
"Painting"
],
"Date": [
"1937"
],
"Topic": [
"Landscape",
"Landscape\\Spain\\Aledo",
"Architecture Exterior\\ruins",
"Architecture Exterior\\industry\\slaughterhouse"
],
"Credit Line": [
"Smithsonian American Art Museum, Gift of the Gertrude Abercrombie Trust"
],
"Object number": [
"1978.146.1"
],
"See more items in": [
"Smithsonian American Art Museum Collection"
],
"Data Source": [
"Smithsonian American Art Museum"
],
"Record ID": [
"saam_1978.146.1"
],
"Visitor Tag(s)": [
"\nNo tags yet, be the first!\n\nAdd Your Tags!\n"
]
},
}
Every SI object comes with a unique ID (e.g. saam_1978.146.1
), title (e.g. Slaughterhouse Ruins at Aledo
), and image URL (n.b. sometimes this URL will lead to a blank image, however). Other elements could potentially have multiple values, and so they are stored as nested arrays in the JSON output, which can easily be parsed by Ruby's JSON module or other library of your choice.
Fork/contact me with more suggestions for the project!
Matthew D. Lincoln | Ph.D Student, Department of Art History & Archaeology, University of Maryland, College Park