This #dataengineering project, commissioned by Hack for LA leadership, entails developing a reusable pipeline for attributing L.A. County cities/parcels to the Native American tribes whose homes they originally were. In doing so, I leverage public resources and semi-unstructured data (in PDFs) to deliver an adaptive #dataproduct that can be used for multiple purposes.
The steps in this pipeline are as follows, and the code (in #R) is below…
- Extract/parse city -> zip code lookup tables from LA County’s official PDF
- Clean up, normalize, and merge extracted lookup tables
- Use Google’s Geocoding API to get the lat/lons per Zip
- Query Native Land Digital’s API to find the Native American peoples (and languages) who originally lived there
- Deliver the data product (currently in CSV format; can easily push to some DB)
For questions, please feel free to contact: Jude Calvillo. Thanks!
## Load libraries
library(tabulizer) # for extracting tables from PDFs
library(jsonlite) # for neatly converting JSON to dataframes
library(stringr) # for string splitting
library(dplyr) # for data transformations/aggregations
library(ggplot2) # for plotting
library(httr) # for raw API calls
library(ggmap) # easy interface to Google Geocode API
Using Tabulizer library to extract tables from a PDF, then previewing the data.
## Parse the Zip code lookup table from LA County's PDF at: http://file.lacounty.gov/SDSInter/lac/1031552_MasterZipCodes.pdf
zips <- extract_tables("http://file.lacounty.gov/SDSInter/lac/1031552_MasterZipCodes.pdf")
## Preview
head(zips[[1]])
## [,1] [,2]
## [1,] "" "County of Los Angeles"
## [2,] "" "ZIP CODE LIST"
## [3,] "" ""
## [4,] "ZIP CODE" "AREA NAME * (See note below)"
## [5,] "90001 Florence/South Central (City of LA)" ""
## [6,] "90002 Watts (City of LA)" ""
## [,3]
## [1,] ""
## [2,] ""
## [3,] "Supervisorial District"
## [4,] "1st 2nd 3rd 4th 5th"
## [5,] "X X"
## [6,] "X"
head(zips[[2]])
## [,1] [,2] [,3] [,4]
## [1,] "90047" "South Central (City of LA)" "" "X"
## [2,] "90048" "West Beverly (City of LA)" "" "X"
## [3,] "90049" "Bel Air Estates (City of LA)/Brentwood (City of LA)" "" ""
## [4,] "90050" "Los Angeles" "X" ""
## [5,] "90051" "Los Angeles" "" "X"
## [6,] "90052" "Los Angeles" "" "X"
## [,5] [,6] [,7]
## [1,] "" "" ""
## [2,] "X" "" ""
## [3,] "X" "" ""
## [4,] "" "" ""
## [5,] "" "" ""
## [6,] "" "" ""
head(zips[[7]])
## [,1] [,2] [,3]
## [1,] "91301 Agoura/Oak Park" "X" ""
## [2,] "91302 Calabasas/Hidden Hills" "X" ""
## [3,] "91303 Canoga Park (City of LA)" "X" ""
## [4,] "91304 Canoga Park (City of LA)" "X" "X"
## [5,] "91305 Canoga Park (City of LA)" "X" ""
## [6,] "91306 Winnetka (City of LA)" "X" ""
Header and column detection were a little problematic on a few pages, but the errors fall within a limited range of classes and the data is still usable. Therefore, let’s just trim to what we need. This involves…
- Grabbing only the first two columns per table (given the commonality between the errors)
- Remove all rows that don’t start with a zip code
- For all those rows with zip and area name merged, split the text and replace
- Flag areas that are officially part of the City of LA
–
## Generally, between the different types of errors we're seeing, the data we need is in the first 2 columns of the listed matrices, so let's row bind the first two columns of all the matrices
zips_df <- data.frame()
for(i in 1:length(zips)){
zip_page <- as.data.frame(zips[i])[,c(1:2)]
zips_df <- rbind(zips_df, zip_page)
}
## Pretty the names and remove rows that don't start with 5 numbers, then preview
names(zips_df) <- c("zip","area_name")
zips_df <- zips_df[grep("^[0-9][0-9][0-9][0-9][0-9].*", as.character(zips_df$zip)),]
head(zips_df)
## zip area_name
## 5 90001 Florence/South Central (City of LA)
## 6 90002 Watts (City of LA)
## 7 90003 South Central (City of LA)
## 8 90004 Hancock Park (City of LA)
## 9 90005 Koreatown (City of LA)
## 10 90006 Pico Heights (City of LA)
tail(zips_df)
## zip area_name
## 486 93560 Rosamond X
## 487 93563 Valyermo X
## 488 93584 Lancaster X
## 489 93586 Lancaster X
## 490 93590 Palmdale X
## 491 93591 Palmdale/Lake Los Angeles X
## How many rows have zip and area name merged?
zips_df$zip <- as.character(zips_df$zip)
zips_df$area_name <- as.character(zips_df$area_name)
length(zips_df$zip[grep(" [aA-zZ]", zips_df$zip)])
## [1] 175
## For all those rows with zip and area name merged, split the text and replace
zips_df[grep(" [aA-zZ]", zips_df$zip),] <- str_split_fixed(zips_df$zip[grep(" [aA-zZ]", zips_df$zip)], " ", n = 2)
dim(zips_df)
## [1] 487 2
## Flag whether an area is in the "City of LA" (vs in-line text), to clean and lean things out
zips_df$LA_city <- grepl("City of LA", zips_df$area_name, ignore.case = T)
zips_df$area_name <- gsub("\\(City of LA\\)| \\(City of LA\\)", "", zips_df$area_name, ignore.case = T)
## Preview final data
head(zips_df)
## zip area_name LA_city
## 5 90001 Florence/South Central TRUE
## 6 90002 Watts TRUE
## 7 90003 South Central TRUE
## 8 90004 Hancock Park TRUE
## 9 90005 Koreatown TRUE
## 10 90006 Pico Heights TRUE
tail(zips_df)
## zip area_name LA_city
## 486 93560 Rosamond FALSE
## 487 93563 Valyermo FALSE
## 488 93584 Lancaster FALSE
## 489 93586 Lancaster FALSE
## 490 93590 Palmdale FALSE
## 491 93591 Palmdale/Lake Los Angeles FALSE
Unfortunately, this API is rate-limited, so we’ll factor that into our design.
if(!file.exists("dat/los-angeles-county_zip-codes_lat-lon.csv")){
## Grab my Google Geocode API key and register it with GGmaps, my easy interface to Google Maps family of APIs
geo_key <- readRDS("geo_key.RDS")
register_google(geo_key)
## Run zips through Geocode API call and keep lat/lon as new column values per zip
zips_df[, c("lon", "lat")] <- geocode(zips_df$zip, output = "latlon")
## Save table for later use
write.csv(zips_df, "dat/los-angeles_county_zip-codes_lat-lon.csv", row.names = F)
} else {
## Let user know we've already got the data and therefore don't need to call the API
print("Zip to Lat/Lon lookup table already exists. Skipping API call, then previewing data...")
## Get existing data, if it's there
zips_df <- read.csv("dat/los-angeles-county_zip-codes_lat-lon.csv", stringsAsFactors = F)
}
## [1] "Zip to Lat/Lon lookup table already exists. Skipping API call, then previewing data..."
## Preview the data
head(zips_df)
## zip area_name LA_city lon lat
## 1 90001 Florence/South Central TRUE -118.2468 33.96979
## 2 90002 Watts TRUE -118.2497 33.95111
## 3 90003 South Central TRUE -118.2731 33.96580
## 4 90004 Hancock Park TRUE -118.3082 34.07489
## 5 90005 Koreatown TRUE -118.3097 34.05788
## 6 90006 Pico Heights TRUE -118.2965 34.04708
tail(zips_df)
## zip area_name LA_city lon lat
## 482 93560 Rosamond FALSE -118.3228 34.884301
## 483 93563 Valyermo FALSE -117.7491 34.405338
## 484 93584 Lancaster FALSE -118.1400 34.700000
## 485 93586 Lancaster FALSE 110.3391 1.542587
## 486 93590 Palmdale FALSE -118.0600 34.500000
## 487 93591 Palmdale/Lake Los Angeles FALSE -117.8194 34.592562
## Any NAs?
print("Possible problem zips, for inspection:")
## [1] "Possible problem zips, for inspection:"
zips_df[is.na(zips_df$lon),]
## zip area_name LA_city lon lat
## 79 90080 Airport Worldway TRUE NA NA
Step 4. Send lat, lons to Native-land.ca API to get Native American lands and languages each area covers
For each lat, lon combo, we ping Native-land.ca’s API to get the Native American lands that once
## Add empty territories and languages fields for now
zips_df$native_territories <- NA
zips_df$native_languages <- NA
## For each lat, lon combo, call the API
if(!file.exists("dat/los-angeles-county_native-american-lands.csv")){
for(i in 1:length(zips_df$zip)){
if(!is.na(zips_df$lon[i])){
api_resp <- GET("https://native-land.ca/api/index.php",
query = list(maps = "languages,territories", position = paste0(zips_df$lat[i], ",", zips_df$lon[i])),
add_headers('User-Agent' = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'))
api_content <- fromJSON(rawToChar(api_resp$content))$properties
zips_df$native_territories[i] <- paste0(api_content$Name[grep("territories", api_content$description)], collapse = ",")
zips_df$native_languages[i] <- paste0(api_content$Name[grep("languages", api_content$description)], collapse = ",")
} else {
print(paste0("Not a valid set of coordinate values for area: ", zips_df$area_name[i], " (", zips_df$zip[i], ").", " Skipping."))
}
}
} else {
## Let user know we've already got the data and therefore don't need to call the API
print("Table already exists. Skipping API call, then previewing data...")
## Get existing data, if it's there
zips_df <- read.csv("dat/los-angeles-county_native-american-lands.csv", stringsAsFactors = F)
}
## [1] "Table already exists. Skipping API call, then previewing data..."
## Preview data
head(zips_df)
## zip area_name LA_city lon lat native_territories
## 1 90001 Florence/South Central TRUE -118.2468 33.96979 Chumash,Tongva,Kizh
## 2 90002 Watts TRUE -118.2497 33.95111 Chumash,Tongva,Kizh
## 3 90003 South Central TRUE -118.2731 33.96580 Chumash,Tongva,Kizh
## 4 90004 Hancock Park TRUE -118.3082 34.07489 Chumash,Tongva,Kizh
## 5 90005 Koreatown TRUE -118.3097 34.05788 Chumash,Tongva,Kizh
## 6 90006 Pico Heights TRUE -118.2965 34.04708 Chumash,Tongva,Kizh
## native_languages
## 1 Tongva
## 2 Tongva
## 3 Tongva
## 4 Tongva
## 5 Tongva
## 6 Tongva
## Save to file
write.csv(zips_df, "dat/los-angeles-county_native-american-lands.csv", row.names = F)
## List the unique tribes across LA county
print("Unique Native American territories across LA County:")
## [1] "Unique Native American territories across LA County:"
unique(na.omit(unlist(strsplit(zips_df$native_territories, ","))))
## [1] "Chumash" "Tongva"
## [3] "Kizh" "Micqanaqa’n"
## [5] "Fernandeño Tataviam" "Acjachemen (Juaneño)"
## [7] "Payómkawichum (Luiseño)" "Yuhaviatam/Maarenga’yam (Serrano)"
## [9] "Kitanemuk"
Thanks!
-Jude C.