Skip to content

Data pipeline: Ingesting data

Sarah Anoke edited this page Apr 29, 2020 · 8 revisions

Federal Spending Data

These data are in an S3 bucket as a PostgreSQL archive dump within a zip file. Below we'll unzip the file and restore the original PostgreSQL database.

1 - Get the archive file from S3 using root

  1. Log into the root Ubuntu account using sudo -u root -i
  2. Install AWS CLI
sudo apt install awscli
  1. Configure the instance credentials to allow access to S3 bucket (credential or config issues?)
aws configure
# a prompt will appear;
# enter the keys that were generated when IAM profile was first created 

This is a good time to start tmux (more info).

  1. Download the archive dump from S3 
and unzip it
aws s3 cp \ 
   s3://sanoke-insight-fedspend-projectdata/usaspending-db_20200110.zip \
   usaspending-db_20200110.zip
sudo apt install unzip
unzip usaspending-db_20200110.zip -d usaspending-db_20200110

2 - Restore and inspect the database

After we configured the database location in postgresql.conf, the record of user root was lost, so we'll recreate it (same process as above).

sudo -u postgres createuser root
sudo -u postgres createdb root
sudo -u postgres psql
-- add SUPERUSER privileges (in SQL) 
ALTER USER root WITH ENCRYPTED PASSWORD '<enter-password>';
GRANT ALL PRIVILEGES ON DATABASE root TO root;
ALTER USER root WITH SUPERUSER CREATEDB CREATEROLE;

Refer to guidance to understand the structure of these commands, including the flags.

pg_restore --list usaspending-db_20200110 | \ 
   sed '/MATERIALIZED VIEW DATA/D' > restore.list

# the command below takes 5.3 hours to complete
pg_restore --jobs 32 --dbname postgresql://root:'password123'@localhost:5432/root \ 
  --verbose --exit-on-error --use-list restore.list usaspending-db_20200110

psql --dbname postgresql://root:'password123'@localhost:5432/root --command \ 
   'ANALYZE VERBOSE;' --echo-all --set ON_ERROR_STOP=on --set \ 
   VERBOSITY=verbose --set SHOW_CONTEXT=always


pg_restore --list usaspending-db_20200110 | grep "MATERIALIZED VIEW DATA" > refresh.list􏰑

We can log into psql to view a list of the loaded tables using \dt, or download a GUI like pgAdmin or postico. If using pgAdmin,

  • [General] Choose a name for the server; the choice is inconsequential.
  • [Connection] Use the same login information as in the pg_restore commands above
  • [SSH Tunnel] Set up SSH tunneling using the same information used to SSH via command line
  • Once connected, view tables using the Object Explorer on the left, selecting root > Schemas > Public > Tables under the name of the server you just created (FedSpendData).

Historical Legislator Data

These data were read in as a JSON file and parsed into lists in PySpark; these lists were later assembled into a DataFame for use in joins.

Unfortunately I used for loops here, although with more time I could have figured out an implementation with list comprehensions.

Additional information (congressional district population sizes, full spelling of state abbreviations) was ingested and joined as additional columns.

# read in the JSON file, parse into lists
# NOTE manually placed JSON on each worker
lc = spark.read.option('multiline','true').json('legislators-current.json').collect()
firstName = []
lastName  = []
typeOfRep = []
party     = []
state     = []
district  = []
year      = []
# for every legislator...
for l in lc:
    # ...and every term...
    for t in l['terms']:
        y = 0
        term_length = int(t['end'][0:4]) - int(t['start'][0:4])
        # ...and for each year in the term
        while y <= term_length:
            firstName.append(l['name']['first'])
            lastName.append(l['name']['last'])
            typeOfRep.append(t['type'])
            party.append(t['party'])
            state.append(t['state'])
            district.append(t['district'])
            year.append(int(t['start'][0:4]) + y)
            y += 1