-
Notifications
You must be signed in to change notification settings - Fork 2
Data pipeline: Ingesting data
These data are in an S3 bucket as a PostgreSQL archive dump within a zip file. Below we'll unzip the file and restore the original PostgreSQL database.
- Log into the
root
Ubuntu account usingsudo -u root -i
- Install AWS CLI
sudo apt install awscli
- Configure the instance credentials to allow access to S3 bucket (credential or config issues?)
aws configure
# a prompt will appear;
# enter the keys that were generated when IAM profile was first created
This is a good time to start tmux
(more info).
- Download the archive dump from S3 and unzip it
aws s3 cp \
s3://sanoke-insight-fedspend-projectdata/usaspending-db_20200110.zip \
usaspending-db_20200110.zip
sudo apt install unzip
unzip usaspending-db_20200110.zip -d usaspending-db_20200110
After we configured the database location in postgresql.conf
, the record of user root
was lost, so we'll recreate it (same process as above).
sudo -u postgres createuser root
sudo -u postgres createdb root
sudo -u postgres psql
-- add SUPERUSER privileges (in SQL)
ALTER USER root WITH ENCRYPTED PASSWORD '<enter-password>';
GRANT ALL PRIVILEGES ON DATABASE root TO root;
ALTER USER root WITH SUPERUSER CREATEDB CREATEROLE;
Refer to guidance to understand the structure of these commands, including the flags.
pg_restore --list usaspending-db_20200110 | \
sed '/MATERIALIZED VIEW DATA/D' > restore.list
# the command below takes 5.3 hours to complete
pg_restore --jobs 32 --dbname postgresql://root:'password123'@localhost:5432/root \
--verbose --exit-on-error --use-list restore.list usaspending-db_20200110
psql --dbname postgresql://root:'password123'@localhost:5432/root --command \
'ANALYZE VERBOSE;' --echo-all --set ON_ERROR_STOP=on --set \
VERBOSITY=verbose --set SHOW_CONTEXT=always
pg_restore --list usaspending-db_20200110 | grep "MATERIALIZED VIEW DATA" > refresh.list
We can log into psql
to view a list of the loaded tables using \dt
, or download a GUI like pgAdmin or postico. If using pgAdmin,
- [General] Choose a name for the server; the choice is inconsequential.
- [Connection] Use the same login information as in the
pg_restore
commands above - [SSH Tunnel] Set up SSH tunneling using the same information used to SSH via command line
- Once connected, view tables using the Object Explorer on the left, selecting
root > Schemas > Public > Tables under the name of the server you just created (
FedSpendData
).
These data were read in as a JSON file and parsed into lists in PySpark; these lists were later assembled into a DataFame for use in joins.
Unfortunately I used for
loops here, although with more time I could have figured out an implementation with list comprehensions.
Additional information (congressional district population sizes, full spelling of state abbreviations) was ingested and joined as additional columns.
# read in the JSON file, parse into lists
# NOTE manually placed JSON on each worker
lc = spark.read.option('multiline','true').json('legislators-current.json').collect()
firstName = []
lastName = []
typeOfRep = []
party = []
state = []
district = []
year = []
# for every legislator...
for l in lc:
# ...and every term...
for t in l['terms']:
y = 0
term_length = int(t['end'][0:4]) - int(t['start'][0:4])
# ...and for each year in the term
while y <= term_length:
firstName.append(l['name']['first'])
lastName.append(l['name']['last'])
typeOfRep.append(t['type'])
party.append(t['party'])
state.append(t['state'])
district.append(t['district'])
year.append(int(t['start'][0:4]) + y)
y += 1
📬 Create an issue if you have one.
(📝 instructions accurate as of Feb 2020)