ingestion of all data #479

jmensch1 · 2020-03-29T23:52:39Z

The goal here was to make sure it was possible to get all of the socrata data -- from 2015 through today -- into our own database. This code did that successfully, in a mere 74 minutes :)

I used sqlalchemy instead of pandas to create the table and insert and all the data. This made it possible to eliminate several of the cleaning tasks we were doing previously, such as converting columns to datetime and making sure the pandas columns were in sync with the database model. It also makes the Ingest model the single point of truth for the structure of the table.

There was one tricky issue, discussed on Slack, regarding the quirky sorting coming from the socrata api. I eliminated the sort parameter, as Russell suggested, but was still getting intermittent primary key violations due to duplicate srnumbers in separate batches from socrata. So what I did was this:

create a temporary id column on the Ingest model to use as the primary key
after all the data is ingested, deduplicate the table by srnumber, delete the temporary id column, and make the srnumber the primary key. Not pretty, but it works.

Other minor things:

I put the socrata client into a separate class that handles automatic retries in the event that socrata times out (which happens pretty often)
added a cleaning task that converts the closseddate values to null if they precede the createddate (fixing closed date is sometimes before created date #456)

server/src/services/socrataClient.py

server/src/services/sqlIngest.py

sellnat77 · 2020-03-31T02:58:40Z

server/src/services/sqlIngest.py

+    async def populateDatabase(self,
+                               years=range(2015, 2021),
+                               limit=2000000,
+                               querySize=50000):


We might want to move these default numbers to config that way theres a single source of truth

I put the defaults in the config, but I consumed the defaults in app.py instead of here because those default values were also appearing in app.py. So the defaults get applied immediately if the query params are missing, and then passed into this service.

jmensch1 · 2020-03-31T04:41:48Z

Fixed that stuff. I also changed /ingest to a GET request because we're using query params for it instead of the body. Makes sense yeah?

ingest fixes

51654da

jmensch1 linked an issue Mar 29, 2020 that may be closed by this pull request

closed date is sometimes before created date #456

Closed

jmensch1 added this to the 311 Data Alpha - Dev Complete milestone Mar 29, 2020

linting

0c5e42b

jmensch1 requested review from sellnat77 and ryanmswan March 29, 2020 23:58

pretty-printing final report

cb8b0aa

sellnat77 reviewed Mar 31, 2020

View reviewed changes

server/src/services/socrataClient.py Outdated Show resolved Hide resolved

sellnat77 reviewed Mar 31, 2020

View reviewed changes

server/src/services/sqlIngest.py Outdated Show resolved Hide resolved

sellnat77 reviewed Mar 31, 2020

View reviewed changes

jmensch1 added 3 commits March 30, 2020 21:10

passing config from app.py to socrata

ecb2bd3

switched ingest to GET

bf1a739

added default ingestion params to settings.cfg

cb1efe9

Merge branch 'dev' into BACK-Ingestion

48b6b55

sellnat77 approved these changes Mar 31, 2020

View reviewed changes

sellnat77 merged commit 9841d71 into dev Mar 31, 2020

sellnat77 deleted the BACK-Ingestion branch March 31, 2020 16:07

This was referenced Apr 1, 2020

Database initiator method #436

Closed

Fix DataCleaner config loading #287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingestion of all data #479

ingestion of all data #479

jmensch1 commented Mar 29, 2020 •

edited

Loading

sellnat77 Mar 31, 2020

jmensch1 Mar 31, 2020

jmensch1 commented Mar 31, 2020

ingestion of all data #479

ingestion of all data #479

Conversation

jmensch1 commented Mar 29, 2020 • edited Loading

sellnat77 Mar 31, 2020

Choose a reason for hiding this comment

jmensch1 Mar 31, 2020

Choose a reason for hiding this comment

jmensch1 commented Mar 31, 2020

jmensch1 commented Mar 29, 2020 •

edited

Loading