-
-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingestion of all data #479
Conversation
server/src/services/sqlIngest.py
Outdated
async def populateDatabase(self, | ||
years=range(2015, 2021), | ||
limit=2000000, | ||
querySize=50000): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to move these default numbers to config that way theres a single source of truth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the defaults in the config, but I consumed the defaults in app.py instead of here because those default values were also appearing in app.py. So the defaults get applied immediately if the query params are missing, and then passed into this service.
Fixed that stuff. I also changed /ingest to a GET request because we're using query params for it instead of the body. Makes sense yeah? |
The goal here was to make sure it was possible to get all of the socrata data -- from 2015 through today -- into our own database. This code did that successfully, in a mere 74 minutes :)
I used sqlalchemy instead of pandas to create the table and insert and all the data. This made it possible to eliminate several of the cleaning tasks we were doing previously, such as converting columns to datetime and making sure the pandas columns were in sync with the database model. It also makes the Ingest model the single point of truth for the structure of the table.
There was one tricky issue, discussed on Slack, regarding the quirky sorting coming from the socrata api. I eliminated the sort parameter, as Russell suggested, but was still getting intermittent primary key violations due to duplicate srnumbers in separate batches from socrata. So what I did was this:
create a temporary id column on the Ingest model to use as the primary key
after all the data is ingested, deduplicate the table by
srnumber
, delete the temporary id column, and make thesrnumber
the primary key. Not pretty, but it works.Other minor things:
I put the socrata client into a separate class that handles automatic retries in the event that socrata times out (which happens pretty often)
added a cleaning task that converts the
closseddate
values to null if they precede thecreateddate
(fixing closed date is sometimes before created date #456)