Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingestion of all data #479

Merged
merged 7 commits into from
Mar 31, 2020
Merged

ingestion of all data #479

merged 7 commits into from
Mar 31, 2020

Conversation

jmensch1
Copy link
Contributor

@jmensch1 jmensch1 commented Mar 29, 2020

The goal here was to make sure it was possible to get all of the socrata data -- from 2015 through today -- into our own database. This code did that successfully, in a mere 74 minutes :)

I used sqlalchemy instead of pandas to create the table and insert and all the data. This made it possible to eliminate several of the cleaning tasks we were doing previously, such as converting columns to datetime and making sure the pandas columns were in sync with the database model. It also makes the Ingest model the single point of truth for the structure of the table.

There was one tricky issue, discussed on Slack, regarding the quirky sorting coming from the socrata api. I eliminated the sort parameter, as Russell suggested, but was still getting intermittent primary key violations due to duplicate srnumbers in separate batches from socrata. So what I did was this:

  1. create a temporary id column on the Ingest model to use as the primary key

  2. after all the data is ingested, deduplicate the table by srnumber, delete the temporary id column, and make the srnumber the primary key. Not pretty, but it works.

Other minor things:

  • I put the socrata client into a separate class that handles automatic retries in the event that socrata times out (which happens pretty often)

  • added a cleaning task that converts the closseddate values to null if they precede the createddate (fixing closed date is sometimes before created date #456)

@jmensch1 jmensch1 linked an issue Mar 29, 2020 that may be closed by this pull request
@jmensch1 jmensch1 requested review from sellnat77 and ryanmswan March 29, 2020 23:58
async def populateDatabase(self,
years=range(2015, 2021),
limit=2000000,
querySize=50000):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to move these default numbers to config that way theres a single source of truth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the defaults in the config, but I consumed the defaults in app.py instead of here because those default values were also appearing in app.py. So the defaults get applied immediately if the query params are missing, and then passed into this service.

@jmensch1
Copy link
Contributor Author

Fixed that stuff. I also changed /ingest to a GET request because we're using query params for it instead of the body. Makes sense yeah?

@sellnat77 sellnat77 merged commit 9841d71 into dev Mar 31, 2020
@sellnat77 sellnat77 deleted the BACK-Ingestion branch March 31, 2020 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

closed date is sometimes before created date
2 participants