Enforce consistent data fields for current database #315

NealHumphrey · 2017-06-06T18:28:13Z

We want all data fields in our current database to have consistent names and data values if they represent the same thing in different tables. This especially applies to our zones (ward, zip, etc.) but any other field that means the same thing should behave the same across all tables.

Check all tables of the current database, and edit the sql_name in meta.json and the relevant Cleaner in /ingestion/Cleaners.py as needed to make sure that when the database is rebuilt the column names and values will behave the same.

This is the current list of conventions to use. If you find any other things we should start enforcing, choose a convention and add them to this list:
https://github.com/codefordc/housing-insights/blob/dev/docs/resources/data-formatting.md

In addition, please flag any columns that are named the same thing but mean something different - for example, if two tables both have a column called 'id', the column name by itself cannot distinguish the meaning - one should be renamed to 'crime_id' and one should be 'building_permit_id' for example.

NealHumphrey · 2017-06-06T18:30:48Z

@ilissablechwapo we talked about you taking this on last night and got you set up with rebuilding the database (!) . But we should also coordinate with @emkap01 because I know he was looking into this a little at the hackathon - but I think we do still have outstanding issues. #314 is one for example - some values are missing (and need to be added) while Ward values are different between the two (the newly added records from dchousing are all caps).

emkap01 · 2017-06-11T13:46:32Z

Here's a link to the Slack thread about the spreadsheet that explains what cleaner functions each dataset will need, and how to map the fields from each “raw” dataset to the corresponding fields in the ‘project’ table: https://codefordc.slack.com/archives/C5D0LRMFX/p1496794136294546.

emkap01 · 2017-07-01T15:00:09Z

I think an automated way to help us enforce consistent fields for the production db is probably long overdue. I lack the Python chops to pull it off myself in a timely fashion, but I took a shot at mapping out the underlying logic we would want to implement, along with an example of what the output table would look like.

That logic just builds on what we had previously attempted to implement manually via the Data Pipeline Status Tracker spreadsheet. It may need some more tweaks/refinement before we would start coding it up, but hopefully both the document and the example will provide a useful basis for discussion.

Theoretically if we built a new 'pipeline status' table from a script like this, we could very easily build an 'admin view' dash for it that would provide an overall snapshot of where each dataset is in the pipeline, similar to what we've already hacked together in Tableau from the old spreadsheet. With a better, cleaner underlying table, that old dash could be completely re-worked to be much more intuitive and useful.

Some of this probably relates directly to #342 , especially points 2, 4 and 5.

emkap01 · 2017-07-09T14:01:42Z

@NealHumphrey and @jkwening - I updated the Data Pipeline spreadsheet in order to determine which datasets might still need a little cleanup/work.

You guys may already be aware of this, but FWIW I noticed that records in both the project and subsidy tables that are derived from the dchousing datasets have NLIHC values that don't appear to match the standard NLXXXXXX convention. Instead, those NLIHC values are formatted like this: 28f80dd5-df6e-4e85-a784-e35c1ef64230. Presumably we will need to figure out the cause of the discrepancy and fix it.

Separately (as you will see in the updated spreadsheet), for each table in our production db, I checked the values in each field against our established conventions and noted whether or not those values still need some work/cleanup. So anything labeled "normalized" is good to go, but anything labeled "needs_normalization" or "need_to_add_from_MAR" or "need_to_add_from_project" might require some fixing/updating/enhancing.

NealHumphrey · 2017-07-10T19:24:47Z

@emkap01 how does that spreadsheet overlap with the dashboard? Are those need_to_add_from_MAR etc. fields also available in the dashboard?

On the NLIHC_ID fields - that's intentional. NLIHC_ID is created by urban institute for the project table, so we don't have control over the sequential numbering. So we don't want to add on to the sequential numbers and accidentally create the same ID for two different projects (when the PresCat adds one and we add one). The long format is a universally unique identifier (UUID), which has enough randomly generated letters in it that you're statistically guaranteed to not generate the same string twice. That way we can use it without having control of the original source of the IDs.

emkap01 · 2017-07-11T20:16:47Z

@NealHumphrey the dashboard is based on the spreadsheet, however I did not pull all of the fields from the spreadsheet into the dashboard.

I can take a crack at reworking the dash in order to pull more details into it (such as what specific actions need to be taken on each individual field in each dataset that still needs work).

NealHumphrey · 2017-07-11T20:50:08Z

@emkap01 ok. No need to add to the dashboard - I just wanted to know what was what, I can refer back to the spreadsheet to get the info. I think the most effective approach beyond this is picking things off one data source / field at a time and writing up the tickets needed for getting across the finish line - I think what you've done so far is enough to work from from the big picture standpoint for now.

ncclementi mentioned this issue Jun 13, 2017

Fix fields on meta.json related to #315 #338

Closed

jkwening added Data Pipeline Python labels Jun 28, 2017

emkap01 self-assigned this Jul 9, 2017

NealHumphrey removed Data Pipeline labels Jul 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce consistent data fields for current database #315

Enforce consistent data fields for current database #315

NealHumphrey commented Jun 6, 2017 •

edited

Loading

NealHumphrey commented Jun 6, 2017

emkap01 commented Jun 11, 2017

emkap01 commented Jul 1, 2017

emkap01 commented Jul 9, 2017

NealHumphrey commented Jul 10, 2017 •

edited

Loading

emkap01 commented Jul 11, 2017

NealHumphrey commented Jul 11, 2017

Enforce consistent data fields for current database #315

Enforce consistent data fields for current database #315

Comments

NealHumphrey commented Jun 6, 2017 • edited Loading

NealHumphrey commented Jun 6, 2017

emkap01 commented Jun 11, 2017

emkap01 commented Jul 1, 2017

emkap01 commented Jul 9, 2017

NealHumphrey commented Jul 10, 2017 • edited Loading

emkap01 commented Jul 11, 2017

NealHumphrey commented Jul 11, 2017

NealHumphrey commented Jun 6, 2017 •

edited

Loading

NealHumphrey commented Jul 10, 2017 •

edited

Loading