Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce consistent data fields for current database #315

Open
NealHumphrey opened this issue Jun 6, 2017 · 7 comments
Open

Enforce consistent data fields for current database #315

NealHumphrey opened this issue Jun 6, 2017 · 7 comments
Assignees
Labels

Comments

@NealHumphrey
Copy link
Collaborator

NealHumphrey commented Jun 6, 2017

We want all data fields in our current database to have consistent names and data values if they represent the same thing in different tables. This especially applies to our zones (ward, zip, etc.) but any other field that means the same thing should behave the same across all tables.

Check all tables of the current database, and edit the sql_name in meta.json and the relevant Cleaner in /ingestion/Cleaners.py as needed to make sure that when the database is rebuilt the column names and values will behave the same.

This is the current list of conventions to use. If you find any other things we should start enforcing, choose a convention and add them to this list:
https://github.com/codefordc/housing-insights/blob/dev/docs/resources/data-formatting.md

In addition, please flag any columns that are named the same thing but mean something different - for example, if two tables both have a column called 'id', the column name by itself cannot distinguish the meaning - one should be renamed to 'crime_id' and one should be 'building_permit_id' for example.

@NealHumphrey
Copy link
Collaborator Author

@ilissablechwapo we talked about you taking this on last night and got you set up with rebuilding the database (!) . But we should also coordinate with @emkap01 because I know he was looking into this a little at the hackathon - but I think we do still have outstanding issues. #314 is one for example - some values are missing (and need to be added) while Ward values are different between the two (the newly added records from dchousing are all caps).

@emkap01
Copy link
Collaborator

emkap01 commented Jun 11, 2017

Here's a link to the Slack thread about the spreadsheet that explains what cleaner functions each dataset will need, and how to map the fields from each “raw” dataset to the corresponding fields in the ‘project’ table: https://codefordc.slack.com/archives/C5D0LRMFX/p1496794136294546.

@emkap01
Copy link
Collaborator

emkap01 commented Jul 1, 2017

I think an automated way to help us enforce consistent fields for the production db is probably long overdue. I lack the Python chops to pull it off myself in a timely fashion, but I took a shot at mapping out the underlying logic we would want to implement, along with an example of what the output table would look like.

That logic just builds on what we had previously attempted to implement manually via the Data Pipeline Status Tracker spreadsheet. It may need some more tweaks/refinement before we would start coding it up, but hopefully both the document and the example will provide a useful basis for discussion.

Theoretically if we built a new 'pipeline status' table from a script like this, we could very easily build an 'admin view' dash for it that would provide an overall snapshot of where each dataset is in the pipeline, similar to what we've already hacked together in Tableau from the old spreadsheet. With a better, cleaner underlying table, that old dash could be completely re-worked to be much more intuitive and useful.

Some of this probably relates directly to #342 , especially points 2, 4 and 5.

@emkap01 emkap01 self-assigned this Jul 9, 2017
@emkap01
Copy link
Collaborator

emkap01 commented Jul 9, 2017

@NealHumphrey and @jkwening - I updated the Data Pipeline spreadsheet in order to determine which datasets might still need a little cleanup/work.

You guys may already be aware of this, but FWIW I noticed that records in both the project and subsidy tables that are derived from the dchousing datasets have NLIHC values that don't appear to match the standard NLXXXXXX convention. Instead, those NLIHC values are formatted like this: 28f80dd5-df6e-4e85-a784-e35c1ef64230. Presumably we will need to figure out the cause of the discrepancy and fix it.

Separately (as you will see in the updated spreadsheet), for each table in our production db, I checked the values in each field against our established conventions and noted whether or not those values still need some work/cleanup. So anything labeled "normalized" is good to go, but anything labeled "needs_normalization" or "need_to_add_from_MAR" or "need_to_add_from_project" might require some fixing/updating/enhancing.

@NealHumphrey
Copy link
Collaborator Author

NealHumphrey commented Jul 10, 2017

@emkap01 how does that spreadsheet overlap with the dashboard? Are those need_to_add_from_MAR etc. fields also available in the dashboard?

On the NLIHC_ID fields - that's intentional. NLIHC_ID is created by urban institute for the project table, so we don't have control over the sequential numbering. So we don't want to add on to the sequential numbers and accidentally create the same ID for two different projects (when the PresCat adds one and we add one). The long format is a universally unique identifier (UUID), which has enough randomly generated letters in it that you're statistically guaranteed to not generate the same string twice. That way we can use it without having control of the original source of the IDs.

@emkap01
Copy link
Collaborator

emkap01 commented Jul 11, 2017

@NealHumphrey the dashboard is based on the spreadsheet, however I did not pull all of the fields from the spreadsheet into the dashboard.

I can take a crack at reworking the dash in order to pull more details into it (such as what specific actions need to be taken on each individual field in each dataset that still needs work).

@NealHumphrey
Copy link
Collaborator Author

@emkap01 ok. No need to add to the dashboard - I just wanted to know what was what, I can refer back to the spreadsheet to get the info. I think the most effective approach beyond this is picking things off one data source / field at a time and writing up the tickets needed for getting across the finish line - I think what you've done so far is enough to work from from the big picture standpoint for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants