-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds active_subsidized_units & percent_units_subsidized to zone_facts. #574
base: dev
Are you sure you want to change the base?
Adds active_subsidized_units & percent_units_subsidized to zone_facts. #574
Conversation
I'm putting review of this on hold until after our Friday demo to DHCD, since there's some data integrity problems and so I can focus on making sure everything is working for the demo. Will pick it up after likely on Monday. |
@pfjel7 I think the biggest source of error might be double counting on the projects when we can't appropriately de-duplicate the records from our various sources. Here's a good example of several projects that are likely double counting: |
To find those records you can run this query: Two of them come from the preservation catalog ("People's Cooperative" and "Elvans Road"), so those are likely to be two separate projects. However, there are two additional records that appear to match each of those, one from the DHCD database and one from the DCHousing dataset. Only one of those has proj_units_assist_max. Just one example, though. Our current de-dupe method relies on being able to parse the addresses, but here it looks like slightly different addresses are throwing things off |
?That makes sense. I will follow up on your suggestions for how we might check the validity of the data. I will take up that task next week. Will be in touch. Per
…________________________________
From: Neal Humphrey <[email protected]>
Sent: Wednesday, September 20, 2017 10:21 PM
To: codefordc/housing-insights
Cc: Fjelstad, Per; Author
Subject: Re: [codefordc/housing-insights] Adds active_subsidized_units & percent_units_subsidized to zone_facts. (#574)
I'm putting review of this on hold until after our Friday demo to DHCD, since there's some data integrity problems and so I can focus on making sure everything is working for the demo. Will pick it up after likely on Monday.
-
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_codefordc_housing-2Dinsights_pull_574-23issuecomment-2D331032371&d=DwMFaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=tVTgE3T5prTv3rUavIYrDmLWdIgRBDlY0lM4qXkELL8&m=-tPdmL0I96qJE6KwChFbIoidOLU0EoUMOrKR7eoqjVA&s=IPFo55zhMhrplJx_6MAnzJ2eYNqdAVQxC8nffUbo96c&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ASuoBH984-2DxE1S0bSPf0wgLXGKrccfm0ks5skcgrgaJpZM4PegPL&d=DwMFaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=tVTgE3T5prTv3rUavIYrDmLWdIgRBDlY0lM4qXkELL8&m=-tPdmL0I96qJE6KwChFbIoidOLU0EoUMOrKR7eoqjVA&s=ACS1Q7PJXxCqxH-j6KE85WoGNW_xHluFQwTdQuvALho&e=>.
|
@pfjel7 FYI, I made another issue that will help us troubleshoot the numerator portion of this - #577 . In addition, @jkwening added a logger to our deduplication code that gives us additional data about how the deduplication of project records was performed so that we can troubleshoot the ones that are slipping through. Finally, I'm hoping @ptatian can take a look at our data with both of these tools to help us identify duplicates since he has the most familiarity with the preservation catalog data set - he may know off hand whether some of the ones that are supposedly unique to DHCD are actually already in the preservation catalog. We can use this to either improve our dedupe logic and/or create a manual override column during dedupe that uses a list of known pairs and/or send corrections to DHCD with respect to the addresss that they have stored in the database (our main way of identifying duplicated projects). |
This revision gets the job done. Eventually, when we call _get_residential_units from within _populate_zone_facts_table, then the one-time query used here to filter and sum the column values proj_units_assist_max can ultimately be generalized and made into its own separate method. Then we also can trim back the argument res_units_by_zone presently passed to subsequent methods. For now, however, this one-time wrangling produces the results we want.
That said, some data integrity problems also are brought to light. For example, in neighborhood cluster 38, 104% of the residential units are counted as actively "subsidized." Something clearly isn't being counted right, and we need to find out why not. More dramatically, in census tract 11001006804, 106 out of, yes, 13, units are counted as subsidized. So, some sleuthing is needed to see if residential units (from the mar table) are being under-counted or subsidized units (from the project table) are being over-counted..