Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds active_subsidized_units & percent_units_subsidized to zone_facts. #574

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

pfjel7
Copy link
Collaborator

@pfjel7 pfjel7 commented Sep 20, 2017

This revision gets the job done. Eventually, when we call _get_residential_units from within _populate_zone_facts_table, then the one-time query used here to filter and sum the column values proj_units_assist_max can ultimately be generalized and made into its own separate method. Then we also can trim back the argument res_units_by_zone presently passed to subsequent methods. For now, however, this one-time wrangling produces the results we want.

That said, some data integrity problems also are brought to light. For example, in neighborhood cluster 38, 104% of the residential units are counted as actively "subsidized." Something clearly isn't being counted right, and we need to find out why not. More dramatically, in census tract 11001006804, 106 out of, yes, 13, units are counted as subsidized. So, some sleuthing is needed to see if residential units (from the mar table) are being under-counted or subsidized units (from the project table) are being over-counted..

@NealHumphrey
Copy link
Collaborator

I'm putting review of this on hold until after our Friday demo to DHCD, since there's some data integrity problems and so I can focus on making sure everything is working for the demo. Will pick it up after likely on Monday.

@NealHumphrey
Copy link
Collaborator

@pfjel7 I think the biggest source of error might be double counting on the projects when we can't appropriately de-duplicate the records from our various sources. Here's a good example of several projects that are likely double counting:
image

@NealHumphrey
Copy link
Collaborator

To find those records you can run this query:
select * from project where proj_name ilike '%elvan%' or proj_addre ilike '%elvan%'

Two of them come from the preservation catalog ("People's Cooperative" and "Elvans Road"), so those are likely to be two separate projects. However, there are two additional records that appear to match each of those, one from the DHCD database and one from the DCHousing dataset. Only one of those has proj_units_assist_max. Just one example, though. Our current de-dupe method relies on being able to parse the addresses, but here it looks like slightly different addresses are throwing things off

@pfjel7
Copy link
Collaborator Author

pfjel7 commented Sep 21, 2017 via email

@NealHumphrey
Copy link
Collaborator

@pfjel7 FYI, I made another issue that will help us troubleshoot the numerator portion of this - #577 . In addition, @jkwening added a logger to our deduplication code that gives us additional data about how the deduplication of project records was performed so that we can troubleshoot the ones that are slipping through. Finally, I'm hoping @ptatian can take a look at our data with both of these tools to help us identify duplicates since he has the most familiarity with the preservation catalog data set - he may know off hand whether some of the ones that are supposedly unique to DHCD are actually already in the preservation catalog. We can use this to either improve our dedupe logic and/or create a manual override column during dedupe that uses a list of known pairs and/or send corrections to DHCD with respect to the addresss that they have stored in the database (our main way of identifying duplicated projects).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants