Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imports create duplicates if existing data is archived #59

Closed
omnilord opened this issue Oct 12, 2018 · 2 comments
Closed

Imports create duplicates if existing data is archived #59

omnilord opened this issue Oct 12, 2018 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@omnilord
Copy link
Collaborator

This is somehow related to the default_scope hiding the archives from the deduplication function, but I'm not sure how since the deduplication function unscopes the default_scope.

Ideal solution: if a record exists and is archive, unarchive instead of inserting a duplicate.

(also, the duplication counter in the third query (source) is not guaranteed to be unique).

@omnilord omnilord added the bug Something isn't working label Oct 12, 2018
@omnilord
Copy link
Collaborator Author

Upon closer inspection duplicates get created because of differing source data not matching exactly on multiple tests (state is abbreviated in one source, spelled out in others, Shelter names are not quite identical, etc.). Would need to implement some sort of fuzzy matching, or integrate with a better deduplicating tool (maybe dedupe.io as Chris Whitaker suggests).

Would a solution be checking that coordinates are within a certain delta of each other, say a dozen yards or so (around +/- 0.0002 degrees of precision?). What if there are two shelters side-by-side (a general shelter might partner with a medical or accessible location? Does this happen?), one is present in the data, the other is not? Is the accuracy good enough? For our purposes currently, this should be close enough until we wrap up the conversation on source data.

@miklb
Copy link
Contributor

miklb commented Oct 13, 2018

For our purposes currently, this should be close enough until we wrap up the conversation on source data.

100% agreed.

@omnilord omnilord self-assigned this Oct 13, 2018
omnilord added a commit to omnilord/florence-api that referenced this issue Oct 13, 2018
…itude, and reorders the tests by precision, starting with exact address, then moving to coordinates, and finally using the source field

Closes hurricane-response#59
omnilord added a commit to omnilord/florence-api that referenced this issue Oct 15, 2018
…itude, and reorders the tests by precision, starting with exact address, then moving to coordinates, and finally using the source field

Closes hurricane-response#59
@miklb miklb closed this as completed in #64 Oct 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants