-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Imports create duplicates if existing data is archived #59
Comments
Upon closer inspection duplicates get created because of differing source data not matching exactly on multiple tests (state is abbreviated in one source, spelled out in others, Shelter names are not quite identical, etc.). Would need to implement some sort of fuzzy matching, or integrate with a better deduplicating tool (maybe dedupe.io as Chris Whitaker suggests). Would a solution be checking that coordinates are within a certain delta of each other, say a dozen yards or so (around +/- 0.0002 degrees of precision?). What if there are two shelters side-by-side (a general shelter might partner with a medical or accessible location? Does this happen?), one is present in the data, the other is not? Is the accuracy good enough? For our purposes currently, this should be close enough until we wrap up the conversation on source data. |
100% agreed. |
…itude, and reorders the tests by precision, starting with exact address, then moving to coordinates, and finally using the source field Closes hurricane-response#59
…itude, and reorders the tests by precision, starting with exact address, then moving to coordinates, and finally using the source field Closes hurricane-response#59
This is somehow related to the default_scope hiding the archives from the deduplication function, but I'm not sure how since the deduplication function unscopes the default_scope.
Ideal solution: if a record exists and is archive, unarchive instead of inserting a duplicate.
(also, the duplication counter in the third query (source) is not guaranteed to be unique).
The text was updated successfully, but these errors were encountered: