-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter duplicates #481
Filter duplicates #481
Conversation
|
eff5ce6
to
822f0d6
Compare
822f0d6
to
40a7cc3
Compare
@@ -44,8 +45,15 @@ class MapFromLbe(private val logger: Logger) : PipelineStep<List<LbeAcceptingSto | |||
return if (int == ALTERNATIVE_MISCELLANEOUS_CATEGORY) MISCELLANEOUS_CATEGORY else int | |||
} | |||
|
|||
private fun String.decodeSpecialCharacters(): String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from old step Encode
It would perhaps be possible to find more duplicates with the tradeoff of possibly removing false positives (grouping by coordinates + fuzzy matching e.g. with levenshtein distance). Tried it, does not really work well imo as there are some edge cases where names are really similar but not identical. Therefore I went with exact matching by name + postal code + street, because of #467 this works well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to also log specifics about each entry when we mark them as "duplicate", maybe log the complete objects....
I wonder why there are so many duplicates. Do you have an idea? Are the entries in that case most times completely the same or do they differ slightly here and there?
Will improve the logging (and log differing attributes only perhaps?). So for once lots of stores are once from the LK and once from the eak platform (not sure what it's called?). Also there are slight differences most of the time like a whitespace in phone numbers, a little bit adjusted discount, websites without protocol, stuff like that. |
Hm there should be exactly one data source per Landkreis/Kreisfreie Stadt. Either they use an excel list or Freinet. |
At least in the test app-daten.xml it is that way (e.g. just have a look at the first duplicate that is sorted out: |
Yes, that makes sense. So the duplication is is caused by duplicate data sources (freinret+something else). |
See #485 |
Fixes #478