-
-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there alias feature ? #249
Comments
So far there is no way to alias characters/words. There is already a request for character dependent weights: #241. Levenshtein.distance(["street", "road"], ["st", "st"]) # result is 2
weights=...
weights["street", "st"] = 0
weights["st", "street"] = 0
weights["road", "st"] = 0
weights["st", "road"] = 0
weights["street", "road"] = 0
weights["road", "street"] = 0
Levenshtein.distance(["street", "road"], ["st", "st"], weights=weights) # result is 0 which might be enough for your use case. |
This is exactly what I need !! Also, which scorer is best for such scenarios (since tokens can be presented without order). |
So far this feature does not exist in either of them. However it will absolutely be implemented in C++. The Python implementation will only wrap it. It will extend: https://github.com/maxbachmann/rapidfuzz-cpp/blob/d937555ad76a6f1ed853ab4b7102a7b22b6f0fcf/rapidfuzz/distance/Levenshtein.hpp#L142
At least right now the feature is only planned for |
You can also preprocess the input strings such that the fuzz operation occurs in x and the result is (x,y) with y being the original string. Then you preprocess things so that the words you want to be the same score are replaced by one canonical word in x. This is heavy in string manipulation but if you want to use one of the sort scorers, like token_set_ratio or similar, you can do it like that. The more you replace words (or do similar tricks like removing combining characters accents), the more likely that there will be 2 or more 'same best scores', which can lead to inconsistent results on repeated runs with the same dataset. If it matters, get the 2 or 3 best ones (or until they're not the same score) then check if they have the same score, and if they do, either chose a consistent order for the 'winner' or if both are valid somehow, and you can, combine results. |
This can be faster than using weights for Levenshtein for this purpose, since the weighted Levenshtein distance is quite a bit slower to calculate than the uniform Levenshtein distance. So e.g. when comparing a string to a list of known strings you can preprocess ahead of time it is likely faster to preprocess the strings yourself. |
Closing this, since it is tracked as part of #241 |
Hi
Is there a way to provide a alias options ?
For example, "street", "st", "road" could be alias for some scenarios.
How can this be done ? Thank you.
The text was updated successfully, but these errors were encountered: