Is there alias feature ? #249

mkandulavm · 2022-08-16T11:47:38Z

Hi

Is there a way to provide a alias options ?
For example, "street", "st", "road" could be alias for some scenarios.

How can this be done ? Thank you.

maxbachmann · 2022-08-16T13:21:19Z

So far there is no way to alias characters/words. There is already a request for character dependent weights: #241.
This would allow you to alias individual elements by setting their substitution cost as 0. This would still only work on individual symbols:

Levenshtein.distance(["street", "road"], ["st", "st"]) # result is 2
weights=...
weights["street", "st"] = 0
weights["st", "street"] = 0
weights["road", "st"] = 0
weights["st", "road"] = 0
weights["street", "road"] = 0
weights["road", "street"] = 0
Levenshtein.distance(["street", "road"], ["st", "st"], weights=weights) # result is 0

which might be enough for your use case.

mkandulavm · 2022-08-16T13:26:14Z

This is exactly what I need !!
But, is an equivalent call exposed in c++ ?

Also, which scorer is best for such scenarios (since tokens can be presented without order).

maxbachmann · 2022-08-16T13:33:40Z

But, is an equivalent call exposed in c++ ?

So far this feature does not exist in either of them. However it will absolutely be implemented in C++. The Python implementation will only wrap it. It will extend: https://github.com/maxbachmann/rapidfuzz-cpp/blob/d937555ad76a6f1ed853ab4b7102a7b22b6f0fcf/rapidfuzz/distance/Levenshtein.hpp#L142

Also, which scorer is best for such scenarios (since tokens can be presented without order).

At least right now the feature is only planned for Levenshtein/OSA/DamerauLevenshtein. None of those sort the tokens before comparing them.

i30817 · 2022-09-02T18:31:11Z

You can also preprocess the input strings such that the fuzz operation occurs in x and the result is (x,y) with y being the original string. Then you preprocess things so that the words you want to be the same score are replaced by one canonical word in x.

This is heavy in string manipulation but if you want to use one of the sort scorers, like token_set_ratio or similar, you can do it like that.

The more you replace words (or do similar tricks like removing combining characters accents), the more likely that there will be 2 or more 'same best scores', which can lead to inconsistent results on repeated runs with the same dataset.

If it matters, get the 2 or 3 best ones (or until they're not the same score) then check if they have the same score, and if they do, either chose a consistent order for the 'winner' or if both are valid somehow, and you can, combine results.

maxbachmann · 2022-09-02T18:48:06Z

This is heavy in string manipulation but if you want to use one of the sort scorers, like token_set_ratio or similar, you can do it like that.

This can be faster than using weights for Levenshtein for this purpose, since the weighted Levenshtein distance is quite a bit slower to calculate than the uniform Levenshtein distance. So e.g. when comparing a string to a list of known strings you can preprocess ahead of time it is likely faster to preprocess the strings yourself.

maxbachmann · 2022-12-13T14:16:53Z

Closing this, since it is tracked as part of #241

maxbachmann closed this as completed Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there alias feature ? #249

Is there alias feature ? #249

mkandulavm commented Aug 16, 2022

maxbachmann commented Aug 16, 2022 •

edited

Loading

mkandulavm commented Aug 16, 2022 •

edited

Loading

maxbachmann commented Aug 16, 2022 •

edited

Loading

i30817 commented Sep 2, 2022 •

edited

Loading

maxbachmann commented Sep 2, 2022

maxbachmann commented Dec 13, 2022

Is there alias feature ? #249

Is there alias feature ? #249

Comments

mkandulavm commented Aug 16, 2022

maxbachmann commented Aug 16, 2022 • edited Loading

mkandulavm commented Aug 16, 2022 • edited Loading

maxbachmann commented Aug 16, 2022 • edited Loading

i30817 commented Sep 2, 2022 • edited Loading

maxbachmann commented Sep 2, 2022

maxbachmann commented Dec 13, 2022

maxbachmann commented Aug 16, 2022 •

edited

Loading

mkandulavm commented Aug 16, 2022 •

edited

Loading

maxbachmann commented Aug 16, 2022 •

edited

Loading

i30817 commented Sep 2, 2022 •

edited

Loading