Damerau-Levenshtein with character-dependent weights #241

pulsar70 · 2022-07-20T09:47:25Z

In relation to issue #180:

It would be great to generalize the foreseen implementation of the Dameau-Levenshtein distance to weights depending not only on the nature of the operation (insert/delete/substitute/transpose), but also on the specific character pairs involved in the operation.

Such weights could be either read from ressource files directly by the C/C++ implementation code, or provided as parameters (Python dictionaries) by the calling Python function.

This would be a very useful and not-so-complicated first step in the direction of context-sensitive string distance functions.

maxbachmann · 2022-07-20T11:27:32Z

I think this should start off with a similar interface to the levenshtein distance:

distance(s1, s2, *, weights=(1,1,1))

which should be extended by support for float. In addition this could support a custom dict like structure for character weights which behaves similar to:

class PairwiseWeights:
    ...

a = PairwiseWeights()
a["a", "b"] = 0.75

The pure Python version will likely implement this using a dictionary, while for the C++ implementation an efficient storage structure is important. This could look something like this:

template<typename T>
class PairwiseWeights
{
    std::array<std::array<T, 256>, 256> m_extendedAscii;
    my_hashmap<std::tuple<int64_t, int64_t>, T> m_map;
};

This has an overhead of around 256kb for the extended ascii matrix, which would make the common case of char1 < 256 and char2 < 256 very fast. For the hashmap something similar similar to the Python dict using perturbation would probably work well.

Edit: this needs a bit of thought on how to name the containers. For now we probably need one for Substitutions/Transpositions and one for Insertions/Deletions

pulsar70 · 2022-07-20T11:38:33Z

Yes, you clearly need an efficient storage + fast retrieval (hashtable-type) structure.

As far as I understand, the existing distance implementations work on any unicode strings? Then why limit the character range to extended ASCII? One should be able to provide weights for any Unicode points pair, in my opinion.

maxbachmann · 2022-07-20T11:43:10Z

As far as I understand, the existing distance implementations work on any unicode strings? Then why limit the character range to extended ASCII? One should be able to provide weights for any Unicode points pair, in my opinion.

This continues to be true. It is a mixed structure:

template<typename U, typename K>
T get(U ch1, K ch2)
{
    if (ch1 < 256 && ch2 < 256)
        return m_extendedAscii[ch1][ch2];
   return m_map.get(ch1, ch2);
}

since in many cases strings are ascii only this can improve the access times significantly, while it still supports any hashable type. Especially since Python strings are stored as uint8_t*/uint16_t*/uint32_t* depending on the max char in the string. So this will be optimized to:

template<typename U, typename K>
T get(U ch1, K ch2)
{
    return m_extendedAscii[ch1][ch2];
}

for strings with all characters < 256

maxbachmann · 2022-07-23T15:30:10Z

Note that this will likely take a while to get implemented, since this needs to be pretty generic to cover most use cases. Some examples are:

character dependent weights: e.g. keyboard layout
static weight (both int and float)
possibly position dependent weights (weight e.g. mistakes at the start of the string higher)
...

How do those interact, since you probably want to be able to mix them to some extent. An example is a keyboard layout:

substitutions -> depend on distance on keyboard -> mapping
insertion -> probably 1, since they likely do not depend on the character
deletion -> idk
transposition -> probably 1 since mix ups do not depend on the keyboard position

Would it be possible to combine e.g. character dependent and position dependent weights? This might be useful sometimes.

In addition I am unsure how normalization would work with these weights?

pulsar70 · 2022-07-31T15:16:04Z

I was actually only suggesting the technical possibility to specify these weights, leaving the choice of their actual values to the package user (one could provide default ones to play with if really desired, of course).

I have my own custom-made C++ implementation of a Generalized Edit Distance with character-dependent weights,that are fully configurable from simple user-editable txt files. Default values are taken by all those character (pairs) no explicitly mentioned in the config. file, so it enables me to act upon only those values I want to depart from the standard weight.

Should you find it helpful, I would be glad to send you my implementation files. Seeing it properly cythonized in RapidFuzz (which I am technically unable to do myself) would be delightful.

maxbachmann · 2022-07-31T20:21:41Z

I was describing possible use cases for users of the API to decide on an API. I think for now it would be enough to add character dependent weights. If other weights need an API change this can be done in a later major release.

maxbachmann · 2022-12-13T14:16:35Z

Should you find it helpful, I would be glad to send you my implementation files. Seeing it properly cythonized in RapidFuzz (which I am technically unable to do myself) would be delightful.

Sure. Having a reference implementation would be helpful.

maxbachmann added the enhancement New feature or request label Jul 20, 2022

maxbachmann added the discussion Up to discussion label Jul 23, 2022

maxbachmann mentioned this issue Aug 3, 2022

Operation weights not configurable ztane/python-Levenshtein#86

Open

This was referenced Aug 16, 2022

Is there alias feature ? #249

Closed

Installation via PyPi does not work with Python 3.9 infoscout/weighted-levenshtein#25

Closed

maxbachmann mentioned this issue Sep 25, 2022

Get LCS string from LCSseq method #270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Damerau-Levenshtein with character-dependent weights #241

Damerau-Levenshtein with character-dependent weights #241

pulsar70 commented Jul 20, 2022

maxbachmann commented Jul 20, 2022 •

edited

Loading

pulsar70 commented Jul 20, 2022

maxbachmann commented Jul 20, 2022 •

edited

Loading

maxbachmann commented Jul 23, 2022 •

edited

Loading

pulsar70 commented Jul 31, 2022 •

edited

Loading

maxbachmann commented Jul 31, 2022

maxbachmann commented Dec 13, 2022

Damerau-Levenshtein with character-dependent weights #241

Damerau-Levenshtein with character-dependent weights #241

Comments

pulsar70 commented Jul 20, 2022

maxbachmann commented Jul 20, 2022 • edited Loading

pulsar70 commented Jul 20, 2022

maxbachmann commented Jul 20, 2022 • edited Loading

maxbachmann commented Jul 23, 2022 • edited Loading

pulsar70 commented Jul 31, 2022 • edited Loading

maxbachmann commented Jul 31, 2022

maxbachmann commented Dec 13, 2022

maxbachmann commented Jul 20, 2022 •

edited

Loading

maxbachmann commented Jul 20, 2022 •

edited

Loading

maxbachmann commented Jul 23, 2022 •

edited

Loading

pulsar70 commented Jul 31, 2022 •

edited

Loading