Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Damerau-Levenshtein with character-dependent weights #241

Open
pulsar70 opened this issue Jul 20, 2022 · 7 comments
Open

Damerau-Levenshtein with character-dependent weights #241

pulsar70 opened this issue Jul 20, 2022 · 7 comments
Labels
discussion Up to discussion enhancement New feature or request

Comments

@pulsar70
Copy link

In relation to issue #180:

It would be great to generalize the foreseen implementation of the Dameau-Levenshtein distance to weights depending not only on the nature of the operation (insert/delete/substitute/transpose), but also on the specific character pairs involved in the operation.

Such weights could be either read from ressource files directly by the C/C++ implementation code, or provided as parameters (Python dictionaries) by the calling Python function.

This would be a very useful and not-so-complicated first step in the direction of context-sensitive string distance functions.

@maxbachmann maxbachmann added the enhancement New feature or request label Jul 20, 2022
@maxbachmann
Copy link
Member

maxbachmann commented Jul 20, 2022

I think this should start off with a similar interface to the levenshtein distance:

distance(s1, s2, *, weights=(1,1,1))

which should be extended by support for float. In addition this could support a custom dict like structure for character weights which behaves similar to:

class PairwiseWeights:
    ...

a = PairwiseWeights()
a["a", "b"] = 0.75

The pure Python version will likely implement this using a dictionary, while for the C++ implementation an efficient storage structure is important. This could look something like this:

template<typename T>
class PairwiseWeights
{
    std::array<std::array<T, 256>, 256> m_extendedAscii;
    my_hashmap<std::tuple<int64_t, int64_t>, T> m_map;
};

This has an overhead of around 256kb for the extended ascii matrix, which would make the common case of char1 < 256 and char2 < 256 very fast. For the hashmap something similar similar to the Python dict using perturbation would probably work well.

Edit: this needs a bit of thought on how to name the containers. For now we probably need one for Substitutions/Transpositions and one for Insertions/Deletions

@pulsar70
Copy link
Author

Yes, you clearly need an efficient storage + fast retrieval (hashtable-type) structure.

As far as I understand, the existing distance implementations work on any unicode strings? Then why limit the character range to extended ASCII? One should be able to provide weights for any Unicode points pair, in my opinion.

@maxbachmann
Copy link
Member

maxbachmann commented Jul 20, 2022

As far as I understand, the existing distance implementations work on any unicode strings? Then why limit the character range to extended ASCII? One should be able to provide weights for any Unicode points pair, in my opinion.

This continues to be true. It is a mixed structure:

template<typename U, typename K>
T get(U ch1, K ch2)
{
    if (ch1 < 256 && ch2 < 256)
        return m_extendedAscii[ch1][ch2];
   return m_map.get(ch1, ch2);
}

since in many cases strings are ascii only this can improve the access times significantly, while it still supports any hashable type. Especially since Python strings are stored as uint8_t*/uint16_t*/uint32_t* depending on the max char in the string. So this will be optimized to:

template<typename U, typename K>
T get(U ch1, K ch2)
{
    return m_extendedAscii[ch1][ch2];
}

for strings with all characters < 256

@maxbachmann
Copy link
Member

maxbachmann commented Jul 23, 2022

Note that this will likely take a while to get implemented, since this needs to be pretty generic to cover most use cases. Some examples are:

  • character dependent weights: e.g. keyboard layout
  • static weight (both int and float)
  • possibly position dependent weights (weight e.g. mistakes at the start of the string higher)
    ...

How do those interact, since you probably want to be able to mix them to some extent. An example is a keyboard layout:

  • substitutions -> depend on distance on keyboard -> mapping
  • insertion -> probably 1, since they likely do not depend on the character
  • deletion -> idk
  • transposition -> probably 1 since mix ups do not depend on the keyboard position

Would it be possible to combine e.g. character dependent and position dependent weights? This might be useful sometimes.

In addition I am unsure how normalization would work with these weights?

@maxbachmann maxbachmann added the discussion Up to discussion label Jul 23, 2022
@pulsar70
Copy link
Author

pulsar70 commented Jul 31, 2022

I was actually only suggesting the technical possibility to specify these weights, leaving the choice of their actual values to the package user (one could provide default ones to play with if really desired, of course).

I have my own custom-made C++ implementation of a Generalized Edit Distance with character-dependent weights,that are fully configurable from simple user-editable txt files. Default values are taken by all those character (pairs) no explicitly mentioned in the config. file, so it enables me to act upon only those values I want to depart from the standard weight.

Should you find it helpful, I would be glad to send you my implementation files. Seeing it properly cythonized in RapidFuzz (which I am technically unable to do myself) would be delightful.

@maxbachmann
Copy link
Member

I was describing possible use cases for users of the API to decide on an API. I think for now it would be enough to add character dependent weights. If other weights need an API change this can be done in a later major release.

@maxbachmann
Copy link
Member

Should you find it helpful, I would be glad to send you my implementation files. Seeing it properly cythonized in RapidFuzz (which I am technically unable to do myself) would be delightful.

Sure. Having a reference implementation would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Up to discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants