-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add calculate edit distance feature #1656
Conversation
I'm debating if we need all 3 of these, since the names are getting a little confusing. Should we just stick with score (100% is perfect match, 0 is nothing in common) since that's the accuracy measure that's most easily understood and we'd use for marketing purposes? Not sure when we might want the other values - let me know what use cases you had in mind. |
@shreyanid I'd say we can keep the score and a raw distance (which would be rename "distance") |
That sounds more reasonable, ty! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Since you added a requirement maybe re-pip compile.
Future PRs can revisit the default weights, as well as use actual CCTs for testing.
Executive Summary
Adds function to calculate edit distance (Levenshtein distance) between two strings. The function can return as: 1. score (similarity = 1 - distance/source_len) 2. distance (raw levenshtein distance)
Technical details
weights
param is set to default at (2,1,1) for (insertion, deletion, substitution), meaning that we will penalize the insertion we need to add from output (target) in comparison with the source (reference). In other word, the missing extraction will be penalized higher.Important Note!
Test case needs to be updated to use CCT once the function is ready. It is now only tested the "functionality" of edit distance, not the edit distance with CCT as its intended to be.