Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add calculate edit distance feature #1656

Merged
merged 20 commits into from
Oct 7, 2023
Merged

Conversation

Klaijan
Copy link
Contributor

@Klaijan Klaijan commented Oct 5, 2023

Executive Summary

Adds function to calculate edit distance (Levenshtein distance) between two strings. The function can return as: 1. score (similarity = 1 - distance/source_len) 2. distance (raw levenshtein distance)

Technical details

  • The weights param is set to default at (2,1,1) for (insertion, deletion, substitution), meaning that we will penalize the insertion we need to add from output (target) in comparison with the source (reference). In other word, the missing extraction will be penalized higher.
  • The function takes in 2 strings in an assumption that both string are already clean and concatenated (CCT)

Important Note!
Test case needs to be updated to use CCT once the function is ready. It is now only tested the "functionality" of edit distance, not the edit distance with CCT as its intended to be.

@Klaijan Klaijan requested a review from shreyanid October 5, 2023 22:52
@Klaijan Klaijan changed the title Klaijan/feat: edit distance feat: add calculate edit distance feature Oct 5, 2023
@shreyanid
Copy link
Contributor

  1. score (similarity = 1 - percentage) 2. percentage (distance/source_len) 3. distance (raw levenshtein distance)

I'm debating if we need all 3 of these, since the names are getting a little confusing. Should we just stick with score (100% is perfect match, 0 is nothing in common) since that's the accuracy measure that's most easily understood and we'd use for marketing purposes? Not sure when we might want the other values - let me know what use cases you had in mind.

@Klaijan
Copy link
Contributor Author

Klaijan commented Oct 5, 2023

@shreyanid I'd say we can keep the score and a raw distance (which would be rename "distance")

@shreyanid
Copy link
Contributor

That sounds more reasonable, ty!

@Klaijan Klaijan requested a review from shreyanid October 6, 2023 02:51
@Klaijan Klaijan requested a review from shreyanid October 6, 2023 21:10
Copy link
Contributor

@shreyanid shreyanid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Since you added a requirement maybe re-pip compile.

Future PRs can revisit the default weights, as well as use actual CCTs for testing.

@cragwolfe cragwolfe enabled auto-merge October 7, 2023 00:48
@cragwolfe cragwolfe added this pull request to the merge queue Oct 7, 2023
Merged via the queue into main with commit 33edbf8 Oct 7, 2023
39 checks passed
@cragwolfe cragwolfe deleted the klaijan/edit-distance-metric branch October 7, 2023 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants