Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misc std/editdistance #397

Open
1 task
timotheecour opened this issue Nov 28, 2020 · 4 comments
Open
1 task

misc std/editdistance #397

timotheecour opened this issue Nov 28, 2020 · 4 comments

Comments

@timotheecour
Copy link
Owner

timotheecour commented Nov 28, 2020

@timotheecour
Copy link
Owner Author

timotheecour commented Nov 28, 2020

  • allow custom weights for insertions/deletions/replacements/transpositions
  • make editDistance generic
  • factor editDistanceAscii and editDistance to reuse same (generic) implementation, with 0 overhead
  • check whether passing an optional string as reusable buffer can improve performance
  • allow returning an arg-max (ie a path that optimizes the dynamic program; in this case a sequence of edits)

@ringabout
Copy link
Collaborator

ringabout commented Nov 29, 2020

I could implement first task.
Simple reference implementation: https://www.guyrutenberg.com/2008/12/15/damerau-levenshtein-distance-in-python/

I have ported this simple one:
https://gist.github.com/xflywind/bef5b8626076463f4226cd9f20a3d0e7

Find a better article:
https://www.lemoda.net/text-fuzzy/damerau-levenshtein/index.html

@timotheecour
Copy link
Owner Author

timotheecour commented Nov 29, 2020

that would be great, happy to look at drafts in a branch
A few notes:

  • https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance distinguishes 2 algos for that, they may have pros and cons
  • stdlib have some optimizations we should keep, eg common prefix/suffix
  • a naive implementation (without optimizations), straight port from python code, would be nice to have (eg in teditdistance) to serve as groundtruth for tests
  • I'm worried about API explosion if we don't first re-factor editDistanceAscii + editDistance into some generic API (say proc editDistanceCustom[T](a, b: openArray[Y]): int for now). I don't see why the specificities of editDistance (dealing with unicode) can't be abstracted away in some generic API for some reason. This should have 0 overhead over editDistance otherwise not good
  • at least, editDistanceAscii could be abstracted in that generic API transparently

@ringabout
Copy link
Collaborator

see nim-works/nimskull#206

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants