-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: FST-based Curated Dictionary Spellchecking #258
Conversation
Made it more correct (better utf-8 edit-distance support using the levenshtein-automata crate mentioned in #138) and a tiny bit faster by storing the DFA builders thread-locally 💪 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grant, this is some really great work. There are a couple of test cases and organizational things that need to be looked at, but overall it's looking very good, particularly for the WASM target.
Let me know if you have any questions about my comments.
c4645ac
to
d802b68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got some more requests for changes. Still looks good. Let me figure out the performance problem.
It was at this moment I recognized the height of my folly.
- Removed words_iter() - Removed words_with_len_iter() - Added fuzzy_match() - Added fuzzy_match_str()
…xcept fuzzy search
…parsing The dictionary parsing is relatively isolated, and the parsing is needed for the harper-core build script.
… numbers from dictionary
…zzy find operations
…o pub(super) - get_word - get_word_metadata These two functions are needed by `FstDictionary`, but aren't something we need to expose publicly.
1586d7b
to
69cbae1
Compare
This reverts commit 0d5c600. Using `Lrc` in this position makes it much harder to support choosing `concurrency` support.
Has the added bonus of doubling test execution speed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything here looks "correct". I'm going to do some more extensive testing before merging.
Update: For correctness reasons it is now only 80% faster. |
// Let common words bubble up, but do not prioritize them over all else. | ||
found.sort_by_key(|fmr| fmr.edit_distance + if fmr.metadata.common { 0 } else { 1 }); | ||
// Make commonality relevant | ||
found.sort_by_key(|fmr| if fmr.metadata.common { 0 } else { 1 }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove the edit distance portion?
Created a new dictionary
FstDictionary
that uses a finite-state transducer (FST)-based map underneath. This precomputed map is built based on the dictionary file at compile time, and allows for extremely fast edit-distance calculations used in the spellchecking logic.FstDictionary
is ideal for the curated dictionary, but is immutable, so the currentFullDictionary
implementation is still used for user and file dictionaries.Using this new dictionary results in much faster spellchecking (about 10x faster when not yet cached per benchmark below).
One downside to this PR is that it moves the hunspell parsing stuff out into a new crate
harper-dictionary-parsing
, which makes some of the imports a little weird. For instance,harper-core
now importsSpan
fromharper-dictionary-parsing
.Also, there is a noted bug in thefst
library with its handling of japanese characters, but for the time being this should not pose an issue as multi-language support is nowhere near being implemented. (#138, fst/#38)