Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: FST-based Curated Dictionary Spellchecking #258

Merged
merged 70 commits into from
Nov 23, 2024
Merged

Conversation

grantlemons
Copy link
Collaborator

@grantlemons grantlemons commented Nov 2, 2024

Created a new dictionary FstDictionary that uses a finite-state transducer (FST)-based map underneath. This precomputed map is built based on the dictionary file at compile time, and allows for extremely fast edit-distance calculations used in the spellchecking logic.

FstDictionary is ideal for the curated dictionary, but is immutable, so the current FullDictionary implementation is still used for user and file dictionaries.

Using this new dictionary results in much faster spellchecking (about 10x faster when not yet cached per benchmark below).
image

One downside to this PR is that it moves the hunspell parsing stuff out into a new crate harper-dictionary-parsing, which makes some of the imports a little weird. For instance, harper-core now imports Span from harper-dictionary-parsing.

Also, there is a noted bug in the fst library with its handling of japanese characters, but for the time being this should not pose an issue as multi-language support is nowhere near being implemented. (#138, fst/#38)

@grantlemons grantlemons added enhancement New feature or request harper-core labels Nov 2, 2024
@grantlemons grantlemons self-assigned this Nov 2, 2024
@grantlemons grantlemons linked an issue Nov 2, 2024 that may be closed by this pull request
@grantlemons
Copy link
Collaborator Author

grantlemons commented Nov 2, 2024

Made it more correct (better utf-8 edit-distance support using the levenshtein-automata crate mentioned in #138) and a tiny bit faster by storing the DFA builders thread-locally 💪

image

Copy link
Owner

@elijah-potter elijah-potter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grant, this is some really great work. There are a couple of test cases and organizational things that need to be looked at, but overall it's looking very good, particularly for the WASM target.

Let me know if you have any questions about my comments.

harper-dictionary-parsing/src/lib.rs Outdated Show resolved Hide resolved
harper-dictionary-parsing/src/lib.rs Outdated Show resolved Hide resolved
harper-core/build.rs Outdated Show resolved Hide resolved
harper-dictionary-parsing/Cargo.toml Outdated Show resolved Hide resolved
harper-dictionary-parsing/src/char_string.rs Outdated Show resolved Hide resolved
harper-core/src/spell/full_dictionary.rs Outdated Show resolved Hide resolved
harper-core/src/spell/full_dictionary.rs Outdated Show resolved Hide resolved
harper-core/src/spell/mod.rs Show resolved Hide resolved
harper-core/src/spell/mod.rs Outdated Show resolved Hide resolved
harper-core/src/spell/mod.rs Show resolved Hide resolved
Copy link
Owner

@elijah-potter elijah-potter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got some more requests for changes. Still looks good. Let me figure out the performance problem.

harper-core/Cargo.toml Outdated Show resolved Hide resolved
harper-core/src/spell/full_dictionary.rs Show resolved Hide resolved
harper-core/src/spell/full_dictionary.rs Show resolved Hide resolved
harper-core/src/spell/full_dictionary.rs Outdated Show resolved Hide resolved
harper-core/src/spell/full_dictionary.rs Show resolved Hide resolved
harper-core/src/spell/merged_dictionary.rs Show resolved Hide resolved
harper-core/src/spell/dictionary.rs Outdated Show resolved Hide resolved
grantlemons and others added 20 commits November 14, 2024 20:06
It was at this moment I recognized the height of my folly.
- Removed words_iter()
- Removed words_with_len_iter()
- Added fuzzy_match()
- Added fuzzy_match_str()
…parsing

The dictionary parsing is relatively isolated, and the parsing is needed
for the harper-core build script.
harper-core/src/spell/mod.rs Outdated Show resolved Hide resolved
harper-core/src/spell/mod.rs Show resolved Hide resolved
Copy link
Owner

@elijah-potter elijah-potter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything here looks "correct". I'm going to do some more extensive testing before merging.

@grantlemons
Copy link
Collaborator Author

Update: For correctness reasons it is now only 80% faster.

// Let common words bubble up, but do not prioritize them over all else.
found.sort_by_key(|fmr| fmr.edit_distance + if fmr.metadata.common { 0 } else { 1 });
// Make commonality relevant
found.sort_by_key(|fmr| if fmr.metadata.common { 0 } else { 1 });
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the edit distance portion?

@elijah-potter elijah-potter merged commit f3ed4c1 into master Nov 23, 2024
12 checks passed
@elijah-potter elijah-potter deleted the fst-dictionary branch November 23, 2024 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request harper-core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use FST as dictionary?
2 participants