-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-10626 Hunspell: add tools to aid dictionary editing #975
Conversation
Reviewing commits separately might be easier (but I intend to squash them when merging) |
Hi Peter! I'll take a look later today - it's end-of-school in Poland today and it's a bit hectic. |
@dweiss sure, no pressure, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Peter. Very interesting pieces of code! I've skimmed through all the commits and I get the gist of the functionality. I can't say I understand all the details but overall it looks fine. I left a few comments here and there but they're mostly suggestions or musings about how to code this and that, they're not mistakes. Feel free to apply or omit at will.
Changes.txt entry is missing, I think? It'd be good to add it - this is interesting functionality, not just for Lucene needs.
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java
Outdated
Show resolved
Hide resolved
import org.apache.lucene.util.fst.FST; | ||
import org.apache.lucene.util.fst.IntsRefFSTEnum; | ||
|
||
/** A utility class used for generating possible word forms by adding affixes to stems */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd give a link to the method that actually makes use of this (expandRoot) and make all of this package-private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class has 3 public methods for different functionality. Hunspell
features simple versions of these methods (for discoverability), while the methods here provide more control over the behavior. But the javadoc for the class could mention that indeed, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clear, thanks.
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordFormGenerator.java
Show resolved
Hide resolved
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordFormGenerator.java
Outdated
Show resolved
Hide resolved
Map<String, AffixedWord> expanded = | ||
TestSpellChecking.checkExpansionGeneratesCorrectWords(h, "create", "base").stream() | ||
.collect(Collectors.toMap(w -> w.getWord(), w -> w)); | ||
assertEquals(expected, expanded.keySet().stream().sorted().toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could make expected a set, then no sorting would be needed (for both), I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorting makes viewing the difference easier when a test fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... I like assertj for this reason more.
@@ -245,6 +240,31 @@ private LinkedHashSet<Character> appendFlags(AffixEntry affix) { | |||
return appendId <= 0 ? new LinkedHashSet<>() : toSet(dictionary.flagLookup.getFlags(appendId)); | |||
} | |||
|
|||
/** | |||
* Given a list of words, try to produce a smaller set of dictionary entries (with some flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty neat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And quite likely NP-complete as well :) I've come up with some approximate greedy algorithm that seems to work for my cases, but isn't ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's NP-complete and you solve it, let me know. We'll have a bunch of other problems covered then. ;)
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordFormGenerator.java
Show resolved
Hide resolved
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordFormGenerator.java
Show resolved
Hide resolved
… introspection, stem expansion and stem/flag suggestion
b1d900f
to
e8cbc26
Compare
… introspection, stem expansion and stem/flag suggestion (#975)
https://issues.apache.org/jira/browse/LUCENE-10626