LUCENE-10626 Hunspell: add tools to aid dictionary editing #975

donnerpeter · 2022-06-23T19:41:58Z

https://issues.apache.org/jira/browse/LUCENE-10626

donnerpeter · 2022-06-23T19:43:43Z

Reviewing commits separately might be easier (but I intend to squash them when merging)

dweiss · 2022-06-24T06:16:02Z

Hi Peter! I'll take a look later today - it's end-of-school in Poland today and it's a bit hectic.

donnerpeter · 2022-06-24T09:00:03Z

@dweiss sure, no pressure, thanks!

dweiss

Hi Peter. Very interesting pieces of code! I've skimmed through all the commits and I get the gist of the functionality. I can't say I understand all the details but overall it looks fine. I left a few comments here and there but they're mostly suggestions or musings about how to code this and that, they're not mistakes. Feel free to apply or omit at will.

~~Changes.txt entry is missing, I think? It'd be good to add it - this is interesting functionality, not just for Lucene needs.~~

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java

dweiss · 2022-06-24T19:30:52Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordFormGenerator.java

+import org.apache.lucene.util.fst.FST;
+import org.apache.lucene.util.fst.IntsRefFSTEnum;
+
+/** A utility class used for generating possible word forms by adding affixes to stems */


I'd give a link to the method that actually makes use of this (expandRoot) and make all of this package-private.

This class has 3 public methods for different functionality. Hunspell features simple versions of these methods (for discoverability), while the methods here provide more control over the behavior. But the javadoc for the class could mention that indeed, thanks!

Clear, thanks.

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordFormGenerator.java

dweiss · 2022-06-24T19:36:50Z

lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/TestHunspell.java

+    Map<String, AffixedWord> expanded =
+        TestSpellChecking.checkExpansionGeneratesCorrectWords(h, "create", "base").stream()
+            .collect(Collectors.toMap(w -> w.getWord(), w -> w));
+    assertEquals(expected, expanded.keySet().stream().sorted().toList());


You could make expected a set, then no sorting would be needed (for both), I think.

Sorting makes viewing the difference easier when a test fails

Yeah... I like assertj for this reason more.

dweiss · 2022-06-24T19:39:33Z

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordFormGenerator.java

@@ -245,6 +240,31 @@ private LinkedHashSet<Character> appendFlags(AffixEntry affix) {
    return appendId <= 0 ? new LinkedHashSet<>() : toSet(dictionary.flagLookup.getFlags(appendId));
  }

+  /**
+   * Given a list of words, try to produce a smaller set of dictionary entries (with some flags)


This is pretty neat.

And quite likely NP-complete as well :) I've come up with some approximate greedy algorithm that seems to work for my cases, but isn't ideal.

If it's NP-complete and you solve it, let me know. We'll have a bunch of other problems covered then. ;)

lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordFormGenerator.java

… introspection, stem expansion and stem/flag suggestion

… introspection, stem expansion and stem/flag suggestion (#975)

donnerpeter requested a review from dweiss June 23, 2022 19:41

dweiss approved these changes Jun 24, 2022

View reviewed changes

LUCENE-10626: Hunspell: add tools to aid dictionary editing: analysis…

e8cbc26

… introspection, stem expansion and stem/flag suggestion

donnerpeter force-pushed the hunspellTools branch from b1d900f to e8cbc26 Compare July 5, 2022 18:36

donnerpeter merged commit d537013 into apache:main Jul 5, 2022

donnerpeter deleted the hunspellTools branch October 18, 2022 09:53

donnerpeter added a commit that referenced this pull request Jan 13, 2023

LUCENE-10626: Hunspell: add tools to aid dictionary editing: analysis…

3b763af

… introspection, stem expansion and stem/flag suggestion (#975)

asfimport mentioned this pull request Jul 6, 2022

Hunspell: add tools to aid dictionary editing: analysis introspection, stem expansion and stem/flag suggestion [LUCENE-10626] #11662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LUCENE-10626 Hunspell: add tools to aid dictionary editing #975

LUCENE-10626 Hunspell: add tools to aid dictionary editing #975

donnerpeter commented Jun 23, 2022

donnerpeter commented Jun 23, 2022 •

edited

Loading

dweiss commented Jun 24, 2022

donnerpeter commented Jun 24, 2022

dweiss left a comment •

edited

Loading

dweiss Jun 24, 2022

donnerpeter Jun 25, 2022

dweiss Jun 25, 2022

dweiss Jun 24, 2022

donnerpeter Jun 25, 2022

dweiss Jun 25, 2022

dweiss Jun 24, 2022

donnerpeter Jun 25, 2022

dweiss Jun 25, 2022

LUCENE-10626 Hunspell: add tools to aid dictionary editing #975

LUCENE-10626 Hunspell: add tools to aid dictionary editing #975

Conversation

donnerpeter commented Jun 23, 2022

donnerpeter commented Jun 23, 2022 • edited Loading

dweiss commented Jun 24, 2022

donnerpeter commented Jun 24, 2022

dweiss left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

donnerpeter commented Jun 23, 2022 •

edited

Loading

dweiss left a comment •

edited

Loading