Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10626 Hunspell: add tools to aid dictionary editing #975

Merged
merged 1 commit into from
Jul 5, 2022

Conversation

donnerpeter
Copy link
Contributor

@donnerpeter donnerpeter requested a review from dweiss June 23, 2022 19:41
@donnerpeter
Copy link
Contributor Author

donnerpeter commented Jun 23, 2022

Reviewing commits separately might be easier (but I intend to squash them when merging)

@dweiss
Copy link
Contributor

dweiss commented Jun 24, 2022

Hi Peter! I'll take a look later today - it's end-of-school in Poland today and it's a bit hectic.

@donnerpeter
Copy link
Contributor Author

@dweiss sure, no pressure, thanks!

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Peter. Very interesting pieces of code! I've skimmed through all the commits and I get the gist of the functionality. I can't say I understand all the details but overall it looks fine. I left a few comments here and there but they're mostly suggestions or musings about how to code this and that, they're not mistakes. Feel free to apply or omit at will.

Changes.txt entry is missing, I think? It'd be good to add it - this is interesting functionality, not just for Lucene needs.

import org.apache.lucene.util.fst.FST;
import org.apache.lucene.util.fst.IntsRefFSTEnum;

/** A utility class used for generating possible word forms by adding affixes to stems */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd give a link to the method that actually makes use of this (expandRoot) and make all of this package-private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class has 3 public methods for different functionality. Hunspell features simple versions of these methods (for discoverability), while the methods here provide more control over the behavior. But the javadoc for the class could mention that indeed, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clear, thanks.

Map<String, AffixedWord> expanded =
TestSpellChecking.checkExpansionGeneratesCorrectWords(h, "create", "base").stream()
.collect(Collectors.toMap(w -> w.getWord(), w -> w));
assertEquals(expected, expanded.keySet().stream().sorted().toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make expected a set, then no sorting would be needed (for both), I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting makes viewing the difference easier when a test fails

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I like assertj for this reason more.

@@ -245,6 +240,31 @@ private LinkedHashSet<Character> appendFlags(AffixEntry affix) {
return appendId <= 0 ? new LinkedHashSet<>() : toSet(dictionary.flagLookup.getFlags(appendId));
}

/**
* Given a list of words, try to produce a smaller set of dictionary entries (with some flags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty neat.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And quite likely NP-complete as well :) I've come up with some approximate greedy algorithm that seems to work for my cases, but isn't ideal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's NP-complete and you solve it, let me know. We'll have a bunch of other problems covered then. ;)

… introspection, stem expansion and stem/flag suggestion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants