Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hunspell token filter #8061 #8070

Conversation

AntonEliatra
Copy link
Contributor

Description

Add hunspell token filter

Issues Resolved

Closes #8061

Version

all

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

@AntonEliatra
Copy link
Contributor Author

AntonEliatra commented Aug 22, 2024

I did not include dedup parameter, as it does not seem to work. The duplicates are always returned.

Also the configuration for indices.analysis.hunspell.dictionary.ignore_case, does not seem to have any impact.

Also was unable to see any difference in behaviour when adding indices.analysis.hunspell.dictionary.lazy: true
If there is a difference I can add it back in.

Also according to these docs you should be able to change the default directory for hunspell dictionaries, but I was not able to get this to work. If anyone is able to confirm if this works and what format is expected, I can update the PR accordingly

@vagimeli
Copy link
Contributor

vagimeli commented Aug 22, 2024

@udabhas Will you see the preceding comments from the technical writer and provide your feedback? Thank you.

@kolchfa-aws kolchfa-aws assigned vagimeli and unassigned kolchfa-aws Aug 23, 2024
@kolchfa-aws
Copy link
Collaborator

@AntonEliatra I would enter this as a bug in the main OpenSearch repo.

@AntonEliatra
Copy link
Contributor Author

AntonEliatra commented Aug 26, 2024

Bug issue added opensearch-project/OpenSearch#15417

and dedup parameter added to the PR

@AntonEliatra AntonEliatra force-pushed the adding-hunspell-token-filter-docs branch from 939fcb9 to b7e09d5 Compare August 26, 2024 15:20
@vagimeli vagimeli added the 3 - Tech review PR: Tech review in progress label Aug 27, 2024
@vagimeli
Copy link
Contributor

@varun-lodaya The documentation is awaiting tech review and approval, which is delaying progress. Could you please suggest alternative reviewers who can assist with this task in a timely manner? We're eager to move this forward. Thank you.

@vagimeli vagimeli added the Needs SME Waiting on input from subject matter expert label Aug 29, 2024
Signed-off-by: AntonEliatra <[email protected]>
@vagimeli
Copy link
Contributor

vagimeli commented Oct 3, 2024

@varun-lodaya The documentation is awaiting tech review and approval, which is delaying progress. Could you please suggest alternative reviewers who can assist with this task in a timely manner? We're eager to move this forward. Thank you.

@varun-lodaya This is over a month old. We need tech review approval to move it forward in the documentation process. Please review this week or provide a peer who can review it. Thank you.

Signed-off-by: AntonEliatra <[email protected]>
Signed-off-by: Anton Rubin <[email protected]>
Copy link
Collaborator

@kolchfa-aws kolchfa-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @AntonEliatra! A couple of suggestions.

_analyzers/token-filters/hunspell.md Outdated Show resolved Hide resolved
_analyzers/token-filters/hunspell.md Outdated Show resolved Hide resolved
_analyzers/token-filters/hunspell.md Outdated Show resolved Hide resolved
_analyzers/token-filters/hunspell.md Outdated Show resolved Hide resolved
@kolchfa-aws kolchfa-aws self-assigned this Nov 11, 2024
@kolchfa-aws kolchfa-aws added 5 - Editorial review PR: Editorial review in progress and removed 3 - Tech review PR: Tech review in progress labels Nov 14, 2024
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Please see my comments and changes and let me know if you have any questions. Thanks!

_analyzers/token-filters/hunspell.md Outdated Show resolved Hide resolved
_analyzers/token-filters/hunspell.md Outdated Show resolved Hide resolved
_analyzers/token-filters/hunspell.md Outdated Show resolved Hide resolved
_analyzers/token-filters/hunspell.md Outdated Show resolved Hide resolved
_analyzers/token-filters/index.md Outdated Show resolved Hide resolved
@@ -30,7 +30,7 @@ Token filter | Underlying Lucene token filter| Description
`elision` | [ElisionFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/util/ElisionFilter.html) | Removes the specified [elisions](https://en.wikipedia.org/wiki/Elision) from the beginning of tokens. For example, changes `l'avion` (the plane) to `avion` (plane).
[`fingerprint`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/fingerprint/) | [FingerprintFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/FingerprintFilter.html) | Sorts and deduplicates the token list and concatenates tokens into a single token.
`flatten_graph` | [FlattenGraphFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/core/FlattenGraphFilter.html) | Flattens a token graph produced by a graph token filter, such as `synonym_graph` or `word_delimiter_graph`, making the graph suitable for indexing.
`hunspell` | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries.
[`hunspell`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/hunspell/) | [HunspellStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html) | Uses [Hunspell](https://en.wikipedia.org/wiki/Hunspell) rules to stem tokens. Because Hunspell supports a word having multiple stems, this filter can emit multiple tokens for each consumed token. Requires you to configure one or more language-specific Hunspell dictionaries.
`hyphenation_decompounder` | [HyphenationCompoundWordTokenFilter](https://lucene.apache.org/core/9_8_0/analysis/common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html) | Uses XML-based hyphenation patterns to find potential subwords in compound words and checks the subwords against the specified word list. The token output contains only the subwords found in the word list.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 33, second sentence: "Because Hunspell allows a word to have multiple stems"?

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
@kolchfa-aws kolchfa-aws merged commit 01c0d49 into opensearch-project:main Nov 14, 2024
5 checks passed
@kolchfa-aws kolchfa-aws added the backport 2.18 PR: Backport label for 2.18 label Nov 14, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Nov 14, 2024
* adding hunspell token filter #8061

Signed-off-by: Anton Rubin <[email protected]>

* adding dedup and example where to download files

Signed-off-by: Anton Rubin <[email protected]>

* Update hunspell.md

Signed-off-by: AntonEliatra <[email protected]>

* Update hunspell.md

Signed-off-by: AntonEliatra <[email protected]>

* updating parameter table

Signed-off-by: Anton Rubin <[email protected]>

* Apply suggestions from code review

Signed-off-by: kolchfa-aws <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Update _analyzers/token-filters/hunspell.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: Anton Rubin <[email protected]>
Signed-off-by: AntonEliatra <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
(cherry picked from commit 01c0d49)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
github-actions bot pushed a commit that referenced this pull request Nov 14, 2024
epugh pushed a commit to o19s/documentation-website that referenced this pull request Nov 23, 2024
…#8070)

* adding hunspell token filter opensearch-project#8061

Signed-off-by: Anton Rubin <[email protected]>

* adding dedup and example where to download files

Signed-off-by: Anton Rubin <[email protected]>

* Update hunspell.md

Signed-off-by: AntonEliatra <[email protected]>

* Update hunspell.md

Signed-off-by: AntonEliatra <[email protected]>

* updating parameter table

Signed-off-by: Anton Rubin <[email protected]>

* Apply suggestions from code review

Signed-off-by: kolchfa-aws <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Update _analyzers/token-filters/hunspell.md

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

---------

Signed-off-by: Anton Rubin <[email protected]>
Signed-off-by: AntonEliatra <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: kolchfa-aws <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Eric Pugh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Editorial review PR: Editorial review in progress analyzers backport 2.18 PR: Backport label for 2.18 Content gap Needs SME Waiting on input from subject matter expert
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Token filters - hunspell [DOC]
4 participants