Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_ngram_diff problem in _analyze api #56586

Closed
aninda052 opened this issue May 12, 2020 · 3 comments
Closed

max_ngram_diff problem in _analyze api #56586

aninda052 opened this issue May 12, 2020 · 3 comments
Labels
>bug :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@aninda052
Copy link

aninda052 commented May 12, 2020

Elasticsearch 7.6.2

I'm trying to test a analyzer using _analyze api . In my filter i use 'ngram' with 'min_gram' = 3 and 'max_gram' = 8 , AS "The difference between max_gram and min_gram in NGram Tokenizer must be less than or equal to 1 " i can't use ngram with my desire setting . I can not use "max_ngram_diff" in _analyze api . Is there any way i can test my analyzer ?

my analyzer "
{

"tokenizer": "standard",

"filter": [
"lowercase",
{
"type" : "ngram",
"min_gram": 3,
"max_gram": 8,
"token_chars": [ "letter", "digit" ]

		  },
		  {
		  	"type" : "stop",
		  	"stopwords" : "_english_",
		  	"ignore_case": true


		  }
  	],

"text": "test text"
}

Elasticsearch 7.6.2

@aninda052 aninda052 added >bug needs:triage Requires assignment of a team area label labels May 12, 2020
@albertzaharovits
Copy link
Contributor

albertzaharovits commented May 12, 2020

@aninda052 Thank you for raising this to our attention.

I've identified two issues from your description:

  1. There is a documentation problem with the analyzer request parameter. It can only take a String value, but not an object. Hence the correctly formatted request from the above must be:
curl -X GET "localhost:9200/_analyze?pretty&error_trace=true" -H 'Content-Type: application/json' -d'
{
  "tokenizer": {
    "type": "ngram",
    "min_gram": 3,
    "max_gram": 5,
    "token_chars": [
        "letter",
        "digit"
    ]
  },
  "filter": [
    "lowercase"
  ],
  "text" : "Quick Brown Foxes!"
}
'
  1. There is an obscure ngram slack parameter for analyze requests without an index, which triggers exceptions such as:
"[The difference between max_gram and min_gram in NGram Tokenizer must be less than or equal to: [1] but was [5]. This limit can be set by changing the [index.max_ngram_diff] index level setting.]; nested: IllegalArgumentException[The difference between max_gram and min_gram in NGram Tokenizer must be less than or equal to: [1] but was [5]. This limit can be set by changing the [index.max_ngram_diff] index level setting.];at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:644)
at org.elasticsearch.ElasticsearchException.generateFailureXContent(ElasticsearchException.java:572)
at org.elasticsearch.rest.BytesRestResponse.build(BytesRestResponse.java:138)
at org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:96)
at org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:91)
at org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:58)
at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:79)
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$1.handleException(TransportSingleShardAction.java:197)
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1130)
at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1239)
at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1213)
at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:60)
at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:56)
at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:88)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:680)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)\nCaused by: java.lang.IllegalArgumentException: The difference between max_gram and min_gram in NGram Tokenizer must be less than or equal to: [1] but was [5]. This limit can be set by changing the [index.max_ngram_diff] index level setting.
at org.elasticsearch.analysis.common.NGramTokenizerFactory.<init>(NGramTokenizerFactory.java:119)
at org.elasticsearch.index.analysis.AnalysisRegistry.getComponentFactory(AnalysisRegistry.java:131)
at org.elasticsearch.index.analysis.AnalysisRegistry.buildCustomAnalyzer(AnalysisRegistry.java:229)
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.buildCustomAnalyzer(TransportAnalyzeAction.java:200)
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:131)
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:121)
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:73)
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.lambda$asyncShardOperation$0(TransportSingleShardAction.java:110)
at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58)
at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)

Can someone from @elastic/es-search investigate further, please?

@albertzaharovits albertzaharovits added :Search Relevance/Analysis How text is split into tokens and removed needs:triage Requires assignment of a team area label labels May 12, 2020
@elasticmachine elasticmachine added the Team:Search Meta label for search team label May 12, 2020
@jtibshirani
Copy link
Contributor

I opened #56650 to address the docs issue that @albertzaharovits found.

As for the ngram diff issue, here's a suggested workaround:

  • Create a temporary test index with the setting index.max_ngram_diff increased to a larger number.
  • Run the analyze API against that index, as in GET /my-test-index/_analyze ....

To me this is a reasonable limitation of the analyze API, and doesn't seem like a bug. I don't think we would want to add the ability to pass index settings when testing out a transient custom analysis chain.

@aninda052
Copy link
Author

thanks for the suggesting @jtibshirani :)

@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

5 participants