-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Analysis] Support normalizer in request param #24767
[Analysis] Support normalizer in request param #24767
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johtani I like this PR and had fun reviewing it and learning more about this analysis feature. I left some comments but I have to appologize in advance that I'm not an expert in this area yet, however I hope the comments might be useful
@@ -222,6 +237,9 @@ public void readFrom(StreamInput in) throws IOException { | |||
field = in.readOptionalString(); | |||
explain = in.readBoolean(); | |||
attributes = in.readStringArray(); | |||
if (in.getVersion().onOrAfter(Version.V_6_0_0_alpha1_UNRELEASED)) { | |||
normalizer = in.readOptionalString(); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can start having a unit test for the AnalyzeRequest in which e.g. the validate method and the serialization can be checked.
analyzeRequest.normalizer(parser.text()); | ||
} else { | ||
throw new IllegalArgumentException(currentFieldName + " should be normalizer's name"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for this parsing part to RestAnalyzeActionTests?
((request.tokenFilters() != null && request.tokenFilters().size() > 0) | ||
|| (request.charFilters() != null && request.charFilters().size() > 0))) { | ||
// normalizer + (tokenizer/analyzer) = no error, just ignore normalizer param | ||
final IndexSettings indexSettings = indexAnalyzers == null ? null : indexAnalyzers.getIndexSettings(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can maybe go inside the following else
branch.
final IndexSettings indexSettings = indexAnalyzers == null ? null : indexAnalyzers.getIndexSettings(); | ||
if (request.normalizer() != null) { | ||
// Get normalizer from indexanalyzers | ||
analyzer = indexAnalyzers.getNormalizer(request.normalizer()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question out of curiosity: the analyzer we get here doesn't have to be closed (via closeAnalyzer) because its not a new instance? I don't know enough about the lifecycle of these objects yet I'm afraid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it already exists instance that created by IndexService or something. Only close if TransportAnalyzeAction create CustomAnalyzer
} else if (request.normalizer() != null || | ||
((request.tokenFilters() != null && request.tokenFilters().size() > 0) | ||
|| (request.charFilters() != null && request.charFilters().size() > 0))) { | ||
// normalizer + (tokenizer/analyzer) = no error, just ignore normalizer param |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to throw an error here? As far as I see specifying a normalizer and analyzer or tokenizer doesn't make sense? This combination can already be detected earlier on the request I think (is validate()) always called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add check logic in request.validate() method.
Unfortunately, it is not always called. If you call shardOperation yourself directly, validate() method is not called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think its better than nothing
@@ -189,6 +190,44 @@ public static AnalyzeResponse analyze(AnalyzeRequest request, String field, Anal | |||
|
|||
analyzer = new CustomAnalyzer(tokenizerFactory, charFilterFactories, tokenFilterFactories); | |||
closeAnalyzer = true; | |||
} else if (request.normalizer() != null || | |||
((request.tokenFilters() != null && request.tokenFilters().size() > 0) | |||
|| (request.charFilters() != null && request.charFilters().size() > 0))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be split into the two cases request.normalizer() != null
and (request.tokenFilters() != null && request.tokenFilters().size() > 0) || (request.charFilters() != null && request.charFilters().size() > 0)
in two separate else if
blocks instead of separating these cases later? I'm not entirely sure if this works, but I think it would make this part easier to read.
|
||
assertEquals(1, tokens.size()); | ||
assertEquals("abc", tokens.get(0).getTerm()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add a test for the second code path added in this PR (the case where normalizer == null but filter or char_filter is not null and tokenizer/analyzer is null)? I don't know if it is possible with this test setup but it might be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I added that test case in rest api test. Now, we are moving to filter/char_filter to analysis-common module, so I think it would be better than in this test class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
|
||
Will cause the analysis to happen based on the analyzer configured in the | ||
mapping for `obj1.field1` (and if not, the default index analyzer). | ||
|
||
A `normalizer` can be provided for keyword field with normalizer associated with the `twitter` index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace twitter
with the new index name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch :)
==== Support custom normalizer in Analyze API | ||
|
||
Analyze API can analyze normalizer and custom normalizer. | ||
In previous versions of Elasticsearch, Analyze API is required `tokenizer` or `analyzer` parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "is requiring a"
|
||
Analyze API can analyze normalizer and custom normalizer. | ||
In previous versions of Elasticsearch, Analyze API is required `tokenizer` or `analyzer` parameter. | ||
In Elaticsearch 6.0.0, Analyze API analyze a text as a keyword field with custom normalizer if `char_filter`/`filter` without `tokenizer`/`analyzer`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "can analyze",
nit: "... or if char_filter
/filter
is set and tokenizer
/analyzer
is not set"
@elasticmachine test this please |
a2dbf1d
to
39c3eec
Compare
@cbuescher Passed CI, please review again after the conference :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @johtani, LGTM.
I left a few minor comments, feel free to adapt or simply ignore them. The question I left is only for my own understanding.
requestAnalyzer.analyzer("analyzer"); | ||
e = requestAnalyzer.validate(); | ||
assertTrue(e.getMessage().contains("tokenizer/analyze should be null if normalizer is specified")); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ thanks for adding these checks
@@ -122,6 +124,17 @@ public void testParseXContentForAnalyzeRequestWithInvalidStringExplainParamThrow | |||
assertThat(e.getMessage(), startsWith("explain must be either 'true' or 'false'")); | |||
} | |||
|
|||
public void testParseXContentForAnalyzeRequestWithInvalidNromalizerThrowsException() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/Nromalizer/Normalizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch :)
} | ||
|
||
public void testSerializationBwc() throws IOException { | ||
final byte[] data = Base64.getDecoder().decode("AAABA2ZvbwEEdGV4dAAAAAAAAAABCm5vcm1hbGl6ZXI="); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a question: I see how we use this in other bwc tests as well, I guess it represents the request. How did you get that String, do we have tools for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure... I made the string using Base64.getEncoder() and sysout...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment saying what request it represents and which version it has been generated with?
final byte[] data = Base64.getDecoder().decode("AAABA2ZvbwEEdGV4dAAAAAAAAAABCm5vcm1hbGl6ZXI="); | ||
final Version version = randomFrom(Version.V_5_0_0, Version.V_5_0_1, Version.V_5_0_2, | ||
Version.V_5_1_1, Version.V_5_1_2, Version.V_5_3_0, Version.V_5_3_1, Version.V_5_3_2, | ||
Version.V_5_4_0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe use VandomUtils#randomVersionBetween()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good to know. I don't know it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please call getMultiTermComponent
on factories, but otherwise it looks good to me!
throw new IllegalArgumentException("Custom normalizer may not use filter [" | ||
+ tokenFilter.name() + "]"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you are missing the call to MultiTermAwareComponent.getMultiTermComponent
?
} | ||
|
||
public void testSerializationBwc() throws IOException { | ||
final byte[] data = Base64.getDecoder().decode("AAABA2ZvbwEEdGV4dAAAAAAAAAABCm5vcm1hbGl6ZXI="); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment saying what request it represents and which version it has been generated with?
Support normalizer param Support custom normalizer with char_filter/filter param Closes elastic#23347
Add AnalyzeRequestTest Fix some comments
Fix some comments Remove non-use imports elastic#23347
6b06274
to
8d72356
Compare
Fix some comments
8d72356
to
c6dd360
Compare
@jpountz Rebased master and moved check and call logic into parseTokenFilterFactories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* master: [Analysis] Support normalizer in request param (elastic#24767) Remove deprecated IdsQueryBuilder constructor (elastic#25529) Adds check for negative search request size (elastic#25397) test: also inspect the upgrade api response to check whether the upgrade really ran [DOCS] restructure java clients docs pages (elastic#25517)
Support normalizer param and custom normalizer with char_filter/filter param.
In this PR, I didn't change a response.
If user send a request with keyword field name or normalizer name, analyze api display a response with tokenizer that is KeywordTokenizer.
Should we change a response format for normalizer?
Closes #23347