[Analysis] Support normalizer in request param #24767

johtani · 2017-05-18T08:54:28Z

Support normalizer param and custom normalizer with char_filter/filter param.

In this PR, I didn't change a response.
If user send a request with keyword field name or normalizer name, analyze api display a response with tokenizer that is KeywordTokenizer.
Should we change a response format for normalizer?

Closes #23347

cbuescher

@johtani I like this PR and had fun reviewing it and learning more about this analysis feature. I left some comments but I have to appologize in advance that I'm not an expert in this area yet, however I hope the comments might be useful

cbuescher · 2017-05-31T13:17:16Z

core/src/main/java/org/elasticsearch/action/admin/indices/analyze/AnalyzeRequest.java

@@ -222,6 +237,9 @@ public void readFrom(StreamInput in) throws IOException {
        field = in.readOptionalString();
        explain = in.readBoolean();
        attributes = in.readStringArray();
+        if (in.getVersion().onOrAfter(Version.V_6_0_0_alpha1_UNRELEASED)) {
+            normalizer = in.readOptionalString();
+        }


Maybe we can start having a unit test for the AnalyzeRequest in which e.g. the validate method and the serialization can be checked.

cbuescher · 2017-05-31T13:18:40Z

core/src/main/java/org/elasticsearch/rest/action/admin/indices/RestAnalyzeAction.java

+                        analyzeRequest.normalizer(parser.text());
+                    } else {
+                        throw new IllegalArgumentException(currentFieldName + " should be normalizer's name");
+                    }


Can you add a test for this parsing part to RestAnalyzeActionTests?

cbuescher · 2017-05-31T13:31:12Z

core/src/main/java/org/elasticsearch/action/admin/indices/analyze/TransportAnalyzeAction.java

+            ((request.tokenFilters() != null && request.tokenFilters().size() > 0)
+                || (request.charFilters() != null && request.charFilters().size() > 0))) {
+            // normalizer + (tokenizer/analyzer) = no error, just ignore normalizer param
+            final IndexSettings indexSettings = indexAnalyzers == null ? null : indexAnalyzers.getIndexSettings();


This can maybe go inside the following else branch.

cbuescher · 2017-05-31T13:41:25Z

core/src/main/java/org/elasticsearch/action/admin/indices/analyze/TransportAnalyzeAction.java

+            final IndexSettings indexSettings = indexAnalyzers == null ? null : indexAnalyzers.getIndexSettings();
+            if (request.normalizer() != null) {
+                // Get normalizer from indexanalyzers
+                analyzer = indexAnalyzers.getNormalizer(request.normalizer());


A question out of curiosity: the analyzer we get here doesn't have to be closed (via closeAnalyzer) because its not a new instance? I don't know enough about the lifecycle of these objects yet I'm afraid.

Yes, it already exists instance that created by IndexService or something. Only close if TransportAnalyzeAction create CustomAnalyzer

cbuescher · 2017-05-31T13:47:20Z

core/src/main/java/org/elasticsearch/action/admin/indices/analyze/TransportAnalyzeAction.java

+        } else if (request.normalizer() != null ||
+            ((request.tokenFilters() != null && request.tokenFilters().size() > 0)
+                || (request.charFilters() != null && request.charFilters().size() > 0))) {
+            // normalizer + (tokenizer/analyzer) = no error, just ignore normalizer param


Wouldn't it be better to throw an error here? As far as I see specifying a normalizer and analyzer or tokenizer doesn't make sense? This combination can already be detected earlier on the request I think (is validate()) always called?

I will add check logic in request.validate() method.
Unfortunately, it is not always called. If you call shardOperation yourself directly, validate() method is not called.

Thanks, I think its better than nothing

cbuescher · 2017-05-31T13:57:46Z

core/src/main/java/org/elasticsearch/action/admin/indices/analyze/TransportAnalyzeAction.java

@@ -189,6 +190,44 @@ public static AnalyzeResponse analyze(AnalyzeRequest request, String field, Anal

            analyzer = new CustomAnalyzer(tokenizerFactory, charFilterFactories, tokenFilterFactories);
            closeAnalyzer = true;
+        } else if (request.normalizer() != null ||
+            ((request.tokenFilters() != null && request.tokenFilters().size() > 0)
+                || (request.charFilters() != null && request.charFilters().size() > 0))) {


Can this be split into the two cases request.normalizer() != null and (request.tokenFilters() != null && request.tokenFilters().size() > 0) || (request.charFilters() != null && request.charFilters().size() > 0) in two separate else if blocks instead of separating these cases later? I'm not entirely sure if this works, but I think it would make this part easier to read.

cbuescher · 2017-05-31T14:05:56Z

core/src/test/java/org/elasticsearch/action/admin/indices/TransportAnalyzeActionTests.java


+        assertEquals(1, tokens.size());
+        assertEquals("abc", tokens.get(0).getTerm());


Would it be possible to add a test for the second code path added in this PR (the case where normalizer == null but filter or char_filter is not null and tokenizer/analyzer is null)? I don't know if it is possible with this test setup but it might be useful

Ah, I added that test case in rest api test. Now, we are moving to filter/char_filter to analysis-common module, so I think it would be better than in this test class.

makes sense

cbuescher · 2017-05-31T14:11:16Z

docs/reference/indices/analyze.asciidoc


 Will cause the analysis to happen based on the analyzer configured in the
 mapping for `obj1.field1` (and if not, the default index analyzer).

+A `normalizer` can be provided for keyword field with normalizer associated with the `twitter` index.


replace twitter with the new index name

good catch :)

cbuescher · 2017-05-31T14:14:45Z

docs/reference/migration/migrate_6_0/rest.asciidoc

+==== Support custom normalizer in Analyze API
+
+Analyze API can analyze normalizer and custom normalizer.
+In previous versions of Elasticsearch, Analyze API is required `tokenizer` or `analyzer` parameter.


nit: "is requiring a"

cbuescher · 2017-05-31T14:17:03Z

docs/reference/migration/migrate_6_0/rest.asciidoc

+
+Analyze API can analyze normalizer and custom normalizer.
+In previous versions of Elasticsearch, Analyze API is required `tokenizer` or `analyzer` parameter.
+In Elaticsearch 6.0.0, Analyze API analyze a text as a keyword field with custom normalizer if `char_filter`/`filter` without `tokenizer`/`analyzer`.


nit: "can analyze",
nit: "... or if char_filter/filter is set and tokenizer/analyzer is not set"

johtani · 2017-06-11T10:35:14Z

@elasticmachine test this please

johtani · 2017-06-12T10:35:11Z

@cbuescher Passed CI, please review again after the conference :)

cbuescher

Thanks @johtani, LGTM.
I left a few minor comments, feel free to adapt or simply ignore them. The question I left is only for my own understanding.

cbuescher · 2017-06-20T10:16:55Z

core/src/test/java/org/elasticsearch/action/admin/indices/analyze/AnalyzeRequestTests.java

+        requestAnalyzer.analyzer("analyzer");
+        e = requestAnalyzer.validate();
+        assertTrue(e.getMessage().contains("tokenizer/analyze should be null if normalizer is specified"));
+    }


++ thanks for adding these checks

cbuescher · 2017-06-20T10:18:04Z

core/src/test/java/org/elasticsearch/rest/action/admin/indices/RestAnalyzeActionTests.java

@@ -122,6 +124,17 @@ public void testParseXContentForAnalyzeRequestWithInvalidStringExplainParamThrow
        assertThat(e.getMessage(), startsWith("explain must be either 'true' or 'false'"));
    }

+    public void testParseXContentForAnalyzeRequestWithInvalidNromalizerThrowsException() throws Exception {


nit: s/Nromalizer/Normalizer

Good catch :)

cbuescher · 2017-06-20T10:48:21Z

core/src/test/java/org/elasticsearch/action/admin/indices/analyze/AnalyzeRequestTests.java

+    }
+
+    public void testSerializationBwc() throws IOException {
+        final byte[] data = Base64.getDecoder().decode("AAABA2ZvbwEEdGV4dAAAAAAAAAABCm5vcm1hbGl6ZXI=");


More of a question: I see how we use this in other bwc tests as well, I guess it represents the request. How did you get that String, do we have tools for that?

I'm not sure... I made the string using Base64.getEncoder() and sysout...

can you add a comment saying what request it represents and which version it has been generated with?

cbuescher · 2017-06-20T10:49:36Z

core/src/test/java/org/elasticsearch/action/admin/indices/analyze/AnalyzeRequestTests.java

+        final byte[] data = Base64.getDecoder().decode("AAABA2ZvbwEEdGV4dAAAAAAAAAABCm5vcm1hbGl6ZXI=");
+        final Version version = randomFrom(Version.V_5_0_0, Version.V_5_0_1, Version.V_5_0_2,
+            Version.V_5_1_1, Version.V_5_1_2, Version.V_5_3_0, Version.V_5_3_1, Version.V_5_3_2,
+            Version.V_5_4_0);


nit: maybe use VandomUtils#randomVersionBetween()

Oh, good to know. I don't know it :)

jpountz

Please call getMultiTermComponent on factories, but otherwise it looks good to me!

jpountz · 2017-06-22T10:31:06Z

core/src/main/java/org/elasticsearch/action/admin/indices/analyze/TransportAnalyzeAction.java

+                    throw new IllegalArgumentException("Custom normalizer may not use filter ["
+                        + tokenFilter.name() + "]");
+                }
+            }


looks like you are missing the call to MultiTermAwareComponent.getMultiTermComponent?

jpountz · 2017-06-22T10:33:08Z

core/src/test/java/org/elasticsearch/action/admin/indices/analyze/AnalyzeRequestTests.java

+    }
+
+    public void testSerializationBwc() throws IOException {
+        final byte[] data = Base64.getDecoder().decode("AAABA2ZvbwEEdGV4dAAAAAAAAAABCm5vcm1hbGl6ZXI=");


can you add a comment saying what request it represents and which version it has been generated with?

Support normalizer param Support custom normalizer with char_filter/filter param Closes elastic#23347

Add AnalyzeRequestTest Fix some comments

Fix some comments Remove non-use imports elastic#23347

Fix some comments

johtani · 2017-06-28T08:09:06Z

@jpountz Rebased master and moved check and call logic into parseTokenFilterFactories
Could you review this again?

jpountz

LGTM

* master: [Analysis] Support normalizer in request param (elastic#24767) Remove deprecated IdsQueryBuilder constructor (elastic#25529) Adds check for negative search request size (elastic#25397) test: also inspect the upgrade api response to check whether the upgrade really ran [DOCS] restructure java clients docs pages (elastic#25517)

johtani added :Search Relevance/Analysis How text is split into tokens >enhancement v6.0.0 review labels May 18, 2017

cbuescher requested changes May 31, 2017

View reviewed changes

johtani force-pushed the support_normalizer_in_analyze_api branch 4 times, most recently from a2dbf1d to 39c3eec Compare June 12, 2017 05:47

cbuescher self-assigned this Jun 20, 2017

cbuescher approved these changes Jun 20, 2017

View reviewed changes

jpountz requested changes Jun 22, 2017

View reviewed changes

johtani added 3 commits June 26, 2017 17:32

[Analysis] Support normalizer in request param

012749d

Support normalizer param Support custom normalizer with char_filter/filter param Closes elastic#23347

[Analysis] Support normalizer in request param

6dc62f5

Add AnalyzeRequestTest Fix some comments

[Analysis] Support normalizer in request param

37bb2f1

Fix some comments Remove non-use imports elastic#23347

johtani force-pushed the support_normalizer_in_analyze_api branch from 6b06274 to 8d72356 Compare June 27, 2017 22:34

[Analysis] Support normalizer in request param

c6dd360

Fix some comments

johtani force-pushed the support_normalizer_in_analyze_api branch from 8d72356 to c6dd360 Compare June 28, 2017 06:49

jpountz approved these changes Jul 4, 2017

View reviewed changes

johtani merged commit 6894ef6 into elastic:master Jul 4, 2017

clintongormley added v6.0.0-beta1 and removed v6.0.0 labels Jul 25, 2017

jimczi mentioned this pull request Sep 4, 2017

_analyze API skips char_filter when no tokenizer/filters specified #26495

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Analysis] Support normalizer in request param #24767

[Analysis] Support normalizer in request param #24767

johtani commented May 18, 2017

cbuescher left a comment

cbuescher May 31, 2017

cbuescher May 31, 2017

cbuescher May 31, 2017

cbuescher May 31, 2017

johtani Jun 11, 2017 •

edited

Loading

cbuescher May 31, 2017

johtani Jun 11, 2017

cbuescher Jun 20, 2017

cbuescher May 31, 2017

cbuescher May 31, 2017

johtani Jun 11, 2017

cbuescher Jun 20, 2017

cbuescher May 31, 2017

johtani Jun 11, 2017

cbuescher May 31, 2017

cbuescher May 31, 2017

johtani commented Jun 11, 2017

johtani commented Jun 12, 2017

cbuescher left a comment

cbuescher Jun 20, 2017

cbuescher Jun 20, 2017

johtani Jun 20, 2017

cbuescher Jun 20, 2017

johtani Jun 20, 2017

jpountz Jun 22, 2017

cbuescher Jun 20, 2017

johtani Jun 20, 2017

jpountz left a comment

jpountz Jun 22, 2017

jpountz Jun 22, 2017

johtani commented Jun 28, 2017

jpountz left a comment


		assertEquals(1, tokens.size());
		assertEquals("abc", tokens.get(0).getTerm());

[Analysis] Support normalizer in request param #24767

[Analysis] Support normalizer in request param #24767

Conversation

johtani commented May 18, 2017

cbuescher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johtani Jun 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johtani commented Jun 11, 2017

johtani commented Jun 12, 2017

cbuescher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johtani commented Jun 28, 2017

jpountz left a comment

Choose a reason for hiding this comment

johtani Jun 11, 2017 •

edited

Loading