Docs: Test examples that recreate lang analyzers #29535

nik9000 · 2018-04-16T14:56:20Z

We have a pile of documentation describing how to rebuild the built in
language analyzers and, previously, our documentation testing framework
made sure that the examples successfully built an analyzer but they
didn't assert that the analyzer built by the documentation matches the
built in anlayzer. Unsuprisingly, some of the examples aren't quite
right.

This adds a mechanism that tests that the analyzers built by the docs.
The mechanism is fairly simple and brutal but it seems to be working:
build a hundred random unicode sequences and send them through the
_analyze API with the rebuilt analyzer and then again through the
built in analyzer. Then make sure both APIs return the same results.
Each of these calls to _anlayze takes about 20ms on my laptop which
seems fine.

Related to #29499

We have a pile of documentation describing how to rebuild the built in language analyzers and, previously, our documentation testing framework made sure that the examples successfully built *an* analyzer but they didn't assert that the analyzer built by the documentation matches the built in anlayzer. Unsuprisingly, some of the examples aren't quite right. This adds a mechanism that tests that the analyzers built by the docs. The mechanism is fairly simple and brutal but it seems to be working: build a hundred random unicode sequences and send them through the `_analyze` API with the rebuilt analyzer and then again through the built in analyzer. Then make sure both APIs return the same results. Each of these calls to `_anlayze` takes about 20ms on my laptop which seems fine.

elasticmachine · 2018-04-16T14:56:23Z

Pinging @elastic/es-core-infra

nik9000 · 2018-04-16T15:01:00Z

I've filed this as ":core/build" because ":core/build" includes testing infrastructure. It might be more correct to file it as ":search/anlaysis" but the bulk of the work is in the testing infrastructure.

I thought of three ways to do this:

Write a unit test that parses the docs looking for these snippets and runs the tests.
Create special syntax in the docs tests that kicks off these test.
Allow the docs tests to write things in YAML and have that kick off the test.

I opted for option 3 because it seemed the simplest thing at the time. Option two would have been fairly simple as well. Option 1 might produce faster tests but would require a lot of extra complexity that reproduces some of the docs testing infrastructure. Since option 3 takes ~20 milliseconds per language I think we're ok as far as speed is concerned.

nik9000 · 2018-04-16T15:02:21Z

We also talking about doing this using some mechanism to force us to manually check the lucene analyzers when we upgrade lucene. Given how quickly these tests run I don't think that is worth it.

mayya-sharipova · 2018-04-17T11:10:42Z

docs/src/test/java/org/elasticsearch/smoketest/DocsClientYamlTestSuiteIT.java

+            for (int i = 0; i < size; i++) {
+                testText.add(randomRealisticUnicodeOfCodepointLength(between(1, 15))
+                    // Don't look up stashed values
+                    .replace("$", "\\$"));


Do strings generated from randomRealisticUnicodeOfCodepointLength also contain spaces, punctuation marks etc, so that we can test a tokenizer part of analyzers?

It doesn't look like they do. It'd be cool to insert spaces. I don't think I could easily use the same unicode page for both space separated strings though. That might not be too bad though.

mayya-sharipova · 2018-04-17T11:11:31Z

docs/reference/analysis/analyzers/lang-analyzer.asciidoc

            "arabic_keywords",
+            "arabic_normalization",


+1 for correcting documentation on all analyzers

nik9000 · 2018-04-19T13:41:53Z

@mayya-sharipova, I've pushed a patch to add spaces. It isn't perfect, but it does add spaces. Have a look at the comment I sent for a more thorough explanation of what I mean.

mayya-sharipova

@nik9000 Thanks for the change, Nik! For the search side (analyzers and a test to test that tokens are similar), I have left a small comment. Other than that everything looks fine.
May be somebody from the Core/Infra team can review it as well for the testing infrastructure part.

mayya-sharipova · 2018-04-20T15:18:26Z

docs/src/test/java/org/elasticsearch/smoketest/DocsClientYamlTestSuiteIT.java

+                if (false == secondTokens.hasNext()) {
+                    fail(second + " has fewer tokens than " + first + ". "
+                        + first + " has [" + firstTokens.next() + "] but " + second + " is out of tokens. "
+                        + first + "'s last token was [" + previousFirst + "] and "


I don't see where you assign something to previousFirst and previousSecond besides null?

Yeah. I used to have it working. I'll fix.

mayya-sharipova · 2018-04-20T15:20:36Z

docs/src/test/java/org/elasticsearch/smoketest/DocsClientYamlTestSuiteIT.java

+                testText.add(b.toString()
+                    // Don't look up stashed values
+                    .replace("$", "\\$"));
+            }


+1 for this change

This makes the change to the regex smaller and fixes some parse errors I hadn't noticed before.

nik9000 · 2018-04-27T16:26:20Z

I found a few more bugs in the configurations by running with -Dtests.iters=50 and pushed fixes.

nik9000 · 2018-04-27T18:34:38Z

@elasticmachine recheck this please.

dakrone

LGTM, I left really minor nits, but nothing that needs another review

dakrone · 2018-05-07T22:05:33Z

docs/src/test/java/org/elasticsearch/smoketest/DocsClientYamlTestSuiteIT.java

+        private static CompareAnalyzers parse(XContentParser parser) throws IOException {
+            XContentLocation location = parser.getTokenLocation();
+            CompareAnalyzers section = PARSER.parse(parser, location);
+            assert parser.currentToken() == Token.END_OBJECT;


Can you add a message here so it'll be helpful if someone accidentally misses a closing token?

dakrone · 2018-05-07T22:12:29Z

test/framework/src/main/java/org/elasticsearch/test/rest/yaml/section/ExecutableSection.java

     */
-    NamedXContentRegistry XCONTENT_REGISTRY = new NamedXContentRegistry(Arrays.asList(
+    List<NamedXContentRegistry.Entry> DEFAULT_EXECUTABLE_CONTEXTS = unmodifiableList(Arrays.asList(


I think this can be final

I started out declaring it public static final out of habit but checkstyle failed because they are all forced on that field because it is in an interface.

dakrone · 2018-05-07T22:12:42Z

test/framework/src/main/java/org/elasticsearch/test/rest/yaml/section/ExecutableSection.java

+     * {@link NamedXContentRegistry} that parses the default list of
+     * {@link ExecutableSection}s available for tests.
+     */
+    NamedXContentRegistry XCONTENT_REGISTRY = new NamedXContentRegistry(DEFAULT_EXECUTABLE_CONTEXTS);


Same here, this could be final I think?

…or-you * elastic/master: (22 commits) Docs: Test examples that recreate lang analyzers (elastic#29535) BulkProcessor to retry based on status code (elastic#29329) Add GET Repository High Level REST API (elastic#30362) add a comment explaining the need for RetryOnReplicaException on missing mappings Add `coordinating_only` node selector (elastic#30313) Stop forking groovyc (elastic#30471) Avoid setting connection request timeout (elastic#30384) Use date format in `date_range` mapping before fallback to default (elastic#29310) Watcher: Increase HttpClient parallel sent requests (elastic#30130) Mute ML upgrade test (elastic#30458) Stop forking javac (elastic#30462) Client: Deprecate many argument performRequest (elastic#30315) Docs: Use task_id in examples of tasks (elastic#30436) Security: Rename IndexLifecycleManager to SecurityIndexManager (elastic#30442) [Docs] Fix typo in cardinality-aggregation.asciidoc (elastic#30434) Avoid NPE in `more_like_this` when field has zero tokens (elastic#30365) Build: Switch to building javadoc with html5 (elastic#30440) Add a quick tour of the project to CONTRIBUTING (elastic#30187) Reindex: Use request flavored methods (elastic#30317) Silence SplitIndexIT.testSplitIndexPrimaryTerm test failure. (elastic#30432) ...

We have a pile of documentation describing how to rebuild the built in language analyzers and, previously, our documentation testing framework made sure that the examples successfully built *an* analyzer but they didn't assert that the analyzer built by the documentation matches the built in anlayzer. Unsuprisingly, some of the examples aren't quite right. This adds a mechanism that tests that the analyzers built by the docs. The mechanism is fairly simple and brutal but it seems to be working: build a hundred random unicode sequences and send them through the `_analyze` API with the rebuilt analyzer and then again through the built in analyzer. Then make sure both APIs return the same results. Each of these calls to `_anlayze` takes about 20ms on my laptop which seems fine.

Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in elastic#29535. Closes elastic#29499

* master: Upgrade to Lucene-7.4-snapshot-6705632810 (#30519) add version compatibility from 6.4.0 after backport, see #30319 (#30390) Security: Simplify security index listeners (#30466) Add proper longitude validation in geo_polygon_query (#30497) Remove Discovery.AckListener.onTimeout() (#30514) Build: move generated-resources to build (#30366) Reindex: Fold "with all deps" project into reindex (#30154) Isolate REST client single host tests (#30504) Solve Gradle deprecation warnings around shadowJar (#30483) SAML: Process only signed data (#30420) Remove BWC repository test (#30500) Build: Remove xpack specific run task (#30487) AwaitsFix IntegTestZipClientYamlTestSuiteIT#indices.split tests LLClient: Add setJsonEntity (#30447) Expose CommonStatsFlags directly in IndicesStatsRequest. (#30163) Silence IndexUpgradeIT test failures. (#30430) Bump Gradle heap to 1792m (#30484) [docs] add warning for read-write indices in force merge documentation (#28869) Avoid deadlocks in cache (#30461) Test: remove hardcoded list of unconfigured ciphers (#30367) mute SplitIndexIT due to #30416 Docs: Test examples that recreate lang analyzers (#29535) BulkProcessor to retry based on status code (#29329) Add GET Repository High Level REST API (#30362) add a comment explaining the need for RetryOnReplicaException on missing mappings Add `coordinating_only` node selector (#30313) Stop forking groovyc (#30471) Avoid setting connection request timeout (#30384) Use date format in `date_range` mapping before fallback to default (#29310) Watcher: Increase HttpClient parallel sent requests (#30130) # Conflicts: # x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/LocalStateCompositeXPackPlugin.java

* 6.x: Upgrade to Lucene-7.4-snapshot-6705632810 (#30519) Remove Discovery.AckListener.onTimeout() (#30514) Build: move generated-resources to build (#30366) Reindex: Fold "with all deps" project into reindex (#30154) Isolate REST client single host tests (#30504) Remove BWC repository test (#30500) Build: Remove xpack specific run task (#30487) AwaitsFix IntegTestZipClientYamlTestSuiteIT#indices.split tests LLClient: Add setJsonEntity (#30447) [docs] add warning for read-write indices in force merge documentation (#28869) Avoid deadlocks in cache (#30461) BulkProcessor to retry based on status code (#29329) Avoid setting connection request timeout (#30384) Test: remove hardcoded list of unconfigured ciphers (#30367) Add GET Repository High Level REST API (#30362) mute SplitIndexIT due to #30416 Docs: Test examples that recreate lang analyzers (#29535) add a comment explaining the need for RetryOnReplicaException on missing mappings Pass the task to broadcast actions (#29672) Stop forking groovyc (#30471) Add `coordinating_only` node selector (#30313) Fix accidental error in changelog Use date format in `date_range` mapping before fallback to default (#29310) Watcher: Increase HttpClient parallel sent requests (#30130) [Security][Tests] Azeri(Turkish) locale tripps opensaml dependency

Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in #29535. Closes #29499

nik9000 added 4 commits April 13, 2018 12:41

Wip

e37aae1

Merge branch 'master' into builtin_analyzer_tests

062f66b

Fix names

0a1d66d

nik9000 added >docs General docs changes >test Issues or PRs that are addressing/adding tests :Delivery/Build Build or test infrastructure v7.0.0 v6.3.0 labels Apr 16, 2018

mayya-sharipova reviewed Apr 17, 2018

View reviewed changes

nik9000 added 3 commits April 17, 2018 17:20

Merge branch 'master' into builtin_analyzer_tests

5274dcd

Spaces

ec2eef7

Document sytax enhancement

27c9dc1

mayya-sharipova requested changes Apr 20, 2018

View reviewed changes

nik9000 added 2 commits April 20, 2018 15:46

Merge branch 'master' into builtin_analyzer_tests

d3a87f4

Fixes

bf17504

mayya-sharipova approved these changes Apr 20, 2018

View reviewed changes

nik9000 added v6.4.0 and removed v6.3.0 labels Apr 25, 2018

nik9000 added 6 commits April 25, 2018 16:28

Merge branch 'master' into builtin_analyzer_tests

90bce00

Merge branch 'master' into builtin_analyzer_tests

91cac14

Move flag

3579f83

This makes the change to the regex smaller and fixes some parse errors I hadn't noticed before.

Merge branch 'master' into builtin_analyzer_tests

881a25d

Fix up irish stemmer

d309405

,

bd10c4a

nik9000 added 6 commits April 27, 2018 11:15

Fix irish better

43b2213

Fix cjk

0e6b62f

.

2c77a97

,

5ba12d1

, againt

f4c2220

Sigh

b58330b

nik9000 added 4 commits April 27, 2018 17:00

Merge branch 'master' into builtin_analyzer_tests

e93cb96

Fix precommit

dcf25b5

Merge branch 'master' into builtin_analyzer_tests

5069ac0

Remove errant class file

3c0f070

dakrone approved these changes May 7, 2018

View reviewed changes

nik9000 added 3 commits May 7, 2018 19:45

Merge branch 'master' into builtin_analyzer_tests

21b0647

Add warning

8c2150d

Merge branch 'master' into builtin_analyzer_tests

2f8af5c

nik9000 merged commit f9dc868 into elastic:master May 9, 2018

nik9000 mentioned this pull request May 9, 2018

Docs: Document how to rebuild analyzers #30498

Merged

nik9000 added a commit that referenced this pull request May 14, 2018

Docs: Document how to rebuild analyzers (#30498)

9881bfa

Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in #29535. Closes #29499

nik9000 added a commit that referenced this pull request May 14, 2018

Docs: Document how to rebuild analyzers (#30498)

23dc9b0

Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in #29535. Closes #29499

cbuescher mentioned this pull request May 18, 2018

[CI] Language analyzer docs failure #30557

Closed

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs: Test examples that recreate lang analyzers #29535

Docs: Test examples that recreate lang analyzers #29535

nik9000 commented Apr 16, 2018 •

edited

Loading

elasticmachine commented Apr 16, 2018

nik9000 commented Apr 16, 2018

nik9000 commented Apr 16, 2018

mayya-sharipova Apr 17, 2018 •

edited

Loading

nik9000 Apr 17, 2018

mayya-sharipova Apr 17, 2018

nik9000 commented Apr 19, 2018

mayya-sharipova left a comment

mayya-sharipova Apr 20, 2018

nik9000 Apr 20, 2018

mayya-sharipova Apr 20, 2018

nik9000 commented Apr 27, 2018

nik9000 commented Apr 27, 2018

dakrone left a comment

dakrone May 7, 2018

dakrone May 7, 2018

nik9000 May 7, 2018

dakrone May 7, 2018

Docs: Test examples that recreate lang analyzers #29535

Docs: Test examples that recreate lang analyzers #29535

Conversation

nik9000 commented Apr 16, 2018 • edited Loading

elasticmachine commented Apr 16, 2018

nik9000 commented Apr 16, 2018

nik9000 commented Apr 16, 2018

mayya-sharipova Apr 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Apr 19, 2018

mayya-sharipova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Apr 27, 2018

nik9000 commented Apr 27, 2018

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Apr 16, 2018 •

edited

Loading

mayya-sharipova Apr 17, 2018 •

edited

Loading