-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: Test examples that recreate lang analyzers #29535
Conversation
We have a pile of documentation describing how to rebuild the built in language analyzers and, previously, our documentation testing framework made sure that the examples successfully built *an* analyzer but they didn't assert that the analyzer built by the documentation matches the built in anlayzer. Unsuprisingly, some of the examples aren't quite right. This adds a mechanism that tests that the analyzers built by the docs. The mechanism is fairly simple and brutal but it seems to be working: build a hundred random unicode sequences and send them through the `_analyze` API with the rebuilt analyzer and then again through the built in analyzer. Then make sure both APIs return the same results. Each of these calls to `_anlayze` takes about 20ms on my laptop which seems fine.
Pinging @elastic/es-core-infra |
I've filed this as ":core/build" because ":core/build" includes testing infrastructure. It might be more correct to file it as ":search/anlaysis" but the bulk of the work is in the testing infrastructure. I thought of three ways to do this:
I opted for option 3 because it seemed the simplest thing at the time. Option two would have been fairly simple as well. Option 1 might produce faster tests but would require a lot of extra complexity that reproduces some of the docs testing infrastructure. Since option 3 takes ~20 milliseconds per language I think we're ok as far as speed is concerned. |
We also talking about doing this using some mechanism to force us to manually check the lucene analyzers when we upgrade lucene. Given how quickly these tests run I don't think that is worth it. |
for (int i = 0; i < size; i++) { | ||
testText.add(randomRealisticUnicodeOfCodepointLength(between(1, 15)) | ||
// Don't look up stashed values | ||
.replace("$", "\\$")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do strings generated from randomRealisticUnicodeOfCodepointLength
also contain spaces, punctuation marks etc, so that we can test a tokenizer part of analyzers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like they do. It'd be cool to insert spaces. I don't think I could easily use the same unicode page for both space separated strings though. That might not be too bad though.
"arabic_keywords", | ||
"arabic_normalization", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for correcting documentation on all analyzers
@mayya-sharipova, I've pushed a patch to add spaces. It isn't perfect, but it does add spaces. Have a look at the comment I sent for a more thorough explanation of what I mean. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nik9000 Thanks for the change, Nik! For the search side (analyzers and a test to test that tokens are similar), I have left a small comment. Other than that everything looks fine.
May be somebody from the Core/Infra team can review it as well for the testing infrastructure part.
if (false == secondTokens.hasNext()) { | ||
fail(second + " has fewer tokens than " + first + ". " | ||
+ first + " has [" + firstTokens.next() + "] but " + second + " is out of tokens. " | ||
+ first + "'s last token was [" + previousFirst + "] and " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where you assign something to previousFirst
and previousSecond
besides null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I used to have it working. I'll fix.
testText.add(b.toString() | ||
// Don't look up stashed values | ||
.replace("$", "\\$")); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this change
I found a few more bugs in the configurations by running with |
@elasticmachine recheck this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I left really minor nits, but nothing that needs another review
private static CompareAnalyzers parse(XContentParser parser) throws IOException { | ||
XContentLocation location = parser.getTokenLocation(); | ||
CompareAnalyzers section = PARSER.parse(parser, location); | ||
assert parser.currentToken() == Token.END_OBJECT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a message here so it'll be helpful if someone accidentally misses a closing token?
*/ | ||
NamedXContentRegistry XCONTENT_REGISTRY = new NamedXContentRegistry(Arrays.asList( | ||
List<NamedXContentRegistry.Entry> DEFAULT_EXECUTABLE_CONTEXTS = unmodifiableList(Arrays.asList( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started out declaring it public static final
out of habit but checkstyle failed because they are all forced on that field because it is in an interface.
* {@link NamedXContentRegistry} that parses the default list of | ||
* {@link ExecutableSection}s available for tests. | ||
*/ | ||
NamedXContentRegistry XCONTENT_REGISTRY = new NamedXContentRegistry(DEFAULT_EXECUTABLE_CONTEXTS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, this could be final I think?
…or-you * elastic/master: (22 commits) Docs: Test examples that recreate lang analyzers (elastic#29535) BulkProcessor to retry based on status code (elastic#29329) Add GET Repository High Level REST API (elastic#30362) add a comment explaining the need for RetryOnReplicaException on missing mappings Add `coordinating_only` node selector (elastic#30313) Stop forking groovyc (elastic#30471) Avoid setting connection request timeout (elastic#30384) Use date format in `date_range` mapping before fallback to default (elastic#29310) Watcher: Increase HttpClient parallel sent requests (elastic#30130) Mute ML upgrade test (elastic#30458) Stop forking javac (elastic#30462) Client: Deprecate many argument performRequest (elastic#30315) Docs: Use task_id in examples of tasks (elastic#30436) Security: Rename IndexLifecycleManager to SecurityIndexManager (elastic#30442) [Docs] Fix typo in cardinality-aggregation.asciidoc (elastic#30434) Avoid NPE in `more_like_this` when field has zero tokens (elastic#30365) Build: Switch to building javadoc with html5 (elastic#30440) Add a quick tour of the project to CONTRIBUTING (elastic#30187) Reindex: Use request flavored methods (elastic#30317) Silence SplitIndexIT.testSplitIndexPrimaryTerm test failure. (elastic#30432) ...
We have a pile of documentation describing how to rebuild the built in language analyzers and, previously, our documentation testing framework made sure that the examples successfully built *an* analyzer but they didn't assert that the analyzer built by the documentation matches the built in anlayzer. Unsuprisingly, some of the examples aren't quite right. This adds a mechanism that tests that the analyzers built by the docs. The mechanism is fairly simple and brutal but it seems to be working: build a hundred random unicode sequences and send them through the `_analyze` API with the rebuilt analyzer and then again through the built in analyzer. Then make sure both APIs return the same results. Each of these calls to `_anlayze` takes about 20ms on my laptop which seems fine.
Adds documentation for how to rebuild all the built in analyzers and tests for that documentation using the mechanism added in elastic#29535. Closes elastic#29499
* master: Upgrade to Lucene-7.4-snapshot-6705632810 (#30519) add version compatibility from 6.4.0 after backport, see #30319 (#30390) Security: Simplify security index listeners (#30466) Add proper longitude validation in geo_polygon_query (#30497) Remove Discovery.AckListener.onTimeout() (#30514) Build: move generated-resources to build (#30366) Reindex: Fold "with all deps" project into reindex (#30154) Isolate REST client single host tests (#30504) Solve Gradle deprecation warnings around shadowJar (#30483) SAML: Process only signed data (#30420) Remove BWC repository test (#30500) Build: Remove xpack specific run task (#30487) AwaitsFix IntegTestZipClientYamlTestSuiteIT#indices.split tests LLClient: Add setJsonEntity (#30447) Expose CommonStatsFlags directly in IndicesStatsRequest. (#30163) Silence IndexUpgradeIT test failures. (#30430) Bump Gradle heap to 1792m (#30484) [docs] add warning for read-write indices in force merge documentation (#28869) Avoid deadlocks in cache (#30461) Test: remove hardcoded list of unconfigured ciphers (#30367) mute SplitIndexIT due to #30416 Docs: Test examples that recreate lang analyzers (#29535) BulkProcessor to retry based on status code (#29329) Add GET Repository High Level REST API (#30362) add a comment explaining the need for RetryOnReplicaException on missing mappings Add `coordinating_only` node selector (#30313) Stop forking groovyc (#30471) Avoid setting connection request timeout (#30384) Use date format in `date_range` mapping before fallback to default (#29310) Watcher: Increase HttpClient parallel sent requests (#30130) # Conflicts: # x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/LocalStateCompositeXPackPlugin.java
* 6.x: Upgrade to Lucene-7.4-snapshot-6705632810 (#30519) Remove Discovery.AckListener.onTimeout() (#30514) Build: move generated-resources to build (#30366) Reindex: Fold "with all deps" project into reindex (#30154) Isolate REST client single host tests (#30504) Remove BWC repository test (#30500) Build: Remove xpack specific run task (#30487) AwaitsFix IntegTestZipClientYamlTestSuiteIT#indices.split tests LLClient: Add setJsonEntity (#30447) [docs] add warning for read-write indices in force merge documentation (#28869) Avoid deadlocks in cache (#30461) BulkProcessor to retry based on status code (#29329) Avoid setting connection request timeout (#30384) Test: remove hardcoded list of unconfigured ciphers (#30367) Add GET Repository High Level REST API (#30362) mute SplitIndexIT due to #30416 Docs: Test examples that recreate lang analyzers (#29535) add a comment explaining the need for RetryOnReplicaException on missing mappings Pass the task to broadcast actions (#29672) Stop forking groovyc (#30471) Add `coordinating_only` node selector (#30313) Fix accidental error in changelog Use date format in `date_range` mapping before fallback to default (#29310) Watcher: Increase HttpClient parallel sent requests (#30130) [Security][Tests] Azeri(Turkish) locale tripps opensaml dependency
We have a pile of documentation describing how to rebuild the built in
language analyzers and, previously, our documentation testing framework
made sure that the examples successfully built an analyzer but they
didn't assert that the analyzer built by the documentation matches the
built in anlayzer. Unsuprisingly, some of the examples aren't quite
right.
This adds a mechanism that tests that the analyzers built by the docs.
The mechanism is fairly simple and brutal but it seems to be working:
build a hundred random unicode sequences and send them through the
_analyze
API with the rebuilt analyzer and then again through thebuilt in analyzer. Then make sure both APIs return the same results.
Each of these calls to
_anlayze
takes about 20ms on my laptop whichseems fine.
Related to #29499