Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Multicluster transform test failures (yaml=multi_cluster/80_transform/Batch transform from local and remote cluster) #51629

Closed
cbuescher opened this issue Jan 29, 2020 · 9 comments · Fixed by #62086
Assignees
Labels
:ml/Transform Transform >test-failure Triaged test failures from CI v8.0.0-alpha1

Comments

@cbuescher
Copy link
Member

Log https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob+fast+part2/3366/console
Build Scans https://gradle-enterprise.elastic.co/s/fx6eksdxojray

Failure:

java.lang.AssertionError: Failure at [multi_cluster/80_transform:238]: transforms.0.checkpointing.last.checkpoint didn't match expected value:
transforms.0.checkpointing.last.checkpoint: expected Integer [1] but was Integer [0]

	at __randomizedtesting.SeedInfo.seed([6D89625592739213:E5DD5D8F3C8FFFEB]:0)
	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:405)
	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:382)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.AssertionError: transforms.0.checkpointing.last.checkpoint didn't match expected value:
transforms.0.checkpointing.last.checkpoint: expected Integer [1] but was Integer [0]

	at org.elasticsearch.test.rest.yaml.section.MatchAssertion.doAssert(MatchAssertion.java:93)
	at org.elasticsearch.test.rest.yaml.section.Assertion.execute(Assertion.java:76)
	at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:398)
	... 37 more

I wasn't able to reproduce locally with this:

./gradlew ':x-pack:qa:multi-cluster-tests-with-security:mixed-clusterRunner' --tests "org.elasticsearch.multi_cluster.MultiClusterYamlTestSuiteIT.test {yaml=multi_cluster/80_transform/Batch transform from local and remote cluster}" -Dtests.seed=6D89625592739213 -Dtests.security.manager=true -Dtests.locale=sr-Latn-BA -Dtests.timezone=HST -Dcompiler.java=13 -Dtests.rest.suite=multi_cluster
@cbuescher cbuescher added >test-failure Triaged test failures from CI v8.0.0 :ml/Transform Transform labels Jan 29, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml/Transform)

@cbuescher
Copy link
Member Author

@hendrikmuhs hendrikmuhs self-assigned this Jan 30, 2020
@hendrikmuhs
Copy link

hendrikmuhs commented Jan 30, 2020

The root cause in all 3 cases seems to be a race condition when storing the state into the state document:

01:44:49 »  Caused by: org.elasticsearch.index.engine.VersionConflictEngineException: [data_frame_transform_state_and_stats-simple-local-remote-transform]: version conflict, document already exists (current version [1])
01:44:49 »  	at org.elasticsearch.index.engine.InternalEngine.planIndexingAsPrimary(InternalEngine.java:1062) ~[elasticsearch-7.7.0-SNAPSHOT.jar:7.7.0-SNAPSHOT]
01:44:49 »  	at org.elasticsearch.index.engine.InternalEngine.indexingStrategyForOperation(InternalEngine.java:1021) ~[elasticsearch-7.7.0-SNAPSHOT.jar:7.7.0-SNAPSHOT]
01:44:49 »  	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:910) ~[elasticsearch-7.7.0-SNAPSHOT.jar:7.7.0-SNAPSHOT]
01:44:49 »  	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:813) ~[elasticsearch-7.7.0-SNAPSHOT.jar:7.7.0-SNAPSHOT]
01:44:49 »  	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:785) ~[elasticsearch-7.7.0-SNAPSHOT.jar:7.7.0-SNAPSHOT]
01:44:49 »  	at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:742) ~[elasticsearch-7.7.0-SNAPSHOT.jar:7.7.0-SNAPSHOT]
01:44:49 »  	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:254) ~[elasticsearch-7.7.0-SNAPSHOT.jar:7.7.0-SNAPSHOT]
01:44:49 »  	at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:157) ~[elasticsearch-7.7.0-SNAPSHOT.jar:7.7.0-SNAPSHOT]
01:44:49 »  	... 148 more
01:44:49 » WARN ][o.e.p.AllocatedPersistentTask] [mixed-cluster-0] attempt to complete task [data_frame/transforms[c]] with id [simple-local-remote-transform] in the [COMPLETED] state

We use optimistic concurrency control and keep the seq_nr/primary term in an atomic variable, so it looks like 2 save state calls run at the same time.

@mark-vieira
Copy link
Contributor

mark-vieira commented Jan 30, 2020

Ran into this on one of my PRs as well. Looks like this is happening in master and 7.x. Here are a few more examples if it helps. Also this affects the Batch transform from remote cluster test as well.

https://gradle-enterprise.elastic.co/s/o6exshw4mzfig/tests?search=test%20%7Byaml%3Dmulti_cluster/80_transform/Batch%20transform%20from%20remote%20cluster%7D

https://gradle-enterprise.elastic.co/s/v4qidlin5znus/tests?search=test%20%7Byaml%3Dmulti_cluster/80_transform/Batch%20transform%20from%20remote%20cluster%7D

@hendrikmuhs
Copy link

Thank you @mark-vieira !

After some more investigations I think my 1st assumption is not the problem. Together with the failures we also see:

 org.elasticsearch.ElasticsearchSecurityException: action [indices:admin/refresh] is unauthorized for user [joe]
21:03:55 »  	at org.elasticsearch.xpack.core.security.support.Exceptions.authorizationError(Exceptions.java:34) ~[x-pack-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
21:03:55 »  	at org.elasticsearch.xpack.security.authz.AuthorizationService.denialException(AuthorizationService.java:588) ~[x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
21:03:55 »  	at org.elasticsearch.xpack.security.authz.AuthorizationService$AuthorizationResultListener.handleFailure(AuthorizationService.java:635) [x-pack-security-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]

I assumed this is benign, but it is the other way around, the version conflict is benign and this refresh issue isn't.

TL/DR

Transform stores state in an index, this state is returned when the transform is stopped - no task, therefore no in-memory information. The refresh is triggered to ensure the latest state is flushed. Because the refresh fails this is not the case, so it returns an old state.

Note that an index re-fresh is triggered internally by lucene eventually. This problem only makes trouble for non-human interaction (CI is a great way to spot those issues) and if security is enabled and if you use a non-admin user.

@hendrikmuhs
Copy link

I am 99.9% certain that #51732 fixes this issue.

The other optimistic concurrency control hiccup still needs to be addressed.

@hendrikmuhs
Copy link

The error hasn't happened again after the fix, closing.

Follow up regarding logs: #52035

@tvernum
Copy link
Contributor

tvernum commented Feb 10, 2020

Unfortunately this just failed again in master.
https://gradle-enterprise.elastic.co/s/a6uha6ubq3shy/console-log?task=:x-pack:qa:multi-cluster-tests-with-security:mixed-clusterRunner

:x-pack:qa:multi-cluster-tests-with-security:mixed-clusterRunner FAILED
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8

org.elasticsearch.multi_cluster.MultiClusterYamlTestSuiteIT > test {yaml=multi_cluster/80_transform/Batch transform from local and remote cluster} FAILED
    java.lang.AssertionError: Failure at [multi_cluster/80_transform:238]: transforms.0.checkpointing.last.checkpoint didn't match expected value:
    transforms.0.checkpointing.last.checkpoint: expected Integer [1] but was Integer [0]
Full Task Log

Task Output

org.elasticsearch.multi_cluster.MultiClusterYamlTestSuiteIT > test {yaml=multi_cluster/80_transform/Batch transform from local and remote cluster} FAILED
    java.lang.AssertionError: Failure at [multi_cluster/80_transform:238]: transforms.0.checkpointing.last.checkpoint didn't match expected value:
    transforms.0.checkpointing.last.checkpoint: expected Integer [1] but was Integer [0]
        at __randomizedtesting.SeedInfo.seed([58951B50C1EC2023:D0C1248A6F104DDB]:0)
        at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:405)
        at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:382)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
        at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
        at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
        at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
        at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:826)
        at java.base/java.lang.Thread.run(Thread.java:834)

        Caused by:
        java.lang.AssertionError: transforms.0.checkpointing.last.checkpoint didn't match expected value:
        transforms.0.checkpointing.last.checkpoint: expected Integer [1] but was Integer [0]
            at org.elasticsearch.test.rest.yaml.section.MatchAssertion.doAssert(MatchAssertion.java:93)
            at org.elasticsearch.test.rest.yaml.section.Assertion.execute(Assertion.java:76)
            at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:398)
            ... 38 more
REPRODUCE WITH: ./gradlew ':x-pack:qa:multi-cluster-tests-with-security:mixed-clusterRunner' --tests "org.elasticsearch.multi_cluster.MultiClusterYamlTestSuiteIT.test {yaml=multi_cluster/80_transform/Batch transform from local and remote cluster}" -Dtests.seed=58951B50C1EC2023 -Dtests.security.manager=true -Dtests.locale=tr -Dtests.timezone=Pacific/Midway -Dcompiler.java=13 -Dtests.rest.suite=multi_cluster

Suite: Test class org.elasticsearch.multi_cluster.MultiClusterYamlTestSuiteIT
  1> [2020-02-09T14:25:58,196][INFO ][o.e.m.MultiClusterYamlTestSuiteIT] [test] [yaml=multi_cluster/80_transform/Search remote cluster] before test
  1> [2020-02-09T14:25:58,303][INFO ][o.e.m.MultiClusterYamlTestSuiteIT] [test] initializing REST clients against [http://[::1]:49798, http://127.0.0.1:49799]
  1> [2020-02-09T14:25:58,807][INFO ][o.e.m.MultiClusterYamlTestSuiteIT] [test] initializing client, minimum es version [8.0.0], master version, [8.0.0], hosts [http://[::1]:49798, http://127.0.0.1:49799]
  1> [2020-02-09T14:26:00,195][INFO ][o.e.m.MultiClusterYamlTestSuiteIT] [test] [yaml=multi_cluster/80_transform/Search remote cluster] after test
  1> [2020-02-09T14:26:00,220][INFO ][o.e.m.MultiClusterYamlTestSuiteIT] [test] [yaml=multi_cluster/80_transform/Batch transform from local and remote cluster] before test
  1> [2020-02-09T14:26:02,602][INFO ][o.e.m.MultiClusterYamlTestSuiteIT] [test] Stash dump on test failure [{
  1>   "stash" : {
  1>     "body" : {
  1>       "count" : 1,
  1>       "transforms" : [
  1>         {
  1>           "id" : "simple-local-remote-transform",
  1>           "state" : "stopped",
  1>           "stats" : {
  1>             "pages_processed" : 0,
  1>             "documents_processed" : 0,
  1>             "documents_indexed" : 0,
  1>             "trigger_count" : 1,
  1>             "index_time_in_ms" : 0,
  1>             "index_total" : 0,
  1>             "index_failures" : 0,
  1>             "search_time_in_ms" : 0,
  1>             "search_total" : 0,
  1>             "search_failures" : 0,
  1>             "exponential_avg_checkpoint_duration_ms" : 0.0,
  1>             "exponential_avg_documents_indexed" : 0.0,
  1>             "exponential_avg_documents_processed" : 0.0
  1>           },
  1>           "checkpointing" : {
  1>             "last" : {
  1>               "checkpoint" : 0
  1>             },
  1>             "operations_behind" : 18
  1>           }
  1>         }
  1>       ]
  1>     }
  1>   }
  1> }]
  1> [2020-02-09T14:26:02,962][INFO ][o.e.m.MultiClusterYamlTestSuiteIT] [test] [yaml=multi_cluster/80_transform/Batch transform from local and remote cluster] after test
  2> REPRODUCE WITH: ./gradlew ':x-pack:qa:multi-cluster-tests-with-security:mixed-clusterRunner' --tests "org.elasticsearch.multi_cluster.MultiClusterYamlTestSuiteIT.test {yaml=multi_cluster/80_transform/Batch transform from local and remote cluster}" -Dtests.seed=58951B50C1EC2023 -Dtests.security.manager=true -Dtests.locale=tr -Dtests.timezone=Pacific/Midway -Dcompiler.java=13 -Dtests.rest.suite=multi_cluster
  2> java.lang.AssertionError: Failure at [multi_cluster/80_transform:238]: transforms.0.checkpointing.last.checkpoint didn't match expected value:
    transforms.0.checkpointing.last.checkpoint: expected Integer [1] but was Integer [0]

@tvernum tvernum reopened this Feb 10, 2020
hendrikmuhs pushed a commit that referenced this issue Sep 28, 2020
refactor how state is persisted, call doSaveState only from the indexer thread, except there is none.

fixes #60781
fixes #52931
fixes #51629
fixes #52035
hendrikmuhs pushed a commit that referenced this issue Sep 28, 2020
refactor how state is persisted, call doSaveState only from the indexer thread, except there is none.

fixes #60781
fixes #52931
fixes #51629
fixes #52035
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml/Transform Transform >test-failure Triaged test failures from CI v8.0.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants