Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoFollowIT.testAutoFollowManyIndices failure #56891

Closed
davidkyle opened this issue May 18, 2020 · 5 comments · Fixed by #56910, #60070 or #61679
Closed

AutoFollowIT.testAutoFollowManyIndices failure #56891

davidkyle opened this issue May 18, 2020 · 5 comments · Fixed by #56910, #60070 or #61679
Assignees
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@davidkyle
Copy link
Member

Build scan:
https://gradle-enterprise.elastic.co/s/vgarx75evvhp4

Repro line:

./gradlew ':x-pack:plugin:ccr:internalClusterTest' --tests "org.elasticsearch.xpack.ccr.AutoFollowIT.testAutoFollowManyIndices" \
  -Dtests.seed=BBFCB0557D17ACF7 \
  -Dtests.security.manager=true \
  -Dtests.locale=sk \
  -Dtests.timezone=America/Menominee \
  -Dcompiler.java=14 \
  -Druntime.java=11

Reproduces locally?:
No

Applicable branches:
Master

Failure history:
Rare, only twice in 31 days. It is not clear whether the failure is a due to an intermittent build issue or the actual test so I leaving the test unmuted.

Another instance: https://gradle-enterprise.elastic.co/s/xqbrmt6iaglg4

Failure excerpt:

java.lang.AssertionError: 
Expected: a collection with size <13>
     but: collection size was <10>
	at __randomizedtesting.SeedInfo.seed([BBFCB0557D17ACF7:93CE9F2814860980]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.elasticsearch.xpack.ccr.AutoFollowIT.lambda$testAutoFollowManyIndices$7(AutoFollowIT.java:199)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:892)
	at org.elasticsearch.xpack.ccr.AutoFollowIT.assertLongBusy(AutoFollowIT.java:562)
	at org.elasticsearch.xpack.ccr.AutoFollowIT.testAutoFollowManyIndices(AutoFollowIT.java:191)
@davidkyle davidkyle added >test-failure Triaged test failures from CI :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels May 18, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/CCR)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 18, 2020
@ywelsch
Copy link
Contributor

ywelsch commented May 18, 2020

This test does a lot of stuff, and can sometimes run into a timeout (when it randomly uses a large number of indices). I've opened #56910 to increase the timeout

@probakowski
Copy link
Contributor

@probakowski probakowski reopened this Jul 22, 2020
@ywelsch
Copy link
Contributor

ywelsch commented Jul 22, 2020

Relevant log line:

[2020-07-22T06:30:46,336][WARN ][o.e.x.c.AutoFollowIT     ] [testAutoFollowManyIndices] AssertionError when waiting for auto-follower, auto-follow stats are: {"number_of_failed_follow_indices":1,"number_of_failed_remote_cluster_state_requests":0,"number_of_successful_follow_indices":12,"recent_auto_follow_errors":[{"leader_index":"my-pattern:logs-11","timestamp":1595399376228,"auto_follow_exception":{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (restore_snapshot[_latest_]) within 30s"}}],"auto_followed_clusters":[{"cluster_name":"leader_cluster","time_since_last_check_millis":0,"last_seen_metadata_version":30}]}
java.lang.AssertionError: 
Expected: <13>
     but: was <12>
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) ~[hamcrest-2.1.jar:2.1]
	at org.junit.Assert.assertThat(Assert.java:956) ~[junit-4.12.jar:4.12]
	at org.junit.Assert.assertThat(Assert.java:923) ~[junit-4.12.jar:4.12]
	at org.elasticsearch.xpack.ccr.AutoFollowIT.lambda$testAutoFollowManyIndices$7(AutoFollowIT.java:195) ~[internalClusterTest/:?]
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:897) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.xpack.ccr.AutoFollowIT.assertLongBusy(AutoFollowIT.java:562) [internalClusterTest/:?]
	at org.elasticsearch.xpack.ccr.AutoFollowIT.testAutoFollowManyIndices(AutoFollowIT.java:191) [internalClusterTest/:?]
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) [randomizedtesting-runner-2.7.7.jar:?]
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49) [lucene-test-framework-8.6.0.jar:8.6.0 a9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc - broustant - 2020-07-07 12:46:33]
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) [lucene-test-framework-8.6.0.jar:8.6.0 a9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc - broustant - 2020-07-07 12:46:33]
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48) [lucene-test-framework-8.6.0.jar:8.6.0 a9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc - broustant - 2020-07-07 12:46:33]
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) [lucene-test-framework-8.6.0.jar:8.6.0 a9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc - broustant - 2020-07-07 12:46:33]
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) [lucene-test-framework-8.6.0.jar:8.6.0 a9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc - broustant - 2020-07-07 12:46:33]
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902) [randomizedtesting-runner-2.7.7.jar:?]
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45) [lucene-test-framework-8.6.0.jar:8.6.0 a9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc - broustant - 2020-07-07 12:46:33]
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) [randomizedtesting-runner-2.7.7.jar:?]
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41) [lucene-test-framework-8.6.0.jar:8.6.0 a9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc - broustant - 2020-07-07 12:46:33]
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) [randomizedtesting-runner-2.7.7.jar:?]
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53) [lucene-test-framework-8.6.0.jar:8.6.0 a9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc - broustant - 2020-07-07 12:46:33]
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47) [lucene-test-framework-8.6.0.jar:8.6.0 a9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc - broustant - 2020-07-07 12:46:33]
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64) [lucene-test-framework-8.6.0.jar:8.6.0 a9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc - broustant - 2020-07-07 12:46:33]
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54) [lucene-test-framework-8.6.0.jar:8.6.0 a9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc - broustant - 2020-07-07 12:46:33]
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375) [randomizedtesting-runner-2.7.7.jar:?]
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831) [randomizedtesting-runner-2.7.7.jar:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

The reason the test is failing is because the follow request submitted by AutoFollowCoordinator ran into a timeout (30 seconds by default) because of a busy master, and did not retry. This is a CCR bug that needs to be fixed. One part of it is making the timeout on the request unbounded.

dnhatn added a commit that referenced this issue Jul 27, 2020
Today, a follow task will fail if the master node of the follower 
cluster is temporarily overloaded and unable to process master node
requests (such as update mapping, setting, or alias) from a follow-task
within the default timeout. This error is transient, and follow-tasks
should not abort. We can avoid this problem by setting the timeout of
master node requests on the follower cluster to unbounded.

Closes #56891
dnhatn added a commit that referenced this issue Jul 27, 2020
Today, a follow task will fail if the master node of the follower
cluster is temporarily overloaded and unable to process master node
requests (such as update mapping, setting, or alias) from a follow-task
within the default timeout. This error is transient, and follow-tasks
should not abort. We can avoid this problem by setting the timeout of
master node requests on the follower cluster to unbounded.

Closes #56891
dnhatn added a commit that referenced this issue Jul 27, 2020
Today, a follow task will fail if the master node of the follower
cluster is temporarily overloaded and unable to process master node
requests (such as update mapping, setting, or alias) from a follow-task
within the default timeout. This error is transient, and follow-tasks
should not abort. We can avoid this problem by setting the timeout of
master node requests on the follower cluster to unbounded.

Closes #56891
dnhatn added a commit that referenced this issue Jul 28, 2020
Today, a follow task will fail if the master node of the follower
cluster is temporarily overloaded and unable to process master node
requests (such as update mapping, setting, or alias) from a follow-task
within the default timeout. This error is transient, and follow-tasks
should not abort. We can avoid this problem by setting the timeout of
master node requests on the follower cluster to unbounded.

Closes #56891
dnhatn added a commit that referenced this issue Aug 11, 2020
Today, a follow task will fail if the master node of the follower
cluster is temporarily overloaded and unable to process master node
requests (such as update mapping, setting, or alias) from a follow-task
within the default timeout. This error is transient, and follow-tasks
should not abort. We can avoid this problem by setting the timeout of
master node requests on the follower cluster to unbounded.

Closes #56891
@mayya-sharipova
Copy link
Contributor

Reopening as there was another failure on master today (not sure how significant it is):

Build scan: https://gradle-enterprise.elastic.co/s/nwd2rtk6hutna

REPRODUCE WITH: ./gradlew ':x-pack:plugin:ccr:internalClusterTest' --tests "org.elasticsearch.xpack.ccr.AutoFollowIT.testAutoFollowManyIndices"
-Dtests.seed=AC5B141E2DEEAAFC
-Dtests.security.manager=true
-Dtests.locale=sl
-Dtests.timezone=Pacific/Fakaofo
-Druntime.java=11

java.lang.AssertionError: 	
Expected: <12>	
     but: was <11>	
at __randomizedtesting.SeedInfo.seed([AC5B141E2DEEAAFC:84693B63447F0F8B]:0)	
•••
at org.elasticsearch.xpack.ccr.AutoFollowIT.lambda$testAutoFollowManyIndices$7(AutoFollowIT.java:195)	
•••
at org.elasticsearch.xpack.ccr.AutoFollowIT.assertLongBusy(AutoFollowIT.java:562)	
at org.elasticsearch.xpack.ccr.AutoFollowIT.testAutoFollowManyIndices(AutoFollowIT.java:191)

dnhatn added a commit that referenced this issue Aug 31, 2020
If the master node of the follower cluster is busy, then the 
auto-follower will fail to initialize the following process. This also
occurs when an auto-follow pattern matches multiple indices. We should
set the timeout of put-follow requests issued by the auto-follower to
unbounded to avoid this problem.

Closes #56891
dnhatn added a commit that referenced this issue Aug 31, 2020
If the master node of the follower cluster is busy, then the 
auto-follower will fail to initialize the following process. This also
occurs when an auto-follow pattern matches multiple indices. We should
set the timeout of put-follow requests issued by the auto-follower to
unbounded to avoid this problem.

Closes #56891
dnhatn added a commit that referenced this issue Aug 31, 2020
If the master node of the follower cluster is busy, then the 
auto-follower will fail to initialize the following process. This also
occurs when an auto-follow pattern matches multiple indices. We should
set the timeout of put-follow requests issued by the auto-follower to
unbounded to avoid this problem.

Closes #56891
dnhatn added a commit that referenced this issue Sep 1, 2020
If the master node of the follower cluster is busy, then the
auto-follower will fail to initialize the following process. This also
occurs when an auto-follow pattern matches multiple indices. We should
set the timeout of put-follow requests issued by the auto-follower to
unbounded to avoid this problem.

Closes #56891
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
6 participants