[CI] FollowerFailOverIT testFailOverOnFollower fails frequently #38633

cbuescher · 2019-02-08T15:50:10Z

This was once mentioned in #33337 but since its
a different method and its not muted yet I'll open a separate issue to track it.

Latest failures eg: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+internalClusterTest/780/console

Doesn't reproduce locally, but this is the reproduction line

./gradlew :x-pack:plugin:ccr:internalClusterTest \
  -Dtests.seed=E9FBEB281249D0E0 \
  -Dtests.class=org.elasticsearch.xpack.ccr.FollowerFailOverIT \
  -Dtests.method="testFailOverOnFollower" \
  -Dtests.security.manager=true \
  -Dtests.locale=uk-UA \
  -Dtests.timezone=Asia/Jerusalem \
  -Dcompiler.java=11 \
  -Druntime.java=8

Errors in the log:

java.lang.AssertionError: timed out waiting for green state
	at org.junit.Assert.fail(Assert.java:88)
	at org.elasticsearch.xpack.CcrIntegTestCase.ensureColor(CcrIntegTestCase.java:317)
	at org.elasticsearch.xpack.CcrIntegTestCase.ensureFollowerGreen(CcrIntegTestCase.java:291)
	at org.elasticsearch.xpack.CcrIntegTestCase.ensureFollowerGreen(CcrIntegTestCase.java:286)
	at org.elasticsearch.xpack.ccr.FollowerFailOverIT.testFailOverOnFollower(FollowerFailOverIT.java:113)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)

java.lang.AssertionError: [follower-index][0], node[5iKNfmaJTBinORNF0IpQ6A], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=nxJxZYsLT6mCGGshzf7bMQ], unassigned_info[[reason=NODE_LEFT], at[2019-02-08T15:11:56.410Z], delayed=false, details[node_left [5iKNfmaJTBinORNF0IpQ6A]], allocation_status[no_attempt]], expected_shard_size[230] seq_no_stats mismatch
Expected: <SeqNoStats{maxSeqNo=300, localCheckpoint=71, globalCheckpoint=71}>
     but: was <SeqNoStats{maxSeqNo=300, localCheckpoint=-1, globalCheckpoint=71}>
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.elasticsearch.test.InternalTestCluster.lambda$assertSeqNos$6(InternalTestCluster.java:1389)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:858)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:832)
	at org.elasticsearch.test.InternalTestCluster.assertSeqNos(InternalTestCluster.java:1359)
	at org.elasticsearch.xpack.CcrIntegTestCase.afterTest(CcrIntegTestCase.java:180)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)
	Suppressed: java.lang.AssertionError: [follower-index][0], node[5iKNfmaJTBinORNF0IpQ6A], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=nxJxZYsLT6mCGGshzf7bMQ], unassigned_info[[reason=NODE_LEFT], at[2019-02-08T15:11:56.410Z], delayed=false, details[node_left [5iKNfmaJTBinORNF0IpQ6A]], allocation_status[no_attempt]], expected_shard_size[230] seq_no_stats mismatch
Expected: <SeqNoStats{maxSeqNo=300, localCheckpoint=71, globalCheckpoint=71}>
     but: was <SeqNoStats{maxSeqNo=300, localCheckpoint=-1, globalCheckpoint=71}>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
		at org.junit.Assert.assertThat(Assert.java:956)
		at org.elasticsearch.test.InternalTestCluster.lambda$assertSeqNos$6(InternalTestCluster.java:1389)
		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:846)
		... 38 more

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-02-08T15:50:14Z

Pinging @elastic/es-distributed

Relates to elastic#38633

Relates to #38633

cbuescher · 2019-02-08T16:22:28Z

I muted this test method with #38634 on master, 7.x and 7.0

There were two documents (seq=2 and seq=103) missing on the follower in one of the failures of `testFailOverOnFollower`. I spent several hours on that failure but could not figure out the reason. I adjust log and unmute this test so we can collect more information. Relates #38633

jasontedor · 2019-02-16T16:13:45Z

@bleskes @dnhatn @DaveCTurner @martijnvg @ywelsch I think that I can explain what is occurring in this test, and what is occurring here is a blocker issue.

This issue is the result of adding recovery from remote.

With recovery from remote, we copy over the index files and there is no translog replay phase. If the leader does a flush before the follower initiates recovery from remote, after recovery from remote the follower will be fully caught up all primary shards of the follower will have empty translogs. When a replica shard of the follower attempts to recover from the primary shard of the follower, we want to replay translog from the local checkpoint of the commit, to bake a history of operations for it. Since the primary shards of the follower have empty translogs, this replay can not happen and recovery will fail. That explains the following:

RemoteTransportException[[follower3][127.0.0.1:58606][internal:index/shard/recovery/start_recovery]]; nested: IllegalStateException[translog replay failed to cover required sequence numbers (required range [1:174). first missing op is [1]]; ], allocation_status[no_attempt]]

Immediately, I see two options:

InternalEngine#readHistoryOperations could serve operations from the Lucene history instead of the translog; however, this interplays with the need for shard history retention leases to interplay with recovery to ensure that these operations exist
the primary shards of the follower could build their own translog from Lucene history after remotely recovering files

I am going to mute this test for now and open a new issue marked as a blocker.

jasontedor · 2019-02-16T17:32:16Z

I have marked this test as awaiting a fix in 6.7, 7.0, 7.x, and master.

dnhatn · 2019-02-16T17:54:07Z

@jasontedor Thanks for digging into this. @martijnvg and I found this issue in #38949. I am working on the fix for this - it's pretty contained. We can use the local checkpoint of the recovery target to validate the sending operations instead of tracking them locally on the recovery source.

DaveCTurner · 2019-02-16T18:16:39Z

Is this caused (or exacerbated) by #38904? Prior to #38904 if soft deletes were enabled we did read historical operations from Lucene during peer recovery, and I think the follower primary would have retained that history in Lucene for the duration of the recovery.

I ask because I note that the first report of this failure comes from a time before when #38904 was merged.

jasontedor · 2019-02-16T18:23:30Z

Yes that exacerbates it.

There were two documents (seq=2 and seq=103) missing on the follower in one of the failures of `testFailOverOnFollower`. I spent several hours on that failure but could not figure out the reason. I adjust log and unmute this test so we can collect more information. Relates #38633

Relates #38633

dnhatn · 2019-02-25T19:55:16Z

This should be fixed by #39006. I unmuted this test in f59b8cd.

Relates #38633

cbuescher added >test-failure Triaged test failures from CI :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features v8.0.0 labels Feb 8, 2019

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Feb 8, 2019

Mute FollowerFailOverIT testFailOverOnFollower

18e89d2

Relates to elastic#38633

cbuescher mentioned this issue Feb 8, 2019

Mute FollowerFailOverIT testFailOverOnFollower #38634

Merged

cbuescher pushed a commit that referenced this issue Feb 8, 2019

Mute FollowerFailOverIT testFailOverOnFollower (#38634)

d03b386

Relates to #38633

cbuescher pushed a commit that referenced this issue Feb 8, 2019

Mute FollowerFailOverIT testFailOverOnFollower (#38634)

ce5811e

Relates to #38633

cbuescher pushed a commit that referenced this issue Feb 8, 2019

Mute FollowerFailOverIT testFailOverOnFollower (#38634)

7aeed4d

Relates to #38633

dnhatn self-assigned this Feb 9, 2019

This was referenced Feb 11, 2019

Fix synchronization in LocalCheckpointTracker#contains #38755

Merged

Adjust log and unmute testFailOverOnFollower #38762

Merged

jasontedor mentioned this issue Feb 16, 2019

Replica recovery on follower can fail after recovery from remote #39000

Closed

dnhatn added a commit that referenced this issue Feb 25, 2019

Unmute FollowerFailOverIT#testFailOverOnFollower

f59b8cd

Relates #38633

dnhatn closed this as completed Feb 25, 2019

dnhatn added a commit that referenced this issue Feb 25, 2019

Unmute FollowerFailOverIT#testFailOverOnFollower

0f29b89

Relates #38633

dnhatn added a commit that referenced this issue Feb 25, 2019

Unmute FollowerFailOverIT#testFailOverOnFollower

d63d0fa

Relates #38633

tlrx mentioned this issue Feb 27, 2019

FollowerFailOverIT.testFailOverOnFollower fails on CI #39467

Closed

jakelandis removed the v8.0.0 label Jul 26, 2021

jakelandis added the v8.0.0-alpha1 label Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] FollowerFailOverIT testFailOverOnFollower fails frequently #38633

[CI] FollowerFailOverIT testFailOverOnFollower fails frequently #38633

cbuescher commented Feb 8, 2019

elasticmachine commented Feb 8, 2019

cbuescher commented Feb 8, 2019

jasontedor commented Feb 16, 2019

jasontedor commented Feb 16, 2019

dnhatn commented Feb 16, 2019

DaveCTurner commented Feb 16, 2019 •

edited

Loading

jasontedor commented Feb 16, 2019

dnhatn commented Feb 25, 2019

[CI] FollowerFailOverIT testFailOverOnFollower fails frequently #38633

[CI] FollowerFailOverIT testFailOverOnFollower fails frequently #38633

Comments

cbuescher commented Feb 8, 2019

elasticmachine commented Feb 8, 2019

cbuescher commented Feb 8, 2019

jasontedor commented Feb 16, 2019

jasontedor commented Feb 16, 2019

dnhatn commented Feb 16, 2019

DaveCTurner commented Feb 16, 2019 • edited Loading

jasontedor commented Feb 16, 2019

dnhatn commented Feb 25, 2019

DaveCTurner commented Feb 16, 2019 •

edited

Loading