Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] CCR: testFollowIndexAndCloseNode fails #33337

Closed
dnhatn opened this issue Sep 2, 2018 · 28 comments
Closed

[CI] CCR: testFollowIndexAndCloseNode fails #33337

dnhatn opened this issue Sep 2, 2018 · 28 comments
Assignees
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI

Comments

@dnhatn
Copy link
Member

dnhatn commented Sep 2, 2018

testFollowIndexAndCloseNode fails on 6.x:

ERROR   58.1s J1 | ShardChangesIT.testFollowIndexAndCloseNode <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <0>
   >      but: was <3>
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.xpack.ccr.ShardChangesIT.lambda$unfollowIndex$12(ShardChangesIT.java:533)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:847)
   > 	at org.elasticsearch.xpack.ccr.ShardChangesIT.unfollowIndex(ShardChangesIT.java:519)
   > 	at org.elasticsearch.xpack.ccr.ShardChangesIT.testFollowIndexAndCloseNode(ShardChangesIT.java:351)
   > 	at java.lang.Thread.run(Thread.java:748)

This may be the actual reason.

  2> WARNING: Uncaught exception in thread: Thread[elasticsearch[node_td4][write][T#1],5,TGRP-ShardChangesIT]
  2> java.lang.AssertionError: seqNo [133] was processed twice in generation [2], with different data. prvOp [Index{id='CBdenGUBUjHVaMehll6c', type='doc', seqNo=133, primaryTerm=1, version=1, autoGeneratedIdTimestamp=-1}], newOp [Index{id='CBdenGUBUjHVaMehll6c', type='doc', seqNo=133, primaryTerm=2, version=1, autoGeneratedIdTimestamp=-1}]
  1> [2018-09-02T17:19:26,895][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_tm2] removing template [random_index_template]
  2> 	at __randomizedtesting.SeedInfo.seed([26198A03C82BA27]:0)
  1> [2018-09-02T17:19:26,903][INFO ][o.e.n.Node               ] [testValidateFollowingIndexSettings] stopping ...
  2> 	at org.elasticsearch.index.translog.TranslogWriter.assertNoSeqNumberConflict(TranslogWriter.java:214)
  2> 	at org.elasticsearch.index.translog.TranslogWriter.add(TranslogWriter.java:181)
  1> [2018-09-02T17:19:26,906][INFO ][o.e.c.s.MasterService    ] [node_tm2] zen-disco-node-left({node_tc4}{mIMSWWHESX-PxSM6NVVikw}{ZqmDtVfPSTaqXDkLLas5WA}{127.0.0.1}{127.0.0.1:33427}{xpack.installed=true}), reason(left), reason: removed {{node_tc4}{mIMSWWHESX-PxSM6NVVikw}{ZqmDtVfPSTaqXDkLLas5WA}{127.0.0.1}{127.0.0.1:33427}{xpack.installed=true},}

CI: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/2388/console
Log: testFollowIndexAndCloseNode.txt.zip

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn dnhatn added >test-failure Triaged test failures from CI :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features labels Sep 2, 2018
dnhatn added a commit that referenced this issue Sep 2, 2018
dnhatn added a commit that referenced this issue Sep 2, 2018
@dnhatn
Copy link
Member Author

dnhatn commented Sep 2, 2018

I've muted this test on master and 6.x.

@dnhatn dnhatn self-assigned this Sep 7, 2018
@dnhatn
Copy link
Member Author

dnhatn commented Sep 18, 2018

This test failed because we assigned different terms (1 and 2) to seq=133. This happened in these steps:

  1. The primary of the follower assigns its primary term to operations in TransportBulkShardOperationsAction.

  2. An operation (seq=133) is assigned term=1, indexed to the primary and the replica of the follower. However, this primary is restarted before it can respond to ShardFollowNodeTask.

  3. The replica is promoted, and the old primary recovers operation {seq=133, term=1} from the new primary (i.e. old replica) via peer-recovery.

  4. ShardFollowNodeTask does not receive a response, then retries to index the operation(seq=133) to a new primary. The new primary assigns its term (term=2) to operation (seq=133). TransportBulkShardOperationsAction replicates this operation {seq=133,term=2} to the old primary.

  5. The new operation {seq=133,term=2} conflicts with the old operation {seq=133,term=1} and trip the assertion.

java.lang.AssertionError: seqNo [133] was processed twice in generation [2], 
with different data. 

prvOp [Index{id='CBdenGUBUjHVaMehll6c', type='doc', seqNo=133, primaryTerm=1,
             version=1, autoGeneratedIdTimestamp=-1}], 
newOp [Index{id='CBdenGUBUjHVaMehll6c', type='doc', seqNo=133, primaryTerm=2,
             version=1, autoGeneratedIdTimestamp=-1}]

One solution I can see is to let ShardFollowNodeTask assign the primary term once before dispatching operations. We can piggyback the current term on the primary in responses (even though we won't have the latest term on ShardFollowNodeTask all the time). @bleskes and @jasontedor WDYT?

@dnhatn
Copy link
Member Author

dnhatn commented Sep 18, 2018

This still is an issue with the proposal if ShardFollowNodeTask is restarted. Or let's relax the assertNoSeqNumberConflictassertion for the following engines?

@bleskes
Copy link
Contributor

bleskes commented Sep 19, 2018

I think this specific situation will be avoided with rollbacks in place, correct? I did make me think of a deeper problem. I added this as a discussion topic for our next sync.

@dnhatn
Copy link
Member Author

dnhatn commented Sep 19, 2018

I think this specific situation will be avoided with rollbacks in place, correct?

No, as we are testing with 1 replica on both sides.

@dnhatn
Copy link
Member Author

dnhatn commented Oct 1, 2018

I've un-muted this test. It should be okay now as the FollowingEngine will skip an operation which was already processed before (see #34099).

I am closing this issue since @jasontedor is working on a broader fix which covers this issue.

/cc @martijnvg

@dnhatn dnhatn closed this as completed Oct 1, 2018
@dnhatn
Copy link
Member Author

dnhatn commented Oct 2, 2018

We haven't resolved this issue completely. I will mute this test.
CI: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/122/console

@dnhatn dnhatn reopened this Oct 2, 2018
dnhatn added a commit that referenced this issue Oct 2, 2018
dnhatn added a commit that referenced this issue Oct 2, 2018
@dnhatn dnhatn closed this as completed in 7bc11a8 Oct 10, 2018
dnhatn added a commit that referenced this issue Oct 11, 2018
This issue was resolved by #34288.

Closes #33337
Relates #34288
kcm pushed a commit that referenced this issue Oct 30, 2018
kcm pushed a commit that referenced this issue Oct 30, 2018
This issue was resolved by #34288.

Closes #33337
Relates #34288
@talevy
Copy link
Contributor

talevy commented Oct 31, 2018

this failed again for me in a PR CI run

link

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request/1500/console

reproduce

./gradlew :x-pack:plugin:ccr:internalClusterTest -Dtests.seed=1B26A8E013912DCB -Dtests.class=org.elasticsearch.xpack.ccr.FollowerFailOverIT -Dtests.method="testFollowIndexAndCloseNode" -Dtests.security.manager=true -Dtests.locale=zh-SG -Dtests.timezone=America/Virgin -Dcompiler.java=11 -Druntime.java=8
stacktrace

15:23:32 FAILURE 81.6s J2 | FollowerFailOverIT.testFollowIndexAndCloseNode <<< FAILURES!
15:23:32    > Throwable #1: java.lang.AssertionError: 
15:23:32    > Expected: <1224L>
15:23:32    >      but: was <1223L>
15:23:32    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
15:23:32    > 	at org.elasticsearch.xpack.CcrIntegTestCase.lambda$assertSameDocCount$2(CcrIntegTestCase.java:399)
15:23:32    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:846)
15:23:32    > 	at org.elasticsearch.xpack.CcrIntegTestCase.assertSameDocCount(CcrIntegTestCase.java:394)
15:23:32    > 	at org.elasticsearch.xpack.ccr.FollowerFailOverIT.testFollowIndexAndCloseNode(FollowerFailOverIT.java:136)
15:23:32    > 	at java.lang.Thread.run(Thread.java:748)
15:23:32    > 	Suppressed: java.lang.AssertionError: 
15:23:32    > Expected: <1224L>
15:23:32    >      but: was <1223L>
15:23:32    > 		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
15:23:32    > 		at org.elasticsearch.xpack.CcrIntegTestCase.lambda$assertSameDocCount$2(CcrIntegTestCase.java:399)
15:23:32    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:834)
15:23:32    > 		... 39 more
15:23:32    > 	Suppressed: java.lang.AssertionError: 
15:23:32    > Expected: <1224L>
15:23:32    >      but: was <1223L>
15:23:32    > 		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
15:23:32    > 		at org.elasticsearch.xpack.CcrIntegTestCase.lambda$assertSameDocCount$2(CcrIntegTestCase.java:399)

OH, I take that back, I might be stale. will merge in latest master

@dnhatn
Copy link
Member Author

dnhatn commented Oct 31, 2018

Sorry for the noise @talevy. I am re-opening this and will look into it soon.

@dnhatn dnhatn reopened this Oct 31, 2018
@talevy
Copy link
Contributor

talevy commented Oct 31, 2018

thanks @dnhatn, this branch was working off of master as of yesterday, working off of this time in history, I believe, https://github.com/elastic/elasticsearch/tree/7ef65dedc36735a0e84f482bd9fbc4acab9f7a17

@davidkyle
Copy link
Member

Another failure in this suite this time on 6.5. The actuals errors look different but may be related so I'm documenting here rather than opening a new issue

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.5+multijob-unix-compatibility/os=ubuntu&&virtual/15/console

Obviously does not reproduce:

./gradlew :x-pack:plugin:ccr:internalClusterTest \
  -Dtests.seed=C69158B9F0EBE031 \
  -Dtests.class=org.elasticsearch.xpack.ccr.FollowerFailOverIT \
  -Dtests.method="testFailOverOnFollower" \
  -Dtests.security.manager=true \
  -Dtests.locale=da-DK \
  -Dtests.timezone=America/Halifax \
  -Dcompiler.java=11 \
  -Druntime.java=8

Some of the errors are:

java.lang.AssertionError: shard [leader-index][0] on node [leaderd3] has pending operations:
 --> BulkShardRequest [[leader-index][0]] containing [index {[leader-index][doc][101844], source[{"f":101844}]}]
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:234)
	at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2327)
	at
...
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=809, name=Thread-80, state=RUNNABLE, group=TGRP-FollowerFailOverIT]
Caused by: [leader-index/byEAeLw0QOW7N0HaS64B4A] IndexNotFoundException[no such index]
	at __randomizedtesting.SeedInfo.seed([C69158B9F0EBE031]:0)
	at org.elasticsearch.cluster.routing.RoutingTable.shardRoutingTable(RoutingTable.java:137)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.primary(TransportReplicationAction.java:795)
...

@davidkyle
Copy link
Member

Another failure of testFollowIndexAndCloseNode this time on master

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/135/console

Does not reproduce:

./gradlew :x-pack:plugin:ccr:internalClusterTest \
  -Dtests.seed=C96A71B22CD3BE8B \
  -Dtests.class=org.elasticsearch.xpack.ccr.FollowerFailOverIT \
  -Dtests.method="testFollowIndexAndCloseNode" \
  -Dtests.security.manager=true \
  -Dtests.locale=de-DE \
  -Dtests.timezone=America/New_York \
  -Dcompiler.java=11 \
  -Druntime.java=8

The failed assertion is different to the one in the opening comment

Expected: <1190L>
     but: was <1189L>
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.elasticsearch.xpack.CcrIntegTestCase.lambda$assertSameDocCount$2(CcrIntegTestCase.java:399)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
	at org.elasticsearch.xpack.CcrIntegTestCase.assertSameDocCount(CcrIntegTestCase.java:394)
	at org.elasticsearch.xpack.ccr.FollowerFailOverIT.testFollowIndexAndCloseNode(FollowerFailOverIT.java:139)

dnhatn added a commit that referenced this issue Nov 3, 2018
The suite FollowerFailOverIT is failing because some documents are not
replicated to the follower. Maybe the FollowTask is not working as
expected or the background indexers eat all resources while the follower
cluster is trying to reform after a failover; then CI is not fast enough
to replicate all the indexed docs within 60 seconds (sometimes I see 80k
docs on the leader).

This commit limits the number of documents to be indexed into the leader
index by the background threads so that we can eliminate the latter
case. This change also replaces a docCount assertion with a docIds
assertion so we can have more information if these tests fail again.

Relates #33337
dnhatn added a commit that referenced this issue Nov 3, 2018
The suite FollowerFailOverIT is failing because some documents are not
replicated to the follower. Maybe the FollowTask is not working as
expected or the background indexers eat all resources while the follower
cluster is trying to reform after a failover; then CI is not fast enough
to replicate all the indexed docs within 60 seconds (sometimes I see 80k
docs on the leader).

This commit limits the number of documents to be indexed into the leader
index by the background threads so that we can eliminate the latter
case. This change also replaces a docCount assertion with a docIds
assertion so we can have more information if these tests fail again.

Relates #33337
dnhatn added a commit that referenced this issue Nov 3, 2018
The suite FollowerFailOverIT is failing because some documents are not
replicated to the follower. Maybe the FollowTask is not working as
expected or the background indexers eat all resources while the follower
cluster is trying to reform after a failover; then CI is not fast enough
to replicate all the indexed docs within 60 seconds (sometimes I see 80k
docs on the leader).

This commit limits the number of documents to be indexed into the leader
index by the background threads so that we can eliminate the latter
case. This change also replaces a docCount assertion with a docIds
assertion so we can have more information if these tests fail again.

Relates #33337
@martijnvg
Copy link
Member

This started to fail again. The latest failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/113/console

The test checks whether all operations have fully replicated to the follow shards, only for one shard it turns out there no ops:

FAILURE 79.1s J2 | FollowerFailOverIT.testFollowIndexAndCloseNode <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <{0=SeqNoStats{maxSeqNo=112, localCheckpoint=112, globalCheckpoint=112}, 1=SeqNoStats{maxSeqNo=118, localCheckpoint=118, globalCheckpoint=118}, 2=SeqNoStats{maxSeqNo=-1, localCheckpoint=-1, globalCheckpoint=-2}}>
   >      but: was <{0=SeqNoStats{maxSeqNo=112, localCheckpoint=112, globalCheckpoint=112}, 1=SeqNoStats{maxSeqNo=118, localCheckpoint=118, globalCheckpoint=118}, 2=SeqNoStats{maxSeqNo=104, localCheckpoint=104, globalCheckpoint=104}}>

After a random non master node is stopped in the follower cluster the following warnings do show up in the log:

[2018-11-23T12:06:38,102][WARN ][o.e.x.c.a.b.TransportBulkShardOperationsAction] [followerd2] [[index2][2]] failed to perform indices:data/write/bulk_shard_operat
ions[s] on replica [index2][2], node[AHaZkuH5SymTX_zDKdL5Mg], [R], s[STARTED], a[id=EBAwm4tuQ5qNlww6nd75iw]
...
[2018-11-23T12:06:38,108][WARN ][o.e.c.a.s.ShardStateAction] [followerd2] node closed while execution action [internal:cluster/shard/failure] for shard entry [shard id [[index2][2]], allocation id [EBAwm4tuQ5qNlww6nd75iw], primary term [1], message [failed to perform indices:data/write/bulk_shard_operations[s] on replica [index2][2], node[AHaZkuH5SymTX_zDKdL5Mg], [R], s[STARTED], a[id=EBAwm4tuQ5qNlww6nd75iw]], failure [NodeNotConnectedException[[followerd1][127.0.0.1:34629] Node not connected]], markAsStale [true]]
...
[2018-11-23T12:06:38,159][WARN ][o.e.c.r.a.AllocationService] [followerm0] [index2][2] marking unavailable shards as stale: [eYvk9X_aSYC3GkolxpdS0g]
...
[2018-11-23T12:06:38,213][WARN ][o.e.c.r.a.AllocationService] [followerm0] [index2][1] marking unavailable shards as stale: [t8elnSUqR56mlxT6TlWGuA]

I'm trying to understand why these follower shards are marked as stale.

@Tim-Brooks
Copy link
Contributor

@alpar-t
Copy link
Contributor

alpar-t commented Jan 11, 2019

@martijnvg
Copy link
Member

This failure is different than before. I'm looking into this now.

@martijnvg
Copy link
Member

martijnvg commented Jan 11, 2019

This looks to be related to a test issue that was fixed in another test: #35403 (comment)

I will disable delayed allocation in this test too and close this issue when I don't see any failures on monday.

Turns out this is not because of delayed allocation. No attempt was made to assign a replica shard of the follower index. I'm going to add more information in the logs of this test.

@matriv
Copy link
Contributor

matriv commented Jan 29, 2019

Another one here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+internalClusterTest/70/console

  1> [2019-01-29T14:05:35,830][INFO ][o.e.n.Node               ] [testAddNewReplicasOnFollower] closed
  1> [2019-01-29T14:05:35,826][WARN ][o.e.t.RemoteClusterConnection] [followerm1] fetching nodes from external cluster [leader_cluster] failed
  1> org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:41543] connect_exception
  1> 	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1569) ~[elasticsearch-6.6.0-SNAPSHOT.jar:6.6.0-SNAPSHOT]
  1> 	at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:99) ~[elasticsearch-6.6.0-SNAPSHOT.jar:6.6.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-6.6.0-SNAPSHOT.jar:6.6.0-SNAPSHOT]
  1> 	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_202]
  1> 	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ~[?:1.8.0_202]
  1> 	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_202]
  1> 	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_202]
  1> 	at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-6.6.0-SNAPSHOT.jar:6.6.0-SNAPSHOT]
  1> 	at org.elasticsearch.transport.MockTcpTransport.lambda$initiateChannel$0(MockTcpTransport.java:195) ~[framework-6.6.0-SNAPSHOT.jar:6.6.0-SNAPSHOT]
  1> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_202]
  1> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_202]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
  1> Caused by: java.net.ConnectException: Connection refused (Connection refused)
  1> 	at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_202]
  1> 	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_202]
  1> 	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_202]
  1> 	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_202]
  1> 	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_202]
  1> 	at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_202]
  1> 	at org.elasticsearch.mocksocket.MockSocket.access$101(MockSocket.java:32) ~[mocksocket-1.2.jar:?]
  1> 	at org.elasticsearch.mocksocket.MockSocket.lambda$connect$0(MockSocket.java:66) ~[mocksocket-1.2.jar:?]
  1> 	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_202]
  1> 	at org.elasticsearch.mocksocket.MockSocket.connect(MockSocket.java:65) ~[mocksocket-1.2.jar:?]
  1> 	at org.elasticsearch.mocksocket.MockSocket.connect(MockSocket.java:59) ~[mocksocket-1.2.jar:?]
  1> 	at org.elasticsearch.transport.MockTcpTransport.lambda$initiateChannel$0(MockTcpTransport.java:190) ~[framework-6.6.0-SNAPSHOT.jar:6.6.0-SNAPSHOT]
  1> 	... 5 more
FAILURE 77.7s J4 | FollowerFailOverIT.testFollowIndexAndCloseNode <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <{0=SeqNoStats{maxSeqNo=33, localCheckpoint=33, globalCheckpoint=33}, 1=SeqNoStats{maxSeqNo=42, localCheckpoint=42, globalCheckpoint=42}, 2=SeqNoStats{maxSeqNo=48, localCheckpoint=-1, globalCheckpoint=-2}}>
   >      but: was <{0=SeqNoStats{maxSeqNo=33, localCheckpoint=33, globalCheckpoint=33}, 1=SeqNoStats{maxSeqNo=42, localCheckpoint=42, globalCheckpoint=42}, 2=SeqNoStats{maxSeqNo=48, localCheckpoint=48, globalCheckpoint=48}}>
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.xpack.CcrIntegTestCase.lambda$assertIndexFullyReplicatedToFollower$4(CcrIntegTestCase.java:453)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
   > 	at org.elasticsearch.xpack.CcrIntegTestCase.assertIndexFullyReplicatedToFollower(CcrIntegTestCase.java:438)
   > 	at org.elasticsearch.xpack.ccr.FollowerFailOverIT.testFollowIndexAndCloseNode(FollowerFailOverIT.java:160)
   > 	at java.lang.Thread.run(Thread.java:748)
   > 	Suppressed: java.lang.AssertionError: 
   > Expected: <{0=SeqNoStats{maxSeqNo=31, localCheckpoint=31, globalCheckpoint=31}, 1=SeqNoStats{maxSeqNo=41, localCheckpoint=41, globalCheckpoint=41}, 2=SeqNoStats{maxSeqNo=47, localCheckpoint=-1, globalCheckpoint=-2}}>
   >      but: was <{0=SeqNoStats{maxSeqNo=33, localCheckpoint=33, globalCheckpoint=33}, 1=SeqNoStats{maxSeqNo=42, localCheckpoint=42, globalCheckpoint=42}, 2=SeqNoStats{maxSeqNo=48, localCheckpoint=48, globalCheckpoint=48}}>
   > 		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 		at org.elasticsearch.xpack.CcrIntegTestCase.lambda$assertIndexFullyReplicatedToFollower$4(CcrIntegTestCase.java:453)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
   > 		... 39 more

dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Feb 1, 2019
@cbuescher
Copy link
Member

This looks like another one today:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1351/console

@dnhatn Let me know if I should mute this test again or if you are still looking for more logs.

@cbuescher
Copy link
Member

Another one today: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+internalClusterTest/528/console

@dnhatn Can you mute this if you got what you are looking for in the logs?

@dnhatn
Copy link
Member Author

dnhatn commented Feb 4, 2019

@cbuescher Thanks for the ping. I'll this test now.

dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Feb 4, 2019
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Feb 4, 2019
dnhatn added a commit that referenced this issue Feb 4, 2019
dnhatn added a commit that referenced this issue Feb 4, 2019
@henningandersen
Copy link
Contributor

I ran into this on one of my PR builds and investigated it a bit. I can reproduce this around 1/50 times. I start the single test method in IntelliJ, with Repeat until failure on.

@martijnvg
Copy link
Member

@dnhatn Can this test be unmuted now?

@dnhatn
Copy link
Member Author

dnhatn commented Feb 28, 2019

@martijnvg We need to wait for #39467.

@dnhatn
Copy link
Member Author

dnhatn commented Mar 7, 2019

Resolved by #39584

@dnhatn dnhatn closed this as completed Mar 7, 2019
@dnhatn
Copy link
Member Author

dnhatn commented Mar 7, 2019

I have unmuted this test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests