[CI] DiscoveryDisruptionIT testJoinWaitsForClusterApplier failing #86974

DaveCTurner · 2022-05-20T13:27:16Z

Build scan:
https://gradle-enterprise.elastic.co/s/22kfr6xkperam/tests/:server:internalClusterTest/org.elasticsearch.discovery.DiscoveryDisruptionIT/testJoinWaitsForClusterApplier

Reproduction line:
./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.discovery.DiscoveryDisruptionIT.testJoinWaitsForClusterApplier" -Dtests.seed=9FB11F584CBC91 -Dtests.locale=zh-Hans-CN -Dtests.timezone=Egypt -Druntime.java=17

Applicable branches:
master

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.discovery.DiscoveryDisruptionIT&tests.test=testJoinWaitsForClusterApplier

Failure excerpt:

java.lang.AssertionError: failed to reach a stable cluster of [2] nodes. Tried via [node_t0]. last cluster state:
cluster uuid: tT33RxZHSC2bUIXNFsiwVw [committed: true]
version: 5
state uuid: yXI7NhQ1Q7m1V6pLSq9HfQ
from_diff: false
meta data version: 2
   coordination_metadata:
      term: 1
      last_committed_config: VotingConfiguration{zM3EXpSlSe6xt_cJp3-XXg,hdOfDcy3TDiHiIGfXgTUeg,fftPYa_xTjuct52l9LbDQQ}
      last_accepted_config: VotingConfiguration{zM3EXpSlSe6xt_cJp3-XXg,hdOfDcy3TDiHiIGfXgTUeg,fftPYa_xTjuct52l9LbDQQ}
      voting tombstones: []
metadata customs:
   index-graveyard: IndexGraveyard[[]]
nodes: 
   {node_t1}{zM3EXpSlSe6xt_cJp3-XXg}{dKRW-RxLRdyXDGhbYqfhYQ}{node_t1}{127.0.0.1}{127.0.0.1:42085}{cdfhilmrstw}
   {node_t0}{hdOfDcy3TDiHiIGfXgTUeg}{XPfPxzX8RhuEdkIGJdwK1g}{node_t0}{127.0.0.1}{127.0.0.1:40001}{cdfhilmrstw}, local, master
   {node_t2}{fftPYa_xTjuct52l9LbDQQ}{OCDxp77GR_e7Ry-YCeJwmw}{node_t2}{127.0.0.1}{127.0.0.1:44087}{cdfhilmrstw}
routing_table (version 1):
routing_nodes:
-----node_id[zM3EXpSlSe6xt_cJp3-XXg][V]
-----node_id[fftPYa_xTjuct52l9LbDQQ][V]
-----node_id[hdOfDcy3TDiHiIGfXgTUeg][V]
---- unassigned

  at org.junit.Assert.fail(Assert.java:88)
  at org.elasticsearch.test.ESIntegTestCase.ensureStableCluster(ESIntegTestCase.java:1293)
  at org.elasticsearch.test.ESIntegTestCase.ensureStableCluster(ESIntegTestCase.java:1274)
  at org.elasticsearch.discovery.DiscoveryDisruptionIT.testJoinWaitsForClusterApplier(DiscoveryDisruptionIT.java:244)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:833)

The text was updated successfully, but these errors were encountered:

Relates elastic#86974

Relates #86974

elasticmachine · 2022-06-16T11:26:20Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2022-06-20T11:12:59Z

The issue is that the victim might notice it's been removed from the cluster and start to attempt to rejoin before its applier service is blocked, but the join request is delayed in transmission so that it's processed between calling masterTransportService.clearAllRules() and ensureStableCluster(2, masterName) and therefore succeeds, joining the node back into the cluster.

It's very delicate timing tho - here's a way to add the necessary delays:

diff --git a/server/src/internalClusterTest/java/org/elasticsearch/discovery/DiscoveryDisruptionIT.java b/server/src/internalClusterTest/java/org/elasticsearch/discovery/DiscoveryDisruptionIT.java
index 07c1ddfe328..1e07beb9e0d 100644
--- a/server/src/internalClusterTest/java/org/elasticsearch/discovery/DiscoveryDisruptionIT.java
+++ b/server/src/internalClusterTest/java/org/elasticsearch/discovery/DiscoveryDisruptionIT.java
@@ -215,12 +215,32 @@ public class DiscoveryDisruptionIT extends AbstractDisruptionTestCase {
         final var victimName = randomValueOtherThan(masterName, () -> randomFrom(internalCluster().getNodeNames()));
         logger.info("--> master [{}], victim [{}]", masterName, victimName);

+        final var victimTransportService = (MockTransportService) internalCluster().getInstance(TransportService.class, victimName);
+        final var rejoinBarrier = new CyclicBarrier(2);
+        victimTransportService.addSendBehavior((connection, requestId, action, request, options) -> {
+            if (action.equals(JoinHelper.JOIN_ACTION_NAME)) {
+                victimTransportService.getThreadPool().generic().execute(() -> {
+                    try {
+                        rejoinBarrier.await(10, TimeUnit.SECONDS);
+                        rejoinBarrier.await(10, TimeUnit.SECONDS);
+                        connection.sendRequest(requestId, action, request, options);
+                    } catch (Exception e) {
+                        throw new AssertionError(e);
+                    }
+                });
+            } else {
+                connection.sendRequest(requestId, action, request, options);
+            }
+        });
+
         // drop the victim from the cluster with a network disruption
         final var masterTransportService = (MockTransportService) internalCluster().getInstance(TransportService.class, masterName);
         masterTransportService.addFailToSendNoConnectRule(internalCluster().getInstance(TransportService.class, victimName));
         logger.info("--> waiting for victim's departure");
         ensureStableCluster(2, masterName);

+        rejoinBarrier.await(10, TimeUnit.SECONDS);
+
         // block the cluster applier thread on the victim
         logger.info("--> blocking victim's applier service");
         final var barrier = new CyclicBarrier(2);
@@ -236,7 +256,6 @@ public class DiscoveryDisruptionIT extends AbstractDisruptionTestCase {
         barrier.await(10, TimeUnit.SECONDS);

         // verify that the victim sends no joins while the applier is blocked
-        final var victimTransportService = (MockTransportService) internalCluster().getInstance(TransportService.class, victimName);
         victimTransportService.addSendBehavior((connection, requestId, action, request, options) -> {
             assertNotEquals(action, JoinHelper.JOIN_ACTION_NAME);
             connection.sendRequest(requestId, action, request, options);
@@ -245,6 +264,8 @@ public class DiscoveryDisruptionIT extends AbstractDisruptionTestCase {
         // fix the network disruption
         logger.info("--> removing network disruption");
         masterTransportService.clearAllRules();
+        rejoinBarrier.await(10, TimeUnit.SECONDS);
+        Thread.sleep(1000);
         ensureStableCluster(2, masterName);

         // permit joins again

Block the cluster applier before disrupting the cluster so the victim node doesn't try and rejoin too soon. Closes elastic#86974

Block the cluster applier before disrupting the cluster so the victim node doesn't try and rejoin too soon. Closes #86974

Block the cluster applier before disrupting the cluster so the victim node doesn't try and rejoin too soon. Closes elastic#86974

Block the cluster applier before disrupting the cluster so the victim node doesn't try and rejoin too soon. Closes #86974

DaveCTurner added :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI labels May 20, 2022

DaveCTurner self-assigned this May 20, 2022

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 20, 2022

More verbose logging in testJoinWaitsForClusterApplier

8bda25a

Relates elastic#86974

This was referenced May 20, 2022

More verbose logging in testJoinWaitsForClusterApplier #86975

Merged

Capture deprecation warnings in batched master tasks #85525

Merged

DaveCTurner added a commit that referenced this issue May 23, 2022

More verbose logging in testJoinWaitsForClusterApplier (#86975)

32c90ab

Relates #86974

DaveCTurner added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 16, 2022

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 20, 2022

Fix testJoinWaitsForClusterApplier

2d0df8c

Block the cluster applier before disrupting the cluster so the victim node doesn't try and rejoin too soon. Closes elastic#86974

DaveCTurner mentioned this issue Jun 20, 2022

Fix testJoinWaitsForClusterApplier #87842

Merged

elasticsearchmachine closed this as completed in #87842 Jun 30, 2022

elasticsearchmachine pushed a commit that referenced this issue Jun 30, 2022

Fix testJoinWaitsForClusterApplier (#87842)

64b225a

Block the cluster applier before disrupting the cluster so the victim node doesn't try and rejoin too soon. Closes #86974

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 30, 2022

Fix testJoinWaitsForClusterApplier (elastic#87842)

16f22d0

Block the cluster applier before disrupting the cluster so the victim node doesn't try and rejoin too soon. Closes elastic#86974

elasticsearchmachine pushed a commit that referenced this issue Jun 30, 2022

Fix testJoinWaitsForClusterApplier (#87842) (#88205)

9237475

Block the cluster applier before disrupting the cluster so the victim node doesn't try and rejoin too soon. Closes #86974

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] DiscoveryDisruptionIT testJoinWaitsForClusterApplier failing #86974

[CI] DiscoveryDisruptionIT testJoinWaitsForClusterApplier failing #86974

DaveCTurner commented May 20, 2022

elasticmachine commented Jun 16, 2022

DaveCTurner commented Jun 20, 2022

[CI] DiscoveryDisruptionIT testJoinWaitsForClusterApplier failing #86974

[CI] DiscoveryDisruptionIT testJoinWaitsForClusterApplier failing #86974

Comments

DaveCTurner commented May 20, 2022

elasticmachine commented Jun 16, 2022

DaveCTurner commented Jun 20, 2022