Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] DiscoveryDisruptionIT testJoinWaitsForClusterApplier failing #86974

Closed
DaveCTurner opened this issue May 20, 2022 · 2 comments · Fixed by #87842
Closed

[CI] DiscoveryDisruptionIT testJoinWaitsForClusterApplier failing #86974

DaveCTurner opened this issue May 20, 2022 · 2 comments · Fixed by #87842
Assignees
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

Build scan:
https://gradle-enterprise.elastic.co/s/22kfr6xkperam/tests/:server:internalClusterTest/org.elasticsearch.discovery.DiscoveryDisruptionIT/testJoinWaitsForClusterApplier

Reproduction line:
./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.discovery.DiscoveryDisruptionIT.testJoinWaitsForClusterApplier" -Dtests.seed=9FB11F584CBC91 -Dtests.locale=zh-Hans-CN -Dtests.timezone=Egypt -Druntime.java=17

Applicable branches:
master

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.discovery.DiscoveryDisruptionIT&tests.test=testJoinWaitsForClusterApplier

Failure excerpt:

java.lang.AssertionError: failed to reach a stable cluster of [2] nodes. Tried via [node_t0]. last cluster state:
cluster uuid: tT33RxZHSC2bUIXNFsiwVw [committed: true]
version: 5
state uuid: yXI7NhQ1Q7m1V6pLSq9HfQ
from_diff: false
meta data version: 2
   coordination_metadata:
      term: 1
      last_committed_config: VotingConfiguration{zM3EXpSlSe6xt_cJp3-XXg,hdOfDcy3TDiHiIGfXgTUeg,fftPYa_xTjuct52l9LbDQQ}
      last_accepted_config: VotingConfiguration{zM3EXpSlSe6xt_cJp3-XXg,hdOfDcy3TDiHiIGfXgTUeg,fftPYa_xTjuct52l9LbDQQ}
      voting tombstones: []
metadata customs:
   index-graveyard: IndexGraveyard[[]]
nodes: 
   {node_t1}{zM3EXpSlSe6xt_cJp3-XXg}{dKRW-RxLRdyXDGhbYqfhYQ}{node_t1}{127.0.0.1}{127.0.0.1:42085}{cdfhilmrstw}
   {node_t0}{hdOfDcy3TDiHiIGfXgTUeg}{XPfPxzX8RhuEdkIGJdwK1g}{node_t0}{127.0.0.1}{127.0.0.1:40001}{cdfhilmrstw}, local, master
   {node_t2}{fftPYa_xTjuct52l9LbDQQ}{OCDxp77GR_e7Ry-YCeJwmw}{node_t2}{127.0.0.1}{127.0.0.1:44087}{cdfhilmrstw}
routing_table (version 1):
routing_nodes:
-----node_id[zM3EXpSlSe6xt_cJp3-XXg][V]
-----node_id[fftPYa_xTjuct52l9LbDQQ][V]
-----node_id[hdOfDcy3TDiHiIGfXgTUeg][V]
---- unassigned

  at org.junit.Assert.fail(Assert.java:88)
  at org.elasticsearch.test.ESIntegTestCase.ensureStableCluster(ESIntegTestCase.java:1293)
  at org.elasticsearch.test.ESIntegTestCase.ensureStableCluster(ESIntegTestCase.java:1274)
  at org.elasticsearch.discovery.DiscoveryDisruptionIT.testJoinWaitsForClusterApplier(DiscoveryDisruptionIT.java:244)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:833)

@DaveCTurner DaveCTurner added :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI labels May 20, 2022
@DaveCTurner DaveCTurner self-assigned this May 20, 2022
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 20, 2022
@DaveCTurner DaveCTurner added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 16, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor Author

The issue is that the victim might notice it's been removed from the cluster and start to attempt to rejoin before its applier service is blocked, but the join request is delayed in transmission so that it's processed between calling masterTransportService.clearAllRules() and ensureStableCluster(2, masterName) and therefore succeeds, joining the node back into the cluster.

It's very delicate timing tho - here's a way to add the necessary delays:

diff --git a/server/src/internalClusterTest/java/org/elasticsearch/discovery/DiscoveryDisruptionIT.java b/server/src/internalClusterTest/java/org/elasticsearch/discovery/DiscoveryDisruptionIT.java
index 07c1ddfe328..1e07beb9e0d 100644
--- a/server/src/internalClusterTest/java/org/elasticsearch/discovery/DiscoveryDisruptionIT.java
+++ b/server/src/internalClusterTest/java/org/elasticsearch/discovery/DiscoveryDisruptionIT.java
@@ -215,12 +215,32 @@ public class DiscoveryDisruptionIT extends AbstractDisruptionTestCase {
         final var victimName = randomValueOtherThan(masterName, () -> randomFrom(internalCluster().getNodeNames()));
         logger.info("--> master [{}], victim [{}]", masterName, victimName);

+        final var victimTransportService = (MockTransportService) internalCluster().getInstance(TransportService.class, victimName);
+        final var rejoinBarrier = new CyclicBarrier(2);
+        victimTransportService.addSendBehavior((connection, requestId, action, request, options) -> {
+            if (action.equals(JoinHelper.JOIN_ACTION_NAME)) {
+                victimTransportService.getThreadPool().generic().execute(() -> {
+                    try {
+                        rejoinBarrier.await(10, TimeUnit.SECONDS);
+                        rejoinBarrier.await(10, TimeUnit.SECONDS);
+                        connection.sendRequest(requestId, action, request, options);
+                    } catch (Exception e) {
+                        throw new AssertionError(e);
+                    }
+                });
+            } else {
+                connection.sendRequest(requestId, action, request, options);
+            }
+        });
+
         // drop the victim from the cluster with a network disruption
         final var masterTransportService = (MockTransportService) internalCluster().getInstance(TransportService.class, masterName);
         masterTransportService.addFailToSendNoConnectRule(internalCluster().getInstance(TransportService.class, victimName));
         logger.info("--> waiting for victim's departure");
         ensureStableCluster(2, masterName);

+        rejoinBarrier.await(10, TimeUnit.SECONDS);
+
         // block the cluster applier thread on the victim
         logger.info("--> blocking victim's applier service");
         final var barrier = new CyclicBarrier(2);
@@ -236,7 +256,6 @@ public class DiscoveryDisruptionIT extends AbstractDisruptionTestCase {
         barrier.await(10, TimeUnit.SECONDS);

         // verify that the victim sends no joins while the applier is blocked
-        final var victimTransportService = (MockTransportService) internalCluster().getInstance(TransportService.class, victimName);
         victimTransportService.addSendBehavior((connection, requestId, action, request, options) -> {
             assertNotEquals(action, JoinHelper.JOIN_ACTION_NAME);
             connection.sendRequest(requestId, action, request, options);
@@ -245,6 +264,8 @@ public class DiscoveryDisruptionIT extends AbstractDisruptionTestCase {
         // fix the network disruption
         logger.info("--> removing network disruption");
         masterTransportService.clearAllRules();
+        rejoinBarrier.await(10, TimeUnit.SECONDS);
+        Thread.sleep(1000);
         ensureStableCluster(2, masterName);

         // permit joins again

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 20, 2022
Block the cluster applier before disrupting the cluster so the victim
node doesn't try and rejoin too soon.

Closes elastic#86974
elasticsearchmachine pushed a commit that referenced this issue Jun 30, 2022
Block the cluster applier before disrupting the cluster so the victim
node doesn't try and rejoin too soon.

Closes #86974
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Jun 30, 2022
Block the cluster applier before disrupting the cluster so the victim
node doesn't try and rejoin too soon.

Closes elastic#86974
elasticsearchmachine pushed a commit that referenced this issue Jun 30, 2022
Block the cluster applier before disrupting the cluster so the victim
node doesn't try and rejoin too soon.

Closes #86974
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants