-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] DiscoveryDisruptionIT testJoinWaitsForClusterApplier failing #86974
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
The issue is that the victim might notice it's been removed from the cluster and start to attempt to rejoin before its applier service is blocked, but the join request is delayed in transmission so that it's processed between calling It's very delicate timing tho - here's a way to add the necessary delays: diff --git a/server/src/internalClusterTest/java/org/elasticsearch/discovery/DiscoveryDisruptionIT.java b/server/src/internalClusterTest/java/org/elasticsearch/discovery/DiscoveryDisruptionIT.java
index 07c1ddfe328..1e07beb9e0d 100644
--- a/server/src/internalClusterTest/java/org/elasticsearch/discovery/DiscoveryDisruptionIT.java
+++ b/server/src/internalClusterTest/java/org/elasticsearch/discovery/DiscoveryDisruptionIT.java
@@ -215,12 +215,32 @@ public class DiscoveryDisruptionIT extends AbstractDisruptionTestCase {
final var victimName = randomValueOtherThan(masterName, () -> randomFrom(internalCluster().getNodeNames()));
logger.info("--> master [{}], victim [{}]", masterName, victimName);
+ final var victimTransportService = (MockTransportService) internalCluster().getInstance(TransportService.class, victimName);
+ final var rejoinBarrier = new CyclicBarrier(2);
+ victimTransportService.addSendBehavior((connection, requestId, action, request, options) -> {
+ if (action.equals(JoinHelper.JOIN_ACTION_NAME)) {
+ victimTransportService.getThreadPool().generic().execute(() -> {
+ try {
+ rejoinBarrier.await(10, TimeUnit.SECONDS);
+ rejoinBarrier.await(10, TimeUnit.SECONDS);
+ connection.sendRequest(requestId, action, request, options);
+ } catch (Exception e) {
+ throw new AssertionError(e);
+ }
+ });
+ } else {
+ connection.sendRequest(requestId, action, request, options);
+ }
+ });
+
// drop the victim from the cluster with a network disruption
final var masterTransportService = (MockTransportService) internalCluster().getInstance(TransportService.class, masterName);
masterTransportService.addFailToSendNoConnectRule(internalCluster().getInstance(TransportService.class, victimName));
logger.info("--> waiting for victim's departure");
ensureStableCluster(2, masterName);
+ rejoinBarrier.await(10, TimeUnit.SECONDS);
+
// block the cluster applier thread on the victim
logger.info("--> blocking victim's applier service");
final var barrier = new CyclicBarrier(2);
@@ -236,7 +256,6 @@ public class DiscoveryDisruptionIT extends AbstractDisruptionTestCase {
barrier.await(10, TimeUnit.SECONDS);
// verify that the victim sends no joins while the applier is blocked
- final var victimTransportService = (MockTransportService) internalCluster().getInstance(TransportService.class, victimName);
victimTransportService.addSendBehavior((connection, requestId, action, request, options) -> {
assertNotEquals(action, JoinHelper.JOIN_ACTION_NAME);
connection.sendRequest(requestId, action, request, options);
@@ -245,6 +264,8 @@ public class DiscoveryDisruptionIT extends AbstractDisruptionTestCase {
// fix the network disruption
logger.info("--> removing network disruption");
masterTransportService.clearAllRules();
+ rejoinBarrier.await(10, TimeUnit.SECONDS);
+ Thread.sleep(1000);
ensureStableCluster(2, masterName);
// permit joins again |
Block the cluster applier before disrupting the cluster so the victim node doesn't try and rejoin too soon. Closes elastic#86974
Block the cluster applier before disrupting the cluster so the victim node doesn't try and rejoin too soon. Closes #86974
Block the cluster applier before disrupting the cluster so the victim node doesn't try and rejoin too soon. Closes elastic#86974
Build scan:
https://gradle-enterprise.elastic.co/s/22kfr6xkperam/tests/:server:internalClusterTest/org.elasticsearch.discovery.DiscoveryDisruptionIT/testJoinWaitsForClusterApplier
Reproduction line:
./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.discovery.DiscoveryDisruptionIT.testJoinWaitsForClusterApplier" -Dtests.seed=9FB11F584CBC91 -Dtests.locale=zh-Hans-CN -Dtests.timezone=Egypt -Druntime.java=17
Applicable branches:
master
Reproduces locally?:
No
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.discovery.DiscoveryDisruptionIT&tests.test=testJoinWaitsForClusterApplier
Failure excerpt:
The text was updated successfully, but these errors were encountered: