-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] ClusterDisruptionIT testAckedIndexing failing #91447
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
I could reproduce this. See https://gradle-enterprise.elastic.co/s/tmnxogpfsbwiu. |
This used to reproduce (almost) reliably with |
Hmm I don't think we've fixed anything that might have prevented the cluster from stabilising in a reasonable timeframe, which appears to be the problem here. But maybe we have. In the run you shared above the cluster seems to be constantly adding
That seems suspicious, and unrelated to allocator changes. |
Hmm the master thinks this node is faulty and therefore never sends it the cluster state:
|
So the Coordinator will publish the new cluster state, and it calls into the FollowersChecker to remove faulty nodes if they are no longer part of the cluster state. The only other paths that I see to clear faulty nodes is when the node becomes a follower or becomes a candidate. In this test failure scenario, a faulty node (node_t1) joins a cluster with master node_t2. So the cluster state being published should contain node_t1, which I think will leave node_t1 marked as a faulty node. I haven't been able to reproduce the failure. Next I'll see if I can write a new test in FollwersCheckerTests to cover the scenario, and see what happens. |
A node is marked as faulty in the FollowersChecker when there is a network disconnection with the that node. A node is not returned to not-faulty unless a master election occurs or the node is removed from the cluster, as mentioned above. I think the failed run differs in marking a node as faulty but never removes it -- which would remove the node from the faultyNodes list
*note, I searched for These are the relevant logs in the failed run in this ticket, and then a successful run on my local machine. So the question becomes: why didn't the node get removed properly from the cluster state? |
The logic to remove the node -- node-left cluster state update -- is scheduled right after logging "marking node as faulty". That leads back to the Coordinator, removeNode. The removeNode code actually was significantly changed 9 months ago, to improve the master service batch queuing system. Maybe something was wrong with the batching, such that the cluster state wasn't updated properly. But this might be enough investigation to put the failure to rest because the code has changed greatly. Potentially incidentally fixed by #92021 |
Thanks @DiannaHohensee ! I believe @tlrx also mentioned he could not reproduce it in current version, and was thinking of closing this (and re-open it if it ever pops up again). So closing this for now. Feel free to re-open if you think we should not close this now. |
I don't think this is addressed by the master-service task queues work. The node's failure to rejoin properly would eventually be detected by the
However the default lag-detection timeout is 90s and we do not wait that long in these tests. So I think it would be reasonable to change that: diff --git a/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java b/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
index 2a8621a60b6..dc08fffa49c 100644
--- a/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
+++ b/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
@@ -13,6 +13,7 @@ import org.elasticsearch.cluster.block.ClusterBlock;
import org.elasticsearch.cluster.block.ClusterBlockLevel;
import org.elasticsearch.cluster.coordination.Coordinator;
import org.elasticsearch.cluster.coordination.FollowersChecker;
+import org.elasticsearch.cluster.coordination.LagDetector;
import org.elasticsearch.cluster.coordination.LeaderChecker;
import org.elasticsearch.cluster.node.DiscoveryNodes;
import org.elasticsearch.common.settings.Settings;
@@ -116,6 +117,7 @@ public abstract class AbstractDisruptionTestCase extends ESIntegTestCase {
.put(FollowersChecker.FOLLOWER_CHECK_TIMEOUT_SETTING.getKey(), "5s") // for hitting simulated network failures quickly
.put(FollowersChecker.FOLLOWER_CHECK_RETRY_COUNT_SETTING.getKey(), 1) // for hitting simulated network failures quickly
.put(Coordinator.PUBLISH_TIMEOUT_SETTING.getKey(), "5s") // <-- for hitting simulated network failures quickly
+ .put(LagDetector.CLUSTER_FOLLOWER_LAG_TIMEOUT_SETTING.getKey(), "5s") // remove lagging nodes quickly so they can rejoin
.put(TransportSettings.CONNECT_TIMEOUT.getKey(), "10s") // Network delay disruption waits for the min between this
// value and the time of disruption and does not recover immediately
// when disruption is stop. We should make sure we recover faster However in production it is far from ideal for a faulty node to cause constant cluster state updates (at elasticsearch/server/src/main/java/org/elasticsearch/cluster/coordination/FollowersChecker.java Line 329 in 8e6e0e5
We might do that just before becoming elasticsearch/server/src/main/java/org/elasticsearch/cluster/coordination/FollowersChecker.java Lines 150 to 151 in 8e6e0e5
Then once we've become elasticsearch/server/src/main/java/org/elasticsearch/cluster/coordination/FollowersChecker.java Line 360 in 8e6e0e5
But since we're In terms of a fix, perhaps it would be sufficient to do this in diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java b/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
index 3da890b37ad..57936e27c49 100644
--- a/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
+++ b/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
@@ -920,6 +920,7 @@ public class Coordinator extends AbstractLifecycleComponent implements ClusterSt
);
assert leaderChecker.leader() == null : leaderChecker.leader();
+ followersChecker.clearCurrentNodes();
followersChecker.updateFastResponseState(leaderTerm, mode);
updateSingleNodeClusterChecker(); Not sure that covers all cases. Maybe a more direct solution would be better: if we get a Fixing this is pretty low-priority IMO since it requires a super-rare set of disruptions to happen with exactly the right timings, and resolves itself in time anyway, but I think we should keep this action on a list somewhere. |
I strongly agree that the lifecycle for marking and unmarking a faulty node appears quite fragile. I was thinking that a successful node join should also go into FollowersChecker to remove itself from potentially being on the faultyNodes list. Then the newly joined node could receive the cluster state publication -- the lack of which is the cause of this test failure. But I don't have an understanding of the cluster state update contents, and the lifecycle expectations there.
Though this bit sounds like a second problem. |
Lol. I wanted to see where the master service error was for node-left cluster state update that is missing, and I found this NullPointerException 😁
UPDATE: |
That might work, but I think I'd rather refuse to let a faulty node join the cluster in the first place, retrying its removal as necessary. The trouble is that "successful node join" is not very well-defined, the master might think a join was successful even if the joining node does not, and this could lead to a situation where the node keeps on joining without fault detection properly removing it first.
Ahh that'd do it I reckon. I tried to reproduce the race I guessed at in my previous comment by adding a delay that should provoke it, but have not succeeded after several hundred attempts. But this NPE would cause the same problem without needing a re-election. |
Chatted more with David about this, to get a better understanding of how cluster state updates work. The general outline of the proposed fix would be to
Quick fix, but testing might be hard, if we want to test this failure case. Perhaps something contrived could be tested, where we reach into components and make artificial alterations, and then check that it gets sorted out. |
NB I think we can close this test failure if we make the following change: diff --git a/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java b/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
index 2a8621a60b6..dc08fffa49c 100644
--- a/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
+++ b/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
@@ -13,6 +13,7 @@ import org.elasticsearch.cluster.block.ClusterBlock;
import org.elasticsearch.cluster.block.ClusterBlockLevel;
import org.elasticsearch.cluster.coordination.Coordinator;
import org.elasticsearch.cluster.coordination.FollowersChecker;
+import org.elasticsearch.cluster.coordination.LagDetector;
import org.elasticsearch.cluster.coordination.LeaderChecker;
import org.elasticsearch.cluster.node.DiscoveryNodes;
import org.elasticsearch.common.settings.Settings;
@@ -116,6 +117,7 @@ public abstract class AbstractDisruptionTestCase extends ESIntegTestCase {
.put(FollowersChecker.FOLLOWER_CHECK_TIMEOUT_SETTING.getKey(), "5s") // for hitting simulated network failures quickly
.put(FollowersChecker.FOLLOWER_CHECK_RETRY_COUNT_SETTING.getKey(), 1) // for hitting simulated network failures quickly
.put(Coordinator.PUBLISH_TIMEOUT_SETTING.getKey(), "5s") // <-- for hitting simulated network failures quickly
+ .put(LagDetector.CLUSTER_FOLLOWER_LAG_TIMEOUT_SETTING.getKey(), "5s") // remove lagging nodes quickly so they can rejoin
.put(TransportSettings.CONNECT_TIMEOUT.getKey(), "10s") // Network delay disruption waits for the min between this
// value and the time of disruption and does not recover immediately
// when disruption is stop. We should make sure we recover faster Then we can open another (non-test-failure) issue about the Real™ fix as Dianna describes above. |
It's possible for a node-left task to get interrupted prior to removing the node from the master's list of faultyNodes. Nodes on the faultyNodes list do not receive cluster state updates, and are eventually removed. Subsequently, when the node attempts to rejoin, after test network disruptions have ceased, the node-join request can succeed, but the node will never receive the cluster state update, consider the node-join a failure, and will resend node-join requests until the LagDetector removes the node from the faultyNodes list. #108690 will address the node-join issue. Closes #91447
Build scan:
https://gradle-enterprise.elastic.co/s/262tw2urcmfja/tests/:server:internalClusterTest/org.elasticsearch.discovery.ClusterDisruptionIT/testAckedIndexing
Reproduction line:
./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.discovery.ClusterDisruptionIT.testAckedIndexing" -Dtests.seed=F227DC2661BD3347 -Dtests.locale=sr-Latn-RS -Dtests.timezone=America/Halifax -Druntime.java=17
Applicable branches:
main
Reproduces locally?:
Yes
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.discovery.ClusterDisruptionIT&tests.test=testAckedIndexing
Failure excerpt:
The text was updated successfully, but these errors were encountered: