[CI] ClusterDisruptionIT testAckedIndexing failing #91447

pxsalehi · 2022-11-09T10:03:55Z

Build scan:
https://gradle-enterprise.elastic.co/s/262tw2urcmfja/tests/:server:internalClusterTest/org.elasticsearch.discovery.ClusterDisruptionIT/testAckedIndexing

Reproduction line:
./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.discovery.ClusterDisruptionIT.testAckedIndexing" -Dtests.seed=F227DC2661BD3347 -Dtests.locale=sr-Latn-RS -Dtests.timezone=America/Halifax -Druntime.java=17

Applicable branches:
main

Reproduces locally?:
Yes

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.discovery.ClusterDisruptionIT&tests.test=testAckedIndexing

Failure excerpt:

java.lang.AssertionError: failed to reach a stable cluster of [3] nodes. Tried via [node_t0]. last cluster state:
cluster uuid: -XwQ51BpTsyQMw_Wjp7XDg [committed: true]
version: 64
state uuid: XfzY1rwHSNWCym8FAHukTg
from_diff: false
meta data version: 16
   coordination_metadata:
      term: 2
      last_committed_config: VotingConfiguration{JuSgh8LOR9WQxzHHDqqlVQ,teMmveWdQ8eJomW0l7COzg,XZTfvsT_T8G2zfpKc6vMiQ}
      last_accepted_config: VotingConfiguration{JuSgh8LOR9WQxzHHDqqlVQ,teMmveWdQ8eJomW0l7COzg,XZTfvsT_T8G2zfpKc6vMiQ}
      voting tombstones: []
   [test/0e3yQZiiQryESs-x7Qtacw]: v[13], mv[4], sv[1], av[1]
      0: p_term [1], isa_ids [GgrbxbMaTNCpILvZl4Ep7A, vByWNTvzQHKjmYEG4ZNTeg]
      1: p_term [1], isa_ids [v_Fpb5_fRW-Di9kVj35n6g, geKyf7jqSVWDQpWqMVQTmw]
metadata customs:
   index-graveyard: IndexGraveyard[[]]   persistent_tasks: {"last_allocation_id":1,"tasks":[{"id":"health-node","task":{"health-node":{"params":{}}},"allocation_id":1,"assignment":{"executor_node":"JuSgh8LOR9WQxzHHDqqlVQ","explanation":""}}]}
nodes: 
   {node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{GEOyX6yBQK6q12UWfW01AA}{node_t1}{127.0.0.1}{127.0.0.1:19091}{cdfhilmrstw}
   {node_t2}{JuSgh8LOR9WQxzHHDqqlVQ}{Di5DKi8_Q3OesY4wpcnCfA}{node_t2}{127.0.0.1}{127.0.0.1:19093}{cdfhilmrstw}, master
   {node_t0}{teMmveWdQ8eJomW0l7COzg}{UqjXsHaTQKax1c3zc-P9iA}{node_t0}{127.0.0.1}{127.0.0.1:19092}{cdfhilmrstw}
routing_table (version 14):
-- index [[test/0e3yQZiiQryESs-x7Qtacw]]
----shard_id [test][0]
--------[test][0], node[teMmveWdQ8eJomW0l7COzg], [R], s[STARTED], a[id=vByWNTvzQHKjmYEG4ZNTeg], failed_attempts[0]
--------[test][0], node[JuSgh8LOR9WQxzHHDqqlVQ], [P], s[STARTED], a[id=GgrbxbMaTNCpILvZl4Ep7A], failed_attempts[0]
----shard_id [test][1]
--------[test][1], node[teMmveWdQ8eJomW0l7COzg], relocating [XZTfvsT_T8G2zfpKc6vMiQ], [P], s[RELOCATING], a[id=v_Fpb5_fRW-Di9kVj35n6g, rId=vG7OtoSySlafxcOzoHgVMA], failed_attempts[0], expected_shard_size[225]
--------[test][1], node[JuSgh8LOR9WQxzHHDqqlVQ], [R], s[STARTED], a[id=geKyf7jqSVWDQpWqMVQTmw], failed_attempts[0]

routing_nodes:
-----node_id[teMmveWdQ8eJomW0l7COzg][V]
--------[test][0], node[teMmveWdQ8eJomW0l7COzg], [R], s[STARTED], a[id=vByWNTvzQHKjmYEG4ZNTeg], failed_attempts[0]
--------[test][1], node[teMmveWdQ8eJomW0l7COzg], relocating [XZTfvsT_T8G2zfpKc6vMiQ], [P], s[RELOCATING], a[id=v_Fpb5_fRW-Di9kVj35n6g, rId=vG7OtoSySlafxcOzoHgVMA], failed_attempts[0], expected_shard_size[225]
-----node_id[JuSgh8LOR9WQxzHHDqqlVQ][V]
--------[test][0], node[JuSgh8LOR9WQxzHHDqqlVQ], [P], s[STARTED], a[id=GgrbxbMaTNCpILvZl4Ep7A], failed_attempts[0]
--------[test][1], node[JuSgh8LOR9WQxzHHDqqlVQ], [R], s[STARTED], a[id=geKyf7jqSVWDQpWqMVQTmw], failed_attempts[0]
-----node_id[XZTfvsT_T8G2zfpKc6vMiQ][V]
--------[test][1], node[XZTfvsT_T8G2zfpKc6vMiQ], relocating [teMmveWdQ8eJomW0l7COzg], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=vG7OtoSySlafxcOzoHgVMA, rId=v_Fpb5_fRW-Di9kVj35n6g], failed_attempts[0], expected_shard_size[225]
---- unassigned
customs:
   health: org.elasticsearch.health.metadata.HealthMetadata@ff2c405a

  at org.junit.Assert.fail(Assert.java:88)
  at org.elasticsearch.test.ESIntegTestCase.ensureStableCluster(ESIntegTestCase.java:1295)
  at org.elasticsearch.discovery.ClusterDisruptionIT.testAckedIndexing(ClusterDisruptionIT.java:224)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2022-11-09T10:04:17Z

Pinging @elastic/es-distributed (Team:Distributed)

pxsalehi · 2022-11-09T10:04:46Z

I could reproduce this. See https://gradle-enterprise.elastic.co/s/tmnxogpfsbwiu.

pxsalehi · 2022-11-21T13:49:07Z

This used to reproduce (almost) reliably with -Dtests.seed=F227DC2661BD3347. However, this doesn't seem to be the case anymore. I suspect these failures were due to the desired-balance allocator changes, and it seems to have been resolved, as @DaveCTurner and @idegtiarenko were addressing other failures. I would still look into this out of interest as this seems to be a special test suite with a complicated history! If I do not see anything special, I'll just close this since I might be chasing a potentially resolved issue.

DaveCTurner · 2022-11-21T14:05:21Z

Hmm I don't think we've fixed anything that might have prevented the cluster from stabilising in a reasonable timeframe, which appears to be the problem here. But maybe we have.

In the run you shared above the cluster seems to be constantly adding node_t1 for some reason:

  1> [2022-11-09T06:01:34,871][INFO ][o.e.d.ClusterDisruptionIT] [testAckedIndexing] stopping disruption
  1> [2022-11-09T06:01:34,871][INFO ][o.e.t.d.NetworkDisruption] [testAckedIndexing] stop disrupting (disruption scheme: network disconnects, disrupted links: two partitions (partition 1: [node_t2, node_t0] and partition 2: [node_t1]))
  1> [2022-11-09T06:01:35,818][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:36,816][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:37,820][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:38,820][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:39,827][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:40,828][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:41,834][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:42,836][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:43,844][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:44,846][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:45,845][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:46,847][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:47,854][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:48,860][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:49,868][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:50,870][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:51,874][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:52,878][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:53,882][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:54,886][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:55,892][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:56,897][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:57,900][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:58,909][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:59,913][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:00,918][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:01,921][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:02,927][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:03,934][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:04,936][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:05,940][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:06,946][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:07,948][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:08,950][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:09,965][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:10,986][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:11,966][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:12,974][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:13,981][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:14,886][DEBUG][o.e.d.ClusterDisruptionIT] [testAckedIndexing] Indexing exceptions during disruption:
  1> [2022-11-09T06:02:14,887][INFO ][o.e.d.ClusterDisruptionIT] [testAckedIndexing] [ClusterDisruptionIT#testAckedIndexing]: cleaning up after test

That seems suspicious, and unrelated to allocator changes.

DaveCTurner · 2022-11-21T14:08:10Z

Hmm the master thinks this node is faulty and therefore never sends it the cluster state:

  1> [2022-11-09T06:01:35,818][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=25}
  1> [2022-11-09T06:01:36,817][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=26}
  1> [2022-11-09T06:01:37,821][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=27}
  1> [2022-11-09T06:01:38,821][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=28}
  1> [2022-11-09T06:01:39,827][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=29}
  1> [2022-11-09T06:01:40,829][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=30}
  1> [2022-11-09T06:01:41,835][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=31}
  1> [2022-11-09T06:01:42,837][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=32}
  1> [2022-11-09T06:01:43,844][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=33}
  1> [2022-11-09T06:01:44,847][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=34}
  1> [2022-11-09T06:01:45,846][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=35}
  1> [2022-11-09T06:01:46,849][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=36}
  1> [2022-11-09T06:01:47,855][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=37}
  1> [2022-11-09T06:01:48,861][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=38}
  1> [2022-11-09T06:01:49,868][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=39}
  1> [2022-11-09T06:01:50,870][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=40}
  1> [2022-11-09T06:01:51,875][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=41}
  1> [2022-11-09T06:01:52,879][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=42}
  1> [2022-11-09T06:01:53,882][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=43}
  1> [2022-11-09T06:01:54,887][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=44}
  1> [2022-11-09T06:01:55,893][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=45}
  1> [2022-11-09T06:01:56,897][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=46}
  1> [2022-11-09T06:01:57,901][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=47}
  1> [2022-11-09T06:01:58,910][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=48}
  1> [2022-11-09T06:01:59,913][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=49}
  1> [2022-11-09T06:02:00,919][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=50}
  1> [2022-11-09T06:02:01,922][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=51}
  1> [2022-11-09T06:02:02,928][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=52}
  1> [2022-11-09T06:02:03,934][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=53}
  1> [2022-11-09T06:02:04,936][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=54}
  1> [2022-11-09T06:02:05,940][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=55}
  1> [2022-11-09T06:02:06,947][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=56}
  1> [2022-11-09T06:02:07,949][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=57}
  1> [2022-11-09T06:02:08,950][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=58}
  1> [2022-11-09T06:02:09,966][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=59}
  1> [2022-11-09T06:02:10,987][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=60}
  1> [2022-11-09T06:02:11,968][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=61}
  1> [2022-11-09T06:02:12,975][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=62}
  1> [2022-11-09T06:02:13,982][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=63}
  1> [2022-11-09T06:02:14,982][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=64}
  1> [2022-11-09T06:02:15,996][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=65}

DiannaHohensee · 2023-11-28T18:13:16Z

So the Coordinator will publish the new cluster state, and it calls into the FollowersChecker to remove faulty nodes if they are no longer part of the cluster state. The only other paths that I see to clear faulty nodes is when the node becomes a follower or becomes a candidate.

In this test failure scenario, a faulty node (node_t1) joins a cluster with master node_t2. So the cluster state being published should contain node_t1, which I think will leave node_t1 marked as a faulty node. I haven't been able to reproduce the failure. Next I'll see if I can write a new test in FollwersCheckerTests to cover the scenario, and see what happens.

DiannaHohensee · 2023-11-29T00:41:27Z

A node is marked as faulty in the FollowersChecker when there is a network disconnection with the that node. A node is not returned to not-faulty unless a master election occurs or the node is removed from the cluster, as mentioned above. I think the failed run differs in marking a node as faulty but never removes it -- which would remove the node from the faultyNodes list

.....

// add  -- isolated

  1> [2022-11-09T06:01:33,797][INFO ][o.e.c.s.MasterService    ] [node_t2] node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} joining, removed [5.9s/5907ms] ago with reason [disconnected]], term: 2, version: 23, delta: added {{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}}


// DISRUPTING -----------

  1> [2022-11-09T06:01:33,866][INFO ][o.e.t.d.NetworkDisruption] [testAckedIndexing] start disrupting (disruption type: network disconnects, disrupted links: two partitions (partition 1: [node_t2, node_t0] and partition 2: [node_t1]))


// mark faulty  -- isolated

  1> [2022-11-09T06:01:34,803][DEBUG][o.e.c.c.FollowersChecker ] [node_t2] FollowerChecker{discoveryNode={node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}, failureCountSinceLastSuccess=1, timeoutCountSinceLastSuccess=0, [cluster.fault_detection.follower_check.retry_count]=1} disconnected

  1> [2022-11-09T06:01:34,804][DEBUG][o.e.c.c.FollowersChecker ] [node_t2] FollowerChecker{discoveryNode={node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}, failureCountSinceLastSuccess=1, timeoutCountSinceLastSuccess=0, [cluster.fault_detection.follower_check.retry_count]=1} marking node as faulty


// STOP DISRUPTING -------

  1> [2022-11-09T06:01:34,871][INFO ][o.e.t.d.NetworkDisruption] [testAckedIndexing] stop disrupting (disruption scheme: network disconnects, disrupted links: two partitions (partition 1: [node_t2, node_t0] and partition 2: [node_t1]))

.....

*note, I searched for node-left with delta: removed, where the cluster state logged actually removes the node. There is a node-left but no delta: removed for the last marking node as faulty

These are the relevant logs in the failed run in this ticket, and then a successful run on my local machine.

So the question becomes: why didn't the node get removed properly from the cluster state?

DiannaHohensee · 2023-11-29T00:43:07Z

The logic to remove the node -- node-left cluster state update -- is scheduled right after logging "marking node as faulty". That leads back to the Coordinator, removeNode.

The removeNode code actually was significantly changed 9 months ago, to improve the master service batch queuing system. Maybe something was wrong with the batching, such that the cluster state wasn't updated properly. But this might be enough investigation to put the failure to rest because the code has changed greatly.

Potentially incidentally fixed by #92021

kingherc · 2023-11-29T10:29:28Z

Thanks @DiannaHohensee ! I believe @tlrx also mentioned he could not reproduce it in current version, and was thinking of closing this (and re-open it if it ever pops up again). So closing this for now. Feel free to re-open if you think we should not close this now.

DaveCTurner · 2023-11-29T11:42:33Z

I don't think this is addressed by the master-service task queues work. The node's failure to rejoin properly would eventually be detected by the LagDetector:

  1> [2022-11-09T06:02:44,109][DEBUG][o.e.c.c.LagDetector      ] [node_t2] starting lag detector for version 93: [NodeAppliedStateTracker{discoveryNode={node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmr
stw}, appliedVersion=24}]

However the default lag-detection timeout is 90s and we do not wait that long in these tests. So I think it would be reasonable to change that:

diff --git a/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java b/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
index 2a8621a60b6..dc08fffa49c 100644
--- a/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
+++ b/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
@@ -13,6 +13,7 @@ import org.elasticsearch.cluster.block.ClusterBlock;
 import org.elasticsearch.cluster.block.ClusterBlockLevel;
 import org.elasticsearch.cluster.coordination.Coordinator;
 import org.elasticsearch.cluster.coordination.FollowersChecker;
+import org.elasticsearch.cluster.coordination.LagDetector;
 import org.elasticsearch.cluster.coordination.LeaderChecker;
 import org.elasticsearch.cluster.node.DiscoveryNodes;
 import org.elasticsearch.common.settings.Settings;
@@ -116,6 +117,7 @@ public abstract class AbstractDisruptionTestCase extends ESIntegTestCase {
         .put(FollowersChecker.FOLLOWER_CHECK_TIMEOUT_SETTING.getKey(), "5s") // for hitting simulated network failures quickly
         .put(FollowersChecker.FOLLOWER_CHECK_RETRY_COUNT_SETTING.getKey(), 1) // for hitting simulated network failures quickly
         .put(Coordinator.PUBLISH_TIMEOUT_SETTING.getKey(), "5s") // <-- for hitting simulated network failures quickly
+        .put(LagDetector.CLUSTER_FOLLOWER_LAG_TIMEOUT_SETTING.getKey(), "5s") // remove lagging nodes quickly so they can rejoin
         .put(TransportSettings.CONNECT_TIMEOUT.getKey(), "10s") // Network delay disruption waits for the min between this
         // value and the time of disruption and does not recover immediately
         // when disruption is stop. We should make sure we recover faster

However in production it is far from ideal for a faulty node to cause constant cluster state updates (at URGENT priority) for 90s while the lag detector kicks in, and I don't see anything that would prevent that from happening still today: it seems we can add a node to FollowersChecker#faultyNodes but then fail to remove it from the cluster if the node-left task fails with a NotMasterException. More specifically, when handling a follower check failure we check isRunning() first here:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/coordination/FollowersChecker.java

Line 329 in 8e6e0e5

if (running() == false) {

We might do that just before becoming CANDIDATE which clears things here:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/coordination/FollowersChecker.java

Lines 150 to 151 in 8e6e0e5

    
           followerCheckers.keySet().removeIf(isUnknownNode); 
        
           faultyNodes.removeIf(isUnknownNode);

Then once we've become CANDIDATE we might go and mark the node as faulty here:

elasticsearch/server/src/main/java/org/elasticsearch/cluster/coordination/FollowersChecker.java

Line 360 in 8e6e0e5

failNode(reason);

But since we're CANDIDATE, the node-left task fails, leaving us in a state where we can win the next election with a faulty node that we never removed from the cluster.

In terms of a fix, perhaps it would be sufficient to do this in becomeLeader()?

diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java b/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
index 3da890b37ad..57936e27c49 100644
--- a/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
+++ b/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
@@ -920,6 +920,7 @@ public class Coordinator extends AbstractLifecycleComponent implements ClusterSt
         );

         assert leaderChecker.leader() == null : leaderChecker.leader();
+        followersChecker.clearCurrentNodes();
         followersChecker.updateFastResponseState(leaderTerm, mode);

         updateSingleNodeClusterChecker();

Not sure that covers all cases. Maybe a more direct solution would be better: if we get a node-join request from a faulty node that's still in the cluster, we could submit another node-left task and wait for that to complete before processing the join.

Fixing this is pretty low-priority IMO since it requires a super-rare set of disruptions to happen with exactly the right timings, and resolves itself in time anyway, but I think we should keep this action on a list somewhere.

DiannaHohensee · 2023-11-29T16:22:50Z

I strongly agree that the lifecycle for marking and unmarking a faulty node appears quite fragile. I was thinking that a successful node join should also go into FollowersChecker to remove itself from potentially being on the faultyNodes list. Then the newly joined node could receive the cluster state publication -- the lack of which is the cause of this test failure. But I don't have an understanding of the cluster state update contents, and the lifecycle expectations there.

we can win the next election with a faulty node that we never removed from the cluster

Though this bit sounds like a second problem.

DiannaHohensee · 2023-11-29T17:41:20Z

Lol. I wanted to see where the master service error was for node-left cluster state update that is missing, and I found this NullPointerException 😁

  1> [2022-11-09T06:01:34,806][ERROR][o.e.c.c.NodeRemovalClusterStateTaskExecutor] [node_t2] unexpected failure during [node-left]
  1> java.lang.NullPointerException: Cannot invoke "org.elasticsearch.cluster.routing.RoutingNode.nodeId()" because "node" is null
  1> 	at org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:62) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.decideCanAllocate(DesiredBalanceReconciler.java:432) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.findRelocationTarget(DesiredBalanceReconciler.java:420) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.balance(DesiredBalanceReconciler.java:385) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.run(DesiredBalanceReconciler.java:88) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.recordTime(DesiredBalanceShardsAllocator.java:299) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.reconcile(DesiredBalanceShardsAllocator.java:213) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.allocate(DesiredBalanceShardsAllocator.java:162) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.AllocationService.lambda$reroute$6(AllocationService.java:423) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:518) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.AllocationService.executeWithRoutingAllocation(AllocationService.java:444) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:420) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.AllocationService.disassociateDeadNodes(AllocationService.java:275) ~[main/:?]
  1> 	at org.elasticsearch.cluster.coordination.NodeRemovalClusterStateTaskExecutor.execute(NodeRemovalClusterStateTaskExecutor.java:74) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.MasterService.innerExecuteTasks(MasterService.java:1052) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:1017) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:278) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:170) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:110) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:148) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:850) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:257) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:223) ~[main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
  1> 	at java.lang.Thread.run(Thread.java:833) ~[?:?]

UPDATE:
This NullPointerException was fixed a week after this test failure first occurred, Nov 18th.

DaveCTurner · 2023-11-29T21:45:09Z

I was thinking that a successful node join should also go into FollowersChecker to remove itself from potentially being on the faultyNodes list.

That might work, but I think I'd rather refuse to let a faulty node join the cluster in the first place, retrying its removal as necessary. The trouble is that "successful node join" is not very well-defined, the master might think a join was successful even if the joining node does not, and this could lead to a situation where the node keeps on joining without fault detection properly removing it first.

I found this NullPointerException 😁

Ahh that'd do it I reckon. I tried to reproduce the race I guessed at in my previous comment by adding a delay that should provoke it, but have not succeeded after several hundred attempts. But this NPE would cause the same problem without needing a re-election.

DiannaHohensee · 2023-12-04T19:19:13Z

Chatted more with David about this, to get a better understanding of how cluster state updates work.

The general outline of the proposed fix would be to

add node-left logic to the validateJoinRequest func if the joining node is found to have been partially removed, in the sense of this failure where the node is still part of the cluster state but also in the faultyNodes list. The validation function already makes network calls, so waiting for cluster state to publish is fine.
and then processJoinRequest will proceed as usual to run the node-join logic.

Quick fix, but testing might be hard, if we want to test this failure case. Perhaps something contrived could be tested, where we reach into components and make artificial alterations, and then check that it gets sorted out.

DaveCTurner · 2023-12-07T15:03:44Z

NB I think we can close this test failure if we make the following change:

diff --git a/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java b/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
index 2a8621a60b6..dc08fffa49c 100644
--- a/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
+++ b/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
@@ -13,6 +13,7 @@ import org.elasticsearch.cluster.block.ClusterBlock;
 import org.elasticsearch.cluster.block.ClusterBlockLevel;
 import org.elasticsearch.cluster.coordination.Coordinator;
 import org.elasticsearch.cluster.coordination.FollowersChecker;
+import org.elasticsearch.cluster.coordination.LagDetector;
 import org.elasticsearch.cluster.coordination.LeaderChecker;
 import org.elasticsearch.cluster.node.DiscoveryNodes;
 import org.elasticsearch.common.settings.Settings;
@@ -116,6 +117,7 @@ public abstract class AbstractDisruptionTestCase extends ESIntegTestCase {
         .put(FollowersChecker.FOLLOWER_CHECK_TIMEOUT_SETTING.getKey(), "5s") // for hitting simulated network failures quickly
         .put(FollowersChecker.FOLLOWER_CHECK_RETRY_COUNT_SETTING.getKey(), 1) // for hitting simulated network failures quickly
         .put(Coordinator.PUBLISH_TIMEOUT_SETTING.getKey(), "5s") // <-- for hitting simulated network failures quickly
+        .put(LagDetector.CLUSTER_FOLLOWER_LAG_TIMEOUT_SETTING.getKey(), "5s") // remove lagging nodes quickly so they can rejoin
         .put(TransportSettings.CONNECT_TIMEOUT.getKey(), "10s") // Network delay disruption waits for the min between this
         // value and the time of disruption and does not recover immediately
         // when disruption is stop. We should make sure we recover faster

Then we can open another (non-test-failure) issue about the Real™ fix as Dianna describes above.

DiannaHohensee · 2024-05-15T18:26:15Z

Filed #108690 for the production fix, and published a PR for the test fix #108691

It's possible for a node-left task to get interrupted prior to removing the node from the master's list of faultyNodes. Nodes on the faultyNodes list do not receive cluster state updates, and are eventually removed. Subsequently, when the node attempts to rejoin, after test network disruptions have ceased, the node-join request can succeed, but the node will never receive the cluster state update, consider the node-join a failure, and will resend node-join requests until the LagDetector removes the node from the faultyNodes list. #108690 will address the node-join issue. Closes #91447

pxsalehi added :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI labels Nov 9, 2022

elasticsearchmachine added the Team:Distributed Meta label for distributed team (obsolete) label Nov 9, 2022

pxsalehi self-assigned this Nov 15, 2022

idegtiarenko mentioned this issue Dec 14, 2022

[CI] ClusterDisruptionIT testAckedIndexing failing #92344

Closed

pxsalehi removed their assignment May 31, 2023

volodk85 added the medium-risk An open issue or test failure that is a medium risk to future releases label Oct 23, 2023

DiannaHohensee self-assigned this Nov 28, 2023

kingherc closed this as completed Nov 29, 2023

DaveCTurner reopened this Nov 29, 2023

DiannaHohensee added low-risk An open issue or test failure that is a low risk to future releases and removed medium-risk An open issue or test failure that is a medium risk to future releases labels Nov 29, 2023

DiannaHohensee closed this as completed in #108691 May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] ClusterDisruptionIT testAckedIndexing failing #91447

[CI] ClusterDisruptionIT testAckedIndexing failing #91447

pxsalehi commented Nov 9, 2022

elasticsearchmachine commented Nov 9, 2022

pxsalehi commented Nov 9, 2022

pxsalehi commented Nov 21, 2022

DaveCTurner commented Nov 21, 2022

DaveCTurner commented Nov 21, 2022

DiannaHohensee commented Nov 28, 2023

DiannaHohensee commented Nov 29, 2023 •

edited

Loading

DiannaHohensee commented Nov 29, 2023 •

edited

Loading

kingherc commented Nov 29, 2023

DaveCTurner commented Nov 29, 2023

DiannaHohensee commented Nov 29, 2023 •

edited

Loading

DiannaHohensee commented Nov 29, 2023 •

edited

Loading

DaveCTurner commented Nov 29, 2023

DiannaHohensee commented Dec 4, 2023

DaveCTurner commented Dec 7, 2023

DiannaHohensee commented May 15, 2024

[CI] ClusterDisruptionIT testAckedIndexing failing #91447

[CI] ClusterDisruptionIT testAckedIndexing failing #91447

Comments

pxsalehi commented Nov 9, 2022

elasticsearchmachine commented Nov 9, 2022

pxsalehi commented Nov 9, 2022

pxsalehi commented Nov 21, 2022

DaveCTurner commented Nov 21, 2022

DaveCTurner commented Nov 21, 2022

DiannaHohensee commented Nov 28, 2023

DiannaHohensee commented Nov 29, 2023 • edited Loading

DiannaHohensee commented Nov 29, 2023 • edited Loading

kingherc commented Nov 29, 2023

DaveCTurner commented Nov 29, 2023

DiannaHohensee commented Nov 29, 2023 • edited Loading

DiannaHohensee commented Nov 29, 2023 • edited Loading

DaveCTurner commented Nov 29, 2023

DiannaHohensee commented Dec 4, 2023

DaveCTurner commented Dec 7, 2023

DiannaHohensee commented May 15, 2024

DiannaHohensee commented Nov 29, 2023 •

edited

Loading

DiannaHohensee commented Nov 29, 2023 •

edited

Loading

DiannaHohensee commented Nov 29, 2023 •

edited

Loading

DiannaHohensee commented Nov 29, 2023 •

edited

Loading