Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] ClusterDisruptionIT testAckedIndexing failing #91447

Closed
pxsalehi opened this issue Nov 9, 2022 · 16 comments · Fixed by #108691
Closed

[CI] ClusterDisruptionIT testAckedIndexing failing #91447

pxsalehi opened this issue Nov 9, 2022 · 16 comments · Fixed by #108691
Assignees
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. low-risk An open issue or test failure that is a low risk to future releases Team:Distributed Meta label for distributed team (obsolete) >test-failure Triaged test failures from CI

Comments

@pxsalehi
Copy link
Member

pxsalehi commented Nov 9, 2022

Build scan:
https://gradle-enterprise.elastic.co/s/262tw2urcmfja/tests/:server:internalClusterTest/org.elasticsearch.discovery.ClusterDisruptionIT/testAckedIndexing

Reproduction line:
./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.discovery.ClusterDisruptionIT.testAckedIndexing" -Dtests.seed=F227DC2661BD3347 -Dtests.locale=sr-Latn-RS -Dtests.timezone=America/Halifax -Druntime.java=17

Applicable branches:
main

Reproduces locally?:
Yes

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.discovery.ClusterDisruptionIT&tests.test=testAckedIndexing

Failure excerpt:

java.lang.AssertionError: failed to reach a stable cluster of [3] nodes. Tried via [node_t0]. last cluster state:
cluster uuid: -XwQ51BpTsyQMw_Wjp7XDg [committed: true]
version: 64
state uuid: XfzY1rwHSNWCym8FAHukTg
from_diff: false
meta data version: 16
   coordination_metadata:
      term: 2
      last_committed_config: VotingConfiguration{JuSgh8LOR9WQxzHHDqqlVQ,teMmveWdQ8eJomW0l7COzg,XZTfvsT_T8G2zfpKc6vMiQ}
      last_accepted_config: VotingConfiguration{JuSgh8LOR9WQxzHHDqqlVQ,teMmveWdQ8eJomW0l7COzg,XZTfvsT_T8G2zfpKc6vMiQ}
      voting tombstones: []
   [test/0e3yQZiiQryESs-x7Qtacw]: v[13], mv[4], sv[1], av[1]
      0: p_term [1], isa_ids [GgrbxbMaTNCpILvZl4Ep7A, vByWNTvzQHKjmYEG4ZNTeg]
      1: p_term [1], isa_ids [v_Fpb5_fRW-Di9kVj35n6g, geKyf7jqSVWDQpWqMVQTmw]
metadata customs:
   index-graveyard: IndexGraveyard[[]]   persistent_tasks: {"last_allocation_id":1,"tasks":[{"id":"health-node","task":{"health-node":{"params":{}}},"allocation_id":1,"assignment":{"executor_node":"JuSgh8LOR9WQxzHHDqqlVQ","explanation":""}}]}
nodes: 
   {node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{GEOyX6yBQK6q12UWfW01AA}{node_t1}{127.0.0.1}{127.0.0.1:19091}{cdfhilmrstw}
   {node_t2}{JuSgh8LOR9WQxzHHDqqlVQ}{Di5DKi8_Q3OesY4wpcnCfA}{node_t2}{127.0.0.1}{127.0.0.1:19093}{cdfhilmrstw}, master
   {node_t0}{teMmveWdQ8eJomW0l7COzg}{UqjXsHaTQKax1c3zc-P9iA}{node_t0}{127.0.0.1}{127.0.0.1:19092}{cdfhilmrstw}
routing_table (version 14):
-- index [[test/0e3yQZiiQryESs-x7Qtacw]]
----shard_id [test][0]
--------[test][0], node[teMmveWdQ8eJomW0l7COzg], [R], s[STARTED], a[id=vByWNTvzQHKjmYEG4ZNTeg], failed_attempts[0]
--------[test][0], node[JuSgh8LOR9WQxzHHDqqlVQ], [P], s[STARTED], a[id=GgrbxbMaTNCpILvZl4Ep7A], failed_attempts[0]
----shard_id [test][1]
--------[test][1], node[teMmveWdQ8eJomW0l7COzg], relocating [XZTfvsT_T8G2zfpKc6vMiQ], [P], s[RELOCATING], a[id=v_Fpb5_fRW-Di9kVj35n6g, rId=vG7OtoSySlafxcOzoHgVMA], failed_attempts[0], expected_shard_size[225]
--------[test][1], node[JuSgh8LOR9WQxzHHDqqlVQ], [R], s[STARTED], a[id=geKyf7jqSVWDQpWqMVQTmw], failed_attempts[0]

routing_nodes:
-----node_id[teMmveWdQ8eJomW0l7COzg][V]
--------[test][0], node[teMmveWdQ8eJomW0l7COzg], [R], s[STARTED], a[id=vByWNTvzQHKjmYEG4ZNTeg], failed_attempts[0]
--------[test][1], node[teMmveWdQ8eJomW0l7COzg], relocating [XZTfvsT_T8G2zfpKc6vMiQ], [P], s[RELOCATING], a[id=v_Fpb5_fRW-Di9kVj35n6g, rId=vG7OtoSySlafxcOzoHgVMA], failed_attempts[0], expected_shard_size[225]
-----node_id[JuSgh8LOR9WQxzHHDqqlVQ][V]
--------[test][0], node[JuSgh8LOR9WQxzHHDqqlVQ], [P], s[STARTED], a[id=GgrbxbMaTNCpILvZl4Ep7A], failed_attempts[0]
--------[test][1], node[JuSgh8LOR9WQxzHHDqqlVQ], [R], s[STARTED], a[id=geKyf7jqSVWDQpWqMVQTmw], failed_attempts[0]
-----node_id[XZTfvsT_T8G2zfpKc6vMiQ][V]
--------[test][1], node[XZTfvsT_T8G2zfpKc6vMiQ], relocating [teMmveWdQ8eJomW0l7COzg], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=vG7OtoSySlafxcOzoHgVMA, rId=v_Fpb5_fRW-Di9kVj35n6g], failed_attempts[0], expected_shard_size[225]
---- unassigned
customs:
   health: org.elasticsearch.health.metadata.HealthMetadata@ff2c405a

  at org.junit.Assert.fail(Assert.java:88)
  at org.elasticsearch.test.ESIntegTestCase.ensureStableCluster(ESIntegTestCase.java:1295)
  at org.elasticsearch.discovery.ClusterDisruptionIT.testAckedIndexing(ClusterDisruptionIT.java:224)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)

@pxsalehi pxsalehi added :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI labels Nov 9, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team (obsolete) label Nov 9, 2022
@pxsalehi
Copy link
Member Author

pxsalehi commented Nov 9, 2022

I could reproduce this. See https://gradle-enterprise.elastic.co/s/tmnxogpfsbwiu.

@pxsalehi pxsalehi self-assigned this Nov 15, 2022
@pxsalehi
Copy link
Member Author

This used to reproduce (almost) reliably with -Dtests.seed=F227DC2661BD3347. However, this doesn't seem to be the case anymore. I suspect these failures were due to the desired-balance allocator changes, and it seems to have been resolved, as @DaveCTurner and @idegtiarenko were addressing other failures. I would still look into this out of interest as this seems to be a special test suite with a complicated history! If I do not see anything special, I'll just close this since I might be chasing a potentially resolved issue.

@DaveCTurner
Copy link
Contributor

Hmm I don't think we've fixed anything that might have prevented the cluster from stabilising in a reasonable timeframe, which appears to be the problem here. But maybe we have.

In the run you shared above the cluster seems to be constantly adding node_t1 for some reason:

  1> [2022-11-09T06:01:34,871][INFO ][o.e.d.ClusterDisruptionIT] [testAckedIndexing] stopping disruption
  1> [2022-11-09T06:01:34,871][INFO ][o.e.t.d.NetworkDisruption] [testAckedIndexing] stop disrupting (disruption scheme: network disconnects, disrupted links: two partitions (partition 1: [node_t2, node_t0] and partition 2: [node_t1]))
  1> [2022-11-09T06:01:35,818][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:36,816][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:37,820][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:38,820][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:39,827][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:40,828][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:41,834][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:42,836][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:43,844][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:44,846][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:45,845][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:46,847][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:47,854][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:48,860][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:49,868][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:50,870][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:51,874][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:52,878][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:53,882][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:54,886][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:55,892][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:56,897][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:57,900][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:58,909][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:01:59,913][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:00,918][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:01,921][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:02,927][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:03,934][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:04,936][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:05,940][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:06,946][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:07,948][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:08,950][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:09,965][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:10,986][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:11,966][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:12,974][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:13,981][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} rejoining]]
  1> [2022-11-09T06:02:14,886][DEBUG][o.e.d.ClusterDisruptionIT] [testAckedIndexing] Indexing exceptions during disruption:
  1> [2022-11-09T06:02:14,887][INFO ][o.e.d.ClusterDisruptionIT] [testAckedIndexing] [ClusterDisruptionIT#testAckedIndexing]: cleaning up after test

That seems suspicious, and unrelated to allocator changes.

@DaveCTurner
Copy link
Contributor

Hmm the master thinks this node is faulty and therefore never sends it the cluster state:

  1> [2022-11-09T06:01:35,818][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=25}
  1> [2022-11-09T06:01:36,817][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=26}
  1> [2022-11-09T06:01:37,821][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=27}
  1> [2022-11-09T06:01:38,821][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=28}
  1> [2022-11-09T06:01:39,827][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=29}
  1> [2022-11-09T06:01:40,829][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=30}
  1> [2022-11-09T06:01:41,835][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=31}
  1> [2022-11-09T06:01:42,837][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=32}
  1> [2022-11-09T06:01:43,844][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=33}
  1> [2022-11-09T06:01:44,847][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=34}
  1> [2022-11-09T06:01:45,846][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=35}
  1> [2022-11-09T06:01:46,849][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=36}
  1> [2022-11-09T06:01:47,855][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=37}
  1> [2022-11-09T06:01:48,861][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=38}
  1> [2022-11-09T06:01:49,868][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=39}
  1> [2022-11-09T06:01:50,870][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=40}
  1> [2022-11-09T06:01:51,875][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=41}
  1> [2022-11-09T06:01:52,879][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=42}
  1> [2022-11-09T06:01:53,882][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=43}
  1> [2022-11-09T06:01:54,887][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=44}
  1> [2022-11-09T06:01:55,893][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=45}
  1> [2022-11-09T06:01:56,897][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=46}
  1> [2022-11-09T06:01:57,901][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=47}
  1> [2022-11-09T06:01:58,910][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=48}
  1> [2022-11-09T06:01:59,913][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=49}
  1> [2022-11-09T06:02:00,919][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=50}
  1> [2022-11-09T06:02:01,922][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=51}
  1> [2022-11-09T06:02:02,928][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=52}
  1> [2022-11-09T06:02:03,934][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=53}
  1> [2022-11-09T06:02:04,936][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=54}
  1> [2022-11-09T06:02:05,940][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=55}
  1> [2022-11-09T06:02:06,947][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=56}
  1> [2022-11-09T06:02:07,949][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=57}
  1> [2022-11-09T06:02:08,950][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=58}
  1> [2022-11-09T06:02:09,966][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=59}
  1> [2022-11-09T06:02:10,987][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=60}
  1> [2022-11-09T06:02:11,968][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=61}
  1> [2022-11-09T06:02:12,975][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=62}
  1> [2022-11-09T06:02:13,982][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=63}
  1> [2022-11-09T06:02:14,982][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=64}
  1> [2022-11-09T06:02:15,996][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t2] onFaultyNode: [{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}] is faulty, failing target in publication Publication{term=2, version=65}

@pxsalehi pxsalehi removed their assignment May 31, 2023
@volodk85 volodk85 added the medium-risk An open issue or test failure that is a medium risk to future releases label Oct 23, 2023
@DiannaHohensee DiannaHohensee self-assigned this Nov 28, 2023
@DiannaHohensee
Copy link
Contributor

So the Coordinator will publish the new cluster state, and it calls into the FollowersChecker to remove faulty nodes if they are no longer part of the cluster state. The only other paths that I see to clear faulty nodes is when the node becomes a follower or becomes a candidate.

In this test failure scenario, a faulty node (node_t1) joins a cluster with master node_t2. So the cluster state being published should contain node_t1, which I think will leave node_t1 marked as a faulty node. I haven't been able to reproduce the failure. Next I'll see if I can write a new test in FollwersCheckerTests to cover the scenario, and see what happens.

@DiannaHohensee
Copy link
Contributor

DiannaHohensee commented Nov 29, 2023

A node is marked as faulty in the FollowersChecker when there is a network disconnection with the that node. A node is not returned to not-faulty unless a master election occurs or the node is removed from the cluster, as mentioned above. I think the failed run differs in marking a node as faulty but never removes it -- which would remove the node from the faultyNodes list

.....

// add  -- isolated

  1> [2022-11-09T06:01:33,797][INFO ][o.e.c.s.MasterService    ] [node_t2] node-join[{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw} joining, removed [5.9s/5907ms] ago with reason [disconnected]], term: 2, version: 23, delta: added {{node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}}


// DISRUPTING -----------

  1> [2022-11-09T06:01:33,866][INFO ][o.e.t.d.NetworkDisruption] [testAckedIndexing] start disrupting (disruption type: network disconnects, disrupted links: two partitions (partition 1: [node_t2, node_t0] and partition 2: [node_t1]))


// mark faulty  -- isolated

  1> [2022-11-09T06:01:34,803][DEBUG][o.e.c.c.FollowersChecker ] [node_t2] FollowerChecker{discoveryNode={node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}, failureCountSinceLastSuccess=1, timeoutCountSinceLastSuccess=0, [cluster.fault_detection.follower_check.retry_count]=1} disconnected

  1> [2022-11-09T06:01:34,804][DEBUG][o.e.c.c.FollowersChecker ] [node_t2] FollowerChecker{discoveryNode={node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmrstw}, failureCountSinceLastSuccess=1, timeoutCountSinceLastSuccess=0, [cluster.fault_detection.follower_check.retry_count]=1} marking node as faulty


// STOP DISRUPTING -------

  1> [2022-11-09T06:01:34,871][INFO ][o.e.t.d.NetworkDisruption] [testAckedIndexing] stop disrupting (disruption scheme: network disconnects, disrupted links: two partitions (partition 1: [node_t2, node_t0] and partition 2: [node_t1]))

.....

*note, I searched for node-left with delta: removed, where the cluster state logged actually removes the node. There is a node-left but no delta: removed for the last marking node as faulty

These are the relevant logs in the failed run in this ticket, and then a successful run on my local machine.

So the question becomes: why didn't the node get removed properly from the cluster state?

@DiannaHohensee
Copy link
Contributor

DiannaHohensee commented Nov 29, 2023

The logic to remove the node -- node-left cluster state update -- is scheduled right after logging "marking node as faulty". That leads back to the Coordinator, removeNode.

The removeNode code actually was significantly changed 9 months ago, to improve the master service batch queuing system. Maybe something was wrong with the batching, such that the cluster state wasn't updated properly. But this might be enough investigation to put the failure to rest because the code has changed greatly.

Potentially incidentally fixed by #92021

@kingherc
Copy link
Contributor

Thanks @DiannaHohensee ! I believe @tlrx also mentioned he could not reproduce it in current version, and was thinking of closing this (and re-open it if it ever pops up again). So closing this for now. Feel free to re-open if you think we should not close this now.

@DaveCTurner
Copy link
Contributor

I don't think this is addressed by the master-service task queues work. The node's failure to rejoin properly would eventually be detected by the LagDetector:

  1> [2022-11-09T06:02:44,109][DEBUG][o.e.c.c.LagDetector      ] [node_t2] starting lag detector for version 93: [NodeAppliedStateTracker{discoveryNode={node_t1}{XZTfvsT_T8G2zfpKc6vMiQ}{l_1nLHN9SJiJaSPgStXSlg}{node_t1}{127.0.0.1}{127.0.0.1:13513}{cdfhilmr
stw}, appliedVersion=24}]

However the default lag-detection timeout is 90s and we do not wait that long in these tests. So I think it would be reasonable to change that:

diff --git a/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java b/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
index 2a8621a60b6..dc08fffa49c 100644
--- a/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
+++ b/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
@@ -13,6 +13,7 @@ import org.elasticsearch.cluster.block.ClusterBlock;
 import org.elasticsearch.cluster.block.ClusterBlockLevel;
 import org.elasticsearch.cluster.coordination.Coordinator;
 import org.elasticsearch.cluster.coordination.FollowersChecker;
+import org.elasticsearch.cluster.coordination.LagDetector;
 import org.elasticsearch.cluster.coordination.LeaderChecker;
 import org.elasticsearch.cluster.node.DiscoveryNodes;
 import org.elasticsearch.common.settings.Settings;
@@ -116,6 +117,7 @@ public abstract class AbstractDisruptionTestCase extends ESIntegTestCase {
         .put(FollowersChecker.FOLLOWER_CHECK_TIMEOUT_SETTING.getKey(), "5s") // for hitting simulated network failures quickly
         .put(FollowersChecker.FOLLOWER_CHECK_RETRY_COUNT_SETTING.getKey(), 1) // for hitting simulated network failures quickly
         .put(Coordinator.PUBLISH_TIMEOUT_SETTING.getKey(), "5s") // <-- for hitting simulated network failures quickly
+        .put(LagDetector.CLUSTER_FOLLOWER_LAG_TIMEOUT_SETTING.getKey(), "5s") // remove lagging nodes quickly so they can rejoin
         .put(TransportSettings.CONNECT_TIMEOUT.getKey(), "10s") // Network delay disruption waits for the min between this
         // value and the time of disruption and does not recover immediately
         // when disruption is stop. We should make sure we recover faster

However in production it is far from ideal for a faulty node to cause constant cluster state updates (at URGENT priority) for 90s while the lag detector kicks in, and I don't see anything that would prevent that from happening still today: it seems we can add a node to FollowersChecker#faultyNodes but then fail to remove it from the cluster if the node-left task fails with a NotMasterException. More specifically, when handling a follower check failure we check isRunning() first here:

We might do that just before becoming CANDIDATE which clears things here:

followerCheckers.keySet().removeIf(isUnknownNode);
faultyNodes.removeIf(isUnknownNode);

Then once we've become CANDIDATE we might go and mark the node as faulty here:

But since we're CANDIDATE, the node-left task fails, leaving us in a state where we can win the next election with a faulty node that we never removed from the cluster.

In terms of a fix, perhaps it would be sufficient to do this in becomeLeader()?

diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java b/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
index 3da890b37ad..57936e27c49 100644
--- a/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
+++ b/server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
@@ -920,6 +920,7 @@ public class Coordinator extends AbstractLifecycleComponent implements ClusterSt
         );

         assert leaderChecker.leader() == null : leaderChecker.leader();
+        followersChecker.clearCurrentNodes();
         followersChecker.updateFastResponseState(leaderTerm, mode);

         updateSingleNodeClusterChecker();

Not sure that covers all cases. Maybe a more direct solution would be better: if we get a node-join request from a faulty node that's still in the cluster, we could submit another node-left task and wait for that to complete before processing the join.

Fixing this is pretty low-priority IMO since it requires a super-rare set of disruptions to happen with exactly the right timings, and resolves itself in time anyway, but I think we should keep this action on a list somewhere.

@DaveCTurner DaveCTurner reopened this Nov 29, 2023
@DiannaHohensee
Copy link
Contributor

DiannaHohensee commented Nov 29, 2023

I strongly agree that the lifecycle for marking and unmarking a faulty node appears quite fragile. I was thinking that a successful node join should also go into FollowersChecker to remove itself from potentially being on the faultyNodes list. Then the newly joined node could receive the cluster state publication -- the lack of which is the cause of this test failure. But I don't have an understanding of the cluster state update contents, and the lifecycle expectations there.

we can win the next election with a faulty node that we never removed from the cluster

Though this bit sounds like a second problem.

@DiannaHohensee
Copy link
Contributor

DiannaHohensee commented Nov 29, 2023

Lol. I wanted to see where the master service error was for node-left cluster state update that is missing, and I found this NullPointerException 😁

  1> [2022-11-09T06:01:34,806][ERROR][o.e.c.c.NodeRemovalClusterStateTaskExecutor] [node_t2] unexpected failure during [node-left]
  1> java.lang.NullPointerException: Cannot invoke "org.elasticsearch.cluster.routing.RoutingNode.nodeId()" because "node" is null
  1> 	at org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:62) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.decideCanAllocate(DesiredBalanceReconciler.java:432) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.findRelocationTarget(DesiredBalanceReconciler.java:420) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.balance(DesiredBalanceReconciler.java:385) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.run(DesiredBalanceReconciler.java:88) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.recordTime(DesiredBalanceShardsAllocator.java:299) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.reconcile(DesiredBalanceShardsAllocator.java:213) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.allocate(DesiredBalanceShardsAllocator.java:162) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.AllocationService.lambda$reroute$6(AllocationService.java:423) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:518) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.AllocationService.executeWithRoutingAllocation(AllocationService.java:444) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:420) ~[main/:?]
  1> 	at org.elasticsearch.cluster.routing.allocation.AllocationService.disassociateDeadNodes(AllocationService.java:275) ~[main/:?]
  1> 	at org.elasticsearch.cluster.coordination.NodeRemovalClusterStateTaskExecutor.execute(NodeRemovalClusterStateTaskExecutor.java:74) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.MasterService.innerExecuteTasks(MasterService.java:1052) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:1017) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:278) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:170) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:110) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:148) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:850) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:257) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:223) ~[main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
  1> 	at java.lang.Thread.run(Thread.java:833) ~[?:?]

UPDATE:
This NullPointerException was fixed a week after this test failure first occurred, Nov 18th.

@DiannaHohensee DiannaHohensee added low-risk An open issue or test failure that is a low risk to future releases and removed medium-risk An open issue or test failure that is a medium risk to future releases labels Nov 29, 2023
@DaveCTurner
Copy link
Contributor

I was thinking that a successful node join should also go into FollowersChecker to remove itself from potentially being on the faultyNodes list.

That might work, but I think I'd rather refuse to let a faulty node join the cluster in the first place, retrying its removal as necessary. The trouble is that "successful node join" is not very well-defined, the master might think a join was successful even if the joining node does not, and this could lead to a situation where the node keeps on joining without fault detection properly removing it first.

I found this NullPointerException 😁

Ahh that'd do it I reckon. I tried to reproduce the race I guessed at in my previous comment by adding a delay that should provoke it, but have not succeeded after several hundred attempts. But this NPE would cause the same problem without needing a re-election.

@DiannaHohensee
Copy link
Contributor

Chatted more with David about this, to get a better understanding of how cluster state updates work.

The general outline of the proposed fix would be to

  • add node-left logic to the validateJoinRequest func if the joining node is found to have been partially removed, in the sense of this failure where the node is still part of the cluster state but also in the faultyNodes list. The validation function already makes network calls, so waiting for cluster state to publish is fine.
  • and then processJoinRequest will proceed as usual to run the node-join logic.

Quick fix, but testing might be hard, if we want to test this failure case. Perhaps something contrived could be tested, where we reach into components and make artificial alterations, and then check that it gets sorted out.

@DaveCTurner
Copy link
Contributor

NB I think we can close this test failure if we make the following change:

diff --git a/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java b/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
index 2a8621a60b6..dc08fffa49c 100644
--- a/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
+++ b/server/src/test/java/org/elasticsearch/discovery/AbstractDisruptionTestCase.java
@@ -13,6 +13,7 @@ import org.elasticsearch.cluster.block.ClusterBlock;
 import org.elasticsearch.cluster.block.ClusterBlockLevel;
 import org.elasticsearch.cluster.coordination.Coordinator;
 import org.elasticsearch.cluster.coordination.FollowersChecker;
+import org.elasticsearch.cluster.coordination.LagDetector;
 import org.elasticsearch.cluster.coordination.LeaderChecker;
 import org.elasticsearch.cluster.node.DiscoveryNodes;
 import org.elasticsearch.common.settings.Settings;
@@ -116,6 +117,7 @@ public abstract class AbstractDisruptionTestCase extends ESIntegTestCase {
         .put(FollowersChecker.FOLLOWER_CHECK_TIMEOUT_SETTING.getKey(), "5s") // for hitting simulated network failures quickly
         .put(FollowersChecker.FOLLOWER_CHECK_RETRY_COUNT_SETTING.getKey(), 1) // for hitting simulated network failures quickly
         .put(Coordinator.PUBLISH_TIMEOUT_SETTING.getKey(), "5s") // <-- for hitting simulated network failures quickly
+        .put(LagDetector.CLUSTER_FOLLOWER_LAG_TIMEOUT_SETTING.getKey(), "5s") // remove lagging nodes quickly so they can rejoin
         .put(TransportSettings.CONNECT_TIMEOUT.getKey(), "10s") // Network delay disruption waits for the min between this
         // value and the time of disruption and does not recover immediately
         // when disruption is stop. We should make sure we recover faster

Then we can open another (non-test-failure) issue about the Real™ fix as Dianna describes above.

@DiannaHohensee
Copy link
Contributor

Filed #108690 for the production fix, and published a PR for the test fix #108691

DiannaHohensee added a commit that referenced this issue May 15, 2024
It's possible for a node-left task to get interrupted prior to removing
the node from the master's list of faultyNodes. Nodes on the faultyNodes
list do not receive cluster state updates, and are eventually removed.

Subsequently, when the node attempts to rejoin, after test network
disruptions have ceased, the node-join request can succeed, but the
node will never receive the cluster state update, consider the node-join
a failure, and will resend node-join requests until the LagDetector
removes the node from the faultyNodes list.
#108690 will address the
node-join issue.

Closes #91447
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. low-risk An open issue or test failure that is a low risk to future releases Team:Distributed Meta label for distributed team (obsolete) >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants