node_status_backend: reset backoff on peer checkin #11342

bharathv · 2023-06-12T01:10:46Z

When a peer restarts and a backoff is applied locally, it needs to be
reset once the peer is available again. Otherwise the transport does not
reconnect until the entire backoff elapses thus marking it unavailable
for downstream consumers like partition balancer.

Fixes #5795
Fixes #11307
Fixes #11276

Backports Required

Release Notes

none

bharathv · 2023-06-12T04:12:41Z

/ci-repeat 4
skip-units
dt-repeat=10
tests/rptest/tests/partition_balancer_test.py
tests/rack_aware_replica_placement_test.py
tests/rptest/tests/leadership_transfer_test.py

VladLazar

The change makes sense to me.

mmaslankaprv · 2023-06-12T09:01:05Z

This change looks good, i am wondering if we shouldn't also introduce a configuration that would allow to setup the max backoff time for node_status connections

bharathv · 2023-06-12T21:29:30Z

Failures: (unrelated, known)

CI Failure (Assert failure: (log_manager.cc:449) '_logs.find(cfg.ntp()) == _logs.end()' cannot double register same ntp) #11344
CI Failure (TimeoutError('Redpanda failed to terminate in 30 seconds')) in PartitionBalancerTest.test_unavailable_nodes #11321
CI Failure (Timeout - Failed to start) in MultiTopicAutomaticLeadershipBalancingTest.test_topic_aware_rebalance #11044
Some failures with unable to install apt packages (repo timed out)

bharathv · 2023-06-12T23:20:14Z

i am wondering if we shouldn't also introduce a configuration that would allow to setup the max backoff time for node_status connections

Done.. network partition is an interesting situation..

When a peer restarts and a backoff is applied locally, it needs to be reset once the peer is available again. Otherwise the transport does not reconnect until the entire backoff elapses thus marking it unavailable for downstream consumers like partition balancer.

Adds node_status_reconnect_max_backoff_ms cluster configuration and defaults to 15s.

bharathv · 2023-06-12T23:32:30Z

Last force pushed is to fix conflicts.

mmaslankaprv · 2023-06-13T07:52:18Z

There seems to be a related failure:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 79, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/node_status_test.py", line 141, in test_all_nodes_up
    status_graph.check_cluster_status()
  File "/root/tests/rptest/tests/node_status_test.py", line 98, in check_cluster_status
    assert is_live(
AssertionError: Expected node docker-rp-7 to be alive, but since_last_status > max_delta: ms_since_last_status=211, tolerance=125

bharathv · 2023-06-13T16:45:40Z

umm looks like debug build strikes again.. taking a look.

Wrap it in a waiter. With debug builds the backend can potentially take some additional time to reach the desired state, especially after invalidating all the transports after resetting a backoff.

bharathv · 2023-06-14T16:02:14Z

Failures: (all known, unrelated)

github-actions bot added the area/redpanda label Jun 12, 2023

bharathv requested review from VladLazar, ztlpn and mmaslankaprv June 12, 2023 06:09

VladLazar previously approved these changes Jun 12, 2023

View reviewed changes

bharathv self-assigned this Jun 12, 2023

bharathv dismissed VladLazar’s stale review via 5669c4a June 12, 2023 23:18

bharathv force-pushed the rack_aw branch from 07c5c82 to 5669c4a Compare June 12, 2023 23:18

bharathv requested a review from VladLazar June 12, 2023 23:19

bharathv added 3 commits June 12, 2023 16:20

node_status_backend: factor connection_source_shard into a method

32a8a2d

node_status_backend: Make reconnect backoff configurable

c111134

Adds node_status_reconnect_max_backoff_ms cluster configuration and defaults to 15s.

bharathv force-pushed the rack_aw branch from 5669c4a to c111134 Compare June 12, 2023 23:31

ducktape/node_status: deflake test

4ca6fbf

Wrap it in a waiter. With debug builds the backend can potentially take some additional time to reach the desired state, especially after invalidating all the transports after resetting a backoff.

mmaslankaprv approved these changes Jun 14, 2023

View reviewed changes

vshtokman merged commit 0ad8dd3 into redpanda-data:dev Jun 14, 2023

bharathv deleted the rack_aw branch June 14, 2023 21:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node_status_backend: reset backoff on peer checkin #11342

node_status_backend: reset backoff on peer checkin #11342

bharathv commented Jun 12, 2023 •

edited

Loading

bharathv commented Jun 12, 2023

VladLazar left a comment

mmaslankaprv commented Jun 12, 2023

bharathv commented Jun 12, 2023 •

edited

Loading

bharathv commented Jun 12, 2023

bharathv commented Jun 12, 2023

mmaslankaprv commented Jun 13, 2023

bharathv commented Jun 13, 2023

bharathv commented Jun 14, 2023

node_status_backend: reset backoff on peer checkin #11342

node_status_backend: reset backoff on peer checkin #11342

Conversation

bharathv commented Jun 12, 2023 • edited Loading

Backports Required

Release Notes

bharathv commented Jun 12, 2023

VladLazar left a comment

Choose a reason for hiding this comment

mmaslankaprv commented Jun 12, 2023

bharathv commented Jun 12, 2023 • edited Loading

bharathv commented Jun 12, 2023

bharathv commented Jun 12, 2023

mmaslankaprv commented Jun 13, 2023

bharathv commented Jun 13, 2023

bharathv commented Jun 14, 2023

bharathv commented Jun 12, 2023 •

edited

Loading

bharathv commented Jun 12, 2023 •

edited

Loading