Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (Timeout: Leadership did not stablize) in MultiTopicAutomaticLeadershipBalancingTest.test_topic_aware_rebalance #11454

Closed
r-vasquez opened this issue Jun 15, 2023 · 12 comments · Fixed by #11547
Assignees
Labels
ci-failure kind/bug Something isn't working

Comments

@r-vasquez
Copy link
Contributor

https://buildkite.com/redpanda/redpanda/builds/31317#0188bb7e-46df-41db-a768-aebe518f818b/363-5766

Module: rptest.tests.leadership_transfer_test
Class:  MultiTopicAutomaticLeadershipBalancingTest
Method: test_topic_aware_rebalance
test_id:    rptest.tests.leadership_transfer_test.MultiTopicAutomaticLeadershipBalancingTest.test_topic_aware_rebalance
--
  | status:     FAIL
  | run time:   7 minutes 41.276 seconds
  |  
  |  
  | TimeoutError('Leadership did not stablize')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 79, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/leadership_transfer_test.py", line 191, in test_topic_aware_rebalance
  | wait_until(lambda: topic_leadership_evenly_distributed(),
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
  | raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
  | ducktape.errors.TimeoutError: Leadership did not stablize

Failure mode is different from #11044:

File "/root/tests/rptest/tests/leadership_transfer_test.py", line 191, in test_topic_aware_rebalance
--
  | wait_until(lambda: topic_leadership_evenly_distributed(),
@ztlpn
Copy link
Contributor

ztlpn commented Jun 16, 2023

@abhijat
Copy link
Contributor

abhijat commented Jun 19, 2023

@BenPope
Copy link
Member

BenPope commented Jun 19, 2023

I've looked at a couple of the log files:

The first one in the issue:

The test expects topic-zmejsczxsj to converge on >16 partitions per node. It's not obvious (to me) that it's converging.

[INFO  - 2023-06-14 20:42:54,594 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([44, 3, 14])
[INFO  - 2023-06-14 20:43:04,887 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([44, 16, 1])
[INFO  - 2023-06-14 20:43:18,908 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([30, 27, 4])
[INFO  - 2023-06-14 20:43:33,044 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([30, 16, 15])
[INFO  - 2023-06-14 20:43:43,157 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([30, 18, 13])
[INFO  - 2023-06-14 20:43:53,507 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([30, 17, 14])
[INFO  - 2023-06-14 20:44:03,798 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([20, 30, 11])
[INFO  - 2023-06-14 20:44:14,314 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([24, 25, 12])
[INFO  - 2023-06-14 20:44:29,123 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([20, 28, 13])
[INFO  - 2023-06-14 20:44:39,312 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([11, 35, 15])
[INFO  - 2023-06-14 20:44:49,704 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([23, 25, 13])
[INFO  - 2023-06-14 20:45:00,397 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([20, 31, 10])
[INFO  - 2023-06-14 20:45:14,545 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([18, 32, 11])
[INFO  - 2023-06-14 20:45:28,837 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([22, 23, 16])
[INFO  - 2023-06-14 20:45:29,684 - leadership_transfer_test - has_leader_count - lineno:154]: topic-davzozzuay has dist dict_values([29, 47, 75])
[INFO  - 2023-06-14 20:45:39,956 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([17, 29, 15])
[INFO  - 2023-06-14 20:45:54,098 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([26, 20, 15])
[INFO  - 2023-06-14 20:46:05,433 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([29, 18, 14])
[INFO  - 2023-06-14 20:46:19,602 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([26, 22, 13])
[INFO  - 2023-06-14 20:46:30,778 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([25, 19, 17])
[INFO  - 2023-06-14 20:46:31,537 - leadership_transfer_test - has_leader_count - lineno:154]: topic-davzozzuay has dist dict_values([42, 66, 43])
[INFO  - 2023-06-14 20:46:35,652 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zcphqkygir has dist dict_values([119, 54, 90])
[INFO  - 2023-06-14 20:46:45,970 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([25, 22, 14])
[INFO  - 2023-06-14 20:46:56,113 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([29, 25, 7])
[INFO  - 2023-06-14 20:47:06,293 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([13, 34, 14])
[INFO  - 2023-06-14 20:47:16,702 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([37, 12, 12])
[INFO  - 2023-06-14 20:47:27,120 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([29, 19, 13])
[INFO  - 2023-06-14 20:47:37,368 - leadership_transfer_test - has_leader_count - lineno:154]: topic-zmejsczxsj has dist dict_values([30, 27, 4])

For the most recent test failure:

The test expects topic-dqzbtadrwu to converge on >16 partitions per node. It's not obvious (to me) that it's converging:

[INFO  - 2023-06-19 10:47:41,747 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([18, 41, 2])
[INFO  - 2023-06-19 10:47:52,052 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([13, 26, 22])
[INFO  - 2023-06-19 10:48:02,235 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([43, 10, 8])
[INFO  - 2023-06-19 10:48:12,482 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([16, 27, 18])
[INFO  - 2023-06-19 10:48:12,623 - leadership_transfer_test - has_leader_count - lineno:154]: topic-gtmzuibkrp has dist dict_values([27, 39, 85])
[INFO  - 2023-06-19 10:48:22,854 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([12, 35, 14])
[INFO  - 2023-06-19 10:48:33,408 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([17, 17, 27])
[INFO  - 2023-06-19 10:48:37,541 - leadership_transfer_test - has_leader_count - lineno:154]: topic-gtmzuibkrp has dist dict_values([11, 73, 67])
[INFO  - 2023-06-19 10:48:47,679 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([39, 18, 4])
[INFO  - 2023-06-19 10:48:57,818 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([36, 23, 2])
[INFO  - 2023-06-19 10:49:10,096 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([15, 35, 11])
[INFO  - 2023-06-19 10:49:20,285 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([23, 8, 30])
[INFO  - 2023-06-19 10:49:34,637 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([26, 8, 27])
[INFO  - 2023-06-19 10:49:44,814 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([29, 26, 6])
[INFO  - 2023-06-19 10:49:55,201 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([35, 19, 7])
[INFO  - 2023-06-19 10:50:05,517 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([15, 37, 9])
[INFO  - 2023-06-19 10:50:18,319 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([9, 15, 37])
[INFO  - 2023-06-19 10:50:31,635 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([18, 15, 28])
[INFO  - 2023-06-19 10:50:45,764 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([31, 18, 12])
[INFO  - 2023-06-19 10:50:56,036 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([29, 14, 18])
[INFO  - 2023-06-19 10:51:10,205 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([30, 26, 5])
[INFO  - 2023-06-19 10:51:24,719 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([28, 21, 12])
[INFO  - 2023-06-19 10:51:35,005 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([25, 18, 18])
[INFO  - 2023-06-19 10:51:36,373 - leadership_transfer_test - has_leader_count - lineno:154]: topic-gtmzuibkrp has dist dict_values([32, 50, 69])
[INFO  - 2023-06-19 10:51:46,595 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([31, 17, 13])
[INFO  - 2023-06-19 10:51:57,042 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([24, 18, 19])
[INFO  - 2023-06-19 10:51:57,173 - leadership_transfer_test - has_leader_count - lineno:154]: topic-gtmzuibkrp has dist dict_values([62, 16, 73])
[INFO  - 2023-06-19 10:52:07,609 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([28, 14, 19])
[INFO  - 2023-06-19 10:52:21,827 - leadership_transfer_test - has_leader_count - lineno:154]: topic-dqzbtadrwu has dist dict_values([32, 18, 11])

In both cases the last two lines appear to show it getting worse.

@travisdowns
Copy link
Member

travisdowns commented Jun 19, 2023

Side note: possibly we should capture the initial conditions/input to the balancer (including random seed) and encode this in a unit test, since it seems like plausibly it's getting stuck in a local minimum or there is another problem with the balancer or assumptions in the test and we don't really need a full ducktape test to iterate on that.

cc @ballard26

@travisdowns
Copy link
Member

5 new occurrences:

FAIL test: MultiTopicAutomaticLeadershipBalancingTest.test_topic_aware_rebalance (5/31 runs)
failure at 2023-06-19T17:15:26.701Z: TimeoutError('Leadership did not stablize')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31609#0188d46c-6601-4687-9273-73542b283655
failure at 2023-06-19T16:27:51.446Z: TimeoutError('Leadership did not stablize')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31601#0188d441-60bc-481c-8a3c-368e102f5636
failure at 2023-06-19T01:19:57.960Z: TimeoutError('Leadership did not stablize')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31567#0188d102-709f-4dfa-a1bf-47647df79717
failure at 2023-06-19T07:16:07.230Z: TimeoutError('Leadership did not stablize')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31575#0188d249-1d2a-4a50-a0be-d9b259166824
failure at 2023-06-19T06:58:53.854Z: TimeoutError('Leadership did not stablize')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31576#0188d238-b4ee-4526-819a-b4598aeccfe5

@piyushredpanda
Copy link
Contributor

Yeah there are a lot of these occurrences. I think now there's Michal, Ben and Brandon possibly involved here -- please align so there's no duplicated effort -- really appreciate folks jumping on the most-failing test.

@mmaslankaprv mmaslankaprv assigned mmaslankaprv and unassigned BenPope Jun 20, 2023
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jun 20, 2023
Using progress tracking and unbounded timeout to track topic aware
rebalancing. Instead of using a particular timeout value we verify if
topic aware rebalancing is making progress. If so we give the test a
time to pass.

Fixes: redpanda-data#11454

Signed-off-by: Michal Maslanka <[email protected]>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jun 20, 2023
Using progress tracking and unbounded timeout to track topic aware
rebalancing. Instead of using a particular timeout value we verify if
topic aware rebalancing is making progress. If so we give the test a
time to pass.

Fixes: redpanda-data#11454

Signed-off-by: Michal Maslanka <[email protected]>
mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jun 20, 2023
Using progress tracking and unbounded timeout to track topic aware
rebalancing. Instead of using a particular timeout value we verify if
topic aware rebalancing is making progress. If so we give the test a
time to pass.

Fixes: redpanda-data#11454

Signed-off-by: Michal Maslanka <[email protected]>
@michael-redpanda
Copy link
Contributor

@travisdowns
Copy link
Member

FAIL test: MultiTopicAutomaticLeadershipBalancingTest.test_topic_aware_rebalance (4/13 runs)
failure at 2023-06-20T17:00:46.922Z: TimeoutError('Leadership did not stablize')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31659#0188d995-8cfc-4c2d-b901-61d8aad6af21
failure at 2023-06-20T16:04:57.650Z: TimeoutError('Leadership did not stablize')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31650#0188d959-5be9-4760-9b5d-a9cca5aa7d22
failure at 2023-06-20T14:52:40.429Z: TimeoutError('Leadership did not stablize')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31649#0188d918-2bd8-4eb3-a07c-24645f6808be
failure at 2023-06-20T13:32:07.053Z: TimeoutError('Leadership did not stablize')
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31647#0188d8ce-7808-4e28-a18d-4177ebde0b34

mmaslankaprv added a commit to mmaslankaprv/redpanda that referenced this issue Jun 21, 2023
Using progress tracking and unbounded timeout to track topic aware
rebalancing. Instead of using a particular timeout value we verify if
topic aware rebalancing is making progress. If so we give the test a
time to pass.

Fixes: redpanda-data#11454

Signed-off-by: Michal Maslanka <[email protected]>
@vbotbuildovich
Copy link
Collaborator

@bharathv
Copy link
Contributor

Think this should be fixed by #17145, lets reopen if the issue persists after the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants