CI Failure (TimeoutError in wait_for_partitions_rebalanced) in `ScalingUpTest.test_on_demand_rebalancing` #10024

dlex · 2023-04-13T00:06:23Z

https://buildkite.com/redpanda/redpanda/builds/26854#018770fe-e373-4dff-8222-d485b1468767

Module: rptest.tests.scaling_up_test
Class:  ScalingUpTest
Method: test_on_demand_rebalancing
Arguments:
{
  "partition_count": 1
}

May also happen in test_adding_nodes_to_cluster.

test_id:    rptest.tests.scaling_up_test.ScalingUpTest.test_on_demand_rebalancing.partition_count=1
status:     FAIL
run time:   3 minutes 33.110 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/scaling_up_test.py", line 240, in test_on_demand_rebalancing
    self.wait_for_partitions_rebalanced(total_replicas=total_replicas,
  File "/root/tests/rptest/tests/scaling_up_test.py", line 113, in wait_for_partitions_rebalanced
    wait_until(partitions_rebalanced,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

This one is related to #7418 and is a follow-up to its fix made in #9947.

The failure is manifested with this:

[INFO  - 2023-04-11 16:20:52,412 - scaling_up_test - partitions_rebalanced - lineno:78]: replicas per domain per node: {-1: {1: 8, 2: 8, 3: 8, 4: 8, 5: 7, 6: 9}, 0: {1: 2, 2: 2, 3: 4, 4: 2, 5: 2}} 
[DEBUG - 2023-04-11 16:20:52,414 - scaling_up_test - partitions_rebalanced - lineno:94]: In domain 0, not all nodes' partition counts fall within the expected range [1, 3]. Nodes: 6                
[ERROR - 2023-04-11 16:20:53,416 - cluster - wrapped - lineno:41]: Test failed, doing failure checks...

The distribution of replicas across nodes here is [2,2,4,2,2]. 12 replicas across 5 nodes gives expected_per_node==2.4, the upper bound of expected_range is 2.4*1.2=2.88 and that value rounds up to 3. So it appears that when the overall count of replicas is low, 20% tolerance is not enough in the test criteria.

The text was updated successfully, but these errors were encountered:

michael-redpanda · 2023-04-13T19:50:42Z

[v22.3.X] build: https://buildkite.com/redpanda/redpanda/builds/27041#01877ad0-787e-4c5f-91d5-c70e87ef2f13

michael-redpanda · 2023-04-19T00:20:12Z

https://buildkite.com/redpanda/redpanda/builds/27434#0187969e-69c6-4591-97e1-eb29f9ed90f3

dlex · 2023-04-19T17:07:01Z

This issue may be fixed by 4bdccb2. If it is not, we need to consider that a redpanda bug because the unevenness of [2,2,4,2,2] is too much to consider normal, according to @mmaslankaprv.

dlex · 2023-04-19T17:39:12Z

This also may be a duplicate of #7756.

The https://buildkite.com/redpanda/redpanda/builds/27434#0187969e-69c6-4591-97e1-eb29f9ed90f3 (reported above by @michael-redpanda) is in v22.3 branch where #9622 is not backported yet.

rockwotj · 2023-04-24T14:17:38Z

v22.3.x build: https://buildkite.com/redpanda/redpanda/builds/27775#0187a5a9-7cb7-4e5a-b8cb-db2e98f28217

dlex · 2023-04-24T21:36:05Z

v22.3.x build: https://buildkite.com/redpanda/redpanda/builds/27775#0187a5a9-7cb7-4e5a-b8cb-db2e98f28217

That was a v22.3.16, we need #9622 to be backported there before assessing these failures.

Still a bit of analysis:

[                                                             
    INFO  - 2023-04-21 21:37:13,                              
    784 - scaling_up_test - partitions_rebalanced - lineno:78 
]: replicas per domain per node: {                            
    -1: {                                                     
        1: 8,                                                 
        2: 8,                                                 
        3: 8,                                                 
        4: 9,                                                 
        5: 8,                                                 
        6: 7                                                  
    },                                                        
    0: {                                                      
        1: 4,                                                 
        2: 2,                                                 
        3: 2,                                                 
        4: 1,                                                 
        5: 1,                                                 
        6: 2                                                  
    }                                                         
}

In this case domain 0 looks totally unbalanced in the end. However this is how the last reconciliation loop looks like before the cluster got to that state:

INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:798 - [update: {nullopt}] reconciliation loop - pending reallocation count: 0, finished: false                                                                 
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:535 - [update: {nullopt}] there are 48 replicas in -1 domain, requested to assign 8 replicas per node                                                          
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 2 in domain -1, current allocations: 8                                                        
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 1 replicas to move from node 4 in domain -1, current allocations: 9                                                        
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 5 in domain -1, current allocations: 8                                                        
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 3 in domain -1, current allocations: 8                                                        
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 6 in domain -1, current allocations: 7                                                        
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 1 in domain -1, current allocations: 8                                                        
DEBUG 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:989 - [ntp: {kafka/__consumer_offsets/1}, {{node_id: 4, shard: 1}, {node_id: 2, shard: 1}, {node_id: 3, shard: 1}} -> -]  trying to reassign partition replicas
DEBUG 2023-04-21 21:35:53,921 [shard 0] cluster - partition_allocator.cc:311 - reallocating {partition_id: 1, replication_factor: 3, constrains: {soft_constraints: {}, hard_constraints: {}}}, replicas left: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}}
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - partition_allocator.cc:80 - allocating partition with constraints: {partition_id: 1, replication_factor: 1, constrains: {soft_constraints: {}, hard_constraints: {distinct from: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}}}}}
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - allocation_strategy.cc:104 - constraint: least allocated node in domain -1, node: 1, score: 9994284                                                                               
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - allocation_strategy.cc:104 - constraint: least allocated node in domain -1, node: 4, score: 9993570                                                                               
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - allocation_strategy.cc:104 - constraint: least allocated node in domain -1, node: 5, score: 9994284                                                                               
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - allocation_strategy.cc:104 - constraint: least allocated node in domain -1, node: 6, score: 9994999                                                                               
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 2 has 8 replicas allocated, requested replicas per node 8, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 4 has 8 replicas allocated, requested replicas per node 8, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 5 has 8 replicas allocated, requested replicas per node 8, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 3 has 8 replicas allocated, requested replicas per node 8, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 6 has 8 replicas allocated, requested replicas per node 8, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 1 has 8 replicas allocated, requested replicas per node 8, difference: 0                                                                            
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:266 - [update: {nullopt}] unevenness error: 0, previous error: 0, improvement: 0, min improvement: 0.025                                                       
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:535 - [update: {nullopt}] there are 12 replicas in 0 domain, requested to assign 2 replicas per node                                                           
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 2 in domain 0, current allocations: 2                                                         
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 4 in domain 0, current allocations: 1                                                         
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 5 in domain 0, current allocations: 1                                                         
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 3 in domain 0, current allocations: 2                                                         
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 6 in domain 0, current allocations: 2                                                         
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 2 replicas to move from node 1 in domain 0, current allocations: 4                                                         
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 2 has 2 replicas allocated, requested replicas per node 2, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 4 has 1 replicas allocated, requested replicas per node 2, difference: 1                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 5 has 1 replicas allocated, requested replicas per node 2, difference: 1                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 3 has 2 replicas allocated, requested replicas per node 2, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 6 has 2 replicas allocated, requested replicas per node 2, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 1 has 4 replicas allocated, requested replicas per node 2, difference: -2                                                                           
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:266 - [update: {nullopt}] unevenness error: 0.19999999999999998, previous error: 0.19999999999999998, improvement: 0, min improvement: 0.1                     
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:813 - [update: {nullopt}] calculated reallocations: {}                                                                                                         
DEBUG 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:834 - [update: {nullopt}] no need reallocations, finished: true                                                                                                
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:180 - reconciliation loop result: Success                                                                                                                      
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:180 - reconciliation loop result: Success

So the balancer has decided that this distribution is fine and gave up.

abhijat · 2023-04-25T15:56:13Z

seen during a backport to v22.3.x https://buildkite.com/redpanda/redpanda/builds/27949#0187b8cf-b153-437c-ad1e-daefc5d194f0

#10326

michael-redpanda · 2023-05-10T18:46:07Z

https://buildkite.com/redpanda/vtools/builds/7372#0187e21e-278f-41d2-b856-5ffb127922e2

dlex · 2023-05-18T02:03:02Z

on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29202#0188230e-0d2e-49a1-91c3-149aeea19128

rystsov · 2023-05-18T18:20:54Z

https://buildkite.com/redpanda/redpanda/builds/29393#01882f93-e82e-40fb-a7e7-d2b32f2550b6

VladLazar · 2023-05-22T16:20:33Z

https://buildkite.com/redpanda/redpanda/builds/29549#018841eb-7559-4a3e-b10e-d9ae2ec62c62

michael-redpanda · 2023-05-25T12:46:24Z

https://buildkite.com/redpanda/vtools/builds/7764#01885138-02e6-4826-8b78-bad85caf0fcc

NyaliaLui · 2023-05-25T16:07:02Z

https://buildkite.com/redpanda/redpanda/builds/29799#01884e9f-1c08-4d95-a228-2230d650bed3

ztlpn · 2023-05-25T16:27:52Z

Sometimes it fails in debug mode because of a stuck partition movement problem (that @bharathv tracked to a possible RPC bug).

michael-redpanda · 2023-06-15T16:31:17Z

https://buildkite.com/redpanda/redpanda/builds/31315#0188bb4d-3ca8-412d-b3f5-43de6d86380c

piyushredpanda · 2023-08-22T03:05:27Z

Closing old issues that have not occurred in 2 months.

dlex added kind/bug Something isn't working ci-failure labels Apr 13, 2023

dlex mentioned this issue Apr 13, 2023

[v23.1.x] Support for allocation domains in scaling_up_tests #9969

Merged

dlex self-assigned this Apr 13, 2023

dlex mentioned this issue Apr 13, 2023

CI Failure (partitions_rebalanced times out) in ScalingUpTest.test_adding_nodes_to_cluster #7418

Closed

This was referenced Apr 13, 2023

c/tm_stm: Convert tm_snapshot to fragmented_vector #10035

Merged

[v23.1.x] c/tm_stm: Convert tm_snapshot to fragmented_vector #10055

Merged

[v22.3.x] c/tm_stm: Convert tm_snapshot to fragmented_vector #10056

Merged

michael-redpanda mentioned this issue Apr 13, 2023

[v22.3.x] Error checking pp json serialization #10039

Merged

piyushredpanda mentioned this issue Apr 13, 2023

[v22.3.x] cloud_storage: fix timequery edge cases #9876

Merged

This was referenced Apr 18, 2023

[v22.3.x] schema_registry: Support protobuf known types #10167

Merged

CI Failure [v22.3.x] (TimeoutError) in ScalingUpTest.test_topic_hot_spots #10185

Closed

michael-redpanda mentioned this issue Apr 19, 2023

[v22.3.x] Support compressed batches in kafka client #10176

Merged

This was referenced Apr 19, 2023

[v22.3.x] tests: add wait_until for cluster start #10203

Merged

[v22.3.x] duck: stop process in node before starting rp with rpk #10207

Merged

rockwotj mentioned this issue Apr 24, 2023

[v22.3.x] Make _idempotent_producer_locks an absl::btree_map #10268

Merged

7 tasks

michael-redpanda mentioned this issue Apr 24, 2023

[v22.3.x] send messages synchronously in go-kafka-serde client #10309

Merged

piyushredpanda mentioned this issue Apr 25, 2023

[v22.3.x] cloud_storage: parallelize remote_segment stop #9239

Merged

7 tasks

abhijat mentioned this issue Apr 25, 2023

Backport make credentials endpoint configurable #10326

Merged

7 tasks

NyaliaLui mentioned this issue Apr 25, 2023

[v22.3.x] k/metrics: rm static members from consumers count #10334

Merged

bharathv mentioned this issue Apr 26, 2023

[backport] [22.3.x]: rm_stm/seq_table: unlink before destruction #10343

Merged

7 tasks

dotnwat mentioned this issue Apr 26, 2023

[v22.3.x] Backport of #10232 #9870 #10300

Merged

7 tasks

VladLazar mentioned this issue Apr 27, 2023

[v22.3.x] k/alter_configs: use default for cleanup policy #10380

Merged

7 tasks

This was referenced May 5, 2023

[v22.3.x] storage: switch to node_hash_map for kvstore #10414

Merged

storage: use async serialization for large kvstore snapshot batches #5129

Merged

rystsov mentioned this issue May 18, 2023

Fix txn atomicity issue #10670

Merged

7 tasks

dlex mentioned this issue May 20, 2023

Node-wide throughput exemptions for clients #10755

Merged

7 tasks

mmaslankaprv added the area/replication label May 21, 2023

ztlpn mentioned this issue May 25, 2023

Schedule decom-related partition movements in partition balancer #10922

Merged

7 tasks

dlex mentioned this issue May 25, 2023

CI Failure (TimeoutError in wait_for_partitions_rebalanced) in ScalingUpTest.test_adding_nodes_to_cluster #11042

Closed

dlex assigned ztlpn and unassigned dlex May 25, 2023

dlex mentioned this issue Jun 7, 2023

Default to reading from the beginning of partition in ThroughputLimitsSnc.test_consumers #11254

Merged

7 tasks

andrwng mentioned this issue Jun 12, 2023

rp_storage_tool: handle start_kafka_offset field #11363

Merged

7 tasks

piyushredpanda closed this as completed Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (TimeoutError in wait_for_partitions_rebalanced) in `ScalingUpTest.test_on_demand_rebalancing` #10024

CI Failure (TimeoutError in wait_for_partitions_rebalanced) in `ScalingUpTest.test_on_demand_rebalancing` #10024

dlex commented Apr 13, 2023 •

edited

Loading

michael-redpanda commented Apr 13, 2023

michael-redpanda commented Apr 19, 2023

dlex commented Apr 19, 2023

dlex commented Apr 19, 2023

rockwotj commented Apr 24, 2023

dlex commented Apr 24, 2023

abhijat commented Apr 25, 2023 •

edited

Loading

michael-redpanda commented May 10, 2023

dlex commented May 18, 2023

rystsov commented May 18, 2023

VladLazar commented May 22, 2023

michael-redpanda commented May 25, 2023

NyaliaLui commented May 25, 2023

ztlpn commented May 25, 2023

michael-redpanda commented Jun 15, 2023

piyushredpanda commented Aug 22, 2023

CI Failure (TimeoutError in wait_for_partitions_rebalanced) in ScalingUpTest.test_on_demand_rebalancing #10024

CI Failure (TimeoutError in wait_for_partitions_rebalanced) in ScalingUpTest.test_on_demand_rebalancing #10024

Comments

dlex commented Apr 13, 2023 • edited Loading

michael-redpanda commented Apr 13, 2023

michael-redpanda commented Apr 19, 2023

dlex commented Apr 19, 2023

dlex commented Apr 19, 2023

rockwotj commented Apr 24, 2023

dlex commented Apr 24, 2023

abhijat commented Apr 25, 2023 • edited Loading

michael-redpanda commented May 10, 2023

dlex commented May 18, 2023

rystsov commented May 18, 2023

VladLazar commented May 22, 2023

michael-redpanda commented May 25, 2023

NyaliaLui commented May 25, 2023

ztlpn commented May 25, 2023

michael-redpanda commented Jun 15, 2023

piyushredpanda commented Aug 22, 2023

CI Failure (TimeoutError in wait_for_partitions_rebalanced) in `ScalingUpTest.test_on_demand_rebalancing` #10024

CI Failure (TimeoutError in wait_for_partitions_rebalanced) in `ScalingUpTest.test_on_demand_rebalancing` #10024

dlex commented Apr 13, 2023 •

edited

Loading

abhijat commented Apr 25, 2023 •

edited

Loading