Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (TimeoutError in wait_for_partitions_rebalanced) in ScalingUpTest.test_on_demand_rebalancing #10024

Closed
dlex opened this issue Apr 13, 2023 · 16 comments
Assignees
Labels

Comments

@dlex
Copy link
Contributor

dlex commented Apr 13, 2023

https://buildkite.com/redpanda/redpanda/builds/26854#018770fe-e373-4dff-8222-d485b1468767

Module: rptest.tests.scaling_up_test
Class:  ScalingUpTest
Method: test_on_demand_rebalancing
Arguments:
{
  "partition_count": 1
}

May also happen in test_adding_nodes_to_cluster.

test_id:    rptest.tests.scaling_up_test.ScalingUpTest.test_on_demand_rebalancing.partition_count=1
status:     FAIL
run time:   3 minutes 33.110 seconds


    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/scaling_up_test.py", line 240, in test_on_demand_rebalancing
    self.wait_for_partitions_rebalanced(total_replicas=total_replicas,
  File "/root/tests/rptest/tests/scaling_up_test.py", line 113, in wait_for_partitions_rebalanced
    wait_until(partitions_rebalanced,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

This one is related to #7418 and is a follow-up to its fix made in #9947.

The failure is manifested with this:

[INFO  - 2023-04-11 16:20:52,412 - scaling_up_test - partitions_rebalanced - lineno:78]: replicas per domain per node: {-1: {1: 8, 2: 8, 3: 8, 4: 8, 5: 7, 6: 9}, 0: {1: 2, 2: 2, 3: 4, 4: 2, 5: 2}} 
[DEBUG - 2023-04-11 16:20:52,414 - scaling_up_test - partitions_rebalanced - lineno:94]: In domain 0, not all nodes' partition counts fall within the expected range [1, 3]. Nodes: 6                
[ERROR - 2023-04-11 16:20:53,416 - cluster - wrapped - lineno:41]: Test failed, doing failure checks...                                                                                              

The distribution of replicas across nodes here is [2,2,4,2,2]. 12 replicas across 5 nodes gives expected_per_node==2.4, the upper bound of expected_range is 2.4*1.2=2.88 and that value rounds up to 3. So it appears that when the overall count of replicas is low, 20% tolerance is not enough in the test criteria.

@michael-redpanda
Copy link
Contributor

@michael-redpanda
Copy link
Contributor

@dlex
Copy link
Contributor Author

dlex commented Apr 19, 2023

This issue may be fixed by 4bdccb2. If it is not, we need to consider that a redpanda bug because the unevenness of [2,2,4,2,2] is too much to consider normal, according to @mmaslankaprv.

@dlex
Copy link
Contributor Author

dlex commented Apr 19, 2023

This also may be a duplicate of #7756.

The https://buildkite.com/redpanda/redpanda/builds/27434#0187969e-69c6-4591-97e1-eb29f9ed90f3 (reported above by @michael-redpanda) is in v22.3 branch where #9622 is not backported yet.

@rockwotj
Copy link
Contributor

@dlex
Copy link
Contributor Author

dlex commented Apr 24, 2023

v22.3.x build: https://buildkite.com/redpanda/redpanda/builds/27775#0187a5a9-7cb7-4e5a-b8cb-db2e98f28217

That was a v22.3.16, we need #9622 to be backported there before assessing these failures.

Still a bit of analysis:

[                                                             
    INFO  - 2023-04-21 21:37:13,                              
    784 - scaling_up_test - partitions_rebalanced - lineno:78 
]: replicas per domain per node: {                            
    -1: {                                                     
        1: 8,                                                 
        2: 8,                                                 
        3: 8,                                                 
        4: 9,                                                 
        5: 8,                                                 
        6: 7                                                  
    },                                                        
    0: {                                                      
        1: 4,                                                 
        2: 2,                                                 
        3: 2,                                                 
        4: 1,                                                 
        5: 1,                                                 
        6: 2                                                  
    }                                                         
}                                                             

In this case domain 0 looks totally unbalanced in the end. However this is how the last reconciliation loop looks like before the cluster got to that state:

INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:798 - [update: {nullopt}] reconciliation loop - pending reallocation count: 0, finished: false                                                                 
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:535 - [update: {nullopt}] there are 48 replicas in -1 domain, requested to assign 8 replicas per node                                                          
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 2 in domain -1, current allocations: 8                                                        
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 1 replicas to move from node 4 in domain -1, current allocations: 9                                                        
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 5 in domain -1, current allocations: 8                                                        
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 3 in domain -1, current allocations: 8                                                        
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 6 in domain -1, current allocations: 7                                                        
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 1 in domain -1, current allocations: 8                                                        
DEBUG 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:989 - [ntp: {kafka/__consumer_offsets/1}, {{node_id: 4, shard: 1}, {node_id: 2, shard: 1}, {node_id: 3, shard: 1}} -> -]  trying to reassign partition replicas
DEBUG 2023-04-21 21:35:53,921 [shard 0] cluster - partition_allocator.cc:311 - reallocating {partition_id: 1, replication_factor: 3, constrains: {soft_constraints: {}, hard_constraints: {}}}, replicas left: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}}
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - partition_allocator.cc:80 - allocating partition with constraints: {partition_id: 1, replication_factor: 1, constrains: {soft_constraints: {}, hard_constraints: {distinct from: {{node_id: 2, shard: 1}, {node_id: 3, shard: 1}}}}}
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - allocation_strategy.cc:104 - constraint: least allocated node in domain -1, node: 1, score: 9994284                                                                               
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - allocation_strategy.cc:104 - constraint: least allocated node in domain -1, node: 4, score: 9993570                                                                               
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - allocation_strategy.cc:104 - constraint: least allocated node in domain -1, node: 5, score: 9994284                                                                               
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - allocation_strategy.cc:104 - constraint: least allocated node in domain -1, node: 6, score: 9994999                                                                               
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 2 has 8 replicas allocated, requested replicas per node 8, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 4 has 8 replicas allocated, requested replicas per node 8, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 5 has 8 replicas allocated, requested replicas per node 8, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 3 has 8 replicas allocated, requested replicas per node 8, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 6 has 8 replicas allocated, requested replicas per node 8, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 1 has 8 replicas allocated, requested replicas per node 8, difference: 0                                                                            
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:266 - [update: {nullopt}] unevenness error: 0, previous error: 0, improvement: 0, min improvement: 0.025                                                       
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:535 - [update: {nullopt}] there are 12 replicas in 0 domain, requested to assign 2 replicas per node                                                           
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 2 in domain 0, current allocations: 2                                                         
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 4 in domain 0, current allocations: 1                                                         
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 5 in domain 0, current allocations: 1                                                         
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 3 in domain 0, current allocations: 2                                                         
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 0 replicas to move from node 6 in domain 0, current allocations: 2                                                         
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:568 - [update: {nullopt}] there are 2 replicas to move from node 1 in domain 0, current allocations: 4                                                         
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 2 has 2 replicas allocated, requested replicas per node 2, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 4 has 1 replicas allocated, requested replicas per node 2, difference: 1                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 5 has 1 replicas allocated, requested replicas per node 2, difference: 1                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 3 has 2 replicas allocated, requested replicas per node 2, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 6 has 2 replicas allocated, requested replicas per node 2, difference: 0                                                                            
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:506 - node 1 has 4 replicas allocated, requested replicas per node 2, difference: -2                                                                           
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:266 - [update: {nullopt}] unevenness error: 0.19999999999999998, previous error: 0.19999999999999998, improvement: 0, min improvement: 0.1                     
INFO  2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:813 - [update: {nullopt}] calculated reallocations: {}                                                                                                         
DEBUG 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:834 - [update: {nullopt}] no need reallocations, finished: true                                                                                                
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:180 - reconciliation loop result: Success                                                                                                                      
TRACE 2023-04-21 21:35:53,921 [shard 0] cluster - members_backend.cc:180 - reconciliation loop result: Success                                                                                                                      

So the balancer has decided that this distribution is fine and gave up.

@abhijat
Copy link
Contributor

abhijat commented Apr 25, 2023

@michael-redpanda
Copy link
Contributor

@dlex
Copy link
Contributor Author

dlex commented May 18, 2023

@rystsov
Copy link
Contributor

rystsov commented May 18, 2023

@VladLazar
Copy link
Contributor

@michael-redpanda
Copy link
Contributor

@NyaliaLui
Copy link
Contributor

@ztlpn
Copy link
Contributor

ztlpn commented May 25, 2023

Sometimes it fails in debug mode because of a stuck partition movement problem (that @bharathv tracked to a possible RPC bug).

@michael-redpanda
Copy link
Contributor

@piyushredpanda
Copy link
Contributor

Closing old issues that have not occurred in 2 months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants