ARM: scale tests require more resources than nodes have (`ManyPartitionsTest`.`test_many_partitions`, `ManyClientsTest`.`test_many_clients`) #7405

jcsp · 2022-11-21T16:12:24Z

i3en.xlarge has 8GB per vCPU

Is4gen.4xlarge has 6GB per vCPU

Our scale tests do not pass reliably on the weaker arm nodes.

FAIL test: ManyPartitionsTest.test_many_partitions (2/3 runs)
  failure at 2022-11-20T03:48:19.569Z: RpkException('command /opt/redpanda/bin/rpk topic --brokers ip-172-31-43-14:9092,ip-172-31-32-247:9092,ip-172-31-36-138:9092,ip-172-31-46-34:9092,ip-172-31-36-63:9092,ip-172-31-42-204:9092,ip-172-31-43-72:9092,ip-172-31-40-33:9092,ip-172-31-47-0:9092 describe scale_000000 -p timed out')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4319#0184917c-c179-4d57-be92-563f4eb9f9c5
  failure at 2022-11-21T03:21:08.085Z: RpkException('command /opt/redpanda/bin/rpk topic --brokers ip-172-31-39-61:9092,ip-172-31-44-233:9092,ip-172-31-41-134:9092,ip-172-31-43-92:9092,ip-172-31-43-95:9092,ip-172-31-33-240:9092,ip-172-31-45-166:9092,ip-172-31-44-245:9092,ip-172-31-44-142:9092 describe scale_000000 -p timed out')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4324#018496a2-e835-4a1e-b9a1-af52a6a95f3b
FAIL test: ManyPartitionsTest.test_many_partitions_compacted (2/3 runs)
  failure at 2022-11-20T03:48:19.569Z: <BadLogLines nodes=ip-172-31-36-63(1) example="ERROR 2022-11-19 20:52:54,026 [shard  0] seastar - Failed to allocate 7340032 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4319#0184917c-c179-4d57-be92-563f4eb9f9c5
  failure at 2022-11-21T03:21:08.085Z: RpkException('command /opt/redpanda/bin/rpk topic --brokers ip-172-31-44-245:9092,ip-172-31-43-92:9092,ip-172-31-39-61:9092,ip-172-31-43-95:9092,ip-172-31-45-166:9092,ip-172-31-44-233:9092,ip-172-31-33-240:9092,ip-172-31-41-134:9092,ip-172-31-44-142:9092 describe scale_000000 -p timed out')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4324#018496a2-e835-4a1e-b9a1-af52a6a95f3b

FAIL test: ManyClientsTest.test_many_clients (2/3 runs)
  failure at 2022-11-20T03:48:19.569Z: <BadLogLines nodes=ip-172-31-42-204(1) example="ERROR 2022-11-20 01:46:40,286 [shard 1] seastar - Failed to allocate 131072 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4319#0184917c-c179-4d57-be92-563f4eb9f9c5
  failure at 2022-11-21T03:21:08.085Z: <BadLogLines nodes=ip-172-31-39-61(1) example="ERROR 2022-11-21 01:46:02,937 [shard 0] seastar - Failed to allocate 131072 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4324#018496a2-e835-4a1e-b9a1-af52a6a95f3b

The text was updated successfully, but these errors were encountered:

jcsp · 2022-11-22T14:35:38Z

I tried this out ManyPartitionsTest.test_many_partitions with PARTITIONS_PER_SHARD = 100 on is4gen.4xlarge, which is what the nightly runs use. In that instance it does get past the timeout creating the topics, but then fails in _single_node_restart while waiting for a restarted node to regain leaderships. The leader balancer trying to move leaderships but getting raft::errc::not_leader errors.

bharathv · 2022-12-20T05:06:21Z

Another instance..

https://buildkite.com/redpanda/vtools/builds/4730#018526d6-f604-42f0-b676-a71840fdf989

FAIL test: ManyClientsTest.test_many_clients (1/2 runs)
  failure at 2022-12-19T05:20:39.021Z: <BadLogLines nodes=ip-172-31-34-1(1) example="ERROR 2022-12-19 02:32:02,945 [shard 0] seastar - Failed to allocate 66432 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4730#018526d6-f604-42f0-b676-a71840fdf989

bharathv · 2022-12-22T00:30:27Z

https://buildkite.com/redpanda/vtools/builds/4788#01853123-39d8-452b-b909-9c057db48f19

FAIL test: ManyPartitionsTest.test_many_partitions (1/3 runs)
  failure at 2022-12-21T05:00:23.156Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4788#01853123-39d8-452b-b909-9c057db48f19

FAIL test: ManyPartitionsTest.test_many_partitions_compacted (1/3 runs)
  failure at 2022-12-21T05:00:23.156Z: AssertionError('Unable to determine group within set number of attempts')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4788#01853123-39d8-452b-b909-9c057db48f19

BenPope · 2022-12-30T14:45:08Z

I'm tempted to group all of these in here:

FAIL test: ManyClientsTest.test_many_clients (1/14 runs)
failure at 2022-12-29T04:31:38.661Z: <BadLogLines nodes=ip-172-31-36-118(1) example="ERROR 2022-12-29 02:11:31,581 [shard 1] seastar - Failed to allocate 131072 bytes">
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888
FAIL test: ManyPartitionsTest.test_many_partitions (7/14 runs)
failure at 2022-12-30T04:28:17.969Z: TimeoutError('Redpanda service ip-172-31-47-43 failed to start within 60 sec')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4928#01855f7c-4001-4ed2-90cc-e7a1398b565a
failure at 2022-12-29T04:31:38.661Z: TimeoutError('Redpanda service ip-172-31-46-161 failed to start within 60 sec')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888
failure at 2022-12-28T04:28:12.819Z: AssertionError('Unable to determine group within set number of attempts')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55
failure at 2022-12-27T04:39:59.859Z: <BadLogLines nodes=ip-172-31-41-124(1),ip-172-31-45-245(1),ip-172-31-39-189(1) example="ERROR 2022-12-26 20:33:40,577 [shard 0] seastar - Failed to allocate 4870880 bytes">
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be
failure at 2022-12-26T04:32:21.108Z: TimeoutError('Redpanda service ip-172-31-45-39 failed to start within 60 sec')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4908#01854ae2-7717-429d-85ac-87d26f4e421a
failure at 2022-12-24T04:13:23.294Z: <BadLogLines nodes=ip-172-31-43-39(1) example="ERROR 2022-12-23 20:21:14,100 [shard 10] rpc - server.cc:119 - Error[applying protocol] remote address: 172.31.32.149:63354 - seastar::broken_promise (broken promise)">
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45
failure at 2022-12-25T04:20:18.252Z: TimeoutError('Redpanda service ip-172-31-37-74 failed to start within 60 sec')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1
FAIL test: ManyPartitionsTest.test_many_partitions_compacted (7/14 runs)
failure at 2022-12-30T04:28:17.969Z: TimeoutError('')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4928#01855f7c-4001-4ed2-90cc-e7a1398b565a
failure at 2022-12-29T04:31:38.661Z: TimeoutError('')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888
failure at 2022-12-28T04:28:12.819Z: <BadLogLines nodes=ip-172-31-37-6(1) example="ERROR 2022-12-27 21:01:46,318 [shard 0] seastar - Failed to allocate 4870880 bytes">
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55
failure at 2022-12-27T04:39:59.859Z: <BadLogLines nodes=ip-172-31-39-189(1) example="ERROR 2022-12-26 20:52:41,574 [shard 11] rpc - server.cc:119 - Error[applying protocol] remote address: 172.31.42.51:64603 - seastar::broken_promise (broken promise)">
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be
failure at 2022-12-26T04:32:21.108Z: TimeoutError('')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4908#01854ae2-7717-429d-85ac-87d26f4e421a
failure at 2022-12-24T04:13:23.294Z: TimeoutError('')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45
failure at 2022-12-25T04:20:18.252Z: <NodeCrash ip-172-31-46-176: Redpanda process unexpectedly stopped>
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1

The last one, Redpanda process unexpectedly stopped is due to Failed to allocate 4870880 bytes during shutdown. And also contains Semaphore timed out: raft/connected):

WARN  2022-12-24 20:44:09,721 [shard  1] seastar - Exceptional future ignored: seastar::named_semaphore_timed_out (Semaphore timed out: raft/connected), backtrace: 0x4b7abe7 0x48c08fb 0x1f24867 0x496c637 0x496f49f 0x49a7a73 0x491a017 /opt/redpanda/lib/libc.so.6+0x843b7 /opt/redpanda/lib/libc.so.6+0xef2db
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::linearizable_barrier()::$_49>(seastar::gate&, raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(raft::consensus::linearizable_barrier()::$_49&&), seastar::futurize<raft::consensus::linearizable_barrier()::$_49>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::linearizable_barrier()::$_49>(seastar::gate&, raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(raft::consensus::linearizable_barrier()::$_49&&)>(seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::linearizable_barrier()::$_49>(seastar::gate&, raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(raft::consensus::linearizable_barrier()::$_49&&)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::linearizable_barrier()::$_49>(seastar::gate&, raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(raft::consensus::linearizable_barrier()::$_49&&)&, seastar::future_state<seastar::internal::monostate>&&), void>

dotnwat · 2022-12-30T18:14:34Z

@BenPope do you think we should break out the ignored exceptional future into a separate item? presumably that will exist independent of this generic ARM resource issue?

BenPope · 2022-12-30T18:28:55Z

I grouped it here as evidence of resource starvation, but yes, it should probably be addressed separately.

dotnwat · 2022-12-30T18:31:19Z

I grouped it here as evidence of resource starvation, but yes, it should probably be addressed separately.

got it. just wanted to make sure we don't lose track of the ignored future since even if we fixed some resource issue root cause for failures, the ignored future would still exist.

ballard26 · 2023-01-11T23:31:32Z

Currently we are running nightly CDT tests on a 6x i3en.xlarge cluster for x86 and a 6x is4gen.4xlarge cluster for ARM. Meaning that the ARM cluster is roughly x4-x8 the size of the x86 cluster depending on the performance impacts of hyperthreads counting as vCPUs on the x86 cluster.

Tests in the ManyPartitionsTest will scale accordingly to total shards in a cluster. This entails the test using ~32,000 partitions on the ARM test vs ~8,000 partitions on the x86 test. Combining this with the lower memory per shard that the ARM cluster has its pretty likely this is the issue. The solution to this is ensuring that the ARM cluster has the same number of shards and memory per shard as the x86 cluster so that the ManyPartitionsTest's behaves similarly on both. I will be putting up a PR to do this soon.

For the ManyClientsTest the reason for the bad_allocs isn't as simple. The tests limit Redpanda's CPU cores to 2 and memory to 768mb on each node. So the cluster size differences won't change the test like in the ManyPartitionsTest. One potential cause for the bad_allocs is instead from how the client-swarm works. The app spawns a separate thread per producer with each thread trying to produce a fix number of messages as fast as possible. Hence it should be producing messages a lot quicker on the ARM is4gen.4xlarge node which is far larger than the x86 i3en.xlarge node. This could lead to the ARM cluster having to deal with higher throughput than the x86 cluster. I'm currently getting some metrics from both tests to see if this is the case. If it is then the solution may be to modify the client-swarm to produce at fixed throughputs or to limit the application to a fix number of cores on the large ARM cluster.

travisdowns · 2023-01-12T00:55:56Z

Nice, thanks for this Brandon!

The solution to this is ensuring that the ARM cluster has the same number of shards and memory per shard as the x86 cluster so that the ManyPartitionsTest's behaves similarly on both. I will be putting up a PR to do this soon.

This will change the cluster for all scale tests, right?

ballard26 · 2023-01-12T01:25:53Z

The solution to this is ensuring that the ARM cluster has the same number of shards and memory per shard as the x86 cluster so that the ManyPartitionsTest's behaves similarly on both. I will be putting up a PR to do this soon.

This will change the cluster for all scale tests, right?

I will just restrict the cores/memory RP can use in the ManyPartitionsTest to match what is available in the x86 version to start off with. That won't effect any of the other scale tests. We should look into using a smaller node type for the CDT runs eventually though. I imagine that a is4gen.2xlarge should suffice for our tests.

ballard26 · 2023-01-12T05:18:28Z

Unfortunately even after restricting resources on the ARM cluster the issues in the ManyPartitionsTest wasn't fixed. As @jcsp noticed earlier the leadership balancer in the ARM cluster tries to move partition leadership off nodes that aren't currently the leader far more often than the leader balancer in the x86 cluster. Easily seen here;

[brandonallard@fedora results]$ grep -rn "failed with error: raft::errc::not_leader" latest-amd/ManyPartitionsTest/test_many_partitions | wc -l
6
[brandonallard@fedora results]$ grep -rn "failed with error: raft::errc::not_leader" latest-arm/ManyPartitionsTest/test_many_partitions | wc -l
55

The leader balancer bases its knowledge of a cluster's leadership from the partition_leaders_table so for some reason this table is stale more often on the ARM cluster. Looking into the reason for this currently.

Another interesting observation is that the node that is restarted in the test is muted on the x86 balancer as expected, but is never muted on the ARM balancer.

latest-amd/ManyPartitionsTest/test_many_partitions/1/RedpandaService-0-140243031493984/ip-172-31-4-171/redpanda.log:1464:INFO  2023-01-12 04:07:02,885 [shard 0] cluster - leader_balancer.cc:493 - Leadership rebalancer muting node 9 last heartbeat 26735 ms
[brandonallard@fedora results]$ grep -rn "muting" latest-amd/ManyPartitionsTest/test_many_partitions | wc -l
50
[brandonallard@fedora results]$ grep -rn "muting" latest-arm/ManyPartitionsTest/test_many_partitions | wc -l
0

The leader balancer mutes nodes from heartbeat information from the raft0 follower_stats which could imply that this is more stale on ARM than x86 as well.

jcsp · 2023-01-12T10:06:49Z

As @jcsp noticed earlier the leadership balancer in the ARM cluster tries to move partition leadership off nodes that aren't currently the leader far more often than the leader balancer in the x86 cluster. Easily seen here;

Agree that this is the right place to look: the MaintenanceTest failure involved leader balancer strangeness too #7428

This could lead to the ARM cluster having to deal with higher throughput than the x86 cluster. I'm currently getting some metrics from both tests to see if this is the case. If it is then the solution may be to modify the client-swarm to produce at fixed throughputs or to limit the application to a fix number of cores on the large ARM cluster.

In this situation, we need to dig in and fix redpanda to handle the load: it's okay if it can't keep, but it shouldn't crash. Since this is a producer, the Kafka memory limit semaphore should know a-priori how big a message will be, and account for it: if we're bad_alloc'ing then something is going wrong with that memory management.

(the genesis of client-swarm was to reproduce crashes that a customer saw with significant client counts: this is not a hypothetical)

ballard26 · 2023-01-15T01:53:44Z

A quick update. @mmaslankaprv came up with an explanation as to why the controller in ARM tests seem to have stale information The ARM tests had a larger number of in-flight requests compared to the x86 tests. About 1,500 in the ARM tests vs ~50 in the x86 tests. This could explain why the controller is slow to update.

As to why there are so many more in-flight requests I've noticed that the background traffic is runs during the leadership balancing is 10x more on the ARM cluster than the x86 cluster even after resources are properly limited on the RP nodes. About 420MB/s on ARM and 42MB/s on x86.

[brandonallard@fedora results]$ grep -rn "approx bandwidth" .
./latest-arm/ManyPartitionsTest/test_many_partitions/1/test_log.info:153:[INFO  - 2023-01-12 03:34:25,333 - many_partitions_test - progress_check - lineno:876]: Wait complete, approx bandwidth 423.66427656465044MB/s
./latest-arm/ManyPartitionsTest/test_many_partitions/1/test_log.debug:25670:[INFO  - 2023-01-12 03:34:25,333 - many_partitions_test - progress_check - lineno:876]: Wait complete, approx bandwidth 423.66427656465044MB/s

./latest-amd/ManyPartitionsTest/test_many_partitions/1/test_log.debug:14992:[INFO  - 2023-01-12 04:06:36,354 - many_partitions_test - progress_check - lineno:876]: Wait complete, approx bandwidth 46.84763494224079MB/s
./latest-amd/ManyPartitionsTest/test_many_partitions/1/test_log.debug:187564:[INFO  - 2023-01-12 04:09:20,430 - many_partitions_test - progress_check - lineno:876]: Wait complete, approx bandwidth 42.801016203127205MB/s

This is most likely due to the fact that the kgo-repeater is running on a much larger node in the ARM tests than in the x86 tests.

piyushredpanda · 2023-01-15T14:42:46Z

I see multiple issues that need chasing/fixing here:

the leadership balancer in the ARM cluster tries to move partition leadership off nodes that aren't currently the leader far more often than the leader balancer in the x86 cluster.

and:

In this situation, we need to dig in and fix redpanda to handle the load: it's okay if it can't keep, but it shouldn't crash.

and finally to fix the test itself:

As to why there are so many more in-flight requests I've noticed that the background traffic is runs during the leadership balancing is 10x more on the ARM cluster than the x86 cluster even after resources are properly limited on the RP nodes. About 420MB/s on ARM and 42MB/s on x86.

ballard26 · 2023-01-18T22:14:00Z

The failures for both the ManyPartitionsTest and ManyClientsTest don't appear to be arm64 specific, but rather issues with the arm64 clusters being x4-x8 the size of the amd64 clusters we are running the tests on. In both cases I've allocated an amd64 cluster that is similarly sized to the arm64 cases. And in both cases the tests failed the same way on the amd64 cluster as they did on the arm64 cluster.

Running the ManyPartitionsTest on a i4i.4xlarge cluster with the memory restricted to match whats available on the is4gen.4xlarge cluster results in identical bad_allocs. So this is not an arm-specific issue its just occurring on the arm cluster since it has half the memory an amd64 cluster would have. However, with ~5Gb per core we don't expect this issue to occur. I will be opening a separate issue for this investigation into why these bad_allocs are occurring.

Running the ManyClientsTest on a i4i.4xlarge cluster fails as well in the same way it does on a is4gen.4xlarge cluster. So this isn't an ARM specific issue either. Rather it appears that the RP cluster(3 nodes with 2 CPUS and 768Mb of mem in both cases) can't handle the increased traffic client-swarm is producing as a result of being allocated on a larger node. I will be opening a separate issue for this as well.

dotnwat · 2023-01-19T01:09:47Z

but rather issues with the arm64 clusters being x4-x8 the size of the amd64

to clarify, @ballard26, you mean 4-8x smaller?

ballard26 · 2023-01-19T01:18:07Z

but rather issues with the arm64 clusters being x4-x8 the size of the amd64

to clarify, @ballard26, you mean 4-8x smaller?

The arm64 clusters are 4-8x larger than the amd64 clusters. The amd64 cluster is 4x smaller than the arm64 cluster in terms of pure core count. However, since the core count on amd64 clusters includes hyperthreads in could be up to x8 smaller depending how much you consider a two hyperthreads on the same core to perform as two distinct cores.

dotnwat · 2023-01-19T01:24:30Z

@ballard26 ok so is it fair to then say that the Mem/Core ratio on ARM is 4-8x smaller compared to x86?

ballard26 · 2023-01-19T02:30:54Z

@ballard26 ok so is it fair to then say that the Mem/Core ratio on ARM is 4-8x smaller compared to x86?

So on AWS for storage optimized instance types the is4gen ARM instance types that we use always have 2GB less memory per core than the i4i and i3en instance types. I.e, i4i.xlarge has 4 CPUs and 32GiB mem, i3en.xlarge has 4 CPUs and 32GiB mem, and the arm instance is4gen.xlarge has 4 CPUs and 24GiB of memory. Basically 8GB per core on x86 and 6GB per core on ARM for any instance larger than large

ballard26 · 2023-01-19T02:35:00Z

The 4-8x was in reference to the CPU count on the clusters we run the CDT nightly on. A 12x i3en.xlarge for the x86 CDT nightly and a 12x is4gen.4xlarge for the ARM CDT nightly. One bit of work should be to reduce the size of the ARM cluster we use. Sorry for the ambiguity.

andijcr · 2023-01-23T14:59:46Z

Could this failure be in the same family?
https://buildkite.com/redpanda/vtools/builds/5396#0185db17-163b-49ec-ace4-b4238606be02

FAIL test: ManyPartitionsTest.test_many_partitions_compacted (1/3 runs)
  failure at 2023-01-23T04:13:45.139Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5396#0185db17-163b-49ec-ace4-b4238606be02

rystsov · 2023-01-26T07:55:16Z

https://buildkite.com/redpanda/vtools/builds/5484#0185ea8f-6d91-4f53-94ca-348ecb773302

FAIL test: ManyClientsTest.test_many_clients (1/2 runs)
  failure at 2023-01-26T03:42:54.398Z: <BadLogLines nodes=ip-172-31-14-215(1) example="ERROR 2023-01-26 01:01:45,070 [shard 0] seastar - Failed to allocate 131072 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5484#0185ea8f-6d91-4f53-94ca-348ecb773302

andijcr · 2023-01-27T10:13:40Z

https://buildkite.com/redpanda/vtools/builds/5500#0185efb8-35af-4ec9-8e11-2d6d69c3d310

FAIL test: ManyClientsTest.test_many_clients (1/2 runs)
  failure at 2023-01-27T03:54:39.906Z: <BadLogLines nodes=ip-172-31-7-211(1) example="ERROR 2023-01-27 01:20:03,830 [shard 1] seastar - Failed to allocate 66432 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5500#0185efb8-35af-4ec9-8e11-2d6d69c3d310

VladLazar · 2023-01-27T10:26:05Z

I've also seen this fail in my Azure CDT runs fairly reliably (same failure mode). I'm using Standard_L8s_v3 nodes for Redpanda and Standard_D4ds_v4 for the client.

dlex · 2023-03-07T03:24:18Z

The AssertionError('Unable to determine group within set number of attempts') is happening in both amd64 and arm64 in CDT:

FAIL test: ManyPartitionsTest.test_many_partitions (6/9 runs)

on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/6553#0186b5f4-545f-4223-8333-86f48d099e26
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/6546#0186b0cf-697b-4d10-8ed5-9d7e5388f2a6
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6548#0186b35e-b995-4a2c-94bd-44127dc7b7e9
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6543#0186ae3a-4642-40c5-be9c-72d77ecbdfed
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/6539#0186aba8-f9da-4c97-b205-4ba3a68f3c9f
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6522#0186a917-0e61-42b6-b4cd-a847cf4d283a

FAIL test: ManyPartitionsTest.test_many_partitions_compacted (7/9 runs)

on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/6553#0186b5f4-545f-4223-8333-86f48d099e26
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/6546#0186b0cf-697b-4d10-8ed5-9d7e5388f2a6
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6548#0186b35e-b995-4a2c-94bd-44127dc7b7e9
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6543#0186ae3a-4642-40c5-be9c-72d77ecbdfed
on (amd64, VM) in job https://buildkite.com/redpanda/vtools/builds/6539#0186aba8-f9da-4c97-b205-4ba3a68f3c9f
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6522#0186a917-0e61-42b6-b4cd-a847cf4d283a

ztlpn · 2023-03-08T00:10:06Z

some more

FAIL test: ManyPartitionsTest.test_many_partitions (1/2 runs)
  failure at 2023-03-07T02:19:15.819Z: AssertionError('Unable to determine group within set number of attempts')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6564#0186b888-c751-425d-b740-66b8de3fee24
FAIL test: ManyPartitionsTest.test_many_partitions_compacted (1/2 runs)
  failure at 2023-03-07T02:19:15.819Z: AssertionError('Unable to determine group within set number of attempts')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6564#0186b888-c751-425d-b740-66b8de3fee24

jcsp · 2023-03-08T08:39:15Z

Those most recent reports were the issue fixed by #9257

jcsp · 2023-03-16T13:23:56Z

ARM tests are okay now - green run from last night here https://buildkite.com/redpanda/vtools/builds/6732#0186e6ee-35c5-4ca7-a43b-6a2c8eb474ce

jcsp added kind/bug Something isn't working area/tests ci-failure labels Nov 21, 2022

jcsp assigned travisdowns and ballard26 Nov 21, 2022

piyushredpanda unassigned travisdowns Nov 28, 2022

jcsp added the performance label Dec 19, 2022

jcsp mentioned this issue Jan 20, 2023

CI Failure: ManyPartitionsTest.test_many_partitions timeout in kgo_repeater_service.py #8288

Closed

jcsp changed the title ~~ARM: scale tests require more resources than nodes have~~ ARM: scale tests require more resources than nodes have (ManyPartitionsTest.test_many_partitions, ManyClientsTest.test_many_clients) Jan 20, 2023

This was referenced Jan 22, 2023

CI Failure Failed to allocated 4870880 bytes in ManyPartitionsTest #8355

Closed

CI Failure Failed to allocate 131072 bytes ManyClientsTest.test_many_clients #8356

Closed

ballard26 mentioned this issue Jan 31, 2023

CI Failure (BadLoglines broken promise) in ManyPartitionsTest.test_many_partitions, ManyPartitionsTest.test_many_partitions_compacted #8518

Closed

piyushredpanda mentioned this issue Feb 9, 2023

Failure in ManyPartitionsTest.test_many_partitions_compacted on ARM #7259

Closed

mmaslankaprv mentioned this issue Feb 10, 2023

CI Failure (BadLogLines) in ManyPartitionsTest.test_many_partitions_compacted #8751

Closed

jcsp mentioned this issue Feb 13, 2023

CI Failure (node_leadership_evacuated timeout) in ManyPartitionsTest.test_many_partitions #8684

Closed

jcsp closed this as completed Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM: scale tests require more resources than nodes have (`ManyPartitionsTest`.`test_many_partitions`, `ManyClientsTest`.`test_many_clients`) #7405

ARM: scale tests require more resources than nodes have (`ManyPartitionsTest`.`test_many_partitions`, `ManyClientsTest`.`test_many_clients`) #7405

jcsp commented Nov 21, 2022

jcsp commented Nov 22, 2022

bharathv commented Dec 20, 2022 •

edited

Loading

bharathv commented Dec 22, 2022

BenPope commented Dec 30, 2022

dotnwat commented Dec 30, 2022

BenPope commented Dec 30, 2022

dotnwat commented Dec 30, 2022

ballard26 commented Jan 11, 2023

travisdowns commented Jan 12, 2023 •

edited

Loading

ballard26 commented Jan 12, 2023 •

edited

Loading

ballard26 commented Jan 12, 2023

jcsp commented Jan 12, 2023 •

edited

Loading

ballard26 commented Jan 15, 2023

piyushredpanda commented Jan 15, 2023

ballard26 commented Jan 18, 2023 •

edited

Loading

dotnwat commented Jan 19, 2023

ballard26 commented Jan 19, 2023

dotnwat commented Jan 19, 2023 •

edited

Loading

ballard26 commented Jan 19, 2023 •

edited

Loading

ballard26 commented Jan 19, 2023 •

edited

Loading

andijcr commented Jan 23, 2023

rystsov commented Jan 26, 2023

andijcr commented Jan 27, 2023

VladLazar commented Jan 27, 2023 •

edited

Loading

dlex commented Mar 7, 2023

ztlpn commented Mar 8, 2023

jcsp commented Mar 8, 2023

jcsp commented Mar 16, 2023

ARM: scale tests require more resources than nodes have (ManyPartitionsTest.test_many_partitions, ManyClientsTest.test_many_clients) #7405

ARM: scale tests require more resources than nodes have (ManyPartitionsTest.test_many_partitions, ManyClientsTest.test_many_clients) #7405

Comments

jcsp commented Nov 21, 2022

jcsp commented Nov 22, 2022

bharathv commented Dec 20, 2022 • edited Loading

bharathv commented Dec 22, 2022

BenPope commented Dec 30, 2022

dotnwat commented Dec 30, 2022

BenPope commented Dec 30, 2022

dotnwat commented Dec 30, 2022

ballard26 commented Jan 11, 2023

travisdowns commented Jan 12, 2023 • edited Loading

ballard26 commented Jan 12, 2023 • edited Loading

ballard26 commented Jan 12, 2023

jcsp commented Jan 12, 2023 • edited Loading

ballard26 commented Jan 15, 2023

piyushredpanda commented Jan 15, 2023

ballard26 commented Jan 18, 2023 • edited Loading

dotnwat commented Jan 19, 2023

ballard26 commented Jan 19, 2023

dotnwat commented Jan 19, 2023 • edited Loading

ballard26 commented Jan 19, 2023 • edited Loading

ballard26 commented Jan 19, 2023 • edited Loading

andijcr commented Jan 23, 2023

rystsov commented Jan 26, 2023

andijcr commented Jan 27, 2023

VladLazar commented Jan 27, 2023 • edited Loading

dlex commented Mar 7, 2023

ztlpn commented Mar 8, 2023

jcsp commented Mar 8, 2023

jcsp commented Mar 16, 2023

ARM: scale tests require more resources than nodes have (`ManyPartitionsTest`.`test_many_partitions`, `ManyClientsTest`.`test_many_clients`) #7405

ARM: scale tests require more resources than nodes have (`ManyPartitionsTest`.`test_many_partitions`, `ManyClientsTest`.`test_many_clients`) #7405

bharathv commented Dec 20, 2022 •

edited

Loading

travisdowns commented Jan 12, 2023 •

edited

Loading

ballard26 commented Jan 12, 2023 •

edited

Loading

jcsp commented Jan 12, 2023 •

edited

Loading

ballard26 commented Jan 18, 2023 •

edited

Loading

dotnwat commented Jan 19, 2023 •

edited

Loading

ballard26 commented Jan 19, 2023 •

edited

Loading

ballard26 commented Jan 19, 2023 •

edited

Loading

VladLazar commented Jan 27, 2023 •

edited

Loading