Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM: scale tests require more resources than nodes have (ManyPartitionsTest.test_many_partitions, ManyClientsTest.test_many_clients) #7405

Closed
jcsp opened this issue Nov 21, 2022 · 28 comments
Assignees

Comments

@jcsp
Copy link
Contributor

jcsp commented Nov 21, 2022

i3en.xlarge has 8GB per vCPU

Is4gen.4xlarge has 6GB per vCPU

Our scale tests do not pass reliably on the weaker arm nodes.

FAIL test: ManyPartitionsTest.test_many_partitions (2/3 runs)
  failure at 2022-11-20T03:48:19.569Z: RpkException('command /opt/redpanda/bin/rpk topic --brokers ip-172-31-43-14:9092,ip-172-31-32-247:9092,ip-172-31-36-138:9092,ip-172-31-46-34:9092,ip-172-31-36-63:9092,ip-172-31-42-204:9092,ip-172-31-43-72:9092,ip-172-31-40-33:9092,ip-172-31-47-0:9092 describe scale_000000 -p timed out')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4319#0184917c-c179-4d57-be92-563f4eb9f9c5
  failure at 2022-11-21T03:21:08.085Z: RpkException('command /opt/redpanda/bin/rpk topic --brokers ip-172-31-39-61:9092,ip-172-31-44-233:9092,ip-172-31-41-134:9092,ip-172-31-43-92:9092,ip-172-31-43-95:9092,ip-172-31-33-240:9092,ip-172-31-45-166:9092,ip-172-31-44-245:9092,ip-172-31-44-142:9092 describe scale_000000 -p timed out')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4324#018496a2-e835-4a1e-b9a1-af52a6a95f3b
FAIL test: ManyPartitionsTest.test_many_partitions_compacted (2/3 runs)
  failure at 2022-11-20T03:48:19.569Z: <BadLogLines nodes=ip-172-31-36-63(1) example="ERROR 2022-11-19 20:52:54,026 [shard  0] seastar - Failed to allocate 7340032 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4319#0184917c-c179-4d57-be92-563f4eb9f9c5
  failure at 2022-11-21T03:21:08.085Z: RpkException('command /opt/redpanda/bin/rpk topic --brokers ip-172-31-44-245:9092,ip-172-31-43-92:9092,ip-172-31-39-61:9092,ip-172-31-43-95:9092,ip-172-31-45-166:9092,ip-172-31-44-233:9092,ip-172-31-33-240:9092,ip-172-31-41-134:9092,ip-172-31-44-142:9092 describe scale_000000 -p timed out')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4324#018496a2-e835-4a1e-b9a1-af52a6a95f3b
FAIL test: ManyClientsTest.test_many_clients (2/3 runs)
  failure at 2022-11-20T03:48:19.569Z: <BadLogLines nodes=ip-172-31-42-204(1) example="ERROR 2022-11-20 01:46:40,286 [shard 1] seastar - Failed to allocate 131072 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4319#0184917c-c179-4d57-be92-563f4eb9f9c5
  failure at 2022-11-21T03:21:08.085Z: <BadLogLines nodes=ip-172-31-39-61(1) example="ERROR 2022-11-21 01:46:02,937 [shard 0] seastar - Failed to allocate 131072 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4324#018496a2-e835-4a1e-b9a1-af52a6a95f3b
@jcsp
Copy link
Contributor Author

jcsp commented Nov 22, 2022

I tried this out ManyPartitionsTest.test_many_partitions with PARTITIONS_PER_SHARD = 100 on is4gen.4xlarge, which is what the nightly runs use. In that instance it does get past the timeout creating the topics, but then fails in _single_node_restart while waiting for a restarted node to regain leaderships. The leader balancer trying to move leaderships but getting raft::errc::not_leader errors.

@bharathv
Copy link
Contributor

bharathv commented Dec 20, 2022

Another instance..

https://buildkite.com/redpanda/vtools/builds/4730#018526d6-f604-42f0-b676-a71840fdf989

FAIL test: ManyClientsTest.test_many_clients (1/2 runs)
  failure at 2022-12-19T05:20:39.021Z: <BadLogLines nodes=ip-172-31-34-1(1) example="ERROR 2022-12-19 02:32:02,945 [shard 0] seastar - Failed to allocate 66432 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4730#018526d6-f604-42f0-b676-a71840fdf989

@bharathv
Copy link
Contributor

https://buildkite.com/redpanda/vtools/builds/4788#01853123-39d8-452b-b909-9c057db48f19

FAIL test: ManyPartitionsTest.test_many_partitions (1/3 runs)
  failure at 2022-12-21T05:00:23.156Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4788#01853123-39d8-452b-b909-9c057db48f19

FAIL test: ManyPartitionsTest.test_many_partitions_compacted (1/3 runs)
  failure at 2022-12-21T05:00:23.156Z: AssertionError('Unable to determine group within set number of attempts')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4788#01853123-39d8-452b-b909-9c057db48f19

@BenPope
Copy link
Member

BenPope commented Dec 30, 2022

I'm tempted to group all of these in here:

FAIL test: ManyClientsTest.test_many_clients (1/14 runs)
failure at 2022-12-29T04:31:38.661Z: <BadLogLines nodes=ip-172-31-36-118(1) example="ERROR 2022-12-29 02:11:31,581 [shard 1] seastar - Failed to allocate 131072 bytes">
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888
FAIL test: ManyPartitionsTest.test_many_partitions (7/14 runs)
failure at 2022-12-30T04:28:17.969Z: TimeoutError('Redpanda service ip-172-31-47-43 failed to start within 60 sec')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4928#01855f7c-4001-4ed2-90cc-e7a1398b565a
failure at 2022-12-29T04:31:38.661Z: TimeoutError('Redpanda service ip-172-31-46-161 failed to start within 60 sec')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888
failure at 2022-12-28T04:28:12.819Z: AssertionError('Unable to determine group within set number of attempts')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55
failure at 2022-12-27T04:39:59.859Z: <BadLogLines nodes=ip-172-31-41-124(1),ip-172-31-45-245(1),ip-172-31-39-189(1) example="ERROR 2022-12-26 20:33:40,577 [shard 0] seastar - Failed to allocate 4870880 bytes">
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be
failure at 2022-12-26T04:32:21.108Z: TimeoutError('Redpanda service ip-172-31-45-39 failed to start within 60 sec')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4908#01854ae2-7717-429d-85ac-87d26f4e421a
failure at 2022-12-24T04:13:23.294Z: <BadLogLines nodes=ip-172-31-43-39(1) example="ERROR 2022-12-23 20:21:14,100 [shard 10] rpc - server.cc:119 - Error[applying protocol] remote address: 172.31.32.149:63354 - seastar::broken_promise (broken promise)">
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45
failure at 2022-12-25T04:20:18.252Z: TimeoutError('Redpanda service ip-172-31-37-74 failed to start within 60 sec')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1
FAIL test: ManyPartitionsTest.test_many_partitions_compacted (7/14 runs)
failure at 2022-12-30T04:28:17.969Z: TimeoutError('')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4928#01855f7c-4001-4ed2-90cc-e7a1398b565a
failure at 2022-12-29T04:31:38.661Z: TimeoutError('')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4923#01855a56-3e54-4997-9a52-fa50cb839888
failure at 2022-12-28T04:28:12.819Z: <BadLogLines nodes=ip-172-31-37-6(1) example="ERROR 2022-12-27 21:01:46,318 [shard 0] seastar - Failed to allocate 4870880 bytes">
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55
failure at 2022-12-27T04:39:59.859Z: <BadLogLines nodes=ip-172-31-39-189(1) example="ERROR 2022-12-26 20:52:41,574 [shard 11] rpc - server.cc:119 - Error[applying protocol] remote address: 172.31.42.51:64603 - seastar::broken_promise (broken promise)">
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be
failure at 2022-12-26T04:32:21.108Z: TimeoutError('')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4908#01854ae2-7717-429d-85ac-87d26f4e421a
failure at 2022-12-24T04:13:23.294Z: TimeoutError('')
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4891#01854096-20d5-401c-b8b1-94e4d21e8e45
failure at 2022-12-25T04:20:18.252Z: <NodeCrash ip-172-31-46-176: Redpanda process unexpectedly stopped>
on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1

The last one, Redpanda process unexpectedly stopped is due to Failed to allocate 4870880 bytes during shutdown. And also contains Semaphore timed out: raft/connected):

WARN  2022-12-24 20:44:09,721 [shard  1] seastar - Exceptional future ignored: seastar::named_semaphore_timed_out (Semaphore timed out: raft/connected), backtrace: 0x4b7abe7 0x48c08fb 0x1f24867 0x496c637 0x496f49f 0x49a7a73 0x491a017 /opt/redpanda/lib/libc.so.6+0x843b7 /opt/redpanda/lib/libc.so.6+0xef2db
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::linearizable_barrier()::$_49>(seastar::gate&, raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(raft::consensus::linearizable_barrier()::$_49&&), seastar::futurize<raft::consensus::linearizable_barrier()::$_49>::type seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::linearizable_barrier()::$_49>(seastar::gate&, raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(raft::consensus::linearizable_barrier()::$_49&&)>(seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::linearizable_barrier()::$_49>(seastar::gate&, raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(raft::consensus::linearizable_barrier()::$_49&&)&&)::'lambda'(seastar::internal::promise_base_with_type<void>&&, seastar::future<void> seastar::future<void>::handle_exception_type<auto ssx::spawn_with_gate_then<raft::consensus::linearizable_barrier()::$_49>(seastar::gate&, raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(seastar::broken_condition_variable const&)>(raft::consensus::linearizable_barrier()::$_49&&)::'lambda'(raft::consensus::linearizable_barrier()::$_49&&)&, seastar::future_state<seastar::internal::monostate>&&), void>

@dotnwat
Copy link
Member

dotnwat commented Dec 30, 2022

@BenPope do you think we should break out the ignored exceptional future into a separate item? presumably that will exist independent of this generic ARM resource issue?

@BenPope
Copy link
Member

BenPope commented Dec 30, 2022

I grouped it here as evidence of resource starvation, but yes, it should probably be addressed separately.

@dotnwat
Copy link
Member

dotnwat commented Dec 30, 2022

I grouped it here as evidence of resource starvation, but yes, it should probably be addressed separately.

got it. just wanted to make sure we don't lose track of the ignored future since even if we fixed some resource issue root cause for failures, the ignored future would still exist.

@ballard26
Copy link
Contributor

Currently we are running nightly CDT tests on a 6x i3en.xlarge cluster for x86 and a 6x is4gen.4xlarge cluster for ARM. Meaning that the ARM cluster is roughly x4-x8 the size of the x86 cluster depending on the performance impacts of hyperthreads counting as vCPUs on the x86 cluster.

Tests in the ManyPartitionsTest will scale accordingly to total shards in a cluster. This entails the test using ~32,000 partitions on the ARM test vs ~8,000 partitions on the x86 test. Combining this with the lower memory per shard that the ARM cluster has its pretty likely this is the issue. The solution to this is ensuring that the ARM cluster has the same number of shards and memory per shard as the x86 cluster so that the ManyPartitionsTest's behaves similarly on both. I will be putting up a PR to do this soon.

For the ManyClientsTest the reason for the bad_allocs isn't as simple. The tests limit Redpanda's CPU cores to 2 and memory to 768mb on each node. So the cluster size differences won't change the test like in the ManyPartitionsTest. One potential cause for the bad_allocs is instead from how the client-swarm works. The app spawns a separate thread per producer with each thread trying to produce a fix number of messages as fast as possible. Hence it should be producing messages a lot quicker on the ARM is4gen.4xlarge node which is far larger than the x86 i3en.xlarge node. This could lead to the ARM cluster having to deal with higher throughput than the x86 cluster. I'm currently getting some metrics from both tests to see if this is the case. If it is then the solution may be to modify the client-swarm to produce at fixed throughputs or to limit the application to a fix number of cores on the large ARM cluster.

@travisdowns
Copy link
Member

travisdowns commented Jan 12, 2023

Nice, thanks for this Brandon!

The solution to this is ensuring that the ARM cluster has the same number of shards and memory per shard as the x86 cluster so that the ManyPartitionsTest's behaves similarly on both. I will be putting up a PR to do this soon.

This will change the cluster for all scale tests, right?

@ballard26
Copy link
Contributor

ballard26 commented Jan 12, 2023

The solution to this is ensuring that the ARM cluster has the same number of shards and memory per shard as the x86 cluster so that the ManyPartitionsTest's behaves similarly on both. I will be putting up a PR to do this soon.

This will change the cluster for all scale tests, right?

I will just restrict the cores/memory RP can use in the ManyPartitionsTest to match what is available in the x86 version to start off with. That won't effect any of the other scale tests. We should look into using a smaller node type for the CDT runs eventually though. I imagine that a is4gen.2xlarge should suffice for our tests.

@ballard26
Copy link
Contributor

Unfortunately even after restricting resources on the ARM cluster the issues in the ManyPartitionsTest wasn't fixed. As @jcsp noticed earlier the leadership balancer in the ARM cluster tries to move partition leadership off nodes that aren't currently the leader far more often than the leader balancer in the x86 cluster. Easily seen here;

[brandonallard@fedora results]$ grep -rn "failed with error: raft::errc::not_leader" latest-amd/ManyPartitionsTest/test_many_partitions | wc -l
6
[brandonallard@fedora results]$ grep -rn "failed with error: raft::errc::not_leader" latest-arm/ManyPartitionsTest/test_many_partitions | wc -l
55

The leader balancer bases its knowledge of a cluster's leadership from the partition_leaders_table so for some reason this table is stale more often on the ARM cluster. Looking into the reason for this currently.

Another interesting observation is that the node that is restarted in the test is muted on the x86 balancer as expected, but is never muted on the ARM balancer.

latest-amd/ManyPartitionsTest/test_many_partitions/1/RedpandaService-0-140243031493984/ip-172-31-4-171/redpanda.log:1464:INFO  2023-01-12 04:07:02,885 [shard 0] cluster - leader_balancer.cc:493 - Leadership rebalancer muting node 9 last heartbeat 26735 ms
[brandonallard@fedora results]$ grep -rn "muting" latest-amd/ManyPartitionsTest/test_many_partitions | wc -l
50
[brandonallard@fedora results]$ grep -rn "muting" latest-arm/ManyPartitionsTest/test_many_partitions | wc -l
0

The leader balancer mutes nodes from heartbeat information from the raft0 follower_stats which could imply that this is more stale on ARM than x86 as well.

@jcsp
Copy link
Contributor Author

jcsp commented Jan 12, 2023

As @jcsp noticed earlier the leadership balancer in the ARM cluster tries to move partition leadership off nodes that aren't currently the leader far more often than the leader balancer in the x86 cluster. Easily seen here;

Agree that this is the right place to look: the MaintenanceTest failure involved leader balancer strangeness too #7428

This could lead to the ARM cluster having to deal with higher throughput than the x86 cluster. I'm currently getting some metrics from both tests to see if this is the case. If it is then the solution may be to modify the client-swarm to produce at fixed throughputs or to limit the application to a fix number of cores on the large ARM cluster.

In this situation, we need to dig in and fix redpanda to handle the load: it's okay if it can't keep, but it shouldn't crash. Since this is a producer, the Kafka memory limit semaphore should know a-priori how big a message will be, and account for it: if we're bad_alloc'ing then something is going wrong with that memory management.

(the genesis of client-swarm was to reproduce crashes that a customer saw with significant client counts: this is not a hypothetical)

@ballard26
Copy link
Contributor

A quick update. @mmaslankaprv came up with an explanation as to why the controller in ARM tests seem to have stale information The ARM tests had a larger number of in-flight requests compared to the x86 tests. About 1,500 in the ARM tests vs ~50 in the x86 tests. This could explain why the controller is slow to update.

As to why there are so many more in-flight requests I've noticed that the background traffic is runs during the leadership balancing is 10x more on the ARM cluster than the x86 cluster even after resources are properly limited on the RP nodes. About 420MB/s on ARM and 42MB/s on x86.

[brandonallard@fedora results]$ grep -rn "approx bandwidth" .
./latest-arm/ManyPartitionsTest/test_many_partitions/1/test_log.info:153:[INFO  - 2023-01-12 03:34:25,333 - many_partitions_test - progress_check - lineno:876]: Wait complete, approx bandwidth 423.66427656465044MB/s
./latest-arm/ManyPartitionsTest/test_many_partitions/1/test_log.debug:25670:[INFO  - 2023-01-12 03:34:25,333 - many_partitions_test - progress_check - lineno:876]: Wait complete, approx bandwidth 423.66427656465044MB/s

./latest-amd/ManyPartitionsTest/test_many_partitions/1/test_log.debug:14992:[INFO  - 2023-01-12 04:06:36,354 - many_partitions_test - progress_check - lineno:876]: Wait complete, approx bandwidth 46.84763494224079MB/s
./latest-amd/ManyPartitionsTest/test_many_partitions/1/test_log.debug:187564:[INFO  - 2023-01-12 04:09:20,430 - many_partitions_test - progress_check - lineno:876]: Wait complete, approx bandwidth 42.801016203127205MB/s

This is most likely due to the fact that the kgo-repeater is running on a much larger node in the ARM tests than in the x86 tests.

@piyushredpanda
Copy link
Contributor

I see multiple issues that need chasing/fixing here:

the leadership balancer in the ARM cluster tries to move partition leadership off nodes that aren't currently the leader far more often than the leader balancer in the x86 cluster.

and:

In this situation, we need to dig in and fix redpanda to handle the load: it's okay if it can't keep, but it shouldn't crash.

and finally to fix the test itself:

As to why there are so many more in-flight requests I've noticed that the background traffic is runs during the leadership balancing is 10x more on the ARM cluster than the x86 cluster even after resources are properly limited on the RP nodes. About 420MB/s on ARM and 42MB/s on x86.

@ballard26
Copy link
Contributor

ballard26 commented Jan 18, 2023

The failures for both the ManyPartitionsTest and ManyClientsTest don't appear to be arm64 specific, but rather issues with the arm64 clusters being x4-x8 the size of the amd64 clusters we are running the tests on. In both cases I've allocated an amd64 cluster that is similarly sized to the arm64 cases. And in both cases the tests failed the same way on the amd64 cluster as they did on the arm64 cluster.

Running the ManyPartitionsTest on a i4i.4xlarge cluster with the memory restricted to match whats available on the is4gen.4xlarge cluster results in identical bad_allocs. So this is not an arm-specific issue its just occurring on the arm cluster since it has half the memory an amd64 cluster would have. However, with ~5Gb per core we don't expect this issue to occur. I will be opening a separate issue for this investigation into why these bad_allocs are occurring.

Running the ManyClientsTest on a i4i.4xlarge cluster fails as well in the same way it does on a is4gen.4xlarge cluster. So this isn't an ARM specific issue either. Rather it appears that the RP cluster(3 nodes with 2 CPUS and 768Mb of mem in both cases) can't handle the increased traffic client-swarm is producing as a result of being allocated on a larger node. I will be opening a separate issue for this as well.

@dotnwat
Copy link
Member

dotnwat commented Jan 19, 2023

but rather issues with the arm64 clusters being x4-x8 the size of the amd64

to clarify, @ballard26, you mean 4-8x smaller?

@ballard26
Copy link
Contributor

but rather issues with the arm64 clusters being x4-x8 the size of the amd64

to clarify, @ballard26, you mean 4-8x smaller?

The arm64 clusters are 4-8x larger than the amd64 clusters. The amd64 cluster is 4x smaller than the arm64 cluster in terms of pure core count. However, since the core count on amd64 clusters includes hyperthreads in could be up to x8 smaller depending how much you consider a two hyperthreads on the same core to perform as two distinct cores.

@dotnwat
Copy link
Member

dotnwat commented Jan 19, 2023

@ballard26 ok so is it fair to then say that the Mem/Core ratio on ARM is 4-8x smaller compared to x86?

@ballard26
Copy link
Contributor

ballard26 commented Jan 19, 2023

@ballard26 ok so is it fair to then say that the Mem/Core ratio on ARM is 4-8x smaller compared to x86?

So on AWS for storage optimized instance types the is4gen ARM instance types that we use always have 2GB less memory per core than the i4i and i3en instance types. I.e, i4i.xlarge has 4 CPUs and 32GiB mem, i3en.xlarge has 4 CPUs and 32GiB mem, and the arm instance is4gen.xlarge has 4 CPUs and 24GiB of memory. Basically 8GB per core on x86 and 6GB per core on ARM for any instance larger than large

@ballard26
Copy link
Contributor

ballard26 commented Jan 19, 2023

The 4-8x was in reference to the CPU count on the clusters we run the CDT nightly on. A 12x i3en.xlarge for the x86 CDT nightly and a 12x is4gen.4xlarge for the ARM CDT nightly. One bit of work should be to reduce the size of the ARM cluster we use. Sorry for the ambiguity.

@jcsp jcsp changed the title ARM: scale tests require more resources than nodes have ARM: scale tests require more resources than nodes have (ManyPartitionsTest.test_many_partitions, ManyClientsTest.test_many_clients) Jan 20, 2023
@andijcr
Copy link
Contributor

andijcr commented Jan 23, 2023

Could this failure be in the same family?
https://buildkite.com/redpanda/vtools/builds/5396#0185db17-163b-49ec-ace4-b4238606be02

FAIL test: ManyPartitionsTest.test_many_partitions_compacted (1/3 runs)
  failure at 2023-01-23T04:13:45.139Z: TimeoutError('')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5396#0185db17-163b-49ec-ace4-b4238606be02

@rystsov
Copy link
Contributor

rystsov commented Jan 26, 2023

https://buildkite.com/redpanda/vtools/builds/5484#0185ea8f-6d91-4f53-94ca-348ecb773302

FAIL test: ManyClientsTest.test_many_clients (1/2 runs)
  failure at 2023-01-26T03:42:54.398Z: <BadLogLines nodes=ip-172-31-14-215(1) example="ERROR 2023-01-26 01:01:45,070 [shard 0] seastar - Failed to allocate 131072 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5484#0185ea8f-6d91-4f53-94ca-348ecb773302

@andijcr
Copy link
Contributor

andijcr commented Jan 27, 2023

https://buildkite.com/redpanda/vtools/builds/5500#0185efb8-35af-4ec9-8e11-2d6d69c3d310

FAIL test: ManyClientsTest.test_many_clients (1/2 runs)
  failure at 2023-01-27T03:54:39.906Z: <BadLogLines nodes=ip-172-31-7-211(1) example="ERROR 2023-01-27 01:20:03,830 [shard 1] seastar - Failed to allocate 66432 bytes">
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/5500#0185efb8-35af-4ec9-8e11-2d6d69c3d310

@VladLazar
Copy link
Contributor

VladLazar commented Jan 27, 2023

I've also seen this fail in my Azure CDT runs fairly reliably (same failure mode). I'm using Standard_L8s_v3 nodes for Redpanda and Standard_D4ds_v4 for the client.

@dlex
Copy link
Contributor

dlex commented Mar 7, 2023

The AssertionError('Unable to determine group within set number of attempts') is happening in both amd64 and arm64 in CDT:

FAIL test: ManyPartitionsTest.test_many_partitions (6/9 runs)

FAIL test: ManyPartitionsTest.test_many_partitions_compacted (7/9 runs)

@ztlpn
Copy link
Contributor

ztlpn commented Mar 8, 2023

some more

FAIL test: ManyPartitionsTest.test_many_partitions (1/2 runs)
  failure at 2023-03-07T02:19:15.819Z: AssertionError('Unable to determine group within set number of attempts')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6564#0186b888-c751-425d-b740-66b8de3fee24
FAIL test: ManyPartitionsTest.test_many_partitions_compacted (1/2 runs)
  failure at 2023-03-07T02:19:15.819Z: AssertionError('Unable to determine group within set number of attempts')
      on (arm64, VM) in job https://buildkite.com/redpanda/vtools/builds/6564#0186b888-c751-425d-b740-66b8de3fee24

@jcsp
Copy link
Contributor Author

jcsp commented Mar 8, 2023

Those most recent reports were the issue fixed by #9257

@jcsp
Copy link
Contributor Author

jcsp commented Mar 16, 2023

ARM tests are okay now - green run from last night here https://buildkite.com/redpanda/vtools/builds/6732#0186e6ee-35c5-4ca7-a43b-6a2c8eb474ce

@jcsp jcsp closed this as completed Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests