-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM: scale tests require more resources than nodes have (ManyPartitionsTest
.test_many_partitions
, ManyClientsTest
.test_many_clients
)
#7405
Comments
I tried this out ManyPartitionsTest.test_many_partitions with |
Another instance.. https://buildkite.com/redpanda/vtools/builds/4730#018526d6-f604-42f0-b676-a71840fdf989
|
https://buildkite.com/redpanda/vtools/builds/4788#01853123-39d8-452b-b909-9c057db48f19
|
I'm tempted to group all of these in here: FAIL test: ManyClientsTest.test_many_clients (1/14 runs) The last one,
|
@BenPope do you think we should break out the ignored exceptional future into a separate item? presumably that will exist independent of this generic ARM resource issue? |
I grouped it here as evidence of resource starvation, but yes, it should probably be addressed separately. |
got it. just wanted to make sure we don't lose track of the ignored future since even if we fixed some resource issue root cause for failures, the ignored future would still exist. |
Currently we are running nightly CDT tests on a 6x Tests in the For the |
Nice, thanks for this Brandon!
This will change the cluster for all scale tests, right? |
I will just restrict the cores/memory RP can use in the |
Unfortunately even after restricting resources on the ARM cluster the issues in the
The leader balancer bases its knowledge of a cluster's leadership from the Another interesting observation is that the node that is restarted in the test is muted on the x86 balancer as expected, but is never muted on the ARM balancer.
The leader balancer mutes nodes from heartbeat information from the raft0 |
Agree that this is the right place to look: the MaintenanceTest failure involved leader balancer strangeness too #7428
In this situation, we need to dig in and fix redpanda to handle the load: it's okay if it can't keep, but it shouldn't crash. Since this is a producer, the Kafka memory limit semaphore should know a-priori how big a message will be, and account for it: if we're bad_alloc'ing then something is going wrong with that memory management. (the genesis of client-swarm was to reproduce crashes that a customer saw with significant client counts: this is not a hypothetical) |
A quick update. @mmaslankaprv came up with an explanation as to why the controller in ARM tests seem to have stale information The ARM tests had a larger number of in-flight requests compared to the x86 tests. About 1,500 in the ARM tests vs ~50 in the x86 tests. This could explain why the controller is slow to update. As to why there are so many more in-flight requests I've noticed that the background traffic is runs during the leadership balancing is 10x more on the ARM cluster than the x86 cluster even after resources are properly limited on the RP nodes. About 420MB/s on ARM and 42MB/s on x86.
This is most likely due to the fact that the |
I see multiple issues that need chasing/fixing here:
and:
and finally to fix the test itself:
|
The failures for both the Running the Running the |
to clarify, @ballard26, you mean 4-8x smaller? |
The arm64 clusters are 4-8x larger than the amd64 clusters. The amd64 cluster is 4x smaller than the arm64 cluster in terms of pure core count. However, since the core count on amd64 clusters includes hyperthreads in could be up to x8 smaller depending how much you consider a two hyperthreads on the same core to perform as two distinct cores. |
@ballard26 ok so is it fair to then say that the Mem/Core ratio on ARM is 4-8x smaller compared to x86? |
So on AWS for storage optimized instance types the |
The 4-8x was in reference to the CPU count on the clusters we run the CDT nightly on. A 12x |
ManyPartitionsTest
.test_many_partitions
, ManyClientsTest
.test_many_clients
)
Could this failure be in the same family?
|
https://buildkite.com/redpanda/vtools/builds/5484#0185ea8f-6d91-4f53-94ca-348ecb773302
|
https://buildkite.com/redpanda/vtools/builds/5500#0185efb8-35af-4ec9-8e11-2d6d69c3d310
|
I've also seen this fail in my Azure CDT runs fairly reliably (same failure mode). I'm using Standard_L8s_v3 nodes for Redpanda and Standard_D4ds_v4 for the client. |
some more
|
Those most recent reports were the issue fixed by #9257 |
ARM tests are okay now - green run from last night here https://buildkite.com/redpanda/vtools/builds/6732#0186e6ee-35c5-4ca7-a43b-6a2c8eb474ce |
i3en.xlarge has 8GB per vCPU
Is4gen.4xlarge has 6GB per vCPU
Our scale tests do not pass reliably on the weaker arm nodes.
The text was updated successfully, but these errors were encountered: