CI Failure Failed to allocated 4870880 bytes in ManyPartitionsTest
#8355
Labels
ci-failure
kind/bug
Something isn't working
performance
sev/high
loss of availability, pathological performance degradation, recoverable corruption
Child issue of #7405
After replicating the issue a few times on x86 and ARM with
--dump-memory-diagnostics-on-alloc-failure-kind=all
the issue isn't Redpanda running out of memory. Rather that a ~4Mb contiguous chunk of memory couldn't be allocated due to fragmentation. See;Links to various failures:
https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be
https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55
https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1
Decoded backtrace for the failure:
Analysis of the failure
get_topic_metadata
insrc/v/kafka/server/handlers/metadata.cc
will allocate astd::vector<metadata_response_partition>
for every topic andsizeof(metadata_response_partition) == 112
. In theManyPartitionsTest
test we create a topic with 43490 partitions. Hence we are trying to allocate exactly (112*43490=)4870880 contiguous bytes to store metadata for every partition in the test topic.Potential fix
The best bet is to switch these response types to use a fragmented vector if possible. That way we are avoiding large contiguous allocations. Another potential issue to look out for is when we encode the response. If the encoder tries to dynamically allocate a contiguous chunk of memory to encode the vector we could run into the same issue.
The text was updated successfully, but these errors were encountered: