CI Failure Failed to allocated 4870880 bytes in `ManyPartitionsTest` #8355

ballard26 · 2023-01-22T07:14:03Z

Child issue of #7405

After replicating the issue a few times on x86 and ARM with --dump-memory-diagnostics-on-alloc-failure-kind=all the issue isn't Redpanda running out of memory. Rather that a ~4Mb contiguous chunk of memory couldn't be allocated due to fragmentation. See;

Used memory:  1267M
Free memory:  3733M
Total memory: 5G
...
ERROR 2023-01-18 21:11:50,914 [shard  1] seastar - Failed to allocate 4870880 bytes

Links to various failures:

https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be
https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55
https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1

Decoded backtrace for the failure:

[brandonallard@fedora temp]$ llvm-addr2line --obj b80b71a6b9978d8a9767049adf4e973b443f35.debug -f  --demangle  --pretty-print 
0x5656c75
seastar::memory::on_allocation_failure(unsigned long) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:1806
0x5665351
operator new(unsigned long) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:2064
0x2b25d22
void* std::__1::__libcpp_operator_new<unsigned long>(unsigned long) at /vectorized/llvm/bin/../include/c++/v1/new:245
0x2b3200d
kafka::handler_template<kafka::metadata_api, (short)0, (short)7, seastar::future<seastar::foreign_ptr<std::__1::unique_ptr<kafka::response, std::__1::default_delete<kafka::response>>>>, &kaf
ka::metadata_memory_estimator(unsigned long, kafka::connection_context&)>::handle(kafka::request_context, seastar::smp_service_group) (.resume) at /var/lib/buildkite-agent/builds/buildkite-a
md64-builders-i-02c556a1c5e374299-1/redpanda/redpanda/src/v/kafka/server/handlers/metadata.cc:420

Analysis of the failure

get_topic_metadata in src/v/kafka/server/handlers/metadata.cc will allocate a std::vector<metadata_response_partition> for every topic and sizeof(metadata_response_partition) == 112 . In the ManyPartitionsTest test we create a topic with 43490 partitions. Hence we are trying to allocate exactly (112*43490=)4870880 contiguous bytes to store metadata for every partition in the test topic.

Potential fix

The best bet is to switch these response types to use a fragmented vector if possible. That way we are avoiding large contiguous allocations. Another potential issue to look out for is when we encode the response. If the encoder tries to dynamically allocate a contiguous chunk of memory to encode the vector we could run into the same issue.

The text was updated successfully, but these errors were encountered:

dotnwat · 2023-01-22T20:46:32Z

thanks @ballard26 we'll take a looksie

piyushredpanda · 2023-01-25T00:30:43Z

@ballard26 and @travisdowns chatted about this one and Brandon will help move to fragmented vector here. Thanks for taking it up, @ballard26!

ballard26 added kind/bug Something isn't working ci-failure labels Jan 22, 2023

ballard26 self-assigned this Jan 22, 2023

dotnwat added area/kafka sev/high loss of availability, pathological performance degradation, recoverable corruption labels Jan 22, 2023

ballard26 changed the title ~~Failed to allocated 4870880 bytes in ManyPartitionsTest~~ CI Failure Failed to allocated 4870880 bytes in ManyPartitionsTest Jan 23, 2023

piyushredpanda assigned ballard26 and unassigned ballard26 Jan 24, 2023

piyushredpanda added performance and removed area/kafka labels Jan 25, 2023

ballard26 mentioned this issue Jan 25, 2023

NodeCrash in ManyPartitionsTest.test_many_partitions/test_many_partitions_compacted #8098

Closed

travisdowns mentioned this issue Jan 27, 2023

Use fragmented vector for metadata response #8469

Merged

1 task

dotnwat closed this as completed in #8469 Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure Failed to allocated 4870880 bytes in `ManyPartitionsTest` #8355

CI Failure Failed to allocated 4870880 bytes in `ManyPartitionsTest` #8355

ballard26 commented Jan 22, 2023 •

edited

Loading

dotnwat commented Jan 22, 2023

piyushredpanda commented Jan 25, 2023

CI Failure Failed to allocated 4870880 bytes in ManyPartitionsTest #8355

CI Failure Failed to allocated 4870880 bytes in ManyPartitionsTest #8355

Comments

ballard26 commented Jan 22, 2023 • edited Loading

Links to various failures:

Decoded backtrace for the failure:

Analysis of the failure

Potential fix

dotnwat commented Jan 22, 2023

piyushredpanda commented Jan 25, 2023

CI Failure Failed to allocated 4870880 bytes in `ManyPartitionsTest` #8355

CI Failure Failed to allocated 4870880 bytes in `ManyPartitionsTest` #8355

ballard26 commented Jan 22, 2023 •

edited

Loading