Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure Failed to allocated 4870880 bytes in ManyPartitionsTest #8355

Closed
ballard26 opened this issue Jan 22, 2023 · 2 comments · Fixed by #8469
Closed

CI Failure Failed to allocated 4870880 bytes in ManyPartitionsTest #8355

ballard26 opened this issue Jan 22, 2023 · 2 comments · Fixed by #8469
Assignees
Labels
ci-failure kind/bug Something isn't working performance sev/high loss of availability, pathological performance degradation, recoverable corruption

Comments

@ballard26
Copy link
Contributor

ballard26 commented Jan 22, 2023

Child issue of #7405

After replicating the issue a few times on x86 and ARM with --dump-memory-diagnostics-on-alloc-failure-kind=all the issue isn't Redpanda running out of memory. Rather that a ~4Mb contiguous chunk of memory couldn't be allocated due to fragmentation. See;

Used memory:  1267M
Free memory:  3733M
Total memory: 5G
...
ERROR 2023-01-18 21:11:50,914 [shard  1] seastar - Failed to allocate 4870880 bytes

Links to various failures:

https://buildkite.com/redpanda/vtools/builds/4913#0185500b-bdaa-4cb0-b640-31259ecac3be
https://buildkite.com/redpanda/vtools/builds/4918#0185552f-c829-43a8-9d3b-b74db0421f55
https://buildkite.com/redpanda/vtools/builds/4898#018545bd-095d-4b37-b549-0173b1d272c1

Decoded backtrace for the failure:

[brandonallard@fedora temp]$ llvm-addr2line --obj b80b71a6b9978d8a9767049adf4e973b443f35.debug -f  --demangle  --pretty-print 
0x5656c75
seastar::memory::on_allocation_failure(unsigned long) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:1806
0x5665351
operator new(unsigned long) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:2064
0x2b25d22
void* std::__1::__libcpp_operator_new<unsigned long>(unsigned long) at /vectorized/llvm/bin/../include/c++/v1/new:245
0x2b3200d
kafka::handler_template<kafka::metadata_api, (short)0, (short)7, seastar::future<seastar::foreign_ptr<std::__1::unique_ptr<kafka::response, std::__1::default_delete<kafka::response>>>>, &kaf
ka::metadata_memory_estimator(unsigned long, kafka::connection_context&)>::handle(kafka::request_context, seastar::smp_service_group) (.resume) at /var/lib/buildkite-agent/builds/buildkite-a
md64-builders-i-02c556a1c5e374299-1/redpanda/redpanda/src/v/kafka/server/handlers/metadata.cc:420

Analysis of the failure

get_topic_metadata in src/v/kafka/server/handlers/metadata.cc will allocate a std::vector<metadata_response_partition> for every topic and sizeof(metadata_response_partition) == 112 . In the ManyPartitionsTest test we create a topic with 43490 partitions. Hence we are trying to allocate exactly (112*43490=)4870880 contiguous bytes to store metadata for every partition in the test topic.

Potential fix

The best bet is to switch these response types to use a fragmented vector if possible. That way we are avoiding large contiguous allocations. Another potential issue to look out for is when we encode the response. If the encoder tries to dynamically allocate a contiguous chunk of memory to encode the vector we could run into the same issue.

@ballard26 ballard26 added kind/bug Something isn't working ci-failure labels Jan 22, 2023
@ballard26 ballard26 self-assigned this Jan 22, 2023
@dotnwat
Copy link
Member

dotnwat commented Jan 22, 2023

thanks @ballard26 we'll take a looksie

@dotnwat dotnwat added area/kafka sev/high loss of availability, pathological performance degradation, recoverable corruption labels Jan 22, 2023
@ballard26 ballard26 changed the title Failed to allocated 4870880 bytes in ManyPartitionsTest CI Failure Failed to allocated 4870880 bytes in ManyPartitionsTest Jan 23, 2023
@piyushredpanda
Copy link
Contributor

@ballard26 and @travisdowns chatted about this one and Brandon will help move to fragmented vector here. Thanks for taking it up, @ballard26!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure kind/bug Something isn't working performance sev/high loss of availability, pathological performance degradation, recoverable corruption
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants