Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (Redpanda process unexpectedly stopped) in MemoryStressTest.test_fetch_with_many_partitions #11458

Closed
michael-redpanda opened this issue Jun 15, 2023 · 5 comments · Fixed by #11533
Assignees
Labels
area/kafka ci-failure kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption

Comments

@michael-redpanda
Copy link
Contributor

https://buildkite.com/redpanda/redpanda/builds/31233#0188b78c-c490-4d04-8114-b49dcc1db720

Module: rptest.tests.memory_stress_test
Class:  MemoryStressTest
Method: test_fetch_with_many_partitions
Arguments:
{
  "memory_share_for_fetch": 0.8
}
test_id:    rptest.tests.memory_stress_test.MemoryStressTest.test_fetch_with_many_partitions.memory_share_for_fetch=0.8
status:     FAIL
run time:   13 minutes 16.335 seconds


    <NodeCrash (docker-rp-21,docker-rp-8) docker-rp-21: Redpanda process unexpectedly stopped>
Traceback (most recent call last):
  File "/root/tests/rptest/services/cluster.py", line 83, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/utils/mode_checks.py", line 63, in f
    return func(*args, **kwargs)
  File "/root/tests/rptest/tests/memory_stress_test.py", line 127, in test_fetch_with_many_partitions
    consumer.wait()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/background_thread.py", line 72, in wait
    super(BackgroundThreadService, self).wait(timeout_sec)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 267, in wait
    raise TimeoutError("Timed out waiting %s seconds for service nodes to finish. " % str(timeout_sec)
ducktape.errors.TimeoutError: Timed out waiting 600 seconds for service nodes to finish. These nodes are still alive: ['KafConsumer-0-140241686182784 node 1 on docker-rp-24']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 104, in wrapped
    redpanda.raise_on_crash()
  File "/root/tests/rptest/services/redpanda.py", line 2415, in raise_on_crash
    raise NodeCrash(crashes)
rptest.services.utils.NodeCrash: <NodeCrash (docker-rp-21,docker-rp-8) docker-rp-21: Redpanda process unexpectedly stopped>
@michael-redpanda michael-redpanda added kind/bug Something isn't working ci-failure labels Jun 15, 2023
@piyushredpanda piyushredpanda added the sev/high loss of availability, pathological performance degradation, recoverable corruption label Jun 16, 2023
@dlex
Copy link
Contributor

dlex commented Jun 16, 2023

2 out of 3 brokers have OOMed with the same stack:

seastar::memory::on_allocation_failure(unsigned long) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:2001
seastar::memory::finish_allocation(void*, unsigned long) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:1513
 (inlined by) seastar::memory::allocate_slowpath(unsigned long, bool) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:1549
seastar::memory::allocate(unsigned long) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:1565
 (inlined by) malloc at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/memory.cc:2050
temporary_buffer at /vectorized/include/seastar/core/temporary_buffer.hh:74
 (inlined by) iobuf::create_new_fragment(unsigned long) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-00495a0d66a0e8bb1-1/redpanda/redpanda/src/v/bytes/iobuf.h:241
 (inlined by) iobuf::append(char const*, unsigned long) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-00495a0d66a0e8bb1-1/redpanda/redpanda/src/v/bytes/iobuf.h:292
 (inlined by) iobuf::append(seastar::temporary_buffer<char>) at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-00495a0d66a0e8bb1-1/redpanda/redpanda/src/v/bytes/iobuf.h:310
 (inlined by) operator() at /var/lib/buildkite-agent/builds/buildkite-amd64-builders-i-00495a0d66a0e8bb1-1/redpanda/redpanda/src/v/bytes/iobuf.cc:72
 (inlined by) decltype ((std::declval<read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}&>())(std::declval<seastar::temporary_buffer<char> >())) std::__1::__invoke[abi:v160004]<read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}&, seastar::temporary_buffer<char> >(read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}&, seastar::temporary_buffer<char>&&) at /vectorized/llvm/bin/../include/c++/v1/__functional/invoke.h:394
 (inlined by) std::__1::invoke_result<read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}&, seastar::temporary_buffer<char> >::type std::__1::invoke[abi:v160004]<read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}&, seastar::temporary_buffer<char> >(read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}&, seastar::temporary_buffer<char>&&) at /vectorized/llvm/bin/../include/c++/v1/__functional/invoke.h:539
 (inlined by) auto seastar::internal::future_invoke<read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}&, seastar::temporary_buffer<char> >(read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}&, seastar::temporary_buffer<char>&&) at /vectorized/include/seastar/core/future.hh:1155
 (inlined by) operator() at /vectorized/include/seastar/core/future.hh:1455
 (inlined by) void seastar::futurize<void>::satisfy_with_result_of<seastar::future<seastar::temporary_buffer<char> >::then_impl_nrvo<read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}, seastar::future<void> >(read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}&, seastar::future_state<seastar::temporary_buffer<char> >&&)#1}::operator()(seastar::internal::promise_base_with_type<void>&&, read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}&, seastar::future_state<seastar::temporary_buffer<char> >&&) const::{lambda()#1}>(seastar::internal::promise_base_with_type<void>&&, read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}&&) at /vectorized/include/seastar/core/future.hh:1981
 (inlined by) operator() at /vectorized/include/seastar/core/future.hh:1451
 (inlined by) seastar::continuation<seastar::internal::promise_base_with_type<void>, read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}, seastar::future<seastar::temporary_buffer<char> >::then_impl_nrvo<read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}, seastar::future<void> >(read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, read_iobuf_exactly(seastar::input_stream<char>&, unsigned long)::$_0::operator()(iobuf&, unsigned long&) const::{lambda()#2}::operator()() const::{lambda(seastar::temporary_buffer<char>)#1}&, seastar::future_state<seastar::temporary_buffer<char> >&&)#1}, seastar::temporary_buffer<char> >::run_and_dispose() at /vectorized/include/seastar/core/future.hh:742
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:2557
 (inlined by) seastar::reactor::run_some_tasks() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3020
seastar::reactor::do_run() at /v/build/v_deps_build/seastar-prefix/src/seastar/src/core/reactor.cc:3189

So effectively this is a case when #10905 does not prevent #3409 in all cases. I will be figuring out why exactly, but compared to v23.1 this is not a regression.

@travisdowns
Copy link
Member

Two more OOM cases:

FAIL test: MemoryStressTest.test_fetch_with_many_partitions.memory_share_for_fetch=0.8 (3/31 runs)
failure at 2023-06-19T14:45:32.590Z:
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31598#0188d3e8-3678-4ff4-b39c-9fcaeeb83992
failure at 2023-06-19T07:38:18.682Z:

@michael-redpanda
Copy link
Contributor Author

Closing this as it's a duplicate of #11304

@dlex
Copy link
Contributor

dlex commented Jun 20, 2023

Since the work has started here, doing the other way around - work for #11304 is done here

@dlex dlex reopened this Jun 20, 2023
@travisdowns
Copy link
Member

travisdowns commented Jun 20, 2023

FAIL test: MemoryStressTest.test_fetch_with_many_partitions.memory_share_for_fetch=0.8 (3/13 runs)
failure at 2023-06-20T16:42:17.394Z: <NodeCrash (docker-rp-18,docker-rp-10) docker-rp-18: Redpanda process unexpectedly stopped>
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31657#0188d970-f122-4b33-9e70-8a6e45091c47
failure at 2023-06-20T17:16:32.585Z:
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31659#0188d99c-fed9-40a6-b43c-00b508233319
failure at 2023-06-20T15:30:54.038Z:
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/31649#0188d935-5460-4cb7-bc23-0ba8a98719e5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kafka ci-failure kind/bug Something isn't working sev/high loss of availability, pathological performance degradation, recoverable corruption
Projects
None yet
4 participants