-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node OOM crash while decommissioning while under ingress pressure #9408
Comments
The full log of the failed node: redpanda.log.zip |
The build used to run it: https://buildkite.com/redpanda/redpanda/builds/24753#0186c8e2-fedc-42e7-8ad7-2ebff65c27fe |
|
This is the "big vector" in |
@dlex I'm not able to access this link. Is there a way to run this test on #9484. I want to see if it helps. |
@bharathv I'm sorry the link was to the branch in my fork, I'm giving you the permissions now. However I think the easiest way would be to merge your change and then for me to run the test off a |
Apparently CDT can only run against a deb package, so it will be the easiest to merge the PR first and then I will try the CDT against the latest CI build the next day. Also the link was indeed broken, sorry for that. The branch is PRed now in #9491 |
Okay, I merged the patch this morning. So, should be included in the next nightly build.
Think |
Hmm right, I wonder how because |
It does, more context here https://github.com/redpanda-data/vtools/pull/1329 |
Verified with https://buildkite.com/redpanda/redpanda/builds/25509#01870185-f1d7-44d3-9d48-24857061238e, no crashes out of 4 runs so far. |
Version & Environment
Redpanda version:
v23.2.0-dev-278-g9ad94e101 - 9ad94e1019aef43dda64ef79c1286b72790e9972-dirty
Manual CDT environment using duck.py, Ubuntu 22.04.1 LTS on
is4gen.xlarge
(6 GiB/core)What went wrong?
A redpanda node has crashed on memory allocation.
The test scenario was this:
4-node cluster running since 22:15 and performing other test
Decommission of the node started at 22:35
By 00:05 decommission was still in progress
the node crashes with
seastar_memory - Failed to allocate 2359296 bytes
Partitions were moving out very slowly:
KgoVerifierProducer has populated a topic with 315000 4K messages
then the producer went on emitting 128K messages at ~0.53 GiB/s
one of the nodes is stopped for 30 minutes and then stared back
the underreplicated partitions were being replicated into the node for ~1 hour
the node crashed with seastar - Failed to allocate 6291456 bytes
What should have happened instead?
The node should not have crashed, decommission should have completed successfully.
How to reproduce the issue?
The test for this case is still in a feature branch. The test is
TieredStorageWithLoadTest.test_restarts
Additional information
Some insights into decommission progress
The text was updated successfully, but these errors were encountered: