-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM due to 9 MB allocation in rm_stm #8507
Comments
The size of the snapshot depends on the client workload, what was the workload? |
This is same as https://redpandadata.slack.com/archives/C01ND4SVB6Z/p1675088864863619 This looks like a memory fragmentation issue..snapshot serialization/deserialization needs some fixing. |
Let's discuss it during the stand up tomorrow. I may be able to take a look this week. The things I want to check are:
|
See also slack for additional context: https://redpandadata.slack.com/archives/C01ND4SVB6Z/p1675088864863619 |
Version: 23.1.1 We do NOT use transactions In the environment (that is having issues), we have a very large number of producers (~20k / node). This is unlikely to change anytime soon. In other environments where we have 2-3k producers per node we do not have any issues. If I can provide any additional information that you might find helpful, please let me know. |
thanks @kargh we are looking into this. |
|
Per out-of-band conversation:
|
Good insights here. |
@rystsov could you take a peek at #9408 and confirm if it's a duplicate? The stack looks identical and I'd normally dupe it out right away but I wanted to check since the workload here was quite different: it occurs when one node is down for some time with a synthetic high throughput load and then comes back up. I don't think there are many distinct producers however. |
@bharathv is working on switching to fragmented_vector https://github.com/bharathv/redpanda/commits/switch_to_frag |
@travisdowns dropped a comment there on how to confirm it |
Version & Environment
Redpanda version: 22.3.10
What went wrong?
Large allocation (9 MB) in rm_stm snapshot causes OOM.
What should have happened instead?
No OOM.
Additional information
Find the backtrace below:
The OOM occurs when filling a
std::vector<seq_entry>
inrm_stm::take_snapshot()
.seq_entry
is a 72 byte structure so this means a 9437184-byte allocation is a vector of 131072 of those. Is that a reasonable number of seq entries to be live on one shard?I'm not sure.
The text was updated successfully, but these errors were encountered: