-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Every node OOMs during load test #5563
Comments
Here is another OOM backtrace during this same test, the failed allocation was
The copy specifically implicated here will be fixed when this commit makes it into our seastar fork. However even with out that copy it still indicates a large vector of |
We see out of memory errors on the metadata path for large partition counts. One problematic place would have 3 or 4 copies of the partition list in flight at once. This change avoids this code entirely in the usual case that the metadata request isn't having a side effect of creating new topics and reduces copies even if it is. Issue redpanda-data#5563.
Summary of OOM types. In this load test there were 32 nodes, all of which failed with an OOM. Here is the breakdown: 22 OOMs: 528,000 byte OOM described above creating a vector of |
We see out of memory errors on the metadata path for large partition counts. One problematic place would have 3 or 4 copies of the partition list in flight at once. This change avoids this code entirely in the usual case that the metadata request isn't having a side effect of creating new topics and reduces copies even if it is. Issue redpanda-data#5563.
We see out of memory errors on the metadata path for large partition counts. One problematic place would have 3 or 4 copies of the partition list in flight at once. This change avoids this code entirely in the usual case that the metadata request isn't having a side effect of creating new topics and reduces copies even if it is. Issue redpanda-data#5563.
We see out of memory errors on the metadata path for large partition counts. One problematic place would have 3 or 4 copies of the partition list in flight at once. This change avoids this code entirely in the usual case that the metadata request isn't having a side effect of creating new topics and reduces copies even if it is. Issue redpanda-data#5563.
We see out of memory errors on the metadata path for large partition counts. One problematic place would have 3 or 4 copies of the partition list in flight at once. This change avoids this code entirely in the usual case that the metadata request isn't having a side effect of creating new topics and reduces copies even if it is. Issue redpanda-data#5563.
The primary remaining issues were the large copies in the metadata response handler which were fixed in: |
Add max_frag_bytes to the metadata memory estimate to account for the worse-cast overshoot during vector re-allocation. Issue redpanda-data#5563.
Version & Environment
Redpanda version: 0346aa1
What went wrong?
Out of memory conditions occur when trying to make moderately sized (> 500K < 1M) allocations during a load test, after the cluster controller is killed. Some failures indicate significant fragmentation.
What should have happened instead?
No out of memory condition, but instead a graceful and speedy recovery.
Additional information
Typical OOM:
Relevant decoded backtrace:
The large allocation is the
std::vector
ofmetadata_response_partition
objects inside themetadata_response_topic
. The allocation of 528000 bytes represents 6k metadata objects, given the struct is 88 bytes, which is as expected as the topics involved in the load tests have 6,000 partitions.This specific OOM occurs during an unnecessary copy, which we can eliminate with a
std::move
.The text was updated successfully, but these errors were encountered: