-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent very fast learner recovery rates from causing OOM #9665
Comments
Long term, this item should be addressed by item #3 in the cluster ops doc. Let me check with the team whether it is possible to add a short-term guardrail here. |
I think this issue is based on incorrect assumption, the recovery rate should not lead to OOM as we have a separate semaphore controlling recovery memory usage. |
@mmaslankaprv How did this cause OOM before? Is this an overall memory management issue? The CS request is the same: be more responsible about memory usage even if the user sets it to infinity. |
it didn't cause an OOM before, the OOM was related with a different thing, yes it is an overall memory management issue. |
@mmaslankaprv I'll close this. Do we have a ticket tracking the root cause of OOM? |
@mattschumpert : the root issue here was most likely fixed by #9484 |
We've seen that tuning learner recovery rate very high can cause an OOM. We've separately reduced the memory consumption required for recovery, but we should still defend against OOMing at all costs as admins will not make the connection that this is an unsafe configuration.
The tradeoff should be to deny the requested learner recovery rate with a log warning and/or in general leave a bit more buffer before allocating more for recovery.
The text was updated successfully, but these errors were encountered: