Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent very fast learner recovery rates from causing OOM #9665

Closed
mattschumpert opened this issue Mar 27, 2023 · 6 comments
Closed

Prevent very fast learner recovery rates from causing OOM #9665

mattschumpert opened this issue Mar 27, 2023 · 6 comments
Labels
kind/enhance New feature or request

Comments

@mattschumpert
Copy link

mattschumpert commented Mar 27, 2023

We've seen that tuning learner recovery rate very high can cause an OOM. We've separately reduced the memory consumption required for recovery, but we should still defend against OOMing at all costs as admins will not make the connection that this is an unsafe configuration.

The tradeoff should be to deny the requested learner recovery rate with a log warning and/or in general leave a bit more buffer before allocating more for recovery.

@mattschumpert mattschumpert added the kind/enhance New feature or request label Mar 27, 2023
@vshtokman
Copy link
Contributor

vshtokman commented Mar 31, 2023

Long term, this item should be addressed by item #3 in the cluster ops doc.

Let me check with the team whether it is possible to add a short-term guardrail here.

@mmaslankaprv
Copy link
Member

I think this issue is based on incorrect assumption, the recovery rate should not lead to OOM as we have a separate semaphore controlling recovery memory usage.

@mattschumpert
Copy link
Author

@mmaslankaprv How did this cause OOM before? Is this an overall memory management issue? The CS request is the same: be more responsible about memory usage even if the user sets it to infinity.

@mmaslankaprv
Copy link
Member

it didn't cause an OOM before, the OOM was related with a different thing, yes it is an overall memory management issue.

@mattschumpert
Copy link
Author

@mmaslankaprv I'll close this. Do we have a ticket tracking the root cause of OOM?

@piyushredpanda
Copy link
Contributor

@mattschumpert : the root issue here was most likely fixed by #9484

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhance New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants