-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Use mutex in the scaler cache to prevent concurrent refreshings #6309
Conversation
Signed-off-by: Jorge Turrado <[email protected]>
/run-e2e |
Signed-off-by: Jorge Turrado <[email protected]>
/run-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this fix, this looks like a good safety improvement!
However, I think I may have found some edge cases where these locks don't work quite as expected.
if index < 0 || index >= len(c.Scalers) { | ||
return ScalerBuilder{}, fmt.Errorf("scaler with id %d not found. Len = %d", index, len(c.Scalers)) | ||
} | ||
|
||
c.mutex.RLock() | ||
defer c.mutex.RUnlock() | ||
return c.Scalers[index], nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bounds check should probably be moved until after the lock has been acquired. The c.Scalers
could have been modified in between the check and the time of use.
Delaying the bounds check until after the lock should have no performance impact, because bound check failures are very unlikely, and because the rlock is unlikely to be contested.
However, the code as written is still safe in the sense that refreshes will work. It's just that fetching a ScalerBuilder can still panic, despite the check.
oldSb, err := c.getScalerBuilder(index) | ||
if err != nil { | ||
return nil, err | ||
} | ||
|
||
sb := c.Scalers[id] | ||
defer sb.Scaler.Close(ctx) | ||
ns, sConfig, err := sb.Factory() | ||
c.mutex.Lock() | ||
defer c.mutex.Unlock() | ||
|
||
newScaler, sConfig, err := oldSb.Factory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two problems here. This use of a lock prevents concurrent refreshes, but not multiple refreshes of the same scaler, and may unacceptably impact availability of a Keda instance.
Acquiring the write-lock means that all GetMetricSpecForScaling()
and GetMetricSpecForScalingForScaler()
calls will be locked out until the Factory()
has completed, because those need a read-lock on the same mutex. If the Factory hangs (e.g. because it has to interact with another service over an unreliable network), Keda will be effectively suspended.
- If concurrent
Factory()
calls can be tolerated, the write lock could be acquired after theFactory()
call has succeeded. - If at most one
Factory()
call is allowed per ScalerBuilder, then each ScalerBuilder would need its own mutex.
The other problem is that the same lock is acquired twice in this part of the code. Inlining getScalerBuilder()
and eliding error handling, we roughly have:
c.mutex.RLock()
oldSb := c.Scalers[index]
c.mutex.RUnlock()
c.mutex.Lock()
c.Scalers[index] = newSb
oldSb.Scaler.Close(ctx)
c.mutex.Unlock()
This still makes it possible for multiple concurrent tasks to close the same Scaler, and failing to close the current Scaler. Here's an example of interleaved execution where two goroutines read an old scaler 0xa
and each register their own scaler 0xb
or 0xc
:
goroutine 1 | goroutine 2 |
---|---|
RLock() oldSB := 0xa RUnlock() |
RLock() oldSB := 0xa RUnlock() |
Lock() c.Scalers[index] = 0xb oldSb.Close(ctx) Unlock() |
|
Lock() c.Scalers[index] = 0xc oldSb.Close(ctx) Unlock() |
Here, scaler 0xa
has been closed twice, scaler 0xb
has been leaked without closing it, and 0xc
is the new scaler instance.
One way to solve this is to only replace the scaler if the current scaler is still the old one. Roughly:
Lock()
defer Unlock()
if c.Scalers[index] != oldSb {
// detected concurrent modification, don't actually perform the refresh
newScaler.Close(ctx)
return c.Scalers[index].Scaler
}
c.Scalers[index] = ...
oldSb.Scaler.Close(ctx)
return newScaler
An alternative would be to take the scaler to be closed directly from the c.Scalers
, ignoring the previous oldSb
which is only used for finding the Factory
:
Lock()
defer Unlock()
// load oldSb again in case it has changed
oldSb = c.Scalers[index]
c.Scalers[index] = ...
oldSb.Scaler.Close(ctx)
return newScaler
@@ -40,6 +41,7 @@ type ScalersCache struct { | |||
ScalableObjectGeneration int64 | |||
Recorder record.EventRecorder | |||
CompiledFormula *vm.Program | |||
mutex sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be desirable to make any fields private that should only be accessed while the mutex is locked. In particular, Scalers
cannot be accessed safely in other parts of the code.
Those are right points, but we are thinking about doing a huge internal refactor because we have found some issues related with technical debt generated during the years (and this issue is just an example about that). I'm still working on a formal proposal to open an issue, but I think that all these problems will be solved with a redesign of the current internals with a less coupled approach. The goal of this PR was just mitigating the panics generated for the next release (we'll cut a release this week) and hopefully the next one will include this refactor. |
@latk would be great if you can draft a PR with additional checks you proposed, no matter any future refactor :) Thanks! |
Unfortunately I have no "tuits" left for picking up Go and seeing a PR through to the end. But if someone tags me on related PRs or discussions in the future, there's a 50% chance I'll get nerd-sniped into contributing a code review within 48 hours. |
…edacore#6309) Signed-off-by: michael pechner <[email protected]>
To prevent unexpected accesses to the scalers inside the scalers cache, this PR adds a mutex that blocks operating over the cache meanwhile it's modified,
Checklist
Fixes #6273