Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Use mutex in the scaler cache to prevent concurrent refreshings #6309

Merged
merged 2 commits into from
Nov 4, 2024

Conversation

JorTurFer
Copy link
Member

To prevent unexpected accesses to the scalers inside the scalers cache, this PR adds a mutex that blocks operating over the cache meanwhile it's modified,

Checklist

  • Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #6273

@JorTurFer JorTurFer requested a review from a team as a code owner November 3, 2024 21:19
@JorTurFer
Copy link
Member Author

JorTurFer commented Nov 4, 2024

/run-e2e
Update: You can check the progress here

Signed-off-by: Jorge Turrado <[email protected]>
@JorTurFer
Copy link
Member Author

JorTurFer commented Nov 4, 2024

/run-e2e
Update: You can check the progress here

Copy link
Member

@wozniakjan wozniakjan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@JorTurFer JorTurFer merged commit 5ea1eac into kedacore:main Nov 4, 2024
19 checks passed
@JorTurFer JorTurFer deleted the add-mutex branch November 4, 2024 10:30
Copy link

@latk latk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this fix, this looks like a good safety improvement!

However, I think I may have found some edge cases where these locks don't work quite as expected.

Comment on lines +70 to +76
if index < 0 || index >= len(c.Scalers) {
return ScalerBuilder{}, fmt.Errorf("scaler with id %d not found. Len = %d", index, len(c.Scalers))
}

c.mutex.RLock()
defer c.mutex.RUnlock()
return c.Scalers[index], nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bounds check should probably be moved until after the lock has been acquired. The c.Scalers could have been modified in between the check and the time of use.

Delaying the bounds check until after the lock should have no performance impact, because bound check failures are very unlikely, and because the rlock is unlikely to be contested.

However, the code as written is still safe in the sense that refreshes will work. It's just that fetching a ScalerBuilder can still panic, despite the check.

Comment on lines +166 to +174
oldSb, err := c.getScalerBuilder(index)
if err != nil {
return nil, err
}

sb := c.Scalers[id]
defer sb.Scaler.Close(ctx)
ns, sConfig, err := sb.Factory()
c.mutex.Lock()
defer c.mutex.Unlock()

newScaler, sConfig, err := oldSb.Factory()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two problems here. This use of a lock prevents concurrent refreshes, but not multiple refreshes of the same scaler, and may unacceptably impact availability of a Keda instance.


Acquiring the write-lock means that all GetMetricSpecForScaling() and GetMetricSpecForScalingForScaler() calls will be locked out until the Factory() has completed, because those need a read-lock on the same mutex. If the Factory hangs (e.g. because it has to interact with another service over an unreliable network), Keda will be effectively suspended.

  • If concurrent Factory() calls can be tolerated, the write lock could be acquired after the Factory() call has succeeded.
  • If at most one Factory() call is allowed per ScalerBuilder, then each ScalerBuilder would need its own mutex.

The other problem is that the same lock is acquired twice in this part of the code. Inlining getScalerBuilder() and eliding error handling, we roughly have:

c.mutex.RLock()
oldSb := c.Scalers[index]
c.mutex.RUnlock()

c.mutex.Lock()
c.Scalers[index] = newSb
oldSb.Scaler.Close(ctx)
c.mutex.Unlock()

This still makes it possible for multiple concurrent tasks to close the same Scaler, and failing to close the current Scaler. Here's an example of interleaved execution where two goroutines read an old scaler 0xa and each register their own scaler 0xb or 0xc:

goroutine 1 goroutine 2
RLock()
oldSB := 0xa
RUnlock()
RLock()
oldSB := 0xa
RUnlock()
Lock()
c.Scalers[index] = 0xb
oldSb.Close(ctx)
Unlock()
Lock()
c.Scalers[index] = 0xc
oldSb.Close(ctx)
Unlock()

Here, scaler 0xa has been closed twice, scaler 0xb has been leaked without closing it, and 0xc is the new scaler instance.

One way to solve this is to only replace the scaler if the current scaler is still the old one. Roughly:

Lock()
defer Unlock()
if c.Scalers[index] != oldSb {
  // detected concurrent modification, don't actually perform the refresh
  newScaler.Close(ctx)
  return c.Scalers[index].Scaler
}
c.Scalers[index] = ...
oldSb.Scaler.Close(ctx)
return newScaler

An alternative would be to take the scaler to be closed directly from the c.Scalers, ignoring the previous oldSb which is only used for finding the Factory:

Lock()
defer Unlock()
// load oldSb again in case it has changed
oldSb = c.Scalers[index]
c.Scalers[index] = ...
oldSb.Scaler.Close(ctx)
return newScaler

@@ -40,6 +41,7 @@ type ScalersCache struct {
ScalableObjectGeneration int64
Recorder record.EventRecorder
CompiledFormula *vm.Program
mutex sync.RWMutex
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be desirable to make any fields private that should only be accessed while the mutex is locked. In particular, Scalers cannot be accessed safely in other parts of the code.

@JorTurFer
Copy link
Member Author

JorTurFer commented Nov 5, 2024

Thank you very much for this fix, this looks like a good safety improvement!

However, I think I may have found some edge cases where these locks don't work quite as expected.

Those are right points, but we are thinking about doing a huge internal refactor because we have found some issues related with technical debt generated during the years (and this issue is just an example about that). I'm still working on a formal proposal to open an issue, but I think that all these problems will be solved with a redesign of the current internals with a less coupled approach.
Said this, thanks a lot for the feedback ❤️, seeing people engaged with better performing KEDA is something that we love to see 😄

The goal of this PR was just mitigating the panics generated for the next release (we'll cut a release this week) and hopefully the next one will include this refactor.

@zroubalik
Copy link
Member

@latk would be great if you can draft a PR with additional checks you proposed, no matter any future refactor :)

Thanks!

@latk
Copy link

latk commented Nov 14, 2024

would be great if you can draft a PR with additional checks you proposed, no matter any future refactor :)

Unfortunately I have no "tuits" left for picking up Go and seeing a PR through to the end. But if someone tags me on related PRs or discussions in the future, there's a 50% chance I'll get nerd-sniped into contributing a code review within 48 hours.

mpechner-akasa pushed a commit to nrichardson-akasa/keda that referenced this pull request Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

double close in refreshScalers()
4 participants