fix: Use mutex in the scaler cache to prevent concurrent refreshings #6309

JorTurFer · 2024-11-03T21:19:14Z

To prevent unexpected accesses to the scalers inside the scalers cache, this PR adds a mutex that blocks operating over the cache meanwhile it's modified,

Checklist

Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #6273

Signed-off-by: Jorge Turrado <[email protected]>

JorTurFer · 2024-11-04T07:08:27Z

/run-e2e
Update: You can check the progress here

pkg/scaling/cache/scalers_cache.go

Signed-off-by: Jorge Turrado <[email protected]>

JorTurFer · 2024-11-04T08:50:13Z

/run-e2e
Update: You can check the progress here

wozniakjan

lgtm

latk

Thank you very much for this fix, this looks like a good safety improvement!

However, I think I may have found some edge cases where these locks don't work quite as expected.

latk · 2024-11-04T15:45:44Z

pkg/scaling/cache/scalers_cache.go

+	if index < 0 || index >= len(c.Scalers) {
+		return ScalerBuilder{}, fmt.Errorf("scaler with id %d not found. Len = %d", index, len(c.Scalers))
+	}
+
+	c.mutex.RLock()
+	defer c.mutex.RUnlock()
+	return c.Scalers[index], nil


This bounds check should probably be moved until after the lock has been acquired. The c.Scalers could have been modified in between the check and the time of use.

Delaying the bounds check until after the lock should have no performance impact, because bound check failures are very unlikely, and because the rlock is unlikely to be contested.

However, the code as written is still safe in the sense that refreshes will work. It's just that fetching a ScalerBuilder can still panic, despite the check.

latk · 2024-11-04T16:27:25Z

pkg/scaling/cache/scalers_cache.go

+	oldSb, err := c.getScalerBuilder(index)
+	if err != nil {
+		return nil, err
 	}

-	sb := c.Scalers[id]
-	defer sb.Scaler.Close(ctx)
-	ns, sConfig, err := sb.Factory()
+	c.mutex.Lock()
+	defer c.mutex.Unlock()
+
+	newScaler, sConfig, err := oldSb.Factory()


There are two problems here. This use of a lock prevents concurrent refreshes, but not multiple refreshes of the same scaler, and may unacceptably impact availability of a Keda instance.

Acquiring the write-lock means that all GetMetricSpecForScaling() and GetMetricSpecForScalingForScaler() calls will be locked out until the Factory() has completed, because those need a read-lock on the same mutex. If the Factory hangs (e.g. because it has to interact with another service over an unreliable network), Keda will be effectively suspended.

If concurrent Factory() calls can be tolerated, the write lock could be acquired after the Factory() call has succeeded.

If at most one Factory() call is allowed per ScalerBuilder, then each ScalerBuilder would need its own mutex.

The other problem is that the same lock is acquired twice in this part of the code. Inlining getScalerBuilder() and eliding error handling, we roughly have:

c.mutex.RLock() oldSb := c.Scalers[index] c.mutex.RUnlock() c.mutex.Lock() c.Scalers[index] = newSb oldSb.Scaler.Close(ctx) c.mutex.Unlock()

This still makes it possible for multiple concurrent tasks to close the same Scaler, and failing to close the current Scaler. Here's an example of interleaved execution where two goroutines read an old scaler 0xa and each register their own scaler 0xb or 0xc:

goroutine 1 goroutine 2

RLock()
oldSB := 0xa
RUnlock() RLock()
oldSB := 0xa
RUnlock()

Lock()
c.Scalers[index] = 0xb
oldSb.Close(ctx)
Unlock()

Lock()
c.Scalers[index] = 0xc
oldSb.Close(ctx)
Unlock()

Here, scaler 0xa has been closed twice, scaler 0xb has been leaked without closing it, and 0xc is the new scaler instance.

One way to solve this is to only replace the scaler if the current scaler is still the old one. Roughly:

Lock() defer Unlock() if c.Scalers[index] != oldSb { // detected concurrent modification, don't actually perform the refresh newScaler.Close(ctx) return c.Scalers[index].Scaler } c.Scalers[index] = ... oldSb.Scaler.Close(ctx) return newScaler

An alternative would be to take the scaler to be closed directly from the c.Scalers, ignoring the previous oldSb which is only used for finding the Factory:

Lock() defer Unlock() // load oldSb again in case it has changed oldSb = c.Scalers[index] c.Scalers[index] = ... oldSb.Scaler.Close(ctx) return newScaler

latk · 2024-11-04T16:32:55Z

pkg/scaling/cache/scalers_cache.go

@@ -40,6 +41,7 @@ type ScalersCache struct {
 	ScalableObjectGeneration int64
 	Recorder                 record.EventRecorder
 	CompiledFormula          *vm.Program
+	mutex                    sync.RWMutex


It may be desirable to make any fields private that should only be accessed while the mutex is locked. In particular, Scalers cannot be accessed safely in other parts of the code.

JorTurFer · 2024-11-05T20:52:09Z

Thank you very much for this fix, this looks like a good safety improvement!

However, I think I may have found some edge cases where these locks don't work quite as expected.

Those are right points, but we are thinking about doing a huge internal refactor because we have found some issues related with technical debt generated during the years (and this issue is just an example about that). I'm still working on a formal proposal to open an issue, but I think that all these problems will be solved with a redesign of the current internals with a less coupled approach.
Said this, thanks a lot for the feedback ❤️, seeing people engaged with better performing KEDA is something that we love to see 😄

The goal of this PR was just mitigating the panics generated for the next release (we'll cut a release this week) and hopefully the next one will include this refactor.

zroubalik · 2024-11-05T22:14:36Z

@latk would be great if you can draft a PR with additional checks you proposed, no matter any future refactor :)

Thanks!

latk · 2024-11-14T11:07:41Z

would be great if you can draft a PR with additional checks you proposed, no matter any future refactor :)

Unfortunately I have no "tuits" left for picking up Go and seeing a PR through to the end. But if someone tags me on related PRs or discussions in the future, there's a 50% chance I'll get nerd-sniped into contributing a code review within 48 hours.

…edacore#6309) Signed-off-by: michael pechner <[email protected]>

fix: Use mutex in the scaler cache to prevent concurrent refreshings

a072763

Signed-off-by: Jorge Turrado <[email protected]>

JorTurFer requested a review from a team as a code owner November 3, 2024 21:19

wozniakjan reviewed Nov 4, 2024

View reviewed changes

pkg/scaling/cache/scalers_cache.go Outdated Show resolved Hide resolved

apply feedback

c6811c9

Signed-off-by: Jorge Turrado <[email protected]>

wozniakjan approved these changes Nov 4, 2024

View reviewed changes

JorTurFer merged commit 5ea1eac into kedacore:main Nov 4, 2024
19 checks passed

JorTurFer deleted the add-mutex branch November 4, 2024 10:30

latk reviewed Nov 4, 2024

View reviewed changes

mpechner-akasa pushed a commit to nrichardson-akasa/keda that referenced this pull request Nov 29, 2024

fix: Use mutex in the scaler cache to prevent concurrent refreshings (k…

1b987f5

…edacore#6309) Signed-off-by: michael pechner <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Use mutex in the scaler cache to prevent concurrent refreshings #6309

fix: Use mutex in the scaler cache to prevent concurrent refreshings #6309

JorTurFer commented Nov 3, 2024

JorTurFer commented Nov 4, 2024 •

edited by github-actions bot

Loading

JorTurFer commented Nov 4, 2024 •

edited by github-actions bot

Loading

wozniakjan left a comment

latk left a comment

latk Nov 4, 2024

latk Nov 4, 2024

latk Nov 4, 2024

JorTurFer commented Nov 5, 2024 •

edited

Loading

zroubalik commented Nov 5, 2024

latk commented Nov 14, 2024

goroutine 1	goroutine 2
`RLock()` `oldSB := 0xa` `RUnlock()`	`RLock()` `oldSB := 0xa` `RUnlock()`
`Lock()` `c.Scalers[index] = 0xb` `oldSb.Close(ctx)` `Unlock()`
	`Lock()` `c.Scalers[index] = 0xc` `oldSb.Close(ctx)` `Unlock()`

fix: Use mutex in the scaler cache to prevent concurrent refreshings #6309

fix: Use mutex in the scaler cache to prevent concurrent refreshings #6309

Conversation

JorTurFer commented Nov 3, 2024

Checklist

JorTurFer commented Nov 4, 2024 • edited by github-actions bot Loading

JorTurFer commented Nov 4, 2024 • edited by github-actions bot Loading

wozniakjan left a comment

Choose a reason for hiding this comment

latk left a comment

Choose a reason for hiding this comment

latk Nov 4, 2024

Choose a reason for hiding this comment

latk Nov 4, 2024

Choose a reason for hiding this comment

latk Nov 4, 2024

Choose a reason for hiding this comment

JorTurFer commented Nov 5, 2024 • edited Loading

zroubalik commented Nov 5, 2024

latk commented Nov 14, 2024

JorTurFer commented Nov 4, 2024 •

edited by github-actions bot

Loading

JorTurFer commented Nov 4, 2024 •

edited by github-actions bot

Loading

JorTurFer commented Nov 5, 2024 •

edited

Loading