You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am seeing spurious panics where Keda tries to close an already-closed scaler. This points towards slightly incorrect handling of failing scalers in refreshScaler().
Expected Behavior
Scalers have a clear lifecycle and are always closed exactly once. If scalers fail, they are restarted safely.
Actual Behavior
Scalers can be double-closed, causing a panic.
I stumbled across this problem while investigating a slightly different error, that some scalers seem to occasionally stop producing new metrics. This seems to be correlated with transient scaler errors. I believe – but don't have evidence for this – that this relates to the same refreshScaler() code path, just with a single-close instead of a double-close?
Steps to Reproduce the Problem
Use the Kafka scaler
Be unlucky and experience transient errors with the Kafka broker, causing Keda to go into the refreshScaler() code path multiple times for the same scaler instance.
Unfortunately the available log seems to be misconfigured, and I can't identify the error messages that led up to this failure.
KEDA Version
2.11.1
Kubernetes Version
None
Platform
None
Scaler Details
Kafka
Anything else?
While the error has been observed in Keda 2.11.1, the relevant code hasn't changed since then. Here is the relevant part of the refreshScaler() function, last changed with PR #4093:
returnnil, fmt.Errorf("scaler with id %d not found, len = %d, cache has been probably already invalidated", id, len(c.Scalers))
}
c.Scalers[id] =ScalerBuilder{
Scaler: ns,
ScalerConfig: *sConfig,
Factory: sb.Factory,
}
One interpretation: resource handling problem.
There are two relevant resources:
the old sb.Scaler, which is always closed via the defer
the new scaler ns, which is never closed
This is fine along the happy path: the scaler-builder is replaced with the new scaler, and the old scaler is closed.
However, things break if sb.Factory() returns an error, or if the bounds check fails:
old scaler is not replaced and still closed, leaving the closed scaler in the cache
new scaler fails to be closed
The next time metrics are fetched from the (now closed) scaler, there will be errors, and we will enter the refreshScaler() function again, and will close the old scaler for the second time.
I'm not experienced enough in Go to know how to fix this. I assume that the sb.Scaler.Close() should be moved until after registering the new ScalerBuilder, essentially reverting that part of #4093. And there should probably be a ns.Close() when returning an error from the bounds check. I am also confused by the existence of the bounds check, as a similar check is done a few lines before the quoted snippet. The second bounds check – and with it the diverging control flow – would probably be unnecessary if mutating the sb object in-place. Sketch:
Reading through the Sarama client.Close() code, it is only possible for this channel-close to panic if the Close() method is invoked multiple times concurrently. The remainder of the Close() method is protected by a lock. This in turn could happen if Keda can invoke the same Scaler.Close() concurrently.
The bounds check that was added in #4093 seems to try to defend against concurrent modification of the ScalersCache, but it's not entirely safe because the check + update is not atomic. The bounds check just makes it less likely that concurrent modification can crash the program.
An actual fix would likely require introducing a lock for ScalersCache modifications, in particular for the ScalersCache.refreshScaler() and ScalersCache.Close() operations. It could also help to make the ScalersCache.Scalers field private.
The text was updated successfully, but these errors were encountered:
Thanks for reporting this issue and also thanks for the research 🙇
I've checked sarama client's code and the only scenario that I can see is the concurrent one because as you said, the sarama client protects itself:
I think that the problem could be a concurrent retry. Nevertheless, I think that defer sb.Scaler.Close(ctx) has to be moved too to not keep closed scalers in the cache
Report
I am seeing spurious panics where Keda tries to close an already-closed scaler. This points towards slightly incorrect handling of failing scalers in
refreshScaler()
.Expected Behavior
Scalers have a clear lifecycle and are always closed exactly once. If scalers fail, they are restarted safely.
Actual Behavior
Scalers can be double-closed, causing a panic.
I stumbled across this problem while investigating a slightly different error, that some scalers seem to occasionally stop producing new metrics. This seems to be correlated with transient scaler errors. I believe – but don't have evidence for this – that this relates to the same
refreshScaler()
code path, just with a single-close instead of a double-close?Steps to Reproduce the Problem
refreshScaler()
code path multiple times for the same scaler instance.Logs from KEDA operator
Unfortunately the available log seems to be misconfigured, and I can't identify the error messages that led up to this failure.
KEDA Version
2.11.1
Kubernetes Version
None
Platform
None
Scaler Details
Kafka
Anything else?
While the error has been observed in Keda 2.11.1, the relevant code hasn't changed since then. Here is the relevant part of the
refreshScaler()
function, last changed with PR #4093:keda/pkg/scaling/cache/scalers_cache.go
Lines 149 to 163 in b2ce95d
One interpretation: resource handling problem.
There are two relevant resources:
sb.Scaler
, which is always closed via the deferns
, which is never closedThis is fine along the happy path: the scaler-builder is replaced with the new scaler, and the old scaler is closed.
However, things break if
sb.Factory()
returns an error, or if the bounds check fails:The next time metrics are fetched from the (now closed) scaler, there will be errors, and we will enter the
refreshScaler()
function again, and will close the old scaler for the second time.I'm not experienced enough in Go to know how to fix this. I assume that the
sb.Scaler.Close()
should be moved until after registering the newScalerBuilder
, essentially reverting that part of #4093. And there should probably be ans.Close()
when returning an error from the bounds check. I am also confused by the existence of the bounds check, as a similar check is done a few lines before the quoted snippet. The second bounds check – and with it the diverging control flow – would probably be unnecessary if mutating thesb
object in-place. Sketch:Another interpretation: concurrency problem.
Reading through the Sarama
client.Close()
code, it is only possible for this channel-close to panic if theClose()
method is invoked multiple times concurrently. The remainder of theClose()
method is protected by a lock. This in turn could happen if Keda can invoke the sameScaler.Close()
concurrently.The bounds check that was added in #4093 seems to try to defend against concurrent modification of the ScalersCache, but it's not entirely safe because the check + update is not atomic. The bounds check just makes it less likely that concurrent modification can crash the program.
An actual fix would likely require introducing a lock for ScalersCache modifications, in particular for the
ScalersCache.refreshScaler()
andScalersCache.Close()
operations. It could also help to make theScalersCache.Scalers
field private.The text was updated successfully, but these errors were encountered: