Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-1.5] 🌱 ClusterCacheTracker: fix accessor deletion on health check failure #9031

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions controllers/remote/cluster_cache_tracker.go
Original file line number Diff line number Diff line change
Expand Up @@ -654,14 +654,18 @@ func (t *ClusterCacheTracker) healthCheckCluster(ctx context.Context, in *health
}

err := wait.PollUntilContextCancel(ctx, in.interval, true, runHealthCheckWithThreshold)
// An error returned implies the health check has failed a sufficient number of
// times for the cluster to be considered unhealthy
// NB. we are ignoring ErrWaitTimeout because this error happens when the channel is close, that in this case
// happens when the cache is explicitly stopped.
if err != nil && !wait.Interrupted(err) {
// An error returned implies the health check has failed a sufficient number of times for the cluster
// to be considered unhealthy or the cache was stopped and thus the cache context canceled (we pass the
// cache context into wait.PollUntilContextCancel).
// NB. Log all errors that occurred even if this error might just be from a cancel of the cache context
// when the cache is stopped. Logging an error in this case is not a problem and makes debugging easier.
if err != nil {
t.log.Error(err, "Error health checking cluster", "Cluster", klog.KRef(in.cluster.Namespace, in.cluster.Name))
t.deleteAccessor(ctx, in.cluster)
}
// Ensure in any case that the accessor is deleted (even if it is a no-op).
// NB. It is crucial to ensure the accessor was deleted, so it can be later recreated when the
// cluster is reachable again
t.deleteAccessor(ctx, in.cluster)
}

// newClientWithTimeout returns a new client which sets the specified timeout on all Get and List calls.
Expand Down