Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-1.5] 🌱 ClusterCacheTracker: ensure Get/List calls are not getting stuck when apiserver is unreachable #9030

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions controllers/remote/cluster_cache_tracker.go
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,13 @@ func (t *ClusterCacheTracker) createClient(ctx context.Context, config *rest.Con
return nil, nil, nil, fmt.Errorf("failed waiting for cache for remote cluster %v to sync: %w", cluster, cacheCtx.Err())
}

// Wrap the cached client with a client that sets timeouts on all Get and List calls
// If we don't set timeouts here Get and List calls can get stuck if they lazily create a new informer
// and the informer than doesn't sync because the workload cluster apiserver is not reachable.
// An alternative would be to set timeouts in the contexts we pass into all Get and List calls.
// It should be reasonable to have Get and List calls timeout within the duration configured in the restConfig.
cachedClient = newClientWithTimeout(cachedClient, config.Timeout)

// Start cluster healthcheck!!!
go t.healthCheckCluster(cacheCtx, &healthCheckInput{
cluster: cluster,
Expand Down Expand Up @@ -656,3 +663,33 @@ func (t *ClusterCacheTracker) healthCheckCluster(ctx context.Context, in *health
t.deleteAccessor(ctx, in.cluster)
}
}

// newClientWithTimeout returns a new client which sets the specified timeout on all Get and List calls.
// If we don't set timeouts here Get and List calls can get stuck if they lazily create a new informer
// and the informer than doesn't sync because the workload cluster apiserver is not reachable.
// An alternative would be to set timeouts in the contexts we pass into all Get and List calls.
func newClientWithTimeout(client client.Client, timeout time.Duration) client.Client {
return clientWithTimeout{
Client: client,
timeout: timeout,
}
}

type clientWithTimeout struct {
client.Client
timeout time.Duration
}

var _ client.Client = &clientWithTimeout{}

func (c clientWithTimeout) Get(ctx context.Context, key client.ObjectKey, obj client.Object, opts ...client.GetOption) error {
ctx, cancel := context.WithTimeout(ctx, c.timeout)
defer cancel()
return c.Client.Get(ctx, key, obj, opts...)
}

func (c clientWithTimeout) List(ctx context.Context, list client.ObjectList, opts ...client.ListOption) error {
ctx, cancel := context.WithTimeout(ctx, c.timeout)
defer cancel()
return c.Client.List(ctx, list, opts...)
}