Watches on remote cluster expires every 10s #8893

fabriziopandini · 2023-06-20T13:56:58Z

What steps did you take and what happened?

When we create watches on the remote cluster using the cluster cache tracker, watches are created using the same rest config that the cache uses, and currently this rest config has a timeout of 10s.

This leads to having watches on remote clusters expire every 10s, and even if this doesn't seem an issue thanks to watch bookmarks, this isn't ideal and we should figure out a way to have a longer timeout for the cache than the one used for the remote clients.

What did you expect to happen?

Cluster API to create watches on remote machines with longer timeouts

Cluster API version

main, old versions

Kubernetes version

No response

Anything else you would like to add?

The tricky part of this is that we want to keep the 10-second/short timeout for the remote client, and we also want to use it while creating a new remote client because this timeout is blocking a reconcile loop in a synchronous way.

However, unfortunately, when creating a new remote client we are also creating indexes on the cache, and thus a longer timeout on the cache will block unless we find a way to initiate indexes asyc

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

sbueringer · 2023-06-20T14:02:31Z

I think the problem is bigger than just us adding indexes on client creation. The same can happen for every Get/List call in our reconcilers. All of them can trigger the creation of an informer.

I think we have to take a closer look how all of this works in CR. I think there could be a way to use a different http client for the synchronous calls during informer creation vs ListWatch. Code is roughly here: https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/cache/internal/informers.go#L377-L503

This could e.g. work by passing into the cache:

a mapper with a lower timeout
an http client with a higher timeout

POC: https://github.com/sbueringer/cluster-api/pull/151/files#

But not sure if that covers everything relevant. It should be definitely cleaner to implement this in CR. I think my current POC is just exploting the current behavior which is not guaranteed to stay the same.

sbueringer · 2023-06-20T14:05:20Z

/triage accepted

fabriziopandini · 2024-04-11T19:05:10Z

/priority important-soon

fabriziopandini · 2024-05-08T12:48:51Z

/assign @sbueringer
please ping me when you do the investigation

k8s-triage-robot · 2024-08-06T13:06:43Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

sbueringer · 2024-08-06T13:14:34Z

/triage accepted

k8s-triage-robot · 2024-11-04T13:37:56Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

sbueringer · 2024-11-04T16:17:09Z

/triage accepted

sbueringer · 2024-12-17T10:46:06Z

Notes to myself. We can play around with the 10s timeout we add to the rest config and the 10s timeout we add in Get/List calls to check if any changes to them affect the 10s timeouts.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 20, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 20, 2023

sbueringer mentioned this issue Jun 20, 2023

Scale testing #8814

Open

27 tasks

killianmuldoon mentioned this issue Jun 20, 2023

[WIP] 🐛 Set defaultClientTimeout to 0 for external clients #8882

Closed

fabriziopandini mentioned this issue Jun 21, 2023

Allow configuring kube client timeouts #8864

Closed

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 11, 2024

k8s-ci-robot assigned sbueringer May 8, 2024

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Aug 6, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 6, 2024

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Nov 4, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watches on remote cluster expires every 10s #8893

Watches on remote cluster expires every 10s #8893

fabriziopandini commented Jun 20, 2023 •

edited

Loading

sbueringer commented Jun 20, 2023 •

edited

Loading

sbueringer commented Jun 20, 2023

fabriziopandini commented Apr 11, 2024

fabriziopandini commented May 8, 2024

k8s-triage-robot commented Aug 6, 2024

sbueringer commented Aug 6, 2024

k8s-triage-robot commented Nov 4, 2024

sbueringer commented Nov 4, 2024

sbueringer commented Dec 17, 2024

Watches on remote cluster expires every 10s #8893

Watches on remote cluster expires every 10s #8893

Comments

fabriziopandini commented Jun 20, 2023 • edited Loading

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

sbueringer commented Jun 20, 2023 • edited Loading

sbueringer commented Jun 20, 2023

fabriziopandini commented Apr 11, 2024

fabriziopandini commented May 8, 2024

k8s-triage-robot commented Aug 6, 2024

sbueringer commented Aug 6, 2024

k8s-triage-robot commented Nov 4, 2024

sbueringer commented Nov 4, 2024

sbueringer commented Dec 17, 2024

fabriziopandini commented Jun 20, 2023 •

edited

Loading

sbueringer commented Jun 20, 2023 •

edited

Loading