Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watches on remote cluster expires every 10s #8893

Open
fabriziopandini opened this issue Jun 20, 2023 · 9 comments
Open

Watches on remote cluster expires every 10s #8893

fabriziopandini opened this issue Jun 20, 2023 · 9 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@fabriziopandini
Copy link
Member

fabriziopandini commented Jun 20, 2023

What steps did you take and what happened?

When we create watches on the remote cluster using the cluster cache tracker, watches are created using the same rest config that the cache uses, and currently this rest config has a timeout of 10s.

This leads to having watches on remote clusters expire every 10s, and even if this doesn't seem an issue thanks to watch bookmarks, this isn't ideal and we should figure out a way to have a longer timeout for the cache than the one used for the remote clients.

What did you expect to happen?

Cluster API to create watches on remote machines with longer timeouts

Cluster API version

main, old versions

Kubernetes version

No response

Anything else you would like to add?

The tricky part of this is that we want to keep the 10-second/short timeout for the remote client, and we also want to use it while creating a new remote client because this timeout is blocking a reconcile loop in a synchronous way.

However, unfortunately, when creating a new remote client we are also creating indexes on the cache, and thus a longer timeout on the cache will block unless we find a way to initiate indexes asyc

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 20, 2023
@sbueringer
Copy link
Member

sbueringer commented Jun 20, 2023

I think the problem is bigger than just us adding indexes on client creation. The same can happen for every Get/List call in our reconcilers. All of them can trigger the creation of an informer.

I think we have to take a closer look how all of this works in CR. I think there could be a way to use a different http client for the synchronous calls during informer creation vs ListWatch. Code is roughly here: https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/cache/internal/informers.go#L377-L503

This could e.g. work by passing into the cache:

  • a mapper with a lower timeout
  • an http client with a higher timeout

POC: https://github.com/sbueringer/cluster-api/pull/151/files#

But not sure if that covers everything relevant. It should be definitely cleaner to implement this in CR. I think my current POC is just exploting the current behavior which is not guaranteed to stay the same.

@sbueringer
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 20, 2023
@sbueringer sbueringer mentioned this issue Jun 20, 2023
27 tasks
@fabriziopandini
Copy link
Member Author

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 11, 2024
@fabriziopandini
Copy link
Member Author

/assign @sbueringer
please ping me when you do the investigation

@k8s-triage-robot
Copy link

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Aug 6, 2024
@sbueringer
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 6, 2024
@k8s-triage-robot
Copy link

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Nov 4, 2024
@sbueringer
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 4, 2024
@sbueringer
Copy link
Member

Notes to myself. We can play around with the 10s timeout we add to the rest config and the 10s timeout we add in Get/List calls to check if any changes to them affect the 10s timeouts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants