-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow configuring kube client timeouts #8864
Comments
/triage accepted sounds like a good idea IMO to make this client configurable. |
I think we should clarify a little bit the ask here In CAPI we have two types of clients, the clients reading/writing to the management cluster, and the clients reading/writing from the workload cluster. AFAIK, as of today we allow setting QPS and burst only for the first set of clients, while we do not provide any form of configuration for the clients reading/writing from the workload clusters, which are the one used for drain. Also, draining a node is a very special case, where we delegate timeout management to the drainer itself and currently this is hardcoded to 20sec (https://github.com/fabriziopandini/cluster-api/blob/c1ef1d779998f3489889aa34be42e33c7c336e03/internal/controllers/machine/machine_controller.go#L603-L625) |
The ask is about configuring the HTTP client timeout for the workload kubernetes client that drains nodes, as written in the description. However, we should consider configuring the timeout for other workload cluster clients as well. The kubernetes config created for draining nodes uses the same config as is created for accessing workload clusters, here: https://github.com/fabriziopandini/cluster-api/blob/c1ef1d779998f3489889aa34be42e33c7c336e03/internal/controllers/machine/machine_controller.go#L582 That kubernetes configuration always has a 10s golang HTTP client timeout configured: https://github.com/fabriziopandini/cluster-api/blob/c1ef1d779998f3489889aa34be42e33c7c336e03/controllers/remote/cluster.go#L52-L64. We see log messages like when our clusters exceed 10s to respond to eviction:
The 20s pod eviction timeout isn't relevant, the timeout we see in larger clusters is the 10s timeout waiting for the API Server to respond with headers to the request. |
So the APIserver is not able to list the Pods of a Node within 10 seconds. Got it. Now it makes more sense to me, based on the issue description I thought the actual eviction call for the Pod doesn't finish within 10 seconds which would have surprised me. Mostly out of curiosity, do you know how long this call would take to respond for those clusters? Coming back to making the timeout configurable. First I would suggest that we start using the rest config from the ClusterCacheTracker consistently. Unfortunately we currently have a mix of using the ClusterCacheTracker and RESTConfig directly. This is not great for scalability as RESTConfig requires a call to the apiserver while the ClusterCacheTracker already caches the config. I opened a PR for that: #8894 I think now we have different options to affect the timeout used during drain:
The problem with the first option is that if you increase the timeout used for the entire ClusterCacheTracker / all clients we use for the workload clusters reconcilers can get stuck when workload clusters are not reachable. Let's assume you set the timeout to a minute and then you have some workload clusters that become not reachable. The result is that either during client creation or in any other server-side call reconcilers can get stuck up until this timeout (a minute). The consequence is that some of the workers used for the Reconcilers can't be used to reconcile any of the healthy clusters. We saw this already in production. The result basically being that the entire core CAPI controller can get stuck if you are not using a worker count which is equal to the number of your Machines. Given this, I think option 2 is probably the better solution for your problem. As with option 1 actually configuring a high timeout comes with severe trade-offs. |
Thanks for the context, it really helped me The only caveat here is what kind of UX are thinking here given that this is something that can go at the machine level, and how this relates to the existing nodeDrainTimeout |
10-11s. We are barely hitting the timeout, and only in our largest cluster, with 30-40k pods and 600-800 nodes (the cluster autoscales up and down daily). I can provide more metrics from the controllers if there are specific metrics that are of interest. Usually we are barely missing the timeout, some of the time retrying succeeds, but that retrying has to happen at all slows down the machinedeployment rollout.
... I see. I was a bit confused by the way the kubernetes config was used, and this answers part of my confusion.
Good point, #8865 changes the option globally, but really all the ask here is about the machine controller. So putting together the comments on #8865 and here, would a better approach be to start over and define some sort of |
Thx for the data, just wanted to get a feeling for the scenario/ the latency. Okay good so sounds like we have consensus on only modifying the drain timeout. Just fyi in KCP we already have the case where we hard-code a timeout to 30 sec. So we could consider hard-coding the drain client timeout to 20 or 30 seconds. If we want to make it configurable we have 2 options. Either as command-line flag for the core controller or a field on the Machine CRD. On one side the Machine CRD already has 3 timeout fields. On the other side controllers have the mgmt cluster qps flags. (KCP controller also has Etcd timeout flags) Maybe a flag is the best option if it's not strictly necessary to fine-tune the timeout per Cluster/MD. A CRD field would have to be added to ClusterClass, Cluster.spec.topology, KCP, MD, MS, Machine and MachinePool. (I prefer the flag option) If we add it as a flag, I think we don't need a util for this flag as we would add it only to the core controller.A separate PR to introduce a flag util for the qps flags would be nice though |
I'm ok with the flag (long version) I will defer changing the API when we have a clear use case for it, but probably worth opening an issue of kind |
@cnmcavoy I'm currently refactoring the drain logic. Part of that is getting rid of the separate kubernetes client we use for drain. Instead I'll use the client we always use when communicating with workload clusters (the client cached in the ClusterCacheTracker). This client currently uses a hard-coded 10s timeout. So if I'll just change it that way without further changes we'll hit a regression regarding this issue. But I noticed that our Pod list call wasn't using pagination. Do you think a 10s timeout could be enough when using pagination (Limit/Continue) when listing Pods? |
What would you like to be added (User Story)?
As a cluster administrator, I want to be able to configure the timeout for the kubernetes client configuration instead of accepting the default hardcoded 10s timeout.
Detailed Description
The kube client configuration values used by the cluster API should be configurable from the cli flags at startup. The QPS and Burst settings are already controllable by the user, but the timeout is hardcoded to 10s.
This timeout is used in various places, notably in the CAPI
drainNode
logic, where in larger clusters, the eviction API + latency may result in the control plane taking > 10s some of the time to begin responding to the requests. When the API server fails to respond in time, Cluster API gives up and retries, immediately in another reconcile loop, which often increases the load on the control plane. I would like to be able to configure a longer timeout for these clusters.Anything else you would like to add?
No response
Label(s) to be applied
/kind feature
The text was updated successfully, but these errors were encountered: