-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split Cluster Autoscaler codebase #5394
Comments
we talked about this at the SIG meeting today and i think there was a good discussion. in general, i think it's a cool idea but i also share some of the hesitancy that was discussed in the meeting. namely that this is a really big change which could have near term consequences in the form of bugs and other problems as we figure out how to do the separation, but then also longer term effects around maintenance and involvement. with that said though, i wanted to share some thoughts. in some respects this conversion could look very similar to what happened with the cloud controller managers and their migration from the kubernetes main repository to external repositories. if we were to approach this topic from that perspective, then i think we would want to start by making a KEP about how the deprecation will happen and on what sort of time frame. after defining some of the time frame, i think we would also want to follow SIG cloud provider's example and demonstrate how the gRPC provider could be used as a mechanism for migrating all the "in-tree" providers to external repositories. imo, using the gRPC provider as a gateway to the cluster autoscaler seems like an easier path than changing the autoscaler into a more library-ish code package which is then imported from provider specific implementations. this doesn't completely answer all the questions, especially with regards to involvement and maintenance, but we could demonstrate a very clean workflow for providers to migrate their code into external gRPC providers. to go a step further, we could then provide more in-depth helm charts and guidance on how best to deploy the autoscaler with a gRPC client next to it (although this exists to some level already). we might even see the growth of an operator for deploying the cluster autoscaler based on the need for variance of the underlying cloud provider.
to this point, i have been working with some of the Cluster API community members to improve our Kubemark provider, we even have a work in progress PR that will run some autoscaler testing. i realize that using kubemark does not fully test all the provider specific bits, but it would allow us to create a lower cost method for running pre-submit end-to-end tests which would catch things like the bug referenced in the quote above. |
I support a stronger separation between things like scheduling simulations and the cloud providers. But separate repositories sounds overkill to me. That would certainly increase the release burden significantly, and one typically is only interested in the K8s updates. But a mono-repos approach with each cloud provider as a go module sounds more doable. And perhaps an image per cloud provider. |
In gardener we have to maintain a fork of this repo and add There is also quite an overlap between what cluster autoscaler does and what any cloud provider would like to do w.r.t management of machines and machine groups. This often breaks the single responsibility principle causing race conditions. Therefore i also fully support the need to use CA as a standalone library with well defined APIs easing its consumption from any provider. This will also ensure that there is just one actor which is responsible for managing machines and machine groups bringing in determinism w.r.t expected behavior. |
just out of curiosity, would it be possible to use the gRPC provider to keep your updates so that you don't need to fork the autoscaler? |
Yes, we would like to avoid maintaining a fork. Is the gRPC plugin feature stable ? Apologies for my ignorance, but it it still marked as a proposal. Is there a tracking issue for this that has been closed ? |
While I was the one to initially create this issue, I agree with the concern raised on SIG meeting that it would cause the support for not-very-well maintained cloud providers code to die out and CA ceasing to work in these environments as a result. With that in mind, I think the "alternative considered" I originally posted is actually better, with some slight modifications. Specifically:
The proposals dir contains historical proposals as well. To my knowledge, gRPC support was added and is considered done. I don't think we had a gRPC issue specifically, the closest one would be probably #3644. |
This would be the most preferred option for us (if and when it comes). Main reason - CA does not know and understand beyond node groups. In gardener In the mean time we can evaluate grpc-external-provider approach and see if this would be consumable by us. |
I will preface this by saying that since December 2022 I am no longer an AWS employee and don't have a dog in this fight.
I could not agree more. This is the fundamental problem with CA's current lack of extensibility and the reason why things like Karpenter were created.
This is, in fact, how the karpenter-core interfaces are designed. Specifically, the CloudProvider interface demarcates the CloudProvider as being responsible for creating, deleting, and fetching Machines (Nodes). The whole concept of a "group of nodes" is deliberately absent in this layer in Karpenter, as it should be so that different cloud provider compute infrastructure APIs can be used to manage individual instances instead of relying on the concept of a Managed Node Group / AutoScalingGroup API being the only mechanism by which to provision/deprovision nodes. In the case of AWS, Karpenter can make use of the EC2 CreateFleet API instead of the EC2 AutoScalingGroup APIs in order to make better, more fine-grained node provisioning decisions. Sidenote: I think the design of cluster-autoscaler might be the reason why I couldn't write a Karpenter driver for GKE. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Which component are you using?:
Cluster Autoscaler
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
Cluster Autoscaler has two parts: shared ("core") functionality which is generic to any cloud provider and the cloud provider specific one. CA releases contain both the shared part and all (~30) the cloud provider specific code, along with its dependencies. This approach leads to several problems:
Describe the solution you'd like.:
I believe cloud provider specific code should live in separate repositories. OSS Cluster Autoscaler should really be a library that is being used in various ways, rather than a component trying to support all possible cloud providers. There may be an implementation or two that make sense in this repo (grpc & cluster API come to mind), but everything else probably belongs elsewhere.
Describe any alternative solutions you've considered.:
Additional context.:
The text was updated successfully, but these errors were encountered: