-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support spreading MachineSet-owned Machines over failure domains #3358
Comments
+1, we're currently employing the n MachineDeployments pattern. I need to double check in CAPA, but there's a nested concern of cycling through subnets in a particular AZ. It currently returns the first one only if AvailabilityZone is specified on the AWSMachine. There'd be no way to consume this failure domain spread AND also additional subnets per-AZ. I can open a separate CAPA issue and link to here. |
Please do, thanks! |
One potential concern that we have around automated spreading across failure domains would be for cloud providers that limit storage attachment to the same failure domain (for example AWS). There will be the general case of users needing to understand the impact of how scaling decisions would affect scheduling of those workloads and the associated failure tolerance for a given MachineDeployment. Since Cluster API does not have any insight currently into the workloads running on the cluster, we likely don't have any way of helping the user with any issues that might arise related to these concerns as we rollout changes to a MachineDeployment. There is also a more specific concern related to how the cluster-autoscaler (or other external tools trying to make scaling decisions against cluster api resources) would be able to handle making appropriate scaling decisions for workloads. With the current model, the Cluster Autoscaler can relatively naively make scaling decisions against MachineDeployments. If we add capacity, we can assume it will be able to run the same types of workloads that other Machines in the MachineDeployment can. With failure domain spreading, we would need to figure out how we can possibly enrich the autoscaler with the appropriate information to make the right decision or possibly have a way for the autoscaler to signal that a scalup should happen for an explicit failure domain rather than any failure domain that is being used by the MachineDeployment. |
/milestone next |
@CecileRobertMichon: You must be a member of the kubernetes-sigs/cluster-api-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Maintainers and have them propose you as an additional delegate for this responsibility. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/milestone Next |
this sounds like a nice feature, but i am curious if it would be always on or can a user disable it?
this is highly relevant imo. the cluster autoscaler can be configured to attempt to balance the nodes it creates during scaling. when this option is enabled it will try to balance creation across MachineSets or MachineDeployments (depending on what is being used). i would want to make sure that users know when they might run afoul of having these options enabled (whether in cluster-autoscaler or cluster-api). |
+1 from me on this, I actually thought it was already the case.
What's the current way to set a failure domain per MD? Is it through the Machine template?
I would expect as a user you would be able to set the failure domains explicitly to only deploy machines from that deployment to a subset (in some use cases only 1) of failure domains, but by default it would spread to all failure domains available at the Cluster level, just like KCP. |
The way this is solved for the autoscaler is by letting the user to create individual nodeGroups (machineSet/machineDeployment) for each Availability Zone. Then enable the --balance-similar-node-groups flag on the autoscaler.
In this line, and taking the autoscaling concern into consideration another alternative would be to have a new opt-in resource e.g machinePool (already taken) / machineBalance? which manages machineSet or machineDeployment spread across available AZs automatically. The likes of
This would keep existing primitives and behaviour untouched while giving a new clean building block which would provide the business value intended by this feature. |
I think this is an interesting idea, especially given it allows simpler mechanics around the autoscaling integration. There are several questions that come to mind when I look at this and the full example above:
|
The other option is to leverage MachinePools to spread worker nodes on AZs. We could have 1 autoscaler node group == 1 machine pool. MachinePool takes a list of failure domains which would allow a user to specify a single failure domain for that machine pool and have n machine pools for n zones, or choose to spread the instances within a single machine pool by having a list of zones (potentially the default behavior). So you could also have a machine pool represent a node group in terms of functionality (eg. two different OSes) and have autoscaler balance that as well. @rudoi @ncdc how does one use the n MachineDeployments pattern currently? I don't see a way to specify the failure domain per MachineDeployment (without using the back compat infraMachine failure domain field). |
The way we use it circumvents FailureDomain entirely - the AWSMachineTemplate gets a specific subnet ID. Each MachineDeployment gets its own AWSMachineTemplate, etc. |
@CecileRobertMichon You should be able to set MachineDeployment.Spec.Template.Spec.FailureDomain on each MachineDeployment to the corresponding failure domain. |
This may be a bit of an aside from this topic, but, @CecileRobertMichon I'm not too familiar with MachinePools, is the idea with the list of failure domains that, for example, on AWS, this would create 1 ASG per failure domain? Or do ASGs support multiple failure domains as is and the MachinePool just carries that over? It's my understanding that a |
@JoelSpeed it's the latter, ASGs support multiple failure domains as is and the MachinePool just carries that over, https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroup.html --> "When instances are launched, if you specified multiple Availability Zones, the desired capacity is distributed across these Availability Zones. If a scaling action occurs, Amazon EC2 Auto Scaling automatically maintains balance across all of the Availability Zones that you specify." The same applies to Azure's VMSS. |
from the autoscaler side, it uses its own internal planning algorithm to determine where new nodes should be created and then each provider does something specific to make that happen. in the case of CAPI, we present MachineSets and MachineDeployments and then scale the replicas based on what the autoscaler is prescribing. in the case of AWS, it uses autoscaling groups (ASGs) internally and simply increases and desceases the sizes, for example. if we start using cloud-specific resources that can do internal balancing and scaling of nodes, we might consider decoupling that from the autoscaler's activities. but only if there are other planning algorithms (eg from the cloud provider) that are doing this work. if we can use something that presents to the autoscaler as a resource with a number of replicas, and a way to address individual nodes, that should be sufficient to satisfy its requirements. if this resource (a MachinePool or w/e) covers multiple zones and places nodes inside those zones, then i would imagine it will be fine to interface with the autoscaler. |
Looks like other implementations already handle the multi az thing by just using the first AZ and using that for the scheduling decisions https://github.com/kubernetes/autoscaler/blob/c49beada6b12e7c0428bb63147890f678c4e6001/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L307-L316, I guess we can just copy that behaviour without breaking the autoscaler. Can ignore my concerns in my earlier comment |
@ncdc @rudoi @JoelSpeed @elmiko thoughts on getting this into v1alpha4? |
@CecileRobertMichon I just reread all the comments and there were a few different ideas suggested. Which one are you asking about? 😄 |
Not set on a particular implementation detail yet, just want the ability to spread worker nodes in AZs automatically. We can discuss implementations as part of the proposal I guess? Mostly gaging interest prioritizing this user story at all. |
no objections to moving forward from me. i think my main concern is ensuring that we continue to present unified node groups to the autoscaler. i don't think any of the ideas we've talked about would necessarily break that, but if we start exercising labels or taints which would affect placement of machines within a machineset/deployment then we would need to make sure and document this for autoscaler users. there are some cases where adding additional information would affect behaviors in the autoscaler (eg --balance-similar-node-groups option). |
For context on our position, we have a lot of interest internally pushing us to deliver this feature within OpenShift by end of Q1 next year, we'd like to match what this community is doing as we do that. Our team will be able to help out as and when the community wants to start moving forward with it. So to answer the original question, I would like to see this in v1alpha4. Do we have anyone wanting to own this feature yet? |
+1 to including in v1a4 |
Any volunteer for drafting up a proposal? :) |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
/assign I can give this a try |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle rotten |
/lifecycle frozen |
Closing this in favor of documentation, we've discussed the support of multiple failure domains in MD/MS and ultimately decided that folks should create multiple MD/MS if they want to spread their workloads across AZs. /close |
@vincepri: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
User Story
As a user, I would like a MachineSet to spread the Machines it creates across the cluster's available failure domains, to protect against a total outage due to an outage in a single location.
Detailed Description
KubeadmControlPlane is capable of spreading Machines across failure domains automatically. This is about doing the same thing for a MachineSet.
Anything else you would like to add:
One alternative is to create n MachineDelpoyments for your n failure domains, scaling them independently. Resiliency to failures comes through having multiple MachineDeployments, but it does require you to create e.g. 3 separate resources to cover 3 failure domains, vs. a single MachineDeployment with built-in failure domain spreading.
I'm mostly putting out some feelers to see if there is any community interest in this, and if it's worth pursuing. Happy to go whichever way (do nothing, implement this, or find something else that's better).
/kind feature
The text was updated successfully, but these errors were encountered: