Support spreading MachineSet-owned Machines over failure domains #3358

ncdc · 2020-07-16T19:36:47Z

User Story

As a user, I would like a MachineSet to spread the Machines it creates across the cluster's available failure domains, to protect against a total outage due to an outage in a single location.

Detailed Description

KubeadmControlPlane is capable of spreading Machines across failure domains automatically. This is about doing the same thing for a MachineSet.

Anything else you would like to add:

One alternative is to create n MachineDelpoyments for your n failure domains, scaling them independently. Resiliency to failures comes through having multiple MachineDeployments, but it does require you to create e.g. 3 separate resources to cover 3 failure domains, vs. a single MachineDeployment with built-in failure domain spreading.

I'm mostly putting out some feelers to see if there is any community interest in this, and if it's worth pursuing. Happy to go whichever way (do nothing, implement this, or find something else that's better).

/kind feature

rudoi · 2020-07-16T19:47:02Z

+1, we're currently employing the n MachineDeployments pattern.

I need to double check in CAPA, but there's a nested concern of cycling through subnets in a particular AZ. It currently returns the first one only if AvailabilityZone is specified on the AWSMachine. There'd be no way to consume this failure domain spread AND also additional subnets per-AZ. I can open a separate CAPA issue and link to here.

ncdc · 2020-07-16T19:51:25Z

Please do, thanks!

rudoi · 2020-07-16T20:04:42Z

kubernetes-sigs/cluster-api-provider-aws#1829

detiber · 2020-07-16T20:46:12Z

One potential concern that we have around automated spreading across failure domains would be for cloud providers that limit storage attachment to the same failure domain (for example AWS).

There will be the general case of users needing to understand the impact of how scaling decisions would affect scheduling of those workloads and the associated failure tolerance for a given MachineDeployment. Since Cluster API does not have any insight currently into the workloads running on the cluster, we likely don't have any way of helping the user with any issues that might arise related to these concerns as we rollout changes to a MachineDeployment.

There is also a more specific concern related to how the cluster-autoscaler (or other external tools trying to make scaling decisions against cluster api resources) would be able to handle making appropriate scaling decisions for workloads. With the current model, the Cluster Autoscaler can relatively naively make scaling decisions against MachineDeployments. If we add capacity, we can assume it will be able to run the same types of workloads that other Machines in the MachineDeployment can. With failure domain spreading, we would need to figure out how we can possibly enrich the autoscaler with the appropriate information to make the right decision or possibly have a way for the autoscaler to signal that a scalup should happen for an explicit failure domain rather than any failure domain that is being used by the MachineDeployment.

CecileRobertMichon · 2020-07-22T17:23:27Z

/milestone next

k8s-ci-robot · 2020-07-22T17:23:28Z

@CecileRobertMichon: You must be a member of the kubernetes-sigs/cluster-api-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Maintainers and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone next

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

detiber · 2020-07-22T17:24:58Z

/milestone Next

elmiko · 2020-07-22T17:27:11Z

this sounds like a nice feature, but i am curious if it would be always on or can a user disable it?

There is also a more specific concern related to how the cluster-autoscaler (or other external tools trying to make scaling decisions against cluster api resources) would be able to handle making appropriate scaling decisions for workloads. With the current model, the Cluster Autoscaler can relatively naively make scaling decisions against MachineDeployments. If we add capacity, we can assume it will be able to run the same types of workloads that other Machines in the MachineDeployment can. With failure domain spreading, we would need to figure out how we can possibly enrich the autoscaler with the appropriate information to make the right decision or possibly have a way for the autoscaler to signal that a scalup should happen for an explicit failure domain rather than any failure domain that is being used by the MachineDeployment.

this is highly relevant imo. the cluster autoscaler can be configured to attempt to balance the nodes it creates during scaling. when this option is enabled it will try to balance creation across MachineSets or MachineDeployments (depending on what is being used). i would want to make sure that users know when they might run afoul of having these options enabled (whether in cluster-autoscaler or cluster-api).

CecileRobertMichon · 2020-07-28T22:37:24Z

+1 from me on this, I actually thought it was already the case.

One alternative is to create n MachineDelpoyments for your n failure domains, scaling them independently. Resiliency to failures comes through having multiple MachineDeployments, but it does require you to create e.g. 3 separate resources to cover 3 failure domains, vs. a single MachineDeployment with built-in failure domain spreading.

What's the current way to set a failure domain per MD? Is it through the Machine template?

this sounds like a nice feature, but i am curious if it would be always on or can a user disable it?

I would expect as a user you would be able to set the failure domains explicitly to only deploy machines from that deployment to a subset (in some use cases only 1) of failure domains, but by default it would spread to all failure domains available at the Cluster level, just like KCP.

enxebre · 2020-07-30T13:10:35Z

One potential concern that we have around automated spreading across failure domains would be for cloud providers that limit storage attachment to the same failure domain (for example AWS)...

The way this is solved for the autoscaler is by letting the user to create individual nodeGroups (machineSet/machineDeployment) for each Availability Zone. Then enable the --balance-similar-node-groups flag on the autoscaler.

One alternative is to create n MachineDeployments for your n failure domains, scaling them independently. Resiliency to failures comes through having multiple MachineDeployments, but it does require you to create e.g. 3 separate resources to cover 3 failure domains, vs. a single MachineDeployment with built-in failure domain spreading.

In this line, and taking the autoscaling concern into consideration another alternative would be to have a new opt-in resource e.g machinePool (already taken) / machineBalance? which manages machineSet or machineDeployment spread across available AZs automatically.

The likes of

type MachineBalanceSpec struct {
	// Replicas is the count of machines for this machine pool.
	// Replicas and autoscaling cannot be used together.
	// Default is 1, if autoscaling is not used.
	// +optional
	Replicas *int64 `json:"replicas,omitempty"`

	// Autoscaling is the details for auto-scaling the Machine pool.
	// Replicas and autoscaling cannot be used together.
	// +optional
	Autoscaling *MachineBalanceAutoscaling `json:"autoscaling,omitempty"`
}

// MachinePoolAutoscaling details how the machine pool is to be auto-scaled.
type MachineBalanceAutoscaling struct {
	// MinReplicas is the minimum number of replicas for the machine pool.
	MinReplicas int32 `json:"minReplicas"`

	// MaxReplicas is the maximum number of replicas for the machine pool.
	MaxReplicas int32 `json:"maxReplicas"`
}

This would keep existing primitives and behaviour untouched while giving a new clean building block which would provide the business value intended by this feature.

JoelSpeed · 2020-07-30T14:56:26Z

type MachineBalanceAutoscaling struct {
  // MinReplicas is the minimum number of replicas for the machine pool.
  MinReplicas int32 `json:"minReplicas"`

  // MaxReplicas is the maximum number of replicas for the machine pool.
  MaxReplicas int32 `json:"maxReplicas"`
}```

I think this is an interesting idea, especially given it allows simpler mechanics around the autoscaling integration. There are several questions that come to mind when I look at this and the full example above:

I notice there's no AZ list in the spec, would it be expected that we would spread across all AZs in the cluster? Should that be configurable?
What happens if min/maxReplicas % num AZs != 0, would we end up with imbalanced min/max replicas across the AZs?
Would this need a webhook to ensure that the values are mutually exclusive? I assume we can't enforce this rule with openapi validation alone?

CecileRobertMichon · 2020-07-30T19:19:32Z

The other option is to leverage MachinePools to spread worker nodes on AZs. We could have 1 autoscaler node group == 1 machine pool. MachinePool takes a list of failure domains which would allow a user to specify a single failure domain for that machine pool and have n machine pools for n zones, or choose to spread the instances within a single machine pool by having a list of zones (potentially the default behavior). So you could also have a machine pool represent a node group in terms of functionality (eg. two different OSes) and have autoscaler balance that as well.

@rudoi @ncdc how does one use the n MachineDeployments pattern currently? I don't see a way to specify the failure domain per MachineDeployment (without using the back compat infraMachine failure domain field).

rudoi · 2020-07-30T19:23:22Z

The way we use it circumvents FailureDomain entirely - the AWSMachineTemplate gets a specific subnet ID. Each MachineDeployment gets its own AWSMachineTemplate, etc.

detiber · 2020-07-30T19:30:14Z

@CecileRobertMichon You should be able to set MachineDeployment.Spec.Template.Spec.FailureDomain on each MachineDeployment to the corresponding failure domain.

JoelSpeed · 2020-08-04T16:07:49Z

We could have 1 autoscaler node group == 1 machine pool. MachinePool takes a list of failure domains which would allow a user to specify a single failure domain for that machine pool and have n machine pools for n zones, or choose to spread the instances within a single machine pool by having a list of zones (potentially the default behavior).

This may be a bit of an aside from this topic, but, @CecileRobertMichon I'm not too familiar with MachinePools, is the idea with the list of failure domains that, for example, on AWS, this would create 1 ASG per failure domain? Or do ASGs support multiple failure domains as is and the MachinePool just carries that over?

It's my understanding that a NodeGroup as it is in the cluster autoscaler should be capable of making consistent/identical machines. It compares the resources available to the nodes (CPU, Memory, GPU) as well as the labels/taints etc that are on the nodes, which typically would include the failure domain, when making scaling decisions, with the requirements of the pods. If MachinePool supports multi failure domains, and we went 1:1 mapped with a MachinePool, this failure domain label wouldn't be able to be defined would it? That would break the scaling and scheduling decisions of the autoscaler. Perhaps we would have to somehow map nodegroups to each failure domain of the machinepool?

CecileRobertMichon · 2020-08-04T16:13:03Z

@JoelSpeed it's the latter, ASGs support multiple failure domains as is and the MachinePool just carries that over, https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroup.html --> "When instances are launched, if you specified multiple Availability Zones, the desired capacity is distributed across these Availability Zones. If a scaling action occurs, Amazon EC2 Auto Scaling automatically maintains balance across all of the Availability Zones that you specify."

The same applies to Azure's VMSS.

JoelSpeed · 2020-08-04T16:20:15Z

Thanks, that's interesting. I need to do some investigation into how the cluster autoscaler handles all this as we have been talking about some similar problems recently, CC @elmiko @enxebre, if you aren't already following this, it's relevant to some discussion we had earlier

elmiko · 2020-08-04T17:22:35Z

from the autoscaler side, it uses its own internal planning algorithm to determine where new nodes should be created and then each provider does something specific to make that happen. in the case of CAPI, we present MachineSets and MachineDeployments and then scale the replicas based on what the autoscaler is prescribing. in the case of AWS, it uses autoscaling groups (ASGs) internally and simply increases and desceases the sizes, for example.

if we start using cloud-specific resources that can do internal balancing and scaling of nodes, we might consider decoupling that from the autoscaler's activities. but only if there are other planning algorithms (eg from the cloud provider) that are doing this work.

if we can use something that presents to the autoscaler as a resource with a number of replicas, and a way to address individual nodes, that should be sufficient to satisfy its requirements. if this resource (a MachinePool or w/e) covers multiple zones and places nodes inside those zones, then i would imagine it will be fine to interface with the autoscaler.

JoelSpeed · 2020-08-05T11:03:22Z

Looks like other implementations already handle the multi az thing by just using the first AZ and using that for the scheduling decisions https://github.com/kubernetes/autoscaler/blob/c49beada6b12e7c0428bb63147890f678c4e6001/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L307-L316, I guess we can just copy that behaviour without breaking the autoscaler.

Can ignore my concerns in my earlier comment

CecileRobertMichon · 2020-10-09T21:27:36Z

@ncdc @rudoi @JoelSpeed @elmiko thoughts on getting this into v1alpha4?

ncdc · 2020-10-12T13:29:57Z

@CecileRobertMichon I just reread all the comments and there were a few different ideas suggested. Which one are you asking about? 😄

CecileRobertMichon · 2020-10-12T16:09:17Z

Not set on a particular implementation detail yet, just want the ability to spread worker nodes in AZs automatically. We can discuss implementations as part of the proposal I guess? Mostly gaging interest prioritizing this user story at all.

elmiko · 2020-10-12T17:39:14Z

no objections to moving forward from me.

i think my main concern is ensuring that we continue to present unified node groups to the autoscaler. i don't think any of the ideas we've talked about would necessarily break that, but if we start exercising labels or taints which would affect placement of machines within a machineset/deployment then we would need to make sure and document this for autoscaler users. there are some cases where adding additional information would affect behaviors in the autoscaler (eg --balance-similar-node-groups option).

JoelSpeed · 2020-10-14T12:29:13Z

For context on our position, we have a lot of interest internally pushing us to deliver this feature within OpenShift by end of Q1 next year, we'd like to match what this community is doing as we do that. Our team will be able to help out as and when the community wants to start moving forward with it.

So to answer the original question, I would like to see this in v1alpha4. Do we have anyone wanting to own this feature yet?

rudoi · 2020-10-16T15:40:02Z

+1 to including in v1a4

vincepri · 2020-10-16T15:48:04Z

Any volunteer for drafting up a proposal? :)

fejta-bot · 2021-01-14T16:25:24Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vincepri · 2021-01-14T17:27:04Z

/remove-lifecycle stale

CecileRobertMichon · 2021-02-06T01:03:16Z

/assign

I can give this a try

fejta-bot · 2021-05-07T01:56:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-06-06T02:21:01Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fabriziopandini · 2021-06-07T08:06:42Z

/remove-lifecycle rotten

vincepri · 2021-06-07T18:12:17Z

/lifecycle frozen

vincepri · 2021-10-29T17:37:47Z

Closing this in favor of documentation, we've discussed the support of multiple failure domains in MD/MS and ultimately decided that folks should create multiple MD/MS if they want to spread their workloads across AZs.

/close

k8s-ci-robot · 2021-10-29T17:37:59Z

@vincepri: Closing this issue.

In response to this:

Closing this in favor of documentation, we've discussed the support of multiple failure domains in MD/MS and ultimately decided that folks should create multiple MD/MS if they want to spread their workloads across AZs.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 16, 2020

rudoi mentioned this issue Jul 16, 2020

Support spreading AWSMachines across multiple subnets per availability zone kubernetes-sigs/cluster-api-provider-aws#1829

Closed

k8s-ci-robot added this to the Next milestone Jul 22, 2020

This was referenced Jul 29, 2020

💎 Refactor VM service spec kubernetes-sigs/cluster-api-provider-azure#827

Merged

Document that only control plane machines will automatically get spread on failure domains kubernetes-sigs/cluster-api-provider-azure#844

Closed

JoelSpeed mentioned this issue Sep 25, 2020

WIP: Add proposal for API to automatically spread MachineSets across AZs. openshift/enhancements#483

Closed

CecileRobertMichon mentioned this issue Dec 16, 2020

HA Clusters in Non-AZ enabled Regions kubernetes-sigs/cluster-api-provider-azure#657

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2021

k8s-ci-robot assigned CecileRobertMichon Feb 6, 2021

CecileRobertMichon mentioned this issue Feb 9, 2021

🌱 refactor failure domains logic out of controlplane internal package #4160

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 6, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 7, 2021

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jun 7, 2021

k8s-ci-robot closed this as completed Oct 29, 2021

yastij mentioned this issue Nov 10, 2021

enable setting failureDomain as part of a cluster's topology #5636

Closed

dlipovetsky mentioned this issue Jan 9, 2023

Default to spread worker nodes across failure domains kubernetes-sigs/cluster-api-provider-aws#3203

Closed

Support spreading MachineSet-owned Machines over failure domains #3358

Support spreading MachineSet-owned Machines over failure domains #3358

Comments

ncdc commented Jul 16, 2020

rudoi commented Jul 16, 2020

ncdc commented Jul 16, 2020

rudoi commented Jul 16, 2020

detiber commented Jul 16, 2020

CecileRobertMichon commented Jul 22, 2020

k8s-ci-robot commented Jul 22, 2020

detiber commented Jul 22, 2020

elmiko commented Jul 22, 2020

CecileRobertMichon commented Jul 28, 2020

enxebre commented Jul 30, 2020 • edited Loading

JoelSpeed commented Jul 30, 2020

CecileRobertMichon commented Jul 30, 2020 • edited Loading

rudoi commented Jul 30, 2020

detiber commented Jul 30, 2020

JoelSpeed commented Aug 4, 2020

CecileRobertMichon commented Aug 4, 2020

JoelSpeed commented Aug 4, 2020

elmiko commented Aug 4, 2020 • edited Loading

JoelSpeed commented Aug 5, 2020

CecileRobertMichon commented Oct 9, 2020

ncdc commented Oct 12, 2020

CecileRobertMichon commented Oct 12, 2020

elmiko commented Oct 12, 2020

JoelSpeed commented Oct 14, 2020

rudoi commented Oct 16, 2020

vincepri commented Oct 16, 2020

fejta-bot commented Jan 14, 2021

vincepri commented Jan 14, 2021

CecileRobertMichon commented Feb 6, 2021

fejta-bot commented May 7, 2021

fejta-bot commented Jun 6, 2021

fabriziopandini commented Jun 7, 2021

vincepri commented Jun 7, 2021

vincepri commented Oct 29, 2021

k8s-ci-robot commented Oct 29, 2021

enxebre commented Jul 30, 2020 •

edited

Loading

CecileRobertMichon commented Jul 30, 2020 •

edited

Loading

elmiko commented Aug 4, 2020 •

edited

Loading