Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support spreading MachineSet-owned Machines over failure domains #3358

Closed
ncdc opened this issue Jul 16, 2020 · 35 comments
Closed

Support spreading MachineSet-owned Machines over failure domains #3358

ncdc opened this issue Jul 16, 2020 · 35 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@ncdc
Copy link
Contributor

ncdc commented Jul 16, 2020

User Story

As a user, I would like a MachineSet to spread the Machines it creates across the cluster's available failure domains, to protect against a total outage due to an outage in a single location.

Detailed Description

KubeadmControlPlane is capable of spreading Machines across failure domains automatically. This is about doing the same thing for a MachineSet.

Anything else you would like to add:

One alternative is to create n MachineDelpoyments for your n failure domains, scaling them independently. Resiliency to failures comes through having multiple MachineDeployments, but it does require you to create e.g. 3 separate resources to cover 3 failure domains, vs. a single MachineDeployment with built-in failure domain spreading.

I'm mostly putting out some feelers to see if there is any community interest in this, and if it's worth pursuing. Happy to go whichever way (do nothing, implement this, or find something else that's better).

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 16, 2020
@rudoi
Copy link
Contributor

rudoi commented Jul 16, 2020

+1, we're currently employing the n MachineDeployments pattern.

I need to double check in CAPA, but there's a nested concern of cycling through subnets in a particular AZ. It currently returns the first one only if AvailabilityZone is specified on the AWSMachine. There'd be no way to consume this failure domain spread AND also additional subnets per-AZ. I can open a separate CAPA issue and link to here.

@ncdc
Copy link
Contributor Author

ncdc commented Jul 16, 2020

Please do, thanks!

@rudoi
Copy link
Contributor

rudoi commented Jul 16, 2020

@detiber
Copy link
Member

detiber commented Jul 16, 2020

One potential concern that we have around automated spreading across failure domains would be for cloud providers that limit storage attachment to the same failure domain (for example AWS).

There will be the general case of users needing to understand the impact of how scaling decisions would affect scheduling of those workloads and the associated failure tolerance for a given MachineDeployment. Since Cluster API does not have any insight currently into the workloads running on the cluster, we likely don't have any way of helping the user with any issues that might arise related to these concerns as we rollout changes to a MachineDeployment.

There is also a more specific concern related to how the cluster-autoscaler (or other external tools trying to make scaling decisions against cluster api resources) would be able to handle making appropriate scaling decisions for workloads. With the current model, the Cluster Autoscaler can relatively naively make scaling decisions against MachineDeployments. If we add capacity, we can assume it will be able to run the same types of workloads that other Machines in the MachineDeployment can. With failure domain spreading, we would need to figure out how we can possibly enrich the autoscaler with the appropriate information to make the right decision or possibly have a way for the autoscaler to signal that a scalup should happen for an explicit failure domain rather than any failure domain that is being used by the MachineDeployment.

@CecileRobertMichon
Copy link
Contributor

/milestone next

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: You must be a member of the kubernetes-sigs/cluster-api-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Maintainers and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone next

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@detiber
Copy link
Member

detiber commented Jul 22, 2020

/milestone Next

@k8s-ci-robot k8s-ci-robot added this to the Next milestone Jul 22, 2020
@elmiko
Copy link
Contributor

elmiko commented Jul 22, 2020

this sounds like a nice feature, but i am curious if it would be always on or can a user disable it?

There is also a more specific concern related to how the cluster-autoscaler (or other external tools trying to make scaling decisions against cluster api resources) would be able to handle making appropriate scaling decisions for workloads. With the current model, the Cluster Autoscaler can relatively naively make scaling decisions against MachineDeployments. If we add capacity, we can assume it will be able to run the same types of workloads that other Machines in the MachineDeployment can. With failure domain spreading, we would need to figure out how we can possibly enrich the autoscaler with the appropriate information to make the right decision or possibly have a way for the autoscaler to signal that a scalup should happen for an explicit failure domain rather than any failure domain that is being used by the MachineDeployment.

this is highly relevant imo. the cluster autoscaler can be configured to attempt to balance the nodes it creates during scaling. when this option is enabled it will try to balance creation across MachineSets or MachineDeployments (depending on what is being used). i would want to make sure that users know when they might run afoul of having these options enabled (whether in cluster-autoscaler or cluster-api).

@CecileRobertMichon
Copy link
Contributor

+1 from me on this, I actually thought it was already the case.

One alternative is to create n MachineDelpoyments for your n failure domains, scaling them independently. Resiliency to failures comes through having multiple MachineDeployments, but it does require you to create e.g. 3 separate resources to cover 3 failure domains, vs. a single MachineDeployment with built-in failure domain spreading.

What's the current way to set a failure domain per MD? Is it through the Machine template?

this sounds like a nice feature, but i am curious if it would be always on or can a user disable it?

I would expect as a user you would be able to set the failure domains explicitly to only deploy machines from that deployment to a subset (in some use cases only 1) of failure domains, but by default it would spread to all failure domains available at the Cluster level, just like KCP.

@enxebre
Copy link
Member

enxebre commented Jul 30, 2020

One potential concern that we have around automated spreading across failure domains would be for cloud providers that limit storage attachment to the same failure domain (for example AWS)...

The way this is solved for the autoscaler is by letting the user to create individual nodeGroups (machineSet/machineDeployment) for each Availability Zone. Then enable the --balance-similar-node-groups flag on the autoscaler.

One alternative is to create n MachineDeployments for your n failure domains, scaling them independently. Resiliency to failures comes through having multiple MachineDeployments, but it does require you to create e.g. 3 separate resources to cover 3 failure domains, vs. a single MachineDeployment with built-in failure domain spreading.

In this line, and taking the autoscaling concern into consideration another alternative would be to have a new opt-in resource e.g machinePool (already taken) / machineBalance? which manages machineSet or machineDeployment spread across available AZs automatically.

The likes of

type MachineBalanceSpec struct {
	// Replicas is the count of machines for this machine pool.
	// Replicas and autoscaling cannot be used together.
	// Default is 1, if autoscaling is not used.
	// +optional
	Replicas *int64 `json:"replicas,omitempty"`

	// Autoscaling is the details for auto-scaling the Machine pool.
	// Replicas and autoscaling cannot be used together.
	// +optional
	Autoscaling *MachineBalanceAutoscaling `json:"autoscaling,omitempty"`
}

// MachinePoolAutoscaling details how the machine pool is to be auto-scaled.
type MachineBalanceAutoscaling struct {
	// MinReplicas is the minimum number of replicas for the machine pool.
	MinReplicas int32 `json:"minReplicas"`

	// MaxReplicas is the maximum number of replicas for the machine pool.
	MaxReplicas int32 `json:"maxReplicas"`
}

This would keep existing primitives and behaviour untouched while giving a new clean building block which would provide the business value intended by this feature.

@JoelSpeed
Copy link
Contributor

type MachineBalanceAutoscaling struct {
  // MinReplicas is the minimum number of replicas for the machine pool.
  MinReplicas int32 `json:"minReplicas"`

  // MaxReplicas is the maximum number of replicas for the machine pool.
  MaxReplicas int32 `json:"maxReplicas"`
}```

I think this is an interesting idea, especially given it allows simpler mechanics around the autoscaling integration. There are several questions that come to mind when I look at this and the full example above:

  • I notice there's no AZ list in the spec, would it be expected that we would spread across all AZs in the cluster? Should that be configurable?
  • What happens if min/maxReplicas % num AZs != 0, would we end up with imbalanced min/max replicas across the AZs?
  • Would this need a webhook to ensure that the values are mutually exclusive? I assume we can't enforce this rule with openapi validation alone?

@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented Jul 30, 2020

The other option is to leverage MachinePools to spread worker nodes on AZs. We could have 1 autoscaler node group == 1 machine pool. MachinePool takes a list of failure domains which would allow a user to specify a single failure domain for that machine pool and have n machine pools for n zones, or choose to spread the instances within a single machine pool by having a list of zones (potentially the default behavior). So you could also have a machine pool represent a node group in terms of functionality (eg. two different OSes) and have autoscaler balance that as well.

@rudoi @ncdc how does one use the n MachineDeployments pattern currently? I don't see a way to specify the failure domain per MachineDeployment (without using the back compat infraMachine failure domain field).

@rudoi
Copy link
Contributor

rudoi commented Jul 30, 2020

The way we use it circumvents FailureDomain entirely - the AWSMachineTemplate gets a specific subnet ID. Each MachineDeployment gets its own AWSMachineTemplate, etc.

@detiber
Copy link
Member

detiber commented Jul 30, 2020

@CecileRobertMichon You should be able to set MachineDeployment.Spec.Template.Spec.FailureDomain on each MachineDeployment to the corresponding failure domain.

@JoelSpeed
Copy link
Contributor

We could have 1 autoscaler node group == 1 machine pool. MachinePool takes a list of failure domains which would allow a user to specify a single failure domain for that machine pool and have n machine pools for n zones, or choose to spread the instances within a single machine pool by having a list of zones (potentially the default behavior).

This may be a bit of an aside from this topic, but, @CecileRobertMichon I'm not too familiar with MachinePools, is the idea with the list of failure domains that, for example, on AWS, this would create 1 ASG per failure domain? Or do ASGs support multiple failure domains as is and the MachinePool just carries that over?

It's my understanding that a NodeGroup as it is in the cluster autoscaler should be capable of making consistent/identical machines. It compares the resources available to the nodes (CPU, Memory, GPU) as well as the labels/taints etc that are on the nodes, which typically would include the failure domain, when making scaling decisions, with the requirements of the pods. If MachinePool supports multi failure domains, and we went 1:1 mapped with a MachinePool, this failure domain label wouldn't be able to be defined would it? That would break the scaling and scheduling decisions of the autoscaler. Perhaps we would have to somehow map nodegroups to each failure domain of the machinepool?

@CecileRobertMichon
Copy link
Contributor

@JoelSpeed it's the latter, ASGs support multiple failure domains as is and the MachinePool just carries that over, https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroup.html --> "When instances are launched, if you specified multiple Availability Zones, the desired capacity is distributed across these Availability Zones. If a scaling action occurs, Amazon EC2 Auto Scaling automatically maintains balance across all of the Availability Zones that you specify."

The same applies to Azure's VMSS.

@JoelSpeed
Copy link
Contributor

Thanks, that's interesting. I need to do some investigation into how the cluster autoscaler handles all this as we have been talking about some similar problems recently, CC @elmiko @enxebre, if you aren't already following this, it's relevant to some discussion we had earlier

@elmiko
Copy link
Contributor

elmiko commented Aug 4, 2020

from the autoscaler side, it uses its own internal planning algorithm to determine where new nodes should be created and then each provider does something specific to make that happen. in the case of CAPI, we present MachineSets and MachineDeployments and then scale the replicas based on what the autoscaler is prescribing. in the case of AWS, it uses autoscaling groups (ASGs) internally and simply increases and desceases the sizes, for example.

if we start using cloud-specific resources that can do internal balancing and scaling of nodes, we might consider decoupling that from the autoscaler's activities. but only if there are other planning algorithms (eg from the cloud provider) that are doing this work.

if we can use something that presents to the autoscaler as a resource with a number of replicas, and a way to address individual nodes, that should be sufficient to satisfy its requirements. if this resource (a MachinePool or w/e) covers multiple zones and places nodes inside those zones, then i would imagine it will be fine to interface with the autoscaler.

@JoelSpeed
Copy link
Contributor

Looks like other implementations already handle the multi az thing by just using the first AZ and using that for the scheduling decisions https://github.com/kubernetes/autoscaler/blob/c49beada6b12e7c0428bb63147890f678c4e6001/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L307-L316, I guess we can just copy that behaviour without breaking the autoscaler.

Can ignore my concerns in my earlier comment

@CecileRobertMichon
Copy link
Contributor

@ncdc @rudoi @JoelSpeed @elmiko thoughts on getting this into v1alpha4?

@ncdc
Copy link
Contributor Author

ncdc commented Oct 12, 2020

@CecileRobertMichon I just reread all the comments and there were a few different ideas suggested. Which one are you asking about? 😄

@CecileRobertMichon
Copy link
Contributor

Not set on a particular implementation detail yet, just want the ability to spread worker nodes in AZs automatically. We can discuss implementations as part of the proposal I guess? Mostly gaging interest prioritizing this user story at all.

@elmiko
Copy link
Contributor

elmiko commented Oct 12, 2020

no objections to moving forward from me.

i think my main concern is ensuring that we continue to present unified node groups to the autoscaler. i don't think any of the ideas we've talked about would necessarily break that, but if we start exercising labels or taints which would affect placement of machines within a machineset/deployment then we would need to make sure and document this for autoscaler users. there are some cases where adding additional information would affect behaviors in the autoscaler (eg --balance-similar-node-groups option).

@JoelSpeed
Copy link
Contributor

For context on our position, we have a lot of interest internally pushing us to deliver this feature within OpenShift by end of Q1 next year, we'd like to match what this community is doing as we do that. Our team will be able to help out as and when the community wants to start moving forward with it.

So to answer the original question, I would like to see this in v1alpha4. Do we have anyone wanting to own this feature yet?

@rudoi
Copy link
Contributor

rudoi commented Oct 16, 2020

+1 to including in v1a4

@vincepri
Copy link
Member

Any volunteer for drafting up a proposal? :)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2021
@vincepri
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2021
@CecileRobertMichon
Copy link
Contributor

/assign

I can give this a try

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 6, 2021
@fabriziopandini
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 7, 2021
@vincepri
Copy link
Member

vincepri commented Jun 7, 2021

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jun 7, 2021
@vincepri
Copy link
Member

Closing this in favor of documentation, we've discussed the support of multiple failure domains in MD/MS and ultimately decided that folks should create multiple MD/MS if they want to spread their workloads across AZs.

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

Closing this in favor of documentation, we've discussed the support of multiple failure domains in MD/MS and ultimately decided that folks should create multiple MD/MS if they want to spread their workloads across AZs.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests