Default to spread worker nodes across failure domains #3203

richardcase · 2022-02-11T11:11:15Z

/kind feature

Describe the solution you'd like
Currently, CAPI will spread control plane machines across the reported failure domains (i.e. availability zones). It doesn't do this for worker nodes, machines in a machine deployment (or machines on their own).

Current advice is to create separate machine deployments and manually assign an az (via FailureDomain) to each of the machine deployments to ensure that you have worker machines in different azs.

It would be better when creating machines (if no failure domain is specified on the Machine) that we use the failuredomains on the Cluster and create the machine in a failure domain with the least amount of machine already. CAPI has some functions we could potentially use. Something like this:

machines := collections.FromMachineList(machinesList)
failureDomain := failuredomains.PickFewest(m.Cluster.Status.FailureDomains, machines)

Anything else you would like to add:
We need to investigate if this is feasible, or if it is something that should be upstream in machine deployments.

Environment:

Cluster-api-provider-aws version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-02-11T11:11:22Z

@richardcase: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sedefsavas · 2022-02-17T02:30:22Z

There were some discussions around this before.
This might worth discussing in cluster-api office hours given that all providers are being affected by this.

richardcase · 2022-02-18T08:07:57Z

There were some discussions around this before. This might worth discussing in cluster-api office hours given that all providers are being affected by this.

Good idea. I will add a agenda item for this.

k8s-triage-robot · 2022-05-19T09:01:04Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

richardcase · 2022-06-08T11:05:19Z

/remove-lifecycle stale

k8s-triage-robot · 2022-09-06T11:50:55Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

richardcase · 2022-09-20T08:08:48Z

/remove-lifecycle stale

dlipovetsky · 2022-12-12T17:45:16Z

From triage 12/2022: Let's add to agenda for next office hours. Core CAPI MachineDeployment does not support multiple failure domains. Please see kubernetes-sigs/cluster-api#3358. We'll hold off on applying /triage label until then.

For reference: Oracle and MicroVM Infrastructure Providers do distribute machines in one MachineDeployment across multiple failure domain. (Links to be added here)

richardcase · 2023-01-09T17:45:03Z

Discussed in the 6th Jan 2023 office hours.

cnmcavoy · 2023-01-09T18:04:26Z

As discussed in the CAPA office hours, Indeed had several CAPA workload clusters (self-managed, non-eks) spanning all az's in us-east-2 on july 28 2022 during the outage. Our clusters are configured to use machine deployments in each AZ and the cluster autoscaler is configured for autoscaling machine deployments with the clusterapi provider. We also configure the cluster autoscaler and all of the CAPI/CAPA controllers to use leader election and run 3 replicas of each.

What we observed was that when power to AZ1 was lost, 10 minutes later (I believe 10 minutes is due to the 5 minutes delay for the nodes to be marked unready due to missing kubelet heartbeat + 5 minutes for the pod-eviction-timeout of the kube-controller-manager, but I'm not 100% certain), pods were recreated by kubernetes scheduler without any outside interaction, and were in the pending state. The cluster autoscaler scaled up the machine deployments, and as soon as the machines joined the cluster, workloads scheduled and workloads continued to perform normally, despite the control plane being in a degraded state. No human intervention was required for the cluster recovery after AZ1 was restored or during the outage.

Below are two sets of graphs from one of those clusters, which shows the control plane becoming degraded (2/3 available), and then the pods scheduled / created. The pods are scheduled in 3 "waves" as machines join the cluster and then allow more pods to schedule.

I can provide more specific details on how the MD's were configured if that's useful.

So I wonder if instead of implementing this feature, documentation on how to correctly configure CAPA clusters to sustain an AZ outage would be more desirable?

k8s-triage-robot · 2023-04-09T18:46:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-05-09T19:12:43Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-06-08T19:29:53Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-06-08T19:30:00Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

richardcase · 2023-06-09T04:58:34Z

/reopen
/remove-lifecycle rotten

k8s-ci-robot · 2023-06-09T04:58:38Z

@richardcase: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2024-01-22T05:27:58Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-21T05:35:31Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

richardcase · 2024-02-21T10:30:59Z

/remove-lifecycle rotten

k8s-triage-robot · 2024-05-21T11:30:23Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-06-20T11:47:54Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-07-20T11:51:44Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-07-20T11:51:49Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

richardcase · 2024-07-23T06:54:18Z

/reopen
/remove-lifecycle rotten

k8s-ci-robot · 2024-07-23T06:54:22Z

@richardcase: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-triage-robot · 2024-10-21T07:37:55Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-11-20T07:47:02Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-12-20T08:23:51Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-12-20T08:23:55Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 11, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 19, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 6, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 20, 2022

dlipovetsky mentioned this issue Mar 9, 2023

Enhance the AWSMachineTemplate to support multiple instance types, enable mixed instances in a MachineDeployment #4118

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 9, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2023

k8s-ci-robot reopened this Jun 9, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 9, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 22, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 21, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 21, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 21, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 20, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 20, 2024

k8s-ci-robot reopened this Jul 23, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 23, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 20, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default to spread worker nodes across failure domains #3203

Default to spread worker nodes across failure domains #3203

richardcase commented Feb 11, 2022

k8s-ci-robot commented Feb 11, 2022

sedefsavas commented Feb 17, 2022

richardcase commented Feb 18, 2022

k8s-triage-robot commented May 19, 2022

richardcase commented Jun 8, 2022

k8s-triage-robot commented Sep 6, 2022

richardcase commented Sep 20, 2022

dlipovetsky commented Dec 12, 2022 •

edited

Loading

richardcase commented Jan 9, 2023

cnmcavoy commented Jan 9, 2023

k8s-triage-robot commented Apr 9, 2023

k8s-triage-robot commented May 9, 2023

k8s-triage-robot commented Jun 8, 2023

k8s-ci-robot commented Jun 8, 2023

richardcase commented Jun 9, 2023

k8s-ci-robot commented Jun 9, 2023

k8s-triage-robot commented Jan 22, 2024

k8s-triage-robot commented Feb 21, 2024

richardcase commented Feb 21, 2024

k8s-triage-robot commented May 21, 2024

k8s-triage-robot commented Jun 20, 2024

k8s-triage-robot commented Jul 20, 2024

k8s-ci-robot commented Jul 20, 2024

richardcase commented Jul 23, 2024

k8s-ci-robot commented Jul 23, 2024

k8s-triage-robot commented Oct 21, 2024

k8s-triage-robot commented Nov 20, 2024

k8s-triage-robot commented Dec 20, 2024

k8s-ci-robot commented Dec 20, 2024

Default to spread worker nodes across failure domains #3203

Default to spread worker nodes across failure domains #3203

Comments

richardcase commented Feb 11, 2022

k8s-ci-robot commented Feb 11, 2022

sedefsavas commented Feb 17, 2022

richardcase commented Feb 18, 2022

k8s-triage-robot commented May 19, 2022

richardcase commented Jun 8, 2022

k8s-triage-robot commented Sep 6, 2022

richardcase commented Sep 20, 2022

dlipovetsky commented Dec 12, 2022 • edited Loading

richardcase commented Jan 9, 2023

cnmcavoy commented Jan 9, 2023

k8s-triage-robot commented Apr 9, 2023

k8s-triage-robot commented May 9, 2023

k8s-triage-robot commented Jun 8, 2023

k8s-ci-robot commented Jun 8, 2023

richardcase commented Jun 9, 2023

k8s-ci-robot commented Jun 9, 2023

k8s-triage-robot commented Jan 22, 2024

k8s-triage-robot commented Feb 21, 2024

richardcase commented Feb 21, 2024

k8s-triage-robot commented May 21, 2024

k8s-triage-robot commented Jun 20, 2024

k8s-triage-robot commented Jul 20, 2024

k8s-ci-robot commented Jul 20, 2024

richardcase commented Jul 23, 2024

k8s-ci-robot commented Jul 23, 2024

k8s-triage-robot commented Oct 21, 2024

k8s-triage-robot commented Nov 20, 2024

k8s-triage-robot commented Dec 20, 2024

k8s-ci-robot commented Dec 20, 2024

dlipovetsky commented Dec 12, 2022 •

edited

Loading