feat(cluster): add failureDomain spec label #4145

handsomejack-42 · 2023-10-17T13:12:45Z

What type of PR is this?

/kind api-change

Why we need it:

We need zonal control plane nodes. We might have missed something, but that probably can't be achieved in current state of CAPZ, with our use-case.

in its Status property, AzureCluster announces all discovered failure domains (FDs)
when rolling out control planes, KubeadmControlPlane checks with the FDs reported by the cluster
- the default behavior is to evenly distribute the control plane machines (nodes) between FDs (doc)
there is currently no way for us to restrict FDs used by KubeadmControlPlane when rolling out nodes (no failureDomain in spec)
CAPZ documentation only mentions that explicit placement into FD is achieved by setting the label on Machine(Deployment). In our use-case, however, there are no Machine(Deployment)s, the KubeadmControlPlane spins AzureMachines directly.

What this PR does:

The goal of this commit is to allow CAPZ users to specify which failure domains are eligible for control plane rollouts.

CAPI elders suggested, that the implementation of this behavior lies within the individual providers (CAPZ in our case)

tl;dr the goal is that the FDs are not announced by the AzureCluster object status, so KubeadmControlPlane won't take it into consideration when rolling out control plane nodes

There's a new label in .spec.failureDomain that is used to post-filter discovered failure domains.

The field is optional - if it's missing, all discovered failure domains are eligible for control plane rollout.

Special notes for your reviewer:

cherry-pick candidate

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

Added AzureCluster.spec.failureDomain label, that can be used to prevent control plane node deployment to specified failure domains

linux-foundation-easycla · 2023-10-17T13:12:52Z

The committers listed above are authorized under a signed CLA.

✅ login: handsomejack-42 / name: Honza (e2eb28d)

k8s-ci-robot · 2023-10-17T13:12:55Z

Hi @handsomejack-42. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

CecileRobertMichon · 2023-10-17T23:16:36Z

@handsomejack-42 please let us know once you've signed the CLA so we can trigger the tests

handsomejack-42 · 2023-10-18T07:43:10Z

@handsomejack-42 please let us know once you've signed the CLA so we can trigger the tests

✅

handsomejack-42 · 2023-10-18T08:35:05Z

azure/scope/cluster.go

+		return
+	}
+
+	if fd, ok := s.AzureCluster.Spec.FailureDomains[id]; ok && fd.ControlPlane {


on second thought, there actually is a change to current flow

previously, when FD would be discovered (and not suitable for control plane, aka controlPlane: false), it would get into the status field no matter what

in my version, imagine there is FD1 discovered with controlPlane: false and in spec, there is a different FD with an arbitrary value for controlPlane

the change is, FD1 will no longer be announced in the status

thoughts?

in general: does it make sense to announce FDs with controlPlane: false ?

to be on the safe side, the behavior could change to be as follows

"FD is always propagated to status, only its controlPlane value can be overriden to false by spec"

"FD is always propagated to status, only its controlPlane value can be overriden to false by spec"

This is more future-proof and safe, so +1

In practice, CAPZ sets every single Azure zone it discovers to controlPlane: true: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/controllers/azurecluster_reconciler.go#L217 so this shouldn't actually happen

mboersma · 2023-10-18T11:55:29Z

/ok-to-test

handsomejack-42 · 2023-10-18T12:58:53Z

/retest

handsomejack-42 · 2023-10-18T13:32:40Z

@mboersma

make: Leaving directory '/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure'
cp: cannot stat 'coverage.*': No such file or directory
go: downloading golang.org/x/tools v0.8.0
Couldn't load /logs/artifacts/coverage.out: open /logs/artifacts/coverage.out: no such file or directory.+ EXIT_VALUE=1

can you pls check this?

mboersma · 2023-10-18T15:36:08Z

/retest

Seems like a flake in the container or with file I/O? Let's hope so.

mboersma · 2023-10-18T16:03:27Z

If I run make lint on this PR I get a clearer error:

INFO [runner] linters took 10.064607584s with stages: goanalysis_metalinter: 10.010416459s 
azure/scope/cluster_test.go:3498: unnecessary trailing newline (whitespace)

}

I think that added blank line at 3498 is what's causing the test failure, although I'm not sure why that wasn't reported clearly in the test job.

handsomejack-42 · 2023-10-19T07:36:41Z

/retest

codecov · 2023-10-19T17:12:16Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6987809) 57.86% compared to head (e2eb28d) 57.90%.
Report is 16 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4145      +/-   ##
==========================================
+ Coverage   57.86%   57.90%   +0.03%     
==========================================
  Files         187      187              
  Lines       19216    19219       +3     
==========================================
+ Hits        11120    11128       +8     
+ Misses       7468     7463       -5     
  Partials      628      628

Files	Coverage Δ
api/v1beta1/types_class.go	`76.47% <ø> (ø)`
azure/scope/cluster.go	`60.88% <100.00%> (+0.80%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

CecileRobertMichon · 2023-10-25T00:33:08Z

We need zonal control plane nodes. We might have missed something, but that probably can't be achieved in current state of CAPZ, with our use-case.

@handsomejack-42 to clarify, it's possible to spread control plane nodes across availability zones, and in fact they do get spread automatically across all available zones. Can you please expand on the use case for selecting a subset of the failure domains to spread control planes? Is the goal to have nodes in specific Azure availability zones? or am I misunderstanding what the PR is doing?

CecileRobertMichon

Could you please also update the docs as part of this change https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/docs/book/src/topics/failure-domains.md?

handsomejack-42 · 2023-10-25T13:23:19Z

We need zonal control plane nodes. We might have missed something, but that probably can't be achieved in current state of CAPZ, with our use-case.

@handsomejack-42 to clarify, it's possible to spread control plane nodes across availability zones, and in fact they do get spread automatically across all available zones. Can you please expand on the use case for selecting a subset of the failure domains to spread control planes? Is the goal to have nodes in specific Azure availability zones? or am I misunderstanding what the PR is doing?

Is the goal to have nodes in specific Azure availability zones

exactly this - basically we didn't want to end up in a state where the control plane gets placed to any available AZ and the worker clusters that it's supposed to manage end up in different AZ

handsomejack-42 · 2023-10-25T13:26:50Z

/retest

mboersma · 2023-10-25T15:23:30Z

/retest

Prow failed to start the container.

CecileRobertMichon · 2023-10-25T17:22:09Z

docs/book/src/topics/failure-domains.md

@@ -128,6 +128,21 @@ spec:

 ```

+If you can't use `Machine` (or `MachineDeployment`) to explicitly place your VMs (for example, there is a `KubeAdmControlPlane` does not accept those as an object reference but rather uses `AzureMachineTemplate` directly), then you can opt to restrict the announcement of discovered failure domains from the cluster's status itself.


Suggested change

If you can't use `Machine` (or `MachineDeployment`) to explicitly place your VMs (for example, there is a `KubeAdmControlPlane` does not accept those as an object reference but rather uses `AzureMachineTemplate` directly), then you can opt to restrict the announcement of discovered failure domains from the cluster's status itself.

If you can't use `Machine` (or `MachineDeployment`) to explicitly place your VMs (for example, there is a `KubeadmControlPlane` does not accept those as an object reference but rather uses `AzureMachineTemplate` directly), then you can opt to restrict the announcement of discovered failure domains from the cluster's status itself.

for example, there is a KubeadmControlPlane does not accept those as an object reference

I think we're missing a word here; I couldn't quite parse this paragraph...

fixed, sorry

CecileRobertMichon · 2023-10-25T17:27:47Z

docs/book/src/topics/failure-domains.md

+spec:
+  location: eastus
+  failureDomains:
+    fd1:


is fd1 actually what an Azure fd id would look like?

when we were testing it on dev cluster, the values reported in cluster status were 1, 2 etc. but i wasn't sure if that's the standard or not and for some reason it felt more explicit adding the fd

changed to 1:

CecileRobertMichon

lgtm aside from nits in doc

/assign @mboersma

mboersma · 2023-10-25T20:07:36Z

api/v1beta1/types_class.go

@@ -56,6 +57,15 @@ type AzureClusterClassSpec struct {
 	// Note: All cloud provider config values can be customized by creating the secret beforehand. CloudProviderConfigOverrides is only used when the secret is managed by the Azure Provider.
 	// +optional
 	CloudProviderConfigOverrides *CloudProviderConfigOverrides `json:"cloudProviderConfigOverrides,omitempty"`
+
+	// FailureDomains specifies the list of unique failure domains for the location/region of the cluster.


Maybe we could shorten this comment:

// FailureDomains is a list of failure domains in the cluster's region, used to restrict // eligibility to host the control plane. A FailureDomain maps to an availability zone, // which is a separated group of datacenters within a region. // See: https://learn.microsoft.com/azure/reliability/availability-zones-overview // +optional

applied, thx

mboersma · 2023-10-25T20:16:00Z

azure/scope/cluster.go

@@ -875,11 +875,17 @@ func (s *ClusterScope) APIServerHost() string {
 	return s.APIServerPublicIP().DNSName
 }

-// SetFailureDomain will set the spec for a for a given key.
+// SetFailureDomain to cluster's status by given id. The value of provided control plane might


Suggested change

// SetFailureDomain to cluster's status by given id. The value of provided control plane might

// SetFailureDomain sets a failure domain in a cluster's status by its id. The provided failure domain spec may

i tried to avoid the stutter here "SetFailureDomain sets failure domain" but i didn't feel very confident about how my version read either

applied your suggestion

mboersma · 2023-10-25T20:16:57Z

docs/book/src/topics/failure-domains.md

@@ -128,6 +128,21 @@ spec:

 ```

+If you can't use `Machine` (or `MachineDeployment`) to explicitly place your VMs (for example, there is a `KubeAdmControlPlane` does not accept those as an object reference but rather uses `AzureMachineTemplate` directly), then you can opt to restrict the announcement of discovered failure domains from the cluster's status itself.


for example, there is a KubeadmControlPlane does not accept those as an object reference

I think we're missing a word here; I couldn't quite parse this paragraph...

The goal of this commit is to allow capz users to specify which failure domains are eligible for control plane rollouts. There's a new label in AzureCluster.spec.failureDomain that can be used to override the values of failureDomain.ControlPlane to false, to prevent the control plane being deployed there. The field is optional - if it's missing, all discovered failure domains are announced in status as-is. THERE IS NO BREAKING CHANGE TO CURRENT USERS.

CecileRobertMichon

/lgtm

@handsomejack-42 can you please add a release note in the PR description?

k8s-ci-robot · 2023-10-26T16:49:57Z

LGTM label has been added.

Git tree hash: 8c9a95ab868ceed400ea6081bbe2a8d693b4f0a4

handsomejack-42 · 2023-10-27T08:36:09Z

/lgtm

@handsomejack-42 can you please add a release note in the PR description?

done:

Added AzureCluster.spec.failureDomain label, that can be used to prevent control plane nodes deployment to specified failure domains

mboersma

/lgtm
/approve

k8s-ci-robot · 2023-10-27T14:32:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mboersma]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 17, 2023

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 17, 2023

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 17, 2023

k8s-ci-robot requested review from jackfrancis and Jont828 October 17, 2023 13:12

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 18, 2023

handsomejack-42 commented Oct 18, 2023

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 18, 2023

handsomejack-42 force-pushed the hsj/cluster-failure-domain branch from 4048657 to 3bec8e2 Compare October 18, 2023 12:58

handsomejack-42 force-pushed the hsj/cluster-failure-domain branch from 3bec8e2 to 99e4734 Compare October 19, 2023 07:27

CecileRobertMichon reviewed Oct 25, 2023

View reviewed changes

handsomejack-42 force-pushed the hsj/cluster-failure-domain branch 2 times, most recently from 2eaf8ee to 1687af3 Compare October 25, 2023 12:04

handsomejack-42 force-pushed the hsj/cluster-failure-domain branch from 1687af3 to 557ee8e Compare October 25, 2023 13:44

CecileRobertMichon reviewed Oct 25, 2023

View reviewed changes

k8s-ci-robot assigned mboersma Oct 25, 2023

mboersma reviewed Oct 25, 2023

View reviewed changes

handsomejack-42 force-pushed the hsj/cluster-failure-domain branch from 557ee8e to de9739a Compare October 26, 2023 07:25

handsomejack-42 force-pushed the hsj/cluster-failure-domain branch from de9739a to e2eb28d Compare October 26, 2023 07:48

CecileRobertMichon reviewed Oct 26, 2023

View reviewed changes

k8s-ci-robot assigned CecileRobertMichon Oct 26, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2023

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 27, 2023

mboersma approved these changes Oct 27, 2023

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 27, 2023

k8s-ci-robot merged commit b1c0d6f into kubernetes-sigs:main Oct 27, 2023
11 checks passed

k8s-ci-robot added this to the v1.12 milestone Oct 27, 2023

		@@ -128,6 +128,21 @@ spec:

		```

		If you can't use `Machine` (or `MachineDeployment`) to explicitly place your VMs (for example, there is a `KubeAdmControlPlane` does not accept those as an object reference but rather uses `AzureMachineTemplate` directly), then you can opt to restrict the announcement of discovered failure domains from the cluster's status itself.

	// SetFailureDomain to cluster's status by given id. The value of provided control plane might
	// SetFailureDomain sets a failure domain in a cluster's status by its id. The provided failure domain spec may

feat(cluster): add failureDomain spec label #4145

feat(cluster): add failureDomain spec label #4145

Conversation

handsomejack-42 commented Oct 17, 2023 • edited Loading

linux-foundation-easycla bot commented Oct 17, 2023 • edited Loading

k8s-ci-robot commented Oct 17, 2023

CecileRobertMichon commented Oct 17, 2023

handsomejack-42 commented Oct 18, 2023

handsomejack-42 Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mboersma commented Oct 18, 2023

handsomejack-42 commented Oct 18, 2023

handsomejack-42 commented Oct 18, 2023

mboersma commented Oct 18, 2023

mboersma commented Oct 18, 2023 • edited Loading

handsomejack-42 commented Oct 19, 2023

codecov bot commented Oct 19, 2023 • edited Loading

Codecov Report

CecileRobertMichon commented Oct 25, 2023

CecileRobertMichon left a comment

Choose a reason for hiding this comment

handsomejack-42 commented Oct 25, 2023 • edited Loading

handsomejack-42 commented Oct 25, 2023

mboersma commented Oct 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CecileRobertMichon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CecileRobertMichon left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 26, 2023

handsomejack-42 commented Oct 27, 2023

mboersma left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 27, 2023

handsomejack-42 commented Oct 17, 2023 •

edited

Loading

linux-foundation-easycla bot commented Oct 17, 2023 •

edited

Loading

handsomejack-42 Oct 18, 2023 •

edited

Loading

mboersma commented Oct 18, 2023 •

edited

Loading

codecov bot commented Oct 19, 2023 •

edited

Loading

handsomejack-42 commented Oct 25, 2023 •

edited

Loading