Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cluster): add failureDomain spec label #4145

Merged

Conversation

handsomejack-42
Copy link
Contributor

@handsomejack-42 handsomejack-42 commented Oct 17, 2023

What type of PR is this?

/kind api-change

Why we need it:

We need zonal control plane nodes. We might have missed something, but that probably can't be achieved in current state of CAPZ, with our use-case.

image

  1. in its Status property, AzureCluster announces all discovered failure domains (FDs)
  2. when rolling out control planes, KubeadmControlPlane checks with the FDs reported by the cluster
  3. there is currently no way for us to restrict FDs used by KubeadmControlPlane when rolling out nodes (no failureDomain in spec)
  4. CAPZ documentation only mentions that explicit placement into FD is achieved by setting the label on Machine(Deployment). In our use-case, however, there are no Machine(Deployment)s, the KubeadmControlPlane spins AzureMachines directly.

What this PR does:

The goal of this commit is to allow CAPZ users to specify which failure domains are eligible for control plane rollouts.

CAPI elders suggested, that the implementation of this behavior lies within the individual providers (CAPZ in our case)

  • tl;dr the goal is that the FDs are not announced by the AzureCluster object status, so KubeadmControlPlane won't take it into consideration when rolling out control plane nodes

There's a new label in .spec.failureDomain that is used to post-filter discovered failure domains.

The field is optional - if it's missing, all discovered failure domains are eligible for control plane rollout.

Special notes for your reviewer:

  • cherry-pick candidate

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Added AzureCluster.spec.failureDomain label, that can be used to prevent control plane node deployment to specified failure domains

@k8s-ci-robot k8s-ci-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 17, 2023
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 17, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: handsomejack-42 / name: Honza (e2eb28d)

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 17, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @handsomejack-42. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 17, 2023
@CecileRobertMichon
Copy link
Contributor

@handsomejack-42 please let us know once you've signed the CLA so we can trigger the tests

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 18, 2023
@handsomejack-42
Copy link
Contributor Author

@handsomejack-42 please let us know once you've signed the CLA so we can trigger the tests

return
}

if fd, ok := s.AzureCluster.Spec.FailureDomains[id]; ok && fd.ControlPlane {
Copy link
Contributor Author

@handsomejack-42 handsomejack-42 Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on second thought, there actually is a change to current flow

previously, when FD would be discovered (and not suitable for control plane, aka controlPlane: false), it would get into the status field no matter what

in my version, imagine there is FD1 discovered with controlPlane: false and in spec, there is a different FD with an arbitrary value for controlPlane

the change is, FD1 will no longer be announced in the status

thoughts?

in general: does it make sense to announce FDs with controlPlane: false ?

to be on the safe side, the behavior could change to be as follows

"FD is always propagated to status, only its controlPlane value can be overriden to false by spec"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"FD is always propagated to status, only its controlPlane value can be overriden to false by spec"

This is more future-proof and safe, so +1

In practice, CAPZ sets every single Azure zone it discovers to controlPlane: true: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/controllers/azurecluster_reconciler.go#L217 so this shouldn't actually happen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@mboersma
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 18, 2023
@handsomejack-42 handsomejack-42 force-pushed the hsj/cluster-failure-domain branch from 4048657 to 3bec8e2 Compare October 18, 2023 12:58
@handsomejack-42
Copy link
Contributor Author

/retest

@handsomejack-42
Copy link
Contributor Author

@mboersma

make: Leaving directory '/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure'
cp: cannot stat 'coverage.*': No such file or directory
go: downloading golang.org/x/tools v0.8.0
Couldn't load /logs/artifacts/coverage.out: open /logs/artifacts/coverage.out: no such file or directory.+ EXIT_VALUE=1

can you pls check this?

@mboersma
Copy link
Contributor

/retest

Seems like a flake in the container or with file I/O? Let's hope so.

@mboersma
Copy link
Contributor

mboersma commented Oct 18, 2023

If I run make lint on this PR I get a clearer error:

INFO [runner] linters took 10.064607584s with stages: goanalysis_metalinter: 10.010416459s 
azure/scope/cluster_test.go:3498: unnecessary trailing newline (whitespace)

}

I think that added blank line at 3498 is what's causing the test failure, although I'm not sure why that wasn't reported clearly in the test job.

@handsomejack-42 handsomejack-42 force-pushed the hsj/cluster-failure-domain branch from 3bec8e2 to 99e4734 Compare October 19, 2023 07:27
@handsomejack-42
Copy link
Contributor Author

/retest

@codecov
Copy link

codecov bot commented Oct 19, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6987809) 57.86% compared to head (e2eb28d) 57.90%.
Report is 16 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4145      +/-   ##
==========================================
+ Coverage   57.86%   57.90%   +0.03%     
==========================================
  Files         187      187              
  Lines       19216    19219       +3     
==========================================
+ Hits        11120    11128       +8     
+ Misses       7468     7463       -5     
  Partials      628      628              
Files Coverage Δ
api/v1beta1/types_class.go 76.47% <ø> (ø)
azure/scope/cluster.go 60.88% <100.00%> (+0.80%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@CecileRobertMichon
Copy link
Contributor

We need zonal control plane nodes. We might have missed something, but that probably can't be achieved in current state of CAPZ, with our use-case.

@handsomejack-42 to clarify, it's possible to spread control plane nodes across availability zones, and in fact they do get spread automatically across all available zones. Can you please expand on the use case for selecting a subset of the failure domains to spread control planes? Is the goal to have nodes in specific Azure availability zones? or am I misunderstanding what the PR is doing?

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@handsomejack-42 handsomejack-42 force-pushed the hsj/cluster-failure-domain branch 2 times, most recently from 2eaf8ee to 1687af3 Compare October 25, 2023 12:04
@handsomejack-42
Copy link
Contributor Author

handsomejack-42 commented Oct 25, 2023

We need zonal control plane nodes. We might have missed something, but that probably can't be achieved in current state of CAPZ, with our use-case.

@handsomejack-42 to clarify, it's possible to spread control plane nodes across availability zones, and in fact they do get spread automatically across all available zones. Can you please expand on the use case for selecting a subset of the failure domains to spread control planes? Is the goal to have nodes in specific Azure availability zones? or am I misunderstanding what the PR is doing?

Is the goal to have nodes in specific Azure availability zones

exactly this - basically we didn't want to end up in a state where the control plane gets placed to any available AZ and the worker clusters that it's supposed to manage end up in different AZ

@handsomejack-42
Copy link
Contributor Author

/retest

@handsomejack-42 handsomejack-42 force-pushed the hsj/cluster-failure-domain branch from 1687af3 to 557ee8e Compare October 25, 2023 13:44
@mboersma
Copy link
Contributor

/retest

Prow failed to start the container.

@@ -128,6 +128,21 @@ spec:

```

If you can't use `Machine` (or `MachineDeployment`) to explicitly place your VMs (for example, there is a `KubeAdmControlPlane` does not accept those as an object reference but rather uses `AzureMachineTemplate` directly), then you can opt to restrict the announcement of discovered failure domains from the cluster's status itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you can't use `Machine` (or `MachineDeployment`) to explicitly place your VMs (for example, there is a `KubeAdmControlPlane` does not accept those as an object reference but rather uses `AzureMachineTemplate` directly), then you can opt to restrict the announcement of discovered failure domains from the cluster's status itself.
If you can't use `Machine` (or `MachineDeployment`) to explicitly place your VMs (for example, there is a `KubeadmControlPlane` does not accept those as an object reference but rather uses `AzureMachineTemplate` directly), then you can opt to restrict the announcement of discovered failure domains from the cluster's status itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, there is a KubeadmControlPlane does not accept those as an object reference

I think we're missing a word here; I couldn't quite parse this paragraph...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, sorry

spec:
location: eastus
failureDomains:
fd1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is fd1 actually what an Azure fd id would look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we were testing it on dev cluster, the values reported in cluster status were 1, 2 etc. but i wasn't sure if that's the standard or not and for some reason it felt more explicit adding the fd

changed to 1:

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm aside from nits in doc

/assign @mboersma

@@ -56,6 +57,15 @@ type AzureClusterClassSpec struct {
// Note: All cloud provider config values can be customized by creating the secret beforehand. CloudProviderConfigOverrides is only used when the secret is managed by the Azure Provider.
// +optional
CloudProviderConfigOverrides *CloudProviderConfigOverrides `json:"cloudProviderConfigOverrides,omitempty"`

// FailureDomains specifies the list of unique failure domains for the location/region of the cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could shorten this comment:

// FailureDomains is a list of failure domains in the cluster's region, used to restrict
// eligibility to host the control plane. A FailureDomain maps to an availability zone,
// which is a separated group of datacenters within a region.
// See: https://learn.microsoft.com/azure/reliability/availability-zones-overview
// +optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied, thx

@@ -875,11 +875,17 @@ func (s *ClusterScope) APIServerHost() string {
return s.APIServerPublicIP().DNSName
}

// SetFailureDomain will set the spec for a for a given key.
// SetFailureDomain to cluster's status by given id. The value of provided control plane might
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// SetFailureDomain to cluster's status by given id. The value of provided control plane might
// SetFailureDomain sets a failure domain in a cluster's status by its id. The provided failure domain spec may

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tried to avoid the stutter here "SetFailureDomain sets failure domain" but i didn't feel very confident about how my version read either

applied your suggestion

@@ -128,6 +128,21 @@ spec:

```

If you can't use `Machine` (or `MachineDeployment`) to explicitly place your VMs (for example, there is a `KubeAdmControlPlane` does not accept those as an object reference but rather uses `AzureMachineTemplate` directly), then you can opt to restrict the announcement of discovered failure domains from the cluster's status itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, there is a KubeadmControlPlane does not accept those as an object reference

I think we're missing a word here; I couldn't quite parse this paragraph...

@handsomejack-42 handsomejack-42 force-pushed the hsj/cluster-failure-domain branch from 557ee8e to de9739a Compare October 26, 2023 07:25
The goal of this commit is to allow capz users to specify
which failure domains are eligible for control plane rollouts.

There's a new label in AzureCluster.spec.failureDomain that can be
used to override the values of failureDomain.ControlPlane to false,
to prevent the control plane being deployed there.

The field is optional - if it's missing, all discovered failure domains
are announced in status as-is.

THERE IS NO BREAKING CHANGE TO CURRENT USERS.
@handsomejack-42 handsomejack-42 force-pushed the hsj/cluster-failure-domain branch from de9739a to e2eb28d Compare October 26, 2023 07:48
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@handsomejack-42 can you please add a release note in the PR description?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 8c9a95ab868ceed400ea6081bbe2a8d693b4f0a4

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 27, 2023
@handsomejack-42
Copy link
Contributor Author

/lgtm

@handsomejack-42 can you please add a release note in the PR description?

done:

Added AzureCluster.spec.failureDomain label, that can be used to prevent control plane nodes deployment to specified failure domains

Copy link
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 27, 2023
@k8s-ci-robot k8s-ci-robot merged commit b1c0d6f into kubernetes-sigs:main Oct 27, 2023
11 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.12 milestone Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants