-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(cluster): add failureDomain spec label #4145
feat(cluster): add failureDomain spec label #4145
Conversation
|
Hi @handsomejack-42. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@handsomejack-42 please let us know once you've signed the CLA so we can trigger the tests |
✅ |
azure/scope/cluster.go
Outdated
return | ||
} | ||
|
||
if fd, ok := s.AzureCluster.Spec.FailureDomains[id]; ok && fd.ControlPlane { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on second thought, there actually is a change to current flow
previously, when FD would be discovered (and not suitable for control plane, aka controlPlane: false
), it would get into the status field no matter what
in my version, imagine there is FD1 discovered with controlPlane: false
and in spec
, there is a different FD with an arbitrary value for controlPlane
the change is, FD1 will no longer be announced in the status
thoughts?
in general: does it make sense to announce FDs with controlPlane: false
?
to be on the safe side, the behavior could change to be as follows
"FD is always propagated to status
, only its controlPlane
value can be overriden to false by spec
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"FD is always propagated to status, only its controlPlane value can be overriden to false by spec"
This is more future-proof and safe, so +1
In practice, CAPZ sets every single Azure zone it discovers to controlPlane: true
: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/controllers/azurecluster_reconciler.go#L217 so this shouldn't actually happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
/ok-to-test |
4048657
to
3bec8e2
Compare
/retest |
can you pls check this? |
/retest Seems like a flake in the container or with file I/O? Let's hope so. |
If I run
I think that added blank line at 3498 is what's causing the test failure, although I'm not sure why that wasn't reported clearly in the test job. |
3bec8e2
to
99e4734
Compare
/retest |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #4145 +/- ##
==========================================
+ Coverage 57.86% 57.90% +0.03%
==========================================
Files 187 187
Lines 19216 19219 +3
==========================================
+ Hits 11120 11128 +8
+ Misses 7468 7463 -5
Partials 628 628
☔ View full report in Codecov by Sentry. |
@handsomejack-42 to clarify, it's possible to spread control plane nodes across availability zones, and in fact they do get spread automatically across all available zones. Can you please expand on the use case for selecting a subset of the failure domains to spread control planes? Is the goal to have nodes in specific Azure availability zones? or am I misunderstanding what the PR is doing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also update the docs as part of this change https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/docs/book/src/topics/failure-domains.md?
2eaf8ee
to
1687af3
Compare
exactly this - basically we didn't want to end up in a state where the control plane gets placed to any available AZ and the worker clusters that it's supposed to manage end up in different AZ |
/retest |
1687af3
to
557ee8e
Compare
/retest Prow failed to start the container. |
@@ -128,6 +128,21 @@ spec: | |||
|
|||
``` | |||
|
|||
If you can't use `Machine` (or `MachineDeployment`) to explicitly place your VMs (for example, there is a `KubeAdmControlPlane` does not accept those as an object reference but rather uses `AzureMachineTemplate` directly), then you can opt to restrict the announcement of discovered failure domains from the cluster's status itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can't use `Machine` (or `MachineDeployment`) to explicitly place your VMs (for example, there is a `KubeAdmControlPlane` does not accept those as an object reference but rather uses `AzureMachineTemplate` directly), then you can opt to restrict the announcement of discovered failure domains from the cluster's status itself. | |
If you can't use `Machine` (or `MachineDeployment`) to explicitly place your VMs (for example, there is a `KubeadmControlPlane` does not accept those as an object reference but rather uses `AzureMachineTemplate` directly), then you can opt to restrict the announcement of discovered failure domains from the cluster's status itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, there is a
KubeadmControlPlane
does not accept those as an object reference
I think we're missing a word here; I couldn't quite parse this paragraph...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, sorry
spec: | ||
location: eastus | ||
failureDomains: | ||
fd1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is fd1 actually what an Azure fd id would look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we were testing it on dev cluster, the values reported in cluster status were 1
, 2
etc. but i wasn't sure if that's the standard or not and for some reason it felt more explicit adding the fd
changed to 1:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm aside from nits in doc
/assign @mboersma
api/v1beta1/types_class.go
Outdated
@@ -56,6 +57,15 @@ type AzureClusterClassSpec struct { | |||
// Note: All cloud provider config values can be customized by creating the secret beforehand. CloudProviderConfigOverrides is only used when the secret is managed by the Azure Provider. | |||
// +optional | |||
CloudProviderConfigOverrides *CloudProviderConfigOverrides `json:"cloudProviderConfigOverrides,omitempty"` | |||
|
|||
// FailureDomains specifies the list of unique failure domains for the location/region of the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could shorten this comment:
// FailureDomains is a list of failure domains in the cluster's region, used to restrict
// eligibility to host the control plane. A FailureDomain maps to an availability zone,
// which is a separated group of datacenters within a region.
// See: https://learn.microsoft.com/azure/reliability/availability-zones-overview
// +optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied, thx
azure/scope/cluster.go
Outdated
@@ -875,11 +875,17 @@ func (s *ClusterScope) APIServerHost() string { | |||
return s.APIServerPublicIP().DNSName | |||
} | |||
|
|||
// SetFailureDomain will set the spec for a for a given key. | |||
// SetFailureDomain to cluster's status by given id. The value of provided control plane might |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// SetFailureDomain to cluster's status by given id. The value of provided control plane might | |
// SetFailureDomain sets a failure domain in a cluster's status by its id. The provided failure domain spec may |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i tried to avoid the stutter here "SetFailureDomain sets failure domain" but i didn't feel very confident about how my version read either
applied your suggestion
@@ -128,6 +128,21 @@ spec: | |||
|
|||
``` | |||
|
|||
If you can't use `Machine` (or `MachineDeployment`) to explicitly place your VMs (for example, there is a `KubeAdmControlPlane` does not accept those as an object reference but rather uses `AzureMachineTemplate` directly), then you can opt to restrict the announcement of discovered failure domains from the cluster's status itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, there is a
KubeadmControlPlane
does not accept those as an object reference
I think we're missing a word here; I couldn't quite parse this paragraph...
557ee8e
to
de9739a
Compare
The goal of this commit is to allow capz users to specify which failure domains are eligible for control plane rollouts. There's a new label in AzureCluster.spec.failureDomain that can be used to override the values of failureDomain.ControlPlane to false, to prevent the control plane being deployed there. The field is optional - if it's missing, all discovered failure domains are announced in status as-is. THERE IS NO BREAKING CHANGE TO CURRENT USERS.
de9739a
to
e2eb28d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@handsomejack-42 can you please add a release note in the PR description?
LGTM label has been added. Git tree hash: 8c9a95ab868ceed400ea6081bbe2a8d693b4f0a4
|
done:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mboersma The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind api-change
Why we need it:
We need zonal control plane nodes. We might have missed something, but that probably can't be achieved in current state of CAPZ, with our use-case.
AzureCluster
announces all discovered failure domains (FDs)KubeadmControlPlane
checks with the FDs reported by the clusterKubeadmControlPlane
when rolling out nodes (no failureDomain in spec)Machine(Deployment)
. In our use-case, however, there are noMachine(Deployment)
s, theKubeadmControlPlane
spinsAzureMachines
directly.What this PR does:
The goal of this commit is to allow CAPZ users to specify which failure domains are eligible for control plane rollouts.
CAPI elders suggested, that the implementation of this behavior lies within the individual providers (CAPZ in our case)
AzureCluster
object status, soKubeadmControlPlane
won't take it into consideration when rolling out control plane nodesThere's a new label in
.spec.failureDomain
that is used to post-filter discovered failure domains.The field is optional - if it's missing, all discovered failure domains are eligible for control plane rollout.
Special notes for your reviewer:
TODOs:
Release note: