-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AzureManagedCluster spec.controlPlaneEndpoint is immutable #2711
AzureManagedCluster spec.controlPlaneEndpoint is immutable #2711
Conversation
2ad3173
to
9a49820
Compare
9a49820
to
1e39a10
Compare
@@ -141,77 +141,41 @@ func (m *AzureManagedControlPlane) ValidateUpdate(oldRaw runtime.Object, client | |||
|
|||
if old.Spec.DNSServiceIP != nil { | |||
// Prevent DNSServiceIP modification if it was already set to some value | |||
if m.Spec.DNSServiceIP == nil { | |||
// unsetting the field is not allowed | |||
if !reflect.DeepEqual(m.Spec.DNSServiceIP, old.Spec.DNSServiceIP) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These reflect.DeepEqual
changes are in response to linter complaining about this func's cyclomatic complexity with the addition of more validations. Next step might be to refactor this more, but this works for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really refactor this to not do all the validation in one big function (maybe add a ValidateImmutableFields
?), could you please open an issue so we don't forget? Could be a good first issue maybe
we're losing some error differentiation with this change so it's not fully equivalent, are we okay with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is something like this an improvement (imagine this applied across all webhook validations that check for immutability):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes 💯
imagine if we did this with go generics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll finish #2718 to completion (audit all webhooks) after this PR lands (I think this PR is ready for final review)
In that case I would advocate that we temporarily fix the go cyclomatic linter some other way (maybe extract some stuff to a function) so we don't temporarily regress the error messaging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we just land #2718 first? (this PR is not highest priority)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that works too 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prereq work is done, this PR is ready for another review round
@@ -71,6 +71,26 @@ func (r *AzureManagedCluster) ValidateUpdate(oldRaw runtime.Object) error { | |||
fmt.Sprintf("annotations with '%s' prefix are immutable", azure.CustomHeaderPrefix))) | |||
} | |||
|
|||
if old.Spec.ControlPlaneEndpoint.Host != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any cases where the controller would need to change either of these? Because webhook validations also apply to patches made by controllers, this essentially prevents the CAPZ controllers from changing these values too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some investigation and it seems that AKS does not have any notion of ever being able to change the apiserver endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this change affect a scenario where the user has put in wrong/invalid/ill-formed ControlPlaneEndpoint
? Do we validate initial ControlPlaneEndpoint
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question @nawazkh!
In fact this is one of those "weird" fields where we defer to the AKS service itself to populate the value (and thus we trust that the AKS service will always populate the right value).
@CecileRobertMichon @nojnhuh is there any additional foo we should put in place here? It does feel a bit weird not having any create webhook validations for such an important field, but on the other hand this is an exceptional case where we're not actually able to implement what we want:
- Never allow the user to ever populate this field!
- Always trust the AKS service and accept whatever it provides!
Is there a better way we could be doing this, or is "don't enforce CREATE webhooks" the most appropriate given the state of AKS + capz?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real issue here is that controlPlaneEndpoint should be a Status field, not a Spec field, since it's meant for the CAPZ controller to communicate back info to CAPI, not to be user configurable.
I had a write-up on this in kubernetes-sigs/cluster-api#3715 but unfortunately in never moved forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. My interpretation of that is that we're doing the best we can right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the behavior currently that if the ControlPlaneEndpoint
is changed, that will trigger a reconciliation where the controller will immediately override the provided value with AKS's value? If so, then to @nawazkh's point, it seems like this removes the possibility for the controller to automatically reset the value in case it does happen to change. e.g. If a user mistakenly sets this before the controller does, it seems like the only way to resolve this is to delete and recreate the control plane. Overall I think this check is valuable, but worth being mindful of that change.
One other idea: Is there a way to make this trigger a warning instead of a hard error? Then the webhook could say something to the effect of, "You're trying to change this value, but ultimately we're going to ignore that and set it to what AKS tells us it should be."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user is responsible for changing from ""
--> <some value>
, then that <some value>
will become the durable one, and the automated AKS API-informed update that happens during cluster create reconciliation will be webhook-rejected with this change.
I'm certainly open to any comments on how we can improve this! It seemed to my initially naive understanding that the current situation, with no enforcement upon this rather critical configuration vector, left room for improvement.
1e39a10
to
0c8bee7
Compare
0c8bee7
to
249b3ef
Compare
2c4841b
to
582a8b3
Compare
/assign @nojnhuh |
@@ -70,6 +71,24 @@ func (r *AzureManagedCluster) ValidateUpdate(oldRaw runtime.Object) error { | |||
fmt.Sprintf("annotations with '%s' prefix are immutable", azure.CustomHeaderPrefix))) | |||
} | |||
|
|||
if old.Spec.ControlPlaneEndpoint.Host != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a case where old.Spec.ControlPlaneEndpoint
is nil? I saw that the control plane endpoint is an optional field for the azure managed cluster spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (I assume you mean ""
[empty string], not nil), during initial cluster creation (because this data is never populated by the end user). We can't enforce immutability for transitions like ""
--> <a valid URL>
because that's the update vector we need to expose to allow the AKS API to provide us with data to do an automated update of this value after the cluster is created (the control plane URL is one piece of output data from the AKS API after a successful cluster create).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, I was under the impression that the ControlPlaneEndpoint
itself is not specified rather than the Host
or Port
. I was just thinking that we might be trying to access the Host
of an endpoint that isn't specified, as it is marked as optional here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ControlPlaneEndpoint
is a plain struct value, not a pointer, so there isn't a risk of dereferencing a nil pointer in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this isn't a perfect solution and also that "perfect" likely isn't attainable here, but this is definitely a positive step overall!
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
582a8b3
to
f85e9cc
Compare
@CecileRobertMichon incorporated discussion points from yesterday's office hours, PTAL |
@@ -51,6 +52,21 @@ func (r *AzureManagedCluster) ValidateCreate() error { | |||
"can be set only if the AKS feature flag is enabled", | |||
) | |||
} | |||
if !r.Status.Ready { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we're only checking the status on creates but not updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naively I was thinking to protect the "user accidentally includes a value in the AzureManagedCluster
resource spec when creating the cluster, but not including this check in update leaves open the edge case of "user updates this property after cluster create but before the AKS API does its update", and so yeah, I'll add it there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to your thought process, would it be worth it to hide AzureManagedCluster.Spec.ControlPlaneEndpoint
in the CRD from the user ?
It would technically be part of the spec, just not visible to the user. Could add a psychological barrier on updating a non-exposed field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a breaking API change, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hide AzureManagedCluster.Spec.ControlPlaneEndpoint in the CRD from the user
Hmm, yeah, that would be a breaking change. We can ignore my comment #2711 (comment)
I see you made those changes for AzureManagedCluster but not AzureManagedControlPlane, any reason for that? Wouldn't the webhook also fail for the AzureManagedControlPlane if the user sets it before the controller has a chance to? |
f85e9cc
to
69f8c98
Compare
@CecileRobertMichon fair point, I've applied the change there as well |
/retest |
69f8c98
to
b9eb1c9
Compare
/hold cancel |
/lgtm |
b9eb1c9
to
0b59cf1
Compare
0b59cf1
to
230398f
Compare
@nojnhuh @CecileRobertMichon I think we should scope this PR down to the updated PR description. Integrating "cluster readiness" and/or "detecting whether an update originates from the AKS API vs a user" is proving to be difficult, and the real-world payoff for this additional webhook enforcement is (IMO) not worth the effort at this time. Ready for final review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
While auditing AzureManagedCluster property CRUD operations, I observed that you're able to edit both the
azuremanagedcluster
resource and theazuremanagedcontrolplane
resource and change the values of thespec.controlPlaneEndpoint
configuration. The changes are accepted, and then it seems that the controllers are smart enough to revert the changes (or possibly they are calling the AKS API and resetting it to its authoritative value).Rather than the above, we should fail updates on that configuration via webhook.
For cluster creates, we explicity fail (via webhook) if either of the
ControlPlaneEndpoint
property values are present, which expresses the actual API contract (these properties are designed to be user-facing).For cluster updates, we explicitly fail all updates that change non-zero type values, which has the upside of preventing user modifications to this AKS-owned property surface area. Because this is an intrinsic, immutable configuration property from AKS's perspective, it is appropriate to enforce this immutability even if updates originat from AKS API (if CAPZ detected that the AKS API wanted to change these properties that would not an unexpected violation of the AKS feature contract).
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
TODOs:
Release note: