Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AzureManagedCluster spec.controlPlaneEndpoint is immutable #2711

Merged

Conversation

jackfrancis
Copy link
Contributor

@jackfrancis jackfrancis commented Oct 7, 2022

What type of PR is this?

/kind bug

What this PR does / why we need it:

While auditing AzureManagedCluster property CRUD operations, I observed that you're able to edit both the azuremanagedcluster resource and the azuremanagedcontrolplane resource and change the values of the spec.controlPlaneEndpoint configuration. The changes are accepted, and then it seems that the controllers are smart enough to revert the changes (or possibly they are calling the AKS API and resetting it to its authoritative value).

Rather than the above, we should fail updates on that configuration via webhook.

For cluster creates, we explicity fail (via webhook) if either of the ControlPlaneEndpoint property values are present, which expresses the actual API contract (these properties are designed to be user-facing).

For cluster updates, we explicitly fail all updates that change non-zero type values, which has the upside of preventing user modifications to this AKS-owned property surface area. Because this is an intrinsic, immutable configuration property from AKS's perspective, it is appropriate to enforce this immutability even if updates originat from AKS API (if CAPZ detected that the AKS API wanted to change these properties that would not an unexpected violation of the AKS feature contract).

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

AzureManagedCluster spec.controlPlaneEndpoint is immutable

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 7, 2022
@jackfrancis jackfrancis added area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 7, 2022
@jackfrancis jackfrancis force-pushed the azuremanagedcluster-webhook branch 2 times, most recently from 2ad3173 to 9a49820 Compare October 7, 2022 22:24
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 7, 2022
@jackfrancis jackfrancis force-pushed the azuremanagedcluster-webhook branch from 9a49820 to 1e39a10 Compare October 7, 2022 23:08
@@ -141,77 +141,41 @@ func (m *AzureManagedControlPlane) ValidateUpdate(oldRaw runtime.Object, client

if old.Spec.DNSServiceIP != nil {
// Prevent DNSServiceIP modification if it was already set to some value
if m.Spec.DNSServiceIP == nil {
// unsetting the field is not allowed
if !reflect.DeepEqual(m.Spec.DNSServiceIP, old.Spec.DNSServiceIP) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These reflect.DeepEqual changes are in response to linter complaining about this func's cyclomatic complexity with the addition of more validations. Next step might be to refactor this more, but this works for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really refactor this to not do all the validation in one big function (maybe add a ValidateImmutableFields?), could you please open an issue so we don't forget? Could be a good first issue maybe

we're losing some error differentiation with this change so it's not fully equivalent, are we okay with that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is something like this an improvement (imagine this applied across all webhook validations that check for immutability):

#2718

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes 💯

imagine if we did this with go generics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll finish #2718 to completion (audit all webhooks) after this PR lands (I think this PR is ready for final review).

And then after #2718 I'll file an issue that advocates this surface area is a potential good opportunity for introducing go generics into capz codebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll finish #2718 to completion (audit all webhooks) after this PR lands (I think this PR is ready for final review)

In that case I would advocate that we temporarily fix the go cyclomatic linter some other way (maybe extract some stuff to a function) so we don't temporarily regress the error messaging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we just land #2718 first? (this PR is not highest priority)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that works too 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prereq work is done, this PR is ready for another review round

@@ -71,6 +71,26 @@ func (r *AzureManagedCluster) ValidateUpdate(oldRaw runtime.Object) error {
fmt.Sprintf("annotations with '%s' prefix are immutable", azure.CustomHeaderPrefix)))
}

if old.Spec.ControlPlaneEndpoint.Host != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any cases where the controller would need to change either of these? Because webhook validations also apply to patches made by controllers, this essentially prevents the CAPZ controllers from changing these values too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some investigation and it seems that AKS does not have any notion of ever being able to change the apiserver endpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this change affect a scenario where the user has put in wrong/invalid/ill-formed ControlPlaneEndpoint ? Do we validate initial ControlPlaneEndpoint ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question @nawazkh!

In fact this is one of those "weird" fields where we defer to the AKS service itself to populate the value (and thus we trust that the AKS service will always populate the right value).

@CecileRobertMichon @nojnhuh is there any additional foo we should put in place here? It does feel a bit weird not having any create webhook validations for such an important field, but on the other hand this is an exceptional case where we're not actually able to implement what we want:

  1. Never allow the user to ever populate this field!
  2. Always trust the AKS service and accept whatever it provides!

Is there a better way we could be doing this, or is "don't enforce CREATE webhooks" the most appropriate given the state of AKS + capz?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real issue here is that controlPlaneEndpoint should be a Status field, not a Spec field, since it's meant for the CAPZ controller to communicate back info to CAPI, not to be user configurable.

I had a write-up on this in kubernetes-sigs/cluster-api#3715 but unfortunately in never moved forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. My interpretation of that is that we're doing the best we can right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the behavior currently that if the ControlPlaneEndpoint is changed, that will trigger a reconciliation where the controller will immediately override the provided value with AKS's value? If so, then to @nawazkh's point, it seems like this removes the possibility for the controller to automatically reset the value in case it does happen to change. e.g. If a user mistakenly sets this before the controller does, it seems like the only way to resolve this is to delete and recreate the control plane. Overall I think this check is valuable, but worth being mindful of that change.

One other idea: Is there a way to make this trigger a warning instead of a hard error? Then the webhook could say something to the effect of, "You're trying to change this value, but ultimately we're going to ignore that and set it to what AKS tells us it should be."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user is responsible for changing from "" --> <some value>, then that <some value> will become the durable one, and the automated AKS API-informed update that happens during cluster create reconciliation will be webhook-rejected with this change.

I'm certainly open to any comments on how we can improve this! It seemed to my initially naive understanding that the current situation, with no enforcement upon this rather critical configuration vector, left room for improvement.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 12, 2022
@jackfrancis jackfrancis force-pushed the azuremanagedcluster-webhook branch from 1e39a10 to 0c8bee7 Compare November 14, 2022 23:02
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 14, 2022
@jackfrancis jackfrancis force-pushed the azuremanagedcluster-webhook branch from 0c8bee7 to 249b3ef Compare December 5, 2022 18:59
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 5, 2022
@jackfrancis jackfrancis force-pushed the azuremanagedcluster-webhook branch 2 times, most recently from 2c4841b to 582a8b3 Compare December 5, 2022 22:41
@jackfrancis
Copy link
Contributor Author

/assign @nojnhuh

@@ -70,6 +71,24 @@ func (r *AzureManagedCluster) ValidateUpdate(oldRaw runtime.Object) error {
fmt.Sprintf("annotations with '%s' prefix are immutable", azure.CustomHeaderPrefix)))
}

if old.Spec.ControlPlaneEndpoint.Host != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case where old.Spec.ControlPlaneEndpoint is nil? I saw that the control plane endpoint is an optional field for the azure managed cluster spec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (I assume you mean "" [empty string], not nil), during initial cluster creation (because this data is never populated by the end user). We can't enforce immutability for transitions like "" --> <a valid URL> because that's the update vector we need to expose to allow the AKS API to provide us with data to do an automated update of this value after the cluster is created (the control plane URL is one piece of output data from the AKS API after a successful cluster create).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I was under the impression that the ControlPlaneEndpoint itself is not specified rather than the Host or Port. I was just thinking that we might be trying to access the Host of an endpoint that isn't specified, as it is marked as optional here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ControlPlaneEndpoint is a plain struct value, not a pointer, so there isn't a risk of dereferencing a nil pointer in this case.

Copy link
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this isn't a perfect solution and also that "perfect" likely isn't attainable here, but this is definitely a positive step overall!
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 7, 2022
Copy link
Contributor

@willie-yao willie-yao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 7, 2022
@jackfrancis jackfrancis force-pushed the azuremanagedcluster-webhook branch from 582a8b3 to f85e9cc Compare December 10, 2022 00:00
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 10, 2022
@jackfrancis
Copy link
Contributor Author

@CecileRobertMichon incorporated discussion points from yesterday's office hours, PTAL

@@ -51,6 +52,21 @@ func (r *AzureManagedCluster) ValidateCreate() error {
"can be set only if the AKS feature flag is enabled",
)
}
if !r.Status.Ready {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're only checking the status on creates but not updates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naively I was thinking to protect the "user accidentally includes a value in the AzureManagedCluster resource spec when creating the cluster, but not including this check in update leaves open the edge case of "user updates this property after cluster create but before the AKS API does its update", and so yeah, I'll add it there as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to your thought process, would it be worth it to hide AzureManagedCluster.Spec.ControlPlaneEndpoint in the CRD from the user ?
It would technically be part of the spec, just not visible to the user. Could add a psychological barrier on updating a non-exposed field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a breaking API change, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hide AzureManagedCluster.Spec.ControlPlaneEndpoint in the CRD from the user

Hmm, yeah, that would be a breaking change. We can ignore my comment #2711 (comment)

@CecileRobertMichon
Copy link
Contributor

incorporated discussion points from yesterday's office hours, PTAL

I see you made those changes for AzureManagedCluster but not AzureManagedControlPlane, any reason for that? Wouldn't the webhook also fail for the AzureManagedControlPlane if the user sets it before the controller has a chance to?

@jackfrancis jackfrancis force-pushed the azuremanagedcluster-webhook branch from f85e9cc to 69f8c98 Compare December 13, 2022 23:38
@jackfrancis
Copy link
Contributor Author

@CecileRobertMichon fair point, I've applied the change there as well

@jackfrancis
Copy link
Contributor Author

/retest

@jackfrancis jackfrancis force-pushed the azuremanagedcluster-webhook branch from 69f8c98 to b9eb1c9 Compare December 14, 2022 17:32
@jackfrancis
Copy link
Contributor Author

/hold cancel
/assign @nojnhuh @CecileRobertMichon @nawazkh

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 14, 2022
@nawazkh
Copy link
Member

nawazkh commented Dec 14, 2022

/lgtm
🚀

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 14, 2022
@jackfrancis jackfrancis force-pushed the azuremanagedcluster-webhook branch from b9eb1c9 to 0b59cf1 Compare December 15, 2022 01:30
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2022
@jackfrancis jackfrancis force-pushed the azuremanagedcluster-webhook branch from 0b59cf1 to 230398f Compare December 15, 2022 02:00
@jackfrancis jackfrancis added this to the v1.7 milestone Dec 15, 2022
@jackfrancis
Copy link
Contributor Author

@nojnhuh @CecileRobertMichon I think we should scope this PR down to the updated PR description. Integrating "cluster readiness" and/or "detecting whether an update originates from the AKS API vs a user" is proving to be difficult, and the real-world payoff for this additional webhook enforcement is (IMO) not worth the effort at this time.

Ready for final review.

Copy link
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2022
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 22, 2022
@k8s-ci-robot k8s-ci-robot merged commit 0c769f7 into kubernetes-sigs:main Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants