Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate ControlPlaneEndpoint when ManagedCluster update is not needed. #2134

Merged

Conversation

karthikbalasub
Copy link
Contributor

@karthikbalasub karthikbalasub commented Feb 28, 2022

What type of PR is this?
/kind bug

What this PR does / why we need it:
Fixes a bug in managed clusters service that results in control plane endpoint not getting updated correctly.
Before this fix, update flow will not read the FQDN from the the ManagedCluster fetched from cloud when no update was needed on the managed cluster.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Fixed a bug in managed clusters service that results in control plane endpoint not getting updated correctly.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Feb 28, 2022
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Feb 28, 2022

CLA Signed

The committers are authorized under a signed CLA.

  • ✅ Karthik Balasubramanian (1a7a4c7)

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 28, 2022
@k8s-ci-robot
Copy link
Contributor

Welcome @karthikbalasub!

It looks like this is your first PR to kubernetes-sigs/cluster-api-provider-azure 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-provider-azure has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 28, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @karthikbalasub. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 28, 2022
@shysank
Copy link
Contributor

shysank commented Feb 28, 2022

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 28, 2022
// No update required, but use the MC fetched from Azure for reading fields below.
// This is to ensure the read-only fields like Fqdn from the existing MC are used for updating the
// AzureManagedCluster.
managedCluster = existingMC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that we only need to do some of the time in practice? If not, how did this updates ever work at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, would it just be simpler to modify L335 below to use:

Host: *existingMC.ManagedClusterProperties.Fqdn,

Copy link
Contributor Author

@karthikbalasub karthikbalasub Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that we only need to do some of the time in practice? If not, how did this updates ever work at all?

The issue happens when the create response doesn't include the FQDN (can happen due to timeout, for example). When that happens, subsequent update calls will fail to update the controlplane endpoint.

Also, would it just be simpler to modify L335 below to use:

Host: *existingMC.ManagedClusterProperties.Fqdn,

This would be incorrect when there was an update needed (diff != "") as the response from the update call is assigned to managedCluster in this line above:

managedCluster, err = s.Client.CreateOrUpdate(ctx, managedClusterSpec.ResourceGroupName, managedClusterSpec.Name, managedCluster, customHeaders)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the FQDN will never change, right? So even if there is a diff to process we can always conclude that the FQDN is not one of the updated properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existingMC would also be null for the create flow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CecileRobertMichon @zmalik @michalno1 does anyone else have thoughts here? I think @karthikbalasub has done a great job of describing in detail what's going on.

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for other services we've refactored the reconcile logic to use a common CreateResource func which returns either 1) the existing resource if no update was made or 2) the resulting resource returned from the CreateOrUpdate code. Then each service can do what it needs with resulting resource (eg. update Status)

Looks like we're trying to achieve the same here and that's fine, but I think the code is slightly confusing just because it's not really clear what managedCluster is supposed to be since we use the same variable to send the PUT and to receive the return. What do you think of adding a new variable called result instead or maybe reusing the existingMC variable to assign the return of CreateOrUpdate functions (then we don't even need this else call)?
This would achieve the same thing, just read a bit differently.

I'll look into refactoring managed clusters to async soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CecileRobertMichon I refactored the code based on your suggestion. PTAL, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jackfrancis thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much cleaner, thanks @karthikbalasub !

@CecileRobertMichon
Copy link
Contributor

@karthikbalasub looks like there's a merge conflict, please rebase

@karthikbalasub karthikbalasub force-pushed the fix_controlplane_endpoint branch from 1a7a4c7 to a6843ee Compare March 1, 2022 00:19
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 1, 2022
@karthikbalasub
Copy link
Contributor Author

@karthikbalasub looks like there's a merge conflict, please rebase

Done

@karthikbalasub karthikbalasub force-pushed the fix_controlplane_endpoint branch 4 times, most recently from 6c08249 to de30378 Compare March 1, 2022 01:38
@CecileRobertMichon
Copy link
Contributor

/area managedclusters

@k8s-ci-robot k8s-ci-robot added the area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type label Mar 2, 2022
@CecileRobertMichon
Copy link
Contributor

@karthikbalasub could you please squash commits?

…edControlPlane correctly.

Before this fix, the issue happens when the create request failed/timed out and later update flow was not reading from the cluster fetched from cloud.
@karthikbalasub karthikbalasub force-pushed the fix_controlplane_endpoint branch from 1c3e1a9 to 679375f Compare March 7, 2022 18:28
@karthikbalasub
Copy link
Contributor Author

@karthikbalasub could you please squash commits?

Done

@jackfrancis
Copy link
Contributor

/retest

@CecileRobertMichon
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 8, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 8, 2022
@CecileRobertMichon
Copy link
Contributor

/cherry-pick release-1.2

@k8s-infra-cherrypick-robot

@CecileRobertMichon: once the present PR merges, I will cherry-pick it on top of release-1.2 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-infra-cherrypick-robot

@CecileRobertMichon: new pull request created: #2153

In response to this:

/cherry-pick release-1.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@karthikbalasub karthikbalasub deleted the fix_controlplane_endpoint branch March 8, 2022 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants