Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix endless reconciliation on AKS system node labels #3975

Merged
merged 1 commit into from
Sep 12, 2023

Conversation

nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented Sep 8, 2023

What type of PR is this?
/kind bug
/area managedclusters

What this PR does / why we need it:

When AKS labels agent pool nodes on its own (i.e. not labels specified by the user), e.g. when spot instances are being used, CAPZ has to carefully ignore those AKS labels when calculating whether or not an agent pool should be updated. The existing ignore mechanism only took effect when existing labels were specified on the node by the user. This PR fixes a bug where the combination of no user-specified labels and the presence of any AKS-specified values caused CAPZ to endless reconcile the AzureManagedMachinePool on a diff like this:

  armcontainerservice.AgentPool{
  	Properties: &armcontainerservice.ManagedClusterAgentPoolProfileProperties{
  		... // 15 identical fields
  		MinCount:             &0,
  		Mode:                 &"User",
- 		NodeLabels:           nil,
+ 		NodeLabels:           map[string]*string{"kubernetes.azure.com/scalesetpriority": &"spot"},
  		NodePublicIPPrefixID: nil,
  		NodeTaints:           {&"kubernetes.azure.com/scalesetpriority=spot:NoSchedule"},
  		... // 21 identical fields
  	},
  	ID:   nil,
  	Name: nil,
  	Type: nil,
  }

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

To repro, apply any AzureManagedMachinePool using Spot instances without any spec.nodeLabels, such as something like this:

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  name: jon-aks-$NAME
  namespace: default
spec:
  clusterName: jon-aks
  replicas: 1
  template:
    metadata: {}
    spec:
      bootstrap:
        dataSecretName: ""
      clusterName: jon-aks
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: AzureManagedMachinePool
        name: jon-aks-$NAME
      version: v1.26.3
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureManagedMachinePool
metadata:
  name: jon-aks-$NAME
  namespace: default
spec:
  mode: User
  name: $NAME
  sku: Standard_B2s
  scaleSetPriority: Spot

The "found a diff" messages will appear in the logs and the MachinePool will flip between Provisioned and Scaling or Running every minute or so.

  • cherry-pick candidate

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Fixed a bug where AzureManagedMachinePools using Spot instances would endlessly reconcile without any user-specified node labels

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 8, 2023
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 8, 2023
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 8, 2023

/milestone v1.11
/assign @jackfrancis

@k8s-ci-robot k8s-ci-robot added this to the v1.11 milestone Sep 8, 2023
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 8, 2023

I see now from the unit tests that this might break trying to delete all the node labels. Checking...

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 8, 2023

I see now from the unit tests that this might break trying to delete all the node labels. Checking...

Looks ok to me. Just needed one small tweak for the sake of the unit test.

@codecov
Copy link

codecov bot commented Sep 8, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.04% 🎉

Comparison is base (b082981) 56.26% compared to head (2ae2407) 56.30%.
Report is 28 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3975      +/-   ##
==========================================
+ Coverage   56.26%   56.30%   +0.04%     
==========================================
  Files         190      190              
  Lines       19447    19440       -7     
==========================================
+ Hits        10941    10946       +5     
+ Misses       7877     7865      -12     
  Partials      629      629              
Files Changed Coverage Δ
azure/services/agentpools/spec.go 60.28% <100.00%> (+0.97%) ⬆️

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 8, 2023

#3955

/retest

@nawazkh
Copy link
Member

nawazkh commented Sep 11, 2023

I had to do something similar in #3653 to deal with endless reconciliation.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ad02528dce225020af86f9204b0764a11c07d95a

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 11, 2023

/cherry-pick release-1.10

@k8s-infra-cherrypick-robot

@nojnhuh: once the present PR merges, I will cherry-pick it on top of release-1.10 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 11, 2023
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Sep 12, 2023

/retest

@k8s-ci-robot k8s-ci-robot merged commit c96fced into kubernetes-sigs:main Sep 12, 2023
11 checks passed
@nojnhuh nojnhuh deleted the aks-nodelabel-fix branch September 12, 2023 03:12
@k8s-infra-cherrypick-robot

@nojnhuh: #3975 failed to apply on top of branch "release-1.10":

Applying: fix endless reconciliation on AKS system node labels
Using index info to reconstruct a base tree...
M	azure/services/agentpools/spec.go
M	azure/services/agentpools/spec_test.go
Falling back to patching base and 3-way merge...
Auto-merging azure/services/agentpools/spec_test.go
Auto-merging azure/services/agentpools/spec.go
CONFLICT (content): Merge conflict in azure/services/agentpools/spec.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 fix endless reconciliation on AKS system node labels
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants