added support for system node pools in managedClusters #1475

LochanRn · 2021-06-25T16:44:02Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
system node pool will be of type system and not user.
added mode in the azuremannagedmachinepool api to set the mode of azure agent pools
removed defaultPoolRef from azuremanagedcontrolplane api. (provides more flexibility to manage system node pools.)
validation for deletion of last system node pool is handled in the webhook.
bumped up the containerservice sdk from 2020-02-01 to 2021-03-01

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1376 #1416

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

Mode spec in the AzureManagedMachinePool is used to specify the mode of an agentPool i.e System or User.
Removed defaultPoolRef from AzureManagedControlPlane.

…pi-provider-azure into aad-aks-support

…pi-provider-azure into system-node-pool

k8s-ci-robot · 2021-06-25T16:44:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign neolit123 after the PR has been reviewed.
You can assign the PR to them by writing /assign @neolit123 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2021-06-25T16:44:10Z

Hi @LochanRn. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alexeldeib · 2021-06-25T19:43:59Z

config/crd/bases/infrastructure.cluster.x-k8s.io_azuremanagedmachinepools.yaml

@@ -42,6 +42,13 @@ spec:
            description: AzureManagedMachinePoolSpec defines the desired state of
              AzureManagedMachinePool.
            properties:
+              mode:
+                description: 'Mode - represents mode of an agent pool. Possible values
+                  include: ''System'', ''User'''


nit: the quotes came out a bit weird here. maybe just drop the single quotes around the values themselves? or can we escape them better?

alexeldeib · 2021-06-25T19:47:14Z

exp/api/v1alpha4/azuremanagedmachinepool_webhook.go

+					r.Spec.OSDiskSizeGB,
+					"field is immutable, unsetting is not allowed"))
+		} else if *r.Spec.OSDiskSizeGB != *old.Spec.OSDiskSizeGB {
+			// changing the field is not allowed


I think os disk size is mutable as long as you use managed disks, but I could be mistaken.

will check and get back on this.

alexeldeib · 2021-06-25T19:48:58Z

exp/api/v1alpha4/azuremanagedmachinepool_webhook.go

+	opt1 := client.InNamespace(r.Namespace)
+	opt2 := client.MatchingLabels(map[string]string{
+		clusterv1.ClusterLabelName: clusterName,
+		LabelAgentPoolMode:         SystemNodePool,


This is clever, but I'm curious how it works for a cluster in transition? Might need to add the label inside the reconciler for ammp created before this change went in to ensure they get it applied.

I am not very clear what the doubt is here.

The label is emeded into the ammp by the mutating webhook so i guess that should suffice and not require us to add labels in the reconciler.

AMMP created before this commit and not updated after will not hit the webhook. They will be reconciled. For the webhook to work correctly, they all need to have the label. So it likely should be added inside reconciler.

I guess there were no system pools before this commit, so that’s not really possible 😅 it’s fine as is, you are right

alexeldeib · 2021-06-25T20:48:33Z

/ok-to-test

alexeldeib · 2021-06-25T21:38:08Z

exp/controllers/helpers.go

+
+		isSystemNodePool := ammp.Spec.Mode == infrav1exp.SystemNodePool
+
+		if groupMatches && kindMatches && isSystemNodePool {


hmm...I'm not 100% sure I follow the thinking here. when this was the default pool, we reconciled it as part of the managedcluster itself (actually the managed service still tries to do this for all pools, maybe a TODO there). we'd actually double reconcile it I think, once as part of MC and once as part of AP.

if we modified the last system pool however, we don't want to reconcile the cluster, we want to reconcile just that pool. if we accidentally race and try to delete the last system pool before creating a new one, just let the retry happen on 400 (return err from reconcile)

we can probably also make the managedclusters client avoid patching agentpools after create, to cleanly separate that

MachinePoolToAzureManagedControlPlaneMapFunc was added by you in #1397
This is the description, reasoning the PR has.
If the MachinePool owning the defaultPool of an AzureManagedControlPlane is nil when we try to reconcile, we need to re-reconcile. We can either return an error or watch machinePools and filter them down to the one owning the defaultPool (more accurate, but requires watching and filtering all machinePool events). I implemented the latter.

So this is used in AMCP and not in MC.

I found your reasoning fine hence i thought of working on the same lines for system node pool, i.e if the owner mp of system node pool is nil we try to reconcile as it is more accurate.

But I guess I made a mistake in line 265 in azuremanagedcontrolplane_reconciler I should return a nil over there.

Yeah, I see. This makes sense, but then we should definitely make sure AMCP reconciler doesn't send all the AMMPs to AKS on updates, only update the control plane config. We don't want AMCP and AMMP reconcilers to compete to reconcile node pools except that we let AMCP create the very first one as part of the cluster.

yeah makes sense, i think we have a check for this in line 216 azuremanagedcontrolplane.go. In amcp we reconcile ammp only during creation.

alexeldeib · 2021-06-25T21:42:49Z

exp/controllers/azuremangedcontrolplane_reconciler.go

 		// Add to cluster spec
-		managedClusterSpec.AgentPools = []managedclusters.PoolSpec{defaultPoolSpec}
+		managedClusterSpec.AgentPools = ammps


I know this is what we agreed to in #1376, WDYT about picking a random system pool instead and letting the AMMP reconciler parallelize creation of all the individual pools? I believe this would be faster since we need control plane + 1 pool, and then reconcile all user + system pools in parallel via AKS Agentpool API?

correct that was the thought process.

alexeldeib · 2021-06-25T21:44:26Z

overall approach looks good, few small comments. I still need to try this out locally. looks like generated manifests are out of date and e2e need a few updates

LochanRn · 2021-06-26T08:07:11Z

exp/controllers/azuremangedcontrolplane_reconciler.go

+			ammps = append(ammps, ammp)
+		}
+		if len(ammps) == 0 {
+			return errors.New("owner ref for system machine pools not ready")


should return nil instead of error.

LochanRn · 2021-06-26T08:09:18Z

exp/api/v1alpha4/azuremanagedmachinepool_webhook.go

+	}
+
+	if err := webhookClient.Get(ctx, key, ownerCluster); err != nil {
+		if !azure.ResourceNotFound(err) {


this should be azure.ResourceNotFound(err)

…pi-provider-azure into system-node-pool

LochanRn · 2021-07-06T20:31:12Z

/test pull-cluster-api-provider-azure-e2e-windows

LochanRn · 2021-07-06T22:15:35Z

/test pull-cluster-api-provider-azure-e2e

LochanRn · 2021-07-07T09:02:52Z

/test pull-cluster-api-provider-azure-e2e

CecileRobertMichon · 2021-07-07T22:50:10Z

You can follow the instructions at https://www.kubernetes.dev/docs/guide/github-workflow/#squash-commits to rebase and squash

CecileRobertMichon · 2021-07-07T22:58:02Z

test/e2e/config/azure-dev.yaml

@@ -115,7 +115,7 @@ variables:
 intervals:
  default/wait-controllers: ["3m", "10s"]
  default/wait-cluster: ["20m", "10s"]
-  default/wait-control-plane: ["20m", "10s"]
+  default/wait-control-plane: ["35m", "10s"]


I don't think we want to change this value for every test, was it to address #1481 ? if so we should fix it separately and/or add a specific AKS timeout interval

I have noticed several times for an AKS cluster to come up fully it takes a max of 30m I kept 5m buffer. 20m is not enough

were there any other changes to make e2e pass? I noticed deletion and the full test sometimes takes more than 20min, but the cluster initialization should really take less time. it's fine if we want to debug it elsewhere, but let's be cognizant of that.

I'd probably prefer adding a separate interval for AKS if we need to increase it so high, like cecile mentioned.

Nvm, answered my own question in the other issue/pr

let’s revert the change here, I’ll fix the timeout

No, i did not make any other changes. Yes it gets initialised quickly but for the first managedmachinepool or first system managed machine pool to come to a ready state it takes time. Like about 25-30 mins approx. I have not exactly timed the deletion but from the boot to until the first system pool comes up it would be approx 30 mins.

I debugged this extensively in #1488, there's a lag in Azure List VMSS which causes AMMP reconciliation to take way longer than expected. Both pools and the cluster should generally be done in ~10 minutes. I have a workaround and I'm also chasing it down internally with the owning team.

There's also a small issue with watches where if we reconcile AMMP with a stale AMCP in the cache, we won't requeue the 2nd machinepool for a long time (because our mapper only maps to system pools).

LochanRn · 2021-07-08T03:40:03Z

You can follow the instructions at https://www.kubernetes.dev/docs/guide/github-workflow/#squash-commits to rebase and squash

Thanks

alexeldeib · 2021-07-08T23:05:23Z

azure/services/agentpools/agentpools.go

@@ -19,7 +19,7 @@ package agentpools
 import (
 	"context"

-	"github.com/Azure/azure-sdk-for-go/services/containerservice/mgmt/2020-02-01/containerservice"
+	"github.com/Azure/azure-sdk-for-go/services/containerservice/mgmt/2021-03-01/containerservice"


any chance I can persuade you to upgrade to 2021-05-01 😃 the diff should be basically nothing. if there's any non-trivial diff, you can leave it at 2021-03-01.

The one minor difference while moving it from 2021-03-01 - 2021-05-01 is

ListClusterAdminCredentials api has an extra param serverFqdn in 2021-05-01. I have kept it to empty string.

alexeldeib · 2021-07-08T23:06:07Z

docs/book/src/topics/managedcluster.md

-creating an AzureManagedControlPlane requires defining the default
-machine pool, since AKS requires at least one system pool at creation
-time.
+creating atleast one AzureManagedMachinePool with Spec.Mode System,


Suggested change

creating atleast one AzureManagedMachinePool with Spec.Mode System,

creating at least one AzureManagedMachinePool with Spec.Mode System,

alexeldeib · 2021-07-08T23:07:44Z

exp/api/v1alpha3/azuremanagedmachinepool_types.go

@@ -23,6 +23,10 @@ import (

 // AzureManagedMachinePoolSpec defines the desired state of AzureManagedMachinePool.
 type AzureManagedMachinePoolSpec struct {
+	// Mode - represents mode of an agent pool. Possible values include: System, User.
+	// +kubebuilder:validation:Enum=System;User
+	Mode string `json:"mode"`


should we default this? or require it? i'm fine as-is but thought I'd raise it

I am okay with anything

Have we decided on this ? Should I add a default instead of required ?

Changing from required => optional is non-breaking, let’s leave it required for now 👍

alexeldeib · 2021-07-08T23:08:27Z

exp/api/v1alpha4/azuremanagedmachinepool_types.go

@@ -21,8 +21,21 @@ import (
 	capierrors "sigs.k8s.io/cluster-api/errors"
 )

+const (
+	// LabelAgentPoolMode represents mode of an agent pool. Possible values include: System, User.
+	LabelAgentPoolMode = "azurecluster.infrastructure.cluster.x-k8s.io/agentpoolmode"


"azurecluster" ? maybe "azuremanagedmachinepool"? or at least azuremanagedcontrolplane?

Will change it to azuremanagedmachinepool.

alexeldeib · 2021-07-08T23:10:24Z

exp/api/v1alpha4/azuremanagedmachinepool_types.go

+	LabelAgentPoolMode = "azurecluster.infrastructure.cluster.x-k8s.io/agentpoolmode"
+
+	// SystemNodePool represents mode system for azuremachinepool.
+	SystemNodePool = "System"


nit: you can probably just use the SDK values AgentPoolMode{System,User}, or else I'd prefer naming slightly closer to that (SystemNodePool is a bit too generic imo and could easily be a var name used elsewhere)

actually I think keeping the label definition here is better..

maybe Label{System,User}NodePool? not super picky on this one tbh

Will this be fine ?

type NodePool string const ( // AgentPoolModeSystem represents mode system for azuremachinepool. AgentPoolModeSystem NodePool = "System" )

I’d probably

type NodePoolMode string const ( NodePoolModeSystem … NodePoolModeUser… )

or something to align the names a bit more closely

alexeldeib · 2021-07-08T23:13:47Z

exp/controllers/azuremanagedcontrolplane_controller.go

-		Cluster:          cluster,
-		ControlPlane:     azureControlPlane,
-		MachinePool:      ownerPool,
-		InfraMachinePool: defaultPool,


I think we can remove InfraMachinePool from ManagedControlPlaneScope now?

you've done it here, but I think it can be removed from the struct definitions as well

I did remove it initially, but then had to add it back as the ammp controller also uses ManagedControlPlaneScope. The InfraMachinePool is required there.

k8s-ci-robot · 2021-07-10T14:17:04Z

@LochanRn: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2021-07-10T14:17:57Z

@LochanRn: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-cluster-api-provider-azure-e2e	`be9196b`	link	`/test pull-cluster-api-provider-azure-e2e`
pull-cluster-api-provider-azure-e2e-windows	`be9196b`	link	`/test pull-cluster-api-provider-azure-e2e-windows`
pull-cluster-api-provider-azure-test	`be9196b`	link	`/test pull-cluster-api-provider-azure-test`
pull-cluster-api-provider-azure-verify	`be9196b`	link	`/test pull-cluster-api-provider-azure-verify`
pull-cluster-api-provider-azure-build	`be9196b`	link	`/test pull-cluster-api-provider-azure-build`
pull-cluster-api-provider-azure-e2e-exp	`be9196b`	link	`/test pull-cluster-api-provider-azure-e2e-exp`
pull-cluster-api-provider-azure-apidiff	`be9196b`	link	`/test pull-cluster-api-provider-azure-apidiff`
pull-cluster-api-provider-azure-coverage	`be9196b`	link	`/test pull-cluster-api-provider-azure-coverage`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

LochanRn · 2021-07-12T09:51:19Z

unable to rebase correctly, closing this PR. Will open a fresh PR.

LochanRn added 7 commits June 7, 2021 23:23

updated the containerservice sdk

89b8d83

fixed min one system pool error.

d3ae03e

Merge branch 'master' of https://github.com/kubernetes-sigs/cluster-a…

58b57c9

…pi-provider-azure into aad-aks-support

Merge branch 'master' of https://github.com/kubernetes-sigs/cluster-a…

2b7ffa7

…pi-provider-azure into system-node-pool

changes for system managed machine pool

f0e74c2

Merge branch 'master' of https://github.com/kubernetes-sigs/cluster-a…

34e2e25

…pi-provider-azure into system-node-pool

minor changes

25d6f98

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Jun 25, 2021

k8s-ci-robot requested review from devigned and shysank June 25, 2021 16:44

k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 25, 2021

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 25, 2021

alexeldeib reviewed Jun 25, 2021

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 25, 2021

alexeldeib reviewed Jun 25, 2021

View reviewed changes

LochanRn commented Jun 26, 2021

View reviewed changes

Merge branch 'master' of https://github.com/kubernetes-sigs/cluster-a…

5342c94

…pi-provider-azure into system-node-pool

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 6, 2021

Merge branch 'kubernetes-sigs:master' into system-node-pool

be9196b

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 6, 2021

CecileRobertMichon reviewed Jul 7, 2021

View reviewed changes

LochanRn mentioned this pull request Jul 8, 2021

update the container-service sdk from 2020-02-01 to 2021-03-01 #1416

Closed

3 tasks

alexeldeib reviewed Jul 8, 2021

View reviewed changes

LochanRn force-pushed the system-node-pool branch from 65d786b to be9196b Compare July 10, 2021 14:16

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 10, 2021

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 10, 2021

LochanRn closed this Jul 12, 2021


		isSystemNodePool := ammp.Spec.Mode == infrav1exp.SystemNodePool

		if groupMatches && kindMatches && isSystemNodePool {

	creating atleast one AzureManagedMachinePool with Spec.Mode System,
	creating at least one AzureManagedMachinePool with Spec.Mode System,

added support for system node pools in managedClusters #1475

added support for system node pools in managedClusters #1475

Conversation

LochanRn commented Jun 25, 2021 • edited Loading

k8s-ci-robot commented Jun 25, 2021

k8s-ci-robot commented Jun 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LochanRn Jun 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexeldeib commented Jun 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexeldeib commented Jun 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LochanRn commented Jul 6, 2021

LochanRn commented Jul 6, 2021

LochanRn commented Jul 7, 2021

CecileRobertMichon commented Jul 7, 2021

Choose a reason for hiding this comment

LochanRn Jul 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LochanRn commented Jul 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LochanRn Jul 9, 2021 • edited Loading

Choose a reason for hiding this comment

k8s-ci-robot commented Jul 10, 2021

k8s-ci-robot commented Jul 10, 2021

LochanRn commented Jul 12, 2021

LochanRn commented Jun 25, 2021 •

edited

Loading

LochanRn Jun 25, 2021 •

edited

Loading

LochanRn Jul 8, 2021 •

edited

Loading

LochanRn Jul 9, 2021 •

edited

Loading