[AzureManagedCluster][Spot] `AzureManagedMachinePool` fluctuating between `Running` and `Provisioned` states #4112

esierra-stratio · 2023-10-10T10:20:29Z

/kind bug

AzureManagedMachinePool in AzureManagedCluster with Spot enabled are experiencing continuous fluctuations between the Running and Provisioned states.

What steps did you take and what happened:
Deploy AzureManagedCluster with Provider Version 1.11.3

Below is the configuration of one of the affected AzureManagedMachinePools:

- apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
  kind: AzureManagedMachinePool
  metadata:
    finalizers:
    - azurecluster.infrastructure.cluster.x-k8s.io
    labels:
      azuremanagedmachinepool.infrastructure.cluster.x-k8s.io/agentpoolmode: User
      cluster.x-k8s.io/cluster-name: esierra-dev
      clusterctl.cluster.x-k8s.io/move-hierarchy: "true"
    name: worker2-mp-2
    namespace: cluster-esierra-dev
    ownerReferences:
    - apiVersion: cluster.x-k8s.io/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: MachinePool
      name: worker2-mp-2
    resourceVersion: "12913"
  spec:
    availabilityZones:
    - "3"
    linuxOSConfig:
      sysctls:
        vmMaxMapCount: 262144
    mode: User
    name: worker2mp2
    nodeLabels:
      backup: "false"
    osDiskSizeGB: 50
    osDiskType: Managed
    osType: Linux
    providerIDList:
    - azure:///subscriptions/REDACTED/resourceGroups/esierra-dev-nodes/providers/Microsoft.Compute/virtualMachineScaleSets/aks-worker2mp2-21099784-vmss/virtualMachines/0
    scaleDownMode: Delete
    scaleSetPriority: Spot
    scaling:
      maxSize: 6
      minSize: 1
    sku: Standard_D8_v3
    subnetName: esierra-dev

What did you expect to happen:
AzureManagedMachinePool remains in the Running state when it successfully passes MachineHealthCheck flights.

Anything else you would like to add:
Some of the pieces of evidence are:

Changes in the state of the MachinePool:

cluster-esierra-dev   worker2-mp-0   esierra-dev   1          Running       28m   v1.27.3
cluster-esierra-dev   worker2-mp-1   esierra-dev   1          Provisioned   28m   v1.27.3
cluster-esierra-dev   worker2-mp-2   esierra-dev   1          Provisioned   28m   v1.27.3

cluster-esierra-dev   worker2-mp-0   esierra-dev   1          Provisioned   28m   v1.27.3
cluster-esierra-dev   worker2-mp-1   esierra-dev   1          Running       28m   v1.27.3
cluster-esierra-dev   worker2-mp-2   esierra-dev   1          Provisioned   28m   v1.27.3

Activity log screenshot:

Additional information in there:

Resource
/subscriptions/REDACTED/resourceGroups/esierra-dev/providers/Microsoft.ContainerService/managedClusters/esierra-dev/agentPools/worker2mp1
Operation name
Create or Update Agent Pool
Time stamp
Tue Oct 10 2023 12:03:39 GMT+0200 (Central European Summer Time)
Event initiated by
cloud-provisioner

Some logs related to this in capz-controller-manager:

I1010 10:14:55.557724       1 reconciler.go:108] controllers.reconciler.Reconcile "msg"="processing" "AzureManagedMachinePool"={"name":"worker2-mp-0","namespace":"cluster-esierra-dev"} "controller"="azuremanagedmachinepool" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureManagedMachinePool" "name"="worker2-mp-0" "namespace"="cluster-esierra-dev" "reconcileID"="cb580257-6727-4ed2-b50c-b647170b0928" "request"="cluster-esierra-dev/worker2-mp-0" "x-ms-correlation-request-id"="5e07e639-5a94-402d-bf74-c9dd663727a9"
I1010 10:14:55.612367       1 azuremanagedmachinepool_controller.go:250] controllers.AzureManagedMachinePoolReconciler.reconcileNormal "msg"="Reconciling AzureManagedMachinePool" "AzureManagedMachinePool"={"name":"worker2-mp-0","namespace":"cluster-esierra-dev"} "controller"="azuremanagedmachinepool" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureManagedMachinePool" "name"="worker2-mp-0" "namespace"="cluster-esierra-dev" "reconcileID"="cb580257-6727-4ed2-b50c-b647170b0928" "x-ms-correlation-request-id"="5e07e639-5a94-402d-bf74-c9dd663727a9"
I1010 10:14:55.612885       1 azuremanagedmachinepool_reconciler.go:114] controllers.azureManagedMachinePoolService.Reconcile "msg"="reconciling managed machine pool" "AzureManagedMachinePool"={"name":"worker2-mp-0","namespace":"cluster-esierra-dev"} "controller"="azuremanagedmachinepool" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureManagedMachinePool" "name"="worker2-mp-0" "namespace"="cluster-esierra-dev" "reconcileID"="cb580257-6727-4ed2-b50c-b647170b0928" "x-ms-correlation-request-id"="5e07e639-5a94-402d-bf74-c9dd663727a9"
I1010 10:14:55.612991       1 client.go:83] agentpools.azureClient.CreateOrUpdate "msg"="sending request" "AzureManagedMachinePool"={"name":"worker2-mp-0","namespace":"cluster-esierra-dev"} "controller"="azuremanagedmachinepool" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureManagedMachinePool" "name"="worker2-mp-0" "namespace"="cluster-esierra-dev" "reconcileID"="cb580257-6727-4ed2-b50c-b647170b0928" "resumeToken"="{\"type\":\"AgentPoolsClientCreateOrUpdateResponse\",\"token\":{\"asyncURL\":\"https://management.azure.com/subscriptions/REDACTED/providers/Microsoft.ContainerService/locations/westeurope/operations/2d596d46-5e42-48a1-a837-b132def4a818?api-version=2017-08-31\\u0026t=638325296493969976\\u0026c=REDACTED\",\"locURL\":\"\",\"origURL\":\"https://management.azure.com/subscriptions/REDACTED/resourceGroups/esierra-dev/providers/Microsoft.ContainerService/managedClusters/esierra-dev/agentPools/worker2mp0?api-version=2023-07-01\",\"method\":\"PUT\",\"finalState\":\"\",\"state\":\"InProgress\"}}" "x-ms-correlation-request-id"="5e07e639-5a94-402d-bf74-c9dd663727a9"
I1010 10:14:57.614416       1 azuremanagedmachinepool_controller.go:283] controllers.AzureManagedMachinePoolReconciler.reconcileNormal "msg"="requeuing due to transient failure" "AzureManagedMachinePool"={"name":"worker2-mp-0","namespace":"cluster-esierra-dev"} "controller"="azuremanagedmachinepool" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureManagedMachinePool" "error"="failed to reconcile machine pool worker2mp0: operation type PUT on Azure resource esierra-dev/worker2mp0 is not done. Object will be requeued after 15s" "name"="worker2-mp-0" "namespace"="cluster-esierra-dev" "reconcileID"="cb580257-6727-4ed2-b50c-b647170b0928" "x-ms-correlation-request-id"="5e07e639-5a94-402d-bf74-c9dd663727a9"
I1010 10:14:57.615424       1 reconciler.go:115] controllers.reconciler.Reconcile "msg"="successful" "AzureManagedMachinePool"={"name":"worker2-mp-0","namespace":"cluster-esierra-dev"} "controller"="azuremanagedmachinepool" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="AzureManagedMachinePool" "name"="worker2-mp-0" "namespace"="cluster-esierra-dev" "reconcileID"="cb580257-6727-4ed2-b50c-b647170b0928" "request"="cluster-esierra-dev/worker2-mp-0" "x-ms-correlation-request-id"="5e07e639-5a94-402d-bf74-c9dd663727a9"

Environment:

cluster-api-provider-azure version: 1.11.3
Kubernetes version: (use kubectl version): v1.27.0

The text was updated successfully, but these errors were encountered:

Jont828 · 2023-10-12T16:22:24Z

/triage accepted

nojnhuh · 2023-10-12T18:30:53Z

@esierra-stratio With the log verbosity cranked up to at least 4 with the -v flag on the container from the capz-controller-manager Deployment, do you see any logs that include found a diff? Those should describe exactly what CAPZ is trying and failing to change each time. I deployed what I think is a sufficiently equivalent AzureManagedMachinePool and don't see this behavior:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureManagedMachinePool
metadata:
  name: spot
  namespace: default
spec:
  availabilityZones:
    - "3"
  mode: User
  name: spot
  sku: Standard_D2s_v3
  scaleSetPriority: Spot
  scaling:
    minSize: 1
    maxSize: 6
  scaleDownMode: Delete
  linuxOSConfig:
    sysctls:
      vmMaxMapCount: 262144
  nodeLabels:
    backup: "false"
  osDiskSizeGB: 50
  osDiskType: Managed
  osType: Linux

nojnhuh · 2023-10-12T23:34:01Z

^ My mistake, I wasn't noticing that the machine pool wasn't getting created in the first place when I was trying to repro (looks like an AKS bug). Still working out the details, but do you see the fluctuating stop if you define spec.spotMaxPrice on the AzureManagedMachinePool?

nojnhuh · 2023-10-13T01:17:25Z

Yep, looks like it's because the AKS API is populating spotMaxPrice and CAPZ doesn't for its diff unless it's defined in the AzureManagedMachinePool:

  armcontainerservice.AgentPool{
  	Properties: &armcontainerservice.ManagedClusterAgentPoolProfileProperties{
  		... // 29 identical fields
  		ScaleSetEvictionPolicy: nil,
  		ScaleSetPriority:       nil,
- 		SpotMaxPrice:           nil,
+ 		SpotMaxPrice:           &-1,
  		Tags:                   nil,
  		Type:                   nil,
  		... // 7 identical fields
  	},
  	ID:   nil,
  	Name: nil,
  	Type: nil,
  }

@CecileRobertMichon Do you think it would be better to default spec.spotMaxPrice in the AzureManagedMachinePool webhooks, treat nil and -1 as equal when we perform the diff, or both? With just the webhook change, I don't think the fix would automatically kick in after a CAPZ upgrade where the alternative approach would fix this immediately for affected resources.

CecileRobertMichon · 2023-10-13T14:11:46Z

@nojnhuh I think it doesn't make sense to set SpotMaxPrice at all when the scaleSetPriority isn't set to Spot, however, it makes sense to default to -1 when scaleSetPriority is Spot but SpotMaxPrice isn't specified. Is AKS defaulting that field no matter what?

Is this a change in the AKS API with the move to SDK v2?

nojnhuh · 2023-10-13T15:44:22Z

@nojnhuh I think it doesn't make sense to set SpotMaxPrice at all when the scaleSetPriority isn't set to Spot, however, it makes sense to default to -1 when scaleSetPriority is Spot but SpotMaxPrice isn't specified. Is AKS defaulting that field no matter what?

It looks like AKS only defaults that on Spot node pools.

Is this a change in the AKS API with the move to SDK v2?

I'm guessing this also affects v1.10 since spotMaxPrice was a pointer before. I don't remember ever testing this scenario where spotMaxPrice wasn't defined on a Spot node pool.

nojnhuh · 2023-10-13T16:43:47Z

Ah, spotMaxPrice was only added for v1.11: 6b96d7f

nojnhuh · 2023-10-13T16:59:30Z

I can confirm this bug did exist in the original PR before sdk v2.

nojnhuh · 2023-10-13T18:43:50Z

@CecileRobertMichon This bug doesn't exist in #4069 with ASO, so if we default spotMaxPrice in the webhook here, could we remove later? Or would that be a breaking change? I wonder if it might be marginally safer not to default it in CAPZ if we can in the mostly hypothetical scenario that AKS changes its sentinel "unlimited" value from "-1" to something else in a future API version.

jackfrancis · 2023-10-13T19:28:11Z

@nojnhuh is the ASO resource defaulting under the hood in a similar way to this new default behavior in this PR? If so then I don't think shipping this change, and then removing it in CAPZ as part of the ASO resource introduction would constitute a breaking change.

Do we know why the ASO flow doesn't produce the flapping?

CecileRobertMichon · 2023-10-13T19:33:24Z

in the mostly hypothetical scenario that AKS changes its sentinel "unlimited" value from "-1" to something else in a future API version.

this would be a breaking change in the AKS API, I would be surprised if that happens. If it does though, we have ways to handle it when we bump the hypothetical AKS API version that changes this. Having explicit defaults is overall safer, especially for fields that are immutable. My vote would be to just keep defaulting.

nojnhuh · 2023-10-13T19:51:00Z

@nojnhuh is the ASO resource defaulting under the hood in a similar way to this new default behavior in this PR? If so then I don't think shipping this change, and then removing it in CAPZ as part of the ASO resource introduction would constitute a breaking change.

Do we know why the ASO flow doesn't produce the flapping?

The value defaulted by AKS will be reflected in the status of the ASO resource and not in its spec. As long as CAPZ's Parameters() only modifies the spec of an ASO resource, the diff CAPZ does will only ever find changes in the spec (i.e. only fields CAPZ has already set).

If a defaulted value for spotMaxPrice shows up in status and CAPZ keeps it in the spec as nil, then the next diff will find that both spec.spotMaxPrice and status.spotMaxPrice are unchanged (although they have different values) and skip updating the ASO resource.

esierra-stratio · 2023-10-16T09:06:53Z

I was somewhat disconnected during the last few days of the previous week because I was on vacation, thanks for the effort!

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 10, 2023

github-project-automation bot added this to CAPZ Planning Oct 10, 2023

esierra-stratio changed the title ~~[Azure Managed][Spot] AzureManagedMachinePool Fluctuating Between Running and Provisioned States~~ [AzureManagedCluster][Spot] AzureManagedMachinePool Fluctuating Between Running and Provisioned States Oct 10, 2023

esierra-stratio changed the title ~~[AzureManagedCluster][Spot] AzureManagedMachinePool Fluctuating Between Running and Provisioned States~~ [AzureManagedCluster][Spot] AzureManagedMachinePool fluctuating between Running and Provisioned States Oct 10, 2023

esierra-stratio changed the title ~~[AzureManagedCluster][Spot] AzureManagedMachinePool fluctuating between Running and Provisioned States~~ [AzureManagedCluster][Spot] AzureManagedMachinePool fluctuating between Running and Provisioned states Oct 10, 2023

dtzar assigned nojnhuh Oct 12, 2023

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Oct 12, 2023

Jont828 mentioned this issue Oct 12, 2023

AzureManagedMachinePool oscillates, when trying to reconcile, between running and provisioned when removing nodeLabel or taints fields. #4116

Closed

nojnhuh mentioned this issue Oct 13, 2023

default spec.spotMaxPrice in AzureManagedMachinePool #4126

Merged

4 tasks

k8s-ci-robot closed this as completed in #4126 Oct 13, 2023

github-project-automation bot moved this to Done in CAPZ Planning Oct 13, 2023

nojnhuh mentioned this issue Oct 16, 2023

convert managedclusters and agentpools to ASO #4069

Merged

4 tasks

esierra-stratio mentioned this issue Oct 25, 2023

chore: azure provider to 1.11.4 Stratio/kind#352

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AzureManagedCluster][Spot] `AzureManagedMachinePool` fluctuating between `Running` and `Provisioned` states #4112

[AzureManagedCluster][Spot] `AzureManagedMachinePool` fluctuating between `Running` and `Provisioned` states #4112

esierra-stratio commented Oct 10, 2023 •

edited

Loading

Jont828 commented Oct 12, 2023

nojnhuh commented Oct 12, 2023

nojnhuh commented Oct 12, 2023

nojnhuh commented Oct 13, 2023

CecileRobertMichon commented Oct 13, 2023

nojnhuh commented Oct 13, 2023

nojnhuh commented Oct 13, 2023

nojnhuh commented Oct 13, 2023

nojnhuh commented Oct 13, 2023

jackfrancis commented Oct 13, 2023

CecileRobertMichon commented Oct 13, 2023

nojnhuh commented Oct 13, 2023

esierra-stratio commented Oct 16, 2023

[AzureManagedCluster][Spot] AzureManagedMachinePool fluctuating between Running and Provisioned states #4112

[AzureManagedCluster][Spot] AzureManagedMachinePool fluctuating between Running and Provisioned states #4112

Comments

esierra-stratio commented Oct 10, 2023 • edited Loading

Jont828 commented Oct 12, 2023

nojnhuh commented Oct 12, 2023

nojnhuh commented Oct 12, 2023

nojnhuh commented Oct 13, 2023

CecileRobertMichon commented Oct 13, 2023

nojnhuh commented Oct 13, 2023

nojnhuh commented Oct 13, 2023

nojnhuh commented Oct 13, 2023

nojnhuh commented Oct 13, 2023

jackfrancis commented Oct 13, 2023

CecileRobertMichon commented Oct 13, 2023

nojnhuh commented Oct 13, 2023

esierra-stratio commented Oct 16, 2023

[AzureManagedCluster][Spot] `AzureManagedMachinePool` fluctuating between `Running` and `Provisioned` states #4112

[AzureManagedCluster][Spot] `AzureManagedMachinePool` fluctuating between `Running` and `Provisioned` states #4112

esierra-stratio commented Oct 10, 2023 •

edited

Loading