VMSS reaches limit number of models (10) and can't be scaled anymore #4958

mweibel · 2024-07-02T07:42:34Z

/kind bug

What steps did you take and what happened:
When scaling up and down a MachinePool, it eventually reaches the point where Azure sends the mentioned error:

Virtual Machine Scale Set '{name}' has reached its limit of 10 models that may be referenced by one or more VMs belonging to the Virtual Machine Scale Set. Upgrade the VMs to the latest model of the Virtual Machine Scale Set before trying again.

At this point, the VMSS can't be scaled anymore unless we manually press the update button within the portal (or do so via az CLI).

I believe most of the changes just come from bootstrap token TTL updates but I'm not sure since I haven't yet figured out how to compare/diff the model versions.

What did you expect to happen:
VMSS can continue to scale without issues.

Anything else you would like to add:
This issue may be tangentially related to #2975 since we might need to reflect the image model status based on what Azure API says, and not our own logic.

A few questions for those who are more versed in Azure and CAPZ in general:

Rolling update strategy deletes VMs not running the latest model. There's however also the possibility to update a VM which may or may not provoke a reboot. Shouldn't we do this instead of deleting?
Could VMSS Flex help with this issue? Haven't yet tried that out but I plan to.

Environment:

cluster-api-provider-azure version: latest master with a few PRs applied
Kubernetes version: (use kubectl version): 1.28.5
OS (e.g. from /etc/os-release): Linux/Windows

The text was updated successfully, but these errors were encountered:

mweibel · 2024-07-05T14:16:30Z

FYI currently testing this change: helio@73cdc0d

mweibel · 2024-08-08T15:51:12Z

The change I did seems to somewhat work although with high scale it comes at it's limits because it only works if there no failed/evicted VMSS VMs.

I ran a quick experiment related to the VMSS PUT API.

copied the JSON data from the VMSS and added it to a file
removed immutable properties (name, id, ...)
executed a PUT az rest --method put --url '/subscriptions/{id}/resourcegroups/{rg}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmssName}?api-version=2024-07-01' --body @vmss.json --verbose -o json

Looking at the VMSS right afterwards, I see that the VMSS VMs now report "Latest Model: No". This even though no changes have been made at all.

Is this an issue with the VMSS API or does CAPZ need to verify no changes have been made before executing a CreateOrUpdate on the VMSS?

mweibel · 2024-08-09T09:12:47Z

Looking further into this: when doing an instance scale in the VMSS using Azure Portal, it uses the VMSS PATCH API instead of PUT. This API behaves different in a way that the "Latest model" property is not set to "No" on existing instances. Why this is the case is beyond my knowledge, but it means that CAPZ could generate a patch from the existing VMSS parameters to the new ones and execute a patch call instead.

willie-yao · 2024-08-15T16:40:17Z

/priority backlog

willie-yao · 2024-08-26T22:44:24Z

it means that CAPZ could generate a patch from the existing VMSS parameters to the new ones and execute a patch call instead.

Apologies for the delay on this!! This honestly seems more like an Azure bug than anything else, as it doesn't make sense why the behavior would be different. If you have a fix for working around this in CAPZ, I think it would be appropriate to incorporate it. @jackfrancis @nojnhuh can you shed some light on this as well in case I'm missing something?

mweibel · 2024-09-27T15:33:41Z

Investigating this again a bit more in detail. What I found out so far is that my initial conclusion was only halfway right. PATCH also creates a new model when we supply de CustomData (which is not included when manually copying the VMSS JSON and making a request as I did earlier).

It seems the diff is applied slightly more granular. I added in my fork a change to completely remove the vmss.Properties:

func (s *ScaleSetSpec) existingParameters(ctx context.Context, existing interface{}) (parameters interface{}, err error) {
        // [..snip..]
        
	// If there are no model changes and no increase in the replica count, do not update the VMSS.
	// Decreases in replica count is handled by deleting AzureMachinePoolMachine instances in the MachinePoolScope
	if *vmss.SKU.Capacity <= existingInfraVMSS.Capacity && !hasModelChanges && !s.ShouldPatchCustomData {
		// up to date, nothing to do
		return nil, nil
	}

	// if there are no model changes and no change in custom data, get rid of all properties to avoid unnecessary VMSS model
	// updates.
	if !hasModelChanges && !s.ShouldPatchCustomData {                  // <-- these lines are new
		vmss.Properties = nil
	}

	return vmss, nil
}

This seems to work so far. It's not yet ready to review because hasModelChanges and ShouldPatchCustomData don't really test all possible differences. Therefore we might need some more elaborate diff test.

This also is visible on the change history for that particular VMSS. After applying that change, whenever a capacity update has been done, the properties.VirtualMachineProfile.timeCreated property is not updated anymore. That's probably where the root cause is.

… update fixes kubernetes-sigs#4958 Scale up/down with MachinePool always updates the VM image model to use. This change sets the VirtualMachineProfile to nil when no change is necessary and ensures therefore less churn on scale up/downs leading to model updates which may require manual fixing

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 2, 2024

github-project-automation bot added this to CAPZ Planning Jul 2, 2024

k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Aug 15, 2024

mweibel mentioned this issue Oct 7, 2024

fix(machinepool): clear VirtualMachineProfile if no model/custom data update #5164

Merged

4 tasks

k8s-ci-robot closed this as completed in #5164 Oct 23, 2024

github-project-automation bot moved this to Done in CAPZ Planning Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VMSS reaches limit number of models (10) and can't be scaled anymore #4958

VMSS reaches limit number of models (10) and can't be scaled anymore #4958

mweibel commented Jul 2, 2024 •

edited

Loading

mweibel commented Jul 5, 2024

mweibel commented Aug 8, 2024

mweibel commented Aug 9, 2024

willie-yao commented Aug 15, 2024

willie-yao commented Aug 26, 2024

mweibel commented Sep 27, 2024 •

edited

Loading

VMSS reaches limit number of models (10) and can't be scaled anymore #4958

VMSS reaches limit number of models (10) and can't be scaled anymore #4958

Comments

mweibel commented Jul 2, 2024 • edited Loading

mweibel commented Jul 5, 2024

mweibel commented Aug 8, 2024

mweibel commented Aug 9, 2024

willie-yao commented Aug 15, 2024

willie-yao commented Aug 26, 2024

mweibel commented Sep 27, 2024 • edited Loading

mweibel commented Jul 2, 2024 •

edited

Loading

mweibel commented Sep 27, 2024 •

edited

Loading