Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VMSS reaches limit number of models (10) and can't be scaled anymore #4958

Closed
mweibel opened this issue Jul 2, 2024 · 6 comments · Fixed by #5164
Closed

VMSS reaches limit number of models (10) and can't be scaled anymore #4958

mweibel opened this issue Jul 2, 2024 · 6 comments · Fixed by #5164
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@mweibel
Copy link
Contributor

mweibel commented Jul 2, 2024

/kind bug

What steps did you take and what happened:
When scaling up and down a MachinePool, it eventually reaches the point where Azure sends the mentioned error:

Virtual Machine Scale Set '{name}' has reached its limit of 10 models that may be referenced by one or more VMs belonging to the Virtual Machine Scale Set. Upgrade the VMs to the latest model of the Virtual Machine Scale Set before trying again.

At this point, the VMSS can't be scaled anymore unless we manually press the update button within the portal (or do so via az CLI).

I believe most of the changes just come from bootstrap token TTL updates but I'm not sure since I haven't yet figured out how to compare/diff the model versions.

What did you expect to happen:
VMSS can continue to scale without issues.

Anything else you would like to add:
This issue may be tangentially related to #2975 since we might need to reflect the image model status based on what Azure API says, and not our own logic.

A few questions for those who are more versed in Azure and CAPZ in general:

  1. Rolling update strategy deletes VMs not running the latest model. There's however also the possibility to update a VM which may or may not provoke a reboot. Shouldn't we do this instead of deleting?
  2. Could VMSS Flex help with this issue? Haven't yet tried that out but I plan to.

Environment:

  • cluster-api-provider-azure version: latest master with a few PRs applied
  • Kubernetes version: (use kubectl version): 1.28.5
  • OS (e.g. from /etc/os-release): Linux/Windows
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 2, 2024
@mweibel
Copy link
Contributor Author

mweibel commented Jul 5, 2024

FYI currently testing this change: helio@73cdc0d

@mweibel
Copy link
Contributor Author

mweibel commented Aug 8, 2024

The change I did seems to somewhat work although with high scale it comes at it's limits because it only works if there no failed/evicted VMSS VMs.

I ran a quick experiment related to the VMSS PUT API.

  1. copied the JSON data from the VMSS and added it to a file
  2. removed immutable properties (name, id, ...)
  3. executed a PUT az rest --method put --url '/subscriptions/{id}/resourcegroups/{rg}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmssName}?api-version=2024-07-01' --body @vmss.json --verbose -o json

Looking at the VMSS right afterwards, I see that the VMSS VMs now report "Latest Model: No". This even though no changes have been made at all.

Is this an issue with the VMSS API or does CAPZ need to verify no changes have been made before executing a CreateOrUpdate on the VMSS?

@mweibel
Copy link
Contributor Author

mweibel commented Aug 9, 2024

Looking further into this: when doing an instance scale in the VMSS using Azure Portal, it uses the VMSS PATCH API instead of PUT. This API behaves different in a way that the "Latest model" property is not set to "No" on existing instances. Why this is the case is beyond my knowledge, but it means that CAPZ could generate a patch from the existing VMSS parameters to the new ones and execute a patch call instead.

@willie-yao
Copy link
Contributor

/priority backlog

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Aug 15, 2024
@willie-yao
Copy link
Contributor

it means that CAPZ could generate a patch from the existing VMSS parameters to the new ones and execute a patch call instead.

Apologies for the delay on this!! This honestly seems more like an Azure bug than anything else, as it doesn't make sense why the behavior would be different. If you have a fix for working around this in CAPZ, I think it would be appropriate to incorporate it. @jackfrancis @nojnhuh can you shed some light on this as well in case I'm missing something?

@mweibel
Copy link
Contributor Author

mweibel commented Sep 27, 2024

Investigating this again a bit more in detail. What I found out so far is that my initial conclusion was only halfway right. PATCH also creates a new model when we supply de CustomData (which is not included when manually copying the VMSS JSON and making a request as I did earlier).

It seems the diff is applied slightly more granular. I added in my fork a change to completely remove the vmss.Properties:

func (s *ScaleSetSpec) existingParameters(ctx context.Context, existing interface{}) (parameters interface{}, err error) {
        // [..snip..]
        
	// If there are no model changes and no increase in the replica count, do not update the VMSS.
	// Decreases in replica count is handled by deleting AzureMachinePoolMachine instances in the MachinePoolScope
	if *vmss.SKU.Capacity <= existingInfraVMSS.Capacity && !hasModelChanges && !s.ShouldPatchCustomData {
		// up to date, nothing to do
		return nil, nil
	}

	// if there are no model changes and no change in custom data, get rid of all properties to avoid unnecessary VMSS model
	// updates.
	if !hasModelChanges && !s.ShouldPatchCustomData {                  // <-- these lines are new
		vmss.Properties = nil
	}

	return vmss, nil
}

This seems to work so far. It's not yet ready to review because hasModelChanges and ShouldPatchCustomData don't really test all possible differences. Therefore we might need some more elaborate diff test.

This also is visible on the change history for that particular VMSS. After applying that change, whenever a capacity update has been done, the properties.VirtualMachineProfile.timeCreated property is not updated anymore. That's probably where the root cause is.

image

mweibel added a commit to helio/cluster-api-provider-azure that referenced this issue Oct 7, 2024
… update

fixes kubernetes-sigs#4958

Scale up/down with MachinePool always updates the VM image model to use. This change sets the VirtualMachineProfile to nil when no change is necessary and ensures therefore less churn on scale up/downs leading to model updates which may require manual fixing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
Archived in project
3 participants