-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VMSS reaches limit number of models (10) and can't be scaled anymore #4958
Comments
FYI currently testing this change: helio@73cdc0d |
The change I did seems to somewhat work although with high scale it comes at it's limits because it only works if there no failed/evicted VMSS VMs. I ran a quick experiment related to the VMSS PUT API.
Looking at the VMSS right afterwards, I see that the VMSS VMs now report "Latest Model: No". This even though no changes have been made at all. Is this an issue with the VMSS API or does CAPZ need to verify no changes have been made before executing a CreateOrUpdate on the VMSS? |
Looking further into this: when doing an instance scale in the VMSS using Azure Portal, it uses the VMSS PATCH API instead of PUT. This API behaves different in a way that the "Latest model" property is not set to "No" on existing instances. Why this is the case is beyond my knowledge, but it means that CAPZ could generate a patch from the existing VMSS parameters to the new ones and execute a patch call instead. |
/priority backlog |
Apologies for the delay on this!! This honestly seems more like an Azure bug than anything else, as it doesn't make sense why the behavior would be different. If you have a fix for working around this in CAPZ, I think it would be appropriate to incorporate it. @jackfrancis @nojnhuh can you shed some light on this as well in case I'm missing something? |
Investigating this again a bit more in detail. What I found out so far is that my initial conclusion was only halfway right. PATCH also creates a new model when we supply de CustomData (which is not included when manually copying the VMSS JSON and making a request as I did earlier). It seems the diff is applied slightly more granular. I added in my fork a change to completely remove the func (s *ScaleSetSpec) existingParameters(ctx context.Context, existing interface{}) (parameters interface{}, err error) {
// [..snip..]
// If there are no model changes and no increase in the replica count, do not update the VMSS.
// Decreases in replica count is handled by deleting AzureMachinePoolMachine instances in the MachinePoolScope
if *vmss.SKU.Capacity <= existingInfraVMSS.Capacity && !hasModelChanges && !s.ShouldPatchCustomData {
// up to date, nothing to do
return nil, nil
}
// if there are no model changes and no change in custom data, get rid of all properties to avoid unnecessary VMSS model
// updates.
if !hasModelChanges && !s.ShouldPatchCustomData { // <-- these lines are new
vmss.Properties = nil
}
return vmss, nil
} This seems to work so far. It's not yet ready to review because hasModelChanges and ShouldPatchCustomData don't really test all possible differences. Therefore we might need some more elaborate diff test. This also is visible on the change history for that particular VMSS. After applying that change, whenever a capacity update has been done, the properties.VirtualMachineProfile.timeCreated property is not updated anymore. That's probably where the root cause is. |
… update fixes kubernetes-sigs#4958 Scale up/down with MachinePool always updates the VM image model to use. This change sets the VirtualMachineProfile to nil when no change is necessary and ensures therefore less churn on scale up/downs leading to model updates which may require manual fixing
/kind bug
What steps did you take and what happened:
When scaling up and down a MachinePool, it eventually reaches the point where Azure sends the mentioned error:
At this point, the VMSS can't be scaled anymore unless we manually press the update button within the portal (or do so via az CLI).
I believe most of the changes just come from bootstrap token TTL updates but I'm not sure since I haven't yet figured out how to compare/diff the model versions.
What did you expect to happen:
VMSS can continue to scale without issues.
Anything else you would like to add:
This issue may be tangentially related to #2975 since we might need to reflect the image model status based on what Azure API says, and not our own logic.
A few questions for those who are more versed in Azure and CAPZ in general:
Environment:
kubectl version
): 1.28.5/etc/os-release
): Linux/WindowsThe text was updated successfully, but these errors were encountered: