Azure- returning in-memory size incorrect value when spot instance is deleted #7373

magnetic5355 · 2024-10-09T02:31:05Z

Which component are you using?:cluster-autoscaler

What version of the component are you using?: 1.31

Component version: 1.31

What k8s version are you using (kubectl version)?: 1.30.5+k3s1

kubectl version Output

$ kubectl version

What environment is this in?: Azure

What did you expect to happen?: When a VMSS spot instance is deleted and the node is removed from the cluster I expect the autoscaler to invalidate its cache

What happened instead?: Schedulable pods are present, however the in-memory size is 9 but the actual VMSS set is only 7

1 filter_out_schedulable.go:78] Schedulable pods present │
│ I1009 02:24:15.536067 1 static_autoscaler.go:557] No unschedulable pods │
│ I1009 02:24:15.536082 1 azure_scale_set.go:217] VMSS: k8-agent-2, returning in-memory size: 0 │
│ I1009 02:24:15.536093 1 azure_scale_set.go:217] VMSS: k8-agent-d2ds_v5, returning in-memory size: 9

--- eventually this will start logging in a loop when the cluster tries to scale down ----

│ I1009 02:31:59.254556 1 static_autoscaler.go:756] Decreasing size of k8-agent-d2ds_v5, expected=9 current=7 delta=-2 │
│ I1009 02:31:59.254570 1 azure_scale_set_instance_cache.go:77] invalidating instanceCache for k8-agent-d2ds_v5 │
│ I1009 02:31:59.254579 1 azure_scale_set.go:217] VMSS: k8-agent-d2ds_v5, returning in-memory size: 9 │
│ I1009 02:31:59.254594 1 static_autoscaler.go:469] Some node group target size was fixed, skipping the iteration

How to reproduce it (as minimally and precisely as possible):

Setup K3S cluster (not using AKS)
Set provider ID on nodes to proper format ie aks:///
Set kubernetes.azure.com/agentpool node label
Add tags to VMSS for auto scaler
Increase workload to have autoscaler create new nodes.
Delete a VMSS instance from Azure

In memory size never refreshes, new nodes are never created.

I have to restart the cluster-autoscaler pod to scale the cluster back up

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

adrianmoisey · 2024-10-09T11:05:18Z

/kind cluster-autoscaler

k8s-ci-robot · 2024-10-09T11:05:20Z

@adrianmoisey: The label(s) kind/cluster-autoscaler cannot be applied, because the repository doesn't have them.

In response to this:

/kind cluster-autoscaler

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

adrianmoisey · 2024-10-09T11:05:42Z

/area cluster-autoscaler

tallaxes · 2024-12-08T01:13:36Z

/triage accepted

tallaxes · 2024-12-08T02:31:57Z

Until fixed, one should be able to work around the issue by setting AZURE_GET_VMSS_SIZE_REFRESH_PERIOD

d3v3l0p3r · 2024-12-20T11:26:48Z

How would we go about setting AZURE_GET_VMSS_SIZE_REFRESH_PERIOD? @tallaxes

tallaxes · 2024-12-20T17:29:24Z

@d3v3l0p3r Add it to environment variables (with the value in seconds) defined for container deployment, e.g. using extraEnv in the Helm Chart; something like (untested):

extraEnv:
   AZURE_GET_VMSS_SIZE_REFRESH_PERIOD: "300"

This of course only works for self-hosted (vs AKS-managed) deployment of cluster autoscaler.

magnetic5355 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 9, 2024

k8s-ci-robot added the area/cluster-autoscaler label Oct 9, 2024

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Dec 8, 2024

tallaxes mentioned this issue Dec 8, 2024

fix: correctly set the default refresh period for VMSS size (used for Spot instances) #7579

Merged

k8s-ci-robot closed this as completed in #7579 Dec 8, 2024

YvesZelros mentioned this issue Dec 13, 2024

[BUG] preemption is not helpful for scheduling for all deployments since 1.31.2 upgrade Azure/AKS#4669

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure- returning in-memory size incorrect value when spot instance is deleted #7373

Azure- returning in-memory size incorrect value when spot instance is deleted #7373

magnetic5355 commented Oct 9, 2024 •

edited

Loading

adrianmoisey commented Oct 9, 2024

k8s-ci-robot commented Oct 9, 2024

adrianmoisey commented Oct 9, 2024

tallaxes commented Dec 8, 2024

tallaxes commented Dec 8, 2024

d3v3l0p3r commented Dec 20, 2024

tallaxes commented Dec 20, 2024

Azure- returning in-memory size incorrect value when spot instance is deleted #7373

Azure- returning in-memory size incorrect value when spot instance is deleted #7373

Comments

magnetic5355 commented Oct 9, 2024 • edited Loading

adrianmoisey commented Oct 9, 2024

k8s-ci-robot commented Oct 9, 2024

adrianmoisey commented Oct 9, 2024

tallaxes commented Dec 8, 2024

tallaxes commented Dec 8, 2024

d3v3l0p3r commented Dec 20, 2024

tallaxes commented Dec 20, 2024

magnetic5355 commented Oct 9, 2024 •

edited

Loading