Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure- returning in-memory size incorrect value when spot instance is deleted #7373

Closed
magnetic5355 opened this issue Oct 9, 2024 · 7 comments · Fixed by #7579
Closed

Azure- returning in-memory size incorrect value when spot instance is deleted #7373

magnetic5355 opened this issue Oct 9, 2024 · 7 comments · Fixed by #7579
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@magnetic5355
Copy link

magnetic5355 commented Oct 9, 2024

Which component are you using?:cluster-autoscaler

What version of the component are you using?: 1.31

Component version: 1.31

What k8s version are you using (kubectl version)?: 1.30.5+k3s1

kubectl version Output
$ kubectl version

What environment is this in?: Azure

What did you expect to happen?: When a VMSS spot instance is deleted and the node is removed from the cluster I expect the autoscaler to invalidate its cache

What happened instead?: Schedulable pods are present, however the in-memory size is 9 but the actual VMSS set is only 7

1 filter_out_schedulable.go:78] Schedulable pods present │
│ I1009 02:24:15.536067 1 static_autoscaler.go:557] No unschedulable pods │
│ I1009 02:24:15.536082 1 azure_scale_set.go:217] VMSS: k8-agent-2, returning in-memory size: 0 │
│ I1009 02:24:15.536093 1 azure_scale_set.go:217] VMSS: k8-agent-d2ds_v5, returning in-memory size: 9

--- eventually this will start logging in a loop when the cluster tries to scale down ----

│ I1009 02:31:59.254556 1 static_autoscaler.go:756] Decreasing size of k8-agent-d2ds_v5, expected=9 current=7 delta=-2 │
│ I1009 02:31:59.254570 1 azure_scale_set_instance_cache.go:77] invalidating instanceCache for k8-agent-d2ds_v5 │
│ I1009 02:31:59.254579 1 azure_scale_set.go:217] VMSS: k8-agent-d2ds_v5, returning in-memory size: 9 │
│ I1009 02:31:59.254594 1 static_autoscaler.go:469] Some node group target size was fixed, skipping the iteration

How to reproduce it (as minimally and precisely as possible):

Setup K3S cluster (not using AKS)
Set provider ID on nodes to proper format ie aks:///
Set kubernetes.azure.com/agentpool node label
Add tags to VMSS for auto scaler
Increase workload to have autoscaler create new nodes.
Delete a VMSS instance from Azure

In memory size never refreshes, new nodes are never created.

I have to restart the cluster-autoscaler pod to scale the cluster back up

Anything else we need to know?:

@magnetic5355 magnetic5355 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 9, 2024
@adrianmoisey
Copy link
Member

/kind cluster-autoscaler

@k8s-ci-robot
Copy link
Contributor

@adrianmoisey: The label(s) kind/cluster-autoscaler cannot be applied, because the repository doesn't have them.

In response to this:

/kind cluster-autoscaler

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@adrianmoisey
Copy link
Member

/area cluster-autoscaler

@tallaxes
Copy link
Contributor

tallaxes commented Dec 8, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Dec 8, 2024
@tallaxes
Copy link
Contributor

tallaxes commented Dec 8, 2024

Until fixed, one should be able to work around the issue by setting AZURE_GET_VMSS_SIZE_REFRESH_PERIOD

@d3v3l0p3r
Copy link

How would we go about setting AZURE_GET_VMSS_SIZE_REFRESH_PERIOD? @tallaxes

@tallaxes
Copy link
Contributor

@d3v3l0p3r Add it to environment variables (with the value in seconds) defined for container deployment, e.g. using extraEnv in the Helm Chart; something like (untested):

extraEnv:
   AZURE_GET_VMSS_SIZE_REFRESH_PERIOD: "300"

This of course only works for self-hosted (vs AKS-managed) deployment of cluster autoscaler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
5 participants