Cluster Autoscaler stops working when cluster has nodes with invalid server ID #6716

maksim-paskal · 2024-04-16T13:12:41Z

Which component are you using?: [email protected]

While using autoscaler in Hetzner Cloud provider:

./cluster-autoscaler --cloud-provider=hetzner

Autoscaler stops working (continuous loop) with message:

E0416 12:21:58.365577       1 static_autoscaler.go:354] Failed to get node infos for groups: failed to check if server robot://1 exists error: failed to get servers for node root-minio-hcloud error: server not found

Scenarios to catch this behaviour:

join Hetzner Cloud worker node and than delete server via Hetzner Cloud Console or API
join Hetzner Dedicated worker node to cluster with ProviderID="robot://1" (other than valid "hcloud://[0-9]+")

What did you expect to happen?:

The autoscaler must function even when worker nodes do not have valid ProviderIDs. This will enable users to add Hetzner Dedicated worker nodes. Additionally, the cluster should continue to operate even when the server has been physically deleted.

The text was updated successfully, but these errors were encountered:

apricote · 2024-04-17T05:33:15Z

/area provider/hetzner

Thanks for the issue and PR @maksim-paskal!

As far as I can tell from the code this issue is valid, did not reproduce locally.

Additionally, the cluster should continue to operate even when the server has been physically deleted.

In that case I would expect hcloud-cloud-controller-manager to delete the Node object. Afterwards cluster-autoscaler should work again. If that causes regular disruptions for you or your users, I would be interested to hear about this. Perhaps it is something we should consider in cluster-autoscaler.

maksim-paskal · 2024-04-17T06:42:07Z

@apricote Thanks for quick response.

Yes, the hcloud-cloud-controller-manager will delete these nodes, but sometimes the cluster-autoscaler will stop handling any Pending pods in your cluster.

We manage clusters with more than 50 nodes, and I've observed in the cluster-autoscaler logs that even nodes with valid ProviderID can generate that message. This might occur when the hcloud-cloud-controller-manager is too busy or unavailable.

Regardless, I believe that the cluster-autoscaler should continue processing new Pending pods, even if the cluster is experiencing issues with a single node.

apricote · 2024-04-17T06:44:41Z

Sounds good then. I am not running the cluster-autoscaler anywhere myself, so any feedback from actual users is very valuable :)

apricote · 2024-04-17T08:33:27Z

Backported to current release-branches:

maksim-paskal added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2024

maksim-paskal mentioned this issue Apr 16, 2024

Fix Autoscaling for worker nodes with invalid ProviderID #6717

Merged

k8s-ci-robot added the area/provider/hetzner Issues or PRs related to Hetzner provider label Apr 17, 2024

k8s-ci-robot closed this as completed in #6717 Apr 17, 2024

This was referenced Apr 17, 2024

[v1.29][Hetzner] Fix Autoscaling for worker nodes with invalid ProviderID #6720

Merged

[v1.28][Hetzner] Fix Autoscaling for worker nodes with invalid ProviderID #6721

Merged

[v1.27][Hetzner] Fix Autoscaling for worker nodes with invalid ProviderID #6722

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Autoscaler stops working when cluster has nodes with invalid server ID #6716

Cluster Autoscaler stops working when cluster has nodes with invalid server ID #6716

maksim-paskal commented Apr 16, 2024

apricote commented Apr 17, 2024

maksim-paskal commented Apr 17, 2024

apricote commented Apr 17, 2024

apricote commented Apr 17, 2024

Cluster Autoscaler stops working when cluster has nodes with invalid server ID #6716

Cluster Autoscaler stops working when cluster has nodes with invalid server ID #6716

Comments

maksim-paskal commented Apr 16, 2024

apricote commented Apr 17, 2024

maksim-paskal commented Apr 17, 2024

apricote commented Apr 17, 2024

apricote commented Apr 17, 2024