Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Autoscaler stops working when cluster has nodes with invalid server ID #6716

Closed
maksim-paskal opened this issue Apr 16, 2024 · 4 comments · Fixed by #6717
Closed
Labels
area/provider/hetzner Issues or PRs related to Hetzner provider kind/bug Categorizes issue or PR as related to a bug.

Comments

@maksim-paskal
Copy link
Contributor

Which component are you using?: [email protected]

While using autoscaler in Hetzner Cloud provider:

./cluster-autoscaler --cloud-provider=hetzner

Autoscaler stops working (continuous loop) with message:

E0416 12:21:58.365577       1 static_autoscaler.go:354] Failed to get node infos for groups: failed to check if server robot://1 exists error: failed to get servers for node root-minio-hcloud error: server not found

Scenarios to catch this behaviour:

  1. join Hetzner Cloud worker node and than delete server via Hetzner Cloud Console or API
  2. join Hetzner Dedicated worker node to cluster with ProviderID="robot://1" (other than valid "hcloud://[0-9]+")

What did you expect to happen?:

The autoscaler must function even when worker nodes do not have valid ProviderIDs. This will enable users to add Hetzner Dedicated worker nodes. Additionally, the cluster should continue to operate even when the server has been physically deleted.

@maksim-paskal maksim-paskal added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2024
@apricote
Copy link
Member

/area provider/hetzner

Thanks for the issue and PR @maksim-paskal!

As far as I can tell from the code this issue is valid, did not reproduce locally.

Additionally, the cluster should continue to operate even when the server has been physically deleted.

In that case I would expect hcloud-cloud-controller-manager to delete the Node object. Afterwards cluster-autoscaler should work again. If that causes regular disruptions for you or your users, I would be interested to hear about this. Perhaps it is something we should consider in cluster-autoscaler.

@k8s-ci-robot k8s-ci-robot added the area/provider/hetzner Issues or PRs related to Hetzner provider label Apr 17, 2024
@maksim-paskal
Copy link
Contributor Author

@apricote Thanks for quick response.

Yes, the hcloud-cloud-controller-manager will delete these nodes, but sometimes the cluster-autoscaler will stop handling any Pending pods in your cluster.

We manage clusters with more than 50 nodes, and I've observed in the cluster-autoscaler logs that even nodes with valid ProviderID can generate that message. This might occur when the hcloud-cloud-controller-manager is too busy or unavailable.

Regardless, I believe that the cluster-autoscaler should continue processing new Pending pods, even if the cluster is experiencing issues with a single node.

@apricote
Copy link
Member

Sounds good then. I am not running the cluster-autoscaler anywhere myself, so any feedback from actual users is very valuable :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/hetzner Issues or PRs related to Hetzner provider kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants