[v1.27] fix(hetzner): insufficient nodes when boot fails #6366

apricote · 2023-12-11T10:07:30Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Backport of #6364 to v1.27.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fixed a bug where failed servers are kept for longer than necessary

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

The Hetzner Cloud API returns "Actions" for anything asynchronous that happens inside the backend. When creating a new server multiple actions are returned: `create_server`, `start_server`, `attach_to_network` (if set). Our current code waits for the `create_server` and if it fails, it makes sure to delete the server so cluster-autoscaler can create a new one immediately to provide the required capacity. If one of the "follow up" actions fails though, we do not handle this. This causes issues when the server for whatever reason did not start properly on the first try, as then the customer has a shutdown server, is paying for it, but does not receive the additional capacity for their Kubernetes cluster. This commit fixes the bug, by awaiting all actions returned by the create server API call, and deleting the server if any of them fail.

k8s-ci-robot · 2023-12-11T10:07:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: apricote

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/hetzner/OWNERS~~ [apricote]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

x13n · 2023-12-11T10:14:32Z

/lgtm

Shubham82 · 2023-12-11T10:25:43Z

/lgtm

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 11, 2023

k8s-ci-robot requested a review from BigDarkClown December 11, 2023 10:07

k8s-ci-robot added the area/cluster-autoscaler label Dec 11, 2023

k8s-ci-robot requested a review from x13n December 11, 2023 10:07

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2023

k8s-ci-robot assigned x13n Dec 11, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 11, 2023

apricote mentioned this pull request Dec 11, 2023

fix(hetzner): insufficient nodes when boot fails #6364

Merged

k8s-ci-robot assigned Shubham82 Dec 11, 2023

k8s-ci-robot merged commit e1b7582 into kubernetes:cluster-autoscaler-release-1.27 Dec 11, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.27] fix(hetzner): insufficient nodes when boot fails #6366

[v1.27] fix(hetzner): insufficient nodes when boot fails #6366

apricote commented Dec 11, 2023 •

edited

Loading

k8s-ci-robot commented Dec 11, 2023

x13n commented Dec 11, 2023

Shubham82 commented Dec 11, 2023

[v1.27] fix(hetzner): insufficient nodes when boot fails #6366

[v1.27] fix(hetzner): insufficient nodes when boot fails #6366

Conversation

apricote commented Dec 11, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Dec 11, 2023

x13n commented Dec 11, 2023

Shubham82 commented Dec 11, 2023

apricote commented Dec 11, 2023 •

edited

Loading