🐛 Improve handling of bm server reboot timeouts #1327

janiskemper · 2024-05-31T16:41:41Z

What this PR does / why we need it:
Bare metal servers currently can time out while rebooting. However, the timeouts don't make too much sense right now.

The timeout is too big, so that usually MachineHealthChecks will trigger before the timeout is reached. We reduce this here.
The server was just rebooted again if a timeout is reached. However, if the timeout is actually reached, we don't want to continue, as there is probably something wrong with the server. Now we set a permanent error.

Additionally, there are some code improvements:

The check whether a reboot has been triggered requires an API call to Hetzner API. This check is now only done when necessary, so that we safe the unneccessary API calls.
We set a permanent error if a reboot is marked as failed, so that we stop reconciling.
The ProvisionSucceeded condition of a host is saved in case of a permanent error to give the user some feedback that the permanent error happened and why.

TODOs:

squash commits
include documentation
add unit tests

pkg/services/baremetal/host/host.go

Bare metal servers currently can time out while rebooting. However, the timeouts don't make too much sense right now. - The timeout is too big, so that usually MachineHealthChecks will trigger before the timeout is reached. We reduce this here. - The server was just rebooted again if a timeout is reached. However, if the timeout is actually reached, we don't want to continue, as there is probably something wrong with the server. Now we set a permanent error. Additionally, there are some code improvements: - The check whether a reboot has been triggered requires an API call to Hetzner API. This check is now only done when necessary, so that we safe the unneccessary API calls. - We set a permanent error if a reboot is marked as failed, so that we stop reconciling. - The ProvisionSucceeded condition of a host is saved in case of a permanent error to give the user some feedback that the permanent error happened and why.

janiskemper requested a review from guettli May 31, 2024 16:41

guettli reviewed Jun 5, 2024

View reviewed changes

pkg/services/baremetal/host/host.go Show resolved Hide resolved

janiskemper requested a review from guettli June 18, 2024 08:10

guettli approved these changes Jun 19, 2024

View reviewed changes

guettli marked this pull request as ready for review June 19, 2024 13:18

syself-bot bot added area/code Changes made in the code directory area/api Changes made in the api directory labels Jun 19, 2024

janiskemper force-pushed the handle-reboot-timeout branch from c298628 to 37e8b0b Compare June 19, 2024 13:30

syself-bot bot added the size/M Denotes a PR that changes 50-200 lines, ignoring generated files. label Jun 19, 2024

janiskemper merged commit 6ba258f into main Jun 19, 2024
9 checks passed

janiskemper deleted the handle-reboot-timeout branch June 19, 2024 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Improve handling of bm server reboot timeouts #1327

🐛 Improve handling of bm server reboot timeouts #1327

janiskemper commented May 31, 2024

🐛 Improve handling of bm server reboot timeouts #1327

🐛 Improve handling of bm server reboot timeouts #1327

Conversation

janiskemper commented May 31, 2024