Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Improve handling of bm server reboot timeouts #1327

Merged
merged 1 commit into from
Jun 19, 2024

Conversation

janiskemper
Copy link
Contributor

What this PR does / why we need it:
Bare metal servers currently can time out while rebooting. However, the timeouts don't make too much sense right now.

  • The timeout is too big, so that usually MachineHealthChecks will trigger before the timeout is reached. We reduce this here.
  • The server was just rebooted again if a timeout is reached. However, if the timeout is actually reached, we don't want to continue, as there is probably something wrong with the server. Now we set a permanent error.

Additionally, there are some code improvements:

  • The check whether a reboot has been triggered requires an API call to Hetzner API. This check is now only done when necessary, so that we safe the unneccessary API calls.
  • We set a permanent error if a reboot is marked as failed, so that we stop reconciling.
  • The ProvisionSucceeded condition of a host is saved in case of a permanent error to give the user some feedback that the permanent error happened and why.

TODOs:

  • squash commits
  • include documentation
  • add unit tests

@janiskemper janiskemper requested a review from guettli May 31, 2024 16:41
@janiskemper janiskemper requested a review from guettli June 18, 2024 08:10
@guettli guettli marked this pull request as ready for review June 19, 2024 13:18
@syself-bot syself-bot bot added area/code Changes made in the code directory area/api Changes made in the api directory labels Jun 19, 2024
Bare metal servers currently can time out while rebooting. However, the
timeouts don't make too much sense right now.

- The timeout is too big, so that usually MachineHealthChecks will
  trigger before the timeout is reached. We reduce this here.
- The server was just rebooted again if a timeout is reached. However,
  if the timeout is actually reached, we don't want to continue, as
there is probably something wrong with the server. Now we set a
permanent error.

Additionally, there are some code improvements:
- The check whether a reboot has been triggered requires an API call to
  Hetzner API. This check is now only done when necessary, so that we
safe the unneccessary API calls.
- We set a permanent error if a reboot is marked as failed, so that we
  stop reconciling.
- The ProvisionSucceeded condition of a host is saved in case of a
  permanent error to give the user some feedback that the permanent
error happened and why.
@syself-bot syself-bot bot added the size/M Denotes a PR that changes 50-200 lines, ignoring generated files. label Jun 19, 2024
@janiskemper janiskemper merged commit 6ba258f into main Jun 19, 2024
9 checks passed
@janiskemper janiskemper deleted the handle-reboot-timeout branch June 19, 2024 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Changes made in the api directory area/code Changes made in the code directory size/M Denotes a PR that changes 50-200 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants