Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set max reanimation attempts on HA watchdog #4784

Merged
merged 1 commit into from
Dec 21, 2023

Conversation

mdegat01
Copy link
Contributor

@mdegat01 mdegat01 commented Dec 21, 2023

Proposed change

In sentry we're seeing cases where supervisor tries to reanimate Home Assistant every 8 minutes indefinitely, failing every time. There is clearly something very wrong with these systems but without more info we're not sure what it is. However whatever is wrong, Supervisor cannot fix it by attempting to restart the container. So it should stop trying after 5 failed attempts in a row to avoid filling up the log with noise.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New feature (which adds functionality to the supervisor)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

  • This PR fixes or closes issue: fixes #
  • This PR is related to issue:
  • Link to documentation pull request:
  • Link to cli pull request:

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Black (black --fast supervisor tests)
  • Tests have been added to verify that the new code works.

If API endpoints of add-on configuration are added/changed:

@mdegat01 mdegat01 added the new-feature A new feature label Dec 21, 2023
@mdegat01 mdegat01 requested a review from agners December 21, 2023 03:36
@agners
Copy link
Member

agners commented Dec 21, 2023

While there are probably other cases where the Home Assistant Core API Watchdog fails, the main aim here is to address HomeAssistantError in supervisor.homeassistant.core in restart with the following exception:

Can't restart homeassistant: 500 Server Error for http+docker://localhost/v1.43/containers/d42a654b73855b0508e719f1bfd6db874543133c02a3f5c8c00c01bfad14c732/restart?t=260: Internal Server Error ("Cannot restart container d42a654b73855b0508e719f1bfd6db874543133c02a3f5c8c00c01bfad14c732: tried to kill container, but did not receive an exit event")

There are a total of 263 systems affected by this issue. It is not clear why Docker can't kill the container here. Usually, Linux is very throw in killing processes, killing not being possible is usually when the process hangs in kernel mode. Most likely the Home Assistant Core Python process got hung up in kernel mode somehow here.

The assumption is that in this case the Home Assistant front-end probably doesn't run/work (since the API also does not respond). But it is hard to tell for sure.

In any case, not trying indefinitely make sense. Let's hope we get the full picture through some issue report at one point.

@agners agners merged commit b7ddfba into main Dec 21, 2023
24 checks passed
@agners agners deleted the max-reanimation-attempts-homeassistant branch December 21, 2023 15:44
@github-actions github-actions bot locked and limited conversation to collaborators Dec 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants