Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use longer timeouts for API checks before trigger a rollback #4658

Merged
merged 5 commits into from
Nov 1, 2023

Conversation

agners
Copy link
Member

@agners agners commented Oct 31, 2023

Proposed change

Currently we check for Core API access and that the state is running. If this is not fulfilled within 5 minutes, we rollback to the previous version.

It can take quite a while until Home Assistant Core is in state running. In fact, after going through bootstrap, it can theoretically take indefinitely (as in there is no timeout from Core side).

So to trigger rollback, rather than check the state to be running, just check if the API is accessible in this case. This prevents spurious rollbacks.

However, we can almost guarantee that with all timeouts added up, and some margin, the Core should be up in all but the most obscure setups.

Use a two step method, where we check the API responding first before checking the state. This allows to better understand why the Supervisor chose to rollback, and rolls back the system faster in case there is a serious problem.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New feature (which adds functionality to the supervisor)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Black (black --fast supervisor tests)
  • Tests have been added to verify that the new code works.

If API endpoints of add-on configuration are added/changed:

Currently we check for Core API access and that the state is running. If
this is not fulfilled within 5 minutes, we rollback to the previous
version.

It can take quite a while until Home Assistant Core is in state running.
In fact, after going through bootstrap, it can theoretically take
indefinitely (as in there is no timeout from Core side).

So to trigger rollback, rather than check the state to be running, just
check if the API is accessible in this case. This prevents spurious
rollbacks.
Instead of checking the Core API just for response, do check the
state. Use a timeout which is long enough to cover all stages and
other timeouts during Core startup.
@agners agners changed the title Don't check if Core is running to trigger rollback Use longer timeouts for API checks before trigger a rollback Oct 31, 2023
@bdraco
Copy link
Member

bdraco commented Oct 31, 2023

unrelated

        # Check if port is up
        if not await self.sys_run_in_executor(
            check_port,
            self.sys_homeassistant.ip_address,
            self.sys_homeassistant.api_port,
        ):
            return False

Thats a bit unexpected to run in the executor as it could be all non-blocking

Copy link
Member

@bdraco bdraco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@mdegat01 mdegat01 added the bugfix A bug fix label Nov 1, 2023
@mdegat01 mdegat01 merged commit 1e49129 into main Nov 1, 2023
22 of 23 checks passed
@mdegat01 mdegat01 deleted the dont-check-api-running-after-upgrade branch November 1, 2023 20:01
@mdegat01
Copy link
Contributor

mdegat01 commented Nov 1, 2023

Thanks @agners and @bdraco ! 👍

@github-actions github-actions bot locked and limited conversation to collaborators Nov 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supervisor's 'Core hasn't started' API timeout check seems to be firing every time
3 participants