Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollback of Core not working correctly #4590

Closed
agners opened this issue Sep 29, 2023 · 7 comments
Closed

Rollback of Core not working correctly #4590

agners opened this issue Sep 29, 2023 · 7 comments
Labels

Comments

@agners
Copy link
Member

agners commented Sep 29, 2023

Describe the issue you are experiencing

When Home Assistant Core is broken (e.g. due to home-assistant/core#101027), the Supervisor should rollback to the previous version. It seems that the logic is not working correctly.

With 2023.10.0b0 this can be reproduced only on some x86-64 images. But it has previously been observed with 2023.9.0.dev20230816 too, which failed on all platforms.

What type of installation are you running?

Home Assistant OS

Which operating system are you running on?

Home Assistant Operating System

Steps to reproduce the issue

  1. Install one of the affected versions ha core update --version=<x.y>
  2. Observe Supervisor does not rollback.

Anything in the Supervisor logs that might be useful for us?

23-09-29 11:22:50 INFO (MainThread) [supervisor.homeassistant.core] Updating Home Assistant to version 2023.9.0.dev20230816
23-09-29 11:22:50 INFO (MainThread) [supervisor.docker.interface] Updating image ghcr.io/home-assistant/generic-x86-64-homeassistant:2023.10.0b1 to ghcr.io/home-assistant/generic-x86-64-homeassistant:2023.9.0.dev20230816
23-09-29 11:22:50 INFO (MainThread) [supervisor.docker.interface] Downloading docker image ghcr.io/home-assistant/generic-x86-64-homeassistant with tag 2023.9.0.dev20230816.
23-09-29 11:25:52 INFO (SyncWorker_1) [supervisor.docker.manager] Stopping homeassistant application
23-09-29 11:26:04 INFO (SyncWorker_1) [supervisor.docker.manager] Cleaning homeassistant application
23-09-29 11:26:04 INFO (MainThread) [supervisor.homeassistant.module] Update pulse/client.config: /data/tmp/homeassistant_pulse
23-09-29 11:26:05 INFO (MainThread) [supervisor.docker.homeassistant] Starting Home Assistant ghcr.io/home-assistant/generic-x86-64-homeassistant with version 2023.9.0.dev20230816
23-09-29 11:26:05 INFO (MainThread) [supervisor.homeassistant.core] Wait until Home Assistant is ready
23-09-29 11:26:42 ERROR (MainThread) [supervisor.misc.tasks] Watchdog found a problem with Home Assistant API!
23-09-29 11:26:42 ERROR (MainThread) [supervisor.misc.tasks] Home Assistant watchdog reanimation failed!
23-09-29 11:28:42 WARNING (MainThread) [supervisor.misc.tasks] Watchdog miss API response from Home Assistant
23-09-29 11:30:42 ERROR (MainThread) [supervisor.misc.tasks] Watchdog found a problem with Home Assistant API!
23-09-29 11:30:42 ERROR (MainThread) [supervisor.misc.tasks] Home Assistant watchdog reanimation failed!
ha su restart
ha core options --watchdog=false
ha core update --version=<x.y>

System Health information

This is after updating to 2023.10.0b0

System Information

version core-2023.10.0b0
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.11.5
os_name Linux
os_version 6.1.55
arch x86_64
timezone Europe/Prague
config_dir /config
Home Assistant Cloud
logged_in false
can_reach_cert_server ok
can_reach_cloud_auth ok
can_reach_cloud ok
Home Assistant Supervisor
host_os Home Assistant OS 11.0.dev20230927
update_channel beta
supervisor_version supervisor-2023.09.3
agent_version 1.6.0
docker_version 24.0.6
disk_total 114.8 GB
disk_used 32.5 GB
healthy true
supported true
board generic-x86-64
supervisor_api ok
version_api ok
installed_addons Terminal & SSH (9.7.1), Bla (0.0.1), CHIP Tool (0.2.1), deCONZ (6.22.0), Matter Server (local) (3.0.5), Mosquitto broker (6.3.1), Node-RED (14.5.0), Network UPS Tools (0.12.1), EMQX (0.2.10), Silicon Labs Flasher (local) (0.1.1), Silicon Labs Flasher (0.2.0), ESPHome (2023.9.1), File editor (5.6.0), motionEye (0.19.1), Dnsmasq (1.6.0), RPC Shutdown (2.4), NGINX Home Assistant SSL proxy (3.5.0), Let's Encrypt (4.12.9), Studio Code Server (5.10.2), Glances (0.19.3), MariaDB (2.6.1), Whisper (1.0.0), Piper (1.3.2), RaspberryMatic CCU (3.71.12.20230826), MariaDB (local) (2.6.0), Custom deps deployment (1.3.4), Custom deps deployment (local) (1.3.3), CHIP Controller REPL (0.4.0), Uptime Kuma (0.9.0), SQLite Web (3.9.3), chrony (2.6.1), OpenThread Border Router (2.3.2), Matter Server (4.9.0), openWakeWord (1.5.0), Real-Time latency test (0.1.0)
Dashboards
dashboards 2
resources 0
views 1
mode storage
Recorder
oldest_recorder_run September 22, 2023 at 8:27 AM
current_recorder_run September 29, 2023 at 11:37 AM
estimated_db_size 49.08 MiB
database_engine sqlite
database_version 3.41.2

Supervisor diagnostics

No response

Additional information

No response

@agners
Copy link
Member Author

agners commented Sep 29, 2023

The easiest way to reproduce the issue with 2023.10.0b0 is using a Proxmox installation. Use the advanced installation method, and choose KVM64 as the CPU model (not Host). This creates a system without avx/avx2 support, which causes the Core to crash (as noted in home-assistant/core#101027):

[10:43:15] INFO: Home Assistant Core finish process exit code 256
[10:43:15] INFO: Home Assistant Core finish process received signal 4

The Supervisor doesn't seem to detect that crash:

23-09-29 12:24:45 INFO (MainThread) [supervisor.homeassistant.core] Updating Home Assistant to version 2023.10.0b0
23-09-29 12:24:45 INFO (MainThread) [supervisor.docker.interface] Updating image ghcr.io/home-assistant/qemux86-64-homeassistant:2023.9.3 to ghcr.io/home-assistant/qemux86-64-homeassistant:2023.10.0b0
23-09-29 12:24:45 INFO (MainThread) [supervisor.docker.interface] Downloading docker image ghcr.io/home-assistant/qemux86-64-homeassistant with tag 2023.10.0b0.
23-09-29 12:34:58 INFO (SyncWorker_2) [supervisor.docker.manager] Stopping homeassistant application
23-09-29 12:35:07 INFO (SyncWorker_2) [supervisor.docker.manager] Cleaning homeassistant application
23-09-29 12:35:07 INFO (MainThread) [supervisor.homeassistant.module] Update pulse/client.config: /data/tmp/homeassistant_pulse
23-09-29 12:35:08 INFO (MainThread) [supervisor.docker.homeassistant] Starting Home Assistant ghcr.io/home-assistant/qemux86-64-homeassistant with version 2023.10.0b0
23-09-29 12:35:08 INFO (MainThread) [supervisor.homeassistant.core] Wait until Home Assistant is ready
23-09-29 12:35:09 INFO (MainThread) [supervisor.resolution.evaluate] Starting system evaluation with state running
23-09-29 12:35:09 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
23-09-29 12:35:12 INFO (MainThread) [supervisor.resolution.evaluate] Starting system evaluation with state running
23-09-29 12:35:12 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
23-09-29 12:35:15 INFO (MainThread) [supervisor.resolution.evaluate] Starting system evaluation with state running
23-09-29 12:35:15 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
23-09-29 12:35:17 INFO (MainThread) [supervisor.resolution.evaluate] Starting system evaluation with state running
23-09-29 12:35:17 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
23-09-29 12:35:18 ERROR (MainThread) [supervisor.homeassistant.api] Error on call http://172.30.32.1:8123/api/core/state: [Errno 104] Connection reset by peer
23-09-29 12:35:20 INFO (MainThread) [supervisor.resolution.evaluate] Starting system evaluation with state running
23-09-29 12:35:20 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
23-09-29 12:35:23 INFO (MainThread) [supervisor.resolution.evaluate] Starting system evaluation with state running
23-09-29 12:35:23 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
23-09-29 12:35:26 INFO (MainThread) [supervisor.resolution.evaluate] Starting system evaluation with state running
23-09-29 12:35:26 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
23-09-29 12:35:29 INFO (MainThread) [supervisor.resolution.evaluate] Starting system evaluation with state running
23-09-29 12:35:29 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
23-09-29 12:35:29 INFO (MainThread) [supervisor.store.git] Update add-on https://github.com/esphome/home-assistant-addon repository
23-09-29 12:35:29 INFO (MainThread) [supervisor.store.git] Update add-on https://github.com/home-assistant/addons repository
23-09-29 12:35:29 INFO (MainThread) [supervisor.store.git] Update add-on https://github.com/hassio-addons/repository repository
23-09-29 12:35:29 INFO (MainThread) [supervisor.updater] Fetching update data from https://version.home-assistant.io/stable.json
23-09-29 12:35:29 ERROR (MainThread) [supervisor.homeassistant.api] Error on call http://172.30.32.1:8123/api/core/state: [Errno 104] Connection reset by peer

@agners
Copy link
Member Author

agners commented Sep 29, 2023

Further investigation show that the Home Assistant Core container does not stop. The exit code is beeing set correctly (in /run/s6-linux-init-container-results/exitcode), but the container does not shutdown.

Manually calling /run/s6/basedir/bin/halt stops the container and triggers the fallback logic correctly:

23-09-29 12:48:08 WARNING (MainThread) [supervisor.homeassistant.core] Watchdog found Home Assistant failed, restarting...
23-09-29 12:48:08 ERROR (MainThread) [supervisor.homeassistant.core] Watchdog restart of Home Assistant failed!
23-09-29 12:48:10 ERROR (MainThread) [supervisor.homeassistant.core] Home Assistant has crashed!
23-09-29 12:48:10 CRITICAL (MainThread) [supervisor.homeassistant.core] HomeAssistant update failed -> rollback!
23-09-29 12:48:10 INFO (MainThread) [supervisor.resolution.module] Create new issue update_rollback - core / None
23-09-29 12:48:10 INFO (MainThread) [supervisor.homeassistant.core] A backup of the logfile is stored in /config/home-assistant-rollback.log
23-09-29 12:48:10 INFO (MainThread) [supervisor.homeassistant.core] Updating Home Assistant to version 2023.9.3
23-09-29 12:48:10 INFO (MainThread) [supervisor.docker.interface] Updating image ghcr.io/home-assistant/qemux86-64-homeassistant:2023.10.0b0 to ghcr.io/home-assistant/qemux86-64-homeassistant:2023.9.3
23-09-29 12:48:10 INFO (MainThread) [supervisor.docker.interface] Downloading docker image ghcr.io/home-assistant/qemux86-64-homeassistant with tag 2023.9.3.
23-09-29 12:48:11 INFO (SyncWorker_4) [supervisor.docker.manager] Cleaning homeassistant application
23-09-29 12:48:11 INFO (MainThread) [supervisor.homeassistant.module] Update pulse/client.config: /data/tmp/homeassistant_pulse
23-09-29 12:48:11 INFO (MainThread) [supervisor.docker.homeassistant] Starting Home Assistant ghcr.io/home-assistant/qemux86-64-homeassistant with version 2023.9.3
23-09-29 12:48:11 INFO (MainThread) [supervisor.homeassistant.core] Wait until Home Assistant is ready

After that I was able to reach Home Assistant Core correctly.

agners added a commit to home-assistant/core that referenced this issue Sep 29, 2023
When Home Assistant Core crashes (e.g. due to SIGILL, signal number 4)
as caused by #101027) the container should exit. This makes sure that
the exit code can be read by container orchistration software such as
the Supervisor and act upon.

With this, the Supervisor properly rolls back Home Assistant Core to the
previously installed version.

See also home-assistant/supervisor#4590 (comment).
agners added a commit to home-assistant/core that referenced this issue Sep 29, 2023
When Home Assistant Core crashes (e.g. due to SIGILL, signal number 4
as caused by #101027) the container should exit. This makes sure that
the exit code can be read by container orchistration software such as
the Supervisor and act upon.

With this, the Supervisor properly rolls back Home Assistant Core to the
previously installed version.

See also home-assistant/supervisor#4590 (comment).
@agners
Copy link
Member Author

agners commented Sep 29, 2023

Turns out this is an issue in the Core finish script. I've created a PR with home-assistant/core#101105.

However, this does not address the freeze issue (which can be observed with 2023.9.0.dev20230816). The question is how/if we should handle this case. Based on lack of API response we could just rollback, but we probably should be careful to not do that prematurely.

@agners
Copy link
Member Author

agners commented Oct 4, 2023

The freeze issue is addressed with #4595. With this PR, updating to the above mentioned version leads to:

23-10-05 00:01:12 INFO (MainThread) [supervisor.docker.homeassistant] Starting Home Assistant ghcr.io/home-assistant/generic-x86-64-homeassistant with version 2023.9.0.dev20230816
23-10-05 00:01:12 INFO (MainThread) [supervisor.homeassistant.core] Wait until Home Assistant is ready
23-10-05 00:06:13 ERROR (MainThread) [supervisor.homeassistant.core] No API response in 5 minutes, assuming core has had a fatal startup error
23-10-05 00:06:13 CRITICAL (MainThread) [supervisor.homeassistant.core] HomeAssistant update failed -> rollback!                              
23-10-05 00:06:13 INFO (MainThread) [supervisor.resolution.module] Create new issue update_rollback - core / None                             

@agners agners closed this as completed Oct 4, 2023
@caraar12345
Copy link

caraar12345 commented Oct 30, 2023

👋 Hi! Looks like some people have had some trouble with upgrading because of #4595; I know my Home Assistant installation can take well over 5 minutes to start due to some slow starting integrations.

Screenshot of Integration startup time
showing MQTT takes 187 seconds, Unifi Network takes 151 seconds etc

In my case, I currently can't upgrade to 2023.11.0b2 from 2023.11.0b1 because Supervisor rolls back the upgrade too quickly!

Is there a way to manually choose to disable the 5 minute timeout?

@agners
Copy link
Member Author

agners commented Oct 30, 2023

@caraar12345 thanks for bringing this up. It seems that the rollback is misstriggering here indeed!

Is there a way to manually choose to disable the 5 minute timeout?

No, this is a hardcoded timeout currently. However, by default, the Core itself should timeout as well if stage 1 doesn't complete within 2 minutes (see https://github.com/home-assistant/core/blob/dev/homeassistant/bootstrap.py#L65). The frontend integration, which provides the API the Supervisor is checking is part of stage 1. So I am not entirly sure what is happening here.

Can you open a new issue along with the Supervisor and Core logs after an update attempt? Thanks a lot!

@caraar12345
Copy link

That's really interesting context, thanks!

Absolutely will raise another issue 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants