HAOS Matter device pings fail and device unavailability after 1 of 2 Apple TVs is powered off #2823

Coder84619 · 2023-10-14T20:09:46Z

The problem

When I power down one of the two Apple TVs on my network, HAOS takes a long time (15-30+ minutes) to fail over. 1-2 minutes after the ATV is powered down the Matter container routing information is updated. The route goes from REACHABLE to DELAY to FAILED within 2 minutes of the power down event.

However, I am unable to successfully ping a Thread device global IPv6 address from within the Matter docker container after the route to the dead ATV shows FAILED. Pinging the same Thread IPv6 address from my Mac works, even when I'm experiencing simultaneous timeouts from the Matter container.

1-2 minutes after powering on the second ATV, pings magically start working from within the Matter container and HA starts the re-subscription process and devices start to come back online.

While some Matter disruption for 1-2 minutes after the power off event would be expected, the extended period of 15+ minutes of ping failures while they are working on my Mac seems like a bug. Devices start becoming unavailable in the HA UI about 3-5 minutes after the ATV home-assistant/core#2 is unplugged.

Configuration:
-Two Thread enabled Apple TV 4Ks (Ethernet connected) with tvOS 17.
-No SkyConnect, no other border routers.
-17 Thread devices: 11 Eve, 6 Smartwings. No Nanoleaf.
-HAOS VM on Proxmox 8. No /etc/sysctl.conf changes. Using standard Linux bridge.
-HA OS Version 11.0
-Core network switch: QNAP QSW-M2116P-2T2S with v2 firmware. IGMP snooping disabled.
-Everything on single VLAN
-No mDNS reflectors anywhere

ATV home-assistant/core#2 was unplugged at 12:00PM local time. ATV#2 was plugged back in around 12:12PM. Here's a series of screenshots:

What version of Home Assistant Core has the issue?

2023.10.3

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Matter

Link to integration documentation on our website

https://www.home-assistant.io/integrations/matter

Diagnostics information

config_entry-thread-03852ff08174fec48a1ea63cc94bcb31.json (7).txt
config_entry-matter-753c6020af063292f3b8285bd26ce445.json.txt
core_matter_server_2023-10-14T19-10-12.950Z.log

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

home-assistant · 2023-10-14T20:09:50Z

Hey there @home-assistant/matter, mind taking a look at this issue as it has been labeled with an integration (matter) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of matter can trigger bot actions by commenting:

@home-assistant close Closes the issue.
@home-assistant rename Awesome new title Renames the issue.
@home-assistant reopen Reopen the issue.
@home-assistant unassign matter Removes the current integration label and assignees on the issue, add the integration domain after the command.

_{^{(message by CodeOwnersMention)}}

matter documentation
matter source
_{^{(message by IssueLinks)}}

Coder84619 · 2023-10-14T20:40:50Z

Update: the 2600::14ba address is one of the IPv6 addresses assigned to the HAOS itself. Here's a paste from the HAOS Network page:

Detected: enp0s18 (10.13.2.203/24, 2600:xxxxxxx:14ba/64, fdac:6929:76fb:479e:a9ae:6e23:82cb:2199/64, fe80::73d7:3a86:879e:c772/64)

marcelveldt · 2023-10-14T21:50:35Z

Thanks for the extensive testing and detailed report, this will be really helpful in the quest to look for cause/solution.
Just as a check, I assume you did open up a docker shell into the running matter container, right ?
So the pings are from the matter container ?

Coder84619 · 2023-10-15T02:09:02Z

@marcelveldt Yes I captured all of the prompts which have the core-matter-server description.

agners · 2023-10-15T09:08:58Z

Seems like routes don't get updated correctly, again 😕. Can you reproduce this with HAOS 10.5? Also would be interesting if that is reproducible on a bare metal installation (without Proxmox).

agners · 2023-10-15T09:10:39Z

Note that we carry a patch which should address a very similar problem (see #2434 and #2333 (reply in thread)).

I'll have a closer look next week.

Jc2k · 2023-10-15T09:15:17Z

Asked OP on discord to try setting the forwarding sysctl to 0 to see if it was truly a variant of the same problem.

Coder84619 · 2023-10-15T14:36:45Z

This is my production HA instance, so not keen on messing with the config and trying to mess with sysctl ipv6 settings.

Jc2k · 2023-10-15T14:46:46Z

If it's a VM... could you spin up a test one? Doesn't need to have the matter devices added to it to test ipv6 connectivity.

agners · 2023-10-16T07:23:35Z

This is my production HA instance, so not keen on messing with the config and trying to mess with sysctl ipv6 settings.

sysctl settings (using the command) are volatile. So whatever happens, after reboot the default settings are applied again. So testing with sysctl -w net.ipv6.conf.all.forwarding=0 is really safe.

As for downgrading the OS: This is fairly safe too, as we use a A/B boot slot system. Worst case you can boot the previous slot again (by selecting the other slot in the GRUB bootloader). To be super safe, you can also take a snapshot before executing the downgrade command.

agners · 2023-10-16T07:32:32Z

I've looked at the outputs a bit more closely. To me it seems that the router's state is correctly detected.

From the ping's response (Address unreachable) it sounds as if no route is present. Can you execute the following command before unplugging and after unplugging one of the BRs?

ip -6 route

You can run the command on the OS shell or in the Matter container context, it shouldn't matter (since the Matter container runs on host network).

marcelveldt · 2023-10-16T09:51:33Z

This morning I tried to reproduce your issue on my end.

To test the Border router route switch I've just now done several tests where I unplugged several Border routers out and back, especially the border routers surround an Eve plug in my office. Each time the connection was restored within seconds or minutes. So this means that -at least on my end- the route election process works smoothly and exactly how you expect it to work.

Current theory is that it could be related to Proxmox somehow doesn't forward some of the IPv6 ND messages but that is just a wild guess at this point. To confirm we'll have to re-do the test with proxmox involved.

Jc2k · 2023-10-16T10:19:24Z

That's a great theory. I recall @agners and I discussing a case where VMware defaults seemed to break "normal" ipv6 - possibly promiscuous mode?

That mode would make your layer 2 act like it was connected with old fashioned hubs rather than switches so is not ideal long term, but maybe this would be a good thing to try on proxmox?

Coder84619 · 2023-10-19T00:09:40Z

@agners

Coder84619 · 2023-10-19T02:06:32Z

I did another test run tonight by bypassing Proxmox. I connected a USB dongle in passthrough mode to the HAOS VM, disabled the proxmox NIC and rebooted HAOS. Same results. 2600::9970 is the address of the HAOS VM. This ping is 5 minutes after ATV #2 was powered off.

Coder84619 · 2023-10-19T04:16:53Z

I also stood up a fresh HAOS 11 VM, installed Matter (no devices) on Proxmox and had the same bad results.

agners · 2023-10-19T10:48:35Z

@agners

What this shows is that the route to d53f times out after 30 minutes. This is expected behavior. The IPv6 ND announces routes with a lifetime of 1800s (30min).

However, the kernel should not select this route.

Can you check if the kernel is really selecting the wrong route here using ip route get <target device>

What is a bit puzzling to me is that the ICMP request get answers from ULA and global addresses. Do you know which device is answering here?

agners · 2023-10-19T10:57:40Z

You have two routers: One ending with d249, one ending with d53f.

It seems the device you plug out is ending with d53f (you refer to it as ATV 2). From the first screenshot of your initial post it seems that Linux correctly detects that (your output shows route FAILED in the line of that device).

So the kernel really should select the other router d249 here. But that doesn't seem to be happening. 🤔

agners · 2023-10-19T22:09:55Z

So, I've setup two Apple BRs (Apple TV 4K with Ethernet and a HomePod) in my home, and it turns out, I can reproduce the issue! The route didn't update/failover properly for me either 😰

Investigating more shows that the patch we've added in #2434 is actually broken: The IS_ENABLED preprocessor defines requires configs with CONFIG_ prefixed!

I've created a PR to address this issue #2845.

Thanks a lot for investigating and raising this issue! 🙏

Sidenote: In one of my tests (when I unplugged the Apple TV) the Thread network changed IP addresses. Those were properly announced via mDNS, but I didn't notice at first, and I still tried to ping the old address. I am not sure why Apple/Thread is doing this exactly).

Coder84619 · 2023-10-19T22:46:08Z

@agners Glad I'm not crazy!

agners · 2023-10-20T06:36:17Z

Retested with 11.1.dev20231019, this looks good now. Will be resolved in 11.1.

Coder84619 · 2023-10-22T04:25:50Z

@agners I tested the Dev build as well, and failover is nearly instant. Only a single lost ping.

frenck transferred this issue from home-assistant/core Oct 15, 2023

agners closed this as completed Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HAOS Matter device pings fail and device unavailability after 1 of 2 Apple TVs is powered off #2823

HAOS Matter device pings fail and device unavailability after 1 of 2 Apple TVs is powered off #2823

Coder84619 commented Oct 14, 2023

home-assistant bot commented Oct 14, 2023

Coder84619 commented Oct 14, 2023 •

edited

Loading

marcelveldt commented Oct 14, 2023

Coder84619 commented Oct 15, 2023

agners commented Oct 15, 2023

agners commented Oct 15, 2023

Jc2k commented Oct 15, 2023

Coder84619 commented Oct 15, 2023

Jc2k commented Oct 15, 2023

agners commented Oct 16, 2023

agners commented Oct 16, 2023

marcelveldt commented Oct 16, 2023

Jc2k commented Oct 16, 2023

Coder84619 commented Oct 19, 2023

Coder84619 commented Oct 19, 2023

Coder84619 commented Oct 19, 2023

agners commented Oct 19, 2023

agners commented Oct 19, 2023

agners commented Oct 19, 2023

Coder84619 commented Oct 19, 2023

agners commented Oct 20, 2023

Coder84619 commented Oct 22, 2023

HAOS Matter device pings fail and device unavailability after 1 of 2 Apple TVs is powered off #2823

HAOS Matter device pings fail and device unavailability after 1 of 2 Apple TVs is powered off #2823

Comments

Coder84619 commented Oct 14, 2023

The problem

What version of Home Assistant Core has the issue?

What was the last working version of Home Assistant Core?

What type of installation are you running?

Integration causing the issue

Link to integration documentation on our website

Diagnostics information

Example YAML snippet

Anything in the logs that might be useful for us?

Additional information

home-assistant bot commented Oct 14, 2023

Coder84619 commented Oct 14, 2023 • edited Loading

marcelveldt commented Oct 14, 2023

Coder84619 commented Oct 15, 2023

agners commented Oct 15, 2023

agners commented Oct 15, 2023

Jc2k commented Oct 15, 2023

Coder84619 commented Oct 15, 2023

Jc2k commented Oct 15, 2023

agners commented Oct 16, 2023

agners commented Oct 16, 2023

marcelveldt commented Oct 16, 2023

Jc2k commented Oct 16, 2023

Coder84619 commented Oct 19, 2023

Coder84619 commented Oct 19, 2023

Coder84619 commented Oct 19, 2023

agners commented Oct 19, 2023

agners commented Oct 19, 2023

agners commented Oct 19, 2023

Coder84619 commented Oct 19, 2023

agners commented Oct 20, 2023

Coder84619 commented Oct 22, 2023

Coder84619 commented Oct 14, 2023 •

edited

Loading