-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HAOS Matter device pings fail and device unavailability after 1 of 2 Apple TVs is powered off #2823
Comments
Hey there @home-assistant/matter, mind taking a look at this issue as it has been labeled with an integration ( Code owner commandsCode owners of
(message by CodeOwnersMention) matter documentation |
Update: the 2600::14ba address is one of the IPv6 addresses assigned to the HAOS itself. Here's a paste from the HAOS Network page: Detected: enp0s18 (10.13.2.203/24, 2600:xxxxxxx:14ba/64, fdac:6929:76fb:479e:a9ae:6e23:82cb:2199/64, fe80::73d7:3a86:879e:c772/64) |
Thanks for the extensive testing and detailed report, this will be really helpful in the quest to look for cause/solution. |
@marcelveldt Yes I captured all of the prompts which have the core-matter-server description. |
Seems like routes don't get updated correctly, again 😕. Can you reproduce this with HAOS 10.5? Also would be interesting if that is reproducible on a bare metal installation (without Proxmox). |
Note that we carry a patch which should address a very similar problem (see #2434 and #2333 (reply in thread)). I'll have a closer look next week. |
Asked OP on discord to try setting the forwarding sysctl to 0 to see if it was truly a variant of the same problem. |
This is my production HA instance, so not keen on messing with the config and trying to mess with sysctl ipv6 settings. |
If it's a VM... could you spin up a test one? Doesn't need to have the matter devices added to it to test ipv6 connectivity. |
sysctl settings (using the command) are volatile. So whatever happens, after reboot the default settings are applied again. So testing with As for downgrading the OS: This is fairly safe too, as we use a A/B boot slot system. Worst case you can boot the previous slot again (by selecting the other slot in the GRUB bootloader). To be super safe, you can also take a snapshot before executing the downgrade command. |
I've looked at the outputs a bit more closely. To me it seems that the router's state is correctly detected. From the ping's response (Address unreachable) it sounds as if no route is present. Can you execute the following command before unplugging and after unplugging one of the BRs?
You can run the command on the OS shell or in the Matter container context, it shouldn't matter (since the Matter container runs on host network). |
This morning I tried to reproduce your issue on my end. To test the Border router route switch I've just now done several tests where I unplugged several Border routers out and back, especially the border routers surround an Eve plug in my office. Each time the connection was restored within seconds or minutes. So this means that -at least on my end- the route election process works smoothly and exactly how you expect it to work. Current theory is that it could be related to Proxmox somehow doesn't forward some of the IPv6 ND messages but that is just a wild guess at this point. To confirm we'll have to re-do the test with proxmox involved. |
That's a great theory. I recall @agners and I discussing a case where VMware defaults seemed to break "normal" ipv6 - possibly promiscuous mode? That mode would make your layer 2 act like it was connected with old fashioned hubs rather than switches so is not ideal long term, but maybe this would be a good thing to try on proxmox? |
I did another test run tonight by bypassing Proxmox. I connected a USB dongle in passthrough mode to the HAOS VM, disabled the proxmox NIC and rebooted HAOS. Same results. 2600::9970 is the address of the HAOS VM. This ping is 5 minutes after ATV #2 was powered off. |
I also stood up a fresh HAOS 11 VM, installed Matter (no devices) on Proxmox and had the same bad results. |
What this shows is that the route to However, the kernel should not select this route. Can you check if the kernel is really selecting the wrong route here using What is a bit puzzling to me is that the ICMP request get answers from ULA and global addresses. Do you know which device is answering here? |
You have two routers: One ending with It seems the device you plug out is ending with So the kernel really should select the other router |
So, I've setup two Apple BRs (Apple TV 4K with Ethernet and a HomePod) in my home, and it turns out, I can reproduce the issue! The route didn't update/failover properly for me either 😰 Investigating more shows that the patch we've added in #2434 is actually broken: The I've created a PR to address this issue #2845. Thanks a lot for investigating and raising this issue! 🙏 Sidenote: In one of my tests (when I unplugged the Apple TV) the Thread network changed IP addresses. Those were properly announced via mDNS, but I didn't notice at first, and I still tried to ping the old address. I am not sure why Apple/Thread is doing this exactly). |
@agners Glad I'm not crazy! |
Retested with |
@agners I tested the Dev build as well, and failover is nearly instant. Only a single lost ping. |
The problem
When I power down one of the two Apple TVs on my network, HAOS takes a long time (15-30+ minutes) to fail over. 1-2 minutes after the ATV is powered down the Matter container routing information is updated. The route goes from REACHABLE to DELAY to FAILED within 2 minutes of the power down event.
However, I am unable to successfully ping a Thread device global IPv6 address from within the Matter docker container after the route to the dead ATV shows FAILED. Pinging the same Thread IPv6 address from my Mac works, even when I'm experiencing simultaneous timeouts from the Matter container.
1-2 minutes after powering on the second ATV, pings magically start working from within the Matter container and HA starts the re-subscription process and devices start to come back online.
While some Matter disruption for 1-2 minutes after the power off event would be expected, the extended period of 15+ minutes of ping failures while they are working on my Mac seems like a bug. Devices start becoming unavailable in the HA UI about 3-5 minutes after the ATV home-assistant/core#2 is unplugged.
Configuration:
-Two Thread enabled Apple TV 4Ks (Ethernet connected) with tvOS 17.
-No SkyConnect, no other border routers.
-17 Thread devices: 11 Eve, 6 Smartwings. No Nanoleaf.
-HAOS VM on Proxmox 8. No /etc/sysctl.conf changes. Using standard Linux bridge.
-HA OS Version 11.0
-Core network switch: QNAP QSW-M2116P-2T2S with v2 firmware. IGMP snooping disabled.
-Everything on single VLAN
-No mDNS reflectors anywhere
ATV home-assistant/core#2 was unplugged at 12:00PM local time. ATV#2 was plugged back in around 12:12PM. Here's a series of screenshots:
What version of Home Assistant Core has the issue?
2023.10.3
What was the last working version of Home Assistant Core?
No response
What type of installation are you running?
Home Assistant OS
Integration causing the issue
Matter
Link to integration documentation on our website
https://www.home-assistant.io/integrations/matter
Diagnostics information
config_entry-thread-03852ff08174fec48a1ea63cc94bcb31.json (7).txt
config_entry-matter-753c6020af063292f3b8285bd26ce445.json.txt
core_matter_server_2023-10-14T19-10-12.950Z.log
Example YAML snippet
No response
Anything in the logs that might be useful for us?
No response
Additional information
No response
The text was updated successfully, but these errors were encountered: