Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HAOS Matter device pings fail and device unavailability after 1 of 2 Apple TVs is powered off #2823

Closed
Coder84619 opened this issue Oct 14, 2023 · 22 comments

Comments

@Coder84619
Copy link

The problem

When I power down one of the two Apple TVs on my network, HAOS takes a long time (15-30+ minutes) to fail over. 1-2 minutes after the ATV is powered down the Matter container routing information is updated. The route goes from REACHABLE to DELAY to FAILED within 2 minutes of the power down event.

However, I am unable to successfully ping a Thread device global IPv6 address from within the Matter docker container after the route to the dead ATV shows FAILED. Pinging the same Thread IPv6 address from my Mac works, even when I'm experiencing simultaneous timeouts from the Matter container.

1-2 minutes after powering on the second ATV, pings magically start working from within the Matter container and HA starts the re-subscription process and devices start to come back online.

While some Matter disruption for 1-2 minutes after the power off event would be expected, the extended period of 15+ minutes of ping failures while they are working on my Mac seems like a bug. Devices start becoming unavailable in the HA UI about 3-5 minutes after the ATV home-assistant/core#2 is unplugged.

Configuration:
-Two Thread enabled Apple TV 4Ks (Ethernet connected) with tvOS 17.
-No SkyConnect, no other border routers.
-17 Thread devices: 11 Eve, 6 Smartwings. No Nanoleaf.
-HAOS VM on Proxmox 8. No /etc/sysctl.conf changes. Using standard Linux bridge.
-HA OS Version 11.0
-Core network switch: QNAP QSW-M2116P-2T2S with v2 firmware. IGMP snooping disabled.
-Everything on single VLAN
-No mDNS reflectors anywhere

ATV home-assistant/core#2 was unplugged at 12:00PM local time. ATV#2 was plugged back in around 12:12PM. Here's a series of screenshots:

2023-10-14_12-01-26

2023-10-14_12-02-23

2023-10-14_12-08-07
2023-10-14_12-08-53

2023-10-14_12-09-21
2023-10-14_13-01-33
2023-10-14_12-26-44

What version of Home Assistant Core has the issue?

2023.10.3

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Matter

Link to integration documentation on our website

https://www.home-assistant.io/integrations/matter

Diagnostics information

config_entry-thread-03852ff08174fec48a1ea63cc94bcb31.json (7).txt
config_entry-matter-753c6020af063292f3b8285bd26ce445.json.txt
core_matter_server_2023-10-14T19-10-12.950Z.log

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

@home-assistant
Copy link

Hey there @home-assistant/matter, mind taking a look at this issue as it has been labeled with an integration (matter) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of matter can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign matter Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


matter documentation
matter source
(message by IssueLinks)

@Coder84619
Copy link
Author

Coder84619 commented Oct 14, 2023

Update: the 2600::14ba address is one of the IPv6 addresses assigned to the HAOS itself. Here's a paste from the HAOS Network page:

Detected: enp0s18 (10.13.2.203/24, 2600:xxxxxxx:14ba/64, fdac:6929:76fb:479e:a9ae:6e23:82cb:2199/64, fe80::73d7:3a86:879e:c772/64)

@marcelveldt
Copy link
Member

Thanks for the extensive testing and detailed report, this will be really helpful in the quest to look for cause/solution.
Just as a check, I assume you did open up a docker shell into the running matter container, right ?
So the pings are from the matter container ?

@Coder84619
Copy link
Author

@marcelveldt Yes I captured all of the prompts which have the core-matter-server description.

@frenck frenck transferred this issue from home-assistant/core Oct 15, 2023
@agners
Copy link
Member

agners commented Oct 15, 2023

Seems like routes don't get updated correctly, again 😕. Can you reproduce this with HAOS 10.5? Also would be interesting if that is reproducible on a bare metal installation (without Proxmox).

@agners
Copy link
Member

agners commented Oct 15, 2023

Note that we carry a patch which should address a very similar problem (see #2434 and #2333 (reply in thread)).

I'll have a closer look next week.

@Jc2k
Copy link
Member

Jc2k commented Oct 15, 2023

Asked OP on discord to try setting the forwarding sysctl to 0 to see if it was truly a variant of the same problem.

@Coder84619
Copy link
Author

This is my production HA instance, so not keen on messing with the config and trying to mess with sysctl ipv6 settings.

@Jc2k
Copy link
Member

Jc2k commented Oct 15, 2023

If it's a VM... could you spin up a test one? Doesn't need to have the matter devices added to it to test ipv6 connectivity.

@agners
Copy link
Member

agners commented Oct 16, 2023

This is my production HA instance, so not keen on messing with the config and trying to mess with sysctl ipv6 settings.

sysctl settings (using the command) are volatile. So whatever happens, after reboot the default settings are applied again. So testing with sysctl -w net.ipv6.conf.all.forwarding=0 is really safe.

As for downgrading the OS: This is fairly safe too, as we use a A/B boot slot system. Worst case you can boot the previous slot again (by selecting the other slot in the GRUB bootloader). To be super safe, you can also take a snapshot before executing the downgrade command.

@agners
Copy link
Member

agners commented Oct 16, 2023

I've looked at the outputs a bit more closely. To me it seems that the router's state is correctly detected.

From the ping's response (Address unreachable) it sounds as if no route is present. Can you execute the following command before unplugging and after unplugging one of the BRs?

ip -6 route

You can run the command on the OS shell or in the Matter container context, it shouldn't matter (since the Matter container runs on host network).

@marcelveldt
Copy link
Member

This morning I tried to reproduce your issue on my end.

To test the Border router route switch I've just now done several tests where I unplugged several Border routers out and back, especially the border routers surround an Eve plug in my office. Each time the connection was restored within seconds or minutes. So this means that -at least on my end- the route election process works smoothly and exactly how you expect it to work.

Current theory is that it could be related to Proxmox somehow doesn't forward some of the IPv6 ND messages but that is just a wild guess at this point. To confirm we'll have to re-do the test with proxmox involved.

@Jc2k
Copy link
Member

Jc2k commented Oct 16, 2023

That's a great theory. I recall @agners and I discussing a case where VMware defaults seemed to break "normal" ipv6 - possibly promiscuous mode?

That mode would make your layer 2 act like it was connected with old fashioned hubs rather than switches so is not ideal long term, but maybe this would be a good thing to try on proxmox?

@Coder84619
Copy link
Author

@agners
2023-10-13_20-31-54
2023-10-13_20-52-50
2023-10-13_21-14-38

@Coder84619
Copy link
Author

I did another test run tonight by bypassing Proxmox. I connected a USB dongle in passthrough mode to the HAOS VM, disabled the proxmox NIC and rebooted HAOS. Same results. 2600::9970 is the address of the HAOS VM. This ping is 5 minutes after ATV #2 was powered off.

2023-10-18_19-01-06

@Coder84619
Copy link
Author

I also stood up a fresh HAOS 11 VM, installed Matter (no devices) on Proxmox and had the same bad results.

@agners
Copy link
Member

agners commented Oct 19, 2023

@agners 2023-10-13_20-31-54 2023-10-13_20-52-50 2023-10-13_21-14-38

What this shows is that the route to d53f times out after 30 minutes. This is expected behavior. The IPv6 ND announces routes with a lifetime of 1800s (30min).

However, the kernel should not select this route.

Can you check if the kernel is really selecting the wrong route here using ip route get <target device>

What is a bit puzzling to me is that the ICMP request get answers from ULA and global addresses. Do you know which device is answering here?

@agners
Copy link
Member

agners commented Oct 19, 2023

You have two routers: One ending with d249, one ending with d53f.

It seems the device you plug out is ending with d53f (you refer to it as ATV 2). From the first screenshot of your initial post it seems that Linux correctly detects that (your output shows route FAILED in the line of that device).

2023-10-14_12-01-26

So the kernel really should select the other router d249 here. But that doesn't seem to be happening. 🤔

@agners
Copy link
Member

agners commented Oct 19, 2023

So, I've setup two Apple BRs (Apple TV 4K with Ethernet and a HomePod) in my home, and it turns out, I can reproduce the issue! The route didn't update/failover properly for me either 😰

Investigating more shows that the patch we've added in #2434 is actually broken: The IS_ENABLED preprocessor defines requires configs with CONFIG_ prefixed!

I've created a PR to address this issue #2845.

Thanks a lot for investigating and raising this issue! 🙏

Sidenote: In one of my tests (when I unplugged the Apple TV) the Thread network changed IP addresses. Those were properly announced via mDNS, but I didn't notice at first, and I still tried to ping the old address. I am not sure why Apple/Thread is doing this exactly).

@Coder84619
Copy link
Author

@agners Glad I'm not crazy!

@agners
Copy link
Member

agners commented Oct 20, 2023

Retested with 11.1.dev20231019, this looks good now. Will be resolved in 11.1.

@agners agners closed this as completed Oct 20, 2023
@Coder84619
Copy link
Author

@agners I tested the Dev build as well, and failover is nearly instant. Only a single lost ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants