-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lost routing table once in a while after using mirrored networking #10588
Comments
Hi there. Can you get us a trace to see what happened? Thanks! |
How can I run it? Is there a 'readme'? |
Should I wait for it to happen again, then run the trace, or I should run the trace before and wait for it to happen? |
Sorry, that file (*.wprp) is used with WPR.exe. But I think there's a better path to take, since the repro might take a long time. I tested this - and it will capture only the WSL -related traces to keep the filesize to a minimum. With no wsl instance running (wsl --shutdown), please run the following from an admin cmd shell: logman start wsl_trace -p {b99cdb5a-039c-5046-e672-1a0de0a40211} -o wsl_trace.etl -ets After you see the repro, please run the following to stop the trace: You might have to run it a couple of times ... logman can sometimes fail the first time or 2. C:>logman start wsl_trace -p {b99cdb5a-039c-5046-e672-1a0de0a40211} -o wsl_trace.etl -ets <<<<<<<< Now Repro >>>>>>>> C:>logman stop wsl-trace -ets Error: C:>logman stop wsl_trace -ets C:>dir *.etl Directory of C:\ 10/04/2023 07:53 PM 368,640 wsl_trace.etl Please send back the generated ETL file. Thanks! |
What information will it contain? Will there by any sensitive / personal information, can I send it to your email instead of uploading it here publicly? |
It will contain IP addresses and possibly machine names / DNS names. You can send it to me directly if you would prefer ([email protected]). |
I'm seeing similar to this and #10587 What I'm noticing is that something is triggering the deletion of the routes, which I've proven with
Readding the routes with:
Restores connectivity until the next time the routes are purged (which makes me suspect the scripts linked in both issues aren't too helpful because they seem to rely on shutting the WSL instance and therefore running an extended time because it doesn't happen predictably and like with the issue ending in 7 there seems to be an extended initial period where it works fine). Filtering Windows Event Logger (+/- a few mins) showed the following entries happening at the exact time as the route deletions:
SMB Entries point to:
While the Host-Network-Service shows changes to IPv6 networking (specifically Unique Local Unicast addresses on the host's main Ethernet connection). Export of above Event Viewer logs This makes me think that some service responsible for keeping the WSL instance's routing/etc in sync with the host in mirrored mode forgets to readd IPv4 routes after a period of time if it detects a change to the Windows 11 host's interfaces - even if minor. IPv6 routes remain intact, as seen here:
|
@keith-horton I've seen a repro and sent you the trace. |
@snjnz - that's great information. Yes, that shows that something brought down that virtual NIC on the host, which triggered the vmNIC in the container to go down and up again. Linux deletes addresses and routes when its adapters go down (unlike Windows) - and WSL was not aware that it went down behind us. I'm bringing up a feature to detect when this happens and reset all addresses and routes. @leoleoasd , I'm looking over your traces now. |
Must be psychic because I'd just opened the issue to add an additional observation. Over the last weekend I was ruling out IPv6 (which I'd just enabled on my network) as causing internet stability issues. As part of this I disabled router advertisements on the router for a span of approx 48 hours. During this period, WSL with networking mode set to mirrored operated as expected. So now I'm wondering if it's specific to IPv6.
The IP addresses (both v4 and v6) seem to remain up, it's just the routing table that is affected. Additionally, the IPv6 table doesn't always clear (I don't have copy and pastes from From the trace I added above, can actually see that it seems to be deleting and readding the same routes already, so could it be a race condition of some description that leaves the routing table incomplete? Example these two entries: Deletion:
Readd:
Another hunch I was trying to have a look at was given the seemingly 24hour-duration, was if Windows' Temporary addressing might be playing a part, but I've been having trouble disabling that (or at least getting Windows to honour it). |
@leoleoasd , I can see the network status on the host is changing ... a lot. I don't know the source of what is changing it. I see that we pushed the address and route information successfully. But if you are seeing the routes deleted, then that suggests something also change the state of that vNIC on the host. This is more evidence that we need to talk within the team about better detecting changes from within the Linux container, and resync when we see unexpected changes. @snjnz - yes, you're correct, only the routes are deleted when the interface bounces down and up again. Though I would need to see a WSL trace to see what changes we are directing from WSL vs. what changes are happening from within Linux (likely responding to the vmNIC state changing). Thanks! |
I decided to take a look at my hunch regarding temporary IP addresses, and set the Temporary Preferred Lifetime to 15mins. The first trigger coincided with the PreferredLifetime on the IP at boot reaching zero. I've grabbed some traces and will e-mail you the link @keith-horton in the next 10 mins or so. Once again I'm seeing the same log messages coinciding with the routes dropping in WSL as well, also seeing within a few seconds in NetworkProfile Event Viewer:
It's been forever since I've done anything with Windows networking so I have no idea where other events are logged ( One thing I've also noticed, setting the much shorter temporary address settings results in the routes returning eventually which I think furthers my thoughts that it's a race condition with the routes going up and down. |
Weird, why the network status on the host is changing 🤔 this is a PC, and it was never moved or connected to another internet |
Oh maybe it's because I'm running tailscale on my PC, and it's connection may be unstable? |
Coming over here, I'm the author of the other mentioned issue #10587, I can also confirm/suspect it's a race condition of some sort, related to IPv6's temporary IP timeouts. Now that I read this thread's background, my observations were also the Linux route table was changed (and thus losing the IPv4 default route) at the timestamp of WSL VM Start + Temporary IP Preferred Lifetime + Temporary IP Desync Time - Temporary IP Regenerate Time. For giggles, I changed my Temporary IP Preferred Lifetime to 5 minutes with a Desync Time of 10 seconds. I lost connectivity in WSL at the expected timestamp. For the "routes returning eventually", I assume you're referring to the IPv4 routes, I wonder if that's around some time related to your IPv4 DHCP lease renewal? I'm going to try running IPv6 with Temporary IPs completely disabled and see what happens for a while. |
Fingers crossed this makes it easier to debug on Microsoft's side. I'm glad we've managed to come up with a way to quicken the process to make it easier for the devs.
What I'd noticed is that while I was writing my update here and sending Keith the traces, IPv4 pings I had running in the background started working again (it seemed to coincide with the next Temporary address renewal, but pointed towards restoration of routes), but they got dropped a little later not to return, so not as conclusive, but there is an aspect of 'they can come back'. |
Thank you all for your help debugging this. I was able to reproduce this and I have a fix which will hopefully be out with the next update. |
The preview release should have the fix for this. Which hopefully will be going to the public release soon. |
How can we install the preview release? Is it a public preview? |
wsl --update --pre-release This should update to the latest pre-release. |
Haven't had any problems with the last couple of pre-releases, thank you for your work on this! |
Is this problem fixed? |
I'm still encountering this problem after upgrading to the latest preview; I'm using Tailscale, that may be the issue. |
Windows Version
Microsoft Windows [Version 10.0.22621.2361]
WSL Version
2.0.0.0
Are you using WSL 1 or WSL 2?
Kernel Version
5.15.123.1-1
Distro Version
Archlinux
Other Software
No response
Repro Steps
use it for a while, and:
after rebooting wsl:
Expected Behavior
Actual Behavior
Diagnostic Logs
No response
The text was updated successfully, but these errors were encountered: