Mirrored network becomes unavailable 24h after session start #10587

sbalmos · 2023-10-04T14:49:18Z

Windows Version

Microsoft Windows [Version 10.0.22621.2361]

WSL Version

2.0.1.0

Are you using WSL 1 or WSL 2?

WSL 2
WSL 1

Kernel Version

5.15.123.1-1

Distro Version

Debian 11

Other Software

1Password SSH agent relay tunneling using npiperelay.

Repro Steps

Within 24 hours of starting a Debian 11 WSL session running with mirrored networking, the network becomes unavailable. All existing connections are dropped, and all attempts to use non-loopback IPs return Network is unreachable. Remediation requires completely exiting WSL and performing a full shutdown of the WSL environment through wsl.exe --shutdown.

Expected Behavior

Networking remains available throughout the life of the session.

Actual Behavior

Networking becomes unavailable the next day.

sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 192.168.0.1
ping: connect: Network is unreachable
sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.036 ms
^C
--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.036/0.036/0.036/0.000 ms
sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 1.1.1.1
ping: connect: Network is unreachable
sbalmos@stormfront:/mnt/c/Users/sbalmos$ exit
logout
PS C:\Users\sbalmos> wsl
sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 1.1.1.1
ping: connect: Network is unreachable
sbalmos@stormfront:/mnt/c/Users/sbalmos$ exit
logout
PS C:\Users\sbalmos> wsl --shutdown
PS C:\Users\sbalmos> wsl
removing previous socket...
Starting SSH-Agent relay...
sbalmos@stormfront:/mnt/c/Users/sbalmos$ ping 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=56 time=14.1 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=56 time=11.5 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=56 time=11.9 ms
^C
--- 1.1.1.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 11.511/12.520/14.105/1.134 ms
sbalmos@stormfront:/mnt/c/Users/sbalmos$

Current time is 10:48am ET. Since I restarted WSL here to regain networking, I expect to see networking become unavailable somewhere around 10:45am ET, give or take a few minutes.

Diagnostic Logs

No response

keith-horton · 2023-10-05T03:09:07Z

Hi there. Since this takes a very long time to repro, could you run the following until you get a repro - it will capture minimal traces (just WSL traces managing the Linux network settings).

e.g.

C:>logman start wsl_trace -p {b99cdb5a-039c-5046-e672-1a0de0a40211} -o wsl_trace.etl -ets
The command completed successfully.

<<<<<<<< Now Repro >>>>>>>>

C:>logman stop wsl-trace -ets

Error:
Data Collector Set was not found.

C:>logman stop wsl_trace -ets
The command completed successfully.

C:>dir *.etl
Volume in drive C has no label.
Volume Serial Number is C64F-A1F6

Directory of C:\

10/04/2023 07:53 PM 368,640 wsl_trace.etl
1 File(s) 368,640 bytes
0 Dir(s) 34,756,386,816 bytes free

Please send back the generated ETL file.

Once you have a repro, could you then run a very short repro attempting to make a network connection from the WSL container. The below will be a much deeper trace to try to collect where data is getting lost.

powershell .\collect-wsl-logs.ps1 .\wsl_networking.wprp

(from https://github.com/microsoft/WSL/tree/master/diagnostics)

Thanks!

sbalmos · 2023-10-05T15:03:05Z

ETS and WSL log traces attached. Started the ETS approx. 2 hours before predicted loss of networking, and ended the ETS afterwards. WSL log trace was performed immediately after loss of networking. Loss occurred at 24h+10m... and here's the juicy new tidbit - I just happened to notice that when the event occurred, the interfaces don't lose IPs or anything at the interface level. However, the in-WSL Linux routing table is completely wiped clean. No default routes, nothing. That may be the smoking gun or where to point the Eye of Sauron next.

WSLTraces.zip

keith-horton · 2023-10-06T01:14:38Z

Thanks. I can see from the trace that our code in WSL has been successfully pushing IP updates into the container. There weren't errors setting things up with Linux.

It doesn't look like the wsl_networking.wprp created a trace to observe traffic failing. While it's in this bad state, can you dump out the Linux state (https://github.com/microsoft/WSL/blob/master/diagnostics/networking.sh), then run

wpr.exe -start wsl_networking.wprp -filemode

(then generate network traffic from the Linux container, like trying to ping an address, or wget bing.com a few times)

wpr.exe -stop wsl_networking.etl

please let me know what traffic you tried to send, and that ETL file.

If you could also cat /etc/resolv.conf so we can see what the DNS configuration is.
Thanks!

sbalmos · 2023-10-06T16:49:16Z

The ETL file is approximately 300 megs, zipped to 70, too big for an attachment. I have made it available at https://1drv.ms/u/s!AtUhMGXKAUHRgqFFUgXnAVLNKoHZmA?e=uLvtda

For giggles and completeness, networking-good.txt is a run of the networking shell script while everything is okay. networking-bad.txt is a dump of the script in the bad state. The ETL is also attached. Some pings against 1.1.1.1, local router 192.168.0.1, bing, google, etc were all attempted. Interestingly, looking at the networking-bad dump and some other observations, the IPv4 default route and subnets are nuked. But IPv6 remains up and available. In fact, if I know the IPv6 address of some services, new traffic is passed. Existing traffic was dropped - at the time of the event, I had an IPv6 connection open to one of the Libera IRC network servers, which was dropped. But I was able to successfully ping the IPv6 addresses of both Google and the Libera IRC server I was connected to at the time. Those all, both successful and unsuccessful pings, were captured in the ETL file.

keith-horton · 2023-10-13T06:00:06Z

Thanks. The traffic over IPv6 is working (because there's a v6 route), but IPv4 doesn't have a route, so all of that traffic is failing. We have now heard a couple of instances where something is running on the Windows host that is affecting the vNIC that we use - causing the vmNIC in the container to go down & up again, at which point Linux will delete the IPv4 route (that's just Linux stack behavior, for whatever reasons).

It's not clear what is changing the state of the vNIC on the host though. There's nothing indicated in WSL that it changed (if HNS changed it for example, we would get a callback notification). (HNS is the component that creates the vNICs).

We are going to talk more internally about better responding to this and syncing IP state in Linux when we see changes occur unexpectedly.

sbalmos · 2023-10-13T14:26:02Z

Yup, thanks Keith! I read the other thread, and that one's author is a lot more thorough than I am. I just confirmed over on that thread that the IPv6 Temporary IP behavior he suspected is also what triggers it for me.

keith-horton · 2023-10-17T21:38:15Z

Thank you all for your help debugging this. I was able to reproduce this and I have a fix which will hopefully be out with the next update.

keith-horton · 2023-11-14T07:09:16Z

The preview release should have the fix for this. Which hopefully will be going to the public release soon.
You can get the prerelease here:

wsl --update --pre-release

Thanks again!

CatalinFetoiu · 2024-07-26T23:03:39Z

closing since the issue is fixed. if you still encounter the problem, please open a new issue. thanks

OneBlue added the network label Oct 4, 2023

snjnz mentioned this issue Oct 7, 2023

Lost routing table once in a while after using mirrored networking #10588

Open

2 tasks

This was referenced Nov 18, 2023

No network access with networkingMode=mirrored #10791

Open

Mirrored networking mode is not supported #10823

Open

CatalinFetoiu closed this as completed Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mirrored network becomes unavailable 24h after session start #10587

Mirrored network becomes unavailable 24h after session start #10587

sbalmos commented Oct 4, 2023

keith-horton commented Oct 5, 2023

sbalmos commented Oct 5, 2023 •

edited

Loading

keith-horton commented Oct 6, 2023

sbalmos commented Oct 6, 2023

keith-horton commented Oct 13, 2023

sbalmos commented Oct 13, 2023

keith-horton commented Oct 17, 2023

keith-horton commented Nov 14, 2023

CatalinFetoiu commented Jul 26, 2024

Mirrored network becomes unavailable 24h after session start #10587

Mirrored network becomes unavailable 24h after session start #10587

Comments

sbalmos commented Oct 4, 2023

Windows Version

WSL Version

Are you using WSL 1 or WSL 2?

Kernel Version

Distro Version

Other Software

Repro Steps

Expected Behavior

Actual Behavior

Diagnostic Logs

keith-horton commented Oct 5, 2023

sbalmos commented Oct 5, 2023 • edited Loading

keith-horton commented Oct 6, 2023

sbalmos commented Oct 6, 2023

keith-horton commented Oct 13, 2023

sbalmos commented Oct 13, 2023

keith-horton commented Oct 17, 2023

keith-horton commented Nov 14, 2023

CatalinFetoiu commented Jul 26, 2024

sbalmos commented Oct 5, 2023 •

edited

Loading