Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket Connect hangs until timeout runs out in mirrored mode. #10855

Open
1 of 2 tasks
MichiBab opened this issue Nov 30, 2023 · 29 comments
Open
1 of 2 tasks

Socket Connect hangs until timeout runs out in mirrored mode. #10855

MichiBab opened this issue Nov 30, 2023 · 29 comments
Labels

Comments

@MichiBab
Copy link

MichiBab commented Nov 30, 2023

Windows Version

Microsoft Windows [Version 10.0.22631.2715]

WSL Version

2.0.9.0 and 2.0.14 tested

Are you using WSL 1 or WSL 2?

  • WSL 2
  • WSL 1

Kernel Version

5.15.133.1-1

Distro Version

Ubuntu 22.04

Other Software

No response

Repro Steps

Call this script in mirrored mode as normal user or sudo (try both):

import socket
import time
import os

def is_port_in_use(port):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        # s.settimeout(2)  # set timeout directly

        print(f"Socket type: {s.type}")
        print(f"Socket family: {s.family}")
        print(f"Socket timeout: {s.gettimeout()} seconds")
        # Get and print socket options
        print(f"SO_REUSEADDR option: {s.getsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR)}")
        print(f"SO_KEEPALIVE option: {s.getsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE)}")
        print(f"TCP_NODELAY option: {s.getsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY)}")
        print("checking: ", port)
        print("Timeout: ",s.gettimeout())
        return s.connect_ex(("localhost", port)) == 0
    
    
print("Uid: ", os.getuid())
start_time = time.time()
print("Is Port in use:", is_port_in_use(6006))
print("--- Runtime: %s seconds ---" % (time.time() - start_time))

Expected Behavior

It directly checks if the socket is in use or not.

Actual Behavior

Sometimes it takes over 132 seconds to return True or False. In not mirrored mode it always directly returns. I came across this issue using ros, where the roscore takes about 2 minutes to start, or starting tensorboard takes around 2 minutes aswell, so i build this script to reproduce the error.

If it directly returns true or false in mirrored mode, i start the script as sudo and then the problem occurs again.
What is interesting, is that if i set the socket timeout to a specific second, then it will return after that. Its like its blocked for the default timeout.

Diagnostic Logs

Without networking mode mirrored:

❯ python3 test_connect.py
Uid: 1000
Socket type: 1
Socket family: 2
Socket timeout: None seconds
SO_REUSEADDR option: 0
SO_KEEPALIVE option: 0
TCP_NODELAY option: 0
checking: 6006
Timeout: None
Is Port in use: False
--- Runtime: 0.0001404285430908203 seconds ---

With networking mode mirrored:

❯ python3 test_connect.py
Uid: 1000
Socket type: 1
Socket family: 2
Socket timeout: None seconds
SO_REUSEADDR option: 0
SO_KEEPALIVE option: 0
TCP_NODELAY option: 0
checking: 6006
Timeout: None
Is Port in use: False
--- Runtime: 132.54164338111877 seconds ---

Sometimes it works as fast as without mirrored. But sometimes not, even tho it results in the port is not in use.
If it works directly, starting the script with sudo will result again in this timeout.

With networking mode mirrored and setting timeout to a second with:
s.settimeout(1) before the connect_ex call

❯ python3 test_connect.py
Uid: 1000
Socket type: 1
Socket family: 2
Socket timeout: 1.0 seconds
SO_REUSEADDR option: 0
SO_KEEPALIVE option: 0
TCP_NODELAY option: 0
checking: 6006
Timeout: None
Is Port in use: False
--- Runtime: 1.001239538192749 seconds ---

Or with settimeout to 2 seconds:

❯ python3 test_connect.py
Uid: 1000
Socket type: 1
Socket family: 2
Socket timeout: 2.0 seconds
SO_REUSEADDR option: 0
SO_KEEPALIVE option: 0
TCP_NODELAY option: 0
checking: 6006
Timeout: 2.0
Is Port in use: False
--- Runtime: 2.0022153854370117 seconds ---

Copy link

Hi I'm an AI powered bot that finds similar issues based off the issue title.

Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it. Thank you!

Closed similar issues:

Note: You can give me feedback by thumbs upping or thumbs downing this comment.

@OneBlue OneBlue added the network label Dec 1, 2023
@chanpreetdhanjal
Copy link

Could you please follow the steps below and attach the diagnostic logs? https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#collect-wsl-logs-for-networking-issues

@MichiBab
Copy link
Author

MichiBab commented Dec 5, 2023

Hi, i collected them with a 10 sec forced timeout, as described in my ticket.
❯ python3 test_connect.py
Uid: 1000
Socket type: 1
Socket family: 2
Socket timeout: 10.0 seconds
SO_REUSEADDR option: 0
SO_KEEPALIVE option: 0
TCP_NODELAY option: 0
checking: 6006
Is Port in use: False
--- Runtime: 10.010206699371338 seconds ---
https://github.com/microsoft/WSL/files/13559172/WslLogs-2023-12-05_12-02-48.zip

@chanpreetdhanjal
Copy link

Please follow the networking diagnostic script. https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#collect-wsl-logs-for-networking-issues

It should create a folder like [WslNetworkingLogs-date_ver.zip]

@MichiBab
Copy link
Author

MichiBab commented Dec 7, 2023

WslNetworkingLogs-2023-12-07_21-14-32.zip
Updated, here are the new logs @chanpreetdhanjal .

@markovaljaots
Copy link

I have the same issue with WSL 2.0.14.0

@MichiBab
Copy link
Author

MichiBab commented Jan 9, 2024

Still the same problem with 2.1.0

@derMatze82
Copy link

same problem here with 2.1.0.
It is frustrating because some applications expect a conneciton refused when performing port checks with telnet

@MichiBab
Copy link
Author

Still the same problem with 2.1.1

@linG5821
Copy link

linG5821 commented Jan 24, 2024

When I go through windows telnet localhost [port] the access is very slow, but when I go through telnet 127.0.0.1 [port] or telnet native ip [port] it is very fast, I think it might be because there is a problem with localhost resolution, so I found a blog I looked at the prioritization policy for IPv6

netsh interface ipv6 show prefixpolicies

Default policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::1/128
        40      1  ::/0
        35      4  ::ffff:0:0/96
        30      2  2002::/16
         5      5  2001::/32
         3     13  fc00::/7
         1     11  fec0::/10
         1     12  3ffe::/16
         1      3  ::/96

I've changed the order of their priorities
PS: required Administrator

netsh int ipv6 set prefix ::/96 50 0
netsh int ipv6 set prefix ::ffff:0:0/96 40 1
netsh int ipv6 set prefix 2002::/16 35 2
netsh int ipv6 set prefix 2001::/32 30 3
netsh int ipv6 set prefix ::1/128 10 4
netsh int ipv6 set prefix ::/0 5 5
netsh int ipv6 set prefix fc00::/7 3 13
netsh int ipv6 set prefix fec0::/10 1 11
netsh int ipv6 set prefix 3ffe::/16 1 12

Modified policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::/96
        40      1  ::ffff:0:0/96
        35      2  2002::/16
        30      3  2001::/32
        10      4  ::1/128
         5      5  ::/0
         3     13  fc00::/7
         1     12  3ffe::/16
         1     11  fec0::/10 

now telnet 127.0.0.1 [port] is getting fast

But I don't know what other problems such a change would cause

It can be reset by the following command
netsh interface ipv6 reset
PS: reboot your system

@zed76r
Copy link

zed76r commented Jan 25, 2024

When I go through windows telnet localhost [port] the access is very slow, but when I go through telnet 127.0.0.1 [port] or telnet native ip [port] it is very fast, I think it might be because there is a problem with localhost resolution, so I found a blog I looked at the prioritization policy for IPv6

netsh interface ipv6 show prefixpolicies

Default policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::1/128
        40      1  ::/0
        35      4  ::ffff:0:0/96
        30      2  2002::/16
         5      5  2001::/32
         3     13  fc00::/7
         1     11  fec0::/10
         1     12  3ffe::/16
         1      3  ::/96

I've changed the order of their priorities PS: required Administrator

netsh int ipv6 set prefix ::/96 50 0
netsh int ipv6 set prefix ::ffff:0:0/96 40 1
netsh int ipv6 set prefix 2002::/16 35 2
netsh int ipv6 set prefix 2001::/32 30 3
netsh int ipv6 set prefix ::1/128 10 4
netsh int ipv6 set prefix ::/0 5 5
netsh int ipv6 set prefix fc00::/7 3 13
netsh int ipv6 set prefix fec0::/10 1 11
netsh int ipv6 set prefix 3ffe::/16 1 12

Modified policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::/96
        40      1  ::ffff:0:0/96
        35      2  2002::/16
        30      3  2001::/32
        10      4  ::1/128
         5      5  ::/0
         3     13  fc00::/7
         1     12  3ffe::/16
         1     11  fec0::/10 

now telnet 127.0.0.1 [port] is getting fast

But I don't know what other problems such a change would cause

It can be reset by the following command netsh interface ipv6 reset PS: reboot your system

not work for me.

image

@linG5821
Copy link

linG5821 commented Jan 26, 2024

When I go through windows telnet localhost [port] the access is very slow, but when I go through telnet 127.0.0.1 [port] or telnet native ip [port] it is very fast, I think it might be because there is a problem with localhost resolution, so I found a blog I looked at the prioritization policy for IPv6
netsh interface ipv6 show prefixpolicies
Default policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::1/128
        40      1  ::/0
        35      4  ::ffff:0:0/96
        30      2  2002::/16
         5      5  2001::/32
         3     13  fc00::/7
         1     11  fec0::/10
         1     12  3ffe::/16
         1      3  ::/96

I've changed the order of their priorities PS: required Administrator

netsh int ipv6 set prefix ::/96 50 0
netsh int ipv6 set prefix ::ffff:0:0/96 40 1
netsh int ipv6 set prefix 2002::/16 35 2
netsh int ipv6 set prefix 2001::/32 30 3
netsh int ipv6 set prefix ::1/128 10 4
netsh int ipv6 set prefix ::/0 5 5
netsh int ipv6 set prefix fc00::/7 3 13
netsh int ipv6 set prefix fec0::/10 1 11
netsh int ipv6 set prefix 3ffe::/16 1 12

Modified policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::/96
        40      1  ::ffff:0:0/96
        35      2  2002::/16
        30      3  2001::/32
        10      4  ::1/128
         5      5  ::/0
         3     13  fc00::/7
         1     12  3ffe::/16
         1     11  fec0::/10 

now telnet 127.0.0.1 [port] is getting fast
But I don't know what other problems such a change would cause
It can be reset by the following command netsh interface ipv6 reset PS: reboot your system

not work for me.

image

Are you using powershell or cmd on windows ?

On WSL use python open a http server
image

  • CMD
    image

image

  • Powershell
    image

@zed76r
Copy link

zed76r commented Jan 29, 2024

@linG5821
We have a different issue. The problem is that the port is not listening, and it not refusing connections immediately.

@linG5821
Copy link

I did find problems with changing the default NIC policy in practice

I mapped ports when using Docker, I tried '127.0.0.1:port' and 'localhost:port' both were accessible, but I used the Intranet ip 'ipv4ip:port' and couldn't access it, when I reset the network, Intranet ip 'ipv4ip:port' and '127.0.0.1:port' are accessible, and access to 'localhost:port' comes back to the issue itself.

@MichiBab
Copy link
Author

Still the same issue with 2.1.3.0

@MichiBab
Copy link
Author

Still the same issue on 2.1.5. @chanpreetdhanjal is there any way to help? I have a workaround for ROS but it's a bit annoying on tensorboard to wait that long after fetching plots.

@chanpreetdhanjal
Copy link

Someone from our team is looking at it. We will get back to you as soon as we make further progress. Thank you for your patient.

@zed76r
Copy link

zed76r commented May 14, 2024

Can't believe MS team still not resolve this bug. It's pretty annoying problem, especially when I am going to using the vscode JS debug terminal in WSL2 mirrored network mode.

@chanpreetdhanjal
Copy link

We are actively looking at it as per other items in our priority list. Will provide an update as soon as we have something tangible to share.

@keith-horton
Copy link
Member

To update this thread: there are 2 possible issues.

  1. when attempting to connect by name "localhost", the IPv6 address ::1 is likely resolved first (v4 vs. v6 is configurable in Linux). Linux does not support the option for IPv6 that we use to allow sending localhost traffic to the host - it's only supported with IPv4. So unfortunately we cannot use the :;1 address to connect from the Linux container to the host.

  2. when a connection is attempted by IPv4 127.0.0.1 to the host, but nothing is listening on the host, a RST is returned. We see that something has attempted to port-forward that response because the Linux stack eventually sees this with a different source port (which of course is dropped). We're currently working with Linux folks as we have tracked the response through a kernel debugger to the point where we send it over the vmBus to the Linux container.

Sorry, the RST issue is taking much longer to root cause than expected :(

Hope this helps

@dhtzs
Copy link

dhtzs commented Jun 21, 2024

Has there been any progress on this issue?

@keith-horton
Copy link
Member

we have spent a lot of cycles debugging down the windows stack, through vswitch, to the vm - this has been really difficult and we don't yet have a culprit. we are fixing some Docker related issues - once that goes out, we'll return back to finding this bug.

@shigenobuokamoto
Copy link

when a connection is attempted by IPv4 127.0.0.1 to the host, but nothing is listening on the host, a RST is returned. We see that something has attempted to port-forward that response because the Linux stack eventually sees this with a different source port (which of course is dropped). We're currently working with Linux folks as we have tracked the response through a kernel debugger to the point where we send it over the vmBus to the Linux container.

indeed, when capturing packets on Windows and Linux (WSL), the port is wrong.

Windows host                                  linux(WSL)
       HV interface     FSE switch         loopback0
syn    222 <- 60844                         222 <- 60844
         00:11:22:33:44:55 <-------------- 00:15:5d:bd:df:a8

rst    222 -> 60844                         222 -> 45634
                                                   ^^^^^???
00:15:5d:61:13:45  ----------------------> 00:15:5d:bd:df:a8

destination MAC address Windows sent the packet to is loopback0 (WSL).
MAC address received by Linux (WSL) is the one sent by Windows.

if both perceptions are correct, then maybe something is going on inside the FSE switch?

hostAddressLoopback=true & eth0 address will back to destination port Windows sent it to, which is fine.
i think it is basically the same mechanism, but the behavior is different.

@drolander
Copy link

drolander commented Jul 19, 2024

I'm having the same issue. Just got upgraded to Win11 22H2 and have been waiting to try mirrored networking. It did not take long to find this issue. Please fix.

Using WSL 2.2.4.0

@lihk11
Copy link

lihk11 commented Jul 21, 2024

same issue with 2.2.4.0 and the Intranet ip 'ipv4ip:port' and couldn't access it

@atyshka
Copy link

atyshka commented Sep 23, 2024

Any updates on resolutions or workarounds for this issue?

@nessunluogo
Copy link

Same issue here using WSL 2.3.24.0 just updated

@Morrigan-Ship
Copy link

its now about WSL
its about hyper-v adaptors

I fixed it:

Run powershell as Administrator
wsl --shutdown
win + R keys to run
"C:\WINDOWS\System32\mmc.exe" "C:\WINDOWS\System32\virtmgmt.msc" in run and Enter
stop service
now in powershell Enter: Get-HNSNetwork | ? Name -Like * | Remove-HnsNetwork
start service from old mmc.exe command
run wsl

@derMatze82
Copy link

its now about WSL its about hyper-v adaptors

I fixed it:

Run powershell as Administrator wsl --shutdown win + R keys to run "C:\WINDOWS\System32\mmc.exe" "C:\WINDOWS\System32\virtmgmt.msc" in run and Enter stop service now in powershell Enter: Get-HNSNetwork | ? Name -Like * | Remove-HnsNetwork start service from old mmc.exe command run wsl

does not work for me :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests