Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket TCP_USER_TIMEOUT gives "[Errno 92] Protocol not available" #3873

Closed
jwilson8767 opened this issue Feb 25, 2019 · 5 comments
Closed

socket TCP_USER_TIMEOUT gives "[Errno 92] Protocol not available" #3873

jwilson8767 opened this issue Feb 25, 2019 · 5 comments
Labels

Comments

@jwilson8767
Copy link

  • Your Windows build number: 10.0.17134.619

Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic

  • What you're doing and what's happening: (Copy&paste the full set of specific command-line steps necessary to reproduce the behavior, and their output. Include screen shots if that helps demonstrate the problem.)
    Using AF_UNIX socket cannot be set to timeout.
jwilson@M:~$ python3
Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
>>> TCP_USER_TIMEOUT = 18 # since linux 2.6.37
>>> s.setsockopt(socket.SOL_TCP, 18, 1000)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument
>>>

Same issue using AF_INET

jwilson@M:~$ python3
Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> s.setsockopt(socket.SOL_TCP, 18, 1000)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 92] Protocol not available
>>>

This is particularly devastating for me as it prevents me from using Dask LocalClusters via WSL during development. Help!

  • What's wrong / what should be happening instead:
    Timeout for tcp session should be updated for socket created using python. Failing to do this causes hanging worker processes when using Dask and other multiprocessing tools via Python.
@therealkenc
Copy link
Collaborator

Looks like that code is here (that's a guess though). Check the Dask logs (however Dask logs warnings) and see if you are seeing that warning verbiage "Could not set timeout on TCP stream".

Weird thing (which I didn't look into much) is both Darwin and win32 set the TCP keepalive values but neither fire the Linux-specific TCP_USER_TIMEOUT path (which is related to but different from keepalive). Which means those plats must be doing something else, or they would be seeing "hanging worker processes" too (by design or otherwise).

Constructive thing to do (if you are feeling highly motivated) would be to install the win32 version and then see if you can mimic the behavior in WSL with upstream mods to Dask. Maybe win32 behaves the same as you're observing with WSL, or maybe there are more sys.platform.startswith("win") elsewhere in the codebase that enables Dask to work-around the "hanging worker processes".

@jwilson8767
Copy link
Author

@therealkenc thanks for the quick response!

Good thinking that this may not actually be causing the hanging/orphaned worker processes I'm seeing, I have an issue (referenced above) on the Dask repo to determine the severity of not having TCP_USER_TIMEOUT. In the meantime, do you have an idea for how we could test the current handling of socket timeouts in WSL vs normal Ubuntu? I'm wondering if the default is lower or higher (or to not timeout at all), and what that may do within a python project if a socket with SO_KEEPALIVE turned on but no timeout?

I was able to dig up one other very relevant usage of TCP_USER_TIMEOUT over here: https://github.com/celery/py-amqp/blob/master/amqp/platform.py#L46 which appears to be a dependency of Celery. The related issue is here: celery/py-amqp#145 which makes me wonder if Dask is attempting to reuse sockets just as Celery had previously.

If I get a more concise way of triggering a failure due to this issue I will definitely try it on WSL vs win32, as you say!

@therealkenc
Copy link
Collaborator

In the meantime, do you have an idea for how we could test the current handling of socket timeouts in WSL vs normal Ubuntu?

Not without hand waving. Note there is #2949 #3687 #2915 all flapping in the wind open. You could be hitting any one of them even if the problem isn't TCP_USER_TIMEOUT. And that's not counting the stuff that has been closed like #2846. I only just noticed you are on 10.0.17134. Upgrade to at least 17763 (aka 1809 aka RS5) before you do anything.

But yeah, I had suggested looking at the win32 port on the basis the problem was lack of TCP_USER_TIMEOUT, knowing Darwin and win32 don't have that. Another way (possibly a better way) to attack this is to run some narrower and narrower test Python code on Real Linux in a VM and on WSL and stare at strace(1) logs of both looking for the diverge. You can even comment out that TCP_USER_TIMEOUT on the Real Linux side yourself to see if it makes any difference.

@Saysongkham-Sayavong
Copy link

Hello,
I got error messages as below:
File "/home/topfy/miniconda3/envs/iota2-env/lib/python3.9/site-packages/distributed/comm/tcp.py", line 113, in set_tcp_timeout
sock.setsockopt(socket.SOL_TCP, TCP_USER_TIMEOUT, timeout * 1000)
OSError: [Errno 92] Protocol not available

Please help me to solve this issue, thank you so much in advance.

Copy link
Contributor

This issue has been automatically closed since it has not had any activity for the past year. If you're still experiencing this issue please re-file this as a new issue or feature request.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants