Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Unable to succeed in selecting a random port #22352

Closed
1 of 2 tasks
birgerbr opened this issue Feb 14, 2022 · 17 comments
Closed
1 of 2 tasks

[Bug] Unable to succeed in selecting a random port #22352

birgerbr opened this issue Feb 14, 2022 · 17 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't stale The issue is stale. It will be closed within 7 days unless there are further conversation triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@birgerbr
Copy link
Contributor

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

After running for 6 days, the ray server fails to accept new clients.
I found this error repeated in ray_client_server.err:

RROR:ray.util.client.server.proxier:Unable to find channel for client: d1e32236c08c4b5995b3c10582204c27
WARNING:ray.util.client.server.proxier:Retrying Logstream connection. 1 attempts failed.
ERROR:ray.util.client.server.proxier:Unable to find channel for client: d1e32236c08c4b5995b3c10582204c27
WARNING:ray.util.client.server.proxier:Retrying Logstream connection. 2 attempts failed.
ERROR:ray.util.client.server.proxier:Unable to find channel for client: d1e32236c08c4b5995b3c10582204c27
WARNING:ray.util.client.server.proxier:Retrying Logstream connection. 3 attempts failed.
ERROR:ray.util.client.server.proxier:Unable to find channel for client: d1e32236c08c4b5995b3c10582204c27
WARNING:ray.util.client.server.proxier:Retrying Logstream connection. 4 attempts failed.
ERROR:ray.util.client.server.proxier:Unable to find channel for client: d1e32236c08c4b5995b3c10582204c27
WARNING:ray.util.client.server.proxier:Retrying Logstream connection. 5 attempts failed.
ERROR:grpc._server:Exception iterating responses: Unable to succeed in selecting a random port.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/grpc/_server.py", line 461, in _take_response_from_response_iterator
    return next(response_iterator), True
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 608, in Datapath
    server = self.proxy_manager.create_specific_server(client_id)
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 196, in create_specific_server
    port = self._get_unused_port()
  File "/usr/local/lib/python3.8/dist-packages/ray/util/client/server/proxier.py", line 152, in _get_unused_port
    raise RuntimeError("Unable to succeed in selecting a random port.")
RuntimeError: Unable to succeed in selecting a random port.

I've can not recall seeing this error before upgrading to 1.10.0.

Versions / Dependencies

Ray version 1.10.0, Python 3.8, Ubuntu 20.04.

Reproduction script

I do not have any script for reproduction, the server was running for 6 days before the issue started.

Anything else

The server was running in Kubernetes.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@birgerbr birgerbr added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 14, 2022
@architkulkarni
Copy link
Contributor

Hi @ckw017, can I assign this to you?

@ckw017
Copy link
Member

ckw017 commented Feb 14, 2022

@architkulkarni Yep, that's fine

@birgerbr As a sanity check, in the logs how many ray_client_server_23***.err files are there? Guessing if the server is accepting a lot of client connections it exhausted all the ports. If you have an estimate of how many concurrent connections are usually active, that would be handy. It looks like we might have an effective limit at ~1000 since the current port range is only 23001-23999

Also, what version of Ray were you on before upgrading?

@ckw017 ckw017 self-assigned this Feb 14, 2022
@birgerbr
Copy link
Contributor Author

We were using version 1.9.2.

There are 16 ray_client_server_23***.err files. This was on our staging cluster, and it seems to not have been very active during those days. The number of concurrent connections was probably below 3.

@ckw017
Copy link
Member

ckw017 commented Feb 16, 2022

Got it. If possible can you share the full ray_client_server.err file?

If you still have the cluster up, or if you run into this again can you try running this on the head node of the cluster:

for port in range(23000, 24000):
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
        s.bind(("", port))
    except OSError:
        print("failed to bind to", port)
        traceback.print_exc()
        continue
    finally:
        s.close()

@birgerbr
Copy link
Contributor Author

Here is the log file
ray_client_server.err. The cluster has been restarted, but I can run that code if I see the issue again.

FYI: After the cluster was restarted I found another issue with our cluster. Not sure these issues can be connected, but the other issue was that the latest ray operator image for Kubernetes had some changes that made it incompatible with our current configuration. We solved that by using the 1.10.0 operator image for now. I'm assuming that we will need to update our configuration as done here f51566e when we update to ray 1.11.0.

@ckw017
Copy link
Member

ckw017 commented Feb 17, 2022

Sounds good, digging through the logs it looks like it hit "Server startup failed" 1,000 times before it ran into the "Unable to succeed in selecting a random port", which likely explains how all the ports were exhausted. If the incompatible image is what's causing then server failures then reverting might resolve this as well.

@birgerbr
Copy link
Contributor Author

Our cluster is again getting "Server startup failed", and the operator is running rayproject/ray:1.10.0 which matches the version used in the head node.

The "Unable to succeed in selecting a random port." were not in the logs, but they might have come if we had let it continue.

Your script above ran without any issues.

@vicyap
Copy link
Contributor

vicyap commented Mar 3, 2022

Hello, I think I am hitting a similar issue. My issue has nothing to do with ports, but I do see "Server startup failed". If my issue isn't related, I can file a new one.

My Ray cluster is installed through the Helm chart in this repo. The operator and head node are both using version 1.10.0.

This line fails for me:
ray.init("ray://mycluster.internal:10001", runtime_env={"pip": ["torchaudio==0.10.0", "boto3"]})

However, this does work if I am on a node in the cluster, eg. ray.init("auto").

Here is the client traceback:

>>> ray.init("ray://mycluster.internal:10001", runtime_env={"pip": ["torchaudio==0.10.0", "boto3"]})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/venv38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/venv38/lib/python3.8/site-packages/ray/worker.py", line 785, in init
    return builder.connect()
  File "/venv38/lib/python3.8/site-packages/ray/client_builder.py", line 151, in connect
    client_info_dict = ray.util.client_connect.connect(
  File "/venv38/lib/python3.8/site-packages/ray/util/client_connect.py", line 33, in connect
    conn = ray.connect(
  File "/venv38/lib/python3.8/site-packages/ray/util/client/__init__.py", line 228, in connect
    conn = self.get_context().connect(*args, **kw_args)
  File "/venv38/lib/python3.8/site-packages/ray/util/client/__init__.py", line 88, in connect
    self.client_worker._server_init(job_config, ray_init_kwargs)
  File "/venv38/lib/python3.8/site-packages/ray/util/client/worker.py", line 697, in _server_init
    raise ConnectionAbortedError(
ConnectionAbortedError: Initialization failure from server:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 623, in Datapath
    if not self.proxy_manager.start_specific_server(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 279, in start_specific_server
    serialized_runtime_env_context = self._create_runtime_env(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 233, in _create_runtime_env
    raise RuntimeError(
RuntimeError: Failed to create runtime_env for Ray client server: Failed to install pip requirements:
Collecting torchaudio==0.10.0
Using cached torchaudio-0.10.0-cp38-cp38-manylinux1_x86_64.whl (2.9 MB)
Collecting boto3
Using cached boto3-1.21.12-py3-none-any.whl (132 kB)
Collecting torch==1.10.0

And on the head node, I tailed some logs...

==> dashboard_agent.log <==
2022-03-03 12:48:07,199 INFO runtime_env_agent.py:169 -- Runtime env already failed. Env: {"extensions": {"_ray_commit": "5ea565317a8104c04ae7892bb9bb41c6d72f12df"}, "pipRuntimeEnv": {"config": {"packages": ["torchaudio==0
.10.0", "boto3"]}}, "uris": {"pipUri": "pip://4ceac73cac9531aae34bb906bc8ce6b1b6c04183"}}, err: Failed to install pip requirements:
Collecting torchaudio==0.10.0
Using cached torchaudio-0.10.0-cp38-cp38-manylinux1_x86_64.whl (2.9 MB)
Collecting boto3
Using cached boto3-1.21.12-py3-none-any.whl (132 kB)
Collecting torch==1.10.0

==> ray_client_server.err <==
INFO:ray.util.client.server.proxier:New data connection from client e9e585f631d5412ead897bda8238d092:

==> debug_state_gcs.txt <==

==> debug_state.txt <==

==> debug_state_gcs.txt <==

==> debug_state.txt <==

==> debug_state_gcs.txt <==

==> debug_state.txt <==

==> ray_client_server.err <==
INFO:ray.util.client.server.proxier:e9e585f631d5412ead897bda8238d092 last started stream at 1646340487.1929243. Current stream started at 1646340487.1929243.
ERROR:grpc._server:Exception iterating responses: Server startup failed.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/site-packages/grpc/_server.py", line 461, in _take_response_from_response_iterator
    return next(response_iterator), True
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 712, in Logstream
    channel = self.proxy_manager.get_channel(client_id)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 346, in get_channel
    server.wait_ready()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/util/client/server/proxier.py", line 69, in wait_ready
    raise RuntimeError("Server startup failed.")
RuntimeError: Server startup failed.

I also get "Server startup failed".

Also the first line of dashboard_agent.log is
2022-03-03 12:48:07,199 INFO runtime_env_agent.py:169 -- Runtime env already failed.

Any help is appreciated.

@ckw017
Copy link
Member

ckw017 commented Mar 3, 2022

cc @shrekris-anyscale can you help triage vicyap's issue, looks like an issue with runtime envs

@shrekris-anyscale
Copy link
Contributor

Sure, let me take a look.

@simon-mo
Copy link
Contributor

simon-mo commented Mar 3, 2022

@vicyap looking at the traceback, it looks like the the process or container might be OOM killed. This is a common problem for installing PyTorch within a container.

Please take a look at pytorch/pytorch#1022 (comment) for workaround.

@architkulkarni
Copy link
Contributor

@vicyap is it possible to try the same thing without the torchaudio package and see if it works? We've had a couple other users report about the same thing, using torch with Helm charts.

@simon-mo
Copy link
Contributor

simon-mo commented Mar 3, 2022

@architkulkarni is there a way to do something like pip --no-cache-dir install torch?

@architkulkarni
Copy link
Contributor

architkulkarni commented Mar 3, 2022

@vicyap you can try setting PIP_NO_CACHE_DIR=1 on the cluster:https://pip.pypa.io/en/latest/topics/configuration/#environment-variables

Actually, you might have to set it to 0 instead of 1, as this issue seems to still be open: pypa/pip#5735

Another user reported that using the "conda" field instead seemed to work https://discuss.ray.io/t/failed-to-lease-worker-from-node/5140/6?u=architkulkarni though it's not clear why it would fix a memory issue. If you're okay with running in an isolated conda environment (so not inheriting the existing python packages already installed on the cluster), this could work.

@peterhaddad3121
Copy link

@architkulkarni I seem to be experiencing similar behaviors but a little differently.

I documented an issue here, if it's okay for you to take a look at. Could use some eyes. #23865

@stale
Copy link

stale bot commented Aug 12, 2022

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 12, 2022
@stale
Copy link

stale bot commented Sep 20, 2022

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

@stale stale bot closed this as completed Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't stale The issue is stale. It will be closed within 7 days unless there are further conversation triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

7 participants