Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orca PyTorch Ray Estimator example got crash on YARN #584

Closed
hkvision opened this issue Oct 28, 2020 · 1 comment · Fixed by intel/BigDL#3007
Closed

Orca PyTorch Ray Estimator example got crash on YARN #584

hkvision opened this issue Oct 28, 2020 · 1 comment · Fixed by intel/BigDL#3007
Assignees

Comments

@hkvision
Copy link
Contributor

hkvision commented Oct 28, 2020

I run https://github.com/intel-analytics/analytics-zoo/blob/master/pyzoo/zoo/examples/orca/learn/horovod/pytorch_estimator.py with command python pytorch_estimator.py --cluster_mode yarn --num_nodes 2 --cores 44 on Almaren cluster.
If I use pytorch backend, the program crashes after initiating Ray:

{'node_ip_address': '172.16.0.102', 'raylet_ip_address': '172.16.0.102', 'redis_address': '172.16.0.177:57129', 'object_store_address': '/tmp/ray/session_2020-10-28_18-17-18_438564_110094/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2020-10-28_18-17-18_438564_110094/sockets/raylet', 'webui_url': 'localhost:8265', 'session_dir': '/tmp/ray/session_2020-10-28_18-17-18_438564_110094'}
(pid=110164, ip=172.16.0.177) Prepending /dir7/yarn/nm_0/usercache/root/appcache/application_1588741658055_1395/container_1588741658055_1395_01_000003/python_env/lib/python3.7/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path
(pid=110164, ip=172.16.0.177) Prepending /dir7/yarn/nm_0/usercache/root/appcache/application_1588741658055_1395/container_1588741658055_1395_01_000003/python_env/lib/python3.7/site-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path
(pid=110183, ip=172.16.0.177) Prepending /dir7/yarn/nm_0/usercache/root/appcache/application_1588741658055_1395/container_1588741658055_1395_01_000003/python_env/lib/python3.7/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path
(pid=110183, ip=172.16.0.177) Prepending /dir7/yarn/nm_0/usercache/root/appcache/application_1588741658055_1395/container_1588741658055_1395_01_000003/python_env/lib/python3.7/site-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path

If I use horovod backend, I got the following error:

Rendezvous INFO: HTTP rendezvous server started.
E1028 18:20:40.272358  7702  8383 task_manager.cc:320] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=zoo.orca.learn.horovod.horovod_ray_runner, class_name=Worker, function_name=setup_horovod, function_hash=}, task_id=55c3b2b635949d8145b95b1c0100, job_id=0100, num_args=0, num_returns=2, actor_task_spec={actor_id=45b95b1c0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=3}
Traceback (most recent call last):
  File "pytorch_estimator.py", line 123, in <module>
    train_example(workers_per_node=args.workers_per_node)
  File "pytorch_estimator.py", line 90, in train_example
    }, backend="horovod")
  File "/opt/work/client/anaconda3/envs/orca-kai/lib/python3.7/site-packages/zoo/orca/learn/pytorch/estimator.py", line 74, in from_torch
    backend=backend)
  File "/opt/work/client/anaconda3/envs/orca-kai/lib/python3.7/site-packages/zoo/orca/learn/pytorch/estimator.py", line 111, in __init__
    workers_per_node=workers_per_node)
  File "/opt/work/client/anaconda3/envs/orca-kai/lib/python3.7/site-packages/zoo/orca/learn/pytorch/pytorch_ray_estimator.py", line 137, in __init__
    for i, worker in enumerate(self.remote_workers)
  File "/opt/work/client/anaconda3/envs/orca-kai/lib/python3.7/site-packages/ray/worker.py", line 1476, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Running both backends locally works fine.

@yangw1234
Copy link
Contributor

can be verified on almaren node 002, conda env horovo-pytorch-2

@liu-shaojun liu-shaojun transferred this issue from intel/BigDL Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants