You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{'node_ip_address': '172.16.0.102', 'raylet_ip_address': '172.16.0.102', 'redis_address': '172.16.0.177:57129', 'object_store_address': '/tmp/ray/session_2020-10-28_18-17-18_438564_110094/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2020-10-28_18-17-18_438564_110094/sockets/raylet', 'webui_url': 'localhost:8265', 'session_dir': '/tmp/ray/session_2020-10-28_18-17-18_438564_110094'}
(pid=110164, ip=172.16.0.177) Prepending /dir7/yarn/nm_0/usercache/root/appcache/application_1588741658055_1395/container_1588741658055_1395_01_000003/python_env/lib/python3.7/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path
(pid=110164, ip=172.16.0.177) Prepending /dir7/yarn/nm_0/usercache/root/appcache/application_1588741658055_1395/container_1588741658055_1395_01_000003/python_env/lib/python3.7/site-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path
(pid=110183, ip=172.16.0.177) Prepending /dir7/yarn/nm_0/usercache/root/appcache/application_1588741658055_1395/container_1588741658055_1395_01_000003/python_env/lib/python3.7/site-packages/bigdl/share/conf/spark-bigdl.conf to sys.path
(pid=110183, ip=172.16.0.177) Prepending /dir7/yarn/nm_0/usercache/root/appcache/application_1588741658055_1395/container_1588741658055_1395_01_000003/python_env/lib/python3.7/site-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path
If I use horovod backend, I got the following error:
Rendezvous INFO: HTTP rendezvous server started.
E1028 18:20:40.272358 7702 8383 task_manager.cc:320] Task failed: IOError: 14: Socket closed: Type=ACTOR_TASK, Language=PYTHON, function_descriptor={type=PythonFunctionDescriptor, module_name=zoo.orca.learn.horovod.horovod_ray_runner, class_name=Worker, function_name=setup_horovod, function_hash=}, task_id=55c3b2b635949d8145b95b1c0100, job_id=0100, num_args=0, num_returns=2, actor_task_spec={actor_id=45b95b1c0100, actor_caller_id=ffffffffffffffffffffffff0100, actor_counter=3}
Traceback (most recent call last):
File "pytorch_estimator.py", line 123, in <module>
train_example(workers_per_node=args.workers_per_node)
File "pytorch_estimator.py", line 90, in train_example
}, backend="horovod")
File "/opt/work/client/anaconda3/envs/orca-kai/lib/python3.7/site-packages/zoo/orca/learn/pytorch/estimator.py", line 74, in from_torch
backend=backend)
File "/opt/work/client/anaconda3/envs/orca-kai/lib/python3.7/site-packages/zoo/orca/learn/pytorch/estimator.py", line 111, in __init__
workers_per_node=workers_per_node)
File "/opt/work/client/anaconda3/envs/orca-kai/lib/python3.7/site-packages/zoo/orca/learn/pytorch/pytorch_ray_estimator.py", line 137, in __init__
for i, worker in enumerate(self.remote_workers)
File "/opt/work/client/anaconda3/envs/orca-kai/lib/python3.7/site-packages/ray/worker.py", line 1476, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Running both backends locally works fine.
The text was updated successfully, but these errors were encountered:
I run https://github.com/intel-analytics/analytics-zoo/blob/master/pyzoo/zoo/examples/orca/learn/horovod/pytorch_estimator.py with command
python pytorch_estimator.py --cluster_mode yarn --num_nodes 2 --cores 44
on Almaren cluster.If I use
pytorch
backend, the program crashes after initiating Ray:If I use
horovod
backend, I got the following error:Running both backends locally works fine.
The text was updated successfully, but these errors were encountered: