Issue 12 #13

takaomoriyama · 2023-02-01T10:22:19Z

This is a fix for issue #12
Signed-off-by: Takao Moriyama [[email protected]]

asm582 · 2023-02-01T13:21:27Z

@takaomoriyama thanks, can you please mention how is num_gpus calculated as a comment in the same file?

asm582 · 2023-02-01T13:23:28Z

Another question, do we need to update the documentation for launching ray clusters?

takaomoriyama · 2023-02-02T06:04:06Z

@takaomoriyama thanks, can you please mention how is num_gpus calculated as a comment in the same file?

The number of GPUs available for each host is hard to do in the ray_launch_cluster.sh using information from LSF. So I decided to let ray start command to detect the number of GPUs by itself. Added short comment in the script file.

takaomoriyama · 2023-02-02T06:22:53Z

Another question, do we need to update the documentation for launching ray clusters?

Yes, I agree. The example currently described in the README.md is a special case, where number of core on each node (slot or task) is always one, and no two nodes do not run on the same host. I think more general cases need to be mentioned in README.md.
As a temporary measure, I added a special case (4 nodes, each node has 7 cores and 2 GPUs) at the beginning of ray_launch_cluster.sh file as follows:

#Examples                                                                                                                                                                                                      
# bsub -n 2 -R "span[ptile=1] rusage[mem=4GB]" -gpu "num=2" -o std.out%J -e std.out%J ray_launch_cluster.sh -n conda_env -c "workload args..."                                                                 
#   Run a workload on two nodes. Each node has single core and 2 GPUs. Nodes are placed on separated hosts.                                                                                                    
# bsub -n 4 -R "affinity[core(7,same=socket)]" -gpu "num=2/task" -o std.out%J -e std.out%J ray_launch_cluster.sh -n conda_env -c "workload args..."                                                            
#   Run a workload on 4 nodes. Each node has 7 cores and 2 GPUs.                                                                                                                                               
#   "/task" for GPU option is necessary because multiple nodes may run on a same host. Otherwise 2 GPUs on a host will be shared by all nodes (tasks) on the host.

takaomoriyama · 2023-02-02T06:28:53Z

One more thing included in this PR is new setting of RAY_TMPDIR environment variable.
Without this, multiple users on a same host will be suffered from conflict access to /tmp/ray/ directory.

export RAY_TMPDIR="/tmp/ray-$USER"
echo "RAY_TMPDIR=$RAY_TMPDIR"
mkdir -p $RAY_TMPDIR

…ted on echo node

asm582 · 2023-04-14T20:02:48Z

@takaomoriyama do we intend to do more work on this PR?

takaomoriyama added 3 commits February 6, 2023 12:06

Count number of cores on each host from LSB_AFFINITY_HOSTFILE

8d9b97c

Remove --num-gpu option from "ray start" command so that #GPU is coun…

fcdf7c0

…ted on echo node

Add comment on #GPUs. Use user specific temporary directory.

44c9d38

takaomoriyama force-pushed the issue-12 branch from bbc49c5 to 44c9d38 Compare February 6, 2023 03:49

asm582 merged commit 13eaba6 into IBMSpectrumComputing:main May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 12 #13

Issue 12 #13

takaomoriyama commented Feb 1, 2023

asm582 commented Feb 1, 2023

asm582 commented Feb 1, 2023

takaomoriyama commented Feb 2, 2023

takaomoriyama commented Feb 2, 2023

takaomoriyama commented Feb 2, 2023

asm582 commented Apr 14, 2023

Issue 12 #13

Issue 12 #13

Conversation

takaomoriyama commented Feb 1, 2023

asm582 commented Feb 1, 2023

asm582 commented Feb 1, 2023

takaomoriyama commented Feb 2, 2023

takaomoriyama commented Feb 2, 2023

takaomoriyama commented Feb 2, 2023

asm582 commented Apr 14, 2023