Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 12 #13

Merged
merged 3 commits into from
May 14, 2024
Merged

Issue 12 #13

merged 3 commits into from
May 14, 2024

Conversation

takaomoriyama
Copy link
Contributor

This is a fix for issue #12
Signed-off-by: Takao Moriyama [[email protected]]

@asm582
Copy link
Collaborator

asm582 commented Feb 1, 2023

@takaomoriyama thanks, can you please mention how is num_gpus calculated as a comment in the same file?

@asm582
Copy link
Collaborator

asm582 commented Feb 1, 2023

Another question, do we need to update the documentation for launching ray clusters?

@takaomoriyama
Copy link
Contributor Author

@takaomoriyama thanks, can you please mention how is num_gpus calculated as a comment in the same file?

The number of GPUs available for each host is hard to do in the ray_launch_cluster.sh using information from LSF. So I decided to let ray start command to detect the number of GPUs by itself. Added short comment in the script file.

@takaomoriyama
Copy link
Contributor Author

Another question, do we need to update the documentation for launching ray clusters?

Yes, I agree. The example currently described in the README.md is a special case, where number of core on each node (slot or task) is always one, and no two nodes do not run on the same host. I think more general cases need to be mentioned in README.md.
As a temporary measure, I added a special case (4 nodes, each node has 7 cores and 2 GPUs) at the beginning of ray_launch_cluster.sh file as follows:

#Examples                                                                                                                                                                                                      
# bsub -n 2 -R "span[ptile=1] rusage[mem=4GB]" -gpu "num=2" -o std.out%J -e std.out%J ray_launch_cluster.sh -n conda_env -c "workload args..."                                                                 
#   Run a workload on two nodes. Each node has single core and 2 GPUs. Nodes are placed on separated hosts.                                                                                                    
# bsub -n 4 -R "affinity[core(7,same=socket)]" -gpu "num=2/task" -o std.out%J -e std.out%J ray_launch_cluster.sh -n conda_env -c "workload args..."                                                            
#   Run a workload on 4 nodes. Each node has 7 cores and 2 GPUs.                                                                                                                                               
#   "/task" for GPU option is necessary because multiple nodes may run on a same host. Otherwise 2 GPUs on a host will be shared by all nodes (tasks) on the host.  

@takaomoriyama
Copy link
Contributor Author

One more thing included in this PR is new setting of RAY_TMPDIR environment variable.
Without this, multiple users on a same host will be suffered from conflict access to /tmp/ray/ directory.

export RAY_TMPDIR="/tmp/ray-$USER"
echo "RAY_TMPDIR=$RAY_TMPDIR"
mkdir -p $RAY_TMPDIR

@asm582
Copy link
Collaborator

asm582 commented Apr 14, 2023

@takaomoriyama do we intend to do more work on this PR?

@asm582 asm582 merged commit 13eaba6 into IBMSpectrumComputing:main May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants