-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support setting CUDA_VISIBLE_DEVICES env variable #113
base: master
Are you sure you want to change the base?
Conversation
@sjeaugey @AddyLaddy Thank you for looking at this previously; I'd like to continue the discussion on this PR if possible. |
Hi Ryan. As I mentioned the code was old and I didn't want to base your changes on that old version as it would not work well when we want to update it. |
NCCL and NCCL-tests have been updated; I also updated this PR to resolve build errors against the previous code. |
Thank you for the update. Given the recent updates of NCCL and NCCL Tests, is this getting to a point where it could move forward? |
Sorry, the changes I wanted were missed in the update. So I'm working on pushing another patch. |
Thank you for the update! |
I updated the NCCL tests. Now there is no need to replicate the computation of GPU numbers, as it's computed once at the beginning. Plus NCCL_TESTS_DEVICE supports multiple GPUs. So I looked again at the goal and pros/cons of the different approaches. The problem is that we want to be able to work in an environment where CUDA_VISIBLE_DEVICES is set to restrict the set of GPUs each rank is seeing (each rank seeing different devices) while the NCCL tests assume each rank sees the same set of GPUs, hence compute absolute CUDA devices for each rank. The two approaches are:
I did not find a case where the second approach would not be able to do everything 1 does. However, there is still the case I mentioned before where option 2 works better than option 1: when we set CUDA_VISIBLE_DEVICES to multiple GPUs but each rank does not use all GPUs. So I'm still not sure that the modulo is the right approach, and setting Is there something I'm missing? |
I'm resubmitting this, as an update to #105
I attempted to validate the patch in light of the comments on that PR. Please let me know if I'm misunderstanding.
To evaluate this comment:
"For example, if you run with 2 ranks, and set CUDA_VISIBLE_DEVICES to "0,1,2,3" on one rank and "4,5,6,7" and the other rank, then launch with only 2 GPUs per rank, users would expect to use 0,1 and 4,5 but instead we'd use 0,1 and 6,7."
I applied the proposed patch and added logging (zerodev) to show the offset of the actual device from zero. I launched with two nodes, one process per node, and two GPUs per process, which results in two 2 GPUs per rank.
To evaluate this comment:
"Also, today if you run on a 4 GPUs system and launch 8 ranks or one rank with -g 8 it will error out as "invalid CUDA numeral" or something similar. With this patch it would try to re-use each GPU twice (and generate and error later during NCCL init)."
When I run the patch with four devices but a GPUs requested, I get an argument error.