Support setting CUDA_VISIBLE_DEVICES env variable #113

ryanamazon · 2022-08-19T16:48:45Z

I'm resubmitting this, as an update to #105

I attempted to validate the patch in light of the comments on that PR. Please let me know if I'm misunderstanding.

To evaluate this comment:
"For example, if you run with 2 ranks, and set CUDA_VISIBLE_DEVICES to "0,1,2,3" on one rank and "4,5,6,7" and the other rank, then launch with only 2 GPUs per rank, users would expect to use 0,1 and 4,5 but instead we'd use 0,1 and 6,7."

I applied the proposed patch and added logging (zerodev) to show the offset of the actual device from zero. I launched with two nodes, one process per node, and two GPUs per process, which results in two 2 GPUs per rank.

ubuntu@ip-172-31-30-49:~$ ./dompi
ip-172-31-21-32 CUDA_VISIBLE_DEVICES=4,5,6,7
ip-172-31-30-49 CUDA_VISIBLE_DEVICES=0,1,2,3
# nThread 1 nGpus 2 minBytes 8 maxBytes 8 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
#   Rank  0 Pid  36533 on ip-172-31-30-49 device  0 [0x00] devid=0x16 zerodev=0 domid=0x0 Tesla V100-SXM2-32GB
#   Rank  1 Pid  36533 on ip-172-31-30-49 device  1 [0x00] devid=0x17 zerodev=1 domid=0x0 Tesla V100-SXM2-32GB
#   Rank  2 Pid  33645 on ip-172-31-21-32 device  0 [0x00] devid=0x1a zerodev=4 domid=0x0 Tesla V100-SXM2-32GB
#   Rank  3 Pid  33645 on ip-172-31-21-32 device  1 [0x00] devid=0x1b zerodev=5 domid=0x0 Tesla V100-SXM2-32GB
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum    163.9    0.00    0.00  1e-07    163.3    0.00    0.00  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 7.33647e-05

To evaluate this comment:
"Also, today if you run on a 4 GPUs system and launch 8 ranks or one rank with -g 8 it will error out as "invalid CUDA numeral" or something similar. With this patch it would try to re-use each GPU twice (and generate and error later during NCCL init)."

When I run the patch with four devices but a GPUs requested, I get an argument error.

ubuntu@ip-172-31-30-49:~$ CUDA_VISIBLE_DEVICES=7,6,5,4 nccl-tests/build/all_reduce_perf -b 8 -e 8 -f 2 -g 8 -c 1 -n 100
invalid number of GPUs specified (8), only for 4 GPUs exist

ryanamazon · 2022-08-29T22:56:21Z

@sjeaugey @AddyLaddy Thank you for looking at this previously; I'd like to continue the discussion on this PR if possible.

sjeaugey · 2022-08-31T07:37:52Z

Hi Ryan. As I mentioned the code was old and I didn't want to base your changes on that old version as it would not work well when we want to update it.
That's why we didn't make progress on this PR: we were preparing an update so that we could work on current code. Sorry it took us some time, but it's done now.
I'll try to revisit this PR based on the new code soon.

ryanamazon · 2022-09-16T13:46:05Z

NCCL and NCCL-tests have been updated; I also updated this PR to resolve build errors against the previous code.

ryanamazon · 2022-09-23T14:23:59Z

Thank you for the update. Given the recent updates of NCCL and NCCL Tests, is this getting to a point where it could move forward?

sjeaugey · 2022-09-23T14:45:21Z

Sorry, the changes I wanted were missed in the update. So I'm working on pushing another patch.

ryanamazon · 2022-09-23T15:08:57Z

Thank you for the update!

sjeaugey · 2022-09-26T10:06:55Z

I updated the NCCL tests. Now there is no need to replicate the computation of GPU numbers, as it's computed once at the beginning. Plus NCCL_TESTS_DEVICE supports multiple GPUs.

So I looked again at the goal and pros/cons of the different approaches. The problem is that we want to be able to work in an environment where CUDA_VISIBLE_DEVICES is set to restrict the set of GPUs each rank is seeing (each rank seeing different devices) while the NCCL tests assume each rank sees the same set of GPUs, hence compute absolute CUDA devices for each rank.

The two approaches are:

Detect the number of visible CUDA devices and apply a modulo to the computed device number.
Set NCCL_TESTS_DEVICE=0 to tell the perf tests to start counting at 0 for all ranks

I did not find a case where the second approach would not be able to do everything 1 does. However, there is still the case I mentioned before where option 2 works better than option 1: when we set CUDA_VISIBLE_DEVICES to multiple GPUs but each rank does not use all GPUs.
For example: we partitioned the system between two ranks, so that rank 0 has CUDA_VISIBLE_DEVICES set to 0,1,2,3 while rank 1 has 4,5,6,7. In that case, if we run the tests with -g 1 or -g 2, instead of using its first 1 or 2 GPUs, rank 1 will be using GPU 5 (-g1) or GPUs 6,7 (-g2). With NCCL_TESTS_DEVICE=0, rank 1 would be using GPU 4, or GPUs 4,5, which I believe is what a user would expect.

So I'm still not sure that the modulo is the right approach, and setting NCCL_TESTS_DEVICE=0 seems like a normal way to tell the tests that each rank is running inside a different CUDA_VISIBLE_DEVICES set.

Is there something I'm missing?

ryanamazon closed this Sep 16, 2022

ryanamazon force-pushed the master branch from 1c00538 to 749573f Compare September 16, 2022 13:40

Support setting CUDA_VISIBLE_DEVICES env variable

9b9156f

ryanamazon reopened this Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support setting CUDA_VISIBLE_DEVICES env variable #113

Support setting CUDA_VISIBLE_DEVICES env variable #113

ryanamazon commented Aug 19, 2022 •

edited

Loading

ryanamazon commented Aug 29, 2022

sjeaugey commented Aug 31, 2022

ryanamazon commented Sep 16, 2022 •

edited

Loading

ryanamazon commented Sep 23, 2022

sjeaugey commented Sep 23, 2022

ryanamazon commented Sep 23, 2022

sjeaugey commented Sep 26, 2022 •

edited

Loading

Support setting CUDA_VISIBLE_DEVICES env variable #113

Are you sure you want to change the base?

Support setting CUDA_VISIBLE_DEVICES env variable #113

Conversation

ryanamazon commented Aug 19, 2022 • edited Loading

ryanamazon commented Aug 29, 2022

sjeaugey commented Aug 31, 2022

ryanamazon commented Sep 16, 2022 • edited Loading

ryanamazon commented Sep 23, 2022

sjeaugey commented Sep 23, 2022

ryanamazon commented Sep 23, 2022

sjeaugey commented Sep 26, 2022 • edited Loading

ryanamazon commented Aug 19, 2022 •

edited

Loading

ryanamazon commented Sep 16, 2022 •

edited

Loading

sjeaugey commented Sep 26, 2022 •

edited

Loading