Fix CUDA_VISIBLE_DEVICES tests #638

pentschev · 2021-06-03T22:23:38Z

After recent changes in Distributed, particularly dask/distributed#4866, worker processes will now attempt to get information from PyNVML based on the index specified in CUDA_VISIBLE_DEVICES. Some of our tests purposely test device numbers that may not exist in some systems (e.g., gpuCI where only single-GPU is supported) to ensure the CUDA_VISIBLE_DEVICES of each worker indeed respects the ordering of dask_cuda.utils.cuda_visible_devices. The changes here introduce a new MockWorker class that will monkey-patch the behavior of NVML usage of distributed.Worker, which can then be used to return those tests to a working state.

pentschev · 2021-06-03T22:51:19Z

Failures are unrelated: #637 (comment), probably still needs rapidsai/cudf#8426 to be in.

galipremsagar · 2021-06-04T00:28:21Z

rerun tests

quasiben · 2021-06-04T03:04:43Z

rerun tests

quasiben · 2021-06-04T03:04:58Z

I retargeted this PR to 21.08

pentschev · 2021-06-04T06:56:32Z

rerun tests

pentschev · 2021-06-04T07:05:15Z

I retargeted this PR to 21.08

Why? If we're pinning today's Dask release for 21.06, then we need this in.

pentschev · 2021-06-04T10:23:22Z

I forgot this requires dask/distributed#4873, CI won't pass until that's merged.

jrbourbeau · 2021-06-04T14:23:50Z

dask/distributed#4873 is in now

pentschev · 2021-06-04T14:33:49Z

Thanks @jrbourbeau , rerunning!

pentschev · 2021-06-04T14:33:54Z

rerun tests

pentschev · 2021-06-04T21:00:54Z

I'm still trying to reproduce this, I attempted doing so also running the Local CI scripts, but all tests pass then as well on a DGX-1. I think the issue may be due to gpuCI having only a single GPU, I'll try that on a machine with only one GPU.

This is done to ensure the Scheduler and Nanny are started on the first device, thus avoiding the need to mock those.

…sible-devices-nvml

codecov-commenter · 2021-06-07T20:03:18Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@0c4c59d). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 34f4caa differs from pull request most recent head 5a175f0. Consider uploading reports for the commit 5a175f0 to get more accurate results

@@               Coverage Diff               @@
##             branch-21.08     #638   +/-   ##
===============================================
  Coverage                ?   90.45%           
===============================================
  Files                   ?       15           
  Lines                   ?     1645           
  Branches                ?        0           
===============================================
  Hits                    ?     1488           
  Misses                  ?      157           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c4c59d...5a175f0. Read the comment docs.

pentschev · 2021-06-08T14:07:12Z

rerun tests

quasiben · 2021-06-08T16:55:05Z

@gpucibot merge

pentschev · 2021-06-08T17:05:32Z

Thanks @quasiben !

pentschev added 5 commits June 3, 2021 15:01

Support passing worker_class to LocalCUDACluster

ba5116c

Add MockWorker util class that can be used for certain tests

13bebd3

Use MockWorker class for LocalCUDACluster CUDA_VISIBLE_DEVICES tests

4caa3d2

Add --worker-class support for dask-cuda-worker

6a9e86b

Use MockWorker class for dask-cuda-worker CUDA_VISIBLE_DEVICES tests

f7042dc

pentschev requested a review from a team as a code owner June 3, 2021 22:23

github-actions bot added the python python code needed label Jun 3, 2021

pentschev added 3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change labels Jun 3, 2021

pentschev mentioned this pull request Jun 3, 2021

Add scheduler tests to gpuCI #635

Closed

quasiben changed the base branch from branch-21.06 to branch-21.08 June 4, 2021 03:04

Use device 0 first in CUDA_VISIBLE_DEVICES

b3d0ae4

This is done to ensure the Scheduler and Nanny are started on the first device, thus avoiding the need to mock those.

pentschev added 3 commits June 7, 2021 11:37

Merge remote-tracking branch 'upstream/branch-21.08' into fix-cuda-vi…

5c54ba6

…sible-devices-nvml

Use MockWorker for more tests

9bc5c09

Add docstrings to MockWorker class

5a175f0

quasiben approved these changes Jun 8, 2021

View reviewed changes

rapids-bot bot merged commit c7d7795 into rapidsai:branch-21.08 Jun 8, 2021

pentschev deleted the fix-cuda-visible-devices-nvml branch June 29, 2021 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUDA_VISIBLE_DEVICES tests #638

Fix CUDA_VISIBLE_DEVICES tests #638

pentschev commented Jun 3, 2021

pentschev commented Jun 3, 2021

galipremsagar commented Jun 4, 2021

quasiben commented Jun 4, 2021

quasiben commented Jun 4, 2021

pentschev commented Jun 4, 2021

pentschev commented Jun 4, 2021

pentschev commented Jun 4, 2021

jrbourbeau commented Jun 4, 2021

pentschev commented Jun 4, 2021

pentschev commented Jun 4, 2021

pentschev commented Jun 4, 2021

codecov-commenter commented Jun 7, 2021 •

edited

Loading

pentschev commented Jun 8, 2021

quasiben commented Jun 8, 2021

pentschev commented Jun 8, 2021

Fix CUDA_VISIBLE_DEVICES tests #638

Fix CUDA_VISIBLE_DEVICES tests #638

Conversation

pentschev commented Jun 3, 2021

pentschev commented Jun 3, 2021

galipremsagar commented Jun 4, 2021

quasiben commented Jun 4, 2021

quasiben commented Jun 4, 2021

pentschev commented Jun 4, 2021

pentschev commented Jun 4, 2021

pentschev commented Jun 4, 2021

jrbourbeau commented Jun 4, 2021

pentschev commented Jun 4, 2021

pentschev commented Jun 4, 2021

pentschev commented Jun 4, 2021

codecov-commenter commented Jun 7, 2021 • edited Loading

Codecov Report

pentschev commented Jun 8, 2021

quasiben commented Jun 8, 2021

pentschev commented Jun 8, 2021

codecov-commenter commented Jun 7, 2021 •

edited

Loading