-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CUDA_VISIBLE_DEVICES tests #638
Fix CUDA_VISIBLE_DEVICES tests #638
Conversation
Failures are unrelated: #637 (comment), probably still needs rapidsai/cudf#8426 to be in. |
rerun tests |
rerun tests |
I retargeted this PR to 21.08 |
rerun tests |
Why? If we're pinning today's Dask release for 21.06, then we need this in. |
I forgot this requires dask/distributed#4873, CI won't pass until that's merged. |
dask/distributed#4873 is in now |
Thanks @jrbourbeau , rerunning! |
rerun tests |
I'm still trying to reproduce this, I attempted doing so also running the Local CI scripts, but all tests pass then as well on a DGX-1. I think the issue may be due to gpuCI having only a single GPU, I'll try that on a machine with only one GPU. |
This is done to ensure the Scheduler and Nanny are started on the first device, thus avoiding the need to mock those.
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #638 +/- ##
===============================================
Coverage ? 90.45%
===============================================
Files ? 15
Lines ? 1645
Branches ? 0
===============================================
Hits ? 1488
Misses ? 157
Partials ? 0 Continue to review full report at Codecov.
|
rerun tests |
@gpucibot merge |
Thanks @quasiben ! |
After recent changes in Distributed, particularly dask/distributed#4866, worker processes will now attempt to get information from PyNVML based on the index specified in
CUDA_VISIBLE_DEVICES
. Some of our tests purposely test device numbers that may not exist in some systems (e.g., gpuCI where only single-GPU is supported) to ensure theCUDA_VISIBLE_DEVICES
of each worker indeed respects the ordering ofdask_cuda.utils.cuda_visible_devices
. The changes here introduce a newMockWorker
class that will monkey-patch the behavior of NVML usage ofdistributed.Worker
, which can then be used to return those tests to a working state.