Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CUDA_VISIBLE_DEVICES tests #638

Merged

Conversation

pentschev
Copy link
Member

After recent changes in Distributed, particularly dask/distributed#4866, worker processes will now attempt to get information from PyNVML based on the index specified in CUDA_VISIBLE_DEVICES. Some of our tests purposely test device numbers that may not exist in some systems (e.g., gpuCI where only single-GPU is supported) to ensure the CUDA_VISIBLE_DEVICES of each worker indeed respects the ordering of dask_cuda.utils.cuda_visible_devices. The changes here introduce a new MockWorker class that will monkey-patch the behavior of NVML usage of distributed.Worker, which can then be used to return those tests to a working state.

@pentschev pentschev requested a review from a team as a code owner June 3, 2021 22:23
@github-actions github-actions bot added the python python code needed label Jun 3, 2021
@pentschev pentschev added 3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change labels Jun 3, 2021
@pentschev
Copy link
Member Author

Failures are unrelated: #637 (comment), probably still needs rapidsai/cudf#8426 to be in.

@galipremsagar
Copy link
Contributor

rerun tests

@quasiben quasiben changed the base branch from branch-21.06 to branch-21.08 June 4, 2021 03:04
@quasiben
Copy link
Member

quasiben commented Jun 4, 2021

rerun tests

@quasiben
Copy link
Member

quasiben commented Jun 4, 2021

I retargeted this PR to 21.08

@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

I retargeted this PR to 21.08

Why? If we're pinning today's Dask release for 21.06, then we need this in.

@pentschev
Copy link
Member Author

I forgot this requires dask/distributed#4873, CI won't pass until that's merged.

@jrbourbeau
Copy link
Contributor

dask/distributed#4873 is in now

@pentschev
Copy link
Member Author

Thanks @jrbourbeau , rerunning!

@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

I'm still trying to reproduce this, I attempted doing so also running the Local CI scripts, but all tests pass then as well on a DGX-1. I think the issue may be due to gpuCI having only a single GPU, I'll try that on a machine with only one GPU.

This is done to ensure the Scheduler and Nanny are started on the first
device, thus avoiding the need to mock those.
@codecov-commenter
Copy link

codecov-commenter commented Jun 7, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@0c4c59d). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 34f4caa differs from pull request most recent head 5a175f0. Consider uploading reports for the commit 5a175f0 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08     #638   +/-   ##
===============================================
  Coverage                ?   90.45%           
===============================================
  Files                   ?       15           
  Lines                   ?     1645           
  Branches                ?        0           
===============================================
  Hits                    ?     1488           
  Misses                  ?      157           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c4c59d...5a175f0. Read the comment docs.

@pentschev
Copy link
Member Author

rerun tests

@quasiben
Copy link
Member

quasiben commented Jun 8, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit c7d7795 into rapidsai:branch-21.08 Jun 8, 2021
@pentschev
Copy link
Member Author

Thanks @quasiben !

@pentschev pentschev deleted the fix-cuda-visible-devices-nvml branch June 29, 2021 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants