Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling CUDA_VISIBLE_DEVICES #24

Open
cwpearson opened this issue Aug 18, 2020 · 0 comments
Open

Handling CUDA_VISIBLE_DEVICES #24

cwpearson opened this issue Aug 18, 2020 · 0 comments

Comments

@cwpearson
Copy link
Owner

cwpearson commented Aug 18, 2020

On some platforms (e.g. OLCF Summit), MPI ranks' visibility of GPUs is typically restricted with CUDA_VISIBLE_DEVICES.
We currently require that all ranks be able to see all GPUs, so we can detect GPU distance, for example:

// recover the cuda device ID for this component
const int di = globalCudaIds[ranks[ri] * gpusPerRank + gi];
const int dj = globalCudaIds[ranks[rj] * gpusPerRank + gj];
bandwidth[ci][cj] = gpu_topo::bandwidth(di, dj);

If all GPUs have ID 0, our GPU topology code will think all those GPUs are the same device, since according to a particular rank GPU0 is GPU0.

It may be possible to have the ranks report a UUID for each GPU instead of their CUDA id, and use that throughout to distinguish GPUs.

Once we can support this, we can allow users to tie CPU execution to CPUs with affinity for a particular GPU, which could improve performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant