-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving detection of CUDA enabled MPI in EasyBuild #14517
Comments
Just to throw another spanner in the works, what if you are not using UCX? There was a recent thread on the mailing list (link will only work properly once you have acknowledged that you are not a spammer) that said you wouldn't want to use UCX with Omnipath interconnect, what is the implication here then, no CUDA support possible (maybe @bartoldeman has some input here...)? |
@ocaisa Sorry I missed that someone replied to this discussion; Fortunately, I think there is still hope we can do it in the same design like UCX-CUDA at runtime, using |
We have full support for cuda everywhere as far as i know. nothing else to fix here. |
Summarizing a long discussion with Arkadiy Daydov on slack:
Starting with foss/2021a, we dropped fosscuda and use UCX-CUDA to enable CUDA support for existing UCX+OpenMPI installations.
This worked well for OSUMicroBenchmark, but other applications (PyTorch, LAMMPS) are trying to be clever and detect whether CUDA support is enabled.
Unfortuantely, there doesn't seem to be a definitive way to do this, and unfortunately, so for both these applications, it fails.
The root cause seems to stem from the fact that they both check if the OPAL backend (which as far as I understand, only means the BTL stuff in OpenMPI, i.e. the old "smcuda" thing) has CUDA support, and, if not. report false.
Since we aren't building smcuda, these correctly report false (but that's not what you really want).
OpenMPI provides a header file with
and, unfortunately, not much we can do about the define here, but the function also only checks OPAL, ignoring UCX.
This function predates UCX, and it is discussed what to do with it here:
open-mpi/ompi#7963
and they actually write that
This issue was closed after the PR;
open-mpi/ompi#7970
but this function is sitll returning 0 if there isn't the old
smcuda
support, and this comment worries meI'm not sure what these limits would be, but in my mind, we kind of always use UCX now, and the UCX PML excludes the possibility to use any BTL, so.. OPAL is dead now and ucx-cuda is the only thing that matters.
Patching LAMMPS (kokkos) to just forcibly enabling the "gpu-aware" code, seems to work fine, so so far every application I'm aware of (which isn't much) only needs or expects ucx-cuda.
So, either MPIX_Query_cuda_support is wrong, or it's not fine grained enough to give the applications the information it needs?
The text was updated successfully, but these errors were encountered: