Improving detection of CUDA enabled MPI in EasyBuild #14517

Micket · 2021-12-08T17:18:36Z

Summarizing a long discussion with Arkadiy Daydov on slack:

Starting with foss/2021a, we dropped fosscuda and use UCX-CUDA to enable CUDA support for existing UCX+OpenMPI installations.

This worked well for OSUMicroBenchmark, but other applications (PyTorch, LAMMPS) are trying to be clever and detect whether CUDA support is enabled.
Unfortuantely, there doesn't seem to be a definitive way to do this, and unfortunately, so for both these applications, it fails.
The root cause seems to stem from the fact that they both check if the OPAL backend (which as far as I understand, only means the BTL stuff in OpenMPI, i.e. the old "smcuda" thing) has CUDA support, and, if not. report false.
Since we aren't building smcuda, these correctly report false (but that's not what you really want).

OpenMPI provides a header file with

#define MPIX_CUDA_AWARE_SUPPORT 0
OMPI_DECLSPEC int MPIX_Query_cuda_support(void);

and, unfortunately, not much we can do about the define here, but the function also only checks OPAL, ignoring UCX.

This function predates UCX, and it is discussed what to do with it here:
open-mpi/ompi#7963
and they actually write that

When UCX is used: whether UCX has CUDA support

This issue was closed after the PR;
open-mpi/ompi#7970
but this function is sitll returning 0 if there isn't the old smcuda support, and this comment worries me

, it will be 0. because OMPI not compiled with CUDA. you might get limited support UCX/CUDA for pt2pt.

I'm not sure what these limits would be, but in my mind, we kind of always use UCX now, and the UCX PML excludes the possibility to use any BTL, so.. OPAL is dead now and ucx-cuda is the only thing that matters.
Patching LAMMPS (kokkos) to just forcibly enabling the "gpu-aware" code, seems to work fine, so so far every application I'm aware of (which isn't much) only needs or expects ucx-cuda.
So, either MPIX_Query_cuda_support is wrong, or it's not fine grained enough to give the applications the information it needs?

The text was updated successfully, but these errors were encountered:

ocaisa · 2021-12-09T08:39:25Z

Just to throw another spanner in the works, what if you are not using UCX? There was a recent thread on the mailing list (link will only work properly once you have acknowledged that you are not a spammer) that said you wouldn't want to use UCX with Omnipath interconnect, what is the implication here then, no CUDA support possible (maybe @bartoldeman has some input here...)?

Micket · 2022-01-20T15:09:45Z

@ocaisa Sorry I missed that someone replied to this discussion;
Well, in these cases, and, perhaps necessary for us UCX-users as well as it might be the case that the UCX PML can't be used for all things (ugh... i really was hoping that UCX stuff would be cleaning these things up and make life simpler), I think we need to build the smcuda BTL plugin.

Fortunately, I think there is still hope we can do it in the same design like UCX-CUDA at runtime, using OMPI_MCA_mca_component_path to point to an external ./lib/openmpi/mca_btl_smcuda.so (based on @bartoldeman testing in #12484 )

Micket · 2024-03-31T23:22:04Z

We have full support for cuda everywhere as far as i know. nothing else to fix here.

Micket added the problem report label Dec 8, 2021

This was referenced Jul 20, 2022

{chem}[GCCcore/11.2.0,foss/2021b] LAMMPS v29Sep2021 (with and without cuda) #14815

Closed

{chem}[foss/2021b] LAMMPS v23Jun2022, Voro++ v0.4.6 w/ Python 3.9.6 #15877

Merged

Micket closed this as completed Mar 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving detection of CUDA enabled MPI in EasyBuild #14517

Improving detection of CUDA enabled MPI in EasyBuild #14517

Micket commented Dec 8, 2021

ocaisa commented Dec 9, 2021 •

edited

Loading

Micket commented Jan 20, 2022 •

edited

Loading

Micket commented Mar 31, 2024

Improving detection of CUDA enabled MPI in EasyBuild #14517

Improving detection of CUDA enabled MPI in EasyBuild #14517

Comments

Micket commented Dec 8, 2021

ocaisa commented Dec 9, 2021 • edited Loading

Micket commented Jan 20, 2022 • edited Loading

Micket commented Mar 31, 2024

ocaisa commented Dec 9, 2021 •

edited

Loading

Micket commented Jan 20, 2022 •

edited

Loading