Why my example runs successfully on 4 computational nodes but failed on 8 computational nodes when I would like to use the capability of GPU Aware-MPI？ #609

Terence-iscas · 2024-12-13T14:55:36Z

Terence-iscas
Dec 13, 2024

Hi, I'm running my pipe example which has nearly 650K hexahedral elements (PolynomialOrder=7, NekRS-23.0). Each of my computational node has 4 CPUs and 4 AMD GPUs. I have used 4 computational nodes performing this example where openmpi-4.1.5 with ucx works correctly, because when timing gs, pw+early+device, pw+device and pw+host were all appeared.

I would like to test the strong scalability of my example, so I used 8 computational nodes (same architecture) to run it. Unfortunately, it stopped at the first "timing gs: "

generating mesh ...      
loading mesh from nek ... Nelements: 641770, NboundaryIDs: 3, NboundaryFaces: 355572 done (0.0020819s)
polynomial order N: 7, over-integration order cubN: 10
meshParallelGatherScatterSetup N=7
timing gs: 1.55e-03s 1.68e-03s 1.76e-03s 1.75e-03s 1.76e-03s
Fri Dec 13 17:35:56 CST 2024

And the error log looks:

[ROCT-Thunk-Interface/src/hymgr.c:351, WARNING]: Version mismatch
[ROCT-Thunk-Interface/src/hymgr.c:351, WARNING]: Version mismatch
[ROCT-Thunk-Interface/src/hymgr.c:351, WARNING]: Version mismatch
[ROCT-Thunk-Interface/src/hymgr.c:351, WARNING]: Version mismatch
./exec.sh: line 4: 18058 Killed                  numactl --cpunodebind=3 --membind=3 $NEKRS_HOME/bin/nekrs --setup runner --device 0
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
./exec.sh: line 4: 32348 Killed                  numactl --cpunodebind=1 --membind=1 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4: 32350 Killed                  numactl --cpunodebind=3 --membind=3 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4:  9087 Killed                  numactl --cpunodebind=3 --membind=3 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4: 18060 Killed                  numactl --cpunodebind=2 --membind=2 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4:  6880 Killed                  numactl --cpunodebind=1 --membind=1 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4:   487 Killed                  numactl --cpunodebind=2 --membind=2 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4: 32347 Killed                  numactl --cpunodebind=2 --membind=2 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4:  9088 Killed                  numactl --cpunodebind=1 --membind=1 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4: 18059 Killed                  numactl --cpunodebind=0 --membind=0 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4: 32349 Killed                  numactl --cpunodebind=0 --membind=0 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4: 18061 Killed                  numactl --cpunodebind=1 --membind=1 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4:  6867 Killed                  numactl --cpunodebind=3 --membind=3 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4:   486 Killed                  numactl --cpunodebind=3 --membind=3 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4:  9086 Killed                  numactl --cpunodebind=0 --membind=0 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4:  6879 Killed                  numactl --cpunodebind=2 --membind=2 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4:   484 Killed                  numactl --cpunodebind=1 --membind=1 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4:  9089 Killed                  numactl --cpunodebind=2 --membind=2 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4:  6852 Killed                  numactl --cpunodebind=0 --membind=0 $NEKRS_HOME/bin/nekrs --setup runner --device 0
./exec.sh: line 4:   485 Killed                  numactl --cpunodebind=0 --membind=0 $NEKRS_HOME/bin/nekrs --setup runner --device 0
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[19110,1],15]
  Exit code:    137
--------------------------------------------------------------------------

I have located the function:

// in nekRS-23.0/gslib/ogs/src/oogs.cpp
oogs_t *oogs::setup(ogs_t *ogs,
                    int nVec,
                    dlong stride,
                    const char *type,
                    std::function<void()> callback,
                    oogs_mode gsMode)

The variable oogs_mode_list looks like having following 5 values:

OOGS_LOCAL, OOGS_DEFAULT, OOGS_HOSTMPI, OOGS_DEVICEMPI, OOGS_AUTO

If I use openmpi with ucx, it seems like gsMode=OOGS_AUTO, then the program will test the bandwith of communication. On the other hand, I force the value of gsMode to be other four options. When I try OOGS_DEVICEMPI, the same error occurred again. So my question is whether it exists bugs here? Or my mpi/ucx parameters setting was wrong (see below)?

#!/bin/sh
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
source $HOME/env.sh
case ${lrank} in
[0])
export HIP_VISIBLE_DEVICES=0
export UCX_NET_DEVICES=mlx5_0:1
export UCX_IB_PCI_BW=mlx5_0:400Gbs
export NEKRS_GPU_MPI=1
numactl --cpunodebind=0 --membind=0 $NEKRS_HOME/bin/nekrs --setup pipe --device 0
;;
[1])
export HIP_VISIBLE_DEVICES=1
export UCX_NET_DEVICES=mlx5_1:1
export UCX_IB_PCI_BW=mlx5_1:400Gbs
export NEKRS_GPU_MPI=1
numactl --cpunodebind=1 --membind=1 $NEKRS_HOME/bin/nekrs --setup pipe --device 0
;;
[2])
export HIP_VISIBLE_DEVICES=2
export UCX_NET_DEVICES=mlx5_2:1
export UCX_IB_PCI_BW=mlx5_2:400Gbs
export NEKRS_GPU_MPI=1
numactl --cpunodebind=2 --membind=2 $NEKRS_HOME/bin/nekrs --setup pipe --device 0
;;
[3])
export HIP_VISIBLE_DEVICES=3
export UCX_NET_DEVICES=mlx5_3:1
export UCX_IB_PCI_BW=mlx5_3:400Gbs
export NEKRS_GPU_MPI=1
numactl --cpunodebind=3 --membind=3 $NEKRS_HOME/bin/nekrs --setup pipe --device 0
;;
esac

Launch Command:

mpirun -np 32 -N 4 --hostfile slurm.hosts ./exec.sh

Besides ,I have read the similar topic #594 #578 #568 , it gave me some mind but didn't work, so I launch this topic.
Thank you in advance for your kindly help！Best Wishes!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why my example runs successfully on 4 computational nodes but failed on 8 computational nodes when I would like to use the capability of GPU Aware-MPI？ #609

{{title}}

Replies: 0 comments

Select a reply

Why my example runs successfully on 4 computational nodes but failed on 8 computational nodes when I would like to use the capability of GPU Aware-MPI？ #609

Terence-iscas Dec 13, 2024

Replies: 0 comments

Terence-iscas
Dec 13, 2024