You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Computer hardware: Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz, single node, no HCA.
Network type: N/A.
Details of the problem
Using MPI I/O to write to a file from a device-only memory allocation (e.g. allocated with cudaMalloc) fails. Allocating that same memory in a host-accessible way, e.g., using cudaMallocManaged works.
The reproducer file is here:
reproducer.cpp
#include<iostream>
#include<mpi.h>
#include<cuda_runtime_api.h>intmain(int argc, char* argv[]) {
int N = 10;
int* p;
if (auto e = cudaMalloc(&p, sizeof(int) * N); e != cudaSuccess) std::cerr << __LINE__, abort();
if (auto e = cudaMemset(p, (int)'7', sizeof(int) * N); e != cudaSuccess) std::cerr << __LINE__, abort();
if (auto e = cudaDeviceSynchronize(); e != cudaSuccess) std::cerr << __LINE__, abort();
int mt = -1;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &mt);
if (mt != MPI_THREAD_MULTIPLE) std::cerr << __LINE__, abort();
int nranks, rank;
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_File f;
MPI_File_open(MPI_COMM_WORLD, "output", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &f);
MPI_Offset bytes = sizeof(int) * (MPI_Offset)N;
MPI_Offset total_bytes = bytes * (MPI_Offset)nranks;
MPI_Offset off = bytes * (MPI_Offset)rank;
MPI_File_set_size(f, total_bytes);
MPI_Request req;
MPI_File_iwrite_at(f, off, p, bytes, MPI_INT, &req);
MPI_Waitall(1, &req, MPI_STATUSES_IGNORE);
MPI_File_close(&f);
MPI_Finalize();
return0;
}
Compile it with any CUDA C++ compiler, e.g., nvcc or nvc++ and running it
The call to cuMemcpyAsync failed. This is a unrecoverable error and will
cause the program to abort.
cuMemcpyAsync(0x1b1b5f8, 0x7f25f4a00000, 160) returned value 1
The expected behavior is for this to work correctly.
Full Error Message
--------------------------------------------------------------------------
The call to cuMemcpyAsync failed. This is a unrecoverable error and will
cause the program to abort.
cuMemcpyAsync(0x1b1b5f8, 0x7f25f4a00000, 160) returned value 1
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
[ipp2-0153.nvidia.com:00861] CUDA: Error in cuMemcpy: res=-1, dest=0x1b1b5f8, src=0x7f25f4a00000, size=160
[ipp2-0153:00861] *** Process received signal ***
[ipp2-0153:00861] Signal: Aborted (6)
[ipp2-0153:00861] Signal code: (-6)
[ipp2-0153.nvidia.com:00860] CUDA: Error in cuMemcpy: res=-1, dest=0x30f1908, src=0x7fc2f6a00000, size=160
[ipp2-0153:00860] *** Process received signal ***
[ipp2-0153:00860] Signal: Aborted (6)
[ipp2-0153:00860] Signal code: (-6)
[ipp2-0153:00861] [ 0] [ipp2-0153:00860] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f281ae1a520]
[ipp2-0153:00861] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f281ae6ea7c]
[ipp2-0153:00861] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fc51ca1a520]
[ipp2-0153:00860] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fc51ca6ea7c]
[ipp2-0153:00860] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f281ae1a476]
[ipp2-0153:00861] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fc51ca1a476]
[ipp2-0153:00860] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f281ae007f3]
[ipp2-0153:00861] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fc51ca007f3]
[ipp2-0153:00860] [ 4] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libopen-pal.so.40(+0x55829)[0x7f281a655829]
[ipp2-0153:00861] [ 5] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libopen-pal.so.40(opal_convertor_pack+0x18f)[0x7f281a647bcf]
[ipp2-0153:00861] [ 6] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libmca_common_ompio.so.41(mca_common_ompio_file_iwrite+0x281)[0x7f25c340aae1]
[ipp2-0153:00861] [ 7] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libmca_common_ompio.so.41(mca_common_ompio_file_iwrite_at+0x49)[0x7f25c340ae39]
[ipp2-0153:00861] [ 8] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/openmpi/mca_io_ompio.so(mca_io_ompio_file_iwrite_at+0x26)[0x7f25c3805b56]
[ipp2-0153:00861] [ 9] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libopen-pal.so.40(+0x55829)[0x7fc51c255829]
[ipp2-0153:00860] [ 5] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libopen-pal.so.40(opal_convertor_pack+0x18f)[0x7fc51c247bcf]
[ipp2-0153:00860] [ 6] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libmca_common_ompio.so.41(mca_common_ompio_file_iwrite+0x281)[0x7fc2c140aae1]
[ipp2-0153:00860] [ 7] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libmca_common_ompio.so.41(mca_common_ompio_file_iwrite_at+0x49)[0x7fc2c140ae39]
[ipp2-0153:00860] [ 8] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/openmpi/mca_io_ompio.so(mca_io_ompio_file_iwrite_at+0x26)[0x7fc2c9405b56]
[ipp2-0153:00860] [ 9] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libmpi.so.40(PMPI_File_iwrite_at+0x5e)[0x7f281f2679ce]
[ipp2-0153:00861] [10] ./mpi_io_bug[0x4024c6]
[ipp2-0153:00861] [11] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libmpi.so.40(PMPI_File_iwrite_at+0x5e)[0x7fc520e679ce]
[ipp2-0153:00860] [10] ./mpi_io_bug[0x4024c6]
[ipp2-0153:00860] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f281ae01d90]
[ipp2-0153:00861] [12] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fc51ca01d90]
[ipp2-0153:00860] [12] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f281ae01e40]
[ipp2-0153:00861] [13] ./mpi_io_bug[0x402295]
[ipp2-0153:00861] *** End of error message ***
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fc51ca01e40]
[ipp2-0153:00860] [13] ./mpi_io_bug[0x402295]
[ipp2-0153:00860] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ipp2-0153 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
[ipp2-0153.nvidia.com:00856] 1 more process has sent help message help-mpi-common-cuda.txt / cuMemcpyAsync failed
[ipp2-0153.nvidia.com:00856] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
The text was updated successfully, but these errors were encountered:
I wrote this code a few years back, but I have at the moment unfortunately no way to test this on an Nvidia device (nor do I have access to the HPCX compilation of Open MPI). I would need help from somebody from Nvidia to debug this, I am more than happy to assist or answer any questions. I did go over the code in ompi 4.1.5 briefly, it does look correct to me, i.e. it should actually work.
I did test the ompi 5.0 version of the same code (which uses accelerator framework API functions however), and I can confirm that that worked as expected on our devices.
@gonzalobg
The issue is in the reproducer's write call.
It's passing bytes and using MPI_INT thus writing outside allocated memory. MPI_File_iwrite_at(f, off, p, bytes, MPI_INT, &req);
It should either pass bytes, MPI_BYTE, or N, MPI_INT
It works in either case.
Closing issue. @edgargabriel thanks for the debug.
MPI and System
OpenMPI from the NVIDIA HPC SDK. IIUC, built from HPC-X 2.15 sources.
ompi_info
uname -r
: 5.4.0-84-genericDetails of the problem
Using MPI I/O to write to a file from a device-only memory allocation (e.g. allocated with
cudaMalloc
) fails. Allocating that same memory in a host-accessible way, e.g., usingcudaMallocManaged
works.The reproducer file is here:
reproducer.cpp
Compile it with any CUDA C++ compiler, e.g.,
nvcc
ornvc++
and running itfails with this error:
The expected behavior is for this to work correctly.
Full Error Message
The text was updated successfully, but these errors were encountered: