Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4.1.x: MPI File I/O fails with device memory #11798

Closed
gonzalobg opened this issue Jul 5, 2023 · 3 comments
Closed

v4.1.x: MPI File I/O fails with device memory #11798

gonzalobg opened this issue Jul 5, 2023 · 3 comments
Assignees
Milestone

Comments

@gonzalobg
Copy link

MPI and System

OpenMPI from the NVIDIA HPC SDK. IIUC, built from HPC-X 2.15 sources.

ompi_info
Package: Open MPI qa@sky1 Distribution
                Open MPI: 4.1.5rc2
  Open MPI repo revision: v4.1.5rc1-16-g5980bac
   Open MPI release date: Unreleased developer copy
                Open RTE: 4.1.5rc2
  Open RTE repo revision: v4.1.5rc1-16-g5980bac
   Open RTE release date: Unreleased developer copy
                    OPAL: 4.1.5rc2
      OPAL repo revision: v4.1.5rc1-16-g5980bac
       OPAL release date: Unreleased developer copy
                 MPI API: 3.1.0
            Ident string: 4.1.5rc2
                  Prefix: /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi
 Configured architecture: x86_64-pc-linux-gnu
          Configure host: sky1
           Configured by: qa
           Configured on: Wed May 10 16:39:18 UTC 2023
          Configure host: sky1
  Configure command line: 'CC=gcc' 'CXX=g++' 'FC=nvfortran'
                          'LDFLAGS=-Wl,-rpath-link=/proj/nv/libraries/Linux_x86_64/23.5/hpcx-12/229172-rel-1/comm_libs/12.1/hpcx/hpcx-2.15/ucx/lib
                          -Wl,-rpath-link=/proj/nv/libraries/Linux_x86_64/23.5/hpcx-12/229172-rel-1/comm_libs/12.1/hpcx/hpcx-2.15/hcoll/lib'
                          '--with-platform=../contrib/platform/nvhpc/optimized'
                          '--enable-mpi1-compatibility'
                          '--with-libevent=internal' '--without-xpmem'
                          '--with-slurm'
                          '--with-cuda=/proj/cuda/12.1/Linux_x86_64'
                          '--with-hcoll=/proj/nv/libraries/Linux_x86_64/23.5/hpcx-12/229172-rel-1/comm_libs/12.1/hpcx/hpcx-2.15/hcoll'
                          '--with-ucc=/proj/nv/libraries/Linux_x86_64/23.5/hpcx-12/229172-rel-1/comm_libs/12.1/hpcx/hpcx-2.15/ucc'
                          '--with-ucx=/proj/nv/libraries/Linux_x86_64/23.5/hpcx-12/229172-rel-1/comm_libs/12.1/hpcx/hpcx-2.15/ucx'
                          '--prefix=/proj/nv/libraries/Linux_x86_64/23.5/hpcx-12/229172-rel-1/comm_libs/12.1/hpcx/hpcx-2.15/ompi'
                Built by: qa
                Built on: Wed May 10 16:43:47 UTC 2023
              Built host: sky1
              C bindings: yes
            C++ bindings: no
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the nvfortran compiler and/or Open
                          MPI, does not support the following: array
                          subsections, direct passthru (where possible) to
                          underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
      C compiler version: 4.8.5
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
           Fort compiler: nvfortran
       Fort compiler abs: /proj/nv/Linux_x86_64/23.5/compilers/bin/nvfortran
         Fort ignore TKR: yes (!DIR$ IGNORE_TKR)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
          Fort PROTECTED: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
           C++ profiling: no
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: never
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: yes
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
      MPI1 compatibility: yes
          MPI extensions: affinity, cuda, pcollreq
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.5)
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.5)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.5)
            MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.5)
            MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA event: libevent2022 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
               MCA hwloc: hwloc201 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.5)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.5)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.5)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v4.1.5)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.1.5)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.5)
              MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.1.5)
              MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.1.5)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.5)
           MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.5)
              MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component
                          v4.1.5)
              MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component
                          v4.1.5)
              MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component
                          v4.1.5)
              MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component
                          v4.1.5)
                 MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component
                          v4.1.5)
                 MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.5)
               MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.5)
             MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
                MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.5)
                MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.5)
                MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
               MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
               MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
               MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.5)
              MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.5)
              MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.5)
              MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.5)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.5)
               MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.5)
                 MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: ucc (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
                MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: hcoll (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
               MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.5)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v4.1.5)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
                          v4.1.5)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.5)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                 MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.5)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.5)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.5)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v4.1.5)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v4.1.5)
  • Operating system/version: Ubuntu 22.04, uname -r: 5.4.0-84-generic
  • Computer hardware: Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz, single node, no HCA.
  • Network type: N/A.

Details of the problem

Using MPI I/O to write to a file from a device-only memory allocation (e.g. allocated with cudaMalloc) fails. Allocating that same memory in a host-accessible way, e.g., using cudaMallocManaged works.

The reproducer file is here:

reproducer.cpp
#include <iostream>
#include <mpi.h>
#include <cuda_runtime_api.h>

int main(int argc, char* argv[]) {
  int N = 10;
  int* p;

  if (auto e = cudaMalloc(&p, sizeof(int) * N); e != cudaSuccess) std::cerr << __LINE__, abort();
  if (auto e = cudaMemset(p, (int)'7', sizeof(int) * N); e != cudaSuccess) std::cerr << __LINE__, abort();
  if (auto e = cudaDeviceSynchronize(); e != cudaSuccess) std::cerr << __LINE__, abort();

  int mt = -1;
  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &mt);
  if (mt != MPI_THREAD_MULTIPLE) std::cerr << __LINE__, abort();
  int nranks, rank;
  MPI_Comm_size(MPI_COMM_WORLD, &nranks);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    
  MPI_File f;
  MPI_File_open(MPI_COMM_WORLD, "output", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &f);
  MPI_Offset bytes = sizeof(int) * (MPI_Offset)N;
  MPI_Offset total_bytes = bytes * (MPI_Offset)nranks;
  MPI_Offset off = bytes * (MPI_Offset)rank;
  MPI_File_set_size(f, total_bytes);
  MPI_Request req;
  MPI_File_iwrite_at(f, off, p, bytes, MPI_INT, &req);
  MPI_Waitall(1, &req, MPI_STATUSES_IGNORE);
  MPI_File_close(&f);
  MPI_Finalize();  

  return 0; 
}

Compile it with any CUDA C++ compiler, e.g., nvcc or nvc++ and running it

OMPI_CXX=nvc++ mpicxx -std=c++20 -stdpar=gpu -o mpi_io_bug mpi_io_bug.cpp
mpirun -np 2 ./mpi_io_bug

fails with this error:

The call to cuMemcpyAsync failed. This is a unrecoverable error and will
cause the program to abort.
  cuMemcpyAsync(0x1b1b5f8, 0x7f25f4a00000, 160) returned value 1

The expected behavior is for this to work correctly.

Full Error Message
--------------------------------------------------------------------------
The call to cuMemcpyAsync failed. This is a unrecoverable error and will
cause the program to abort.
  cuMemcpyAsync(0x1b1b5f8, 0x7f25f4a00000, 160) returned value 1
Check the cuda.h file for what the return value means.
--------------------------------------------------------------------------
[ipp2-0153.nvidia.com:00861] CUDA: Error in cuMemcpy: res=-1, dest=0x1b1b5f8, src=0x7f25f4a00000, size=160
[ipp2-0153:00861] *** Process received signal ***
[ipp2-0153:00861] Signal: Aborted (6)
[ipp2-0153:00861] Signal code:  (-6)
[ipp2-0153.nvidia.com:00860] CUDA: Error in cuMemcpy: res=-1, dest=0x30f1908, src=0x7fc2f6a00000, size=160
[ipp2-0153:00860] *** Process received signal ***
[ipp2-0153:00860] Signal: Aborted (6)
[ipp2-0153:00860] Signal code:  (-6)
[ipp2-0153:00861] [ 0] [ipp2-0153:00860] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f281ae1a520]
[ipp2-0153:00861] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f281ae6ea7c]
[ipp2-0153:00861] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fc51ca1a520]
[ipp2-0153:00860] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fc51ca6ea7c]
[ipp2-0153:00860] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f281ae1a476]
[ipp2-0153:00861] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fc51ca1a476]
[ipp2-0153:00860] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f281ae007f3]
[ipp2-0153:00861] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fc51ca007f3]
[ipp2-0153:00860] [ 4] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libopen-pal.so.40(+0x55829)[0x7f281a655829]
[ipp2-0153:00861] [ 5] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libopen-pal.so.40(opal_convertor_pack+0x18f)[0x7f281a647bcf]
[ipp2-0153:00861] [ 6] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libmca_common_ompio.so.41(mca_common_ompio_file_iwrite+0x281)[0x7f25c340aae1]
[ipp2-0153:00861] [ 7] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libmca_common_ompio.so.41(mca_common_ompio_file_iwrite_at+0x49)[0x7f25c340ae39]
[ipp2-0153:00861] [ 8] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/openmpi/mca_io_ompio.so(mca_io_ompio_file_iwrite_at+0x26)[0x7f25c3805b56]
[ipp2-0153:00861] [ 9] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libopen-pal.so.40(+0x55829)[0x7fc51c255829]
[ipp2-0153:00860] [ 5] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libopen-pal.so.40(opal_convertor_pack+0x18f)[0x7fc51c247bcf]
[ipp2-0153:00860] [ 6] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libmca_common_ompio.so.41(mca_common_ompio_file_iwrite+0x281)[0x7fc2c140aae1]
[ipp2-0153:00860] [ 7] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libmca_common_ompio.so.41(mca_common_ompio_file_iwrite_at+0x49)[0x7fc2c140ae39]
[ipp2-0153:00860] [ 8] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/openmpi/mca_io_ompio.so(mca_io_ompio_file_iwrite_at+0x26)[0x7fc2c9405b56]
[ipp2-0153:00860] [ 9] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libmpi.so.40(PMPI_File_iwrite_at+0x5e)[0x7f281f2679ce]
[ipp2-0153:00861] [10] ./mpi_io_bug[0x4024c6]
[ipp2-0153:00861] [11] /opt/nvidia/hpc_sdk/Linux_x86_64/23.5/comm_libs/12.1/hpcx/hpcx-2.15/ompi/lib/libmpi.so.40(PMPI_File_iwrite_at+0x5e)[0x7fc520e679ce]
[ipp2-0153:00860] [10] ./mpi_io_bug[0x4024c6]
[ipp2-0153:00860] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f281ae01d90]
[ipp2-0153:00861] [12] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fc51ca01d90]
[ipp2-0153:00860] [12] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f281ae01e40]
[ipp2-0153:00861] [13] ./mpi_io_bug[0x402295]
[ipp2-0153:00861] *** End of error message ***
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fc51ca01e40]
[ipp2-0153:00860] [13] ./mpi_io_bug[0x402295]
[ipp2-0153:00860] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ipp2-0153 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
[ipp2-0153.nvidia.com:00856] 1 more process has sent help message help-mpi-common-cuda.txt / cuMemcpyAsync failed
[ipp2-0153.nvidia.com:00856] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
@edgargabriel
Copy link
Member

I wrote this code a few years back, but I have at the moment unfortunately no way to test this on an Nvidia device (nor do I have access to the HPCX compilation of Open MPI). I would need help from somebody from Nvidia to debug this, I am more than happy to assist or answer any questions. I did go over the code in ompi 4.1.5 briefly, it does look correct to me, i.e. it should actually work.

I did test the ompi 5.0 version of the same code (which uses accelerator framework API functions however), and I can confirm that that worked as expected on our devices.

@jsquyres
Copy link
Member

jsquyres commented Jul 5, 2023

@janjust Ping

@jsquyres jsquyres added this to the v4.1.6 milestone Jul 11, 2023
@jsquyres jsquyres changed the title MPI File I/O fails with device memory v4.1.x: MPI File I/O fails with device memory Jul 11, 2023
@janjust
Copy link
Contributor

janjust commented Jul 26, 2023

@gonzalobg
The issue is in the reproducer's write call.
It's passing bytes and using MPI_INT thus writing outside allocated memory.
MPI_File_iwrite_at(f, off, p, bytes, MPI_INT, &req);

It should either pass bytes, MPI_BYTE, or N, MPI_INT
It works in either case.
Closing issue. @edgargabriel thanks for the debug.

@janjust janjust closed this as completed Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants