-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: gdr_unmap segfault on master branch via NVSHMEM 2.10.1 on Cray Slingshot 11 with cuFFTMp #296
Comments
Hi @tylerjereddy, I suspect that the segfault is from somewhere in https://github.com/ofiwg/libfabric/blob/main/src/hmem_cuda_gdrcopy.c#L346-L380 or https://github.com/NVIDIA/gdrcopy/blob/master/src/gdrapi.c#L387-L411. Can you use For GDRCopy, you may want to change https://github.com/NVIDIA/gdrcopy/blob/master/src/Makefile#L29 to |
I'm working on it, the exact nature of the failure is not deterministic, even between trials with the same builds it seems. I'll keep working on narrowing down for at least one of the failure scenarios. I'll also paste a few more example outputs that looked a little different (they didn't actually segfault, just error).
I see the max value of an unsigned 16-bit integer in one of the errors in there. Anyway, I'll try to dig deeper. My prints aren't showing up yet, so something I'm not understanding obviously. |
Ah, completely purged out my custom install of I did confirm that I can see prints from my custom
|
Hi @tylerjereddy , I reviewed the NVSHMEM libfabric transport code. It does not use GDRCopy with Slingshot -- at least in NVSHMEM 2.10.1. However, libfabric itself (not NVSHMEM libfabric transport) uses GDRCopy. Based on the backtrace logs you posted, I think NVSHMEM calls into libfabric, which in turn triggers this issue. I think we can ignore NVSHMEM for now. Guessing from your first comment, you originally ran with GDRCopy v2.3 and then moved to the master branch, right? Do you have root access on your system? Have you reloaded the gdrdrv driver from the master branch? If you have root access, can you enable debugging in the gdrdrv driver? After compiling GDRCopy, you can simply modify https://github.com/NVIDIA/gdrcopy/blob/master/insmod.sh#L28 to set
This line does not make sense to me. In most cases, the error code should be propagated from the gdrdrv driver. However, the driver never returns -ENOMEM (12) in the mmap path. And that line with that phrase can only be printed out from |
I don't have root access, it is a supercomputer at LANL. I could perhaps try linking your suggestions for HPC support to see if there's anything they can check. |
I think the HPC admins are looking into your comment a bit, but I wanted to check on a few things:
|
libgdrapi.so and gdrdrv (driver) are forward and backward compatible. Still, there might be some bugs we have fixed in a newer version of gdrdrv. It would be good to use the latest release version. Your application talks to libgdrapi.so (not directly to gdrdrv). For this one, it is backward compatible only. For example, if you compile with GDRCopy v2.4, we cannot guarantee that your application will work with libgdrapi.so v2.3.
I don't know the answer. Is this a user-space library or a driver? If it is a user-space library, you probably can By the way, you may want to try setting |
So, my debug prints were not showing up because prepending my custom Anyway, now I should be able to report some better debug prints. |
More detailed debug analysis below, now that I can use custom
Details for backtrace scenario 1
Details of error scenario 2
Details of error scenario 3
Does this give you any more traction to diagnose the problem? While I wait to hear back about the debug driver stuff, anything else you want me to try here? It also seems to me like there's a misunderstanding somewhere with UCX + |
Thank you @tylerjereddy. I suspect that you may run into a race condition from multithreading. GDRCopy, especially libgdrapi.so, is not thread safe. Anyway, I added a global lock to some functions in this branch https://github.com/NVIDIA/gdrcopy/tree/dev-issue-296-exp. Please try if it helps. You just need to recompile libgdrapi.so and use that. There is no need to install a new gdrdrv driver. |
I still see errors that are not deterministic on that branch (I reduced the optimization level again as well).
Note that Hanging seems more common on this branch now as well, and |
Sorry, there was a left-over code block. I just removed it. Please try again. Note that this is not our final solution. It is just an adhoc implementation to see if it helps. It might not work if the caller calls a GDRCopy API with stale |
Here's the backtrace for the 2-node cuFFTMp reproducer with your updated branch (with optimization level reduced):
So, crash seems to be near here in your new branch, in Line 281 in d229925
Now, if we look at the special branch of The situation looks the same in the All of this code is in a --- a/src/hmem_cuda_gdrcopy.c
+++ b/src/hmem_cuda_gdrcopy.c
@@ -33,6 +33,7 @@
#if HAVE_CONFIG_H
#include <config.h>
+#include <stdio.h>
#endif
#include "ofi_hmem.h"
@@ -356,26 +357,27 @@ int cuda_gdrcopy_dev_unregister(uint64_t handle)
assert(gdrcopy);
pthread_spin_lock(&global_gdr_lock);
+ printf("cuda_gdrcopy_dev_unregister checkpoint 1\n");
err = global_gdrcopy_ops.gdr_unmap(global_gdr, gdrcopy->mh,
gdrcopy->user_ptr, gdrcopy->length);
+ printf("cuda_gdrcopy_dev_unregister checkpoint 2\n");
if (err) {
+ printf("cuda_gdrcopy_dev_unregister checkpoint 2b\n");
FI_WARN(&core_prov, FI_LOG_CORE,
"gdr_unmap failed! error: %s\n",
strerror(err));
goto exit;
}
+ printf("cuda_gdrcopy_dev_unregister checkpoint 3\n");
+ pthread_spin_unlock(&global_gdr_lock);
+ printf("cuda_gdrcopy_dev_unregister checkpoint 4\n");
- err = global_gdrcopy_ops.gdr_unpin_buffer(global_gdr, gdrcopy->mh);
- if (err) {
- FI_WARN(&core_prov, FI_LOG_MR,
- "gdr_unmap failed! error: %s\n",
- strerror(err));
- goto exit;
- }
exit:
+ printf("cuda_gdrcopy_dev_unregister checkpoint 5\n");
pthread_spin_unlock(&global_gdr_lock);
free(gdrcopy);
+ printf("cuda_gdrcopy_dev_unregister checkpoint 6\n");
return err;
} Although deleting |
@tylerjereddy Thank you for the additional info. We also call
|
Starting with your second point of the GDRCopy test applications, I used the latest The output is below, the sanity check seems to "pass" but spits out errors?
The modified interactive script for the 2-node test: #!/bin/bash -l
#
# setup the runtime environment
#export FI_LOG_LEVEL=debug
#export NVSHMEM_DEBUG=TRACE
export FI_HMEM=cuda
export GDRCOPY_ENABLE_LOGGING=1
# we need special CXI- and CUDA-enabled version of libfabric
# per: https://github.com/ofiwg/libfabric/issues/10001#issuecomment-2078604043
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/lib64:$LD_LIBRARY_PATH"
export PATH="/lustre/scratch5/treddy/march_april_2024_testing/libfabric_install_custom/bin:$PATH"
export PATH="$PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/bin"
export PATH="$PATH:/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/bin"
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/comm_libs/12.3/nccl/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ik4v4abhawveafsjmxd7fqwvhagwh7lw/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/ucx-1.15.0-ik4v4abhawveafsjmxd7fqwvhagwh7lw/lib/ucx:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/cuda/12.3/targets/x86_64-linux/lib:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="/usr/projects/hpcsoft/cos2/chicoma/cuda-compat/12.0/:$LD_LIBRARY_PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/pmix-4.2.9-7kfa4s6dwyd5wlayw24vx7jai7d4oi4x/lib"
export NVSHMEM_DISABLE_CUDA_VMM=1
export FI_CXI_OPTIMIZED_MRS=false
export NVSHMEM_REMOTE_TRANSPORT=libfabric
export MPI_HOME=/lustre/scratch5/treddy/march_april_2024_testing/ompi5_install
export CUFFT_LIB=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/lib
export CUFFT_INC=/lustre/scratch5/.mdt1/treddy/march_april_2024_testing/github_projects/spack/opt/spack/linux-sles15-zen2/gcc-12.2.0/nvhpc-23.11-tkgy4rjxilbay253jj65msbj2vrbq673/Linux_x86_64/23.11/math_libs/12.3/targets/x86_64-linux/include/cufftmp
export NVSHMEM_LIB=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/lib
export NVSHMEM_INC=/lustre/scratch5/treddy/march_april_2024_testing/custom_nvshmem_install/include
export LD_LIBRARY_PATH="/lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install/lib:$LD_LIBRARY_PATH"
which fi_info
echo "fi_info -l:"
fi_info -l
echo "fi_info -p cxi:"
fi_info -p cxi
#cd /lustre/scratch5/treddy/march_april_2024_testing/github_projects/CUDALibrarySamples/cuFFTMp/samples/r2c_c2r_slabs_GROMACS
#make clean
#make build
#make run
echo "---------- Running gdrcopy_copybw ---------- "
/lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install/bin/gdrcopy_copybw
echo "---------- End of gdrcopy_copybw ---------- "
echo "---------- Running gdrcopy_sanity ---------- "
/lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install/bin/gdrcopy_sanity
echo "---------- End of gdrcopy_sanity ---------- " |
For the first point, using the latest version of
After that, I tried to do a bit more work. First, I added another print in Sample Diff--- a/src/gdrapi.c
+++ b/src/gdrapi.c
@@ -302,6 +302,7 @@ static int _gdr_unpin_buffer(gdr_t g, gdr_mh_t handle)
LIST_REMOVE(mh, entries);
printf("===> [%d, %d] GDRCopy Checkpoint _gdr_unpin_buffer: 2: mh=%p\n", getpid(), gettid(), mh);
free(mh);
+ printf("===> [%d, %d] GDRCopy Checkpoint _gdr_unpin_buffer: 3\n", getpid(), gettid());
return ret;
} On top of that, per the request to separate the output by node, I made a few more changes to the source to prefix the hostname in each of the prints. These changes are available on my fork of Now, when I run the 2-node cuFFTMp reproducer with that version of That would be consistent with your original instrumented code as well, with the list removal "succeeding," but the I ran the reproducer two more times, and this was not always the case however--sometimes we get the
Of course, things are not fully deterministic, and I saw the double free error happening in what appears to be other parts of the control flow as well:
I'm guessing your team has already flushed the code through an address sanitizer at some point though? This is confusing! What can I do next to help get to the bottom of it? |
There are multiple things that went wrong here. Let's start with the raw output from my instrumented code without your patch.
As shown, they were dealing with the same mh object based on the address. The caller called
The mh address that the caller passed to |
So, suspicion would be on the |
IIUC, NVSHMEM does not use GDRCopy directly in that environment. I don't know the libfabric programming model. Is it thread safe? Does it require special handling from the libfabric caller (NVSHMEM in this case)? My suggestion is to move up one step at a time. Items 2 and 3 are clearly a mistake from GDRCopy's caller. Even if we make GDRCopy thread safe, you will still run into this segfault issue. |
I think I've found evidence of a spin lock in |
Looking at the log you posted in the libfabric issue 10041, you have
So, libfabric passes NULL to |
I think we agree on that, though I wasn't convinced that guarding against that was sufficient to fix all the problems, since I saw other backtraces after that was protected. I'm hoping to make another push at getting that working soon.. |
Working on Cray Slingshot 11, on 2 nodes with 4 x A100 each, with the test case from https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/samples/r2c_c2r_slabs_GROMACS, modified in this way to force multi-node NVSHMEM (
2.10.1
):I'm seeing the output/backtrace below the fold:
My full interactive run script is this, which will tell you a bit more about various dependency versions/paths:
More gruesome details about libfabric, CXI, CUDA support are described at ofiwg/libfabric#10001, but since I'm apparently segfaulting in
gdrcopy
now, it may be helpful to determine what my next debugging steps should be here. I've already discussed things fairly extensively with the NVSHMEM team.I built the latest
gdrcopy
master
branch with gcc12.2.0
+cuda/12.0
"modules" loaded:make -j 32 prefix=/lustre/scratch5/treddy/march_april_2024_testing/gdrcopy_install CUDA=/usr/projects/hpcsoft/cos2/chicoma/cuda/12.0 all install
It would be awesome if I could get this working somehow. Note that I was originally getting different backtraces with
gdrcopy
2.3
.The text was updated successfully, but these errors were encountered: