-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
undefined sigaction() symbol #7340
Comments
A little bit more information about this problem, which I am also encountering. First, the root of the issue seems to be the failure of dlsym(RTDL_NEXT, symbol) in ucs_debug_get_orig_func(). Rather than failing quietly and returning NULL, it is generating the "symbol lookup error:" output and calling exit(). It looks like that is because it's not the system dlsym(), but instead some wrapper from libpami_cudahook.so: Breakpoint 1, 0x00007ffff7f6596c in dlsym () According to flux-framework/flux-core-v0.11#11, this is by design and part of the Spectrum MPI implementation on Summit: "Without getting into too much detail, this is an ugly optimization technique that IBM used to allow their MPI to be able to send buffers allocated by CUDA memory allocation routines. The interception of the CUDA driver calls was achieved by wrapping dlsym in, libpami_cudahook.so, that is preloaded to each MPI process. But this has had lots, lots of issues, least of which was compatibility with both performance and debugging tools." IMHO, the upshot is that the installed UCX in /usr/lib64 on Summit is broken and unusable in MPI programs. I've gotten around this by building a personal UCX 1.8.0, changing the dlsym RTDL_NEXT to use RTDL_GLOBAL (which seems to work) and using this via LD_LIBRARY_PATH. I'll submit this as a support issue to OLCF. |
Describe the bug
UCX 1.10.1 produces the following error at runtime on the Summit system at the OLCF following the recent upgrade to RHEL8:
Steps to Reproduce
This can be reproduced with gcc (9.1 or 11.1), spack origin/develop as of September 1, 2021, and UCX version 1.10.1 (the default version supplied by the spack package). UCX, and programs linked against it, compile and link fine. The symbol error is produced at run time.
I was able to work around it by quickly hacking the UCX code as follows:
For some reason the interception (and subsequent dlsym() lookup) of sigaction isn't working in my environment. Possibly a problem with the link order? This did not happen on Summit prior to the recent RHEL8 upgrade, though it is possible there were other coincidental changes that played a factor as well.
The above patch is not a proper solution; I just did enough to confirm that the problem was related to interception/dlsym of the sigaction/signal function and then execute some benchmarks that don't depend on functionality provided by that interception.
The text was updated successfully, but these errors were encountered: