Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spectrum mpi] undefined symbol: PAMI_CUDA_RegisterPAMIContexts #11

Open
garlick opened this issue May 14, 2019 · 3 comments
Open

[spectrum mpi] undefined symbol: PAMI_CUDA_RegisterPAMIContexts #11

garlick opened this issue May 14, 2019 · 3 comments

Comments

@garlick
Copy link
Member

garlick commented May 14, 2019

When running an mpi hello world program under Flux on lassen, I get the following FATAL ERROR (the horror!) but my program still runs just fine. Also note the lua complaint.
(line breaks added for readability)

$ flux wreckrun -ompi=spectrum  -n2 ./hello
2019-05-14T23:45:52.431872Z job.err[0]: job22: wrexecd says: spectrum.lua: rexecd_init:
    /g/g0/garlick/proj/flux-core-v0.11/src/modules/wreck/lua.d/spectrum.lua:17:
    attempt to concatenate local 'val' (a nil value)
FATAL ERROR: dlsym PAMI_CUDA_RegisterPAMIContexts: ./hello: undefined symbol:
    PAMI_CUDA_RegisterPAMIContexts
FATAL ERROR: dlsym PAMI_CUDA_RegisterPAMIContexts: ./hello: undefined symbol:
    PAMI_CUDA_RegisterPAMIContexts
0: completed MPI_Init in 0.150s.  There are 2 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 0.030s

Flux was started locally on a login node (lassen708), I have a .notce environment, and this was run from source which git describes as v0.11.1.

@garlick garlick changed the title [spectrum mpi] FATAL ERROR: dlsym PAMI_CUDA_RegisterPAMIContexts: ./hello: undefined symbol: PAMI_CUDA_RegisterPAMIContexts [spectrum mpi] undefined symbol: PAMI_CUDA_RegisterPAMIContexts May 14, 2019
@dongahn
Copy link
Member

dongahn commented May 15, 2019

My guess is this is because Spectrum MPI dlopens libpami_cudahook.so. I suspect you can avoid this error by setting LD_PRELOAD to the path to libpami_cudahook.so. With Flux's current spectrum MPI support, this SO won't be used so this should be safe in theory. You should be able to find the libpami_cudahook.so path by looking at the environment variable under jsrun with an MPI program.

Without getting into too much detail, this is an ugly optimization technique that IBM used to allow their MPI to be able to send buffers allocated by CUDA memory allocation routines. The interception of the CUDA driver calls was achieved by wrapping dlsym in, libpami_cudahook.so, that is preloaded to each MPI process. But this has had lots, lots of issues, least of which was compatibility with both performance and debugging tools.

This will have to be revisited when @rountree is finishing up his PMIx work as PAMI will require this to be set correctly and we want support for tools at that point as well. I remember you could get a good mileage by putting libpami_cudahook.so as the last path in the LD_PRELOAD.

@SteVwonder
Copy link
Member

Hmmm. This one boggles me. The spectrum.lua plugin does prepend /opt/ibm/spectrum_mpi/lib/libpami_cudahook.so to the LD_PRELOAD. [source code]. And that file seems to exist:

→ stat /opt/ibm/spectrum_mpi/lib/libpami_cudahook.so  
  File: ‘/opt/ibm/spectrum_mpi/lib/libpami_cudahook.so’ -> ‘libpami_cudahook.so.1’
  Size: 21        	Blocks: 0          IO Block: 65536  symbolic link
Device: 901h/2305d	Inode: 6357621     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    1/     bin)   Gid: (    1/     bin)
Access: 2019-05-14 23:42:50.842844635 -0700
Modify: 2019-02-12 13:13:21.741949852 -0800
Change: 2019-02-12 13:13:21.741949852 -0800
 Birth: -

I have a .notce environment

I wonder if this has something to do with it. What happens if you run module use /usr/tcetmp/modulefiles/Core, then module load StdEnv, and then your login node flux instance + wreckrun? That should pull SpectrumMPI, the XL compiler, and most importantly Cuda into your environment:

→ module show StdEnv
<snip>
load("xl")
load("spectrum-mpi/rolling-release")
load("cuda")

@dongahn
Copy link
Member

dongahn commented May 15, 2019

Hmmm. I think we need to find who defines PAMI_CUDA_RegisterPAMIContexts. From the symbol name of it, it looks like the PAMI library itself or its dependencies. Perhaps doing nm on the spectrum MPI directory suggests something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants