Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patcher issues reported on mailing list #1654

Closed
jsquyres opened this issue May 7, 2016 · 22 comments · Fixed by #1673 or open-mpi/ompi-release#1171
Closed

Patcher issues reported on mailing list #1654

jsquyres opened this issue May 7, 2016 · 22 comments · Fixed by #1673 or open-mpi/ompi-release#1171
Assignees
Milestone

Comments

@jsquyres
Copy link
Member

jsquyres commented May 7, 2016

As reported by @PHHargrove http://www.open-mpi.org/community/lists/devel/2016/05/18928.php and http://www.open-mpi.org/community/lists/devel/2016/05/18930.php, and by @bosilca http://www.open-mpi.org/community/lists/devel/2016/05/18929.php, there appear to be new segv's in master -- potentially caused by patcher...?

Here's one stack trace posted by @PHHargrove on LITTLE-ENDIAN Power8:

Program terminated with signal SIGSEGV, Segmentation fault.

(gdb) where
#0  0x0000000000000000 in ?? ()
#1  0x00003fff897adb38 in intercept_munmap (start=0x3fff89670000, length=65536)
    at /home/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64el-xlc/openmpi-gitclone/opal/mca/memory/patcher/memory_patcher_component.c:155
#2  0x00003fff8933bc80 in __GI__IO_setb () from /lib64/libc.so.6
#3  0x00003fff89339528 in __GI__IO_file_close_it () from /lib64/libc.so.6
#4  0x00003fff89327f74 in fclose@@GLIBC_2.17 () from /lib64/libc.so.6
#5  0x0000000010000f7c in do_test ()
    at /home/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64el-xlc/openmpi-gitclone/ompi/debuggers/dlopen_test.c:97
#6  0x00000000100010e0 in main (argc=1, argv=0x3fffff332888)
    at /home/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64el-xlc/openmpi-gitclone/ompi/debuggers/dlopen_test.c:135

"start" is valid:
(gdb) print *(char*)0x3fff89670000
$1 = 35 '#'

Frame 1:
155         opal_mem_hooks_release_hook (start, length, true);

Here's another:

BIG-endian PPC64 w/ xlc V13.1 experiences a nearly identical failure.
However, this time gdb appears to have been able to resolve frame #0 to a PLT slot (instead of "??").

#0  0x00000fff8904ef88 in 00000010.plt_call.opal_mem_hooks_release_hook+0 ()
   from /gpfs-biou/phh1/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64-xlc-13.1/INST/lib/libopen-pal.so.20
#1  0x00000fff8910b630 in intercept_munmap (start=0xfff88d20000, length=2097152)
    at /gpfs-biou/phh1/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64-xlc-13.1/openmpi-gitclone/opal/mca/memory/patcher/memory_patcher_component.c:155
#2  0x000000800cc5ca80 in ._IO_setb () from /lib64/libc.so.6
#3  0x000000800cc5b16c in ._IO_file_close_it () from /lib64/libc.so.6
#4  0x000000800cc4a758 in .fclose () from /lib64/libc.so.6
#5  0x0000000010000f88 in do_test ()
    at /gpfs-biou/phh1/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64-xlc-13.1/openmpi-gitclone/ompi/debuggers/dlopen_test.c:97
#6  0x00000000100010d8 in main (argc=1, argv=0xffff462f398)
    at /gpfs-biou/phh1/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64-xlc-13.1/openmpi-gitclone/ompi/debuggers/dlopen_test.c:135

@hjelmn Please investigate.

@jsquyres
Copy link
Member Author

@hjelmn Where are we with this?

@jsquyres
Copy link
Member Author

@hjelmn We're told by @bosilca that C++ new is busted with this. Any ideas?

@bosilca
Copy link
Member

bosilca commented May 13, 2016

Let me try to detail my findings. I am working on master, compiled with the default options, but I call MPI_Init_thread with THREAD_MULTIPLE. However, I am not sure how the MPI library has been initialized is important in this context, simply because the issue (t least as far as I understand it) is not related to MPI functions. In fact I notices it happening when the application threads are allocating/releasing memory in same time. Anyway, I have a test case I can replicate pretty consistently, so I will use it to explain what I understand from the issue. I will dump below the stack of the 2 threads that deadlock.

The first thread (let's call it th1) is basically trying to register some memory. While parsing the list of allocations, it realizes that it has to free a VMA item. At this point it will try to acquire the libc memory allocator lock. Keep in mind that here we are deep inside the registration, so we already own the rcache_grdma->cache->vma_module->vma_lock (it was taken right above do_unregistration_gc).

#0  0x00007f705a10309e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007f705a087f4e in _L_lock_5793 () from /lib64/libc.so.6
#2  0x00007f705a083adb in _int_free () from /lib64/libc.so.6
#3  0x00007f7059ab924a in mca_rcache_base_vma_tree_delete (vma_module=0x336ba70, reg=0x15fad300)
    at ../../../../ompi/opal/mca/rcache/base/rcache_base_vma_tree.c:494
#4  0x00007f7059ab74d7 in mca_rcache_base_vma_delete (vma_module=0x336ba70, reg=0x15fad300)
    at ../../../../ompi/opal/mca/rcache/base/rcache_base_vma.c:144
#5  0x00007f705420b4d4 in dereg_mem (reg=0x15fad300)
    at ../../../../../ompi/opal/mca/rcache/grdma/rcache_grdma_module.c:131
#6  0x00007f705420b549 in do_unregistration_gc (rcache=0x3373ee0)
    at ../../../../../ompi/opal/mca/rcache/grdma/rcache_grdma_module.c:153
#7  0x00007f705420b6e4 in mca_rcache_grdma_register (rcache=0x3373ee0, addr=0xd4d13c0, size=33132,
    flags=0, access_flags=5, reg=0x7f704cc2b750)
    at ../../../../../ompi/opal/mca/rcache/grdma/rcache_grdma_module.c:206
#8  0x00007f704fbaf917 in mca_btl_openib_register_mem (btl=0x33742e0, endpoint=0x359c680,
    base=0xd4d13c0, size=33132, flags=5)
    at ../../../../../ompi/opal/mca/btl/openib/btl_openib.c:1940
#9  0x00007f704f15aa8a in mca_bml_base_register_mem (bml_btl=0x50c0cf0, base=0xd4d13c0, size=33132,
    flags=5, handle=0xaf4f3c8) at ../../../../../ompi/ompi/mca/bml/bml.h:354
#10 0x00007f704f15d259 in mca_pml_ob1_recv_request_progress_rget (recvreq=0xaf4f100, btl=0x33742e0,
    segments=0x3525408, num_segments=1)
    at ../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_recvreq.c:708
#11 0x00007f704f159297 in mca_pml_ob1_recv_frag_match (btl=0x33742e0, hdr=0x3529144,
    segments=0x3525408, num_segments=1, type=67)
    at ../../../../../ompi/ompi/mca/pml/ob1/pml_ob1_recvfrag.c:718

The second thread (let's call it th2) is oblivious to any MPI operation. In its execution path, it tries to deallocate some memory. So it acquires the memory allocator lock (in _int_free), and then ends up in the grdma trying to acquire the rcache_grdma->cache->vma_module->vma_lock lock. Bad luck, this lock is held by the other thread (th1), who is trying to lock the locks in the opposite direction.

#0  0x00007f705d69b334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f705d6965f3 in _L_lock_892 () from /lib64/libpthread.so.0
#2  0x00007f705d6964d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f705420a4e6 in opal_mutex_lock (m=0x336bd60)
    at ../../../../../ompi/opal/threads/mutex_unix.h:139
#4  0x00007f705420bf1b in mca_rcache_grdma_invalidate_range (rcache=0x3373ee0, base=0x7f703f163000,
    size=65536) at ../../../../../ompi/opal/mca/rcache/grdma/rcache_grdma_module.c:389
#5  0x00007f7059ab96b1 in mca_rcache_base_mem_cb (base=0x7f703f163000, size=65536, cbdata=0x0,
    from_alloc=false) at ../../../../ompi/opal/mca/rcache/base/rcache_base_mem_cb.c:63
#6  0x00007f7059a00eec in opal_mem_hooks_release_hook (buf=0x7f703f163000, length=65536,
    from_alloc=false) at ../../ompi/opal/memoryhooks/memory.c:126
#7  0x00007f7059aae518 in intercept_madvise (start=0x7f703f163000, length=65536, advice=4)
    at ../../../../../ompi/opal/mca/memory/patcher/memory_patcher_component.c:234
#8  0x00007f705a083b6b in _int_free () from /lib64/libc.so.6
#9  0x00000000016b22f8 in std::_Sp_counted_deleter<double*, void (*)(void*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x7f703c62c2e0)
    at /opt/gcc-5.1/include/c++/5.1.0/bits/shared_ptr_base.h:466
#10 0x000000000168de9c in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (
    this=0x7f703c62c2e0) at /opt/gcc-5.1/include/c++/5.1.0/bits/shared_ptr_base.h:150
#11 0x0000000001687dc9 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (
    this=0x7f703c627b60, __in_chrg=<optimized out>)
    at /opt/gcc-5.1/include/c++/5.1.0/bits/shared_ptr_base.h:659

Thus, this seems to be a classical issue of interlocking resources from multiple threads. Unfortunately, one of the lock is outside our control. The only solution I can see is to defer the release of the VMA items until we can do it safely without holding the rcache_grdma->cache->vma_module->vma_lock lock (this is totally ugly).

Feel free to chime in.

@hjelmn
Copy link
Member

hjelmn commented May 16, 2016

Ugh, that is really ugly. I have an idea that might work. Let me try coding it up and see where it goes. Should have this done sometime today.

@jsquyres jsquyres modified the milestones: Future, v2.0.0 May 16, 2016
@hjelmn
Copy link
Member

hjelmn commented May 16, 2016

George, there is no way around it. We absolutely have to delay freeing anything until the vma lock is dropped to protect against this. I think I have a working solution that should do the job without being too horrendous. Testing now and will post a PR soon if it works.

@bosilca
Copy link
Member

bosilca commented May 16, 2016

Once you fill the PR I will gladly give it a try, I have the perfect reproducer.

hjelmn added a commit to hjelmn/ompi that referenced this issue May 17, 2016
This commit fixes several bugs in the registration cache code:

 - Fix a programming error in the grdma invalidation function that can
   cause an infinite loop if more than 100 registrations are
   associated with a munmapped region. This happens because the
   mca_rcache_base_vma_find_all function returns the same 100
   registrations on each call. This has been fixed by adding an
   iterate function to the vma tree interface.

 - Always obtain the vma lock when needed. This is required because
   there may be other threads in the system even if
   opal_using_threads() is false. Additionally, since it is safe to do
   so (the vma lock is recursive) the vma interface has been made
   thread safe.

 - Avoid calling free() while holding a lock. This avoids race
   conditions with locks held outside the Open MPI code.

Fixes open-mpi#1654.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi-release that referenced this issue May 18, 2016
This commit fixes several bugs in the registration cache code:

 - Fix a programming error in the grdma invalidation function that can
   cause an infinite loop if more than 100 registrations are
   associated with a munmapped region. This happens because the
   mca_rcache_base_vma_find_all function returns the same 100
   registrations on each call. This has been fixed by adding an
   iterate function to the vma tree interface.

 - Always obtain the vma lock when needed. This is required because
   there may be other threads in the system even if
   opal_using_threads() is false. Additionally, since it is safe to do
   so (the vma lock is recursive) the vma interface has been made
   thread safe.

 - Avoid calling free() while holding a lock. This avoids race
   conditions with locks held outside the Open MPI code.

Back-port of open-mpi/ompi@ab8ed17

Fixes open-mpi/ompi#1654.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi-release that referenced this issue May 19, 2016
This commit fixes several bugs in the registration cache code:

 - Fix a programming error in the grdma invalidation function that can
   cause an infinite loop if more than 100 registrations are
   associated with a munmapped region. This happens because the
   mca_rcache_base_vma_find_all function returns the same 100
   registrations on each call. This has been fixed by adding an
   iterate function to the vma tree interface.

 - Always obtain the vma lock when needed. This is required because
   there may be other threads in the system even if
   opal_using_threads() is false. Additionally, since it is safe to do
   so (the vma lock is recursive) the vma interface has been made
   thread safe.

 - Avoid calling free() while holding a lock. This avoids race
   conditions with locks held outside the Open MPI code.

Back-port of open-mpi/ompi@ab8ed17

Fixes open-mpi/ompi#1654.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi-release that referenced this issue May 19, 2016
This commit fixes several bugs in the registration cache code:

 - Fix a programming error in the grdma invalidation function that can
   cause an infinite loop if more than 100 registrations are
   associated with a munmapped region. This happens because the
   mca_rcache_base_vma_find_all function returns the same 100
   registrations on each call. This has been fixed by adding an
   iterate function to the vma tree interface.

 - Always obtain the vma lock when needed. This is required because
   there may be other threads in the system even if
   opal_using_threads() is false. Additionally, since it is safe to do
   so (the vma lock is recursive) the vma interface has been made
   thread safe.

 - Avoid calling free() while holding a lock. This avoids race
   conditions with locks held outside the Open MPI code.

Back-port of open-mpi/ompi@ab8ed17

Fixes open-mpi/ompi#1654.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi-release that referenced this issue May 19, 2016
This commit fixes several bugs in the registration cache code:

 - Fix a programming error in the grdma invalidation function that can
   cause an infinite loop if more than 100 registrations are
   associated with a munmapped region. This happens because the
   mca_rcache_base_vma_find_all function returns the same 100
   registrations on each call. This has been fixed by adding an
   iterate function to the vma tree interface.

 - Always obtain the vma lock when needed. This is required because
   there may be other threads in the system even if
   opal_using_threads() is false. Additionally, since it is safe to do
   so (the vma lock is recursive) the vma interface has been made
   thread safe.

 - Avoid calling free() while holding a lock. This avoids race
   conditions with locks held outside the Open MPI code.

Back-port of open-mpi/ompi@ab8ed17

Fixes open-mpi/ompi#1654.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi-release that referenced this issue May 19, 2016
This commit fixes several bugs in the registration cache code:

 - Fix a programming error in the grdma invalidation function that can
   cause an infinite loop if more than 100 registrations are
   associated with a munmapped region. This happens because the
   mca_rcache_base_vma_find_all function returns the same 100
   registrations on each call. This has been fixed by adding an
   iterate function to the vma tree interface.

 - Always obtain the vma lock when needed. This is required because
   there may be other threads in the system even if
   opal_using_threads() is false. Additionally, since it is safe to do
   so (the vma lock is recursive) the vma interface has been made
   thread safe.

 - Avoid calling free() while holding a lock. This avoids race
   conditions with locks held outside the Open MPI code.

Back-port of open-mpi/ompi@ab8ed17

Fixes open-mpi/ompi#1654.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Sep 13, 2016
This commit fixes several bugs in the registration cache code:

 - Fix a programming error in the grdma invalidation function that can
   cause an infinite loop if more than 100 registrations are
   associated with a munmapped region. This happens because the
   mca_rcache_base_vma_find_all function returns the same 100
   registrations on each call. This has been fixed by adding an
   iterate function to the vma tree interface.

 - Always obtain the vma lock when needed. This is required because
   there may be other threads in the system even if
   opal_using_threads() is false. Additionally, since it is safe to do
   so (the vma lock is recursive) the vma interface has been made
   thread safe.

 - Avoid calling free() while holding a lock. This avoids race
   conditions with locks held outside the Open MPI code.

Fixes open-mpi#1654.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Sep 21, 2016
This commit fixes several bugs in the registration cache code:

 - Fix a programming error in the grdma invalidation function that can
   cause an infinite loop if more than 100 registrations are
   associated with a munmapped region. This happens because the
   mca_rcache_base_vma_find_all function returns the same 100
   registrations on each call. This has been fixed by adding an
   iterate function to the vma tree interface.

 - Always obtain the vma lock when needed. This is required because
   there may be other threads in the system even if
   opal_using_threads() is false. Additionally, since it is safe to do
   so (the vma lock is recursive) the vma interface has been made
   thread safe.

 - Avoid calling free() while holding a lock. This avoids race
   conditions with locks held outside the Open MPI code.

Fixes open-mpi#1654.

Signed-off-by: Nathan Hjelm <[email protected]>
@bosilca
Copy link
Member

bosilca commented Sep 26, 2016

This issue, if it was ever really solved, seems to be back in OMPI 2.0.1 (which has all the patches related to this issue the master has).

Here is the bt of one thread:

> (gdb) bt
> #0  0x00007fda674a3334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007fda6749e5f3 in _L_lock_892 () from /lib64/libpthread.so.0
> #2  0x00007fda6749e4d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x00007fda5bbf9858 in mca_rcache_vma_tree_iterate () from
> /opt/ompi/2.0.1/lib/openmpi/mca_rcache_vma.so
> #4  0x00007fda6625e775 in mca_mpool_base_mem_cb () from
> /opt/ompi/2.0.1/lib/libopen-pal.so.20
> #5  0x00007fda661def3e in opal_mem_hooks_release_hook () from
> /opt/ompi/2.0.1/lib/libopen-pal.so.20
> #6  0x00007fda6625da4d in intercept_madvise () from
> /opt/ompi/2.0.1/lib/libopen-pal.so.20
> #7  0x00007fda66bb4c4b in _int_free () from /lib64/libc.so.6

and the bt of the other one.

> (gdb) bt
> #0  0x00007fda66c3420e in __lll_lock_wait_private () from /lib64/libc.so.6
> #1  0x00007fda66bb90aa in _L_lock_5907 () from /lib64/libc.so.6
> #2  0x00007fda66bb4bbb in _int_free () from /lib64/libc.so.6
> #3  0x000000000493095d in dague_arena_release_chunk (arena=0x25665730,
> chunk=0x7fda31358ad0)

As originally reported they try to cross-lock mutexes, and bad things happen.

@bosilca bosilca reopened this Sep 26, 2016
bosilca pushed a commit to bosilca/ompi that referenced this issue Oct 3, 2016
This commit fixes several bugs in the registration cache code:

 - Fix a programming error in the grdma invalidation function that can
   cause an infinite loop if more than 100 registrations are
   associated with a munmapped region. This happens because the
   mca_rcache_base_vma_find_all function returns the same 100
   registrations on each call. This has been fixed by adding an
   iterate function to the vma tree interface.

 - Always obtain the vma lock when needed. This is required because
   there may be other threads in the system even if
   opal_using_threads() is false. Additionally, since it is safe to do
   so (the vma lock is recursive) the vma interface has been made
   thread safe.

 - Avoid calling free() while holding a lock. This avoids race
   conditions with locks held outside the Open MPI code.

Fixes open-mpi#1654.

Signed-off-by: Nathan Hjelm <[email protected]>
@hppritcha hppritcha modified the milestones: v2.0.2, v2.0.0 Oct 3, 2016
@jsquyres
Copy link
Member Author

jsquyres commented Jan 3, 2017

Discussed on 3 Jan 2016 webex: @hjelmn is going to check this and see if it needs to be a blocker for v2.0.2.

@jsquyres
Copy link
Member Author

jsquyres commented Jan 3, 2017

Tentatively moving to v2.0.2 just so that we don't forget to re-evaluate this for v2.0.2.

@jsquyres jsquyres modified the milestones: v2.0.2, v2.0.3 Jan 3, 2017
@jsquyres
Copy link
Member Author

jsquyres commented Jan 4, 2017

@gpaulsen @jjhursey @markalle Can you check this, too?

@gpaulsen
Copy link
Member

gpaulsen commented Jan 4, 2017

Sure. @markalle has a good memory stress test, that I THINK is multithreaded (or could be made to be). Did @bosilca ever make his reproducer available?

@bosilca
Copy link
Member

bosilca commented Jan 4, 2017

My original reproducer was MADNESS. The more recent stack trace was using DPLASMA.

@hjelmn
Copy link
Member

hjelmn commented Jan 4, 2017

@bosilca Does this occur on master or just 2.0.x?

@markalle
Copy link
Contributor

markalle commented Jan 4, 2017

I'm pretty sure I tried to get my test to hang before without success. And I just now re-tried using a malloc/free oriented test running on a couple threads and it still didn't hang for me.

I agree with the concern over what those stack traces show though. I would want as little as possible happening under opal_mem_hooks_release_hook().

And if its work includes acquiring some lock X (vma_lock?) then that cascades out a requirement that nobody else who ever holds lock X can do much of consequence while holding it, in particular nothing that could call a malloc/free. I don't really know vma_lock's lifecycle, but if that's a new restriction I could believe something was missed.

I'd be very temped to have a new lock that's used for nothing but the structure holding the memory region records triggered from the memory hooks.

@bosilca
Copy link
Member

bosilca commented Jan 5, 2017

My test was in 2.x. But at that moment the release branch was supposed to be in sync with the master.

@hjelmn
Copy link
Member

hjelmn commented Jan 11, 2017

@markalle The vma lock is recursive so in general it is safe to do just about anything while holding the lock. The one exception is that free can not be called from vma_tree_delete as we may be calling that function from a memory hook. The fix in #2703 addresses that issue by putting any deleted vma onto a list that will be cleared the next time a vma is inserted.

hjelmn added a commit to hjelmn/ompi that referenced this issue Jan 12, 2017
This commit fixes a deadlock that can occur when the libc version
holds a lock when calling munmap. In this case we could end up calling
free() from vma_tree_delete which would in turn try to obtain the lock
in libc. To avoid the issue put any deleted vma's in a new list on the
vma module and release them on the next call to vma_tree_insert. This
should be safe as this function is not called from the memory hooks.

Backported from 79cabc9

Fixes open-mpi#1654

Signed-off-by: Nathan Hjelm <[email protected]>
@hppritcha
Copy link
Member

closed via #2719 and #2703

@markalle
Copy link
Contributor

I'm still concerned that the fix in 2719 may be too specific based on the above description that "vma lock is recursive so in general it is safe to do just about anything while holding the lock".

What makes 1654 a hang is the fact that someone holding vma_lock calls free() or otherwise does something non-trivial in glibc. vma_tree_delete() is one case of that, but no vma_lock holder anywhere can call free() or else we end up with essentially the above stack trace. Ex

td1:
    free -- gets glibc lock
      interception into mem_hooks_release_hook()
          mca_rcache_anything wants vma_lock
td2:
    holder of vma_lock
        free -- wants glibc_lock

Here it doesn't matter where td2 is while it's holding that vma_lock, no vma_lock holder can call malloc/free or do anything "complex" in glibc.

@bosilca bosilca reopened this Feb 19, 2017
@bosilca
Copy link
Member

bosilca commented Feb 19, 2017

This deadlock is still around. In fact it doesn't matter what mode we initialize the MPI library, as soon as we replace the memory allocator (aka we use IB) and the application has threads, we're basically doomed. If I disable pinned (and pipeline pinned), OMPI runs to completion (but supposedly with lower performance).

Here we have a multi-threaded C++ application (which does a lot of new/free internally), running on top of OMPI 2.0.2. A similar deadlock can be obtained over OMPI master. It deadlocks consistently after few seconds.

Here are few of the thread stacks:

Thread 9 (Thread 0x7fffde228700 (LWP 54889)):
#0  0x00007ffff65b220e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007ffff653743b in _L_lock_10288 () from /lib64/libc.so.6
#2  0x00007ffff6534c63 in malloc () from /lib64/libc.so.6
#3  0x00007fffe62a1910 in mca_rcache_vma_tree_insert () from /opt/ompi/2.0.2/lib/openmpi/mca_rcache_vma.so
#4  0x00007fffe62a0388 in mca_rcache_vma_insert () from /opt/ompi/2.0.2/lib/openmpi/mca_rcache_vma.so
#5  0x00007fffe5a95356 in mca_mpool_grdma_register () from /opt/ompi/2.0.2/lib/openmpi/mca_mpool_grdma.so
#6  0x00007fffe51c51e2 in mca_btl_openib_register_mem () from /opt/ompi/2.0.2/lib/openmpi/mca_btl_openib.so
#7  0x00007fffe4b961b0 in mca_pml_ob1_recv_request_progress_rget () from /opt/ompi/2.0.2/lib/openmpi/mca_pml_ob1.so
#8  0x00007fffe4b93315 in mca_pml_ob1_recv_frag_match () from /opt/ompi/2.0.2/lib/openmpi/mca_pml_ob1.so
#9  0x00007fffe51cdcbd in btl_openib_handle_incoming () from /opt/ompi/2.0.2/lib/openmpi/mca_btl_openib.so
#10 0x00007fffe51cfcc9 in btl_openib_component_progress () from /opt/ompi/2.0.2/lib/openmpi/mca_btl_openib.so
#11 0x00007ffff0395820 in opal_progress () from /opt/ompi/2.0.2/lib/libopen-pal.so.20
#12 0x00007ffff72db155 in ompi_request_default_test_some () from /opt/ompi/2.0.2/lib/libmpi.so.20
#13 0x00007ffff731045c in PMPI_Testsome () from /opt/ompi/2.0.2/lib/libmpi.so.20
...
Thread 8 (Thread 0x7fffdda27700 (LWP 54890)):
#0  0x00007ffff65b220e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00007ffff65370aa in _L_lock_5907 () from /lib64/libc.so.6
#2  0x00007ffff6532bbb in _int_free () from /lib64/libc.so.6
#3  0x0000000000435674 in parsec_arena_release_chunk (arena=<optimized out>, chunk=<optimized out>) at ...
Thread 6 (Thread 0x7fffdca25700 (LWP 54892)):
#0  0x00007ffff7bcd334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007ffff7bc85f3 in _L_lock_892 () from /lib64/libpthread.so.0
#2  0x00007ffff7bc84d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fffe62a0c98 in mca_rcache_vma_tree_iterate () from /opt/ompi/2.0.2/lib/openmpi/mca_rcache_vma.so
#4  0x00007ffff0414ad5 in mca_mpool_base_mem_cb () from /opt/ompi/2.0.2/lib/libopen-pal.so.20
#5  0x00007ffff039502e in opal_mem_hooks_release_hook () from /opt/ompi/2.0.2/lib/libopen-pal.so.20
#6  0x00007ffff0413dad in intercept_madvise () from /opt/ompi/2.0.2/lib/libopen-pal.so.20
#7  0x00007ffff6532c4b in _int_free () from /lib64/libc.so.6
#8  0x0000000000435674 in parsec_arena_release_chunk (arena=<optimized out>, chunk=<optimized out>)
    at ...

@hjelmn
Copy link
Member

hjelmn commented Feb 20, 2017

Bah. Ok. I have an idea how to deal with this one. Should have a patch ready this week.

hjelmn added a commit to hjelmn/ompi that referenced this issue Feb 22, 2017
This commit makes the vma tree garbage collection list a lifo. This
way we can avoid having to hold any lock when releasing vmas. In
theory this should finally fix the hold-and-wait deadlock detailed
in open-mpi#1654.

Signed-off-by: Nathan Hjelm <[email protected]>
@jsquyres
Copy link
Member Author

jsquyres commented Mar 15, 2017

Per IM discussion with @hjelmn (per #3013 and the corresponding v2.0.x and v2.x PRs), this issue can be closed.

The only issue left is that we no longer hook madvise(2) properly, due to complications discussed in this issue and #3013. New enhancement issue filed as #3176.

thananon pushed a commit to thananon/ompi that referenced this issue Mar 31, 2017
This commit makes the vma tree garbage collection list a lifo. This
way we can avoid having to hold any lock when releasing vmas. In
theory this should finally fix the hold-and-wait deadlock detailed
in open-mpi#1654.

Signed-off-by: Nathan Hjelm <[email protected]>
markalle pushed a commit to markalle/ompi that referenced this issue Apr 5, 2017
This commit makes the vma tree garbage collection list a lifo. This
way we can avoid having to hold any lock when releasing vmas. In
theory this should finally fix the hold-and-wait deadlock detailed
in open-mpi#1654.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Aug 7, 2017
This commit makes the vma tree garbage collection list a lifo. This
way we can avoid having to hold any lock when releasing vmas. In
theory this should finally fix the hold-and-wait deadlock detailed
in open-mpi#1654.

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit 60ad9d1)
Signed-off-by: Nathan Hjelm <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment