-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patcher issues reported on mailing list #1654
Comments
@hjelmn Where are we with this? |
Let me try to detail my findings. I am working on master, compiled with the default options, but I call MPI_Init_thread with THREAD_MULTIPLE. However, I am not sure how the MPI library has been initialized is important in this context, simply because the issue (t least as far as I understand it) is not related to MPI functions. In fact I notices it happening when the application threads are allocating/releasing memory in same time. Anyway, I have a test case I can replicate pretty consistently, so I will use it to explain what I understand from the issue. I will dump below the stack of the 2 threads that deadlock. The first thread (let's call it th1) is basically trying to register some memory. While parsing the list of allocations, it realizes that it has to free a VMA item. At this point it will try to acquire the libc memory allocator lock. Keep in mind that here we are deep inside the registration, so we already own the rcache_grdma->cache->vma_module->vma_lock (it was taken right above do_unregistration_gc).
The second thread (let's call it th2) is oblivious to any MPI operation. In its execution path, it tries to deallocate some memory. So it acquires the memory allocator lock (in _int_free), and then ends up in the grdma trying to acquire the rcache_grdma->cache->vma_module->vma_lock lock. Bad luck, this lock is held by the other thread (th1), who is trying to lock the locks in the opposite direction.
Thus, this seems to be a classical issue of interlocking resources from multiple threads. Unfortunately, one of the lock is outside our control. The only solution I can see is to defer the release of the VMA items until we can do it safely without holding the rcache_grdma->cache->vma_module->vma_lock lock (this is totally ugly). Feel free to chime in. |
Ugh, that is really ugly. I have an idea that might work. Let me try coding it up and see where it goes. Should have this done sometime today. |
George, there is no way around it. We absolutely have to delay freeing anything until the vma lock is dropped to protect against this. I think I have a working solution that should do the job without being too horrendous. Testing now and will post a PR soon if it works. |
Once you fill the PR I will gladly give it a try, I have the perfect reproducer. |
This commit fixes several bugs in the registration cache code: - Fix a programming error in the grdma invalidation function that can cause an infinite loop if more than 100 registrations are associated with a munmapped region. This happens because the mca_rcache_base_vma_find_all function returns the same 100 registrations on each call. This has been fixed by adding an iterate function to the vma tree interface. - Always obtain the vma lock when needed. This is required because there may be other threads in the system even if opal_using_threads() is false. Additionally, since it is safe to do so (the vma lock is recursive) the vma interface has been made thread safe. - Avoid calling free() while holding a lock. This avoids race conditions with locks held outside the Open MPI code. Fixes open-mpi#1654. Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes several bugs in the registration cache code: - Fix a programming error in the grdma invalidation function that can cause an infinite loop if more than 100 registrations are associated with a munmapped region. This happens because the mca_rcache_base_vma_find_all function returns the same 100 registrations on each call. This has been fixed by adding an iterate function to the vma tree interface. - Always obtain the vma lock when needed. This is required because there may be other threads in the system even if opal_using_threads() is false. Additionally, since it is safe to do so (the vma lock is recursive) the vma interface has been made thread safe. - Avoid calling free() while holding a lock. This avoids race conditions with locks held outside the Open MPI code. Back-port of open-mpi/ompi@ab8ed17 Fixes open-mpi/ompi#1654. Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes several bugs in the registration cache code: - Fix a programming error in the grdma invalidation function that can cause an infinite loop if more than 100 registrations are associated with a munmapped region. This happens because the mca_rcache_base_vma_find_all function returns the same 100 registrations on each call. This has been fixed by adding an iterate function to the vma tree interface. - Always obtain the vma lock when needed. This is required because there may be other threads in the system even if opal_using_threads() is false. Additionally, since it is safe to do so (the vma lock is recursive) the vma interface has been made thread safe. - Avoid calling free() while holding a lock. This avoids race conditions with locks held outside the Open MPI code. Back-port of open-mpi/ompi@ab8ed17 Fixes open-mpi/ompi#1654. Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes several bugs in the registration cache code: - Fix a programming error in the grdma invalidation function that can cause an infinite loop if more than 100 registrations are associated with a munmapped region. This happens because the mca_rcache_base_vma_find_all function returns the same 100 registrations on each call. This has been fixed by adding an iterate function to the vma tree interface. - Always obtain the vma lock when needed. This is required because there may be other threads in the system even if opal_using_threads() is false. Additionally, since it is safe to do so (the vma lock is recursive) the vma interface has been made thread safe. - Avoid calling free() while holding a lock. This avoids race conditions with locks held outside the Open MPI code. Back-port of open-mpi/ompi@ab8ed17 Fixes open-mpi/ompi#1654. Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes several bugs in the registration cache code: - Fix a programming error in the grdma invalidation function that can cause an infinite loop if more than 100 registrations are associated with a munmapped region. This happens because the mca_rcache_base_vma_find_all function returns the same 100 registrations on each call. This has been fixed by adding an iterate function to the vma tree interface. - Always obtain the vma lock when needed. This is required because there may be other threads in the system even if opal_using_threads() is false. Additionally, since it is safe to do so (the vma lock is recursive) the vma interface has been made thread safe. - Avoid calling free() while holding a lock. This avoids race conditions with locks held outside the Open MPI code. Back-port of open-mpi/ompi@ab8ed17 Fixes open-mpi/ompi#1654. Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes several bugs in the registration cache code: - Fix a programming error in the grdma invalidation function that can cause an infinite loop if more than 100 registrations are associated with a munmapped region. This happens because the mca_rcache_base_vma_find_all function returns the same 100 registrations on each call. This has been fixed by adding an iterate function to the vma tree interface. - Always obtain the vma lock when needed. This is required because there may be other threads in the system even if opal_using_threads() is false. Additionally, since it is safe to do so (the vma lock is recursive) the vma interface has been made thread safe. - Avoid calling free() while holding a lock. This avoids race conditions with locks held outside the Open MPI code. Back-port of open-mpi/ompi@ab8ed17 Fixes open-mpi/ompi#1654. Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes several bugs in the registration cache code: - Fix a programming error in the grdma invalidation function that can cause an infinite loop if more than 100 registrations are associated with a munmapped region. This happens because the mca_rcache_base_vma_find_all function returns the same 100 registrations on each call. This has been fixed by adding an iterate function to the vma tree interface. - Always obtain the vma lock when needed. This is required because there may be other threads in the system even if opal_using_threads() is false. Additionally, since it is safe to do so (the vma lock is recursive) the vma interface has been made thread safe. - Avoid calling free() while holding a lock. This avoids race conditions with locks held outside the Open MPI code. Fixes open-mpi#1654. Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes several bugs in the registration cache code: - Fix a programming error in the grdma invalidation function that can cause an infinite loop if more than 100 registrations are associated with a munmapped region. This happens because the mca_rcache_base_vma_find_all function returns the same 100 registrations on each call. This has been fixed by adding an iterate function to the vma tree interface. - Always obtain the vma lock when needed. This is required because there may be other threads in the system even if opal_using_threads() is false. Additionally, since it is safe to do so (the vma lock is recursive) the vma interface has been made thread safe. - Avoid calling free() while holding a lock. This avoids race conditions with locks held outside the Open MPI code. Fixes open-mpi#1654. Signed-off-by: Nathan Hjelm <[email protected]>
This issue, if it was ever really solved, seems to be back in OMPI 2.0.1 (which has all the patches related to this issue the master has). Here is the bt of one thread:
and the bt of the other one.
As originally reported they try to cross-lock mutexes, and bad things happen. |
This commit fixes several bugs in the registration cache code: - Fix a programming error in the grdma invalidation function that can cause an infinite loop if more than 100 registrations are associated with a munmapped region. This happens because the mca_rcache_base_vma_find_all function returns the same 100 registrations on each call. This has been fixed by adding an iterate function to the vma tree interface. - Always obtain the vma lock when needed. This is required because there may be other threads in the system even if opal_using_threads() is false. Additionally, since it is safe to do so (the vma lock is recursive) the vma interface has been made thread safe. - Avoid calling free() while holding a lock. This avoids race conditions with locks held outside the Open MPI code. Fixes open-mpi#1654. Signed-off-by: Nathan Hjelm <[email protected]>
Discussed on 3 Jan 2016 webex: @hjelmn is going to check this and see if it needs to be a blocker for v2.0.2. |
Tentatively moving to v2.0.2 just so that we don't forget to re-evaluate this for v2.0.2. |
@bosilca Does this occur on master or just 2.0.x? |
I'm pretty sure I tried to get my test to hang before without success. And I just now re-tried using a malloc/free oriented test running on a couple threads and it still didn't hang for me. I agree with the concern over what those stack traces show though. I would want as little as possible happening under opal_mem_hooks_release_hook(). And if its work includes acquiring some lock X (vma_lock?) then that cascades out a requirement that nobody else who ever holds lock X can do much of consequence while holding it, in particular nothing that could call a malloc/free. I don't really know vma_lock's lifecycle, but if that's a new restriction I could believe something was missed. I'd be very temped to have a new lock that's used for nothing but the structure holding the memory region records triggered from the memory hooks. |
My test was in 2.x. But at that moment the release branch was supposed to be in sync with the master. |
@markalle The vma lock is recursive so in general it is safe to do just about anything while holding the lock. The one exception is that free can not be called from vma_tree_delete as we may be calling that function from a memory hook. The fix in #2703 addresses that issue by putting any deleted vma onto a list that will be cleared the next time a vma is inserted. |
This commit fixes a deadlock that can occur when the libc version holds a lock when calling munmap. In this case we could end up calling free() from vma_tree_delete which would in turn try to obtain the lock in libc. To avoid the issue put any deleted vma's in a new list on the vma module and release them on the next call to vma_tree_insert. This should be safe as this function is not called from the memory hooks. Backported from 79cabc9 Fixes open-mpi#1654 Signed-off-by: Nathan Hjelm <[email protected]>
I'm still concerned that the fix in 2719 may be too specific based on the above description that "vma lock is recursive so in general it is safe to do just about anything while holding the lock". What makes 1654 a hang is the fact that someone holding vma_lock calls free() or otherwise does something non-trivial in glibc. vma_tree_delete() is one case of that, but no vma_lock holder anywhere can call free() or else we end up with essentially the above stack trace. Ex
Here it doesn't matter where td2 is while it's holding that vma_lock, no vma_lock holder can call malloc/free or do anything "complex" in glibc. |
This deadlock is still around. In fact it doesn't matter what mode we initialize the MPI library, as soon as we replace the memory allocator (aka we use IB) and the application has threads, we're basically doomed. If I disable pinned (and pipeline pinned), OMPI runs to completion (but supposedly with lower performance). Here we have a multi-threaded C++ application (which does a lot of new/free internally), running on top of OMPI 2.0.2. A similar deadlock can be obtained over OMPI master. It deadlocks consistently after few seconds. Here are few of the thread stacks:
|
Bah. Ok. I have an idea how to deal with this one. Should have a patch ready this week. |
This commit makes the vma tree garbage collection list a lifo. This way we can avoid having to hold any lock when releasing vmas. In theory this should finally fix the hold-and-wait deadlock detailed in open-mpi#1654. Signed-off-by: Nathan Hjelm <[email protected]>
This commit makes the vma tree garbage collection list a lifo. This way we can avoid having to hold any lock when releasing vmas. In theory this should finally fix the hold-and-wait deadlock detailed in open-mpi#1654. Signed-off-by: Nathan Hjelm <[email protected]>
This commit makes the vma tree garbage collection list a lifo. This way we can avoid having to hold any lock when releasing vmas. In theory this should finally fix the hold-and-wait deadlock detailed in open-mpi#1654. Signed-off-by: Nathan Hjelm <[email protected]>
This commit makes the vma tree garbage collection list a lifo. This way we can avoid having to hold any lock when releasing vmas. In theory this should finally fix the hold-and-wait deadlock detailed in open-mpi#1654. Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit 60ad9d1) Signed-off-by: Nathan Hjelm <[email protected]>
As reported by @PHHargrove http://www.open-mpi.org/community/lists/devel/2016/05/18928.php and http://www.open-mpi.org/community/lists/devel/2016/05/18930.php, and by @bosilca http://www.open-mpi.org/community/lists/devel/2016/05/18929.php, there appear to be new segv's in master -- potentially caused by patcher...?
Here's one stack trace posted by @PHHargrove on LITTLE-ENDIAN Power8:
Here's another:
BIG-endian PPC64 w/ xlc V13.1 experiences a nearly identical failure.
However, this time gdb appears to have been able to resolve frame #0 to a PLT slot (instead of "??").
@hjelmn Please investigate.
The text was updated successfully, but these errors were encountered: