-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ompi 1.4.3 hangs in IMB Gather when np >= 64 & msgsize > 4k #125
Comments
Imported from trac issue 2714. Created by bbenton on 2011-02-08T10:52:29, last modified: 2012-02-21T13:40:41 |
Trac comment by jsquyres on 2011-02-25 09:08:16: Is this ticket related to #2722? |
Trac comment by bbenton on 2011-05-31 11:59:21: Given the increased number of incidents with this, moving it back as a blocker for 1.4.4. |
Trac comment by samuel on 2011-05-31 13:55:39: I couldn't reproduce the IMB/Gather hang on the following system setup: Intel(R) Xeon(R) CPU X5550 @ 2.67GHz $ uname -iopmvrs[[BR]] Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE](rev b0) OFED 1.4.1ish - we don't run a vanilla OFED stack. Open MPI 1.4.3 with SM BTL memory barrier patch (see #2619). IMB 3.2 gcc 4.1.2 20080704 (Red Hat 4.1.2-48) - 32 iterations @ 72 rank processes on 9 nodes. Intel 11.1.072 - 32 iterations @ 72 rank processes on 9 nodes. Some potentially relevant MCA parameters that we run with:[[BR]] Hope that helps, Samuel Gutierrez[[BR]] |
Trac comment by bbenton on 2011-05-31 22:49:23: Adding Chris to cc: |
Trac comment by bbenton on 2011-06-07 11:32:20: We have been able to reproduce this at IBM over eHCA's on a couple of our P6/IH servers (32 core systems). Interestingly, we have seen this on 1.5.x as well as 1.4.x, although it does not seem to happen as frequently on the 1.5.x series. Here is the flow of the gather code:
The call to ompi_coll_tuned_gather_intra_linear_sync() results in the additional, on the fly, connections. If you force it to stay with ompi_coll_tuned_gather_intra_basic_linear(), things work fine. This can be done via the following mca parameters:
Chris Yeoh has looked into the state of things during one of the system hangs. Here is his analysis: Ranks 0-25 think the root process is In looking at where the root processes in I realise now that there is no synchronisation between tests of
which is
i.e., 62 thinks it never received the send from 26, although rank 26 is
i.e., the blocking send to 62. So it indicates the connection between 62 I think I don't quite understand the semantics of the blocking send,
and rank 27 is stuck at
which is actually further along and rank 26 couldn't possibly have Just putting the two above things together, I'm just wondering if its
So rank 27 ends up progressing further but gets stuck on the second Perhaps someone more familiar with this code (George?) can take a look and provide their insights into the state of things as well as the analysis above. |
Trac comment by kliteyn on 2011-06-09 10:02:07: My 0.02$: I was able to reproduce a hang with Allgather, couldn't reproduce with Gather.
msglen_file contains just a single size - "4194304" All 64 ranks are waiting at the same place:
I also see this happening with ''--mca coll basic'', not only with ''tuned''. |
Trac comment by kliteyn on 2011-06-09 10:40:40: BTW, I get same problem with branch 1.5. |
Trac comment by rolfv on 2011-06-09 11:04:11: I am probably stating the obvious here, but this must be related to the fact that some of the first messages we are sending are large. This means that we are using the PUT protocol from the PML layer. This means the sending side pins the memory, then sends a PUT message to the receiving side. Presumably, that messages gets queued up, then gets popped off and attempts to pin the memory. I wonder if that fails, and then we get the hang. By setting mpi_leave_pinned=0, we are making use of the RNDV protocol vs the PUT which means we are not actually pinning any memory. (I believe I submitted a bug against that). One other thing to note is that IMB always does a warmup with a 4 Mbyte message prior to starting a test. So, I would think you may see this hang with IMB regardless of the message size that it is testing. In terms of debugging, we would need to look at each process via a debugger and see where they are at. We would have to look at the send and receive requests and try to figure out what happend. Another far fetched idea is to take a look at the btl_openib_failover.c file. There is a debugging function in there that will dump out all the internal queues from the openib BTL. This allows us to see if there is a message stuck in the BTL maybe because the connection never completed. |
Trac comment by kliteyn on 2011-06-14 08:14:30: The setup that had this issue reproduced has been upgraded to RHEL6, and the problem disappeared... I've been able to reproduce probably the same problem on another cluster - this time it's with IMB Alltoall with 4MB messages and np>= 44, but it's on a very remote setup where I don't have too many privileges, so I'm trying to build another setup in-house to debug it. |
Trac comment by bbenton on 2011-06-14 12:23:28: I think that we might be chasing two problems. In particular, I was able to re-create the gather hang with mpi_leave_pinned = 0 (this was with np=64 & it hung with a message size of 8K). |
Trac comment by kliteyn on 2011-06-16 08:04:36: OK, then you're right - what I'm seeing is irrelevant for this particular issue (it's probably more relevant for https://svn.open-mpi.org/trac/ompi/ticket/2627). |
Trac comment by bbenton on 2011-07-25 13:37:57: Update: This defect remains elusive to recreate. We have not been able to recreate it here at IBM for some weeks now. Also, Brock Palen (via off-list email) has indicated that he has been unable to reproduce this after a "power event" forced him to restart his environment from scratch. My current recommendation for a workaround on IB fabrics is to force the use of ompi_coll_tuned_gather_intra_basic_linear() via:
|
Trac comment by samuel on 2011-08-16 10:42:59: Hi, We updated one of our test clusters and are now experiencing a hang in IMB Alltoall. Can you ask the user to send us the following output? cat /sys/module/mlx4_core/parameters/log_mtts_per_seg We think that this hang is related to a memory registration limitation due to this setting (at least on our machine). We'll have more data later today. We are going to up this value to 5 - its current value on our cluster is 0. This seems to limit the amount of memory that can be registered via calls to ibv_reg_mr, for example, to just under 2GB. Sam |
Trac comment by bbenton on 2011-08-16 10:53:49: I'm not sure that the Alltoall issue is the same as this gather hang. I think that this seems to be more related to some of the issues with resource exhaustion and the registration cache, such as https://svn.open-mpi.org/trac/ompi/ticket/2155, https://svn.open-mpi.org/trac/ompi/ticket/2157, and https://svn.open-mpi.org/trac/ompi/ticket/2295. |
Trac comment by bbenton on 2011-08-16 10:56:34: Brock set us up (Chris Yeoh & Brad Benton) with accounts at UMich. Hopefully, we'll be able to better pursue this problem in the UMich environment. Meanwhile, as discussed during the Aug 9 telecon, since we have a viable workaround with the selection of a different gather algorithm. I'm resetting this defect to "critical" and moving it to 1.4.5 so that we can get 1.4.4 out. |
Trac comment by pasha on 2011-08-16 11:27:21: Replying to [comment:13 samuel]:
BTW , Please see below nice documents that describes the limits tuning: |
Trac comment by samuel on 2011-08-16 12:32:58: Sorry for hijacking the thread. Thanks Pasha. We updated the setting and everything seems to be working as expected. Sam |
Trac comment by bbenton on 2012-02-21 13:40:41: Milestone Open MPI 1.4.5 deleted |
This seems to have been fixed forever ago. |
Update hwloc to 1.9.1, bringing the newer version over from the master
Implement process set name support
As reported on ompi-devel in the following email thread:
http://www.open-mpi.org/community/lists/devel/2011/01/8852.php
ompi 1.4.3 hangs in IMB/Gather when np >= 64. This is being seen mainly on x86_64 systems with Mellanox ConnectX HCAs. Current workarounds seem to be to either use rdmacm or use mpi_preconnect_mpi to establish all possible connections at job launch, rather than on demand. It also seems to be sensitive to the selection of the collective algorithm.
This hang has not been seen in 1.5, nor with other MPIs (e.g., Intel).
This has been seen on multiple clusters: Doron's cluster and on a couple of IBM iDataplex clusters.
The text was updated successfully, but these errors were encountered: