-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osc/rdma v3.1.3rc2: PR #5924 seems to introduce segfaults on simple tests #5969
Comments
Update: #5923 is the v4.0.x version of the PR that seems to have caused the issue. #5924 is the v3.1.x version of the same PR, and #5925 is the v3.0.x version of the same PR. I'm therefore assuming that this same issue arises on all 3 branches. The PR was merged on the v3.0.x and v3.1.x branches, but has not yet (as of 24 Oct 2018) been merged on the v4.0.x branch. |
Hmmm... This is interesting. Maybe we need a second barrier? Won't really hurt much. |
I can throw together a PR and see if it helps. If not the problem might be more complicated. |
I'm looking into it atm (as that PR came from me). Sorry for the issues |
no problem. didn't think it would cause any problems when I looked at it. |
I think you're right, we would need another barrier in line 660 (where it was previously) to make sure that |
Ok, if you want to open a PR for that go ahead otherwise I will post one this evening. |
I'm waiting for our system to come back to verify using the reproducer attached and will post a PR after that. |
…gment" This reverts commit 4f435e8. Issue open-mpi#5969 was caused by this commit and the 4.0.x release managers decided not to take this patch, so having it in 3.0.x and 3.1.x is a bit awkward.
…gment" This reverts commit 4fbe078. Issue open-mpi#5969 was caused by this commit and the 4.0.x release managers decided not to take this patch, so having it in 3.0.x and 3.1.x is a bit awkward. Signed-off-by: Brian Barrett <[email protected]>
…gment" This reverts commit 4f435e8. Issue open-mpi#5969 was caused by this commit and the 4.0.x release managers decided not to take this patch, so having it in 3.0.x and 3.1.x is a bit awkward. Signed-off-by: Brian Barrett <[email protected]>
We reverted the offending patch from 3.0.x and 3.1.x release branches (and they've never shipped in a release). Removing the tags for those branches. |
@devreal This gives us a little more time to see if there is a better solution to ensuring locality. I still want the issue fixed but it needs just a little more time. |
I agree, maybe this was too much a shot from the hip. I will continue working on the issue and hopefully we can bring it into the next release. |
@bwbarrett Since you reverted those commits, can we close this issue? |
I only reverted for 3.0.x and 3.1.x. My understanding was that there was still a 4.0.x problem, but I didn't really look at it in any depth (that's why I left the 4.x label). So I think closing is up to the 4.0.x RMs. |
Should be fixed in v5.0.x/master. Closing. Please reopen if you notice this issue today. |
Background information
What version of Open MPI are you using?
v3.1.3rc2, commit 6187b7a
Open MPI repo revision from
ompi_info --all
: v3.1.3rc1-2-gfa3d929Describe how Open MPI was installed
Please describe the system on which you are running
Linux 3.10.0-693.21.1.el7.x86_64 BTL checkpoint friendly #1 SMP Fri Feb 23 18:54:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
InfiniBand 100 Gb/sec (4X EDR)
More details on file "env.txt" attached here : v3.1.x_osc-rdma_segment-register.tar.gz
Details of the problem
The commit 4fbe078 : "RDMA OSC: initialize segment memory before registering the segment" of PR #5924 seems to introduce a regression on MPI-RMA applications which uses the Passive Target mode (lock/unlock).
I attached to this issue a simple reproducer of the issue. I launched it on the described platform both with srun and mpirun and have the same issue.
The issue seems to happen when 2 process or more are spawned on a node, and when at least two nodes are used.
When looking at this commit 4fbe078, I found that the synchronization between peers part (
shared_comm->c_coll->coll_barrier
) have been moved before theompi_osc_rdma_register
call. When I tried to put it back as before the commit, it worked again.I do not know if the fix I propose here is compliant with what this commit intended to do at the beginning, so I prefer to open an issue instead of providing a partial revert pull request.
PS: removing the MPI_Put() call from the simple reproducer I provided gives another error :
[node1:162962] too many retries sending message to 0x0022:0x00016f67, giving up -------------------------------------------------------------------------- The OpenFabrics driver in Open MPI tried to raise a fatal error, but failed. Hopefully there was an error message before this one that gave some more detailed information. Local host: node1 Source file: btl_openib_endpoint.c Source line: 1037 Your job is now going to abort, sorry. --------------------------------------------------------------------------
This issue may have other visible effects. As I looked for recently open issues, I think it might also be related to #5946.
The text was updated successfully, but these errors were encountered: