-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang in clock_gettime() during Bcast #3445
Comments
Do you know if the hang itself is in Does the problem happen in v1.10.6? Or v2.1.0? |
Hi @jsquyres , thank you very much for your attention and quick response! The hang should happen in OpenMPI, as I got more trace on other ranks, as follows.
|
So the hand is not in clock_gettime, but in the opal_progress that waits for the completion of the requests generated during the bcast. How many processes are involved your parallel application ? Can you check the stack of all processes to make sure they all reached the same MPI_Bcast ? |
Hi @bosilca , there are 8 processes in total. 7 processes are stuck in
Do you mean that there should be a barrier before doing MPI_Bcast? Now, one rank does some work and others not (go to MPI_Bcast directly), could this be the problem? As it happens randomly. |
They cannot be stuck in clock_gettime. What happens is that when you stop the process, it happens that it is in clock_gettime, but that particular function does not block. In fact, if you look on the stack trace, you can notice that you are in the opal_progress, which loops around pooling the network, and calling clock_gettime, until messages are received. Thus, I assume the culprit is that an expected message does not arrive, and thus your process seems blocked in opal_progress (and thus in clock_gettime). There is no need to have a barrier before the bcast, but all processes on the communication where the bcast is called must call the MPI_Bcast function. I just wanted to make sure this is indeed the case. What really matters is if there is an MPI_Bcast on the stack trace, not what is the last function the processes are blocked into. |
@bosilca , thank you for your explanation! I double checked the stack traces of the ranks, and they all called |
I would encourage you to upgrade your version of Open MPI to at least the latest in the v1.10 series (i.e., 1.10.6) to see if this bug was already fixed. If possible, you might want to upgrade to Open MPI v2.1.0. |
@junjieqian the hang could occur when
|
Thank you for taking the time to submit an issue!
Background information
OMPI hang during Bcast() on clock_gettime()
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v1.10.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Build from distribution tarball.
Please describe the system on which you are running
Details of the problem
MPI hangs on clock_gettime(). It happens from time to time, and most jobs are on same machine. The hang can be hours or infinite.
The issue is simiar as #99, which seems has been solved.
The stack trace is as:
The text was updated successfully, but these errors were encountered: