-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
btl smcuda hang in v4.1.5 #12270
Labels
Milestone
Comments
Per Jan 30 discussion: this may or may not be a blocker v4.1.7 release. Let's re-evaluate as we get closer to the end of Q1 CY2024 / expected release of v4.1.7. |
I am making this investigation my top priority this week. |
lrbison
added a commit
to lrbison/ompi
that referenced
this issue
Feb 14, 2024
This change fixes open-mpi#12270 Testing on c7g instance type (arm64) confirms this change elminates hangs and crashes that were previously observed in 1 in 30 runs of IMB alltoall benchmark. Tested with over 300 runs and no failures. The write memory barrier prevents other CPUs from observing the fifo get updated before they observe the updated contents of the header itself. Without the barrier, uninitialized header contents caused the crashes and invalid data. Signed-off-by: Luke Robison <[email protected]>
lrbison
added a commit
to lrbison/ompi
that referenced
this issue
Feb 14, 2024
This change fixes open-mpi#12270 Testing on c7g instance type (arm64) confirms this change elminates hangs and crashes that were previously observed in 1 in 30 runs of IMB alltoall benchmark. Tested with over 300 runs and no failures. The write memory barrier prevents other CPUs from observing the fifo get updated before they observe the updated contents of the header itself. Without the barrier, uninitialized header contents caused the crashes and invalid data. Signed-off-by: Luke Robison <[email protected]>
lrbison
added a commit
to lrbison/ompi
that referenced
this issue
Feb 14, 2024
This change fixes open-mpi#12270 Testing on c7g instance type (arm64) confirms this change elminates hangs and crashes that were previously observed in 1 in 30 runs of IMB alltoall benchmark. Tested with over 300 runs and no failures. The write memory barrier prevents other CPUs from observing the fifo get updated before they observe the updated contents of the header itself. Without the barrier, uninitialized header contents caused the crashes and invalid data. Signed-off-by: Luke Robison <[email protected]>
lrbison
added a commit
to lrbison/ompi
that referenced
this issue
Feb 15, 2024
This change fixes open-mpi#12270 Testing on c7g instance type (arm64) confirms this change elminates hangs and crashes that were previously observed in 1 in 30 runs of IMB alltoall benchmark. Tested with over 300 runs and no failures. The write memory barrier prevents other CPUs from observing the fifo get updated before they observe the updated contents of the header itself. Without the barrier, uninitialized header contents caused the crashes and invalid data. Signed-off-by: Luke Robison <[email protected]> (cherry picked from commit 71f378d)
lrbison
added a commit
to lrbison/ompi
that referenced
this issue
Feb 15, 2024
This change fixes open-mpi#12270 Testing on c7g instance type (arm64) confirms this change elminates hangs and crashes that were previously observed in 1 in 30 runs of IMB alltoall benchmark. Tested with over 300 runs and no failures. The write memory barrier prevents other CPUs from observing the fifo get updated before they observe the updated contents of the header itself. Without the barrier, uninitialized header contents caused the crashes and invalid data. Signed-off-by: Luke Robison <[email protected]> (cherry picked from commit 71f378d)
jiaxiyan
pushed a commit
to jiaxiyan/ompi
that referenced
this issue
Mar 1, 2024
This change fixes open-mpi#12270 Testing on c7g instance type (arm64) confirms this change elminates hangs and crashes that were previously observed in 1 in 30 runs of IMB alltoall benchmark. Tested with over 300 runs and no failures. The write memory barrier prevents other CPUs from observing the fifo get updated before they observe the updated contents of the header itself. Without the barrier, uninitialized header contents caused the crashes and invalid data. Signed-off-by: Luke Robison <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v4.1.x
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone:
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.oddly, git submodule status returns 0, but prints nothingof course, no submodules in 4.1.x!Please describe the system on which you are running
Details of the problem
Started in https://gitlab.com/eessi/support/-/issues/41#note_1738867500
There are some applications that crash or hang when run on c7g.4xlarge. EasyBuild configurations include a patch to build against cuda at all times, however I can reproduce hangs when I do a fresh build without EB patches against debian-provided cuda (no crashes yet)
The symptom I've been able to reproduce is a hang in smcuda btl, so I must configure with cuda support, and we should either exclude ofi at configure or run time. However we are not using CUDA memory, only the smcuda btl.
I'm compiling with gcc 12.3.0.
In fully-loaded examples (ie 64 ranks on hpc7g), we can find a hang relatively frequently (10%?) of IMB's allgather or alltoall test. A lightly-loaded node (6/64) may take as many as 300 executions to find a hang.
A backtrace looks like this:
I find all ranks in a barrier, and every fifo read comes up with SM_FIFO_FREE, but they are all waiting on some completion. To me this means a message was overwritten/missed/dropped.
I have attempted to reproduce in 5.0.x branch, however smcuda deactivates itself when it cannot initialize an accelerator stream.
The text was updated successfully, but these errors were encountered: