-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with atomics on arm64 #12011
Comments
@yuncliu I am attempting to reproduce. I am using AWS hpc7g.16xlarge instances.
Using 4 hosts each running 4 tasks, each task using 16 threads. In order to use these atomics I've compiled with an external pmix and without c11 or gcc internal atomics:
I'll let it run in a loop and monitor for failure. How many hosts did you use to find this issue? |
@yuncliu After 500 executions I still did not observe the original crash. Any other specifics about your setup you can share? |
From issue number #11999:
This is one difference. I only have access to single-socket arm cores. |
Maybe the problem is in opal_atomic_compare_exchange_strong_32 and opal_atomic_compare_exchange_strong_64 ldaxr
cmp
b.ne
stlxr And now in "opal/include/opal/sys/arm64/atomic.h" the function opal_atomic_compare_exchange_strong_32 is exactly same as opal_atomic_compare_exchange_strong_acq_32 and the same goes for opal_atomic_compare_exchange_strong_64 and So acordding to assemble of atomic_compare_exchange_strong the opal_atomic_compare_exchange_strong_32/64 should have stlxr instead of stxr. |
I disagree. That would be Lets compare
I've plugged those into godbolt and you can see none of them have STLXR, since none of them have release semantics. However it does point out that the gcc-builtins and the atomic stdc implementation differ in their acquire behavior. If adding release semantics does fix the issue, then we probably are missing a write barrier somewhere, and that's the real bug. |
ucx is ok. The problem only happens in ob1 |
It's a new year! Any progress on this, perchance? |
@yuncliu I am planning to try another reproducer here, but I was not able to duplicate last time. Can you provide some detail on how you configured Open MPI? Thanks! |
It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it. |
Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned. I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you! |
Background information
This is an issue based on the discussion in #11999.
From #11999 it looks like we are missing a release memory barrier somewhere in the code. The problem is solved by adding release semantics to the store in the CAS. However, we generally use relaxed memory ordering in atomic operations so the fix proposed is not the right one.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
Based on #12005 (port to 4.1.x) this issue seems to be present in 4.1.x and we should assume that it is present in master and 5.0.x as well.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
Reproducer:
It either yields wrong results or crashes.
The text was updated successfully, but these errors were encountered: