Issue with atomics on arm64 #12011

devreal · 2023-10-20T17:01:39Z

Background information

This is an issue based on the discussion in #11999.

From #11999 it looks like we are missing a release memory barrier somewhere in the code. The problem is solved by adding release semantics to the store in the CAS. However, we generally use relaxed memory ordering in atomic operations so the fix proposed is not the right one.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Based on #12005 (port to 4.1.x) this issue seems to be present in 4.1.x and we should assume that it is present in master and 5.0.x as well.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Operating system/version:
Computer hardware:
Network type:

Details of the problem

Reproducer:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/types.h>
#include "mpi.h"

#define MAX_THREADS (20)

int g_rankSize = 0;
int g_rank = 0;
MPI_Comm g_comm[MAX_THREADS];

void *mpi_thread(void* p)
{
    int id = *(int*)p;
    free(p);
    int i;
    int count = 0;
    for (i = 0; i < 1000000; ++i) {
        int s = 1;
        int r = 0;
        MPI_Allreduce(&s, &r, 1, MPI_INT, MPI_SUM, g_comm[id]);
        if (r != g_rankSize) {
            count++;
        }
    }
    printf("rank %d id %d error count = %d\n", g_rank, id, count);
    return NULL;
}

int main(int argc, char** argv)
{
    int mpi_threads_provided;
    int req = MPI_THREAD_MULTIPLE;
    pthread_t threads[MAX_THREADS];
    const int threadNum = 10;
    int64_t i;


    MPI_Init_thread(&argc, &argv, req, &mpi_threads_provided);
    MPI_Comm_rank(MPI_COMM_WORLD, &g_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &g_rankSize);

    MPI_Group worldGroup;
    MPI_Comm_group(MPI_COMM_WORLD, &worldGroup);
    for (i = 0; i < threadNum; ++i) {
        MPI_Comm_create(MPI_COMM_WORLD, worldGroup, &g_comm[i]);
    }

    for (i = 0; i < threadNum; ++i) {
        int *p = (int*)malloc(sizeof(int));
        *p = (int)i;
        pthread_create(&threads[i], NULL, mpi_thread, (void*)p);
    }

    for (i = 0; i < threadNum; ++i) {
        pthread_join(threads[i], NULL);
    }
    MPI_Finalize();
    return 0;
}

It either yields wrong results or crashes.

The text was updated successfully, but these errors were encountered:

lrbison · 2023-10-20T18:07:26Z

@yuncliu I am attempting to reproduce. I am using AWS hpc7g.16xlarge instances.

BogoMIPS        : 2100.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd40
CPU revision    : 1

Using 4 hosts each running 4 tasks, each task using 16 threads. In order to use these atomics I've compiled with an external pmix and without c11 or gcc internal atomics:

./configure --disable-builtin-atomics --disable-c11-atomics

I'll let it run in a loop and monitor for failure. How many hosts did you use to find this issue?

lrbison · 2023-10-23T13:49:29Z

@yuncliu After 500 executions I still did not observe the original crash. Any other specifics about your setup you can share?

lrbison · 2023-10-27T12:55:25Z

From issue number #11999:

My hardware is a server with 192 arm64 core and 4 numa node.

This is one difference. I only have access to single-socket arm cores.

yuncliu · 2023-11-25T03:33:15Z

Maybe the problem is in opal_atomic_compare_exchange_strong_32 and opal_atomic_compare_exchange_strong_64
I disassemble the atomic_compare_exchange_strong in <stdatomic.h> both gcc and clang the assemble is like

ldaxr 
cmp
b.ne
stlxr

And now in "opal/include/opal/sys/arm64/atomic.h" the function opal_atomic_compare_exchange_strong_32 is exactly same as opal_atomic_compare_exchange_strong_acq_32 and the same goes for opal_atomic_compare_exchange_strong_64 and
opal_atomic_compare_exchange_strong_acq_64. I think the opal_atomic_compare_exchange_strong_32/64 and opal_atomic_compare_exchange_strong_acq_32/64 should have difference semantics。

So acordding to assemble of atomic_compare_exchange_strong the opal_atomic_compare_exchange_strong_32/64 should have stlxr instead of stxr.

lrbison · 2023-11-27T21:27:21Z

opal_atomic_compare_exchange_strong_32/64 should have stlxr instead of stxr.

I disagree. That would be opal_atomic_compare_exchange_strong_rel_32/64.

Lets compare opal_atomic_compare_exchange_strong_32 implementations:

stdc implementation:

atomic_compare_exchange_strong_explicit(addr, compare, value, memory_order_relaxed, memory_order_relaxed)

gcc-builtins:

__atomic_compare_exchange_n(addr, oldval, newval, false, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);

arm64 asm:

ldaxr    %w0, [%2]
cmp     %w0, %w3
bne     2f
stxr    %w1, %w4, [%2]
cbnz    %w1, 1b

I've plugged those into godbolt and you can see none of them have STLXR, since none of them have release semantics.

However it does point out that the gcc-builtins and the atomic stdc implementation differ in their acquire behavior.

If adding release semantics does fix the issue, then we probably are missing a write barrier somewhere, and that's the real bug.

yuncliu · 2023-12-13T01:43:57Z

ucx is ok. The problem only happens in ob1

jsquyres · 2024-01-08T20:44:50Z

It's a new year! Any progress on this, perchance?

lrbison · 2024-01-16T16:32:53Z

@yuncliu I am planning to try another reproducer here, but I was not able to duplicate last time.

Can you provide some detail on how you configured Open MPI? Thanks!

lrbison · 2024-01-24T19:25:56Z

@yuncliu Are you using the smcuda btl? I have a reproducer in #12270 if so. Can you re-run your reproducer with --mca btl ^smcuda to disable that component, and see if the problem is still present?

lrbison · 2024-02-21T04:25:40Z

@yuncliu can you test your workload now that #12344 is merged to see if that fix addresses the issue you reported?

github-actions · 2024-03-06T05:03:54Z

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

github-actions · 2024-03-20T05:03:55Z

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

devreal added Target: main Target: v5.0.x Target: v4.1.x labels Oct 20, 2023

devreal mentioned this issue Oct 20, 2023

[arm64] fix atomic operations opal_atomic_compare_exchange_strong_ #11999

Closed

jsquyres added the bug label Oct 20, 2023

jsquyres mentioned this issue Oct 20, 2023

[v4.1.x] fix atomic operations opal_atomic_compare_exchange_strong_ in arm64 #12005

Closed

jsquyres mentioned this issue Jan 30, 2024

btl smcuda hang in v4.1.5 #12270

Closed

lrbison added the State-Awaiting user information label Feb 21, 2024

github-actions bot added the Stale label Mar 6, 2024

github-actions bot added the Closed due to no reply label Mar 20, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with atomics on arm64 #12011

Issue with atomics on arm64 #12011

devreal commented Oct 20, 2023

lrbison commented Oct 20, 2023

lrbison commented Oct 23, 2023

lrbison commented Oct 27, 2023

yuncliu commented Nov 25, 2023 •

edited

Loading

lrbison commented Nov 27, 2023

yuncliu commented Dec 13, 2023

jsquyres commented Jan 8, 2024

lrbison commented Jan 16, 2024

lrbison commented Jan 24, 2024

lrbison commented Feb 21, 2024

github-actions bot commented Mar 6, 2024

github-actions bot commented Mar 20, 2024

Issue with atomics on arm64 #12011

Issue with atomics on arm64 #12011

Comments

devreal commented Oct 20, 2023

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

lrbison commented Oct 20, 2023

lrbison commented Oct 23, 2023

lrbison commented Oct 27, 2023

yuncliu commented Nov 25, 2023 • edited Loading

lrbison commented Nov 27, 2023

yuncliu commented Dec 13, 2023

jsquyres commented Jan 8, 2024

lrbison commented Jan 16, 2024

lrbison commented Jan 24, 2024

lrbison commented Feb 21, 2024

github-actions bot commented Mar 6, 2024

github-actions bot commented Mar 20, 2024

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

yuncliu commented Nov 25, 2023 •

edited

Loading