Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with atomics on arm64 #12011

Closed
devreal opened this issue Oct 20, 2023 · 12 comments
Closed

Issue with atomics on arm64 #12011

devreal opened this issue Oct 20, 2023 · 12 comments

Comments

@devreal
Copy link
Contributor

devreal commented Oct 20, 2023

Background information

This is an issue based on the discussion in #11999.

From #11999 it looks like we are missing a release memory barrier somewhere in the code. The problem is solved by adding release semantics to the store in the CAS. However, we generally use relaxed memory ordering in atomic operations so the fix proposed is not the right one.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Based on #12005 (port to 4.1.x) this issue seems to be present in 4.1.x and we should assume that it is present in master and 5.0.x as well.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version:
  • Computer hardware:
  • Network type:

Details of the problem

Reproducer:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/types.h>
#include "mpi.h"

#define MAX_THREADS (20)

int g_rankSize = 0;
int g_rank = 0;
MPI_Comm g_comm[MAX_THREADS];

void *mpi_thread(void* p)
{
    int id = *(int*)p;
    free(p);
    int i;
    int count = 0;
    for (i = 0; i < 1000000; ++i) {
        int s = 1;
        int r = 0;
        MPI_Allreduce(&s, &r, 1, MPI_INT, MPI_SUM, g_comm[id]);
        if (r != g_rankSize) {
            count++;
        }
    }
    printf("rank %d id %d error count = %d\n", g_rank, id, count);
    return NULL;
}

int main(int argc, char** argv)
{
    int mpi_threads_provided;
    int req = MPI_THREAD_MULTIPLE;
    pthread_t threads[MAX_THREADS];
    const int threadNum = 10;
    int64_t i;


    MPI_Init_thread(&argc, &argv, req, &mpi_threads_provided);
    MPI_Comm_rank(MPI_COMM_WORLD, &g_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &g_rankSize);

    MPI_Group worldGroup;
    MPI_Comm_group(MPI_COMM_WORLD, &worldGroup);
    for (i = 0; i < threadNum; ++i) {
        MPI_Comm_create(MPI_COMM_WORLD, worldGroup, &g_comm[i]);
    }

    for (i = 0; i < threadNum; ++i) {
        int *p = (int*)malloc(sizeof(int));
        *p = (int)i;
        pthread_create(&threads[i], NULL, mpi_thread, (void*)p);
    }

    for (i = 0; i < threadNum; ++i) {
        pthread_join(threads[i], NULL);
    }
    MPI_Finalize();
    return 0;
}

It either yields wrong results or crashes.

@lrbison
Copy link
Contributor

lrbison commented Oct 20, 2023

@yuncliu I am attempting to reproduce. I am using AWS hpc7g.16xlarge instances.

BogoMIPS        : 2100.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd40
CPU revision    : 1

Using 4 hosts each running 4 tasks, each task using 16 threads. In order to use these atomics I've compiled with an external pmix and without c11 or gcc internal atomics:

./configure --disable-builtin-atomics --disable-c11-atomics

I'll let it run in a loop and monitor for failure. How many hosts did you use to find this issue?

@lrbison
Copy link
Contributor

lrbison commented Oct 23, 2023

@yuncliu After 500 executions I still did not observe the original crash. Any other specifics about your setup you can share?

@lrbison
Copy link
Contributor

lrbison commented Oct 27, 2023

From issue number #11999:

My hardware is a server with 192 arm64 core and 4 numa node.

This is one difference. I only have access to single-socket arm cores.

@yuncliu
Copy link

yuncliu commented Nov 25, 2023

Maybe the problem is in opal_atomic_compare_exchange_strong_32 and opal_atomic_compare_exchange_strong_64
I disassemble the atomic_compare_exchange_strong in <stdatomic.h> both gcc and clang the assemble is like

ldaxr 
cmp
b.ne
stlxr

And now in "opal/include/opal/sys/arm64/atomic.h" the function opal_atomic_compare_exchange_strong_32 is exactly same as opal_atomic_compare_exchange_strong_acq_32 and the same goes for opal_atomic_compare_exchange_strong_64 and
opal_atomic_compare_exchange_strong_acq_64. I think the opal_atomic_compare_exchange_strong_32/64 and opal_atomic_compare_exchange_strong_acq_32/64 should have difference semantics。

So acordding to assemble of atomic_compare_exchange_strong the opal_atomic_compare_exchange_strong_32/64 should have stlxr instead of stxr.

@lrbison
Copy link
Contributor

lrbison commented Nov 27, 2023

opal_atomic_compare_exchange_strong_32/64 should have stlxr instead of stxr.

I disagree. That would be opal_atomic_compare_exchange_strong_rel_32/64.

Lets compare opal_atomic_compare_exchange_strong_32 implementations:

stdc implementation:

atomic_compare_exchange_strong_explicit(addr, compare, value, memory_order_relaxed, memory_order_relaxed)

gcc-builtins:

__atomic_compare_exchange_n(addr, oldval, newval, false, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);

arm64 asm:

ldaxr    %w0, [%2]
cmp     %w0, %w3
bne     2f
stxr    %w1, %w4, [%2]
cbnz    %w1, 1b

I've plugged those into godbolt and you can see none of them have STLXR, since none of them have release semantics.

However it does point out that the gcc-builtins and the atomic stdc implementation differ in their acquire behavior.

If adding release semantics does fix the issue, then we probably are missing a write barrier somewhere, and that's the real bug.

@yuncliu
Copy link

yuncliu commented Dec 13, 2023

ucx is ok. The problem only happens in ob1

@jsquyres
Copy link
Member

jsquyres commented Jan 8, 2024

It's a new year! Any progress on this, perchance?

@lrbison
Copy link
Contributor

lrbison commented Jan 16, 2024

@yuncliu I am planning to try another reproducer here, but I was not able to duplicate last time.

Can you provide some detail on how you configured Open MPI? Thanks!

@lrbison
Copy link
Contributor

lrbison commented Jan 24, 2024

@yuncliu Are you using the smcuda btl? I have a reproducer in #12270 if so. Can you re-run your reproducer with --mca btl ^smcuda to disable that component, and see if the problem is still present?

@lrbison
Copy link
Contributor

lrbison commented Feb 21, 2024

@yuncliu can you test your workload now that #12344 is merged to see if that fix addresses the issue you reported?

Copy link

github-actions bot commented Mar 6, 2024

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

@github-actions github-actions bot added the Stale label Mar 6, 2024
Copy link

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants