Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hit assert in ABTI_mem_pool_alloc() #333

Open
NiuYawei opened this issue May 17, 2021 · 13 comments
Open

Hit assert in ABTI_mem_pool_alloc() #333

NiuYawei opened this issue May 17, 2021 · 13 comments
Labels

Comments

@NiuYawei
Copy link

I've seen a couple of CI failures with this assertion failure, but do not have a good reproducer. This is running on Azure, so under docker on a shared VM, and I expect there to be extreme CPU and memory pressure in these cases.

ERROR: daos_engine:0 daos_engine: ../src/include/abti_mem_pool.h:123: ABTI_mem_pool_alloc: Assertionnum_headers_in_cur_bucket >= 1' failed.
ERROR: daos_engine:0 *** Process 43149 received signal 6 ***
Associated errno: Success (0)
/lib64/libpthread.so.0(+0x12b20)[0x7f660fdc8b20]
/lib64/libc.so.6(gsignal+0x10f)[0x7f660f1767ff]
/lib64/libc.so.6(abort+0x127)[0x7f660f160c35]
/lib64/libc.so.6(+0x21b09)[0x7f660f160b09]
/lib64/libc.so.6(+0x2fde6)[0x7f660f16ede6]
/opt/daos/bin/../prereq/release/argobots/lib/libabt.so.1(+0x10bf2)[0x7f660fb9ebf2]
/opt/daos/bin/../prereq/release/argobots/lib/libabt.so.1(ABT_thread_create+0x92)[0x7f660fb9eda2]
/opt/daos/bin/daos_engine[0x44b49f]
/opt/daos/bin/daos_engine[0x44af6e]
/opt/daos/bin/daos_engine(dss_ult_create+0x45)[0x44ada5]
/opt/daos/bin/daos_engine[0x417e20]
/opt/daos/bin/daos_engine[0x417a2b]
/opt/daos/bin/daos_engine[0x4174f5]
/opt/daos/bin/daos_engine[0x417105]
/opt/daos/bin/daos_engine(drpc_progress+0x27e)[0x4165ee]
/opt/daos/bin/daos_engine[0x415622]
/opt/daos/bin/../prereq/release/argobots/lib/libabt.so.1(+0x17dba)[0x7f660fba5dba]
/opt/daos/bin/../prereq/release/argobots/lib/libabt.so.1(+0x17f51)[0x7f660fba5f51]
DEBUG 21:05:28.522056 procmon.go:246: Cleaning Pool f04361ee-06fe-4c34-8ecc-8f1dd3a55c49 failed:pool evict failed: rpc error: code = Unknown desc = failed to send 92B message: dRPC recv: EOF
instance 0, pid 43149, rank 0 exited with status: /opt/daos/bin/daos_engine exited: signal: aborted (core dumped)
`

@shintaro-iwasaki
Copy link
Collaborator

Thank you for reporting an issue!

The pool structure looks broken. This should not happen if the algorithm works correctly. This pool uses a bit complicated logic (#183), but our CI (including numerous OSs, compilers, and CPU architectures) never encountered this issue so far (see https://www.argobots.org/tests/ to know the combinations). I haven't tested the combination of Azure/docker/VM, though. Regarding the line number of assert(), I believe you are using the latest stable Argobots 1.1.

  1. Could you tell me configure options to build Argobots?
  2. Could you tell me information about architecture (x86/64?) and compilers (icc?)?
  3. I received several issues regarding stack overflow (see ULT stack allocation method to address overrun scenario #274). Does it happen even if you use a larger stack size?
    • ABT_THREAD_STACKSIZE=XXX or ABT_ENV_THREAD_STACKSIZE=XXX can change the default stack size.
    • It is not included in Argobots 1.1, but the current main branch also includes active stack smash detection (see thread: support mprotect-based stack guard #327).

I would really appreciate a reproducer to investigate this issue, even if the reproducing code is not small.

@NiuYawei
Copy link
Author

Thanks for looking into this.

  1. The build option looks like:
    './autogen.sh', './configure --prefix=$ARGOBOTS_PREFIX CC=gcc' ' --enable-valgrind' ' --enable-stack-unwind', 'make $JOBS_OPT', 'make $JOBS_OPT install'],

  2. The arch is x86/64 and compiler is gcc.

  3. We used default stack size for most ULTs and only use larger stack size for few particular ULTs which requires large stack sizes. (so we never changed default stack size through the env var). For this particular issue, it happened on an ULT with default stack size, I'm not sure if enlarging stack size could solve the problem, since we can't reliable reproduce the problem, but I'll ask engineers to try it out.

Many thanks, I'll keep you informed if there is any new findings.

@ashleypittman
Copy link

This was actually hit in our testsuite, we're trying to see what we can achieve in github-actions and this is one of the failures that we saw there, an example run is here:

https://github.com/ashleypittman/daos/runs/2567827884

Generally running under github-actions hasn't been that stable for us, we've found a few issues that all seem to relate to resource starvation or timeouts which is not entirely unexpected given the constraints.

We've since trimmed back the PR in question to a core set of functionality and landed it, but I can expand it again to see if I can hit upon a more reliable reproducer. Argobots is built from your v1.1 tag.

I'll create another PR to reproduce the settings I was using before to see if I can trigger this again - it was regularly occurring for a couple of days for me last week.

@shintaro-iwasaki
Copy link
Collaborator

Thank you for your replies.

The arch is x86/64 and compiler is gcc.

I thought Argobots might encounter a bug in 128-bit atomic CAS, which is used for this memory pool algorithm, but a widely used compiler (e.g., GCC) + x86/64 should not cause an issue. This feature is checked in autogen.sh and also has a fallback implementation.

https://github.com/pmodels/argobots/blob/main/src/include/asm/abtd_asm_int128_cas.h#L20-L36

https://github.com/ashleypittman/daos/runs/2567827884

Thank you. It is very helpful! We will investigate this issue, but as the program is large, please do not expect that I can find a bug very soon.

Regarding resource management, Argobots 1.1 fixed error handling paths, so Argobots itself should properly return resource allocation errors (e.g., memory allocation failure in this memory pool) to the user application unless the error is catastrophic. Those paths should be well tested (#309).

I'll create another PR to reproduce the settings I was using before to see if I can trigger this again - it was regularly occurring for a couple of days for me last week.

Thanks! Tag v1.1 of Argobots has not been updated since March 31, so it would be helpful to know which commits directly reveal this issue (that potentially existed in Argobots).

@shintaro-iwasaki
Copy link
Collaborator

I could not reproduce this issue as far as I checked 4-5 times shintaro-iwasaki/daos-copy#1

I will write a heavily threaded program and check this memory pool implementation in Argobots, but at this point I would suspect either a ULT stack overflow or a bug (e.g., illegal memory access) in DAOS.

@philip-davis
Copy link

I am hitting this assert on Summit. This is running with ASAN, which is reporting no errors ahead of the assert failure.

dspaces_server: ../src/include/abti_mem_pool.h:123: ABTI_mem_pool_alloc: Assertion `num_headers_in_cur_bucket >= 1' failed.
[h35n03:06804] *** Process received signal ***
[h35n03:06804] Signal: Aborted (6)
[h35n03:06804] Signal code: (-6)
[h35n03:06804] [ 0] /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-9.1.0/spectrum-mpi-10.3.1.2-20200121-jd4wr7r4th5gtr4qndday6gkbvqziasp/container/../lib/libopen-pal.so.3(+0x9222c)[0x20000194222c]
[h35n03:06804] [ 1] [0x2000000504d8]
[h35n03:06804] [ 2] /lib64/libc.so.6(abort+0x2b4)[0x2000010b2094]
[h35n03:06804] [ 3] /lib64/libc.so.6(+0x356d4)[0x2000010a56d4]
[h35n03:06804] [ 4] /lib64/libc.so.6(__assert_fail+0x64)[0x2000010a57c4]
[h35n03:06804] [ 5] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/argobots-1.1-dmjmhggingfmiycftyfakqbqdqxkh7s4/lib/libabt.so.1(+0x14b78)[0x200000cf4b78]
[h35n03:06804] [ 6] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/argobots-1.1-dmjmhggingfmiycftyfakqbqdqxkh7s4/lib/libabt.so.1(ABT_thread_create+0xa8)[0x200000cf4d18]
[h35n03:06804] [ 7] /gpfs/alpine/scratch/pdavis/csc143/dspaces/build.3/lib64/libdspaces-server.so.2(_handler_for_ss_rpc+0xf4)[0x200000b6b038]
[h35n03:06804] [ 8] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/mercury-2.0.1-jg6qcunksry7jx5uqgpyien6vw4f2tsx/lib/libmercury.so.2(+0x6350)[0x200000c16350]
[h35n03:06804] [ 9] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/mercury-2.0.1-jg6qcunksry7jx5uqgpyien6vw4f2tsx/lib/libmercury.so.2(+0x15080)[0x200000c25080]
[h35n03:06804] [10] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/mercury-2.0.1-jg6qcunksry7jx5uqgpyien6vw4f2tsx/lib/libmercury.so.2(HG_Core_trigger+0x24)[0x200000c2cfb4]
[h35n03:06804] [11] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/mercury-2.0.1-jg6qcunksry7jx5uqgpyien6vw4f2tsx/lib/libmercury.so.2(HG_Trigger+0x28)[0x200000c1a078]
[h35n03:06804] [12] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/mochi-margo-0.9.4-iq7m5zqw6jlxm32sqp3pu6dbdhmuzux2/lib/libmargo.so.0(__margo_hg_progress_fn+0x74)[0x200000baa4f4]
[h35n03:06804] [13] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/argobots-1.1-dmjmhggingfmiycftyfakqbqdqxkh7s4/lib/libabt.so.1(+0x1da18)[0x200000cfda18]
[h35n03:06804] [14] /autofs/nccs-svm1_home1/pdavis/spack-new/opt/spack/linux-rhel7-power9le/gcc-9.1.0/argobots-1.1-dmjmhggingfmiycftyfakqbqdqxkh7s4/lib/libabt.so.1(+0x1dff4)[0x200000cfdff4]
[h35n03:06804] *** End of error message ***

@shintaro-iwasaki
Copy link
Collaborator

@philip-davis Thank you very much. The error seems very similar to what @NiuYawei reported. I will check this issue again.

@shintaro-iwasaki
Copy link
Collaborator

shintaro-iwasaki commented Jun 25, 2021

I tested Argobots' memory-pool operations on Summit-like POWER9 machine at Argonne, but I could not reproduce this issue.

What I did (collapsed)

Argobots v1.1 + POWER9 + GCC 9.3, Spack-default configuration. The test creates 10 millions ULTs (no cutoff Fibonacci(34)) and schedule them in a random-work-stealing manner. I repeated this test with various numbers of ESs 500 times in total (which took a few hours).

## Environment
$ gcc --version
gcc (Spack GCC) 9.3.0
$ cat /proc/cpuinfo
...
cpu             : POWER9, altivec supported
...

## Configure Argobots (the same as the default "spack install argobots")
$ git checkout v1.1
$ sh autogen.sh
$ ./configure --prefix=$(pwd)/install --enable-perf-opt

## Build and run modified fibonacci
$ gcc fib.c -labt -L install/lib -I install/include/ -Wl,-rpath=$(pwd)/install/lib -o fib.out
$ cat test.sh
for repeat in $(seq 5); do
  for es in $(seq 100); do
    date
    echo "./fib.out -n 35 -e $es"
    ./fib.out -n 35 -e $es
  done
done
$ sh test.sh
Code

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <unistd.h>
#include <stdarg.h>
#include <abt.h>

#define DEFAULT_NUM_XSTREAMS 4
#define DEFAULT_N 10

ABT_pool *pools;

typedef struct {
    int n;
    int ret;
} fibonacci_arg_t;

void fibonacci(void *arg)
{
    int n = ((fibonacci_arg_t *)arg)->n;
    int *p_ret = &((fibonacci_arg_t *)arg)->ret;

    if (n <= 1) {
        *p_ret = 1;
    } else {
        fibonacci_arg_t child1_arg = { n - 1, 0 };
        fibonacci_arg_t child2_arg = { n - 2, 0 };
        int rank;
        ABT_xstream_self_rank(&rank);
        ABT_pool target_pool = pools[rank];
        ABT_thread child1;
        /* Calculate fib(n - 1). */
        ABT_thread_create(target_pool, fibonacci, &child1_arg,
                          ABT_THREAD_ATTR_NULL, &child1);
        /* Calculate fib(n - 2).  We do not create another ULT. */
        fibonacci(&child2_arg);
        ABT_thread_free(&child1);
        *p_ret = child1_arg.ret + child2_arg.ret;
    }
}

int fibonacci_seq(int n)
{
    if (n <= 1) {
        return 1;
    } else {
        int i;
        int fib_i1 = 1; /* Value of fib(i - 1) */
        int fib_i2 = 1; /* Value of fib(i - 2) */
        for (i = 3; i <= n; i++) {
            int tmp = fib_i1;
            fib_i1 = fib_i1 + fib_i2;
            fib_i2 = tmp;
        }
        return fib_i1 + fib_i2;
    }
}

int main(int argc, char **argv)
{
    int i, j;
    /* Read arguments. */
    int num_xstreams = DEFAULT_NUM_XSTREAMS;
    int n = DEFAULT_N;
    while (1) {
        int opt = getopt(argc, argv, "he:n:");
        if (opt == -1)
            break;
        switch (opt) {
            case 'e':
                num_xstreams = atoi(optarg);
                break;
            case 'n':
                n = atoi(optarg);
                break;
            case 'h':
            default:
                printf("Usage: ./fibonacci [-e NUM_XSTREAMS] [-n N]\n");
                return -1;
        }
    }

    /* Allocate memory. */
    ABT_xstream *xstreams =
        (ABT_xstream *)malloc(sizeof(ABT_xstream) * num_xstreams);
    pools = (ABT_pool *)malloc(sizeof(ABT_pool) * num_xstreams);
    ABT_sched *scheds = (ABT_sched *)malloc(sizeof(ABT_sched) * num_xstreams);

    /* Initialize Argobots. */
    ABT_init(argc, argv);

    /* Create pools. */
    for (i = 0; i < num_xstreams; i++) {
        ABT_pool_create_basic(ABT_POOL_FIFO, ABT_POOL_ACCESS_MPMC, ABT_TRUE,
                              &pools[i]);
    }

    /* Create schedulers. */
    for (i = 0; i < num_xstreams; i++) {
        ABT_pool *tmp = (ABT_pool *)malloc(sizeof(ABT_pool) * num_xstreams);
        for (j = 0; j < num_xstreams; j++) {
            tmp[j] = pools[(i + j) % num_xstreams];
        }
        ABT_sched_create_basic(ABT_SCHED_DEFAULT, num_xstreams, tmp,
                               ABT_SCHED_CONFIG_NULL, &scheds[i]);
        free(tmp);
    }

    /* Set up a primary execution stream. */
    ABT_xstream_self(&xstreams[0]);
    ABT_xstream_set_main_sched(xstreams[0], scheds[0]);

    /* Create secondary execution streams. */
    for (i = 1; i < num_xstreams; i++) {
        ABT_xstream_create(scheds[i], &xstreams[i]);
    }

    for (int i = 2; i <= n; i++) {
        fibonacci_arg_t arg = { i, 0 };
        fibonacci(&arg);
        int ret = arg.ret;
        int ans = fibonacci_seq(i);
        /* Check the results. */
        printf("Fibonacci(%d) = %d (ans: %d)\n", i, ret, ans);
    }

    /* Join secondary execution streams. */
    for (i = 1; i < num_xstreams; i++) {
        ABT_xstream_join(xstreams[i]);
        ABT_xstream_free(&xstreams[i]);
    }

    /* Finalize Argobots. */
    ABT_finalize();

    /* Free allocated memory. */
    free(xstreams);
    free(pools);
    free(scheds);

    return 0;
}

Although I have not confirmed the reason, I would first suggest you set ABT_THREAD_STACKSIZE=XXX where XXX is sufficiently large (4096 [EDIT] 16384 by default).

ABT_THREAD_STACKSIZE=256000 ./your_app.out

Explanation

If a ULT runs out of its function stack, it can overwrite num_headers and cause this assert failure.

image

[EDIT] 4KB is wrong. 16KB is correct.

I'm not sure the Margo's default stack size, but if Margo does not explicitly set it, possibly the program caused stack overflow considering the depth of function stack @philip-davis reported. By default it is 4KB ([EDIT] 16KB). I am not fully sure if this is the reason since I cannot reproduce this issue; the scenario above assumes that num_headers is overwritten by 0, but it should not be always the case.

To examine this, the latest Argobots (main) supports mprotect-based dynamic detection (see #327): this feature allows the user to detect stack smash when it happens. Argobots 1.1 (the latest stable version) supports stack-canary based lazy detection (see #293). @mdorier: I would welcome any suggestions regarding this issue if you have.

@mdorier
Copy link
Contributor

mdorier commented Jun 25, 2021

Margo sets ABT_THREAD_STACKSIZE to 2097152 by default, so I doubt that's the issue, but I could be wrong.

@shintaro-iwasaki
Copy link
Collaborator

@mdorier Thank you. I will check the memory pool implementation again.

@philip-davis
Copy link

philip-davis commented Jun 25, 2021 via email

@ashleypittman
Copy link

Thank you very much for the update, based on your comments I've tried updating to tip of argobots 2202510 and am building with --enable-debug=most and setting ABT_STACK_OVERFLOW_CHECK=mprotect and this has converted general instability I was seeing into a constant, reproducible segfault as a result of which I've managed to identify at least two areas of our code which require attention.

@ashleypittman
Copy link

Thank you very much for the update, based on your comments I've tried updating to tip of argobots 2202510 and am building with --enable-debug=most and setting ABT_STACK_OVERFLOW_CHECK=mprotect and this has converted general instability I was seeing into a constant, reproducible segfault as a result of which I've managed to identify at least two areas of our code which require attention.

I can confirm that once we'd tried with a build with the ABT_STACK_OVERFLOW_CHECK=memcheck feature and fixed two issues that were causing segfaults with that feature enabled we've not seen this and in fact the system has been remarkably stable since so I think we can confirm that the problems we were seeing were the result of stack overflow and I'd be happy to close this bug report now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants