Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ULT stack allocation method to address overrun scenario #274

Open
bfaccini opened this issue Dec 21, 2020 · 27 comments
Open

ULT stack allocation method to address overrun scenario #274

bfaccini opened this issue Dec 21, 2020 · 27 comments

Comments

@bfaccini
Copy link
Contributor

bfaccini commented Dec 21, 2020

We regularly trigger crashes due to ULT stack overrun, mainly caused by 3rd party libraries who are stacking big arrays/structs, or also during lazy symbol resolution of dynamic libs...

Since we expect to run Millions of concurrent ULTs we could not rely on statically allocating big chunk for stacks to limit memory footprint and thus we would like to benefit of the existing kernel mechanism which allows stacks to dynamically raise to their associated limit if needed.

We first thought that we could implement this externally from Argobots and use our own external stacks in Argobots, but as it is described in issue #16 the present lack of hook in Argobots that would allow to be aware that a stack is no longer in use and be able to free/reassign it, prevents it.

So, here are the main point that should be addressed in order to implement this new stack allocation mechanism inside Argobots:
_ allocate a new stack using mmap(), with a minimal size and by ensuring it will be positioned in a hole (by providing its address as a hint to mmap()!) far enough from the preceding mapping, this to be allowed to automatically grow downward (as per the Kernel's existing mechanism for stacks!), until some reasonable minimum size (assuming that preceding mapping will not itself grow upward...) which could be set as the xstream's stack limit.
_ as it is not possible to rely on kernel to find the best place for us (ie by using NULL as mmap()'s address/1st parameter), we will have to parse /proc/self/maps current content and decide. This process will also have to take into account of current/previous stacks allocations in order to avoid re-using a hole where a previously allocated stack is already expected to grow...
_ as part of these required mechanisms need for managing stacks, a map of current allocations has to be maintained to avoid possibly colliding allocation of new stack, and also a mechanism/policy to determine no longer used stacks to be re-allocated or freed/munmap()'ed.
_ also, Kernel/boot "stack_guard_gap" parameter will have to be taken into account, as the minimum hole size between the preceding mapping, because it is used by Kernel to decide if a "stack" region is allowed to grow or not.

@carns
Copy link
Contributor

carns commented Dec 21, 2020

I'm not familiar enough with these mechanism to comment on @bfaccini 's proposal, but I am interested in this topic.

FWIW we often use external libraries (communication libraries in particular, but there are others as well) that put significant demands on the stack. We explicitly set ABT_THREAD_STACKSIZE to a conservative value before starting Argobots but we don't really know if we have selected a "good" value or not in any given system configuration. I would be happy to simply have a better idea of how close we came to the stack limits, but a dynamic option might also be helpful if the overhead isn't prohibitive.

@shintaro-iwasaki
Copy link
Collaborator

Thank you for your interest in this topic. Since it has been demanded by many users, we will definitely investigate further, but please let me explain the situation so that there would be a practical solution.

Stack hook

Sorry that I missed the discussion #16, but I believe we can do what you want by using the existing ABT_key mechanism.

void stack_free_callback(void *stack) {
    munmap(stack, PREDEFINED_MMAP_STACK_SIZE);
}
ABT_key g_stack_key;
/* ABT_key_create(stack_free_callback, &g_stack_key); */

void thread() {
    /* The created thread needs to know "stack".
     * The ULT-local value will be freed by stack_free_callback() on ABT_thread_free(). */
    ABT_key_set(g_stack_key, stack);
}

void thread_create() {
    stack = mmap(...);
    ABT_thread_create(...);
    ABT_thread_free(...);
}

mmap'ed stack as pointed out by @bfaccini

I would like to know more about your situations, but mmap might be costly if you would run it on every ULT creation. Copying the function stacks to other places is not practical (i.e., the address of variables in a stack will be corrupted), so we need mremap etc for an extension.

I need to learn how mmap can smartly grow the stack, but if this growth is done manually, we also need to consider the cost of mprotect, which is not very cheap (mprotect itself is a few times heavier than the cost of ULT create/free). I am not sure if this cost per ULT is acceptable for your application.


If you know which ULTs could need larger stacks, setting larger stack sizes for only those ULTs is the easiest solution.

There are other (simpler) solutions, which might help in your use cases. None of them are implemented in the main Argobots branch, so please let me know if you find a promising one.

Programmatic stack growth

It is easy to implement an assembly trick that can call a function on a different function stack, so if you know which function uses a large stack, you can run it with this function. The interface will be like this:

extern void ABT_call_on_stack(func_ptr, extra_stack, ...);
void thread(void *arg) {
    thread_args_t *args = (thread_args_t *)arg;
    small_stack_computation(args->arg1);
    if (args->arg2) {
        extra_stack = malloc(1e6); /* extra_stack can be unless there's concurrent access. */
        ABT_call_on_stack(large_stack_computation, extra_stack, args->arg2);
    } else {
        /* This ULT does not need a large stack if arg2 is NULL, so the default stack size is small. */
    }
}

If you know which function uses a large memory, this is a good option. Argobots can provide this feature. The overhead is very low (I believe 20 - 50 cycles or so).

Note: this idea is called "cactus stack". Some research compilers use this technique for all function calls.

if (current_stack_pointer > base_stack_pointer + stack_size * 0.75) {
    extra_stack = malloc(stack_size);
    call_on_stack(foo, extra_stack, abc);
    free(extra_stack);
} else {
    // Call a function normally.
    foo(abc);
}

Lazy stack overflow detection (stack canary)

This is to know check the stack overflow "happened" or not. The idea is very simple; writing a stack canary value at the end of the stack and checking if it's overwritten or not.
https://en.wikipedia.org/wiki/Buffer_overflow_protection#Canaries

Lazy stack allocation

This is an idea to save the total amount of stack consumption (so that the user can set a larger stack per ULT). Now Argobots allocates stacks on ABT_thread_create() and frees it on ABT_thread_free(), but actually the stack is needed from "thread run" to "thread termination". For example, if 90% of ULTs are run-to-completion (e.g., a ULT that does not yield), those ULTs get a stack on ABT_xstream_run_unit() and returns the stack when ABT_xstream_run_unit() finishes, so Argobots virtually needs to keep only the rest of 10% of ULTs' stacks. Our previous research suggested that this run-to-completion-ULT rate could be high in some applications.

This feature will be implemented in Argobots 2.0, which is planned after Argobots 1.1.

@bfaccini
Copy link
Contributor Author

bfaccini commented Jan 7, 2021

Hello,
First of all, many thanks for your detailed comments/proposals.

I agree that ABT_key feature/mechanism seems to allow fixing the external stack management needs upon ULT exit. And I think the way to pass the stack address for later management could be done as part of the "void *arg" (likely to be a struct) parameter of ABT_thread_create().
BTW, you should update issue #16 to indicate this, even if a late update.

The "cactus stack" method can not help in our case, as each of the stack overruns that we have experienced is not due to our code but comes from the external libs we use.

The "stack canary" method is only a way to detect overrun afterward but will not allow to survive from it.

My understanding of the "lazy stack allocation" method you describe is that it can only be used for simple ULTs which execute their code in a raw (only scheduled once without yielding) and thus could use the scheduler's stack, like tasklets. But we have none of this ULT kind in our code.

About mmap()'ed stacks and how they can grow, it is a Kernel feature which is already used for main process/thread stack. It happens when a page-fault occurs for an address in a page that should stand just on top of current top page of a stack vma, then Kernel will just allocate the page in the vma (if not exceeding the stack limits) and return to user-mode for the faulting instruction to complete successfully.
Concerning the possible cost of mmap() (and best place search within current vmas) and how to reduce it, the idea would be to keep/re-use mmap()'ed stacks as a pool. As an optimisation, maintaining a map of current vmas seems to be a plus.
I did not get what you mean about mprotect() and mremap()usage ??... Did you think to respectively use them for both overrun detection and stack growth mechanisms ? If yes, I don't think mermap() can handle stack content relocation...

@shintaro-iwasaki
Copy link
Collaborator

@bfaccini Thank you for your comments and sorry for the late reply. I updated #16. It seems that cactus stack and stack canary are not helpful in your use case.

Regarding mmap(), I am not sure if this is what you want. My understanding can be wrong. I'd be very happy if you could correct my understanding.

MAP_GROWSDOWN (or MAP_STACK)

This would be useful to grow up the memory downwards according to the specification, but as far as I googled, it might not be practical. The number of created memory objects seems limited. For example, on openSUSE 15 and Ubuntu 20, only 65K stacks can be created (see the small code ./a.out 1048576 1 1). At least glibc seems not using this feature for its non-main threads. It should depend on the Linux kernel versions, but it might be better if we can avoid using it.

It seems that we can achieve a similar functionality without MAP_GROWSDOWN. The real physical memory is associated even if it is accessed from bottom, but it can depend on the UNIX/Linux version.

mmap() in general

Because an OS allows overcommitting memory, a normal mmap() should work as we want, but once it grows up (=pages are actually associated), we cannot shrink it. It means that we cannot add an mmap()'ed memory that has already grown up to a pool; otherwise, eventually a program uses large/full stacks for all ULTs. With the default mmap(), it is hard to know how much real memory resource it really consumes, so a program cannot tell whether a stack has been grown or not. It indicates that we need to munmap() all stacks when freeing ULTs. This can be heavy.

If we can accept the overhead of mmap()/munmap()ing all stacks, ./configure --disable-mem-pool option uses malloc() and free() which should similarly work. We can improve this option, but anyway, this might show how this mechanism works (though it depends on the underlying malloc() and free() implementation. I assume that of glibc).

If we want to programmatically know the stack growth, maybe we can use mprotect, but I am not sure if it works. mremap() might be useful for shrinking/expanding a stack, but as you pointed out, maybe we cannot remap the stack in a downward manner (=changing a starting pointer); otherwise, we need to munmap() that stack. There would be any other promising functionalities provided by OS/glibc. They might allow using a pool.

The following is a small code to see how many stacks we can allocate. A small region of allocated stacks is touched.

Test program (collapsed)

#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_ADDRS (1024 * 1024) // 1M

#define MALLOC_KIND_NORMAL_MMAP 0
#define MALLOC_KIND_GROWSDOWN_MMAP 1
#define MALLOC_KIND_STACK_GROWSDOWN_MMAP 2
#define MALLOC_KIND_MALLOC 3

#define TOUCH_KIND_NO 0
#define TOUCH_KIND_TOP 1
#define TOUCH_KIND_BOTTOM 2

size_t g_stack_size = 0;
int g_malloc_kind = MALLOC_KIND_NORMAL_MMAP;
int g_touch_kind = TOUCH_KIND_NO;

void *alloc_memory()
{
    if (g_malloc_kind == MALLOC_KIND_NORMAL_MMAP) {
        return mmap(NULL, g_stack_size, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    } else if (g_malloc_kind == MALLOC_KIND_GROWSDOWN_MMAP) {
        return mmap(NULL, g_stack_size, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN, 0, 0);
    } else if (g_malloc_kind == MALLOC_KIND_STACK_GROWSDOWN_MMAP) {
        return mmap(NULL, g_stack_size, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK | MAP_GROWSDOWN, 0,
                    0);
    } else if (g_malloc_kind == MALLOC_KIND_MALLOC) {
        return malloc(g_stack_size);
    }
    return NULL;
}

void free_memory(void *mem)
{
    if (g_malloc_kind == MALLOC_KIND_MALLOC) {
        free(mem);
    } else {
        munmap(mem, g_stack_size);
    }
}

const char *get_str_malloc_kind(int malloc_kind)
{
    if (malloc_kind == MALLOC_KIND_NORMAL_MMAP) {
        return "normal mmap";
    } else if (malloc_kind == MALLOC_KIND_GROWSDOWN_MMAP) {
        return "mmap + MAP_GROWSDOWN";
    } else if (malloc_kind == MALLOC_KIND_STACK_GROWSDOWN_MMAP) {
        return "mmap + MAP_GROWSDOWN + MAP_STACK";
    } else if (malloc_kind == MALLOC_KIND_MALLOC) {
        return "malloc";
    }
    return "(invalid)";
}

const char *get_str_touch_kind(int touch_kind)
{
    if (touch_kind == TOUCH_KIND_NO) {
        return "no touch";
    } else if (touch_kind == TOUCH_KIND_TOP) {
        return "touch top";
    } else if (touch_kind == TOUCH_KIND_BOTTOM) {
        return "touch bottom";
    }
    return "(invalid)";
}

int main(int argc, char *argv[])
{
    if (argc != 4) {
        printf("Usage: ./a.out MEM_SIZE MALLOC_KIND TOUCH_KIND\n");
        printf("ex:    ./a.out 1024 1 0\n");
        printf("MEM_SIZE    = NUM : memory size [bytes]\n");
        printf("MALLOC_KIND = 0   : %s\n", get_str_malloc_kind(0));
        printf("            = 1   : %s\n", get_str_malloc_kind(1));
        printf("            = 2   : %s\n", get_str_malloc_kind(2));
        printf("            = 3   : %s\n", get_str_malloc_kind(3));
        printf("TOUCH_KIND  = 0   : %s\n", get_str_touch_kind(0));
        printf("            = 1   : %s\n", get_str_touch_kind(1));
        printf("            = 2   : %s\n", get_str_touch_kind(2));
        return -1;
    }

    g_stack_size = atoi(argv[1]);
    g_malloc_kind = atoi(argv[2]);
    g_touch_kind = atoi(argv[3]);

    printf("Create %d %d-byte memory objects (%s, %s)\n", MAX_ADDRS,
           (int)g_stack_size, get_str_malloc_kind(g_malloc_kind),
           get_str_touch_kind(g_touch_kind));
    void **addrs = (void **)calloc(MAX_ADDRS, sizeof(void *));
    for (int i = 0; i < MAX_ADDRS; i++) {
        void *ptr = alloc_memory();
        if (!ptr) {
            printf("allocation function returns NULL.\n");
            return -1;
        }
        if (g_touch_kind == TOUCH_KIND_TOP) {
            ((char *)ptr)[0] = 1;
        } else if (g_touch_kind == TOUCH_KIND_BOTTOM) {
            ((char *)ptr)[g_stack_size - 8] = 1;
        }
        addrs[i] = ptr;
        printf("\rCreate %8d / %8d memory objects", i, MAX_ADDRS);
        fflush(stdout);
    }
    printf("\n");
    fflush(stdout);

    for (int i = 0; i < MAX_ADDRS; i++) {
        free_memory(addrs[i]);
    }
    printf("successfully created and freed all %d memory objects\n", MAX_ADDRS);
    fflush(stdout);

    free(addrs);

    return 0;
}

Lazy stack allocation

Lazy stack allocation is different from a tasklet. The exact mechanism is different, but in terms of stack consumption, it is the same as a mechanism that allocates a real stack only when that ULT "yields" for the first time (=lazily). The functionality of this ULT is the same as the existing ULT (e.g., this ULT can yield unlike a tasklet, that cannot yield). For example, if a ULT happens to successfully acquire a lock without contention and finish its operation, this ULT does not consume a stack. A stack is consumed only when a ULT actually yields (e.g., ABT_thread_yield() or ABT_mutex_lock() with a locked mutex). Of course, some applications might call yield in all ULTs, but others not. I do not know this feature is beneficial for your workload.

I would like to know what direction would fit your use case.

@bfaccini
Copy link
Contributor Author

bfaccini commented Jan 11, 2021

Hello Shintaro,
Thanks for your precisions, so did I correctly understand that "Lazy stack allocation" is intended to behave like a one-shot "cactus stack" upon ABT_thread_yield()/ABT_mutex_lock() (and more generally upon first context-switch, under Argobots control, during each ULT life) ??
If yes, how will you manage the return to the previous/1st frames ?
And anyway, this will not help us to save some stacks allocation in our case, as all of our ULTs code yields...

Concerning the limitation of number of mmap()s per-process, I was not aware of such...
BTW, running your program sample on my side, I was able to allocate more than 500K stacks of 16K size. So I wonder if the limitation is not due to the node configuration (real mem size, per-process memory limits, ...), as in my case it is likely that it was caused by the process "virtual memory" limit of 8GB !!...
So I don't think there is a Kernel limitation at all.
And I don't think that glibc never use mmap() for pthread stack allocation, this may be version dependent.

As I already pointed, there is a need of some management code to be added in order to "pool" the current mmap()'ed stacks and particularly to keep track of each stack size/start-end addresses.
malloc()/free() can not be used if we want to benefit from the stack automatic growing mechanism provided by the Kernel, only mmap() can.
Correct me if I am wrong, we don't need to care during ULT life, but will have to only check about a possible growth upon ULT exit, and this can be done, as I said before by monitoring /proc/self/maps or any equivalent method if any API or syscall available.
Current mappings view will also help to choose (and provide as hint to mmap() !) the best place/hole for a new stack.

@shintaro-iwasaki
Copy link
Collaborator

shintaro-iwasaki commented Jan 11, 2021

Thank you for your comments!

"Lazy stack allocation"

The exact idea is different. It is "assigning a stack on executing a ULT and releasing a stack on joining a ULT". For example, if one creates 10000 ULTs and joins them.

for (i = 0; i < 10000; i++)
  ABT_thread_create(..., &threads[i]);
for (i = 0; i < 10000; i++)
  ABT_thread_free(&threads[i]);

The current mechanism allocates stacks in ABT_thread_create() and frees them in ABT_thread_free(). After creating 10000 ULTs, 10000 stacks are consumed.

Under lazy stack allocation, we allocate a stack on executing a ULT (= the scheduler allocates a stack for a ULT) and free a stack when a ULT finishes. If threads[i] does not happen to yield and successfully finishes, threads[i+1] (or any next ULT) will use the same stack since the stack of threads[i] was returned when it finished. If no threads[i] yield, this program uses only one stack for 10000 ULTs. If threads[i] yields, the next ULT needs to use a different stack since the stack of threads[i] has not been freed. In practice, the stack consumption becomes similar to a case where a ULT's stack is allocated when it yields. Precisely speaking, this strategy strictly limits the stack consumption by the number of active ULTs that need to keep their own stack states.

Anyway, if it does not help your case, we need a different approach.

mmap'ed memory

As far as I tried it on openSUSE 15 and Ubuntu 20, ./a.out 1048576 1 1 failed at 65K, but it seems that it depends on the Linux version etc (I believe it is Linux configuration, e.g., ulimit). In any case, if it works on your environment, that is great.

Surely adding a pool mechanism can alleviate the allocation overheads, but so far I could not find an API that can check the growth of the stack that is mmap'ed with MAP_GROWSDOWN. Maybe we can check the memory protection status or catch a signal, but they do not sound portable.

One idea is removing such a check and, no matter whether the stack is actually extended or not, freeing and reallocating a stack in a pool every N allocations so that the extension can be "reset". This optimization can reduce the overhead by N times. This idea is dirty while the implementation cost is substantial: I want to examine a bit more if this idea is beneficial before implementing it. Specifically, the following information would be helpful: 1. the ordinary stack size needed by most ULTs (16KB?), 2. the large stack size needed by a few ULTs (2MB?), 3. the rate of ULTs that need larger stacks (0.1%?), 4. acceptable overheads for thread creation/freeing (one microsecond? ... the current one is about 200 - 500 cycles).

I welcome any suggestion including a useful API and a smart idea for pools.

@bfaccini
Copy link
Contributor Author

bfaccini commented Jan 14, 2021

@shintaro-iwasaki ,
I spent some times to write a piece of code to better describe how mmap()'ed stack allocation could be done to allow for some automatic growth up to a limit :

#include <sys/mman.h>
#include <sys/types.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>

#define PROC_SELF_MAPS "/proc/self/maps"

/* max line size of PROC_SELF_MAPS output */
#define LINE_LEN 512

/* minimum/original stack size */
#define MIN_STACK_PAGES 4

/* maximum stack size after growth */
#define MAX_STACK_PAGES 8

/* number of stacks to be allocated depends on real memory size, per-process
 * memory limits, ...
 */
#define NB_STACKS 100000

/* seems there are no API to get Kernel boot parameter "stack_guard_gap"
 * configured value ... This the minimum gap size between 2 consecutive
 * mappings/VMAs enforced by th Kernel
 */
#define STACK_GUARD_GAP_PAGES 256

/* may need to improve to find minimal hole ok ?? */
char *find_hole(size_t length)
{
        FILE *fp;
        char line[LINE_LEN];
        char *addr = NULL, *startaddr, *endaddr;

        /* XXX
         * parsing /proc/self/maps for each new stack allocation will
         * have an increasing cost, so it might be helpful to keep an
         * internal map/view up to date with new allocations, and only
         * refresh it upon allocation diverging from hint (using
         * MAP_FIXED_NOREPLACE may also help here) ...
         */

        fp = fopen(PROC_SELF_MAPS, "r");
        if (fp == NULL) {
                fprintf(stderr, "can't open %s : %s\n", PROC_SELF_MAPS,
                       strerror(errno));
                exit(1);
        }

        /* all VMAs start/end addresses are already rounded to page size */
        while (fgets(line, LINE_LEN, fp) != NULL) {
                if (sscanf(line, "%p-%p", &startaddr, &endaddr) == 2) {
                        if (startaddr > addr)
                                if (startaddr - addr >= length)
                                        break;
                        if (endaddr > addr)
                                addr = endaddr;
                }
        }

        /* check if last VMA has been reached but there is still not
         * enough room
         */
        if (UINTPTR_MAX - (uintptr_t)addr < length) {
                addr = MAP_FAILED;
                fprintf(stderr, "no hole found\n");
        }

        fclose(fp);

        return addr;
}

main()
{
        char *addr, *addr_hint;
        int i, j, rc;
        size_t pg_size, stack_guard_gap, min_stack_size, max_stack_size;
        pid_t pid;
        char cmd[256];

        pg_size = sysconf(_SC_PAGESIZE);
        stack_guard_gap = STACK_GUARD_GAP_PAGES * pg_size;
        min_stack_size = MIN_STACK_PAGES * pg_size;
        max_stack_size = MAX_STACK_PAGES * pg_size;

        /* get a process mappings/VMAs picture before */
        pid = getpid();
        rc = snprintf(cmd, sizeof(cmd), "cat /proc/%u/maps", pid);
        if (rc >= sizeof(cmd)) {
                fprintf(stderr, "cat command size too long\n");
                exit(1);
        }
        system(cmd);

        for (i = 0; i <= NB_STACKS; i++) {
                addr_hint = find_hole(max_stack_size + stack_guard_gap);
                if (addr_hint == MAP_FAILED) {
                        fprintf(stderr, "can't find hole for stack #%d\n", i);
                        goto _exit;
                }

                /* try to map found hole, but with minimal size ... */
                addr = mmap(addr_hint + stack_guard_gap + max_stack_size - min_stack_size,
                            min_stack_size, PROT_READ | PROT_WRITE,
                            MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK | MAP_GROWSDOWN,
                            -1, 0);

                if (addr == MAP_FAILED) {
                        fprintf(stderr, "mmap() failed : %s\n", strerror(errno));
                        goto _exit;
                }

                if (addr != addr_hint + stack_guard_gap + max_stack_size - min_stack_size)
                        fprintf(stderr, "mmap() assigned address %p vs hint %p\n",
                                addr,
                                addr_hint + stack_guard_gap + max_stack_size - min_stack_size);


                /* write on top of current stack */
                *addr = 'B';

                /* try to automatically extend stack by writing on top-1 */
                for (j = 0; j < MAX_STACK_PAGES - MIN_STACK_PAGES; j++)
                        *(addr - ((j * pg_size) + 1)) = 'A';
        }

_exit:
        fprintf(stderr, "%d allocations have occured\n", i);

        /* get a process mappings/VMAs picture after */
        system(cmd);
}

As I have indicated in the code comments, and since I also don't have found some existingAPI to help on this, parsing /proc/self/maps upon each stack allocations has a cost, so it will be a good idea to keep an internal mappings view up to date internally to optimise best hole search and to only refresh it upon diverging allocations from expected hint.

@shintaro-iwasaki
Copy link
Collaborator

@bfaccini I am sorry for my late reply. Thank you very much for your code! Please give me a few more days to examine the best direction for this issue.

@shintaro-iwasaki
Copy link
Collaborator

Thank you very much for your code! I'd also like to apologize my late reply.

1. Number of MAP_GROWSDOWN stacks

On Ubuntu 20.4, Red Hat 7.6, and openSUSE 15.2, your code fails to create more than 65509/65514/65514 MAP_GROWSDOWN stacks. I believe this limit comes from the fact that all the MAP_GROWSDOWN addresses are allocated in a specific virtual memory region. I believe it is configurable, but please be aware that, by default, we can create up to 65K stacks on some platforms (i.e., might not be "portable" to create more than 65K).

2. The cost of fopen and mmap:

On our platform, reading "/proc/self/maps" is heavier than mmap() + munmap(), so it might not be very efficient to check "/proc/self/maps".

For example, on Intel Skylake 8180 + openSUSE 15.2 (see the benchmark):

$ ./benchmark.out
fopen+fgets+fclose time 3.82 [us]
open+read+close time 2.52 [us]
mmap(+MAP_GROWSDOWN) without address hint. time 1.98 [us]
mmap(+MAP_GROWSDOWN) with address hint. time 1.85 [us]
mmap. time 1.98 [us]

It indicates that always allocating and deallocating a stack is faster than checking "/proc/self/maps"whenever we allocate or free a stack to avoid putting an extended stack to the stack pool.

Test program (collapsed)

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <stddef.h>
#include <assert.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define LINE_LEN 512
#define STACK_SIZE (256 * 4096)
#define MIN_STACK_SIZE (4096 * 2)
#define NUM_ITERS 200
#define PROC_SELF_MAPS "/proc/self/maps"

static double get_time()
{
    struct timeval tv;
    gettimeofday(&tv, 0);
    return tv.tv_sec + (double)tv.tv_usec * 1e-6;
}

void print_maps()
{
    char line[LINE_LEN];
    FILE *fp = fopen(PROC_SELF_MAPS, "r");
    while (fgets(line, LINE_LEN, fp) != NULL) {
        printf("%s", line);
    }
    printf("\n");
    return;
}

int main()
{
    int num_warmups = 1000, num_repeats = 1000;
    double t1, t2;

    /**************************************************************************/

    for (int i = 0; i < num_warmups + num_repeats; i++) {
        if (i == num_warmups)
            t1 = get_time();
        for (int j = 0; j < NUM_ITERS; j++) {
            FILE *fp = fopen(PROC_SELF_MAPS, "r");
            char line[LINE_LEN];
            fgets(line, LINE_LEN, fp); // Just once.
            fclose(fp);
        }
    }
    t2 = get_time();
    printf("fopen+fgets+fclose time %.2f [us]\n",
           (t2 - t1) / num_repeats * 1.0e6 / NUM_ITERS);

    /**************************************************************************/

    for (int i = 0; i < num_warmups + num_repeats; i++) {
        if (i == num_warmups)
            t1 = get_time();
        for (int j = 0; j < NUM_ITERS; j++) {
            int fd = open(PROC_SELF_MAPS, O_RDONLY);
            char line[LINE_LEN];
            read(fd, line, LINE_LEN);
            close(fd);
        }
    }
    t2 = get_time();
    printf("open+read+close time %.2f [us]\n",
           (t2 - t1) / num_repeats * 1.0e6 / NUM_ITERS);

    /**************************************************************************/

    for (int i = 0; i < num_warmups + num_repeats; i++) {
        if (i == num_warmups)
            t1 = get_time();
        char *ptrs[NUM_ITERS];
        for (int j = 0; j < NUM_ITERS; j++) {
            ptrs[j] =
                (char *)mmap(NULL, MIN_STACK_SIZE, PROT_READ | PROT_WRITE,
                             MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK |
                                 MAP_GROWSDOWN,
                             -1, 0);
            (*ptrs[j]) = 'a';
        }
        // print_maps();
        for (int j = 0; j < NUM_ITERS; j++)
            munmap(ptrs[j], MIN_STACK_SIZE);
    }
    t2 = get_time();
    printf("mmap(+MAP_GROWSDOWN) without address hint. time %.2f [us]\n",
           (t2 - t1) / num_repeats * 1.0e6 / NUM_ITERS);

    /**************************************************************************/

    for (int i = 0; i < num_warmups + num_repeats; i++) {
        if (i == num_warmups)
            t1 = get_time();
        char *ptrs[NUM_ITERS];
        for (int j = 0; j < NUM_ITERS; j++) {
            if (j == 0) {
                ptrs[j] = (char *)mmap(NULL, MIN_STACK_SIZE,
                                       PROT_READ | PROT_WRITE,
                                       MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK |
                                           MAP_GROWSDOWN,
                                       -1, 0);
            } else {
                ptrs[j] =
                    (char *)mmap(ptrs[j - 1] + STACK_SIZE, MIN_STACK_SIZE,
                                 PROT_READ | PROT_WRITE,
                                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK |
                                     MAP_GROWSDOWN,
                                 -1, 0);
                // Assuming that we can find this region quickly.  At least the
                // following works on some platforms.
                assert(ptrs[j] == ptrs[j - 1] + STACK_SIZE);
            }
            (*ptrs[j]) = 'a';
        }
        for (int j = 0; j < NUM_ITERS; j++)
            munmap(ptrs[j], MIN_STACK_SIZE);
    }
    t2 = get_time();
    printf("mmap(+MAP_GROWSDOWN) with address hint. time %.2f [us]\n",
           (t2 - t1) / num_repeats * 1.0e6 / NUM_ITERS);

    /**************************************************************************/

    for (int i = 0; i < num_warmups + num_repeats; i++) {
        if (i == num_warmups)
            t1 = get_time();
        char *ptrs[NUM_ITERS];
        for (int j = 0; j < NUM_ITERS; j++) {
            ptrs[j] =
                (char *)mmap(NULL, STACK_SIZE, PROT_READ | PROT_WRITE,
                             MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
            (*ptrs[j]) = 'a';
        }
        // print_maps();
        for (int j = 0; j < NUM_ITERS; j++)
            munmap(ptrs[j], STACK_SIZE);
    }
    t2 = get_time();
    printf("mmap. time %.2f [us]\n",
           (t2 - t1) / num_repeats * 1.0e6 / NUM_ITERS);

    /**************************************************************************/
    return 0;
}

3. Possible direction:

Assuming that X % of N ULTs uses a larger stack (1MB) while (100 - X) % of N ULTs use a smaller stack (16KB: the default Argobots stack size). If function overheads are the same as above ...

1. If all ULTs use 1MB stacks with the current Argobots memory pool:

Memory consumption: N * 1024 KB
Approximated Allocation overhead: 10 ns // this 10 ns (=20 cycles) is for a typical memory pool operation

2. If we mmap() + munmap() every ULT stack:

Memory consumption: N * (X / 100.0 * 1024 + (1.0 - X / 100.0) * 16) KB 
Allocation overhead: 2000 ns

3. If we mmap() + munmap() every M stack allocation/deallocation (i.e., a counter in the stack reaches M, the stack is freed to reset its stack growth):

Memory consumption (when X << 1): N * (X * M / 100.0 * 1024 + (1.0 - X * M / 100.0) * 16) KB 
Allocation overhead: 2000 / M + 10 ns

If we don't use MAP_GROWSDOWN, the virtual memory consumption of cases 2 and 3 would be the same as case 1., but Argobots can create more than 65K stacks on some platforms without changing their system configurations. If vm.overcommit_memory is not 2 (https://www.kernel.org/doc/Documentation/vm/overcommit-accounting), the program should not crash without MAP_GROWSDOWN. In this case, the OOM killer will kill applications if Argobots actually touches too many stack regions, but the same thing will happen even if we use MAP_GROWSDOWN.

I think the following implementation is reasonable.

  • Using a memory pool with a per-execution-stream cache (= the same as the current pool implementation)
  • mmap() + munmap() (or malloc/free()) every M ULT stack (and/or stacks more than an upper limit of the number of stacks per system)
  • Do not use MAP_GROWSDOWN (= to avoid 65K limits. The implementation is easier and simpler.)
  • M, pool cache size, and overall pool capacity are configurable

I'd like to know your idea especially to know your application requirement. Once the direction is fixed, I will create a PR.

@bfaccini
Copy link
Contributor Author

Hello Shintaro,

First of all, I had to update my code snippet above as it was wrongly doing a system("cat /proc/self/maps") at the beginning and end, which was a non-sense since it would not dump VMAs of the current process !!...

I agree with you that the cost of parsing "/proc/self/maps" upon each mmap()'ed stack allocation is too heavy, and this is why I already wrote above that the best approach would be to maintain an internal view of it, use it for next allocations, and only have to refresh it upon mmap() failure or unexpected mmap()'ed address vs hint. This is very unlikely to happen frequently as the process VMAs map, appart of this stack allocation mechanism under control, should not change that much.

Again, if you want to benefit from the stack automatic growth (up to the process stack limit) mechanism provided by the Kernel, you need to use MAP_GROWSDOWN !! See the following piece of code into my snippet :

                /* try to automatically extend stack by writing on top-1 */
                for (j = 0; j < MAX_STACK_PAGES - MIN_STACK_PAGES; j++)
                        *(addr - ((j * pg_size) + 1)) = 'A';

you will not be able to do similar out-of-range write without a SEGV with regular mmap()'ed areas. This is for most of the architectures, and would have to use MAP_GROWSUP on archs where stack is growing upward.

Also, I don't think that there is a specific limit for the number of pre-process mmap()'ed regions with MAP_GROWSDOWN property, it is better a generic limitation which can easily be avoided, as it is a Kernel tunable :

$ sysctl vm.max_map_count
vm.max_map_count = 65530
$ 

@shintaro-iwasaki
Copy link
Collaborator

Thank you for your comments!

I agree with you that the cost of parsing "/proc/self/maps" upon each mmap()'ed stack allocation is too heavy, and this is why I already wrote above that the best approach would be to maintain an internal view of it.

I do not think find_hole() is necessary since mmap with MAP_GROWSDOWN will automatically find a proper place of the memory region. If really necessary, we can have an internal data structure as you indicated in the comment.

This /proc/self/maps would be useful to check if a stack has been grown or not. To avoid calling mmap() whenever you create a ULT, as you have mentioned, introducing a stack pool would be promising. However, we do not want to keep grown-up stacks in the pool. Otherwise, eventually a stack pool will be filled up by fully grown-up stacks. /proc/self/maps would be useful to detect such a stack. The cost issue was discussed in the previous comment. My conclusion was "not beneficial for this use case". (Note that if we do not need such a pool, implementation is very easy.) Do you have any opinion about this?

Again, if you want to benefit from the stack automatic growth (up to the process stack limit) mechanism provided by the Kernel, you need to use MAP_GROWSDOWN!!

I understand it works, but I would like to understand the benefit of MAP_GROWSDOWN in more detail.

In my understanding, MAP_GROWSDOWN gives 1 or 2MB of memory ("the process stack limit" as you mentioned). The virtual memory address is automatically grown-up while physical memory consumption is the same as mmap() since even a normal mmap() does not physically allocate all the memory when a virtual memory address is assigned. Consuming virtual memory addresses does not cause SEGV if vm.overcommit_memory != 2, which is the default Linux configuration in many cases.

I don't think that there is a specific limit for the number of pre-process mmap()'ed regions with MAP_GROWSDOWN property, it is better a generic limitation which can easily be avoided, as it is a Kernel tunable.

It might be a tunable parameter, but as I told you, 2MB normal mmap can create 100K stacks while 2MB MAP_GROWSDOWN mmap cannot create more than 65K stacks on Ubuntu 20.4, Red Hat 7.6, and openSUSE 15.2 for some reasons, so I am not sure if MAP_GROWSDOWN is solely limited by vm.max_map_count. In any case, a 2MB normal mmap works well on platforms that use default configurations as far as I tried: vm.overcommit_memory != 2 and 65K limit for MAP_GROWSDOWN mmap for some reasons. I would prefer a simpler and more portable solution (by default) since a normal mmap() does not need to use find_hole() etc.


No matter whether we use MAP_GROWSDOWN or not, performance will be the one I roughly estimated in my previous comment (Possible direction). Do you have any preference over those directions? Otherwise, I would really appreciate it if you could give me any suggestions regarding design and implementation or performance requirements for your use case.

@bfaccini
Copy link
Contributor Author

find_hole() code is necessary to ensure that Kernel will let the stack grow from min_stack_size to max_stack_size. And there is no special treatment for MAP_GROWSDOWN during mmap().

I agree with you that the detection/handling of stacks that have grown is problematic if we want to munmap() them. But it may be ok to safely use the stack's lowest possible address and its max length as munmap() parameters, first because munmap() will not complain/fail due to any unmapped parts in this range !, and secondly because the Kernel would not have allowed for an other VMA to be allocated within the stack_guard_gap interval from the top of this stack !!

I think your understanding of the Kernel behaviour with MAP_GROWSDOWN regions is not exact, and you may also not have well understood what my code snippet tries to demonstrate. if find_hole() tries to find an unmapped area of stack_guard_gap + max_stack_size length and then do an mmap() of only a min_stack_size length in the highest address range of this found area, this is to ensure that the Kernel will then allow it to grow up to max_stack_size upon need/access. So the stack area/VMA will start its life with a minimum virtual address range which then will be grown by the Kernel only if needed during page-fault handling.

vm.max_map_count is the tunable that limits the number of VMAs/regions which can make the virtual memory address space of a process. So if you set it to one million, you will be able to create one million of mmap()'ed regions, even with MAP_GROWSDOWN property ! I have checked this on CentOS 7.8. And I have no doubt it is the same on Ubuntu and Suse platforms as it is a Kernel thing.

@shintaro-iwasaki
Copy link
Collaborator

Thank you for correcting my wrong understanding. Yes, as you explained, find_hole() seems necessary to extend a stack that is created by MAP_GROWSDOWN: mmap() + MAP_GROWSDOWN does not ensure extra room for a stack growth by default. Thank you for checking vm.max_map_count, which should affect the number of memory regions that can be mmaped with MAP_GROWSDOWN.


I would like to discuss two things: 1. how to allocate each stack (whether we use MAP_GROWSDOWN or not), and 2. the design of the memory pool.

1. How to allocate each stack

I would appreciate your idea regarding why we want to use MAP_GROWSDOWN. There are a few drawbacks.

  • One obvious concern is that other runtimes/libraries use a memory region near a stack mmap()ed with MAP_GROWSDOWN. Concurrent mmap (by another runtime) could be problematic. In any case, if memory should be allocated on top of a stack mmap()ed with MAP_GROWSDOWN, any stack access causes a silent memory error since that stack is overlapped with another mmap()ed memory.
  • Another concern is the portability of /proc/self/maps regarding its format and location (e.g., on Mac).
  • The default value of vm.max_map_count on some machines is a non-negligible concern: some platforms do not allow users to change that value (at least I cannot change this configuration on servers at Argonne). Note that I can create (+ 1-byte write) 10 millions of 1MB memory on the same machine that has vm.max_map_count=65300.

I believe we can use the memory overcommit mechanism (vm.overcommit_memory != 2) because it has several merits:

  • It does not need complicated find_holes(), which makes things faster and robust. No need to think about mmap() by other programs and libraries.
  • It does not need to read /proc/self/maps, so the mechanism does not rely on its format etc.
  • It is enabled by default on many platforms (as far as I tried), which is good for users who do not have permission to change the system settings.
  • Normal mmap does not seem affected by vm.max_map_count (on some platforms as far as I tried).

Is there any reason to use MAP_GROWSDOWN?

2. Memory pool

No matter whether we use MAP_GROWSDOWN or not, we can cache ULT stacks to avoid calling mmap(). The following applies to both cases with and without MAP_GROWSDOWN.

  1. First, do we need a ULT stack cache (say, a stack pool)?

    mmap() + munmap() cost is around 2 us as far as I checked. If 2 us is an acceptable overhead, we can call mmap() and munmap() every time when we create a ULT. This is easy: we can do this without changing the current Argobots implementation.

  2. If 2 us is unacceptably large, we would need a ULT stack cache (= ULT stack pool).

    The problem here is that we want to release a stack that has been already extended (= physical memory is associated). A simple idea is freeing ULT stacks that have been used many times since such a stack has been already extended by one of ULTs that have used that stack. Implementation is simple: adding a counter to each ULT stack. The counter is incremented when it is assigned to a ULT. When a ULT is freed, its associated stack is pushed to the stack pool. When the counter reaches a certain threshold, however, that ULT stack is freed instead of pushing it to the pool. Basically, if we set a threshold to M, the allocation overheads get 2 us / M (+ some constant overheads to manage a ULT stack pool). Larger M causes larger memory footprints.

I would like to know your use case in details so that I can understand which implementation is suitable for your workloads and start to work on this issue. Any better idea is welcome. I would truly appreciate it if you could correct my understanding.

@bfaccini
Copy link
Contributor Author

Concerning how to ensure that an other mmap()/VMA will take place in the hole between the current top of stack and the previous region at stack allocation time, this is again where find_hole() algorithm by taking into account stack_guard_gap anticipates the Kernel placement rules for new areas.

I assume that each platforms will provide an equivalent, why not an API !, for "/proc/self/maps". Linux distros should have it as they all use the same Kernels.

vm.max_map_count limitation would have to be well documented. If required (like it would for our use case) it will need to be changed by administrator, like this happens for huge resources demanding applications...

To be honest, I really don't understand why you always refer to vm.overcommit_memory tunable for this issue. Well, since it impacts the full process virtual memory allocation we may have to take care of it if the mmap()'ed stacks total memory footprint becomes too important vs the real memory size. But it will not help for the stack allocation+placement and further growth problem we are trying to solve here.

BTW, can you better describe me why you think vm.max_map_count has no effect on some platforms ??

Again MAP_GROWSDOWN, helps the Kernel to know how a region will be allowed to grow, and also how to best place new regions/mmap()s to avoid collisions.

About the need of a stacks pool, I believe we may try without first as it could be added later, what do you think ??

@shintaro-iwasaki
Copy link
Collaborator

Thank you for your comments. It seems that a pool is not necessary for now, so I will stop talking about a stack pool.

vm.overcommit_memory is not tunable.

vm.max_map_count is not tunable on our machine, either. My point is that, by default, vm.overcommit_memory is enabled and vm.max_map_count is 65K on many platforms, so we can reasonably rely on this behavior.

It impacts the full process virtual memory allocation we may have to take care of it if the mmap()'ed stacks total memory footprint becomes too important vs the real memory size.

Do we need to keep track of the footprint of virtual memory addresses? Of course, it would be great if the footprint of virtual memory addresses is almost the same as the footprint of real memory, but does this justify adding a complicated find_hole() management that assumes non-standardized Linux or Unix behaviors?

It will not help for the stack allocation+placement and further growth problem we are trying to solve here.

If we don't care about the footprint of virtual memory addresses, we can just allocate 2MB (or any size) of stack to each ULT. You can create millions of ULTs with 2MB stacks if vm.overcommit_memory is enabled.

BTW, can you better describe me why you think vm.max_map_count has no effect on some platforms ??

I wanted to point out that vm.max_map_count does not limit the normal mmap() on some platforms. There is no better explanation for this.

MAP_GROWSDOWN helps the Kernel to know how a region will be allowed to grow,

Yes, the kernel will know that the footprint of virtual memory addresses is the same as the footprint of real memory.

[MAP_GROWSDOWN helps the Kernel to know] how to best place new regions/mmap()s to avoid collisions

Does the kernel know how to best place new regions/mmap()s to avoid collisions? For example, if MAP_GROWSDOWN memory is 0x01003000 - 0x01004000, is there any guarantee that a kernel avoids allocating a memory that contains 0x01002fff (or any memory close to that) by other mmap?


In any case, at this point, I do not want to implement this find_hole() inside Argobots because its maintenance cost for many architectures and OSes and for thread safety (corner cases) is too much compared with the merits of MAP_GROWSDOWN. So far the program can implement mmap+MAP_GROWSDOWN stacks without changing Argobots. Since mmap(), munmap() andfind_hole() are heavy, the merit of implementing this mechanism inside Argobots is very little.

To test its behavior, the following should work.

#include <abt.h>
#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>

#define DEFAULT_STACK_SIZE (4096 * 2)

ABT_key g_stack_key;
/* create_thread() is called in parallel, use a spinlock to protect g_stack_attr
 * or create a copy for each thread. */
ABT_thread_attr g_stack_attr;
int g_measure_time = 0;

int init_stack_key();
void finalize_stack_key();
void *allocate_stack();
void free_stack(void *mem);

void create_thread(ABT_pool pool, void (*f)(void *), void *arg,
                   ABT_thread *newthread);
void free_thread(ABT_thread *thread);
void thread_func(void *arg);

int main(int argc, char *argv[])
{
    if (argc != 3) {
        printf("Usage: ./a.out NUM_THREADS MEASURE_TIME?\n");
        printf("Ex:    ./a.out 32 0\n");
        return -1;
    }
    int num_threads = atoi(argv[1]);
    g_measure_time = atoi(argv[2]);
    int num_repeats = g_measure_time ? (100000 / num_threads + 2) : 1;

    /* Initialization */
    ABT_init(0, NULL);
    ABT_xstream xstream;
    ABT_pool pool;
    ABT_xstream_self(&xstream);
    ABT_xstream_get_main_pools(xstream, 1, &pool);
    init_stack_key();
    ABT_thread *threads = malloc(sizeof(ABT_thread) * num_threads);

    double start_time, end_time;

    /* Fork and join threads. */
    for (int step = 0; step < num_repeats; step++) {
        if (step == num_repeats / 2)
            start_time = ABT_get_wtime();
        for (int i = 0; i < num_threads; i++) {
            create_thread(pool, thread_func, NULL, &threads[i]);
        }
        for (int i = 0; i < num_threads; i++) {
            ABT_thread_free(&threads[i]);
        }
    }

    /* Print execution time */
    end_time = ABT_get_wtime();
    if (g_measure_time) {
        double t = (end_time - start_time) / num_threads /
                   (num_repeats - num_repeats / 2) * 1.0e6;
        printf("Fork-join per thread: %.2f[us]\n", t);
    }

    /* Finalization */
    free(threads);
    finalize_stack_key();
    ABT_finalize();
    return 0;
}

void deep_func(int depth, char *root_stack)
{
    /* Use a stack region to see if the stack is extended. */
    volatile char *mem = (char *)alloca(sizeof(char) * 512);
    mem[512] = 1;
    if (depth == 0) {
        printf("Consumed %d bytes of stacks\n", (int)(root_stack - mem));
    } else {
        deep_func(depth - 1, root_stack);
    }
}

void thread_func(void *arg)
{
    if (g_measure_time == 0) {
        deep_func(128, (char *)alloca(sizeof(char) * 128));
    }
}

/* ************************************************************************** */

int init_stack_key()
{
    ABT_key_create(free_stack, &g_stack_key);
    ABT_thread_attr_create(&g_stack_attr);
}

void finalize_stack_key()
{
    ABT_thread_attr_free(&g_stack_attr);
    ABT_key_free(&g_stack_key);
}

void create_thread(ABT_pool pool, void (*f)(void *), void *arg,
                   ABT_thread *newthread)
{
    void *newstack = allocate_stack();
    ABT_thread_attr_set_stack(g_stack_attr, newstack, DEFAULT_STACK_SIZE);
    ABT_thread_create(pool, f, arg, g_stack_attr, newthread);
    ABT_thread_set_specific(*newthread, g_stack_key, newstack);
}

/* ************************************************************************** */

#define PROC_SELF_MAPS "/proc/self/maps"
#define LINE_LEN 1024
#define STACK_GUARD_GAP_SIZE (4096 * 256)
#define MAX_STACK_SIZE (4096 * 128)
char *find_hole(size_t length)
{
    FILE *fp;
    char line[LINE_LEN];
    char *addr = NULL, *startaddr, *endaddr;

    fp = fopen(PROC_SELF_MAPS, "r");
    if (fp == NULL) {
        return NULL;
    }

    /* all VMAs start/end addresses are already rounded to page size */
    while (fgets(line, LINE_LEN, fp) != NULL) {
        if (sscanf(line, "%p-%p", &startaddr, &endaddr) == 2) {
            if (startaddr > addr)
                if (startaddr - addr >= length)
                    break;
            if (endaddr > addr)
                addr = endaddr;
        }
    }

    /* check if last VMA has been reached but there is still not
     * enough room
     */
    if (UINTPTR_MAX - (uintptr_t)addr < length) {
        addr = MAP_FAILED;
        fprintf(stderr, "no hole found\n");
    }

    fclose(fp);

    return addr;
}

void *allocate_stack()
{
    void *addr_hint = find_hole(MAX_STACK_SIZE + STACK_GUARD_GAP_SIZE);
    return mmap(addr_hint + STACK_GUARD_GAP_SIZE + MAX_STACK_SIZE -
                    DEFAULT_STACK_SIZE,
                DEFAULT_STACK_SIZE, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK | MAP_GROWSDOWN, -1, 0);
}

void free_stack(void *mem)
{
    munmap(mem, DEFAULT_STACK_SIZE);
}

The initialization part might look complicated, but except for this, the program can just replace ABT_thread_create with create_thread(). find_hole() gets slower if we allocate more stacks since the time of fgets() is proportional to the size of /proc/self/maps. The program might want to have a stack cache/pool or a VMA map if necessary.

@bfaccini
Copy link
Contributor Author

bfaccini commented Jan 26, 2021

vm.overcommit_memory is not tunable

I don't think I ever wrote that "vm.overcommit_memory is not tunable", did I ? For me it is also a tunable, like vm.max_map_count, even if they can only be modified by root!

but does this justify adding a complicated find_hole() management that assumes non-standardized Linux or Unix behaviors?

I agree with you that the cost of find_hole() may be too high, mainly due to the parsing of /proc/self/maps ..., and will raise with the number of mmap()'ed stacks growing. Solving/optimizing this appears tricky as even if we want to internally manage/update a view of the current mappings, we will need to implement a quick algorithm like the one used by the Kernel (red-black tree)...

If we don't care about the footprint of virtual memory addresses, we can just allocate 2MB (or any size) of stack to each ULT.

mmap()'ing fixed, but bigger than presently, size stacks could be acceptable for our use case. And by te way and sorry to insist, but specifying MAP_GROWSDOWN ensure that stack overrun scenarios will be at least immediately detected (because of stack_guard_gap) and possibly (if no extra VMA/mapping is allocated on top of LowerStackAddress+stack_guard_gap) also allow for some additional growth if needed.

I wanted to point out that vm.max_map_count does not limit the normal mmap() on some platforms. There is no better explanation for this.

Again sorry to insist but hopefully it does since it is its only purpose. My guess is that, what happened during your test with normal mmap(), you have triggered the scenario where multiple adjacent VMAs/mappings with the same flags/properties have been gathered into a single by the Kernel, causing you to not reach the vm.max_map_count limit.

For example, if MAP_GROWSDOWN memory is 0x01003000 - 0x01004000, is there any guarantee that a kernel avoids allocating a memory that contains 0x01002fff (or any memory close to that) by other mmap?

Yes, this is guaranteed by the Kernel which will put stack_guard_gap pages in addition on top (lower addresses) of the mmap()'ed area with MAP_GROWSDOWN property, with the first goal to avoid stack overruns for security purpose.

Also, thanks for your new code snippet which is definitely a good example and start point on how to use mmap()'ed stacks allocated externally from Argobots !!

@shintaro-iwasaki
Copy link
Collaborator

@bfaccini Thank you for your comments!

I agree with you that the cost of find_hole() may be too high, mainly due to the parsing of /proc/self/maps ..., and will raise with the number of mmap()'ed stacks growing. Solving/optimizing this appears tricky as even if we want to internally manage/update a view of the current mappings, we will need to implement a quick algorithm like the one used by the Kernel (red-black tree)

Not only Argobots allocates memory, so it would be challenging to keep the mapping updated. For example, the internal map can misunderstand that 0x01003000 - 0x01004000 is promising since 0x00983000 - 0x01003000 is unused according to the internal map, but actually 0x01000000 - 0x01001000 might have been allocated by other libraries (e.g., glibc). Using a stack 0x01003000 - 0x01004000 for a ULT while 0x01000000 - 0x01001000 is used by another can cause a weird memory error even though that ULT uses only a few kilobytes for a stack (it depends on how close the other memory region is).

Yes, this is guaranteed by the Kernel which will put stack_guard_gap pages in addition on top (lower addresses) of the mmap()'ed area with MAP_GROWSDOWN property, with the first goal to avoid stack overruns for security purpose.

I am not fully sure if the Kernel really guarantees it. At least, mmap() + MAP_GROWSDOWN does not consider its lower addresses when it decides its virtual address (and that's why find_hole() is necessary in my understanding). Specifically, my concerns are as follows:

  1. The kernel might not guarantee that mmap() never assigns an address that is "a little bit" lower than 0x01003000 (say 0x01000000 - 0x01001000, which limits the growth of the 0x01003000 - 0x01004000 stack at most just a few kilobytes even if it does not contain 0x01002fff)?

  2. Even if the kernel has a policy that it reserves (0x01003000 - BIG_CONSTANT_SIZE (e.g., 4MB)) - 0x01003000 once 0x01003000 - 0x01004000 is allocated with MAP_GROWSDOWN, it does not mean the address returned by find_hole() can have a large hole when mmap() + MAP_GROWSDOWN is actually called since other libraries may call mmap() in parallel. I am not sure how it should be handled.

mmap()'ing fixed, but bigger than presently, size stacks could be acceptable for our use case. And by the way and sorry to insist, but specifying MAP_GROWSDOWN ensure that stack overrun scenarios will be at least immediately detected (because of stack_guard_gap) and possibly (if no extra VMA/mapping is allocated on top of LowerStackAddress+stack_guard_gap) also allow for some additional growth if needed.

I still believe mmap()'ing fixed but bigger stacks is simple and practical although it has its downsides (1. vm.overcommit_memory must be enabled, 2. the consumption of virtual addresses does not reflect the physical memory usage, and 3. the stack size is fixed so there is no lucky case where the stack can grow up significantly unlike MAP_GROWSDOWN).

Regarding SEGV, we can use mprotect() to guard each ULT's stack (which needs at least one extra page). Argobots can implement this feature if needed. If this feature is the main reason why you want to use MAP_GROWSDOWN, we are happy to implement this stack guard feature in Argobots right away (I plan to implement a stack canary option too).

As you pointed out, extra stack growth larger than a specified size is not allowed if we mmap() a fixed large stack without MAP_GROWSDOWN. If this possibility of extra stack growth beyond a predefined large stack size is important, MAP_GROWSDOWN is the only option.

In any case, all the concerns above and I mentioned previously assume a general case, so it might not apply to your workload (e.g., Only Argobots uses mmap() after a certain time. Assuming a certain OS (e.g., Linux, not macOS) so the format of /proc/self/maps is the same. Since the developer knows which virtual addresses the kernel and the other runtimes/libraries use by default so a certain virtual address region (e.g., 0x10000000 - 0x30000000) can be exclusively used by Argobots. ... ). I just hesitate to implement this mechanism inside Argobots. I understand using MAP_GROWSDOWN has its unique merits. If the current find_hole() + ABT_thread_set_specific() code works, I will add this as an example so that the CI can test this behavior.

@bfaccini
Copy link
Contributor Author

You are right when you say that the extra-gap used buy find_hole(), to stand on top of lowest stack address, is not guaranteed to be kept free by the Kernel, but it is likely this will happen if all stacks can be allocated within the same address range, located appart of others allocations areas, and if this gap is only a few pages.

But anyway, as this will be ok for our use case, let's stick to a mmap()'ed stack size bigger than current default ULT stack size (1 or 2 MB), and if It is done with MAP_GROWSDOWN you will not need to implement any stack guard mechanism as it is already done by the Kernel (using stack_guard_gap).

About stack canary method, I don't think it will help detect stack overrun so efficiently than stack guard method will be able to avoid it.

@shintaro-iwasaki
Copy link
Collaborator

Thanks. If you don't need an extra stack growth (i.e., no find_hole()), the following change should work where DEFAULT_STACK_SIZE can be a few megabytes:

void *allocate_stack()
{
    return mmap(NULL, DEFAULT_STACK_SIZE, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK | MAP_GROWSDOWN, -1, 0);
}

I believe the easiest way to use such a stack is copying the code I wrote in the previous comment with the change above. Further optimizations would need some changes in Argobots, e.g., a new memory allocation and management mechanism inside Argobots. Creating a pool should be the most straightforward optimization, but there is always a performance vs memory footprint trade-off. If you encounter any performance issues, please let me know with details about your workloads.

(Stack canary was for #294, so please never mind.)

@bfaccini
Copy link
Contributor Author

Hi @shintaro-iwasaki ,
Humm, I am not sure I understand your last answer, are you saying that we can just use mmap() as the stack allocation mechanism for ULTs stacks, externally from Argobots ?

@shintaro-iwasaki
Copy link
Collaborator

Humm, I am not sure I understand your last answer, are you saying that we can just use mmap() as the stack allocation mechanism for ULTs stacks, externally from Argobots ?

Yes, that's what I meant. That's why Argobots allows users to set a stack. mmap() is heavy (a few microseconds while the current ABT_thread_create() + ABT_THREAD_ATTR_NULL costs 0.1 microseconds or even less), so the performance benefit is little if Argobots just calls mmap() and munmap() internally every time a ULT is created and freed.

It does not mean we do not want to integrate it (or any advanced one) into Argobots. Our long discussion was basically about MAP_GROWSDOWN, so I could not figure out the overall design (e.g., functionality, performance, memory footprint...). If you encounter any performance issues, please let me know with details about your workloads so that we can work on this issue. If the strategy above works well for your workloads, we are happy to implement it too (basically create a macro branch to replace malloc and free in https://github.com/pmodels/argobots/blob/main/src/include/abti_mem.h).

@bfaccini
Copy link
Contributor Author

bfaccini commented Feb 1, 2021

Well this is a bit disappointing, but may be at least this issue has permitted you to feel the need for some enhancement around the ULTs stacks allocation/management/protection/....

@shintaro-iwasaki
Copy link
Collaborator

Thank you. We understood your concerns regarding ULT stacks overall though still your requirement is unclear to me. As the ULT stack usage highly depends on each use case, we need some specific inputs from users more than "some enhancement around the ULT stacks allocation/...". There is always a trade-off between performance and memory footprint.

For example, a few microseconds per ULT is acceptable, using mmap() and munmap() every time when we fork and join ULTs is a reasonable option (this will be another --disable-mem-pool implementation, which currently uses malloc() and free()). If this overhead is too large, we might want a pool to cache allocated stacks (like --enable-mem-pool, which is enabled by default). This stack pool has numerous design options. The total number of ULTs in a cache can affect the total footprint; currently Argobots does not free ULT stacks during its execution by default, so the memory consumption is that of the peak during its execution. If we need to cap the total number of cached stacks in the system, this needs an additional mechanism (e.g., allocation of the bulk of stacks should be disabled). A local cache (i.e., per ES cache) might improve the performance in some cases, while it makes this management more complicated. If we don't want to keep "already grown stacks" for a long time (e.g., keeping it in a cache adds memory pressure if the whole part of the stack memory has been mapped to physical memory), we might want to free stacks periodically, but it certainly adds some overheads. If you just need a dynamic stack protection mechanism over the current memory pool implementation, using mprotect() is easier. Even if we need to add a new feature, mprotect() might be faster than mmap() + MAP_GROWSDOWN in some cases, but there is of course pros and cons. Unfortunately, we do not have enough manpower to implement each option as a prototype.

We are happy to discuss with users and work on it if you have any ideas about the requirements. At this point, my understanding is that 1. your program does not need "extra stack growth" of MAP_GROWSDOWN (and thus no find_hole()), 2. it can rely on virtual memory overcommit (vm.overcommit_memory), 3. the maximum stack consumption will be a few megabytes, and 4. "stack guard" is preferred (by MAP_GROWSDOWN or mprotect()). Specifically, the following information would be highly helpful to design a memory allocator:

  1. Acceptable overheads per ULT create/free (e.g., 0.3 us)
  2. The maximum number of ULTs at a certain point (e.g., 500K ULTs)
  3. Acceptable physical memory consumption (e.g., 16GB in total for ULT stacks)
  4. Average stack consumption (e.g., 16 KB)
  5. Rate of ULTs that need large stacks (e.g., 1% ULTs need 2MB stacks while the others use only 16KB or so).

We would appreciate any comments from you. Of course, the custom stack allocation mechanism as I showed as an example should be a good solution to tailor stack allocation to your applications. In any case, I am sorry for your disappointment.

@bfaccini
Copy link
Contributor Author

Hello Shintaro,
Since ABT_thread_set_specific() is requiring an ABT_thread parameter, ULT must be a named one. But then its handle must be explicitly freed using ABT_thread_free(). So, how would be the best way to automatize this mechanism upon ULT exit ?? Is it safe to also use an ABT_key for this ??

@shintaro-iwasaki
Copy link
Collaborator

Thank you for trying the code!

ABT_key etc works for an unnamed ULT. The destructor will be properly called (and thus no memory leak). ABT_key_set() does not take an ABT_thread parameter, so this can be used for an unnamed ULT. A wrapper trick is needed to call ABT_key_set() in a ULT. The following is an example implementation that supports an unnamed ULT as well as a named ULT.

typedef struct {
    void *stack_to_be_freed;
    void (*thread_f)(void *);
    void *thread_arg;
    int need_free;
} unnamed_thread_wrapper_arg_t;

static void unnamed_thread_wrapper(void *arg)
{
    unnamed_thread_wrapper_arg_t *p_wrapper_arg =
        (unnamed_thread_wrapper_arg_t *)arg;
    ABT_key_set(g_stack_key, p_wrapper_arg->stack_to_be_freed);
    p_wrapper_arg->thread_f(p_wrapper_arg->thread_arg);
    if (p_wrapper_arg->need_free)
        free(p_wrapper_arg);
}

void create_thread(ABT_pool pool, void (*f)(void *), void *arg,
                   ABT_thread *newthread)
{
    if (newthread) {
        /* Named thread. */
        void *newstack = allocate_stack();
        ABT_thread_attr_set_stack(g_stack_attr, newstack, DEFAULT_STACK_SIZE);
        ABT_thread_create(pool, f, arg, g_stack_attr, newthread);
        ABT_thread_set_specific(*newthread, g_stack_key, newstack);
    } else {
        /* Unnamed thread. */
        void *newstack = allocate_stack();
        const int is_naive = 0; /* Unoptimized version? */
        if (is_naive) {
            /* wrapper_arg will be freed by unnamed_thread_wrapper(). */
            unnamed_thread_wrapper_arg_t *wrapper_arg =
                (unnamed_thread_wrapper_arg_t *)malloc(
                    sizeof(unnamed_thread_wrapper_arg_t));
            wrapper_arg->stack_to_be_freed = newstack;
            wrapper_arg->thread_f = f;
            wrapper_arg->thread_arg = arg;
            wrapper_arg->need_free = 1;
            ABT_thread_attr_set_stack(g_stack_attr, newstack,
                                      DEFAULT_STACK_SIZE);
            ABT_thread_create(pool, unnamed_thread_wrapper, wrapper_arg,
                              g_stack_attr, NULL);
        } else {
            /* Allocate wrapper_arg using a stack part (default stack size will
             * be reduced a bit).  malloc() and free() are not needed. */
            const size_t unnamed_thread_wrapper_arg_size =
                ((sizeof(unnamed_thread_wrapper_arg_t) + 64 - 1) / 64 * 64);
            const size_t new_stack_size =
                DEFAULT_STACK_SIZE - unnamed_thread_wrapper_arg_size;
            unnamed_thread_wrapper_arg_t *wrapper_arg =
                (unnamed_thread_wrapper_arg_t
                     *)((char *)newstack + DEFAULT_STACK_SIZE -
                        unnamed_thread_wrapper_arg_size);
            wrapper_arg->stack_to_be_freed = newstack;
            wrapper_arg->thread_f = f;
            wrapper_arg->thread_arg = arg;
            wrapper_arg->need_free = 0;
            ABT_thread_attr_set_stack(g_stack_attr, newstack, new_stack_size);
            ABT_thread_create(pool, unnamed_thread_wrapper, wrapper_arg,
                              g_stack_attr, NULL);
        }
    }
}

You can use this create_thread() to create an unnamed ULT just like ABT_thread_create(). The following is a complete test program that uses create_thread() to create both named and unnamed ULTs. No memory leak as far as I checked with valgrind.

Test program (collapsed)

#include <abt.h>
#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>

#define DEFAULT_STACK_SIZE (4096 * 2)

ABT_key g_stack_key;
/* If create_thread() is called in parallel, use a spinlock to protect
 * g_stack_attr or create a copy for each thread/ES to avoid contention. */
ABT_thread_attr g_stack_attr;
int g_measure_time = 0;

int init_stack_key();
void finalize_stack_key();
void *allocate_stack();
void free_stack(void *mem);

void create_thread(ABT_pool pool, void (*f)(void *), void *arg,
                   ABT_thread *newthread);
void free_thread(ABT_thread *thread);
void thread_func(void *arg);

int main(int argc, char *argv[])
{
    if (argc != 3) {
        printf("Usage: ./a.out NUM_THREADS MEASURE_TIME?\n");
        printf("Ex:    ./a.out 32 0\n");
        return -1;
    }
    int num_threads = atoi(argv[1]);
    g_measure_time = atoi(argv[2]);
    int num_repeats = g_measure_time ? (100000 / num_threads + 2) : 1;

    /* Initialization */
    ABT_init(0, NULL);
    ABT_xstream xstream;
    ABT_pool pool;
    ABT_xstream_self(&xstream);
    ABT_xstream_get_main_pools(xstream, 1, &pool);
    init_stack_key();
    ABT_thread *threads = malloc(sizeof(ABT_thread) * num_threads);

    double start_time, end_time;

    /* Fork and join threads. */
    for (int step = 0; step < num_repeats; step++) {
        if (step == num_repeats / 2)
            start_time = ABT_get_wtime();
        /* Named threads */
        for (int i = 0; i < num_threads; i++) {
            create_thread(pool, thread_func, NULL, &threads[i]);
        }
        /* Unnamed threads */
        for (int i = 0; i < num_threads; i++) {
            create_thread(pool, thread_func, NULL, NULL);
        }
        for (int i = 0; i < num_threads; i++) {
            ABT_thread_free(&threads[i]);
        }
    }

    /* Print execution time */
    end_time = ABT_get_wtime();
    if (g_measure_time) {
        double t = (end_time - start_time) / num_threads /
                   (num_repeats - num_repeats / 2) * 1.0e6;
        printf("Fork-join per thread: %.2f[us]\n", t);
    }

    /* Finalization */
    free(threads);
    finalize_stack_key();
    ABT_finalize();
    return 0;
}

void deep_func(int depth, char *root_stack)
{
    /* Use a stack region to see if the stack is extended. */
    volatile char *mem = (char *)alloca(sizeof(char) * 512);
    mem[512] = 1;
    if (depth == 0) {
        printf("Consumed %d bytes of stacks\n", (int)(root_stack - mem));
    } else {
        deep_func(depth - 1, root_stack);
    }
}

void thread_func(void *arg)
{
    if (g_measure_time == 0) {
        deep_func(128, (char *)alloca(sizeof(char) * 128));
    }
}

/* ************************************************************************** */

int init_stack_key()
{
    ABT_key_create(free_stack, &g_stack_key);
    ABT_thread_attr_create(&g_stack_attr);
}

void finalize_stack_key()
{
    ABT_thread_attr_free(&g_stack_attr);
    ABT_key_free(&g_stack_key);
}

typedef struct {
    void *stack_to_be_freed;
    void (*thread_f)(void *);
    void *thread_arg;
    int need_free;
} unnamed_thread_wrapper_arg_t;

static void unnamed_thread_wrapper(void *arg)
{
    unnamed_thread_wrapper_arg_t *p_wrapper_arg =
        (unnamed_thread_wrapper_arg_t *)arg;
    ABT_key_set(g_stack_key, p_wrapper_arg->stack_to_be_freed);
    p_wrapper_arg->thread_f(p_wrapper_arg->thread_arg);
    if (p_wrapper_arg->need_free)
        free(p_wrapper_arg);
}

void create_thread(ABT_pool pool, void (*f)(void *), void *arg,
                   ABT_thread *newthread)
{
    if (newthread) {
        /* Named thread. */
        void *newstack = allocate_stack();
        ABT_thread_attr_set_stack(g_stack_attr, newstack, DEFAULT_STACK_SIZE);
        ABT_thread_create(pool, f, arg, g_stack_attr, newthread);
        ABT_thread_set_specific(*newthread, g_stack_key, newstack);
    } else {
        /* Unnamed thread. */
        void *newstack = allocate_stack();
        const int is_naive = 0; /* Unoptimized version? */
        if (is_naive) {
            /* wrapper_arg will be freed by unnamed_thread_wrapper(). */
            unnamed_thread_wrapper_arg_t *wrapper_arg =
                (unnamed_thread_wrapper_arg_t *)malloc(
                    sizeof(unnamed_thread_wrapper_arg_t));
            wrapper_arg->stack_to_be_freed = newstack;
            wrapper_arg->thread_f = f;
            wrapper_arg->thread_arg = arg;
            wrapper_arg->need_free = 1;
            ABT_thread_attr_set_stack(g_stack_attr, newstack,
                                      DEFAULT_STACK_SIZE);
            ABT_thread_create(pool, unnamed_thread_wrapper, wrapper_arg,
                              g_stack_attr, NULL);
        } else {
            /* Allocate wrapper_arg using a stack part (default stack size will
             * be reduced a bit).  malloc() and free() are not needed. */
            const size_t unnamed_thread_wrapper_arg_size =
                ((sizeof(unnamed_thread_wrapper_arg_t) + 64 - 1) / 64 * 64);
            const size_t new_stack_size =
                DEFAULT_STACK_SIZE - unnamed_thread_wrapper_arg_size;
            unnamed_thread_wrapper_arg_t *wrapper_arg =
                (unnamed_thread_wrapper_arg_t
                     *)((char *)newstack + DEFAULT_STACK_SIZE -
                        unnamed_thread_wrapper_arg_size);
            wrapper_arg->stack_to_be_freed = newstack;
            wrapper_arg->thread_f = f;
            wrapper_arg->thread_arg = arg;
            wrapper_arg->need_free = 0;
            ABT_thread_attr_set_stack(g_stack_attr, newstack, new_stack_size);
            ABT_thread_create(pool, unnamed_thread_wrapper, wrapper_arg,
                              g_stack_attr, NULL);
        }
    }
}

/* ************************************************************************** */

#define PROC_SELF_MAPS "/proc/self/maps"
#define LINE_LEN 1024
#define STACK_GUARD_GAP_SIZE (4096 * 256)
#define MAX_STACK_SIZE (4096 * 128)
char *find_hole(size_t length)
{
    FILE *fp;
    char line[LINE_LEN];
    char *addr = NULL, *startaddr, *endaddr;

    fp = fopen(PROC_SELF_MAPS, "r");
    if (fp == NULL) {
        return NULL;
    }

    /* all VMAs start/end addresses are already rounded to page size */
    while (fgets(line, LINE_LEN, fp) != NULL) {
        if (sscanf(line, "%p-%p", &startaddr, &endaddr) == 2) {
            if (startaddr > addr)
                if (startaddr - addr >= length)
                    break;
            if (endaddr > addr)
                addr = endaddr;
        }
    }

    /* check if last VMA has been reached but there is still not
     * enough room
     */
    if (UINTPTR_MAX - (uintptr_t)addr < length) {
        addr = MAP_FAILED;
        fprintf(stderr, "no hole found\n");
    }

    fclose(fp);

    return addr;
}

void *allocate_stack()
{
    void *addr_hint = find_hole(MAX_STACK_SIZE + STACK_GUARD_GAP_SIZE);
    return mmap(addr_hint + STACK_GUARD_GAP_SIZE + MAX_STACK_SIZE -
                    DEFAULT_STACK_SIZE,
                DEFAULT_STACK_SIZE, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK | MAP_GROWSDOWN, -1, 0);
}

void free_stack(void *mem)
{
    munmap(mem, DEFAULT_STACK_SIZE);
}

Misc:

  • ABT_self_set_specific() is added from Argobots 1.1b1 for consistent naming. This is the same as ABT_key_set() except for minor error handling details.
  • ABT_self_get_thread() (or ABT_thread_self()) can get the underlying ABT_thread even if the underlying ULT is unnamed. Freeing such an ABT_thread handle is not allowed, but the obtained ABT_thread handle itself is valid and thus used for ABT_thread_set_specific().

@bfaccini
Copy link
Contributor Author

Well, seems to me that you did not answer directly to my "Is it safe to also use an ABT_key for this (to free a named ABT_thread) ??" question but better tried to provide me with an other way... So I presume the answer is no ?!
OTOH, I understand that ABT_key_set(), is the equivalent of ABT_thread_set_specific() but for the current/self ULT, which will allow to use an unnamed ULT and thus avoid the need of freeing a named ULT upon exit to not leak it.
But either for named ULTs, with ABT_thread_set_specific(), or for unnamed ULTs, with wrapping struct/function, they do not make it easy to implement externall/custom stack management inside a real application and definitely introduce a new level of complexity.

@shintaro-iwasaki
Copy link
Collaborator

I am sorry for missing some of your questions. I will answer all of your questions again.

Since ABT_thread_set_specific() is requiring an ABT_thread parameter, ULT must be a named one.

This first assumption is no. ABT_self_get_thread() (or ABT_thread_self()) can get the underlying ABT_thread even if the underlying ULT is unnamed. Freeing such an ABT_thread handle is not allowed, but the obtained ABT_thread handle itself is valid and thus used for ABT_thread_set_specific(). Or if you use a wrapper, you can use ABT_key_set().

But then its handle must be explicitly freed using ABT_thread_free().

A named ULT must be freed by using ABT_thread_free(). This is yes.

So, how would be the best way to automatize this mechanism upon ULT exit ??

Using an unnamed ULT is my suggestion.

Is it safe to also use an ABT_key for this ??

Sorry, I did not answer this question. ABT_key's destructor is called when a ULT is freed, so it is not possible to free an named ULT in its ABT_key destructor; that destructor is never called.

Well, seems to me that you did not answer directly to my "Is it safe to also use an ABT_key for this (to free a named ABT_thread) ??" question but better tried to provide me with an other way... So I presume the answer is no ?!

So, the answer is no. The reason is explained above.

I understand that ABT_key_set(), is the equivalent of ABT_thread_set_specific() but for the current/self ULT, which will allow to use an unnamed ULT and thus avoid the need of freeing a named ULT upon exit to not leak it.

Your understanding is correct. You don't need to use a named ULT since you can use ABT_key_set().

Either for named ULTs, with ABT_thread_set_specific(), or for unnamed ULTs, with wrapping struct/function, they do not make it easy to implement external/custom stack management inside a real application and definitely introduce a new level of complexity.

You need to call basically the following functions.

// Call it after ABT_init() once.
int init_stack_key();
// Call if before ABT_finalize().
void finalize_stack_key();
// Call this instead of ABT_thread_create().
// Unlike ABT_thread_create(), create_thread() currently does not take ABT_thread_attr.
void create_thread(ABT_pool pool, void (*f)(void *), void *arg, ABT_thread *newthread);

I believe this create_thread() solution is simple and customizable enough to try things with a real application and even deploy.

And again, we are happy to implement any stack enhancement that is wholly or partly integrated into Argobots, but as I explained, I need some information from you about your workloads to determine a stack allocation method and design a stack pool cache. It might be nice for you if I could implement and provide all the options in Argobots, but unfortunately the design space of "stack allocation" is too huge for us to cover all. Stack allocation design is not as simple as stack dump or stack unwinding features you previously requested. We would truly appreciate information about the workload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants