-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ULT stack allocation method to address overrun scenario #274
Comments
I'm not familiar enough with these mechanism to comment on @bfaccini 's proposal, but I am interested in this topic. FWIW we often use external libraries (communication libraries in particular, but there are others as well) that put significant demands on the stack. We explicitly set ABT_THREAD_STACKSIZE to a conservative value before starting Argobots but we don't really know if we have selected a "good" value or not in any given system configuration. I would be happy to simply have a better idea of how close we came to the stack limits, but a dynamic option might also be helpful if the overhead isn't prohibitive. |
Thank you for your interest in this topic. Since it has been demanded by many users, we will definitely investigate further, but please let me explain the situation so that there would be a practical solution. Stack hookSorry that I missed the discussion #16, but I believe we can do what you want by using the existing void stack_free_callback(void *stack) {
munmap(stack, PREDEFINED_MMAP_STACK_SIZE);
}
ABT_key g_stack_key;
/* ABT_key_create(stack_free_callback, &g_stack_key); */
void thread() {
/* The created thread needs to know "stack".
* The ULT-local value will be freed by stack_free_callback() on ABT_thread_free(). */
ABT_key_set(g_stack_key, stack);
}
void thread_create() {
stack = mmap(...);
ABT_thread_create(...);
ABT_thread_free(...);
}
|
Hello, I agree that ABT_key feature/mechanism seems to allow fixing the external stack management needs upon ULT exit. And I think the way to pass the stack address for later management could be done as part of the "void *arg" (likely to be a struct) parameter of ABT_thread_create(). The "cactus stack" method can not help in our case, as each of the stack overruns that we have experienced is not due to our code but comes from the external libs we use. The "stack canary" method is only a way to detect overrun afterward but will not allow to survive from it. My understanding of the "lazy stack allocation" method you describe is that it can only be used for simple ULTs which execute their code in a raw (only scheduled once without yielding) and thus could use the scheduler's stack, like tasklets. But we have none of this ULT kind in our code. About mmap()'ed stacks and how they can grow, it is a Kernel feature which is already used for main process/thread stack. It happens when a page-fault occurs for an address in a page that should stand just on top of current top page of a stack vma, then Kernel will just allocate the page in the vma (if not exceeding the stack limits) and return to user-mode for the faulting instruction to complete successfully. |
@bfaccini Thank you for your comments and sorry for the late reply. I updated #16. It seems that cactus stack and stack canary are not helpful in your use case. Regarding
|
Hello Shintaro, Concerning the limitation of number of mmap()s per-process, I was not aware of such... As I already pointed, there is a need of some management code to be added in order to "pool" the current mmap()'ed stacks and particularly to keep track of each stack size/start-end addresses. |
Thank you for your comments! "Lazy stack allocation"The exact idea is different. It is "assigning a stack on executing a ULT and releasing a stack on joining a ULT". For example, if one creates 10000 ULTs and joins them. for (i = 0; i < 10000; i++)
ABT_thread_create(..., &threads[i]);
for (i = 0; i < 10000; i++)
ABT_thread_free(&threads[i]); The current mechanism allocates stacks in Under Anyway, if it does not help your case, we need a different approach.
|
@shintaro-iwasaki ,
As I have indicated in the code comments, and since I also don't have found some existingAPI to help on this, parsing /proc/self/maps upon each stack allocations has a cost, so it will be a good idea to keep an internal mappings view up to date internally to optimise best hole search and to only refresh it upon diverging allocations from expected hint. |
@bfaccini I am sorry for my late reply. Thank you very much for your code! Please give me a few more days to examine the best direction for this issue. |
Thank you very much for your code! I'd also like to apologize my late reply. 1. Number of
|
Hello Shintaro, First of all, I had to update my code snippet above as it was wrongly doing a system("cat /proc/self/maps") at the beginning and end, which was a non-sense since it would not dump VMAs of the current process !!... I agree with you that the cost of parsing "/proc/self/maps" upon each mmap()'ed stack allocation is too heavy, and this is why I already wrote above that the best approach would be to maintain an internal view of it, use it for next allocations, and only have to refresh it upon mmap() failure or unexpected mmap()'ed address vs hint. This is very unlikely to happen frequently as the process VMAs map, appart of this stack allocation mechanism under control, should not change that much. Again, if you want to benefit from the stack automatic growth (up to the process stack limit) mechanism provided by the Kernel, you need to use MAP_GROWSDOWN !! See the following piece of code into my snippet :
you will not be able to do similar out-of-range write without a SEGV with regular mmap()'ed areas. This is for most of the architectures, and would have to use MAP_GROWSUP on archs where stack is growing upward. Also, I don't think that there is a specific limit for the number of pre-process mmap()'ed regions with MAP_GROWSDOWN property, it is better a generic limitation which can easily be avoided, as it is a Kernel tunable :
|
Thank you for your comments!
I do not think This
I understand it works, but I would like to understand the benefit of In my understanding,
It might be a tunable parameter, but as I told you, No matter whether we use |
find_hole() code is necessary to ensure that Kernel will let the stack grow from min_stack_size to max_stack_size. And there is no special treatment for MAP_GROWSDOWN during mmap(). I agree with you that the detection/handling of stacks that have grown is problematic if we want to munmap() them. But it may be ok to safely use the stack's lowest possible address and its max length as munmap() parameters, first because munmap() will not complain/fail due to any unmapped parts in this range !, and secondly because the Kernel would not have allowed for an other VMA to be allocated within the stack_guard_gap interval from the top of this stack !! I think your understanding of the Kernel behaviour with MAP_GROWSDOWN regions is not exact, and you may also not have well understood what my code snippet tries to demonstrate. if find_hole() tries to find an unmapped area of stack_guard_gap + max_stack_size length and then do an mmap() of only a min_stack_size length in the highest address range of this found area, this is to ensure that the Kernel will then allow it to grow up to max_stack_size upon need/access. So the stack area/VMA will start its life with a minimum virtual address range which then will be grown by the Kernel only if needed during page-fault handling. vm.max_map_count is the tunable that limits the number of VMAs/regions which can make the virtual memory address space of a process. So if you set it to one million, you will be able to create one million of mmap()'ed regions, even with MAP_GROWSDOWN property ! I have checked this on CentOS 7.8. And I have no doubt it is the same on Ubuntu and Suse platforms as it is a Kernel thing. |
Thank you for correcting my wrong understanding. Yes, as you explained, I would like to discuss two things: 1. how to allocate each stack (whether we use 1. How to allocate each stackI would appreciate your idea regarding why we want to use
I believe we can use the memory overcommit mechanism (
Is there any reason to use 2. Memory poolNo matter whether we use
I would like to know your use case in details so that I can understand which implementation is suitable for your workloads and start to work on this issue. Any better idea is welcome. I would truly appreciate it if you could correct my understanding. |
Concerning how to ensure that an other mmap()/VMA will take place in the hole between the current top of stack and the previous region at stack allocation time, this is again where find_hole() algorithm by taking into account stack_guard_gap anticipates the Kernel placement rules for new areas. I assume that each platforms will provide an equivalent, why not an API !, for "/proc/self/maps". Linux distros should have it as they all use the same Kernels. vm.max_map_count limitation would have to be well documented. If required (like it would for our use case) it will need to be changed by administrator, like this happens for huge resources demanding applications... To be honest, I really don't understand why you always refer to vm.overcommit_memory tunable for this issue. Well, since it impacts the full process virtual memory allocation we may have to take care of it if the mmap()'ed stacks total memory footprint becomes too important vs the real memory size. But it will not help for the stack allocation+placement and further growth problem we are trying to solve here. BTW, can you better describe me why you think vm.max_map_count has no effect on some platforms ?? Again MAP_GROWSDOWN, helps the Kernel to know how a region will be allowed to grow, and also how to best place new regions/mmap()s to avoid collisions. About the need of a stacks pool, I believe we may try without first as it could be added later, what do you think ?? |
Thank you for your comments. It seems that a pool is not necessary for now, so I will stop talking about a stack pool.
Do we need to keep track of the footprint of virtual memory addresses? Of course, it would be great if the footprint of virtual memory addresses is almost the same as the footprint of real memory, but does this justify adding a complicated
If we don't care about the footprint of virtual memory addresses, we can just allocate 2MB (or any size) of stack to each ULT. You can create millions of ULTs with 2MB stacks if
I wanted to point out that
Yes, the kernel will know that the footprint of virtual memory addresses is the same as the footprint of real memory.
Does the kernel know how to best place new regions/mmap()s to avoid collisions? For example, if In any case, at this point, I do not want to implement this To test its behavior, the following should work. #include <abt.h>
#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#define DEFAULT_STACK_SIZE (4096 * 2)
ABT_key g_stack_key;
/* create_thread() is called in parallel, use a spinlock to protect g_stack_attr
* or create a copy for each thread. */
ABT_thread_attr g_stack_attr;
int g_measure_time = 0;
int init_stack_key();
void finalize_stack_key();
void *allocate_stack();
void free_stack(void *mem);
void create_thread(ABT_pool pool, void (*f)(void *), void *arg,
ABT_thread *newthread);
void free_thread(ABT_thread *thread);
void thread_func(void *arg);
int main(int argc, char *argv[])
{
if (argc != 3) {
printf("Usage: ./a.out NUM_THREADS MEASURE_TIME?\n");
printf("Ex: ./a.out 32 0\n");
return -1;
}
int num_threads = atoi(argv[1]);
g_measure_time = atoi(argv[2]);
int num_repeats = g_measure_time ? (100000 / num_threads + 2) : 1;
/* Initialization */
ABT_init(0, NULL);
ABT_xstream xstream;
ABT_pool pool;
ABT_xstream_self(&xstream);
ABT_xstream_get_main_pools(xstream, 1, &pool);
init_stack_key();
ABT_thread *threads = malloc(sizeof(ABT_thread) * num_threads);
double start_time, end_time;
/* Fork and join threads. */
for (int step = 0; step < num_repeats; step++) {
if (step == num_repeats / 2)
start_time = ABT_get_wtime();
for (int i = 0; i < num_threads; i++) {
create_thread(pool, thread_func, NULL, &threads[i]);
}
for (int i = 0; i < num_threads; i++) {
ABT_thread_free(&threads[i]);
}
}
/* Print execution time */
end_time = ABT_get_wtime();
if (g_measure_time) {
double t = (end_time - start_time) / num_threads /
(num_repeats - num_repeats / 2) * 1.0e6;
printf("Fork-join per thread: %.2f[us]\n", t);
}
/* Finalization */
free(threads);
finalize_stack_key();
ABT_finalize();
return 0;
}
void deep_func(int depth, char *root_stack)
{
/* Use a stack region to see if the stack is extended. */
volatile char *mem = (char *)alloca(sizeof(char) * 512);
mem[512] = 1;
if (depth == 0) {
printf("Consumed %d bytes of stacks\n", (int)(root_stack - mem));
} else {
deep_func(depth - 1, root_stack);
}
}
void thread_func(void *arg)
{
if (g_measure_time == 0) {
deep_func(128, (char *)alloca(sizeof(char) * 128));
}
}
/* ************************************************************************** */
int init_stack_key()
{
ABT_key_create(free_stack, &g_stack_key);
ABT_thread_attr_create(&g_stack_attr);
}
void finalize_stack_key()
{
ABT_thread_attr_free(&g_stack_attr);
ABT_key_free(&g_stack_key);
}
void create_thread(ABT_pool pool, void (*f)(void *), void *arg,
ABT_thread *newthread)
{
void *newstack = allocate_stack();
ABT_thread_attr_set_stack(g_stack_attr, newstack, DEFAULT_STACK_SIZE);
ABT_thread_create(pool, f, arg, g_stack_attr, newthread);
ABT_thread_set_specific(*newthread, g_stack_key, newstack);
}
/* ************************************************************************** */
#define PROC_SELF_MAPS "/proc/self/maps"
#define LINE_LEN 1024
#define STACK_GUARD_GAP_SIZE (4096 * 256)
#define MAX_STACK_SIZE (4096 * 128)
char *find_hole(size_t length)
{
FILE *fp;
char line[LINE_LEN];
char *addr = NULL, *startaddr, *endaddr;
fp = fopen(PROC_SELF_MAPS, "r");
if (fp == NULL) {
return NULL;
}
/* all VMAs start/end addresses are already rounded to page size */
while (fgets(line, LINE_LEN, fp) != NULL) {
if (sscanf(line, "%p-%p", &startaddr, &endaddr) == 2) {
if (startaddr > addr)
if (startaddr - addr >= length)
break;
if (endaddr > addr)
addr = endaddr;
}
}
/* check if last VMA has been reached but there is still not
* enough room
*/
if (UINTPTR_MAX - (uintptr_t)addr < length) {
addr = MAP_FAILED;
fprintf(stderr, "no hole found\n");
}
fclose(fp);
return addr;
}
void *allocate_stack()
{
void *addr_hint = find_hole(MAX_STACK_SIZE + STACK_GUARD_GAP_SIZE);
return mmap(addr_hint + STACK_GUARD_GAP_SIZE + MAX_STACK_SIZE -
DEFAULT_STACK_SIZE,
DEFAULT_STACK_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK | MAP_GROWSDOWN, -1, 0);
}
void free_stack(void *mem)
{
munmap(mem, DEFAULT_STACK_SIZE);
} The initialization part might look complicated, but except for this, the program can just replace |
I don't think I ever wrote that "vm.overcommit_memory is not tunable", did I ? For me it is also a tunable, like vm.max_map_count, even if they can only be modified by root!
I agree with you that the cost of find_hole() may be too high, mainly due to the parsing of /proc/self/maps ..., and will raise with the number of mmap()'ed stacks growing. Solving/optimizing this appears tricky as even if we want to internally manage/update a view of the current mappings, we will need to implement a quick algorithm like the one used by the Kernel (red-black tree)...
mmap()'ing fixed, but bigger than presently, size stacks could be acceptable for our use case. And by te way and sorry to insist, but specifying MAP_GROWSDOWN ensure that stack overrun scenarios will be at least immediately detected (because of stack_guard_gap) and possibly (if no extra VMA/mapping is allocated on top of LowerStackAddress+stack_guard_gap) also allow for some additional growth if needed.
Again sorry to insist but hopefully it does since it is its only purpose. My guess is that, what happened during your test with normal mmap(), you have triggered the scenario where multiple adjacent VMAs/mappings with the same flags/properties have been gathered into a single by the Kernel, causing you to not reach the vm.max_map_count limit.
Yes, this is guaranteed by the Kernel which will put stack_guard_gap pages in addition on top (lower addresses) of the mmap()'ed area with MAP_GROWSDOWN property, with the first goal to avoid stack overruns for security purpose. Also, thanks for your new code snippet which is definitely a good example and start point on how to use mmap()'ed stacks allocated externally from Argobots !! |
@bfaccini Thank you for your comments!
Not only Argobots allocates memory, so it would be challenging to keep the mapping updated. For example, the internal map can misunderstand that
I am not fully sure if the Kernel really guarantees it. At least,
I still believe mmap()'ing fixed but bigger stacks is simple and practical although it has its downsides (1. Regarding SEGV, we can use As you pointed out, extra stack growth larger than a specified size is not allowed if we In any case, all the concerns above and I mentioned previously assume a general case, so it might not apply to your workload (e.g., Only Argobots uses |
You are right when you say that the extra-gap used buy find_hole(), to stand on top of lowest stack address, is not guaranteed to be kept free by the Kernel, but it is likely this will happen if all stacks can be allocated within the same address range, located appart of others allocations areas, and if this gap is only a few pages. But anyway, as this will be ok for our use case, let's stick to a mmap()'ed stack size bigger than current default ULT stack size (1 or 2 MB), and if It is done with MAP_GROWSDOWN you will not need to implement any stack guard mechanism as it is already done by the Kernel (using stack_guard_gap). About stack canary method, I don't think it will help detect stack overrun so efficiently than stack guard method will be able to avoid it. |
Thanks. If you don't need an extra stack growth (i.e., no void *allocate_stack()
{
return mmap(NULL, DEFAULT_STACK_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK | MAP_GROWSDOWN, -1, 0);
} I believe the easiest way to use such a stack is copying the code I wrote in the previous comment with the change above. Further optimizations would need some changes in Argobots, e.g., a new memory allocation and management mechanism inside Argobots. Creating a pool should be the most straightforward optimization, but there is always a performance vs memory footprint trade-off. If you encounter any performance issues, please let me know with details about your workloads. (Stack canary was for #294, so please never mind.) |
Hi @shintaro-iwasaki , |
Yes, that's what I meant. That's why Argobots allows users to set a stack. It does not mean we do not want to integrate it (or any advanced one) into Argobots. Our long discussion was basically about |
Well this is a bit disappointing, but may be at least this issue has permitted you to feel the need for some enhancement around the ULTs stacks allocation/management/protection/.... |
Thank you. We understood your concerns regarding ULT stacks overall though still your requirement is unclear to me. As the ULT stack usage highly depends on each use case, we need some specific inputs from users more than "some enhancement around the ULT stacks allocation/...". There is always a trade-off between performance and memory footprint. For example, a few microseconds per ULT is acceptable, using We are happy to discuss with users and work on it if you have any ideas about the requirements. At this point, my understanding is that 1. your program does not need "extra stack growth" of
We would appreciate any comments from you. Of course, the custom stack allocation mechanism as I showed as an example should be a good solution to tailor stack allocation to your applications. In any case, I am sorry for your disappointment. |
Hello Shintaro, |
Thank you for trying the code!
typedef struct {
void *stack_to_be_freed;
void (*thread_f)(void *);
void *thread_arg;
int need_free;
} unnamed_thread_wrapper_arg_t;
static void unnamed_thread_wrapper(void *arg)
{
unnamed_thread_wrapper_arg_t *p_wrapper_arg =
(unnamed_thread_wrapper_arg_t *)arg;
ABT_key_set(g_stack_key, p_wrapper_arg->stack_to_be_freed);
p_wrapper_arg->thread_f(p_wrapper_arg->thread_arg);
if (p_wrapper_arg->need_free)
free(p_wrapper_arg);
}
void create_thread(ABT_pool pool, void (*f)(void *), void *arg,
ABT_thread *newthread)
{
if (newthread) {
/* Named thread. */
void *newstack = allocate_stack();
ABT_thread_attr_set_stack(g_stack_attr, newstack, DEFAULT_STACK_SIZE);
ABT_thread_create(pool, f, arg, g_stack_attr, newthread);
ABT_thread_set_specific(*newthread, g_stack_key, newstack);
} else {
/* Unnamed thread. */
void *newstack = allocate_stack();
const int is_naive = 0; /* Unoptimized version? */
if (is_naive) {
/* wrapper_arg will be freed by unnamed_thread_wrapper(). */
unnamed_thread_wrapper_arg_t *wrapper_arg =
(unnamed_thread_wrapper_arg_t *)malloc(
sizeof(unnamed_thread_wrapper_arg_t));
wrapper_arg->stack_to_be_freed = newstack;
wrapper_arg->thread_f = f;
wrapper_arg->thread_arg = arg;
wrapper_arg->need_free = 1;
ABT_thread_attr_set_stack(g_stack_attr, newstack,
DEFAULT_STACK_SIZE);
ABT_thread_create(pool, unnamed_thread_wrapper, wrapper_arg,
g_stack_attr, NULL);
} else {
/* Allocate wrapper_arg using a stack part (default stack size will
* be reduced a bit). malloc() and free() are not needed. */
const size_t unnamed_thread_wrapper_arg_size =
((sizeof(unnamed_thread_wrapper_arg_t) + 64 - 1) / 64 * 64);
const size_t new_stack_size =
DEFAULT_STACK_SIZE - unnamed_thread_wrapper_arg_size;
unnamed_thread_wrapper_arg_t *wrapper_arg =
(unnamed_thread_wrapper_arg_t
*)((char *)newstack + DEFAULT_STACK_SIZE -
unnamed_thread_wrapper_arg_size);
wrapper_arg->stack_to_be_freed = newstack;
wrapper_arg->thread_f = f;
wrapper_arg->thread_arg = arg;
wrapper_arg->need_free = 0;
ABT_thread_attr_set_stack(g_stack_attr, newstack, new_stack_size);
ABT_thread_create(pool, unnamed_thread_wrapper, wrapper_arg,
g_stack_attr, NULL);
}
}
} You can use this Test program (collapsed)
#include <abt.h>
#include <alloca.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#define DEFAULT_STACK_SIZE (4096 * 2)
ABT_key g_stack_key;
/* If create_thread() is called in parallel, use a spinlock to protect
* g_stack_attr or create a copy for each thread/ES to avoid contention. */
ABT_thread_attr g_stack_attr;
int g_measure_time = 0;
int init_stack_key();
void finalize_stack_key();
void *allocate_stack();
void free_stack(void *mem);
void create_thread(ABT_pool pool, void (*f)(void *), void *arg,
ABT_thread *newthread);
void free_thread(ABT_thread *thread);
void thread_func(void *arg);
int main(int argc, char *argv[])
{
if (argc != 3) {
printf("Usage: ./a.out NUM_THREADS MEASURE_TIME?\n");
printf("Ex: ./a.out 32 0\n");
return -1;
}
int num_threads = atoi(argv[1]);
g_measure_time = atoi(argv[2]);
int num_repeats = g_measure_time ? (100000 / num_threads + 2) : 1;
/* Initialization */
ABT_init(0, NULL);
ABT_xstream xstream;
ABT_pool pool;
ABT_xstream_self(&xstream);
ABT_xstream_get_main_pools(xstream, 1, &pool);
init_stack_key();
ABT_thread *threads = malloc(sizeof(ABT_thread) * num_threads);
double start_time, end_time;
/* Fork and join threads. */
for (int step = 0; step < num_repeats; step++) {
if (step == num_repeats / 2)
start_time = ABT_get_wtime();
/* Named threads */
for (int i = 0; i < num_threads; i++) {
create_thread(pool, thread_func, NULL, &threads[i]);
}
/* Unnamed threads */
for (int i = 0; i < num_threads; i++) {
create_thread(pool, thread_func, NULL, NULL);
}
for (int i = 0; i < num_threads; i++) {
ABT_thread_free(&threads[i]);
}
}
/* Print execution time */
end_time = ABT_get_wtime();
if (g_measure_time) {
double t = (end_time - start_time) / num_threads /
(num_repeats - num_repeats / 2) * 1.0e6;
printf("Fork-join per thread: %.2f[us]\n", t);
}
/* Finalization */
free(threads);
finalize_stack_key();
ABT_finalize();
return 0;
}
void deep_func(int depth, char *root_stack)
{
/* Use a stack region to see if the stack is extended. */
volatile char *mem = (char *)alloca(sizeof(char) * 512);
mem[512] = 1;
if (depth == 0) {
printf("Consumed %d bytes of stacks\n", (int)(root_stack - mem));
} else {
deep_func(depth - 1, root_stack);
}
}
void thread_func(void *arg)
{
if (g_measure_time == 0) {
deep_func(128, (char *)alloca(sizeof(char) * 128));
}
}
/* ************************************************************************** */
int init_stack_key()
{
ABT_key_create(free_stack, &g_stack_key);
ABT_thread_attr_create(&g_stack_attr);
}
void finalize_stack_key()
{
ABT_thread_attr_free(&g_stack_attr);
ABT_key_free(&g_stack_key);
}
typedef struct {
void *stack_to_be_freed;
void (*thread_f)(void *);
void *thread_arg;
int need_free;
} unnamed_thread_wrapper_arg_t;
static void unnamed_thread_wrapper(void *arg)
{
unnamed_thread_wrapper_arg_t *p_wrapper_arg =
(unnamed_thread_wrapper_arg_t *)arg;
ABT_key_set(g_stack_key, p_wrapper_arg->stack_to_be_freed);
p_wrapper_arg->thread_f(p_wrapper_arg->thread_arg);
if (p_wrapper_arg->need_free)
free(p_wrapper_arg);
}
void create_thread(ABT_pool pool, void (*f)(void *), void *arg,
ABT_thread *newthread)
{
if (newthread) {
/* Named thread. */
void *newstack = allocate_stack();
ABT_thread_attr_set_stack(g_stack_attr, newstack, DEFAULT_STACK_SIZE);
ABT_thread_create(pool, f, arg, g_stack_attr, newthread);
ABT_thread_set_specific(*newthread, g_stack_key, newstack);
} else {
/* Unnamed thread. */
void *newstack = allocate_stack();
const int is_naive = 0; /* Unoptimized version? */
if (is_naive) {
/* wrapper_arg will be freed by unnamed_thread_wrapper(). */
unnamed_thread_wrapper_arg_t *wrapper_arg =
(unnamed_thread_wrapper_arg_t *)malloc(
sizeof(unnamed_thread_wrapper_arg_t));
wrapper_arg->stack_to_be_freed = newstack;
wrapper_arg->thread_f = f;
wrapper_arg->thread_arg = arg;
wrapper_arg->need_free = 1;
ABT_thread_attr_set_stack(g_stack_attr, newstack,
DEFAULT_STACK_SIZE);
ABT_thread_create(pool, unnamed_thread_wrapper, wrapper_arg,
g_stack_attr, NULL);
} else {
/* Allocate wrapper_arg using a stack part (default stack size will
* be reduced a bit). malloc() and free() are not needed. */
const size_t unnamed_thread_wrapper_arg_size =
((sizeof(unnamed_thread_wrapper_arg_t) + 64 - 1) / 64 * 64);
const size_t new_stack_size =
DEFAULT_STACK_SIZE - unnamed_thread_wrapper_arg_size;
unnamed_thread_wrapper_arg_t *wrapper_arg =
(unnamed_thread_wrapper_arg_t
*)((char *)newstack + DEFAULT_STACK_SIZE -
unnamed_thread_wrapper_arg_size);
wrapper_arg->stack_to_be_freed = newstack;
wrapper_arg->thread_f = f;
wrapper_arg->thread_arg = arg;
wrapper_arg->need_free = 0;
ABT_thread_attr_set_stack(g_stack_attr, newstack, new_stack_size);
ABT_thread_create(pool, unnamed_thread_wrapper, wrapper_arg,
g_stack_attr, NULL);
}
}
}
/* ************************************************************************** */
#define PROC_SELF_MAPS "/proc/self/maps"
#define LINE_LEN 1024
#define STACK_GUARD_GAP_SIZE (4096 * 256)
#define MAX_STACK_SIZE (4096 * 128)
char *find_hole(size_t length)
{
FILE *fp;
char line[LINE_LEN];
char *addr = NULL, *startaddr, *endaddr;
fp = fopen(PROC_SELF_MAPS, "r");
if (fp == NULL) {
return NULL;
}
/* all VMAs start/end addresses are already rounded to page size */
while (fgets(line, LINE_LEN, fp) != NULL) {
if (sscanf(line, "%p-%p", &startaddr, &endaddr) == 2) {
if (startaddr > addr)
if (startaddr - addr >= length)
break;
if (endaddr > addr)
addr = endaddr;
}
}
/* check if last VMA has been reached but there is still not
* enough room
*/
if (UINTPTR_MAX - (uintptr_t)addr < length) {
addr = MAP_FAILED;
fprintf(stderr, "no hole found\n");
}
fclose(fp);
return addr;
}
void *allocate_stack()
{
void *addr_hint = find_hole(MAX_STACK_SIZE + STACK_GUARD_GAP_SIZE);
return mmap(addr_hint + STACK_GUARD_GAP_SIZE + MAX_STACK_SIZE -
DEFAULT_STACK_SIZE,
DEFAULT_STACK_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK | MAP_GROWSDOWN, -1, 0);
}
void free_stack(void *mem)
{
munmap(mem, DEFAULT_STACK_SIZE);
}
Misc:
|
Well, seems to me that you did not answer directly to my "Is it safe to also use an ABT_key for this (to free a named ABT_thread) ??" question but better tried to provide me with an other way... So I presume the answer is no ?! |
I am sorry for missing some of your questions. I will answer all of your questions again.
This first assumption is no.
A named ULT must be freed by using
Using an unnamed ULT is my suggestion.
Sorry, I did not answer this question.
So, the answer is no. The reason is explained above.
Your understanding is correct. You don't need to use a named ULT since you can use
You need to call basically the following functions. // Call it after ABT_init() once.
int init_stack_key();
// Call if before ABT_finalize().
void finalize_stack_key();
// Call this instead of ABT_thread_create().
// Unlike ABT_thread_create(), create_thread() currently does not take ABT_thread_attr.
void create_thread(ABT_pool pool, void (*f)(void *), void *arg, ABT_thread *newthread); I believe this And again, we are happy to implement any stack enhancement that is wholly or partly integrated into Argobots, but as I explained, I need some information from you about your workloads to determine a stack allocation method and design a stack pool cache. It might be nice for you if I could implement and provide all the options in Argobots, but unfortunately the design space of "stack allocation" is too huge for us to cover all. Stack allocation design is not as simple as stack dump or stack unwinding features you previously requested. We would truly appreciate information about the workload. |
We regularly trigger crashes due to ULT stack overrun, mainly caused by 3rd party libraries who are stacking big arrays/structs, or also during lazy symbol resolution of dynamic libs...
Since we expect to run Millions of concurrent ULTs we could not rely on statically allocating big chunk for stacks to limit memory footprint and thus we would like to benefit of the existing kernel mechanism which allows stacks to dynamically raise to their associated limit if needed.
We first thought that we could implement this externally from Argobots and use our own external stacks in Argobots, but as it is described in issue #16 the present lack of hook in Argobots that would allow to be aware that a stack is no longer in use and be able to free/reassign it, prevents it.
So, here are the main point that should be addressed in order to implement this new stack allocation mechanism inside Argobots:
_ allocate a new stack using mmap(), with a minimal size and by ensuring it will be positioned in a hole (by providing its address as a hint to mmap()!) far enough from the preceding mapping, this to be allowed to automatically grow downward (as per the Kernel's existing mechanism for stacks!), until some reasonable minimum size (assuming that preceding mapping will not itself grow upward...) which could be set as the xstream's stack limit.
_ as it is not possible to rely on kernel to find the best place for us (ie by using NULL as mmap()'s address/1st parameter), we will have to parse /proc/self/maps current content and decide. This process will also have to take into account of current/previous stacks allocations in order to avoid re-using a hole where a previously allocated stack is already expected to grow...
_ as part of these required mechanisms need for managing stacks, a map of current allocations has to be maintained to avoid possibly colliding allocation of new stack, and also a mechanism/policy to determine no longer used stacks to be re-allocated or freed/munmap()'ed.
_ also, Kernel/boot "stack_guard_gap" parameter will have to be taken into account, as the minimum hole size between the preceding mapping, because it is used by Kernel to decide if a "stack" region is allowed to grow or not.
The text was updated successfully, but these errors were encountered: