Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic on stack overflow #91

Closed
jclark opened this issue Jun 20, 2021 · 29 comments
Closed

Panic on stack overflow #91

jclark opened this issue Jun 20, 2021 · 29 comments
Assignees
Milestone

Comments

@jclark
Copy link
Contributor

jclark commented Jun 20, 2021

At the moment, we seg fault on stack overflow.

We ought to detect this and call the runtime panic function.

I think LLVM has some stuff to help with this.

@KavinduZoysa
Copy link
Contributor

For our knowledge, this is how rust implemented this.

Rust detects stack overflow using SIGSEGV. When SIGSEGV was detected, we cannot say that the root cause is stack overflow because stack overflow is not the only cause to SIGSEGV(https://en.wikipedia.org/wiki/Segmentation_fault). So when SIGSEGV was detected, their signal handler checks whether the address lies within the stack guard page.

@jclark
Copy link
Contributor Author

jclark commented Jun 21, 2021

Lots of non-portable, low-level code there. It would be nice if we could do something more straightforward for now.

@KavinduZoysa
Copy link
Contributor

@jclark, Please check the below approach with a working sample and add your thoughts/improvements. Sample codes can be found here.

In this approach, we can detect the stack overflow by comparing the remaining stack size with the stack frame size we are going to allocate. I have written an equivalent C code for this approach. I have taken the address of the beginning of the stack using a dummy variable and this may not be the best way to do it. I wrote this just to verify the approach.

Then, I have written this LLVM IR by looking at above equivalent C code. To get the starting address of the stack frame I have used the llvm.frameaddress intrinsic. Here I have hardcoded STACK_FRAME_SIZE as 10 bytes, but ideally, it should be the stack frame size of the function we are going to check. This can be taken using LLVM stack map intrinsic. I worked on generating and parsing the LLVM stack map when I was exploring garbage collectors. Using this we can get the stack frame size of function and use it instead of the STACK_FRAME_SIZE constant.

@jclark
Copy link
Contributor Author

jclark commented Jun 23, 2021

Stepping back a bit, the approach is:

  1. in the runtime's main, we identify the stack limit (ie a pointer to the end of the stack) and store it in a global
  2. at the beginning of each function, we panic if there's a possibility that this function call will exceed the stack limit (except in so far as it calls other functions)

I like this approach. On details:

  1. stack limit (we can have a system dependent function in the runtime to return this)
    1. stacks can grow up or down and LLVM supports both; not sure if they grow up on any system we care about
    2. on Linux, execve uses the stack to store argv/argc/envp (both the pointers and the strings), so a local variable in main isn't the beginning of the stack; it's the beginning of the stack + the space used by execve
    3. we can use rlimit to get the stack size
  2. for the function's stack usage we need to account for:
    1. local variables we alloca
    2. space we use to store arguments passed to other Ballerina functions
    3. stack space used by C runtime functions, where we don't want to do an additional check
  3. we need to be careful not to mess up LLVM's mem2reg optimization
    1. I think I've read that mem2reg only works for alloca's in the the prologue
    2. if we do the stack check before the alloca, then we need to make sure that it doesn't stop the alloca's from being considered as being in the prologue (I guess that means the first basic block) - maybe you can do an experiment here

A general point is that we do not need to be able to use every byte of available stack space. We want things to be simple and efficient and 100% guarantee that we never use more than the limit. We should not get into trying to exactly calculate the stack usage of a function: rather we should compute an upper limit (i.e. a value X such that we sure that it the stack usage won't be more than X). For example, if the number of BIR registers is r, can we make the upper limit be X*r + Y for some X and Y?

@jclark
Copy link
Contributor Author

jclark commented Jun 23, 2021

We should also be careful we do not do anything that inhibits LLVM inlining.

@jclark
Copy link
Contributor Author

jclark commented Jun 23, 2021

We can divide between C and LL as follows:

  • C runtime exports global char *_bal_stack_limit for use by LL code
  • C runtime has system dependent function setStackLimit(argc,argv,envp), which is called by main before calling _B_main
  • LL code tests whether current stack pointer < _bal_stack_limit + something TBD, and calls _bal_panic if it is

I wonder whether llvm.frameaddress can be relied on.

@jclark
Copy link
Contributor Author

jclark commented Jun 23, 2021

Eventually we will want to be able to grow stacks, so strands can start off with small stacks. Here is what Go does

https://docs.google.com/document/d/1wAaf1rYoM4S4gtnPh0zOlGzWtrZFQ5suE8qr2sD8uWQ/pub

@jclark
Copy link
Contributor Author

jclark commented Jun 23, 2021

Have a look at the section "// go:nosplit" in this

https://dave.cheney.net/2018/01/08/gos-hidden-pragmas

where it shows the code the Go inserts in its prolog to check for stack overflow

@KavinduZoysa
Copy link
Contributor

Thank you @jclark, I will go through these details and come back.

@jclark
Copy link
Contributor Author

jclark commented Jun 23, 2021

I suggest looking at how llgo (Go on LLVM) deals with stack overflow

https://github.com/llvm/llvm-project/tree/release/10.x/llgo

I couldn't find the relevant code.

@KavinduZoysa
Copy link
Contributor

Sure, I will check that.

@jclark
Copy link
Contributor Author

jclark commented Jun 23, 2021

Also see

https://lists.llvm.org/pipermail/llvm-dev/2013-September/065333.html

It seems like a good start would be to add up all the space allocated by the alloca instructions (which we can do easily).

@jclark
Copy link
Contributor Author

jclark commented Jun 23, 2021

I wonder if this will work: after doing all the allocas, but before spilling the parameters into stack, we check whether the address of the last alloca is beyond the limit: we also make sure that there's some space between _bal_stack_limit and the real stack limit, to allow for anything that LLVM adds itself.

@manuranga
Copy link
Contributor

  1. Since LLVM alloca instruction can allocate dynamic arrays (even though we are using it to allocate statically sized things), I don't think there will be a way to get the current stack frame size statically from LLVM.
    In the initial version, we should do something simple that doesn't involve finding frame sizes. eg
if (stack_size - currently_used < 1 kb) {
   panic_stackoverflow();
}
  1. If it messes with the LLVM optimization, we might have to do this as a LLVM pass.

@KavinduZoysa
Copy link
Contributor

KavinduZoysa commented Jun 24, 2021

package main

func main() {
	foo()
}

func foo() {
	a := 5
	_ = a
	b := 6
	_ = b
	c := 7
	_ = c
}

This is the output of llvm-goc.

define void @main.main(i8* nest %nest.0) #0 !dbg !14 {
entry:
  call void @main.foo(i8* nest undef), !dbg !15
  ret void
}

define void @main.foo(i8* nest %nest.1) #0 !dbg !16 {
entry:
  %a = alloca i64, align 8
  %b = alloca i64, align 8
  %c = alloca i64, align 8
  %0 = bitcast i64* %a to i8*
  call void @llvm.lifetime.start.p0i8(i64 8, i8* %0)
  %1 = bitcast i64* %b to i8*
  call void @llvm.lifetime.start.p0i8(i64 8, i8* %1)
  %2 = bitcast i64* %c to i8*
  call void @llvm.lifetime.start.p0i8(i64 8, i8* %2)
  store i64 5, i64* %a, align 8
  call void @llvm.dbg.declare(metadata i64* %a, metadata !17, metadata !DIExpression()), !dbg !20
  %a.ld.0 = load i64, i64* %a, align 8, !dbg !21
  store i64 6, i64* %b, align 8
  call void @llvm.dbg.declare(metadata i64* %b, metadata !22, metadata !DIExpression()), !dbg !23
  %b.ld.0 = load i64, i64* %b, align 8, !dbg !24
  store i64 7, i64* %c, align 8
  call void @llvm.dbg.declare(metadata i64* %c, metadata !25, metadata !DIExpression()), !dbg !26
  %c.ld.0 = load i64, i64* %c, align 8, !dbg !27
  %3 = bitcast i64* %a to i8*
  call void @llvm.lifetime.end.p0i8(i64 8, i8* %3)
  %4 = bitcast i64* %b to i8*
  call void @llvm.lifetime.end.p0i8(i64 8, i8* %4)
  %5 = bitcast i64* %c to i8*
  call void @llvm.lifetime.end.p0i8(i64 8, i8* %5)
  ret void
}

@KavinduZoysa
Copy link
Contributor

I did a test to verify the output of llvm.frameaddress and compare its' output with the address of the first local var in function. Sample codes can be found here.

First I have written a C code to print the address of the first local var.
Then I generated the llvm IR for that and modified it. Modifications done :

  • Add llvm.frameaddress intrinsic
  • Add printf to print the about output

Then I ran the llvm IR with enabling debug(This is done to debug using gdb)

So the output is something like this.

0x7fffffffde8c <---- address of local var (alloc)
0x7fffffffde90 <---- output of llvm.frameaddress

Ideally, the frame address should be the address of the base pointer(rbp in x86) of a given frame. Because the base pointer is pointing to the beginning of the stack frame. By looking at the generated assmbly we can verify that `llvm.frameaddress' gives us the address of rbp.

Also, I have debugged the code using gdb to get the base pointer.

@jclark, what do you think about this?

@jclark
Copy link
Contributor Author

jclark commented Jun 24, 2021

@KavinduZoysa How does this work with function inlining? We want to compare using the address from alloca vs using llvm.frameaddress. I am not clear which is more appropriate for our use. If there are multiple allocas, is the address of the last at the end of the stack (i.e. bottom if growing down)?

@KavinduZoysa
Copy link
Contributor

When we consider function inlining,

  • llvm.frameaddress: This intrinsic gives us the frame address of the function after inlining happens. In the above example, let's say foo was inlined inside main. Then this intrinsic provides us the frame address of main.
  • address from alloca: If we consider the first address of the variables in foo, when inlining, those variables are allocated after the allocation of the main function's variables.

I have compared several llvm IRs with assembly codes that have multiple allocas. In all cases, the difference between the base pointer and stack pointer is greater than the sum of allocas. According to that, we cannot say the address of the last is the end of the stack.

@jclark
Copy link
Contributor Author

jclark commented Jun 24, 2021

In all cases, the difference between the base pointer and stack pointer is greater than the sum of allocas.

Yes. The frame size for a function F needs to include at least

  1. all the allocas of F,
  2. the maximum of the space needed for the arguments of each function that F calls
  3. space to save registers

What else is there? I'm not too worried about 3, because a fixed amount of space can deal with this. Question is: is there anything other than 1 and 2 that can cause unbounded increase in frame size?

@KavinduZoysa
Copy link
Contributor

The difference between llvm.frameaddress and address from alloca depends on the size of the first alloca. If the size of alloca is i64, then the difference is 8 bytes, if it is i32 then the difference is 4 bytes.

Question is: is there anything other than 1 and 2 that can cause unbounded increase in frame size?

Yes, I also did not found any other reason for unbounded increase in frame size. I will check more on that.

@jclark
Copy link
Contributor Author

jclark commented Jun 24, 2021

Given the point about llvm.frameaddress and inlining, I don't think llvm.frameaddress is going to work well for stack overflow checking, because we cannot estimate our frame size in a way that's consistent with llvm.frameaddress when inlining has been done. Of course, we could add a LLVM pass to insert these checks, but then we don't need llvm.frameaddress. So it's clear to me that at this stage we should use an alloca address as the basis for stack checking.

So we are going to insert a check at the beginning of the function that does the equivalent of:

extern char *stack_guard;
if (address_from_alloca < stack_guard) panic(STACK_OVERFLOW);

The remaining question is: where can we put this check without negatively impacting LLVM optimization specifically mem2reg and function inlining?

@KavinduZoysa
Copy link
Contributor

The remaining question is: where can we put this check without negatively impacting LLVM optimization specifically mem2reg and function inlining?

I am working on this, I will update as soon as possible.

extern char *stack_guard;
if (address_from_alloca < stack_guard) panic(STACK_OVERFLOW);

Just to verify this, address_from_alloca is the address of the last alloca for a given function and stack_guard is the stack_limit + CONSTANT memory size. Please correct me if I am wrong.

@jclark
Copy link
Contributor Author

jclark commented Jun 25, 2021

Just to verify this, address_from_alloca is the address of the last alloca for a given function and stack_guard is the stack_limit + CONSTANT memory size. Please correct me if I am wrong.

Eventually, I would expect address_from_alloca to be the first alloca of the function minus an estimate of the frame size (or the last plus an estimate of space used for arguments of called functions); stack_guard to be as you say.

For now, I would expect stack_guard to be the address of a local variable in main minus a constant for the allowed size of the stack (e.g. a megabyte) and address_from_alloca to be the first or last alloca of the function (doesn’t really matter with that stack_guard).

@KavinduZoysa
Copy link
Contributor

I have run a couple of tests to check the impact of mem2reg optimization. If we have used a particular alloca as shown below then those allocas will not be removed after mem2reg optimization.

before mem2reg:

define dso_local i32 @bar() #0 {
  %1 = alloca i32, align 4
  %2 = alloca i32, align 4
  ;; Read the 2nd address and compare
  %"%100" = bitcast i32* %2 to i8*
  %"%101" = load i8*, i8** @stak_guard 
  %"%102" = icmp slt i8* %"%100", %"%101"
  %"%103" = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @.str, i64 0, i64 0), i1 %"%102")

  store i32 5, i32* %1, align 4
  store i32 2, i32* %2, align 4
  %3 = load i32, i32* %2, align 4
  %4 = add nsw i32 %3, 1
  store i32 %4, i32* %2, align 4
  %5 = load i32, i32* %1, align 4
  %6 = add nsw i32 %3, %5
  ret i32 %6
}

after mem2reg:

define dso_local i32 @bar() #0 {
  %1 = alloca i32, align 4
  ;; Remains the alloca which is needed
  %"%100" = bitcast i32* %1 to i8*
  %"%101" = load i8*, i8** @stak_guard, align 8
  %"%102" = icmp slt i8* %"%100", %"%101"
  %"%103" = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @.str, i64 0, i64 0), i1 %"%102")

  store i32 2, i32* %1, align 4
  %2 = load i32, i32* %1, align 4
  %3 = add nsw i32 %2, 1
  store i32 %3, i32* %1, align 4
  %4 = add nsw i32 %2, 5
  ret i32 %4
}

When we consider the function inlining, the callee function's allocas are merged with callers allocas. But the alloca we used to get address_from_alloca will be there in callers after inlining.

@manuranga
Copy link
Contributor

Since are not planing to be precise at this stage, wouldn't it be better to do:

extern char *stack_guard;
if (llvm.frameaddress  < stack_guard) panic(STACK_OVERFLOW);
  1. It will not mess with any optimizations. From the llvm docs:

Note that calling this intrinsic does not prevent function inlining or other aggressive transformations, so the value returned may not be that of the obvious source-language caller.

  1. Check will be redundant when inlined but not incorrect.

@jclark
Copy link
Contributor Author

jclark commented Jun 25, 2021

@KavinduZoysa What does the assembly look like for that example?

@manuranga llvm.frameaddress is a step in the wrong direction: I can’t see any way to make an approach based in llvm.frameaddress robust. Using the address of an alloca, we can make it robust by putting an upper bound on stack usage based on number of allocas and number of arguments passed to called functions. Optimization cost is not yet clear. Need to see assembly and try taking the address of the last alloca.

@jclark
Copy link
Contributor Author

jclark commented Jun 25, 2021

I think the C equivalent of llvm.frameaddress __builtin_frame_address is useful in main. Taking the address of a stack variable and adding to it is undefined behaviour in C.

@KavinduZoysa
Copy link
Contributor

KavinduZoysa commented Jun 25, 2021

@KavinduZoysa What does the assembly look like for that example?

Please find the below assembly for the checking code above example.

before mem2reg:

0000000000401130 <bar>:
  401130:	55                      push   %rbp
  401131:	48 89 e5             	mov    %rsp,%rbp
  401134:	48 83 ec 10          	sub    $0x10,%rsp
  401138:	48 8d 45 f8          	lea    -0x8(%rbp),%rax // 2nd alloca (last one)
  40113c:	48 3b 04 25 40 40 40 	cmp    0x404040,%rax

after mem2reg:

0000000000401130 <bar>:
  401130:	55                   	push   %rbp
  401131:	48 89 e5                mov    %rsp,%rbp
  401134:	48 83 ec 10          	sub    $0x10,%rsp
  401138:	48 8d 45 fc          	lea    -0x4(%rbp),%rax // 1st alloca (last one)
  40113c:	48 3b 04 25 40 40 40 	cmp    0x404040,%rax

After mem2reg optimization, since only one alloca is left, that is taken to compare.

@jclark jclark closed this as completed in 7c458c3 Jun 26, 2021
@jclark
Copy link
Contributor Author

jclark commented Jun 26, 2021

What I implemented was to do use an extra alloca for the compare against stack_guard. After optimization, this seems to end up pretty much the same as frameaddress in the presence, so it's not as robust as I would like. I will create a separate issue to improve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants