-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Option to reduce code size of the generated AOT file when stack trace is enabled #3758
Comments
Thanks @loganek for bringing the great idea, it is really an issue that the AOT code size increases a lot when dump call stack, performance profiling or GC feature is enabled. As you mentioned, there may be many fields to commit to the stack frame, and I think there had better be an option of whether to check the stack overflow or not. And this issue introduces a new stack frame structure so there will be two stack frame structures, different option combinations may use different stack frame structure. I think you mean Another way may be to add some wamrc options for developer to select which elements to commit? For example:
And together with the current And had better add an option to enable/disable stack frame boundary check, e.g. |
I'll need to do a bit more experiments; in my proof of concept I indeed created a new frame structure (
This was just a conceptual idea to highlight the fact there'll be a few options. I think I agree that just having I'm more interested in your thoughts about option 1 and option 2 in my initial post, but given that option 2 won't give us all information, we should perhaps abandon it. |
Re-using
I also like the style of
For option 2, I think we don't need to store the instruction offset when the aot stack frame is created/pushed, in fact the ip offset is stored only when exception is to be thrown (see calling commit_ip in aot_emit_exception.c) and some places that may need to commit the frame data and switch to runtime (see callings of aot_gen_commit_sp_ip in aot_emit_function.c and aot_emit_control.c). The function aot_alloc_frame in aot_runtime.c and alloc_frame_for_aot_func in aot_emit_function.c also don't store ip offset to the frame. And the places to free the the stack frame in the AOT code also seem not an issue: the aot compiler concentrates the handlings of function return to be only in several places: (1) func_end: ; preds = %block28_end
%phi = phi i32 [ 0, %block28_end ]
ret i32 %phi
func_ret: ; preds = %block30_end, %br_if_else1021, %check_exce_succ1010, %check_exce_succ992, %br_if_else952, %br_if_else942, %loop5_begin, %loop4_begin, %loop3_begin, %loop2_begin, %check_exce_succ740, %check_exce_succ723, %check_exce_succ706, %check_exce_succ689, %block23_end, %br_if_else654, %br_if_else617, %check_exce_succ592, %check_exce_succ574, %block12_end, %br_if_else501, %br_if_else439, %br_if_else377, %block13_end
br label %got_exception
got_exception: ; preds = %func_ret, %block10_end, %check_nan_succ, %br_if_else268, %block0_end
%exception_id_phi = phi i32 [ 4, %block0_end ], [ 5, %br_if_else268 ], [ 3, %check_nan_succ ], [ 4, %block10_end ], [ 33, %func_ret ]
call void @aot_set_exception_with_id(ptr %aot_inst, i32 %exception_id_phi)
ret i32 0 I think the main issue of option 2 is: AOT runtime also allocates/frees frame before/after calling aot function, see aot_call_function(), if changing it, there may be backward compatibility issue for old AOT file - no frame is allocated for the first call for the old AOT file. If you just wants to save the AOT code size for tiny frame, maybe we can define a function to allocate the frame, call if before calling the wasm function by passing it with func idx (and ip offset if needed)? |
If to use the option 2, maybe we can use two bits in the feature flag of AOT file: a bit to indicate whether the tiny frame is used and a bit to indicate whether option 2 is used. And in the runtime: (1) when bit for option 2 is 0, keep the current behavior, (2) else, don't allocate/free aot frame before/after calling a function, and when traversing the frames, treat the frame as full frame or tiny frame according to tiny frame flag. By this way, aot runtime can keep the backward compatibility with the old AOT file in which the two bits are 0. |
I'll review the code and see what option would be more suitable; in my current PoC I indeed have two distinct data structures so I might keep it like that going forward.
Yes, I think the names can be short and explained in the doc. I also think usually those are not often handwritten but rather part of the CI/build script so I hope this won't be an issue.
I think pushing ip offset must happen on every
Yes, those are the places where the code needs to be generated. I'd need to run more experiments if putting extra instructions there will not cause the code size to increase significantly, but hopefully not.
As discussed in the meeting today, I think we can workaround it by having AOT feature flags so the runtime knows what to do.
I thought about it too, but I think calling the function is going to generate not significantly less code than having instructions for allocating frame, and I think it might affect a performance significantly so I think I'd rather generate the extra code at the beginning of each function and in its termination stage. After thinking more about it and speaking to @wenyongh offline, I think option 2 might give us better results in terms of code size optimization, so I'll build another prototype now and share some results, based on that we'll pick the final approach (although given that in our case we have 10x more calls than function definitions, I think option 2 will likely give better results). |
Yes, that should be kept, but note that here it stores the ip offset to the caller's frame which has been created, but not the callee frame: the callee frame is to be created next by the aot code generated by alloc_frame_for_aot_func (in opcode call), or by calling runtime API aot_alloc_frame (in opcode call_indirect), and both of them don't store ip offset to the frame. So it doesn't matter that we change the current full trace implementation to option 2.
Yes, in fact
Got it, thanks.
Agree, in fact fast-jit also uses option 2, there is only one frame allocation and frame freeing in a function's jitted code. |
Those parameters can be used to reduce the size of the AOT code. There's going to be more changes related to AOT code size reduction, this is just the initial step. p.s. #3758
Initially I thought that func_idx will always be needed. However, just having another thought on that, when the user is interested in both function index and instruction pointer, they could just save function pointer to the frame and skip writing the function index at all to save some generated code. We can then provide a simple script that using wasm-objdump "fixes" the stack trace by deducing function index based on the instruction pointer of that frame. I'll create a task to add |
- Implement TINY / STANDARD frame modes - tiny mode is only able to keep track on the IP and func idx, STANDARD mode provides more capabilities (parameters, stack pointer etc.). - Implement FRAME_PER_FUNCTION / FRAME_PER_CALL modes - frame per function adds code at the beginning and at the end of each function for allocating / deallocating stack frame, whereas in per-call mode the frame is allocated before each call. The exception is call to the imported function, where frame-per-function mode also allocates the stack before the `call` instruction (as it can't instrument the imported function). At the moment TINY + FRAME_PER_FUNCTION is automatically enabled in case GC and perf profiling are disabled and `values` call stack feature is not requested. In all the other cases STANDARD + FRAME_PER_CALL is used. STANDARD + FRAME_PER_FUNCTION and TINY + FRAME_PER_CALL are currently not implemented but possible, and might be enabled in the future. ps. #3758
Also add a script that converts instruction pointers to function indexes (or function names) bytecodealliance#3758
Also add a script that converts instruction pointers to function indexes (or function names) bytecodealliance#3758
Also add a script that converts instruction pointers to function indexes (or function names). #3758
Feature
We'd like to propose an option in the WAMR AOT compiler that will reduce the size of the generated code while still keeping track of some information about the existing callstack.
Currently, when compiling our file (~6 MB, over 100k call instructions and ~15k functions) to AOT with call stack tracking enabled (
--enable-dump-call-stack
option), the resulting file is 25MB (and is even more for ARM). When the call stack tracking is disabled, the size is ~18 MB, but we can't turn it off in production so we can analyze the crashes. Our experiments show that with our settings (no GC, no EH which is not supported yet but will be in the future) we could generate ~20MB bundle while keeping the basic information (just function call stacks).We're not targeting GC, JIT and EH (once it's ready) modes although the solution potentially can be extended to those features too (didn't dig deep into the implementation).
Benefit
The main benefit of the feature is a reduced AOT code size (perhaps at the cost of some of the additional features) therefore a lower memory usage and disk usage (or network usage if the bundle is downloaded).
Implementation
We considered two main approaches to implement this functionality. We're leaning towards option 1 given the concerns around the other option, but we'd like to open the discussion for the community and also hear some of the ideas.
Option 1: keep wrapping
call
instructionsCurrently stack traces are implemented by adding extra code around
call
instruction for saving the frame on the stack before the instruction, and popping it after the instruction is executed. This approach assumes that we'll re-use the existing mechanism, but the code for pushing the stack trace will be simplified and configurable.Right now the --enable-dump-call-stack generates a code for:
We'd like to introduce a new flag (the name is yet to be decided, perhaps something like
--dump-call-stack-detail-level
but it's yet to be decided) which will allows to define what's being tracked as part of each call and the level of validation. Each frame will include the following information:For now we'll completely disable validation for stack overflows, but if required, this can be added as an option too in the future.
This feature will not allow to include any other information - the
--enable-dump-call-stack
flag should be used instead if additional data is required.To avoid additional instructions, we'll do the following:
The benefit of this approach is simplicity of the implementation as we can rely on the existing mechanism; we already implemented a proof of concept and tested it on a few examples. The major disadvantage (compare to option 2) is a need for instrumenting every single
call
instruction; based on a few programs we've noticed that there's usually 5-10x morecall
instructions than function definitions, so option 2) might result in a better code. Another disadvantage is that we'll need to make changes in the runtime (as the runtime, when generating stack trace, relies on theprev
field from theAOTFrame
now - we'll use AOT feature flags to detect which option is used), so users would have to update both AOT code and the runtime itself.Option 2
In this option, instead of instrumenting every
call
instruction, we'll add a code for pushing the frame at the beginning of the function, and code for popping frame will be added on termination. Given that based on our experiments there's usually significantly more calls than function definitions, there should be a visible code size reduction too, which is what we're trying to achieve here. Another benefit of this approach is that the code compiled with a new compiler can work out of the box with the older runtimes (at least 2.x due to abi compatibility). There's a few concerns though:call
instruction just like we do today.end
in the function and before eachreturn
).Tasks
The text was updated successfully, but these errors were encountered: