-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use TCO of C compiler to speed up emulation #95
Conversation
416aafa
to
c4ddbd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run clang-format-12 -i src/*.[ch]
to indent.
Please read https://github.com/sysprog21/rv32emu/blob/master/CONTRIBUTING.md carefully.
Quote from Wikipedia:
"Refactoring" should not be considered because you do change the dispatching and instruction emulation behavior. |
fba7a7d
to
13972b1
Compare
13972b1
to
24cd6c0
Compare
For clang support (the major compiler on macOS), we can explicitly define MUST_TAIL in #if defined(__has_attribute) && __has_attribute(musttail)
/* Clang requires a special tail recursion attribute to use tail recursion. */
#define MUST_TAIL __attribute__((musttail))
#else
#define MUST_TAIL
#endif See https://github.com/protocolbuffers/protobuf/blob/main/src/google/protobuf/port_def.inc |
TODO:
|
24cd6c0
to
01abf3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify the performance metrics:
-----------------------------------------------------------------------
Test environment3: Ubuntu Linux 20.04 on ThunderX2
Compiler: gcc 9.4.0
Coremark test result:
Previous: 260.543173 Iterations/Sec
Now: 286.504547 Iterations/Sec
-----------------------------------------------------------------------
Test environment4: Ubuntu Linux 20.04 on ThunderX2
Compiler: gcc 9.4.0
Coremark test result:
Previous: 239.773443 Iterations/Sec
Now: 285.154751 Iterations/Sec
We shall check clang vs. gcc. By the way, ThunderX2 is based on older microarchitecture. Arm64-specific experiments should be carried out on eMag. Drop ThunderX2 related items.
4c529bd
to
2ee914a
Compare
2ee914a
to
044cd7b
Compare
Performance metrics
The experiments should be amended as follows to make the comparisons more self-explanatory.
Then, the explanation of the gcc-aarch64 build's lack of TCO would then be revealed. At the moment, change the stage of this pull request to "draft" because it is not as effective as computed-goto in terms of the ratio for improvements. |
We need to modify the function emulate into a recursive version for meeting the requirement of tail-call optimization(TCO). To achieve this, I add a variable is_tail to the struct rv_insn_t to help us determine whether the basic block is terminate or not. As a result, we can use this variable to rewrite function emulate into a self-recursive function. Running coremark benchmark now produces faster results than it did previously, and the test results show below. | Microprocessor | compiler | CoreMark w/ commit 285a988 | CoreMark w/ PR sysprog21#95 | Speedup | |---------------------------------------------------------------------------------------| | Core i7-8700 | clang-15 | 811.6384112 | 838.7883352 | +3.3% | |---------------------------------------------------------------------------------------| | Core i7-8700 | gcc-11 | 848.3487534 | 900.1869588 | +6.1% | |---------------------------------------------------------------------------------------| | eMag 8180 | clang-15 | 272.723566 | 295.1729862 | +8.3% | |---------------------------------------------------------------------------------------| | eMag 8180 | gcc-11 | 308.3846342 | 313.7543564 | +1.7% | Previously, when the function emulate terminated, it returned to function block_emulate because the previous calling route was rv_step -> block_emulate -> emulate -> block_emulate -> emulate -> ... . So, each time the function emulate was called, a function stack frame was created. However, the current calling route is rv_step -> emulate -> emulate -> ..., so function emulate can now use the same function stack frame because of TCO. That is, any instructions in a basic block can execute function emulate by using the same function stack frame and save the overhead of creating function stack frame.
We can eliminate the trailing --- a/Makefile
+++ b/Makefile
@@ -5,6 +5,7 @@ OUT ?= build
BIN := $(OUT)/rv32emu
CFLAGS = -std=gnu99 -O2 -Wall -Wextra
+CFLAGS += -Wno-unused-label
CFLAGS += -include src/common.h
# Set the default stack pointer
--- a/src/decode.h
+++ b/src/decode.h
@@ -166,6 +166,13 @@ enum {
#undef _
};
+/* can-branch information for each RISC-V instruction */
+enum {
+#define _(inst, can_branch) __rv_insn_##inst##_canbranch = can_branch,
+ RISCV_INSN_LIST
+#undef _
+};
+
/* clang-format off */
/* instruction decode masks */
enum {
--- a/src/emulate.c
+++ b/src/emulate.c
@@ -259,7 +259,14 @@ static inline bool insn_is_misaligned(uint32_t pc)
static bool do_##inst(riscv_t *rv UNUSED, const rv_insn_t *ir UNUSED) \
{ \
rv->X[rv_reg_zero] = 0; \
- code rv->PC += ir->insn_len; \
+ code; \
+ if (__rv_insn_##inst##_canbranch) { \
+ /* can branch */ \
+ rv->csr_cycle++; \
+ return true; \
+ } \
+ nextop: \
+ rv->PC += ir->insn_len; \
rv->csr_cycle++; \
if (ir->tailcall) \
return true; Then, we can rewrite /* BEQ: Branch if Equal */
RVOP(beq, {
const uint32_t pc = rv->PC;
- if (rv->X[ir->rs1] == rv->X[ir->rs2]) {
- rv->PC += ir->imm;
- /* check instruction misaligned */
- if (unlikely(insn_is_misaligned(rv->PC))) {
- rv->compressed = false;
- rv_except_insn_misaligned(rv, pc);
- return false;
- }
- /* increment the cycles csr */
- rv->csr_cycle++;
- /* can branch */
- rv->csr_cycle++;
- return true;
+ if (rv->X[ir->rs1] != rv->X[ir->rs2])
+ goto nextop;
+
+ rv->PC += ir->imm;
+ /* check instruction misaligned */
+ if (unlikely(insn_is_misaligned(rv->PC))) {
+ rv->compressed = false;
+ rv_except_insn_misaligned(rv, pc);
+ return false;
}
})
/* BNE: Branch if Not Equal */ Code duplication should be avoided at all times. Each RISC-V instruction ought to see the statement |
We adhere to the wasm3 implementation, which separates all instruction emulations, and organize them into a funciton table. After doing performance analysis, we discovered that emulator took a long time to calculate the offset of function table. We therefore alter struct rv_insn_t so that we can directly assign instruction emulation to IR with adding member opfunc. Running coremark benchmark now produces faster results than it did previously, and the test results show below. | Microprocessor | compiler | CoreMark w/ commit f2da162 | CoreMark w/ PR sysprog21#95 | Speedup | |------------------------------------------------------------------------------------------------| | Core i7-8700 | clang-15 | 836.4849530 | 971.9516670 | +13.9% | |------------------------------------------------------------------------------------------------| | Core i7-8700 | gcc-12 | 888.3423808 | 963.3369450 | +7.8% | |------------------------------------------------------------------------------------------------| | eMAG 8180 | clang-15 | 286.0007652 | 335.396515 | +20.5% | |------------------------------------------------------------------------------------------------| | eMAG 8180 | gcc-12 | 259.6389222 | 332.561175 | +14.0% | Previously, we had to calculate the jumping address using a method such as switch-case, computed-goto, or function table, but this is no longer necessary.
19da2b0
to
0158b4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The git commit messages were outdated. Measure based on latest code changes and rework the descriptions. In particular, potential concerns on TCO should be addressed.
To meet the tail-call optimization requirement, we must convert the function emulate into a recursive version (TCO). To accomplish this, we add a variable tailcall to the struct rv_insn_t to assist us in determining whether or not the basic block is terminated. As a result, we can rewrite function emulate into a self-recursive function using this variable. However, after performing performance analysis, we discovered that the emulator required a significant amount of time to calculate the jumping address. As a result, we stick with the wasm3 implementation, which separates all instruction emulations, and modify struct rv_insn_t so that we can directly assign instruction emulation to IR by adding member impl. Running coremark benchmark now produces faster results than it did previously, and the test results show below. | Microprocessor | compiler | CoreMark w/ commit f2da162 | CoreMark w/ PR sysprog21#95 | Speedup | |------------------------------------------------------------------------------------------------| | Core i7-8700 | clang-15 | 836.4849530 | 971.9516670 | +13.9% | |------------------------------------------------------------------------------------------------| | Core i7-8700 | gcc-12 | 888.3423808 | 963.3369450 | +7.8% | |------------------------------------------------------------------------------------------------| | eMAG 8180 | clang-15 | 286.0007652 | 335.396515 | +20.5% | |------------------------------------------------------------------------------------------------| | eMAG 8180 | gcc-12 | 259.6389222 | 332.561175 | +14.0% | Previously, when the function emulate terminated, it returned to the function block_emulate because the previous calling sequence was rv_step -> block_emulate -> emulate -> block_emulate -> emulate -> block_emulate -> emulate ->.... As a result, a function stack frame was created each time the function emulate was called. In addition, the jumping address had to be calculated using a method such as switch-case, computed-goto in function emulate.However, because we can now invoke instruction emulation directly and the current calling route is rv_step -> instruction emulation -> instruction emulation ->..., the instruction emulation function can now use the same function stack frame due to TCO. That is, any instruction in a basic block can emulate a function by using the same function stack frame, saving the overhead of creating function stack frames.
0158b4f
to
69b780c
Compare
To meet the tail-call optimization requirement, we must convert the function emulate into a recursive version (TCO). To accomplish this, we add a variable tailcall to the struct rv_insn_t to assist us in determining whether or not the basic block is terminated. As a result, we can rewrite function emulate into a self-recursive function using this variable. However, after performing performance analysis, we discovered that the emulator required a significant amount of time to calculate the jumping address. As a result, we stick with the wasm3 implementation, which separates all instruction emulations, and modify struct rv_insn_t so that we can directly assign instruction emulation to IR by adding member impl. CoreMark results: | Model | Compiler | f2da162 | PR #95 | Speedup | |--------------+----------+---------+---------+---------| | Core i7-8700 | clang-15 | 836.484 | 971.951 | +13.9% | |--------------+----------+---------+---------+---------| | Core i7-8700 | gcc-12 | 888.342 | 963.336 | +7.8% | |--------------+----------+---------+---------+---------| | eMAG 8180 | clang-15 | 286.000 | 335.396 | +20.5% | |--------------+----------+-------------------+---------| | eMAG 8180 | gcc-12 | 259.638 | 332.561 | +14.0% | Previously, when function "emulate" terminated, it returned to function "block_emulate" because the previous calling sequence was rv_step -> block_emulate -> emulate -> block_emulate -> emulate -> ... As a result, a function stack frame was created each time function "emulate" was invoked. In addition, the jumping address had to be calculated using a method such as switch-case, computed-goto in function "emulate". However, because we can now invoke instruction emulation directly and the current calling route is rv_step -> instruction emulation -> instruction emulation -> ... The instruction emulation an now use the same function stack frame due to TCO. That is, any instruction in a basic block can emulate a function by using the same function stack frame, saving the overhead of creating function stack frames.
To meet the tail-call optimization requirement, we must convert the function emulate into a recursive version (TCO). To accomplish this, we add a variable tailcall to the struct rv_insn_t to assist us in determining whether or not the basic block is terminated. As a result, we can rewrite function emulate into a self-recursive function using this variable. However, after performing performance analysis, we discovered that the emulator required a significant amount of time to calculate the jumping address. As a result, we stick with the wasm3 implementation, which separates all instruction emulations, and modify struct rv_insn_t so that we can directly assign instruction emulation to IR by adding member impl. CoreMark results: | Model | Compiler | f2da162 | TCO | Speedup | |--------------+----------+---------+---------+---------| | Core i7-8700 | clang-15 | 836.484 | 971.951 | +13.9% | |--------------+----------+---------+---------+---------| | Core i7-8700 | gcc-12 | 888.342 | 963.336 | +7.8% | |--------------+----------+---------+---------+---------| | eMAG 8180 | clang-15 | 286.000 | 335.396 | +20.5% | |--------------+----------+---------+---------+---------| | eMAG 8180 | gcc-12 | 259.638 | 332.561 | +14.0% | Previously, when function "emulate" terminated, it returned to function "block_emulate" because the previous calling sequence was rv_step -> block_emulate -> emulate -> block_emulate -> emulate -> ... As a result, a function stack frame was created each time function "emulate" was invoked. In addition, the jumping address had to be calculated using a method such as switch-case, computed-goto in function "emulate". However, because we can now invoke instruction emulation directly and the current calling route is rv_step -> instruction emulation -> instruction emulation -> ... The instruction emulation an now use the same function stack frame due to TCO. That is, any instruction in a basic block can emulate a function by using the same function stack frame, saving the overhead of creating function stack frames.
In the previous implementation, fencei was treated as a branch instruction, but it was assigned a missing value in the new branch list. As a result, emulator fail to pass Zifencei test. See: sysprog21#95
In the previous implementation, fencei was treated as a branch instruction, but it was assigned a missing value in the new branch list. As a result, emulator fails to pass Zifencei test. See: sysprog21#95
According to sysprog21#95, computed-goto has been replaced by tail-call optimization (TCO). Therefore, the option about computed-goto is unnecessary.
According to sysprog21#95, computed-goto has been replaced by tail-call optimization (TCO). Therefore, the option about computed-goto is unnecessary.
We need to refactor the function emulate to a recursive version for meeting the requirement of tail-call optimization(TCO). To achieve this, I add a variable is_tail to the struct rv_insn_t to help us determine whether the basic block is terminate or not. As a result, we can use this variable to rewrite function emulate into a self-recursive function.
Running coremark and dhrystone benchmark now produces faster results than it did previously.