-
Notifications
You must be signed in to change notification settings - Fork 1
riscv_dis_opts_batch_2
- Status: WITHDRAWN
(superseded by Disassembler: Core Optimization 1-2 (Mapping symbols)) - Branch:
riscv-dis-opts-batch-2
- Tracking PR: #31 (view Pull Request and Diff)
This is the second batch of the disassembler optimization.
Performance improvements achieved by batch 1 on libraries were not great. I found a cause for this.
Following dumps are excerpts from the gprof profiler output of: objdump -d vmlinux-5.18-debug1.o
(Linux kernel 5.18, vmlinux.o
with debug symbols enabled). This is the one of the worst cases I have observed.
Name | Value |
---|---|
File Size | 683 065 480 |
Text Sections | 12 |
Symbols | 9 554 485 |
Mapping Symbols | 2 225 |
Disassembler Lines (*) | 2 448 651 |
(*): A disassembler line is a line with address: representation (instruction|data)
.
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
39.90 40.47 40.47 2448651 0.00 0.00 print_insn_riscv
33.67 74.62 34.15 4301254721 0.00 0.00 riscv_get_map_state
24.38 99.35 24.73 compare_symbols
0.46 99.82 0.47 8500 0.00 0.01 disassemble_section
0.42 100.25 0.43 520710 0.00 0.00 find_symbol_for_address
0.35 100.61 0.36 2448625 0.00 0.00 riscv_disassemble_insn
0.08 100.69 0.08 9564476 0.00 0.00 bfd_elf64_swap_symbol_in
0.08 100.77 0.08 _init
0.07 100.84 0.07 96120516 0.00 0.00 match_opcode
0.06 100.90 0.06 17831262 0.00 0.00 objdump_styled_sprintf
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
42.18 45.72 45.72 2448651 0.00 0.00 print_insn_riscv
33.32 81.83 36.11 4301254721 0.00 0.00 riscv_get_map_state
22.88 106.63 24.80 compare_symbols
0.44 107.11 0.48 520710 0.00 0.00 find_symbol_for_address
0.37 107.51 0.40 8500 0.00 0.01 disassemble_section
0.26 107.79 0.28 2448625 0.00 0.00 riscv_disassemble_insn
0.11 107.91 0.12 12105719 0.00 0.00 match_opcode
0.07 107.99 0.08 _init
0.06 108.06 0.07 17831262 0.00 0.00 objdump_styled_sprintf
0.06 108.13 0.07 9564476 0.00 0.00 bfd_elf64_swap_symbol_in
They look pretty similar. The main overhead of this case stays the same after the Batch 1.
As you can see, there are three major functions that run so long:
-
print_insn_riscv
(riscv_search_mapping_symbol
) riscv_get_map_state
compare_symbols
compare_symbols
function has a good reason to be slow (even though, more optimization may be possible).
So, let's focus on the first two.
print_insn_riscv
is a dispatcher function.
Actually, most of the time we run this function, we are running inlined riscv_search_mapping_symbol
function. It searches for suitable mapping symbol.
Mapping symbols are special-purpose symbols that determine whether the specified location is code or data.
We parse $x
symbol as a start of the code region and $d
as a start of the data.
To do a mapping-aware disassembling, we have to check the symbol table and update the state (code or data) if necessary.
This is the role of the riscv_get_map_state
function.
... Yes, the cause is clear. riscv_search_mapping_symbol
and riscv_data_length
functions use the linear search algorithm to find mapping symbols but this is very inefficient. There are other factors:
-
riscv_search_mapping_symbol
usedsymtab_pos
for search hints (usually a beginning of the function disassembling) but this is not a good hint to search. This is because mapping symbols are not (or rarely) related to the functions. Most of the time, it ends up with scanning many symbols per instruction. - The symbol table contains many non-mapping symbols.
- The symbol table is sorted by address, then section (not section, then address).
We normally have to check whether the section of the mapping symbol matches to the address we are disassembling. This sort order is not good for finding mapping symbols.
In fact, not inlined riscv_get_map_state
function is called... about 4.30 billion times... for just about 2.45 million instructions/words.
This patchset implements:
- A binary search to look up suitable mapping symbol and
- Separate mapping symbol table, sorted by section and then address
(unless the section to disassemble isNULL
).
Note that, if we filtered the mapping symbols and made a smaller table, even a binary search on this table is not a small cost. Not only a binary search would traverse 10 or some elements, that would disrupt memory cache and branch prediction.
If we can remember and reuse the last mapping symbol (used), we could instead use "very short" linear search. And we can. "Very short" means, it usually checks only one table entry to determine whether the current location is code or data (worst case scenario is checking next 3 elements, unless the input ELF file is hostile to the disassembler).
Reusing the last mapping symbol is what the current disassembler does. However, it's not efficient much because the symbol table is dirty and must go backward to find suitable mapping symbol. After making the smaller table, we no longer need to scan unrelated symbols.
We can reduce the number of binary search further.
Before this patch, it did not try to reuse the last mapping symbol if a "stop offset" is changed (usually when the target function changes).
However, on usual objdump
usecases, we continue disassembling (we usually don't skip much) so that most of the last mapping symbols can be actually reused. So, the author changed so that we reuse the last mapping symbol when the last printed address plus its size matches the current address without changing the section.
As a result, 4.30 billion times of the symbol traversal turns into about 2.46 million mapping symbol traversal with only 23 times of the binary search (plus one time scanning of all symbols). Yes. In this case, the overhead is less than 1/1,000. That would eliminate most of the non-disassembling target-specific overhead.
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
91.93 24.72 24.72 compare_symbols
1.97 25.25 0.53 2465633 0.00 0.00 print_insn_riscv
1.56 25.67 0.42 520710 0.00 0.00 find_symbol_for_address
1.56 26.09 0.42 8500 0.00 0.00 disassemble_section
0.52 26.23 0.14 _init
0.48 26.36 0.13 2448625 0.00 0.00 riscv_disassemble_insn
0.26 26.43 0.07 9564476 0.00 0.00 bfd_elf64_swap_symbol_in
0.22 26.49 0.06 19108712 0.00 0.00 riscv_get_map_state_by_name
0.22 26.55 0.06 9598391 0.00 0.00 bfd_elf_string_from_elf_section
0.19 26.60 0.05 1 0.05 1.77 disassemble_data
Now, print_insn_riscv
only consumes about 2% and riscv_get_map_state
is gone. The biggest bottleneck here is compare_symbols
(we cannot optimize it easily) that consumes over 90% of the run time.
This benchmark is performed on:
- Ubuntu 22.04 LTS
- AMD Ryzen 5 PRO 5650G processor
In the parallel run, I ran 6 parallel jobs with -j6
(corresponding 6 cores; although the processor has 12 hardware threads, -j12
just slowed the benchmark).
This is relative to commit b82817674f4
plus two prerequisites above. Batch 1 required some modification to rebase against Disassembler: Fix address printer.
Program | Improvements | Aggr. 1+2 | Notes |
---|---|---|---|
Busybox 1.35.1 (RV64GC ) |
0.1-2.5% | 33.0-33.9% | |
OpenSBI 1.1 (generic fw_*.elf ) |
2.8-3.0% | 47.8-48.2% | |
Linux kernel 5.18 (vmlinux ) |
10.7-17.7% | 44.2-51.1% | |
Linux kernel 5.18 (vmlinux.o ) |
86.2-257.3% | 115.5-269.1% | Not finally linked |
glibc (libc.so.6 ) |
3.0-3.2% | 31.8-36.8% |
Regular programs won't benefit much. The Linux kernel (vmlinux
) is an exception.
Because this patchset removes most of the archive/symbol overhead, performance on vmlinux.o
is drastically improved.
Program | Improvements | Aggr. 1+2 |
---|---|---|
glibc (libc.a ) |
471.4-496.0% | 520.1-547.3% |
newlib (libc.a ) |
114.5-118.9% | 137.4-145.8% |
Just like vmlinux.o
, static libraries (*.a
) benefit well from this optimization.
Program | Improvements | Aggr. 1+2 |
---|---|---|
Linux kernel 5.18 (vmlinux ) |
1.5-1.6% | 72.6-98.4% |
Random files (/dev/urandom ) |
0.8-1.4% | 94.1-116.6% |
1M (1048576) CSR instructions | 1.6% | 386.5% |
Binary file handling is not improved and most of "changes" should be regarded as noise.
Program | Improvements | Aggr. 1+2 |
---|---|---|
Linux kernel 5.18 (vmlinux ) with debug info |
(-0.3)% | 30.4% |
Linux kernel 5.18 (vmlinux ) without debug info |
(-0.6)% | 89.0% |
OpenSBI 1.1 (generic fw_*.elf ) |
(-0.5)-(-0.3)% | 65.2-65.6% |
1M (1048576) CSR instructions (ELF) | (-0.8)% | 201.8% |
GDB performance is slightly reduced.
System | Path | N | Improvements | Aggr. 1+2 |
---|---|---|---|---|
Ubuntu 22.04 LTS (image for HiFive Unmatched) | /usr/bin |
563 | 0.8% | 28.2% |
Debian unstable (as of 2022-07-20) | /usr/bin |
269 | 0.2% | 27.8% |
Ubuntu 22.04 LTS (image for HiFive Unmatched) | /usr/lib |
6797 | 67.4% | 94.5% |
Debian unstable (as of 2022-07-20) | /usr/lib |
548 | 161.9% | 193.7% |
/usr/lib
improvements are remarkable. This affects all-system disassembling performance below.
System | N | Improvements | Aggr. 1+2 |
---|---|---|---|
Ubuntu 22.04 LTS (image for HiFive Unmatched) | 7666 | 47.4% | 67.8% |
Debian unstable (as of 2022-07-20) | 946 | 115.3% | 129.6% |
System | Path | N | Improvements | Aggr. 1+2 |
---|---|---|---|---|
Ubuntu 22.04 LTS (image for HiFive Unmatched) | /usr/bin |
563 | 0.4% | 51.2% |
Debian unstable (as of 2022-07-20) | /usr/bin |
269 | 0.8% | 44.5% |
System | N | Improvements | Aggr. 1+2 |
---|---|---|---|
Ubuntu 22.04 LTS (image for HiFive Unmatched) | 7666 | 0.8% | 59.3% |
Debian unstable (as of 2022-07-20) | 946 | 0.6% | 46.1% |