riscv_dis_opts_batch_2

Disassembler: Core improvements and optimizations (batch 2)

Status: WITHDRAWN
(superseded by Disassembler: Core Optimization 1-2 (Mapping symbols))
Branch: riscv-dis-opts-batch-2
Tracking PR: #31 (view Pull Request and Diff)

Requires

Feature Description

This is the second batch of the disassembler optimization.

Performance improvements achieved by batch 1 on libraries were not great. I found a cause for this.

Benchmarking Overhead

Following dumps are excerpts from the gprof profiler output of: objdump -d vmlinux-5.18-debug1.o (Linux kernel 5.18, vmlinux.o with debug symbols enabled). This is the one of the worst cases I have observed.

Name	Value
File Size	683 065 480
Text Sections	12
Symbols	9 554 485
Mapping Symbols	2 225
Disassembler Lines (*)	2 448 651

(*): A disassembler line is a line with address: representation (instruction|data).

Before Batch 1

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds      calls   s/call   s/call  name
 39.90     40.47    40.47    2448651     0.00     0.00  print_insn_riscv
 33.67     74.62    34.15 4301254721     0.00     0.00  riscv_get_map_state
 24.38     99.35    24.73                               compare_symbols
  0.46     99.82     0.47       8500     0.00     0.01  disassemble_section
  0.42    100.25     0.43     520710     0.00     0.00  find_symbol_for_address
  0.35    100.61     0.36    2448625     0.00     0.00  riscv_disassemble_insn
  0.08    100.69     0.08    9564476     0.00     0.00  bfd_elf64_swap_symbol_in
  0.08    100.77     0.08                               _init
  0.07    100.84     0.07   96120516     0.00     0.00  match_opcode
  0.06    100.90     0.06   17831262     0.00     0.00  objdump_styled_sprintf

After Batch 1

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self                self     total
 time   seconds   seconds      calls   s/call   s/call  name
 42.18     45.72    45.72    2448651     0.00     0.00  print_insn_riscv
 33.32     81.83    36.11 4301254721     0.00     0.00  riscv_get_map_state
 22.88    106.63    24.80                               compare_symbols
  0.44    107.11     0.48     520710     0.00     0.00  find_symbol_for_address
  0.37    107.51     0.40       8500     0.00     0.01  disassemble_section
  0.26    107.79     0.28    2448625     0.00     0.00  riscv_disassemble_insn
  0.11    107.91     0.12   12105719     0.00     0.00  match_opcode
  0.07    107.99     0.08                               _init
  0.06    108.06     0.07   17831262     0.00     0.00  objdump_styled_sprintf
  0.06    108.13     0.07    9564476     0.00     0.00  bfd_elf64_swap_symbol_in

They look pretty similar. The main overhead of this case stays the same after the Batch 1.

Analysis

As you can see, there are three major functions that run so long:

print_insn_riscv (riscv_search_mapping_symbol)
riscv_get_map_state
compare_symbols

compare_symbols function has a good reason to be slow (even though, more optimization may be possible). So, let's focus on the first two.

print_insn_riscv is a dispatcher function. Actually, most of the time we run this function, we are running inlined riscv_search_mapping_symbol function. It searches for suitable mapping symbol.

Mapping symbols are special-purpose symbols that determine whether the specified location is code or data. We parse $x symbol as a start of the code region and $d as a start of the data. To do a mapping-aware disassembling, we have to check the symbol table and update the state (code or data) if necessary. This is the role of the riscv_get_map_state function.

... Yes, the cause is clear. riscv_search_mapping_symbol and riscv_data_length functions use the linear search algorithm to find mapping symbols but this is very inefficient. There are other factors:

riscv_search_mapping_symbol used symtab_pos for search hints (usually a beginning of the function disassembling) but this is not a good hint to search. This is because mapping symbols are not (or rarely) related to the functions. Most of the time, it ends up with scanning many symbols per instruction.
The symbol table contains many non-mapping symbols.
The symbol table is sorted by address, then section (not section, then address).
We normally have to check whether the section of the mapping symbol matches to the address we are disassembling. This sort order is not good for finding mapping symbols.

In fact, not inlined riscv_get_map_state function is called... about 4.30 billion times... for just about 2.45 million instructions/words.

What this Patchset Does

This patchset implements:

A binary search to look up suitable mapping symbol and
Separate mapping symbol table, sorted by section and then address
(unless the section to disassemble is NULL).

Note that, if we filtered the mapping symbols and made a smaller table, even a binary search on this table is not a small cost. Not only a binary search would traverse 10 or some elements, that would disrupt memory cache and branch prediction.

If we can remember and reuse the last mapping symbol (used), we could instead use "very short" linear search. And we can. "Very short" means, it usually checks only one table entry to determine whether the current location is code or data (worst case scenario is checking next 3 elements, unless the input ELF file is hostile to the disassembler).

Reusing the last mapping symbol is what the current disassembler does. However, it's not efficient much because the symbol table is dirty and must go backward to find suitable mapping symbol. After making the smaller table, we no longer need to scan unrelated symbols.

We can reduce the number of binary search further. Before this patch, it did not try to reuse the last mapping symbol if a "stop offset" is changed (usually when the target function changes). However, on usual objdump usecases, we continue disassembling (we usually don't skip much) so that most of the last mapping symbols can be actually reused. So, the author changed so that we reuse the last mapping symbol when the last printed address plus its size matches the current address without changing the section.

As a result, 4.30 billion times of the symbol traversal turns into about 2.46 million mapping symbol traversal with only 23 times of the binary search (plus one time scanning of all symbols). Yes. In this case, the overhead is less than 1/1,000. That would eliminate most of the non-disassembling target-specific overhead.

Benchmarking Optimization Effects

After Batch 2

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 91.93     24.72    24.72                             compare_symbols
  1.97     25.25     0.53  2465633     0.00     0.00  print_insn_riscv
  1.56     25.67     0.42   520710     0.00     0.00  find_symbol_for_address
  1.56     26.09     0.42     8500     0.00     0.00  disassemble_section
  0.52     26.23     0.14                             _init
  0.48     26.36     0.13  2448625     0.00     0.00  riscv_disassemble_insn
  0.26     26.43     0.07  9564476     0.00     0.00  bfd_elf64_swap_symbol_in
  0.22     26.49     0.06 19108712     0.00     0.00  riscv_get_map_state_by_name
  0.22     26.55     0.06  9598391     0.00     0.00  bfd_elf_string_from_elf_section
  0.19     26.60     0.05        1     0.05     1.77  disassemble_data

Now, print_insn_riscv only consumes about 2% and riscv_get_map_state is gone. The biggest bottleneck here is compare_symbols (we cannot optimize it easily) that consumes over 90% of the run time.

Benchmarking System

This benchmark is performed on:

Ubuntu 22.04 LTS
AMD Ryzen 5 PRO 5650G processor

In the parallel run, I ran 6 parallel jobs with -j6 (corresponding 6 cores; although the processor has 12 hardware threads, -j12 just slowed the benchmark).

Aggregate Performance Improvements

This is relative to commit b82817674f4 plus two prerequisites above. Batch 1 required some modification to rebase against Disassembler: Fix address printer.

`objdump -d` (ELF)

Program	Improvements	Aggr. 1+2	Notes
Busybox 1.35.1 (`RV64GC`)	0.1-2.5%	33.0-33.9%
OpenSBI 1.1 (generic `fw_*.elf`)	2.8-3.0%	47.8-48.2%
Linux kernel 5.18 (`vmlinux`)	10.7-17.7%	44.2-51.1%
Linux kernel 5.18 (`vmlinux.o`)	86.2-257.3%	115.5-269.1%	Not finally linked
glibc (`libc.so.6`)	3.0-3.2%	31.8-36.8%

Regular programs won't benefit much. The Linux kernel (vmlinux) is an exception.

Because this patchset removes most of the archive/symbol overhead, performance on vmlinux.o is drastically improved.

`objdump -d` (ELF-based archive)

Program	Improvements	Aggr. 1+2
glibc (`libc.a`)	471.4-496.0%	520.1-547.3%
newlib (`libc.a`)	114.5-118.9%	137.4-145.8%

Just like vmlinux.o, static libraries (*.a) benefit well from this optimization.

`objdump -D` (binary)

Program	Improvements	Aggr. 1+2
Linux kernel 5.18 (`vmlinux`)	1.5-1.6%	72.6-98.4%
Random files (`/dev/urandom`)	0.8-1.4%	94.1-116.6%
1M (1048576) CSR instructions	1.6%	386.5%

Binary file handling is not improved and most of "changes" should be regarded as noise.

`gdb`: `disas` of near all code region

Program	Improvements	Aggr. 1+2
Linux kernel 5.18 (`vmlinux`) with debug info	(-0.3)%	30.4%
Linux kernel 5.18 (`vmlinux`) without debug info	(-0.6)%	89.0%
OpenSBI 1.1 (generic `fw_*.elf`)	(-0.5)-(-0.3)%	65.2-65.6%
1M (1048576) CSR instructions (ELF)	(-0.8)%	201.8%

GDB performance is slightly reduced.

Batch: `objdump -d` on Linux distribution

Serial Run: All ELF Files Under the Directory

System	Path	N	Improvements	Aggr. 1+2
Ubuntu 22.04 LTS (image for HiFive Unmatched)	`/usr/bin`	563	0.8%	28.2%
Debian unstable (as of 2022-07-20)	`/usr/bin`	269	0.2%	27.8%
Ubuntu 22.04 LTS (image for HiFive Unmatched)	`/usr/lib`	6797	67.4%	94.5%
Debian unstable (as of 2022-07-20)	`/usr/lib`	548	161.9%	193.7%

/usr/lib improvements are remarkable. This affects all-system disassembling performance below.

Parallel Run: All (including data-only ELFs)

System	N	Improvements	Aggr. 1+2
Ubuntu 22.04 LTS (image for HiFive Unmatched)	7666	47.4%	67.8%
Debian unstable (as of 2022-07-20)	946	115.3%	129.6%

Batch: `objdump -D` (as binary) on Linux distribution

Serial Run: All ELF Files Under the Directory

System	Path	N	Improvements	Aggr. 1+2
Ubuntu 22.04 LTS (image for HiFive Unmatched)	`/usr/bin`	563	0.4%	51.2%
Debian unstable (as of 2022-07-20)	`/usr/bin`	269	0.8%	44.5%

Parallel Run: All (including data-only ELFs)

System	N	Improvements	Aggr. 1+2
Ubuntu 22.04 LTS (image for HiFive Unmatched)	7666	0.8%	59.3%
Debian unstable (as of 2022-07-20)	946	0.6%	46.1%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

riscv_dis_opts_batch_2

Disassembler: Core improvements and optimizations (batch 2)

Requires

Feature Description

Benchmarking Overhead

Before Batch 1

After Batch 1

Analysis

What this Patchset Does

Benchmarking Optimization Effects

After Batch 2

Benchmarking System

Aggregate Performance Improvements

`objdump -d` (ELF)

`objdump -d` (ELF-based archive)

`objdump -D` (binary)

`gdb`: `disas` of near all code region

Batch: `objdump -d` on Linux distribution

Serial Run: All ELF Files Under the Directory

Parallel Run: All (including data-only ELFs)

Batch: `objdump -D` (as binary) on Linux distribution

Serial Run: All ELF Files Under the Directory

Parallel Run: All (including data-only ELFs)

Clone this wiki locally

riscv_dis_opts_batch_2

Disassembler: Core improvements and optimizations (batch 2)

Requires

Feature Description

Benchmarking Overhead

Before Batch 1

After Batch 1

Analysis

What this Patchset Does

Benchmarking Optimization Effects

After Batch 2

Benchmarking System

Aggregate Performance Improvements

objdump -d (ELF)

objdump -d (ELF-based archive)

objdump -D (binary)

gdb: disas of near all code region

Batch: objdump -d on Linux distribution

Serial Run: All ELF Files Under the Directory

Parallel Run: All (including data-only ELFs)

Batch: objdump -D (as binary) on Linux distribution

Serial Run: All ELF Files Under the Directory

Parallel Run: All (including data-only ELFs)

Clone this wiki locally

`objdump -d` (ELF)

`objdump -d` (ELF-based archive)

`objdump -D` (binary)

`gdb`: `disas` of near all code region

Batch: `objdump -d` on Linux distribution

Batch: `objdump -D` (as binary) on Linux distribution