Context switching performance #39

2dav · 2023-02-22T22:18:33Z

Hey,
playing with coroutines and libfringe, it turned out that a vital part of context switching performance lies in pop+jmp vs ret.
This comment on HN sheds some light

jmp rax (or any register really) uses the indirect jump prediction, while ret uses a special dedicated predictor that uses a stack (the Return Stack Buffer or RSB), populated by call instructions, to predict the return address. In the coroutine case, the ret does not jump to the address of the last call so it will be mispredicted many time.

This is still a thing with the modern CPUs, zen3 at least.
Changing two lines on ~master:

generator-rs/src/detail/asm/asm_x86_64_sysv_elf_gas.S

Line 43 in 5888dac

ret

pop %rax
jmp *%rax

x86_64 zen3

name	before ns/iter	after ns/iter	diff ns/iter	diff %	speedup
scoped_yield_bench	22	9	-13	-59.09%	x 2.44
single_yield_bench	25	10	-15	-60.00%	x 2.50
single_yield_with_bench	22	9	-13	-59.09%	x 2.44

perf output for one of the benches

	before	after
page-faults	779.883	6457	/sec
stalled-cycles-frontend	12.93%	0.08%	frontend cycles idle
stalled-cycles-backend	1.30%	48.56%	backend cycles idle
instructions	1.32	3.01	insn per cycle
branches	1.247	2.754	G/sec
branch-misses	6.61%	0.03%	of all branches

I don't have other hardware at hand right now, but can test this on Macbook M1 this week.

The text was updated successfully, but these errors were encountered:

Xudong-Huang · 2023-02-23T00:58:23Z

Great! Thanks for the findings!

Xudong-Huang · 2023-02-23T05:00:36Z

I don't have aarch64 platform, may need help from other people.
could be something like below to replace ret instruction

ldp x2, x1, [sp], #16 
br  x1

2dav · 2023-02-23T11:40:22Z

Apple M1 2020

name	before ns/iter	after ns/iter	diff ns/iter	diff %	speedup
scoped_yield_bench	22	11	-11	-50.00%	x 2.00
single_yield_bench	23	12	-11	-47.83%	x 1.92
single_yield_with_bench	23	11	-12	-52.17%	x 2.09

I'm not familiar with the ARM assembly, but from a quick googling it seems that ret on ARM doesn't pop the return address off the stack, but reads it from LR(x30) register which is already populated at the return point, so the required change is

generator-rs/src/detail/asm/asm_aarch64_aapcs_macho_gas.S

Line 51 in 4fcd324

ret

br x30

this passes all of the tests and bench workload.

Xudong-Huang added a commit that referenced this issue Feb 23, 2023

📝 change ret to jump when switch context on x86 (#39)

73cd25b

Xudong-Huang added a commit that referenced this issue Feb 23, 2023

📝 fix windows asm (#39)

a757d97

Xudong-Huang added a commit that referenced this issue Apr 21, 2023

📝 impove aarch64 switching performance (#39)

b7ebfbb

Xudong-Huang closed this as completed Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context switching performance #39

Context switching performance #39

2dav commented Feb 22, 2023 •

edited

Loading

Xudong-Huang commented Feb 23, 2023

Xudong-Huang commented Feb 23, 2023

2dav commented Feb 23, 2023 •

edited

Loading

Context switching performance #39

Context switching performance #39

Comments

2dav commented Feb 22, 2023 • edited Loading

Xudong-Huang commented Feb 23, 2023

Xudong-Huang commented Feb 23, 2023

2dav commented Feb 23, 2023 • edited Loading

2dav commented Feb 22, 2023 •

edited

Loading

2dav commented Feb 23, 2023 •

edited

Loading