Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context switching performance #39

Closed
2dav opened this issue Feb 22, 2023 · 3 comments
Closed

Context switching performance #39

2dav opened this issue Feb 22, 2023 · 3 comments

Comments

@2dav
Copy link

2dav commented Feb 22, 2023

Hey,
playing with coroutines and libfringe, it turned out that a vital part of context switching performance lies in pop+jmp vs ret.
This comment on HN sheds some light

jmp rax (or any register really) uses the indirect jump prediction, while ret uses a special dedicated predictor that uses a stack (the Return Stack Buffer or RSB), populated by call instructions, to predict the return address. In the coroutine case, the ret does not jump to the address of the last call so it will be mispredicted many time.

This is still a thing with the modern CPUs, zen3 at least.
Changing two lines on ~master:

pop %rax
jmp *%rax

x86_64 zen3

name before ns/iter after ns/iter diff ns/iter diff % speedup
scoped_yield_bench 22 9 -13 -59.09% x 2.44
single_yield_bench 25 10 -15 -60.00% x 2.50
single_yield_with_bench 22 9 -13 -59.09% x 2.44

perf output for one of the benches

before after
page-faults 779.883 6457 /sec
stalled-cycles-frontend 12.93% 0.08% frontend cycles idle
stalled-cycles-backend 1.30% 48.56% backend cycles idle
instructions 1.32 3.01 insn per cycle
branches 1.247 2.754 G/sec
branch-misses 6.61% 0.03% of all branches

I don't have other hardware at hand right now, but can test this on Macbook M1 this week.

@Xudong-Huang
Copy link
Owner

Great! Thanks for the findings!

@Xudong-Huang
Copy link
Owner

I don't have aarch64 platform, may need help from other people.
could be something like below to replace ret instruction

ldp x2, x1, [sp], #16 
br  x1

@2dav
Copy link
Author

2dav commented Feb 23, 2023

Apple M1 2020

name before ns/iter after ns/iter diff ns/iter diff % speedup
scoped_yield_bench 22 11 -11 -50.00% x 2.00
single_yield_bench 23 12 -11 -47.83% x 1.92
single_yield_with_bench 23 11 -12 -52.17% x 2.09

I'm not familiar with the ARM assembly, but from a quick googling it seems that ret on ARM doesn't pop the return address off the stack, but reads it from LR(x30) register which is already populated at the return point, so the required change is

br x30

this passes all of the tests and bench workload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants