-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SpinLock "pause" instructions and performance benchmark #443
Add SpinLock "pause" instructions and performance benchmark #443
Conversation
Codecov Report
@@ Coverage Diff @@
## master #443 +/- ##
=======================================
Coverage 94.37% 94.37%
=======================================
Files 189 189
Lines 8409 8410 +1
=======================================
+ Hits 7936 7937 +1
Misses 473 473
|
0cf3ca8
to
d446d88
Compare
CI is failing because of Fundamental issue described here: To avoid conflicts with Windows headers with |
// processor. | ||
__yield(); | ||
#else | ||
// TODO: Issue PAGE/YIELD on other architectures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put an #error
here for unimplemented architecture?
For the slow path below at line 113 which says the goal is ~1000ns, but 1ms is 10^6 ns, perhaps typo here? And for std::this_thread::sleep_for
, I think it makes little sense to sleep for less than one 1 CPU quantum slice which is usually 10ms or more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put an
#error
here for unimplemented architecture?
It's not an error, though, just a CPU power efficiency optimisation.
For the slow path below at line 113 which says the goal is ~1000ns, but 1ms is 10^6 ns, perhaps typo here? And for
std::this_thread::sleep_for
, I think it makes little sense to sleep for less than one 1 CPU quantum slice which is usually 10ms or more.
Yeah, that's a typo. The idea is to bump order of mangitue, I think this_thread::yield
is also a big higher in actual ns cost. I can update the comments, as for the sleep_for quantile, we have a benchmark, so I can show you the performance difference between 1ms + 10ms.
Here's the output:
----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------------------------
BM_SpinLockThrashing/1/process_time/real_time 0.171 ms 0.155 ms 804
BM_SpinLockThrashing/2/process_time/real_time 0.658 ms 0.234 ms 200
BM_SpinLockThrashing/4/process_time/real_time 1.37 ms 0.744 ms 84
BM_SpinLockThrashing/8/process_time/real_time 2.92 ms 0.852 ms 55
BM_SpinLockThrashing/12/process_time/real_time 3.74 ms 1.46 ms 32
BM_TenMsSleepSpinLockThrashing/1/process_time/real_time 0.179 ms 0.177 ms 707
BM_TenMsSleepSpinLockThrashing/2/process_time/real_time 1.55 ms 0.943 ms 116
BM_TenMsSleepSpinLockThrashing/4/process_time/real_time 2.91 ms 1.95 ms 48
BM_TenMsSleepSpinLockThrashing/8/process_time/real_time 4.92 ms 1.49 ms 21
BM_TenMsSleepSpinLockThrashing/12/process_time/real_time 22.3 ms 2.84 ms 11
Effectively using an entire quantum of sleep is likely to delay a bit too long. Note: This is for a high-contention, low-lock time. So I think the current 1ms sleep is a pretty good balance of throughput + CPU consumption, but I'm happy to toy around with other values if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to read the output table here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the table header.
…and be single-CPU friendly(ish)
83dde01
to
6279657
Compare
…and be single-CPU friendly(ish) (open-telemetry#443)
Creates a new benchmark to compare spinlock implementations and tweak going forward. Also adds in the desired "pause/yield" instructions to the built-in spinlock implementation.
On windows x86_64 12 core machine, we find the following:
the benchmark in question spins up N thread (1->12) and attempts to grab the lock, alter the shared integer and release the lock (i.e. very high contention and very fast ownership). The benchmark is viewing overall throughput time for the work, not fairness. Additionally, it look at overall CPU spend for the entire process, across all threads. This benchmark is considered a "worst-case" scenario for stress-testing mutex contention.
TL;DR: for these statistics:
I think this confirms what we have today is a good balance between CPU-spend and throughput, although tuning may still need to happen to ensure this is true across a wide range of hardware and environments, hence I'd like to submit the benchmark as step one in that process.