Add SpinLock "pause" instructions and performance benchmark #443

jsuereth · 2020-12-13T18:04:04Z

Creates a new benchmark to compare spinlock implementations and tweak going forward. Also adds in the desired "pause/yield" instructions to the built-in spinlock implementation.

On windows x86_64 12 core machine, we find the following:

2020-12-13T12:06:37-05:00
Running spinlock_benchmark.exe
Run on (12 X 2592 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
----------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations
----------------------------------------------------------------------------------------------------
BM_SpinLockThrashing/1/process_time/real_time                  0.167 ms        0.166 ms          846
BM_SpinLockThrashing/2/process_time/real_time                  0.686 ms        0.155 ms          202
BM_SpinLockThrashing/4/process_time/real_time                   1.36 ms        0.419 ms          112
BM_SpinLockThrashing/8/process_time/real_time                   2.44 ms         1.20 ms           52
BM_SpinLockThrashing/12/process_time/real_time                  4.47 ms         3.35 ms           28
BM_ProcYieldSpinLockThrashing/1/process_time/real_time         0.159 ms        0.142 ms          879
BM_ProcYieldSpinLockThrashing/2/process_time/real_time         0.286 ms        0.514 ms          486
BM_ProcYieldSpinLockThrashing/4/process_time/real_time         0.791 ms         1.40 ms          178
BM_ProcYieldSpinLockThrashing/8/process_time/real_time          1.79 ms         7.24 ms           82
BM_ProcYieldSpinLockThrashing/12/process_time/real_time         3.89 ms         15.6 ms           45
BM_NaiveSpinLockThrashing/1/process_time/real_time             0.159 ms        0.170 ms          920
BM_NaiveSpinLockThrashing/2/process_time/real_time             0.459 ms        0.724 ms          302
BM_NaiveSpinLockThrashing/4/process_time/real_time              1.01 ms         1.23 ms          127
BM_NaiveSpinLockThrashing/8/process_time/real_time              2.44 ms         10.6 ms           69
BM_NaiveSpinLockThrashing/12/process_time/real_time             9.83 ms         87.6 ms           28
BM_ThreadYieldSpinLockThrashing/1/process_time/real_time       0.167 ms        0.185 ms          929
BM_ThreadYieldSpinLockThrashing/2/process_time/real_time       0.248 ms        0.353 ms          575
BM_ThreadYieldSpinLockThrashing/4/process_time/real_time       0.619 ms         1.25 ms          225
BM_ThreadYieldSpinLockThrashing/8/process_time/real_time        1.63 ms         3.96 ms           71
BM_ThreadYieldSpinLockThrashing/12/process_time/real_time       2.39 ms         6.91 ms           52

the benchmark in question spins up N thread (1->12) and attempts to grab the lock, alter the shared integer and release the lock (i.e. very high contention and very fast ownership). The benchmark is viewing overall throughput time for the work, not fairness. Additionally, it look at overall CPU spend for the entire process, across all threads. This benchmark is considered a "worst-case" scenario for stress-testing mutex contention.

TL;DR: for these statistics:

The implemented mutex (SpinLockThrashing) shows a good balance of CPU spend + overall processing time for this simple benchmark.
A naive spin-lock (with processor yield) instruction is fastest for single-thread use case (expected), but scales poorly with more threads/contention
Naive thread-yielding leads to higher throughput but more overall CPU-spend in the 12-thread scenario.

I think this confirms what we have today is a good balance between CPU-spend and throughput, although tuning may still need to happen to ensure this is true across a wide range of hardware and environments, hence I'd like to submit the benchmark as step one in that process.

codecov · 2020-12-13T18:13:19Z

Codecov Report

Merging #443 (d74afbd) into master (c786881) will increase coverage by 0.00%.
The diff coverage is 0.00%.

@@           Coverage Diff           @@
##           master     #443   +/-   ##
=======================================
  Coverage   94.37%   94.37%           
=======================================
  Files         189      189           
  Lines        8409     8410    +1     
=======================================
+ Hits         7936     7937    +1     
  Misses        473      473

Impacted Files	Coverage Δ
api/include/opentelemetry/common/spin_lock_mutex.h	`31.25% <0.00%> (-2.09%)`	⬇️
sdk/src/logs/batch_log_processor.cc	`95.06% <0.00%> (+1.23%)`	⬆️

maxgolov · 2020-12-22T19:30:49Z

CI is failing because of #include <Windows.h> that re-defines min and max .. That causes the build to fail at any usage if std::min or std::max macro. One way out of this is to define NOMINMAX globally. The other way is to apply targeted patches to surround these in parenthesis, e.g. (std::min) and (std::max) in every place where we use the macros. Since Mishal also hit this issue elsewhere on Windows, I'd like to apply more broad fix. You can take a look, review, merge, then rebase with my fix included.

Fundamental issue described here:

https://stackoverflow.com/questions/13416418/define-nominmax-using-stdmin-max

To avoid conflicts with Windows headers with std::max, you need to #define NOMINMAX before you include Windows headers.

ThomsonTan · 2020-12-23T02:03:24Z

api/include/opentelemetry/common/spin_lock_mutex.h

+        // processor.
+        __yield();
+#else
+        // TODO: Issue PAGE/YIELD on other architectures.


Could we put an #error here for unimplemented architecture?

For the slow path below at line 113 which says the goal is ~1000ns, but 1ms is 10^6 ns, perhaps typo here? And for std::this_thread::sleep_for, I think it makes little sense to sleep for less than one 1 CPU quantum slice which is usually 10ms or more.

Could we put an #error here for unimplemented architecture?

It's not an error, though, just a CPU power efficiency optimisation.

For the slow path below at line 113 which says the goal is ~1000ns, but 1ms is 10^6 ns, perhaps typo here? And for std::this_thread::sleep_for, I think it makes little sense to sleep for less than one 1 CPU quantum slice which is usually 10ms or more.

Yeah, that's a typo. The idea is to bump order of mangitue, I think this_thread::yield is also a big higher in actual ns cost. I can update the comments, as for the sleep_for quantile, we have a benchmark, so I can show you the performance difference between 1ms + 10ms.

Here's the output:

---------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------------------- BM_SpinLockThrashing/1/process_time/real_time 0.171 ms 0.155 ms 804 BM_SpinLockThrashing/2/process_time/real_time 0.658 ms 0.234 ms 200 BM_SpinLockThrashing/4/process_time/real_time 1.37 ms 0.744 ms 84 BM_SpinLockThrashing/8/process_time/real_time 2.92 ms 0.852 ms 55 BM_SpinLockThrashing/12/process_time/real_time 3.74 ms 1.46 ms 32 BM_TenMsSleepSpinLockThrashing/1/process_time/real_time 0.179 ms 0.177 ms 707 BM_TenMsSleepSpinLockThrashing/2/process_time/real_time 1.55 ms 0.943 ms 116 BM_TenMsSleepSpinLockThrashing/4/process_time/real_time 2.91 ms 1.95 ms 48 BM_TenMsSleepSpinLockThrashing/8/process_time/real_time 4.92 ms 1.49 ms 21 BM_TenMsSleepSpinLockThrashing/12/process_time/real_time 22.3 ms 2.84 ms 11

Effectively using an entire quantum of sleep is likely to delay a bit too long. Note: This is for a high-contention, low-lock time. So I think the current 1ms sleep is a pretty good balance of throughput + CPU consumption, but I'm happy to toy around with other values if you'd like.

How to read the output table here?

added the table header.

…and be single-CPU friendly(ish)

…and be single-CPU friendly(ish) (open-telemetry#443)

jsuereth added the pr:do-not-merge This PR is not ready to be merged. label Dec 13, 2020

jsuereth requested a review from a team December 13, 2020 18:04

jsuereth force-pushed the wip-spinlock-benchmark-and-pause branch from 0cf3ca8 to d446d88 Compare December 22, 2020 15:52

maxgolov mentioned this pull request Dec 22, 2020

Resolve NOMINMAX issues where possible #481

Merged

jsuereth added area:api enhancement New feature or request and removed pr:do-not-merge This PR is not ready to be merged. labels Dec 22, 2020

maxgolov approved these changes Dec 22, 2020

View reviewed changes

maxgolov requested a review from a team December 22, 2020 22:09

ThomsonTan reviewed Dec 23, 2020

View reviewed changes

lalitb approved these changes Dec 24, 2020

View reviewed changes

ThomsonTan approved these changes Dec 24, 2020

View reviewed changes

Update spin-lock implementation to use incremental back-off strategy …

6279657

…and be single-CPU friendly(ish)

jsuereth force-pushed the wip-spinlock-benchmark-and-pause branch from 83dde01 to 6279657 Compare December 29, 2020 15:19

Merge branch 'master' into wip-spinlock-benchmark-and-pause

d74afbd

lalitb merged commit 3cb5d26 into open-telemetry:master Jan 5, 2021

kxyr pushed a commit to open-o11y/opentelemetry-cpp that referenced this pull request Jan 8, 2021

Update spin-lock implementation to use incremental back-off strategy …

f520b39

…and be single-CPU friendly(ish) (open-telemetry#443)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SpinLock "pause" instructions and performance benchmark #443

Add SpinLock "pause" instructions and performance benchmark #443

jsuereth commented Dec 13, 2020

codecov bot commented Dec 13, 2020 •

edited

Loading

maxgolov commented Dec 22, 2020

ThomsonTan Dec 23, 2020

jsuereth Dec 23, 2020 •

edited

Loading

ThomsonTan Dec 23, 2020

jsuereth Dec 24, 2020

Add SpinLock "pause" instructions and performance benchmark #443

Add SpinLock "pause" instructions and performance benchmark #443

Conversation

jsuereth commented Dec 13, 2020

codecov bot commented Dec 13, 2020 • edited Loading

Codecov Report

maxgolov commented Dec 22, 2020

ThomsonTan Dec 23, 2020

Choose a reason for hiding this comment

jsuereth Dec 23, 2020 • edited Loading

Choose a reason for hiding this comment

ThomsonTan Dec 23, 2020

Choose a reason for hiding this comment

jsuereth Dec 24, 2020

Choose a reason for hiding this comment

codecov bot commented Dec 13, 2020 •

edited

Loading

jsuereth Dec 23, 2020 •

edited

Loading