Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SpinLock "pause" instructions and performance benchmark #443

Merged

Conversation

jsuereth
Copy link
Contributor

Creates a new benchmark to compare spinlock implementations and tweak going forward. Also adds in the desired "pause/yield" instructions to the built-in spinlock implementation.

On windows x86_64 12 core machine, we find the following:

2020-12-13T12:06:37-05:00
Running spinlock_benchmark.exe
Run on (12 X 2592 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
----------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations
----------------------------------------------------------------------------------------------------
BM_SpinLockThrashing/1/process_time/real_time                  0.167 ms        0.166 ms          846
BM_SpinLockThrashing/2/process_time/real_time                  0.686 ms        0.155 ms          202
BM_SpinLockThrashing/4/process_time/real_time                   1.36 ms        0.419 ms          112
BM_SpinLockThrashing/8/process_time/real_time                   2.44 ms         1.20 ms           52
BM_SpinLockThrashing/12/process_time/real_time                  4.47 ms         3.35 ms           28
BM_ProcYieldSpinLockThrashing/1/process_time/real_time         0.159 ms        0.142 ms          879
BM_ProcYieldSpinLockThrashing/2/process_time/real_time         0.286 ms        0.514 ms          486
BM_ProcYieldSpinLockThrashing/4/process_time/real_time         0.791 ms         1.40 ms          178
BM_ProcYieldSpinLockThrashing/8/process_time/real_time          1.79 ms         7.24 ms           82
BM_ProcYieldSpinLockThrashing/12/process_time/real_time         3.89 ms         15.6 ms           45
BM_NaiveSpinLockThrashing/1/process_time/real_time             0.159 ms        0.170 ms          920
BM_NaiveSpinLockThrashing/2/process_time/real_time             0.459 ms        0.724 ms          302
BM_NaiveSpinLockThrashing/4/process_time/real_time              1.01 ms         1.23 ms          127
BM_NaiveSpinLockThrashing/8/process_time/real_time              2.44 ms         10.6 ms           69
BM_NaiveSpinLockThrashing/12/process_time/real_time             9.83 ms         87.6 ms           28
BM_ThreadYieldSpinLockThrashing/1/process_time/real_time       0.167 ms        0.185 ms          929
BM_ThreadYieldSpinLockThrashing/2/process_time/real_time       0.248 ms        0.353 ms          575
BM_ThreadYieldSpinLockThrashing/4/process_time/real_time       0.619 ms         1.25 ms          225
BM_ThreadYieldSpinLockThrashing/8/process_time/real_time        1.63 ms         3.96 ms           71
BM_ThreadYieldSpinLockThrashing/12/process_time/real_time       2.39 ms         6.91 ms           52

the benchmark in question spins up N thread (1->12) and attempts to grab the lock, alter the shared integer and release the lock (i.e. very high contention and very fast ownership). The benchmark is viewing overall throughput time for the work, not fairness. Additionally, it look at overall CPU spend for the entire process, across all threads. This benchmark is considered a "worst-case" scenario for stress-testing mutex contention.

TL;DR: for these statistics:

  • The implemented mutex (SpinLockThrashing) shows a good balance of CPU spend + overall processing time for this simple benchmark.
  • A naive spin-lock (with processor yield) instruction is fastest for single-thread use case (expected), but scales poorly with more threads/contention
  • Naive thread-yielding leads to higher throughput but more overall CPU-spend in the 12-thread scenario.

I think this confirms what we have today is a good balance between CPU-spend and throughput, although tuning may still need to happen to ensure this is true across a wide range of hardware and environments, hence I'd like to submit the benchmark as step one in that process.

@jsuereth jsuereth added the pr:do-not-merge This PR is not ready to be merged. label Dec 13, 2020
@jsuereth jsuereth requested a review from a team December 13, 2020 18:04
@codecov
Copy link

codecov bot commented Dec 13, 2020

Codecov Report

Merging #443 (d74afbd) into master (c786881) will increase coverage by 0.00%.
The diff coverage is 0.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #443   +/-   ##
=======================================
  Coverage   94.37%   94.37%           
=======================================
  Files         189      189           
  Lines        8409     8410    +1     
=======================================
+ Hits         7936     7937    +1     
  Misses        473      473           
Impacted Files Coverage Δ
api/include/opentelemetry/common/spin_lock_mutex.h 31.25% <0.00%> (-2.09%) ⬇️
sdk/src/logs/batch_log_processor.cc 95.06% <0.00%> (+1.23%) ⬆️

@jsuereth jsuereth force-pushed the wip-spinlock-benchmark-and-pause branch from 0cf3ca8 to d446d88 Compare December 22, 2020 15:52
@maxgolov
Copy link
Contributor

CI is failing because of #include <Windows.h> that re-defines min and max .. That causes the build to fail at any usage if std::min or std::max macro. One way out of this is to define NOMINMAX globally. The other way is to apply targeted patches to surround these in parenthesis, e.g. (std::min) and (std::max) in every place where we use the macros. Since Mishal also hit this issue elsewhere on Windows, I'd like to apply more broad fix. You can take a look, review, merge, then rebase with my fix included.

Fundamental issue described here:

To avoid conflicts with Windows headers with std::max, you need to #define NOMINMAX before you include Windows headers.

@jsuereth jsuereth added area:api enhancement New feature or request and removed pr:do-not-merge This PR is not ready to be merged. labels Dec 22, 2020
@maxgolov maxgolov requested a review from a team December 22, 2020 22:09
// processor.
__yield();
#else
// TODO: Issue PAGE/YIELD on other architectures.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put an #error here for unimplemented architecture?

For the slow path below at line 113 which says the goal is ~1000ns, but 1ms is 10^6 ns, perhaps typo here? And for std::this_thread::sleep_for, I think it makes little sense to sleep for less than one 1 CPU quantum slice which is usually 10ms or more.

Copy link
Contributor Author

@jsuereth jsuereth Dec 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put an #error here for unimplemented architecture?

It's not an error, though, just a CPU power efficiency optimisation.

For the slow path below at line 113 which says the goal is ~1000ns, but 1ms is 10^6 ns, perhaps typo here? And for std::this_thread::sleep_for, I think it makes little sense to sleep for less than one 1 CPU quantum slice which is usually 10ms or more.

Yeah, that's a typo. The idea is to bump order of mangitue, I think this_thread::yield is also a big higher in actual ns cost. I can update the comments, as for the sleep_for quantile, we have a benchmark, so I can show you the performance difference between 1ms + 10ms.

Here's the output:

----------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations
----------------------------------------------------------------------------------------------------
BM_SpinLockThrashing/1/process_time/real_time                 0.171 ms        0.155 ms          804
BM_SpinLockThrashing/2/process_time/real_time                 0.658 ms        0.234 ms          200
BM_SpinLockThrashing/4/process_time/real_time                  1.37 ms        0.744 ms           84
BM_SpinLockThrashing/8/process_time/real_time                  2.92 ms        0.852 ms           55
BM_SpinLockThrashing/12/process_time/real_time                 3.74 ms         1.46 ms           32
BM_TenMsSleepSpinLockThrashing/1/process_time/real_time       0.179 ms        0.177 ms          707
BM_TenMsSleepSpinLockThrashing/2/process_time/real_time        1.55 ms        0.943 ms          116
BM_TenMsSleepSpinLockThrashing/4/process_time/real_time        2.91 ms         1.95 ms           48
BM_TenMsSleepSpinLockThrashing/8/process_time/real_time        4.92 ms         1.49 ms           21
BM_TenMsSleepSpinLockThrashing/12/process_time/real_time       22.3 ms         2.84 ms           11

Effectively using an entire quantum of sleep is likely to delay a bit too long. Note: This is for a high-contention, low-lock time. So I think the current 1ms sleep is a pretty good balance of throughput + CPU consumption, but I'm happy to toy around with other values if you'd like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to read the output table here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the table header.

@jsuereth jsuereth force-pushed the wip-spinlock-benchmark-and-pause branch from 83dde01 to 6279657 Compare December 29, 2020 15:19
@lalitb lalitb merged commit 3cb5d26 into open-telemetry:master Jan 5, 2021
kxyr pushed a commit to open-o11y/opentelemetry-cpp that referenced this pull request Jan 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:api enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants