-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<xatomic.h>: Consider adding _mm_pause on x86/x64 #680
Comments
Marking as |
I pinged whatever folks from whom I might have heard this here: https://twitter.com/MalwareMinigun/status/1246187718645698562 we'll see what they say |
@BillyONeal , it seems that one of persons you are asking is the same Olivier, who made reference implementation of atomic wait with If the concern is unpredictable duration then, well, lets not rely on duration and don't do more than one _mm_pause per spin iteration (and this STL does not do it, as I can see) I agree that benchmark should be provided. I disagree that you can improve / not make worse every case -- there's always a tradeoff. |
I can provide a test program that shows
|
Some things are pure wins - e.g. given compiler support, |
_mm_pause also has the issue of being different lengths depending on which Intel arch you're using, 10 cycles vs the newer ones having 140 cycles, so if people using atomics rely on a very fast atomic check time then they could have a significant drop in performance, https://aloiskraus.wordpress.com/2018/06/16/why-skylakex-cpus-are-sometimes-50-slower-how-intel-has-broken-existing-code/ also the people who work on .NET found out. So for some extremely high-threadcount high contention atomics this could explode the time it takes as this article explains. So adding _mm_pause can negatively affect some workloads while positively affect others, so it'd probably be best to find out which type of workload is the most often used and optimize for that while informing others of the change (if the change happens). |
@BillyONeal , I have an idea. In scope of #593 (atomic waits) I can add some minimal level of Windows XP support, Then, the question about (Note that What do you think? |
Correct me if I'm wrong, but the problem in the linked article was that .net used something like 50 pauses in the first iteration and then tripled that in each iteration. Of course that leads to very long backoff times very quickly. That doesn't mean that something like
isn't still better than
It is only when you try to figure out the "optimal" number of pause instructions that this becomes an issue but e.g. 1 is certainly still better than 0. And generally speaking, I'd certainly expect a |
@MikeGitb Well, that might have been the case, but just to make sure I ran a modified version of @AlexGuteniev 's test program where I instead ran 60 worker threads, a total of 30 runs, and then averaged the results at the end of the program, this is on an i7-7700HQ with Visual Studio's debug window in release x64 mode, The main reason I threw the article out there was to warn against being too quick on a decision like this because there's been problems with _mm_pause before due to Intel shenanigans. Of course, there should still be more testing and a more "concrete" answer found, but I just want to warn against premature changes where it's not explicitly a bug but can affect performance, as anything to do with performance goes: Test before making claims. |
One of other issues (166) mentions an interesting document, Intel Architecture Optimization Manual: The part 2.3.4 Pause Latency in Skylake Microarchitecture covers this subject. It mentions number of cycles growth for |
Here came across an explanation of a strong point to have at least one |
Can a decision be made here? |
I think your benchmark program demonstrates that we should add the pause. (I was reassigned to the vcpkg team between March and June and so did not see that) |
The increasing backoff example in the Skylake manual you reference is probably a good idea too. |
Resolves microsoft#370 , resolves microsoft#680
Updated benchmark:#include <chrono>
#include <iostream>
#include <thread>
#include <mutex>
#include <atomic>
#include <thread>
#include <intrin.h>
struct MammothAtomicValue
{
unsigned v;
void increment()
{
++v;
}
bool is_equal(const MammothAtomicValue& other) const
{
return v == other.v;
}
bool is_greater_or_equal(const MammothAtomicValue& other) const
{
return v >= other.v;
}
};
enum use_pause_t
{
no_pause_no_load,
no_pause_load,
one_pause_no_load,
one_pause_load,
recommended_method,
};
template<use_pause_t use_pause>
struct MammothAtomic
{
void lock()
{
switch (use_pause)
{
case no_pause_no_load:
while (spin_mutex.exchange(true, std::memory_order_acquire))
{
// nothing
}
break;
case no_pause_load:
while (spin_mutex.exchange(true, std::memory_order_acquire))
{
while (spin_mutex.load(std::memory_order_relaxed))
{
// nothing
}
}
break;
case one_pause_no_load:
while (spin_mutex.exchange(true, std::memory_order_acquire))
{
_mm_pause();
}
break;
case one_pause_load:
while (spin_mutex.exchange(true, std::memory_order_acquire))
{
while (spin_mutex.load(std::memory_order_relaxed))
{
_mm_pause();
}
}
break;
case recommended_method:
{
int mask = 1;
int const max = 64;
while (spin_mutex.exchange(true, std::memory_order_acquire))
{
while (spin_mutex.load(std::memory_order_relaxed))
{
for (int i = mask; i; --i)
{
_mm_pause();
}
mask = mask < max ? mask << 1 : max;
}
}
break;
}
}
}
void unlock()
{
spin_mutex.store(false, std::memory_order_release);
}
MammothAtomicValue load()
{
lock();
MammothAtomicValue result = value;
unlock();
return result;
}
MammothAtomicValue cas(const MammothAtomicValue& desired,
const MammothAtomicValue& expected)
{
lock();
MammothAtomicValue result = value;
if (value.is_equal(expected))
{
value = desired;
}
unlock();
return result;
}
MammothAtomicValue value;
std::atomic<bool> spin_mutex = false;
MammothAtomicValue increment()
{
MammothAtomicValue expected = load();
MammothAtomicValue result = expected;
for (;;)
{
result.increment();
result = cas(result, expected);
if (result.is_equal(expected))
{
break;
}
expected = result;
}
return result;
}
};
template<use_pause_t use_pause>
std::chrono::steady_clock::duration run_benchmark()
{
MammothAtomicValue initial = { 0 };
MammothAtomicValue final = { 2000000 };
MammothAtomic<use_pause> value = { initial };
std::atomic<bool> race_start_flag = false;
auto worker = [&] {
while (!race_start_flag.load()) {}
while (!value.increment().is_greater_or_equal(final)) {}
};
// with two workers, _mm_pause is double win, with one worker no effect
// with more than two some sort of win
std::thread workers[4];
for (auto& w : workers) {
w = std::thread(worker);
}
auto start_time = std::chrono::steady_clock::now();
race_start_flag.store(true);
for (auto& w : workers) {
w.join();
}
return std::chrono::steady_clock::now() - start_time;
}
int main()
{
namespace chrono = std::chrono;
for (int i = 0; i < 4; i++)
{
std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
run_benchmark<no_pause_no_load>()).count() << " none\t";
std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
run_benchmark<no_pause_load>()).count() << " load\t";
std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
run_benchmark<one_pause_no_load>()).count() << " pause\t";
std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
run_benchmark<one_pause_load>()).count() << " both\t";
std::cout << chrono::duration_cast<chrono::duration<double, std::milli>>(
run_benchmark<recommended_method>()).count() << " as recommended" << std::endl;
}
} Results:
|
On an ARM64 device.
|
I'm referring to this part:
STL/stl/inc/xatomic.h
Lines 26 to 40 in 260cfaf
In particular, this line:
STL/stl/inc/xatomic.h
Line 31 in 260cfaf
In
"winnt.h"
it is defined as follows:#define YieldProcessor _mm_pause
In documentation on YieldProcessor use of
pause
instruction is documented:Other synchronization libraries also make use of
pause
instruction.@BillyONeal explained in #613 (comment):
Though I understand that there are some folks that know something, it all highly suspicious that this STL is the only library I know that avoids
pause
.The text was updated successfully, but these errors were encountered: