-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NativeAOT] A few scalability fixes for the fat Lock #88633
Conversation
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas Issue DetailsRecent investigations in productizing Lock type revealed a few issues in the "Fat" lock implementation on NativeAOT that negatively impact scalability under contentions. This PR addresses the following issues:
A blind CAS saves us one load of the state before the CAS with a hope to save on bus interactions. That does not seem to be making any difference on the current hardware.
The spin loop uses backoffs in cases of failed interlocked operations but on the most common path, when we retry acquiring, there is only a minimum pause. As a result we retry too aggressively, which is unproductive. It becomes even more expensive when other threads keep modifying the lock state. We’d read a lot of modified states, and each will cause a cache miss + unnecessary sharing of the state that will soon need to be invalidated.
As lock becomes more crowded the spinning cost quickly goes up because of cache misses and more threads wasting more cycles. Ideally the spin limit should be dialed in reverse to the cost of spinning. Such logic is not very sensitive to how crowded the lock is. What is worse it is self-fulfilling. We may acquire the lock after a long and expensive spinning with many cache misses and failed interlocked operations (and not only in the current thread), yet we conclude that we “succeeded” and everyone should spin more – thus make the situation worse, and yet it can make spinning more likely to “succeed” in the future. A crowded lock can degenerate into an expensive spinlock and stay in that mode, while a better choice would be the opposite – reduce crowdedness by reducing the spin limit. It appears that detecting if lock has changed its ownership while we were spinning is a lot better signal for dialing the spin limit.
|
@@ -25,13 +25,7 @@ private static void MonitorEnter(object obj, ref bool lockTaken) | |||
ObjectHeader.GetLockObject(obj) : | |||
SyncTable.GetLockObject(resultOrIndex); | |||
|
|||
if (lck.TryAcquire(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This acquire attempt is unnecessary and is also too soon. It was probably here to pick "unusual" cases like recursive locking because TryAcquire(0)
would defer internally to the slow path for one iteration. Calling the slow path directly is simpler though and just as good.
return; | ||
} | ||
|
||
Monitor.TryAcquireContended(lck, obj, Timeout.Infinite); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed TryAcquireContended
-> TryAcquireSlow
. The name was deceiving. This method is not just to handle contention. It is to handle all possible scenarios, while the initial attempts to acquire quickly can handle only a few common cases.
A simple benchmark to check the lock response to growing contention could be the following: using System.Diagnostics;
namespace ConsoleApp12
{
internal class Program
{
static Program fatLock = new Program();
static Program thinLock = new Program();
private const int iters = 1000000;
static void Main(string[] args)
{
fatLock = new Program();
fatLock.GetHashCode();
for (; ; )
{
for (int i = 0; i < 7; i++)
{
int thrCount = 1 << i;
System.Console.WriteLine("Threads:" + thrCount);
for (int j = 0; j < 4; j++)
{
System.Console.Write("Fat Lock: ");
RunMany(() => FatLock(thrCount), thrCount);
//System.Console.Write("Thin Lock :");
//thinLock = new Program();
//RunMany(() => ThinLock(thrCount), thrCount);
}
}
System.Console.WriteLine();
}
}
static void RunMany(Action f, int threadCount)
{
Thread[] threads = new Thread[threadCount];
bool start = false;
for (int i = 0; i < threads.Length; i++)
{
threads[i] = new Thread(
() =>
{
while (!start) Thread.SpinWait(1);
f();
}
);
threads[i].Start();
}
Thread.Sleep(10);
start = true;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < threads.Length; i++)
{
threads[i].Join();
}
System.Console.WriteLine("Ops per msec: " + iters / sw.ElapsedMilliseconds);
}
static int sharedCounter = 0;
private static int Fib(int n) => n < 2 ? 1 : Fib(n - 1) + Fib(n - 2);
private static int ComputeSomething(Random random)
{
int delay = random.Next(4, 10);
return Fib(delay);
}
static void FatLock(int thrCount)
{
Random random = new Random();
for (int i = 0; i < iters / thrCount; i++)
{
// do some computation
int value = ComputeSomething(random);
lock (fatLock)
{
// update shared state
sharedCounter+= value;
}
}
}
}
} |
Here is the output on an x64 machine with 32 logical cores: === Before the fix
=== After the fix
|
For the reference here is the same benchmark with the native CoreCLR lock. === CoreCLR
|
/azp run runtime-extra-platforms |
Azure Pipelines successfully started running 1 pipeline(s). |
Rebased onto recent main, to keep up with changes. |
Rebased onto main, to keep up to date. |
Rebased onto main, to keep up to date. |
I'm taking a look, will try to get back soon |
// now we can estimate how busy the lock is and adjust spinning accordingly | ||
ushort spinLimit = _spinLimit; | ||
if (ownerChanged != 0) | ||
{ | ||
// The lock has changed ownership while we were trying to acquire it. | ||
// It is a signal that we might want to spin less next time. | ||
// Pursuing a lock that is being "stolen" by other threads is inefficient | ||
// due to cache misses and unnecessary sharing of state that keeps invalidating. | ||
if (spinLimit > MinSpinLimit) | ||
{ | ||
_spinLimit = (ushort)(spinLimit - 1); | ||
} | ||
} | ||
else if (spinLimit < MaxSpinLimit && iteration > spinLimit / 2) | ||
{ | ||
// we used more than 50% of allowed iterations, but the lock does not look very contested, | ||
// we can allow a bit more spinning. | ||
_spinLimit = (ushort)(spinLimit + 1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A useful aspect of a spin-tuning heuristic would be to avoid spin-waiting when it just uses CPU time but does not help to avoid context-switching. I'm not sure that this is addressed by the spin-tuning heuristic in this change. I suggest creating a test case where spin-waiting is wasted, and to compare the CPU time with profiling, with and without spin-waiting. Ideally, a self-tuning spin-wait heuristic would show roughly the same CPU time usage as disabling spin-waiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand the concern - this is about a lock that is typically long-held.
If the lock is held for much longer than it takes to block/awaken a thread, then spinning is useless.
Note that even if a thread decides to block right away, it will still pay thousands of cycles for blocking/waking. In the first approximation, if max spinning that we allow does not cost much more than the costs of block/awake, then the overall overhead even with the max spin should not be much worse than ideal.
Also, since such long-held lock will typically be acquired by an awoken waiter without any spinning, it would not raise the spin limit, so we will likely stay at the initial min spin setting.
We may have an occasional new thread arrive at precise time window around lock release, spin and acquire and thus cause spin count to slowly drift up and reach the max. But it is also possible for such "new" threads to contest the lock with an awoken waiter and reduce the spin.
Since the lock is long-held, there cannot be many spinners. If the lock is very popular, eventually all the interested threads will block.
I think the possibility of really bad behavior is not high, but it may be possible to further improve the heuristics for the long-held locks. It could be interesting to make some adversarial scenarios to see if we can get in trouble intentionally, and see how likely and how bad it can get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that even if a thread decides to block right away, it will still pay thousands of cycles for blocking/waking. In the first approximation, if max spinning that we allow does not cost much more than the costs of block/awake, then the overall overhead even with the max spin should not be much worse than ideal.
It seems like the max spin count would use a not-insignificant amount of CPU time relative to context switching, though I haven't measured with this heuristic.
Also, since such long-held lock will typically be acquired by an awoken waiter without any spinning, it would not raise the spin limit, so we will likely stay at the initial min spin setting.
True, but if some other less-frequent lock contention on the same lock happens to increase the spin count, the spin-tuning would not tune down the spin-waiting for a more-frequent lock contention that doesn't benefit from spin-waiting. It may not be a significant issue at the moment since the spin-tuning heuristic seems to favor less spin-waiting anyway.
@@ -169,13 +191,14 @@ private static unsafe void ExponentialBackoff(uint iteration) | |||
uint rand = ((uint)&iteration + iteration) * 2654435769u; | |||
// set the highmost bit to ensure minimum number of spins is exponentialy increasing | |||
// that is in case some stack location results in a sequence of very low spin counts | |||
rand |= (1 << 32); | |||
// it basically gurantees that we spin at least 1, 2, 4, 8, 16, times, and so on | |||
rand |= (1u << 31); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not relevant to this particular change, but why is this additional randomization necessary? Given that the upper bit is set, it seems the randomization effect would be miniscule, especially at higher iterations. To me it doesn't seem like the added complication would be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the upper bit is set, it seems the randomization effect would be miniscule, especially at higher iterations.
With upper bit set and other bits random, the range of possible waits should still be ~50% of the minimum for the given iteration.
The cost of this "randomization" is basically free next to the cost of occupying the core even for one spinwait cycle. It is a cheap insurance against threads arriving at the same time or synchronizing with each other via effects of cache misses.
Some scenarios/environments may not need this, and those that need may need it less after this PR, since we are trying to ensure that we do not see too much contentious spinning in the first place.
It is hard to prove that there will be no benefits though, since this code will run on very wide range of hardware. So it seems a nearly free insurance is still good to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it seems the spin-wait backoff can yield fairly large gaps between checks at higher iterations, whereas CoreCLR's monitor's spin-wait backoff maxes out fairly quickly in favor of reducing latency. I guess a small amount of randomization may not hurt, and as you said it would probably be difficult to prove or disprove if it actually helps.
// We will use a pause that doubles up on every iteration. It will not be more than 2x worse | ||
// than the ideal guess, while minimizing the number of retries. | ||
// We will allow pauses up to 64~128 spinwaits, or more if there are collisions. | ||
ExponentialBackoff(Math.Min(iteration, 6) + collisions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the added collisions counter would make this expontential backoff at lot more than power-of-2-exponential, and perhaps fairly quickly depending on the situation. Is there a reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two factors that inform the backoff here - iteration and collisions.
The retry period would be normally driven by iterations. As we spin more, we reduce the rate of reads, as the comment explains. The goal is to acquire the lock within 2X of optimal wait if noone wants it, while avoiding possible interference with other threads acquiring/releasing the lock. This is a power-of-2 exponential backoff with a limit.
When we see collisions (i.e. our interlocked operations start failing), it is a signal that we may still have more competition at the lock than we would like, so we backoff more.
Collisions should be relatively rare - basically a "panic mode" to resolve unexpected bursts of activity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a cap on the collisions too similarly to the iteration, so as to avoid a perhaps rare but potentially very large backoffs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExponentialBackoff has a total limit controlled by MaxExponentialBackoffBits
.
It would be extremely unlikely to have so many collisions, but we have a limit just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, 10 bits seems like a lot to me as a cap for one spin iteration. If we assume SpinWait(1)
takes in the order of 10 ns, SpinWait((1 << 10) / 2)
would take in the order of 5 us, which is quite a lot for one spin-wait iteration, despite it being rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of double-up increases, by the time the pause reaches 5us, we have already spent 5us spinning. Basically the horribly long pauses will take horribly long time to achieve. The math kind of tells us that the limit is unnecessary, yet for many people, me included, an unbounded spin would trigger an alarm, so we need some limit. Because, who knows, we may not be considering everything ...
There is not much science behind this number though. I wanted to pick something that does not stand in the way in a typical scenario, yet protects from degenerate cases (i.e. not spinning for seconds). Thus I picked 2^10
. I did not see it achieved in experiments (not even close), yet it is feels safer than 2^16
- another number I'd pick.
If there is any way to better estimate a better number, we could change it.
It would be interesting to get access to a 1024 core machine, or how high the core count goes these days, and see how high the collisions numbers could possibly get if trying intentionally for it (also to see if randomization helps).
I think experimenting on 32 cores will mostly tell us that this number does not matter.
The above test seems to show some negative processor cache effects when lock ownership transitions very quickly between more than a few threads. It seems the best way to get high throughput there is to have only 2-3 threads active and to have all other threads wait. Spin-waiting less appears to have a similar effect. This would benefit a scenario where each thread is re-acquiring the lock a great many times in very quick succession, quickly enough that processor cache effects start coming into play for throughput measurements, which doesn't seem very realistic to me. I'm not sure that tuning for this test case is worthwhile. If it can be done without tradeoffs, then great, but otherwise for scaling on this test I think it's good enough if the throughput stabilizes somewhere reasonable with increasing threads. The spin-tuning heuristic in this change seems to favor less spin-waiting even when more spin-waiting is beneficial. Consider the following test case, which measures how quickly a burst of lock contentions on a number of threads is resolved, in terms of throughput of bursts. I think it's a reasonably realistic test case, and can occur for instance periodically with timer ticks. using System;
using System.Diagnostics;
using System.Threading;
namespace ConsoleApp12
{
internal class Program
{
static Program fatLock = new Program();
private static bool stop;
private static int iters;
static void Main(string[] args)
{
fatLock = new Program();
fatLock.GetHashCode();
for (int thrCount = 1; thrCount <= Environment.ProcessorCount; thrCount *= 2)
{
Console.WriteLine("Threads:" + thrCount);
for (int j = 0; j < 4; j++)
{
Console.Write("Fat Lock: ");
RunMany(() => FatLock(thrCount), thrCount);
}
}
}
static void RunMany(Action f, int threadCount)
{
Thread[] threads = new Thread[threadCount];
bool start = false;
stop = false;
iters = 0;
locksTaken = 0;
for (int i = 0; i < threads.Length; i++)
{
threads[i] = new Thread(
() =>
{
while (!start) Thread.SpinWait(1);
f();
}
);
threads[i].Start();
}
Thread.Sleep(10);
start = true;
Stopwatch sw = Stopwatch.StartNew();
Thread.Sleep(500);
int completedIters = iters;
sw.Stop();
stop = true;
for (int i = 0; i < threads.Length; i++)
{
threads[i].Join();
}
Console.WriteLine($"Ops per msec: {completedIters / sw.ElapsedMilliseconds,7}");
}
private static int Fib(int n) => n < 2 ? 1 : Fib(n - 1) + Fib(n - 2);
private static int ComputeSomething(Random random)
{
int delay = random.Next(4, 10);
return Fib(delay);
}
static int locksTaken;
static void FatLock(int thrCount)
{
Random random = new Random();
while (!stop)
{
int locksTakenValue;
lock (fatLock)
{
// update shared state
iters++;
locksTakenValue = ++locksTaken;
// do some computation
int value = ComputeSomething(random);
}
if (locksTakenValue == thrCount)
{
// this is the last thread to take the lock in this burst
locksTaken = -thrCount + 1;
}
else
{
// wait for other threads to take the lock
while (!stop && locksTaken > 0) Thread.SpinWait(1);
Interlocked.Increment(ref locksTaken);
}
// wait for other threads to finish waiting above
while (!stop && locksTaken < 0) Thread.SpinWait(1);
}
}
}
} Below are some results on my 16-core 32-proc AMD processor on Windows. CoreCLR:
NativeAOT after this change:
NativeAOT before this change:
In this test, allowing more spin-waiting threads avoids excessive context switching, up to a point after which it doesn't help much unless the spin-wait duration is increased, which would have other negative tradeoffs. Granted, this is also only a microbenchmark and I wouldn't suggest tuning specifically for it, but it shows how spin-waiting can reduce context-switching. After this change in NativeAOT, the context-switching overhead seems to start showing up with fewer threads involved in the contention (fairly early, around 4), indicating that the spin-tuning heuristic is favoring less spin-waiting even when more is beneficial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite my other comments, this change seems to have fixed a number of other issues in the previous implementation in NativeAOT, and it seems likely that it would work better than NativeAOT's lock implementation in .NET 7, so it's probably a good improvement to take for .NET 8. For .NET 9, I hope that we can consolidate the implementations with #87672, which is mostly a port of CoreCLR's lock implementation, an implementation that we know to have been working without any significant issues for quite some time, and any further improvements can be made on top of that.
Completely agree on that! |
Thanks!!! |
Recent investigations in productizing Lock type revealed a few issues in the "Fat" lock implementation on NativeAOT that negatively impact scalability under contentions.
This PR addresses the following issues:
A blind CAS saves us one load of the state before the CAS with a hope to save on bus interactions. That does not seem to be making any difference on the current hardware.
The downside is that we cannot make a quick “unfair” one-shot acquire when a waiter is present (since the lock state is not clean “Unlocked”), and thus the initial attempt is wasted. Preventing failures to acquire/release on a quick path appear to be more impactful than avoiding a load.
The spin loop uses backoffs in cases of failed interlocked operations but on the most common path, when we retry acquiring, there is only a minimum pause. As a result we retry too aggressively, which is unproductive. It becomes even more expensive when other threads keep modifying the lock state. Retrying threads would read a lot of modified states, and each will cause a cache miss + unnecessary sharing of the state that will soon need to be invalidated.
We need to pace the retries with backoffs.
As lock becomes more crowded the spinning cost quickly goes up because of cache misses and more threads wasting more cycles. Ideally the spin limit should be dialed in reverse to the cost of spinning.
Instead, we assume that if we acquired the lock while spinning, the spin limit should be increased.
Such logic is not very sensitive to how crowded the lock is. What is worse it is self-fulfilling. We may acquire the lock after a long and expensive spinning with many cache misses and failed interlocked operations (and not only in the current thread), yet we conclude that we “succeeded” and everyone should spin even more – thus make the situation worse, and yet it can make spinning more likely to “succeed” in the future. A crowded lock can degenerate into an expensive spinlock and stay in that mode, while a better choice would be the opposite – reduce crowdedness by reducing the spin limit.
It appears that detecting if lock has changed its ownership while we were spinning is a much better signal for dialing the spin limit.