-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Ensure ConcurrentBag's TryTake is linearizable #30947
Conversation
For .NET Core 2.0, I ported the ThreadPool's work-stealing implementation to ConcurrentBag, leading to significant performance throughput and allocation improvements. However, there's a subtle difference in the concurrency guarantees the ThreadPool's implementation provided from what ConcurrentBag needs, which ends up breaking certain usage patterns on top of ConcurrentBag. Specifically, ThreadPool's "steal" implementation need not be fully linearizable. It's possible for a thread to see the bag's count as 1, and then while the thread is doing a take/steal for its count to never drop below 1, but for the steal to still fail, even though there was always an item available. This is ok for the thread pool because it manages a known count of work items in the queues separately, and if it sees that there are still items available after a steal has failed, it'll try again. That "try again" logic provided above the work-stealing queue thus didn't make it over to ConcurrentBag, which breaks some usages of ConcurrentBag, in particular cases where a type like BlockingCollection is wrapping the bag and managing its own count. It's possible now for BlockingCollection to know that there's an item in the bag but to then fail to take it, which causes problems such as exceptions being thrown. The fix is to port back the relevant portion of ConcurrentBag from .NET Core 1.x / .NET Framework, where local push operations on a list track the number of times the list transitions from empty to non-empty. A steal operation then looks at those counts prior to doing the steal, and if the steal fails, it looks again after: if the count has increased, it retries. This unfortunately means that local pushes on small lists are now more expensive than in .NET Core 2.0/2.1, as if there are <= 2 items in the list, it takes the lock, but this seems unavoidable given the work-stealing design.
// We work around this by looking at the number of times any list transitions from == 0 to > 0, | ||
// checking that before and after the steal attempts. We don't care about > 0 to > 0 transitions, | ||
// because a steal from a list with > 0 elements would have been successful. | ||
long initialEmptyToNonEmptyCounts = Interlocked.Read(ref _emptyToNonEmptyListTransitionCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initialEmptyToNonEmptyCounts [](start = 21, length = 28)
Is it possible in some worst case this operation can starve? I mean while executing here, other threads can push items (which change _emptyToNonEmptyListTransitionCount) and then some other third thread Take the pushed item and this operation continue to happen without having this current thread to get a chance to steal any item and in same time _emptyToNonEmptyListTransitionCount keep changing? I know this is hypothetical but it may be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's possible, but the chances of that happening are small, and get smaller and smaller the more times it would need to loop. It's also true with ConcurrentBag in netfx today, as this is just adopting the same scheme it has.
@@ -62,7 +64,7 @@ public ConcurrentBag(IEnumerable<T> collection) | |||
WorkStealingQueue queue = GetCurrentThreadWorkStealingQueue(forceCreate: true); | |||
foreach (T item in collection) | |||
{ | |||
queue.LocalPush(item); | |||
queue.LocalPush(item, ref _emptyToNonEmptyListTransitionCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref _emptyToNonEmptyListTransitionCount [](start = 38, length = 39)
why we are passing _emptyToNonEmptyListTransitionCount to LocalPush and not letting LocalPush to access _emptyToNonEmptyListTransitionCount as it is object field anyway? I am seeing we always access it with Interlock operation so it should be safe. is this because erf reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_emptyToNonEmptyListTransitionCount
is a field on ConcurrentBag
; LocalPush
is a method on WorkStealingQueue
, which doesn't have a back reference to the ConcurrentBag
with which it's associated.
/// <summary>The head work stealing queue in a linked list of queues.</summary> | ||
private volatile WorkStealingQueue _workStealingQueues; | ||
/// <summary>Number of times any list transitions from empty to non-empty.</summary> | ||
private long _emptyToNonEmptyListTransitionCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long [](start = 16, length = 4)
does it make any difference if we have this as ulong instead? just to always have positive number? no strong feeling though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really... we're just doing an == check on the two values.
Debug.Assert(_headIndex <= _tailIndex); | ||
|
||
_currentOp = (int)Operation.Add; | ||
Interlocked.Exchange(ref _currentOp, (int)Operation.Add); // ensure subsequent reads aren't reordered before this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the surrounding lock already guarantee that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that releasing a lock has store release semantics rather than being a full barrier. If it is store release, that would prevent any reads/writes from inside the lock from moving out, but it wouldn't prevent a later read from moving up before this write to _currentOp, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah that's true
For .NET Core 2.0, I ported the ThreadPool's work-stealing implementation to ConcurrentBag, leading to significant performance throughput and allocation improvements. However, there's a subtle difference in the concurrency guarantees the ThreadPool's implementation provided from what ConcurrentBag needs, which ends up breaking certain usage patterns on top of ConcurrentBag. Specifically, ThreadPool's "steal" implementation need not be fully linearizable. It's possible for a thread to see the bag's count as 1, and then while the thread is doing a take/steal for its count to never drop below 1, but for the steal to still fail, even though there was always an item available. This is ok for the thread pool because it manages a known count of work items in the queues separately, and if it sees that there are still items available after a steal has failed, it'll try again. That "try again" logic provided above the work-stealing queue thus didn't make it over to ConcurrentBag, which breaks some usages of ConcurrentBag, in particular cases where a type like BlockingCollection is wrapping the bag and managing its own count. It's possible now for BlockingCollection to know that there's an item in the bag but to then fail to take it, which causes problems such as exceptions being thrown. The fix is to port back the relevant portion of ConcurrentBag from .NET Core 1.x / .NET Framework, where local push operations on a list track the number of times the list transitions from empty to non-empty. A steal operation then looks at those counts prior to doing the steal, and if the steal fails, it looks again after: if the count has increased, it retries. This unfortunately means that local pushes on small lists are now more expensive than in .NET Core 2.0/2.1, as if there are <= 2 items in the list, it takes the lock, but this seems unavoidable given the work-stealing design.
For .NET Core 2.0, I ported the ThreadPool's work-stealing implementation to ConcurrentBag, leading to significant performance throughput and allocation improvements. However, there's a subtle difference in the concurrency guarantees the ThreadPool's implementation provided from what ConcurrentBag needs, which ends up breaking certain usage patterns on top of ConcurrentBag. Specifically, ThreadPool's "steal" implementation need not be fully linearizable. It's possible for a thread to see the bag's count as 1, and then while the thread is doing a take/steal for its count to never drop below 1, but for the steal to still fail, even though there was always an item available. This is ok for the thread pool because it manages a known count of work items in the queues separately, and if it sees that there are still items available after a steal has failed, it'll try again. That "try again" logic provided above the work-stealing queue thus didn't make it over to ConcurrentBag, which breaks some usages of ConcurrentBag, in particular cases where a type like BlockingCollection is wrapping the bag and managing its own count. It's possible now for BlockingCollection to know that there's an item in the bag but to then fail to take it, which causes problems such as exceptions being thrown. The fix is to port back the relevant portion of ConcurrentBag from .NET Core 1.x / .NET Framework, where local push operations on a list track the number of times the list transitions from empty to non-empty. A steal operation then looks at those counts prior to doing the steal, and if the steal fails, it looks again after: if the count has increased, it retries. This unfortunately means that local pushes on small lists are now more expensive than in .NET Core 2.0/2.1, as if there are <= 2 items in the list, it takes the lock, but this seems unavoidable given the work-stealing design.
For .NET Core 2.0, I ported the ThreadPool's work-stealing implementation to ConcurrentBag, leading to significant performance throughput and allocation improvements. However, there's a subtle difference in the concurrency guarantees the ThreadPool's implementation provided from what ConcurrentBag needs, which ends up breaking certain usage patterns on top of ConcurrentBag. Specifically, ThreadPool's "steal" implementation need not be fully linearizable. It's possible for a thread to see the bag's count as 1, and then while the thread is doing a take/steal for its count to never drop below 1, but for the steal to still fail, even though there was always an item available. This is ok for the thread pool because it manages a known count of work items in the queues separately, and if it sees that there are still items available after a steal has failed, it'll try again. That "try again" logic provided above the work-stealing queue thus didn't make it over to ConcurrentBag, which breaks some usages of ConcurrentBag, in particular cases where a type like BlockingCollection is wrapping the bag and managing its own count. It's possible now for BlockingCollection to know that there's an item in the bag but to then fail to take it, which causes problems such as exceptions being thrown. The fix is to port back the relevant portion of ConcurrentBag from .NET Core 1.x / .NET Framework, where local push operations on a list track the number of times the list transitions from empty to non-empty. A steal operation then looks at those counts prior to doing the steal, and if the steal fails, it looks again after: if the count has increased, it retries. This unfortunately means that local pushes on small lists are now more expensive than in .NET Core 2.0/2.1, as if there are <= 2 items in the list, it takes the lock, but this seems unavoidable given the work-stealing design. Commit migrated from dotnet/corefx@864e82e
For .NET Core 2.0, I ported the ThreadPool's work-stealing implementation to ConcurrentBag, leading to significant performance throughput and allocation improvements. However, there's a subtle difference in the concurrency guarantees the ThreadPool's implementation provided from what ConcurrentBag needs, which ends up breaking certain usage patterns on top of ConcurrentBag.
Specifically, ThreadPool's "steal" implementation need not be fully linearizable. It's possible for a thread to see the bag's count as 1, and then while the thread is doing a take/steal for its count to never drop below 1, but for the steal to still fail, even though there was always an item available. This is ok for the thread pool because it manages a known count of work items in the queues separately, and if it sees that there are still items available after a steal has failed, it'll try again. That "try again" logic provided above the work-stealing queue thus didn't make it over to ConcurrentBag, which breaks some usages of ConcurrentBag, in particular cases where a type like BlockingCollection is wrapping the bag and managing its own count. It's possible now for BlockingCollection to know that there's an item in the bag but to then fail to take it, which causes problems such as exceptions being thrown.
The fix is to port back the relevant portion of ConcurrentBag from .NET Core 1.x / .NET Framework, where local push operations on a list track the number of times the list transitions from empty to non-empty. A steal operation then looks at those counts prior to doing the steal, and if the steal fails, it looks again after: if the count has increased, it retries. This unfortunately means that local pushes on small lists are now more expensive than in .NET Core 2.0/2.1, as if there are <= 2 items in the list, it takes the lock, but this seems unavoidable given the work-stealing design.
Contributes to https://github.com/dotnet/corefx/issues/30781
cc: @kouvel, @tarekgh, @ReubenBond
Please pay close attention to the details of the change. I've convinced myself that it's correct, but I've also been staring at it for a while, so it's very possible I've missed something. I tried to keep the logic as close to the original .NET Core 1.x / .NET Framework logic as possible. The main difference is that rather than having each list maintain its own empty-to-nonempty transition count, I have a single count that's incremented via interlocked increment by all of them. I did that in part to keep things simpler, even though it adds a bit more expense, but also because I couldn't convince myself that the original logic used by netfx here was actually correct; if you can, we could consider switching to that, which would make some adds slightly less expensive while making steals more expensive.