Fix: CTS.TryReset() concurrency issue #60182 #60224

sakno · 2021-10-09T22:55:09Z

PR contains a fix for #60182

Also, the fix resolves potential concurrency between Cancel and TryReset as follows:

If Cancel happens before TryReset then CTS will be canceled normally
If TryReset happens before Cancel then Cancel will have no effect

ghost · 2021-10-09T22:55:16Z

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

PR contains a fix for #60182

Author:	sakno
Assignees:	-
Labels:	`area-System.Threading`, `community-contribution`
Milestone:	-

kasperk81 · 2021-10-11T12:27:16Z

If Cancel happens before TryReset then CTS will be canceled normally

If TryReset happens before Cancel then Cancel will have no effect

tests for both cases which fail with current net60 will protect this fix from regressing

sakno · 2021-10-11T14:00:49Z

@kasperk81 , I agree with you. However, I see no way to write a stable tests for that cases. During such tests, we must have a guarantee that Cancel is called exactly after TryReset has reached CompareExchange barrier.

stephentoub · 2021-10-11T23:36:03Z

Thanks, @sakno.

I appreciate your putting a fix together. I'm not sure it's what we should do, though. Beyond adding complexity to the synchronization scheme, it's making IsCancellationRequested a bit more expensive, and while not a lot more, IsCancellationRequested is on a ton of hot paths; it's very likely that adding even a tad more expense there will be noticed in some benchmarks and potentially even real-world use.

In your bug report, you said there's no workaround. But, isn't the only affect of this race condition that, in rare conditions, TryReset throws an ObjectDisposedException instead of returning false? Assuming yes, this points to both a workaround (someone can catch the exception) and a fix (we can either catch or avoid the exception).

Note that CancellationTokenSource already wraps timer.Change in another place with a try/catch(ObjectDisposedException):

runtime/src/libraries/System.Private.CoreLib/src/System/Threading/CancellationTokenSource.cs

Lines 355 to 365 in 4b84451

    
           try 
        
           { 
        
               timer.Change(millisecondsDelay, Timeout.UnsignedInfinite); 
        
           } 
        
           catch (ObjectDisposedException) 
        
           { 
        
               // Just eat the exception.  There is no other way to tell that 
        
               // the timer has been disposed, and even if there were, there 
        
               // would not be a good way to deal with the observe/dispose 
        
               // race condition. 
        
           }

so while I don't love first-chance exceptions like this, I think it'd be fine for the entirety of this fix to simply be wrapping this additional location with a similar try/catch. That would also seem to be the least-risky fix that would enable us porting back to release/6.0 (you opened this PR against release/6.0-rc2, but our standard practice is to first get it fixed in main and then decide if/what/when to backport). We could subsequently look at avoiding the exception altogether; we're already depending on the internal TimerQueueTimer, and we could for example update its Change method to take an additional bool argument indicating whether to throw or return false in the face of cancellation/disposal; that, however, should be something we consider separately from fixing this.

sakno · 2021-10-12T09:56:16Z

@BrzVlad , @EgorBo , @eiriktsarpalis , @imhameed , @lambdageek , @lateralusX , @layomia , @marek-safar , @naricc , @SamMonoRT , @steveharter , @thaystg , @vargaz I'm very sorry folk, my bad, you're all added because I changed the base for this PR. The participation is not needed.

sakno · 2021-10-12T10:06:51Z

Thanks @stephentoub for your feedback. From my point of view, try-catch here looks like dirty fix. I still believe that the better way exists.

it's making IsCancellationRequested a bit more expensive

Yes, with one extra instruction: and operator. I fixed that with carefully selected value of ResettingState constant.

adding complexity to the synchronization scheme

Overhead for all existing methods remain the same except for TryReset. Even for this method, StoreLoad fence is just replaced with cmpxchg barrier which is introduces negligible overhead in comparison to execution time of entire method.

With proposed changes, we can use happens before relationship to analyze concurrency situations. I think this approach is less risky than leaving everything as-is. We have three sources of state transitions: scheduled cancellation, manual cancellation (via Cancel method), and TryReset. And all of them now protected with the barrier protecting the state transition. The possible transitions are:

NotCanceled => Resetting => Notifying => NotifyingComplete
NotCanceled => Resetting => NotCanceled
NotCanceled => Notifying => NotifyingComplete

Additionally, this schema naturally resolves the undefined behavior caused by concurrent calls of Cancel and TryReset.

stephentoub · 2021-10-12T11:04:20Z

The original intent here was to use the timer's lock as the coordination mechanism. That's still valid; the only thing that was missed was that the timer might explicitly throw if it's been closed. We're about to ship the release, and the proposed lock-free and non-trivial changes to synchronization are inherently risky, as is the potential perf impact to IsCancellationRequested I mentioned. We don't have time to fully vet such changes and any downstream impact. It's a lot less risky to simply add the catch block to deal with the one missed aspect initially. Even when we're not constrained by time, I'd prefer to follow-through with the original design and continue using the existing lock for the coordination and just avoiding the throw rather than adding any expense at all to IsCancellationRequested.

sakno · 2021-10-12T13:26:31Z

All expenses related to IsCancellationRequested removed in the last change. And there are still two mechanisms of synchronization: timer and _state field, even without my changes. try-catch adds constant overhead to TryReset method. Moreover, a big plus here is a solution for Cancel and TryReset concurrency.

However, I'm agree with you about the timeline, so I see the following consensus here:

I'll add a separated PR with dirty hack for ObjectDisposedException
This one I prefer to remain opened for further discussions

What do you think?

stephentoub · 2021-10-12T13:57:15Z

I'll add a separated PR with dirty hack for ObjectDisposedException

You keep trying to label it that way 😄, but I disagree with the categorization. TimerQueueTimer is signaling a particular state used for coordination with an exception rather than a return value, and the bug fundamentally is that the call site isn't properly handling that signal. The true fix for the bug is thus adding that catch block. Separately I dislike that this is the mechanism by which the internal TimerQueueTimer.Change is signaling that condition, and we can subsequently look at addressing that such that the state is optionally signaled via return value rather than exception, and both this catch block and the other in the file could then be changed to listen for that new signal (which this call site already is, and the other one doesn't actually need to pay attention to).

But, yes, please either change this PR to be that fix or open a new one for it. Either way is fine, though I still don't think what's in this PR is the proper long-term change, as it's adding complexity for little reason. It's also a breaking change: note that with how it's in this PR now, exceptions thrown from cancellation callbacks may be thrown out of TryReset.

sakno · 2021-10-12T14:16:45Z

exceptions thrown from cancellation callbacks may be thrown out of TryReset.

Good catch. I'll change this PR with a quick fix (let's use this term instead of dirty hack 😄 ) using try-catch and open a new PR with necessary changes to TimerQueueTimer class.

src/libraries/System.Private.CoreLib/src/System/Threading/CancellationTokenSource.cs

stephentoub

Thanks.

stephentoub · 2021-10-13T01:01:22Z

/backport to release/6.0

github-actions · 2021-10-13T01:01:36Z

Started backporting to release/6.0: https://github.com/dotnet/runtime/actions/runs/1335487553

ghost added the community-contribution Indicates that the PR has been added by a community member label Oct 9, 2021

dotnet-issue-labeler bot added the area-System.Threading label Oct 9, 2021

sakno marked this pull request as draft October 9, 2021 22:55

sakno force-pushed the issue-60182 branch from 1638e98 to 72ce261 Compare October 9, 2021 23:26

sakno changed the base branch from main to release/6.0-rc2 October 9, 2021 23:27

sakno marked this pull request as ready for review October 10, 2021 00:16

sakno force-pushed the issue-60182 branch from 8c59529 to cc0ef3e Compare October 12, 2021 09:51

sakno requested review from BrzVlad, EgorBo, eiriktsarpalis, imhameed, lambdageek, lateralusX, layomia, marek-safar, naricc, SamMonoRT, steveharter, thaystg and vargaz as code owners October 12, 2021 09:51

sakno changed the base branch from release/6.0-rc2 to main October 12, 2021 09:52

Quick fix of dotnet#60182

7913094

sakno force-pushed the issue-60182 branch from cc0ef3e to 7913094 Compare October 12, 2021 14:23

Fixed reset condition when there is no timer

c398288

This was referenced Oct 12, 2021

Alternative fix of #60182 #60298

Merged

CTS.TryReset() concurrency issue #60182

Closed

stephentoub reviewed Oct 12, 2021

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Threading/CancellationTokenSource.cs Show resolved Hide resolved

Simplify branching

096837e

stephentoub approved these changes Oct 12, 2021

View reviewed changes

stephentoub merged commit 477789e into dotnet:main Oct 13, 2021

github-actions bot mentioned this pull request Oct 13, 2021

[release/6.0] Fix: CTS.TryReset() concurrency issue #60182 #60323

Merged

ghost locked as resolved and limited conversation to collaborators Nov 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: CTS.TryReset() concurrency issue #60182 #60224

Fix: CTS.TryReset() concurrency issue #60182 #60224

sakno commented Oct 9, 2021 •

edited

Loading

ghost commented Oct 9, 2021

kasperk81 commented Oct 11, 2021

sakno commented Oct 11, 2021

stephentoub commented Oct 11, 2021

sakno commented Oct 12, 2021 •

edited

Loading

sakno commented Oct 12, 2021

stephentoub commented Oct 12, 2021

sakno commented Oct 12, 2021

stephentoub commented Oct 12, 2021 •

edited

Loading

sakno commented Oct 12, 2021

stephentoub left a comment

stephentoub commented Oct 13, 2021

github-actions bot commented Oct 13, 2021

Fix: CTS.TryReset() concurrency issue #60182 #60224

Fix: CTS.TryReset() concurrency issue #60182 #60224

Conversation

sakno commented Oct 9, 2021 • edited Loading

ghost commented Oct 9, 2021

kasperk81 commented Oct 11, 2021

sakno commented Oct 11, 2021

stephentoub commented Oct 11, 2021

sakno commented Oct 12, 2021 • edited Loading

sakno commented Oct 12, 2021

stephentoub commented Oct 12, 2021

sakno commented Oct 12, 2021

stephentoub commented Oct 12, 2021 • edited Loading

sakno commented Oct 12, 2021

stephentoub left a comment

Choose a reason for hiding this comment

stephentoub commented Oct 13, 2021

github-actions bot commented Oct 13, 2021

sakno commented Oct 9, 2021 •

edited

Loading

sakno commented Oct 12, 2021 •

edited

Loading

stephentoub commented Oct 12, 2021 •

edited

Loading