Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: CTS.TryReset() concurrency issue #60182 #60224

Merged
merged 3 commits into from
Oct 13, 2021
Merged

Conversation

sakno
Copy link
Contributor

@sakno sakno commented Oct 9, 2021

PR contains a fix for #60182

Also, the fix resolves potential concurrency between Cancel and TryReset as follows:

  • If Cancel happens before TryReset then CTS will be canceled normally
  • If TryReset happens before Cancel then Cancel will have no effect

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Oct 9, 2021
@ghost
Copy link

ghost commented Oct 9, 2021

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

PR contains a fix for #60182

Author: sakno
Assignees: -
Labels:

area-System.Threading, community-contribution

Milestone: -

@sakno sakno marked this pull request as draft October 9, 2021 22:55
@sakno sakno changed the base branch from main to release/6.0-rc2 October 9, 2021 23:27
@sakno sakno marked this pull request as ready for review October 10, 2021 00:16
@kasperk81
Copy link
Contributor

  • If Cancel happens before TryReset then CTS will be canceled normally

  • If TryReset happens before Cancel then Cancel will have no effect

tests for both cases which fail with current net60 will protect this fix from regressing

@sakno
Copy link
Contributor Author

sakno commented Oct 11, 2021

@kasperk81 , I agree with you. However, I see no way to write a stable tests for that cases. During such tests, we must have a guarantee that Cancel is called exactly after TryReset has reached CompareExchange barrier.

@stephentoub
Copy link
Member

Thanks, @sakno.

I appreciate your putting a fix together. I'm not sure it's what we should do, though. Beyond adding complexity to the synchronization scheme, it's making IsCancellationRequested a bit more expensive, and while not a lot more, IsCancellationRequested is on a ton of hot paths; it's very likely that adding even a tad more expense there will be noticed in some benchmarks and potentially even real-world use.

In your bug report, you said there's no workaround. But, isn't the only affect of this race condition that, in rare conditions, TryReset throws an ObjectDisposedException instead of returning false? Assuming yes, this points to both a workaround (someone can catch the exception) and a fix (we can either catch or avoid the exception).

Note that CancellationTokenSource already wraps timer.Change in another place with a try/catch(ObjectDisposedException):

try
{
timer.Change(millisecondsDelay, Timeout.UnsignedInfinite);
}
catch (ObjectDisposedException)
{
// Just eat the exception. There is no other way to tell that
// the timer has been disposed, and even if there were, there
// would not be a good way to deal with the observe/dispose
// race condition.
}

so while I don't love first-chance exceptions like this, I think it'd be fine for the entirety of this fix to simply be wrapping this additional location with a similar try/catch. That would also seem to be the least-risky fix that would enable us porting back to release/6.0 (you opened this PR against release/6.0-rc2, but our standard practice is to first get it fixed in main and then decide if/what/when to backport). We could subsequently look at avoiding the exception altogether; we're already depending on the internal TimerQueueTimer, and we could for example update its Change method to take an additional bool argument indicating whether to throw or return false in the face of cancellation/disposal; that, however, should be something we consider separately from fixing this.

@sakno
Copy link
Contributor Author

sakno commented Oct 12, 2021

@BrzVlad , @EgorBo , @eiriktsarpalis , @imhameed , @lambdageek , @lateralusX , @layomia , @marek-safar , @naricc , @SamMonoRT , @steveharter , @thaystg , @vargaz I'm very sorry folk, my bad, you're all added because I changed the base for this PR. The participation is not needed.

@sakno
Copy link
Contributor Author

sakno commented Oct 12, 2021

Thanks @stephentoub for your feedback. From my point of view, try-catch here looks like dirty fix. I still believe that the better way exists.

it's making IsCancellationRequested a bit more expensive

Yes, with one extra instruction: and operator. I fixed that with carefully selected value of ResettingState constant.

adding complexity to the synchronization scheme

Overhead for all existing methods remain the same except for TryReset. Even for this method, StoreLoad fence is just replaced with cmpxchg barrier which is introduces negligible overhead in comparison to execution time of entire method.

With proposed changes, we can use happens before relationship to analyze concurrency situations. I think this approach is less risky than leaving everything as-is. We have three sources of state transitions: scheduled cancellation, manual cancellation (via Cancel method), and TryReset. And all of them now protected with the barrier protecting the state transition. The possible transitions are:

  • NotCanceled => Resetting => Notifying => NotifyingComplete
  • NotCanceled => Resetting => NotCanceled
  • NotCanceled => Notifying => NotifyingComplete

Additionally, this schema naturally resolves the undefined behavior caused by concurrent calls of Cancel and TryReset.

@stephentoub
Copy link
Member

The original intent here was to use the timer's lock as the coordination mechanism. That's still valid; the only thing that was missed was that the timer might explicitly throw if it's been closed. We're about to ship the release, and the proposed lock-free and non-trivial changes to synchronization are inherently risky, as is the potential perf impact to IsCancellationRequested I mentioned. We don't have time to fully vet such changes and any downstream impact. It's a lot less risky to simply add the catch block to deal with the one missed aspect initially. Even when we're not constrained by time, I'd prefer to follow-through with the original design and continue using the existing lock for the coordination and just avoiding the throw rather than adding any expense at all to IsCancellationRequested.

@sakno
Copy link
Contributor Author

sakno commented Oct 12, 2021

All expenses related to IsCancellationRequested removed in the last change. And there are still two mechanisms of synchronization: timer and _state field, even without my changes. try-catch adds constant overhead to TryReset method. Moreover, a big plus here is a solution for Cancel and TryReset concurrency.

However, I'm agree with you about the timeline, so I see the following consensus here:

  • I'll add a separated PR with dirty hack for ObjectDisposedException
  • This one I prefer to remain opened for further discussions

What do you think?

@stephentoub
Copy link
Member

stephentoub commented Oct 12, 2021

I'll add a separated PR with dirty hack for ObjectDisposedException

You keep trying to label it that way 😄, but I disagree with the categorization. TimerQueueTimer is signaling a particular state used for coordination with an exception rather than a return value, and the bug fundamentally is that the call site isn't properly handling that signal. The true fix for the bug is thus adding that catch block. Separately I dislike that this is the mechanism by which the internal TimerQueueTimer.Change is signaling that condition, and we can subsequently look at addressing that such that the state is optionally signaled via return value rather than exception, and both this catch block and the other in the file could then be changed to listen for that new signal (which this call site already is, and the other one doesn't actually need to pay attention to).

But, yes, please either change this PR to be that fix or open a new one for it. Either way is fine, though I still don't think what's in this PR is the proper long-term change, as it's adding complexity for little reason. It's also a breaking change: note that with how it's in this PR now, exceptions thrown from cancellation callbacks may be thrown out of TryReset.

@sakno
Copy link
Contributor Author

sakno commented Oct 12, 2021

exceptions thrown from cancellation callbacks may be thrown out of TryReset.

Good catch. I'll change this PR with a quick fix (let's use this term instead of dirty hack 😄 ) using try-catch and open a new PR with necessary changes to TimerQueueTimer class.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@stephentoub stephentoub merged commit 477789e into dotnet:main Oct 13, 2021
@stephentoub
Copy link
Member

/backport to release/6.0

@github-actions
Copy link
Contributor

Started backporting to release/6.0: https://github.com/dotnet/runtime/actions/runs/1335487553

@ghost ghost locked as resolved and limited conversation to collaborators Nov 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Threading community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants