Skip to content
This repository has been archived by the owner on Aug 1, 2024. It is now read-only.

CounterEvent's synchronous design causes thread starvation under load #755

Closed
DaRosenberg opened this issue Aug 14, 2018 · 7 comments
Closed

Comments

@DaRosenberg
Copy link

Which service(blob, file, queue, table) does this issue concern?

Blob

Which version of the SDK was used?

9.3.0

Which platform are you using? (ex: .NET Core 2.1)

.NET Core 2.1

What problem was encountered?

The type CounterEvent is used in the stream implementations to wait for all pending operations to finish before returning from a flush operation.

The implementation of CounterEvent is synchronous and based on an underlying ManualResetEvent. Stream implementations queue up a thread pool operation to wait for the counter to reach zero. This thread pool operation gets scheduled on a dedicated thread, which then blocks for the duration of the wait:

await Task.Run(() => this.noPendingWritesEvent.Wait(), cancellationToken);

This has turned out to be a significant scalability for us, as high concurrency quickly leads to thread pool starvation as a lot of these threads are in this waiting state for a long time.

Have you found a mitigation/solution?

In our fork, we changed the CounterEvent implementation to use an AsyncManualResetEvent behind the scenes and to provide a WaitAsync() methods which the stream implementations can use to wait without blocking a whole thread:

await this.noPendingWritesEvent.WaitAsync(cancellationToken);

An even cleaner alternative could be to replace the CounterEvent completely with an AsyncCountdownEvent.

This change removed the scalability bottleneck for us and allowed us to reach almost perfect and infinite scalability in our service. Scalability increased by 20x at least (we stopped measuring) and the CPU and the capacity of the backend blob storage account are now the only limits.

@rickle-msft
Copy link
Contributor

Hi, @DaRosenberg. Thank you for opening this issue. I have added it to our team's backlog for further consideration. We will get back to you when we have more information to share.

@DaRosenberg
Copy link
Author

@rickle-msft Thanks! May I ask - are you accepting community contributions at all? If so I am happy to submit a PR for this, since we successfully improved this in our fork.

I ask because, while your readme says you do welcome contributions, I have a quite small and simple PR open since January which nobody has either approved, rejected or commented on. I went to great lengths to ensure it's up to your standards, so it doesn't feel very encouraging that you guys don't even look at it...

Can you clarify whether you do in fact welcome contributions, and if not, maybe remove the section of your readme which solicits them?

@rickle-msft
Copy link
Contributor

@DaRosenberg We do accept community contributions. Evidently, we have not been as responsive to all PRs as we ought to have, and I am very sorry for the frustration this has caused. Let me discuss with my team on Monday morning why we have not reviewed your other PR and whether we would realistically be able to review a PR you submit for this before you go through the hassle for doing so.

Thank you again for you communication and contributions and patience.

@kfarmer-msft
Copy link
Contributor

@DaRosenberg

It looks like an implementation of CounterEventAsync made it into our split library some time ago, but hasn't been put to use in the methods you're discussing. I've made some changes and am running tests now.

@DaRosenberg
Copy link
Author

@kfarmer-msft That would be sweet. Would love to move off our fork and back onto the official package.

kfarmer-msft added a commit that referenced this issue Dec 4, 2018
@kfarmer-msft
Copy link
Contributor

@DaRosenberg: Please try v9.4.2

@DaRosenberg
Copy link
Author

DaRosenberg commented Dec 23, 2018

@kfarmer-msft Did some perf testing on 9.4.2 today. It is definitely an improvement, but we are not seeing the same near-perfect linear scalability we are getting with our fork.

I've reviewed the commit you referenced, and as far as I can see you've covered all the places we needed to change. Probably something else was introduced into the library since then that causes some bottleneck somewhere else. Maybe I'll find the time to do some more line-level profiling some day.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants