-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread pool starvation caused by sync waiting inside PhysicalBridge #2664
Comments
Now this is very interesting (pinging @NickCraver @philon-msft ) - it looks like the intent is for Great report @krasin-ga ! Will investigate further. |
Yep, fully agree, we changed how this works several times back at the beginning but the thread pool hopping would happen at several points now. Needs love and potentially capping somehow in thread creation lest we create hundreds for hundreds of endpoints during a storm. |
Thanks for the quick reply! Please allow me to add my two cents. I understand that this logic was added to circumvent issues related to using synchronous operations and thread pool starvation in general. However, it seems that this is not the problem that the Redis client library should solve. It would be more rational to focus on minimizing overhead within the library itself. At first glance, it appears that abandoning synchronous locks in favor of wait-free alternatives and using |
We're tweaking how this works in #2667 to reflect the current dedicated thread behavior. With respect to the follow-up comments: I'm just going to assume you've only taken a glance because the rest of it's pretty uninformed with respect to how things in the library behave. I'm going to write off the other comments as not having taken any deep view of things in the library or dealt with thread pool starvation in hundreds of apps as we have and continue to help others with and leave it at that. |
As I understand the current architecture: From the top view, PhysicalPridge is responsible for pushing messages to the backlog from which they will be read by a dedicated thread and serialized to a pipe (from which searialized messages will be pushed to the socket by PipeScheduler) or completed by timeout. Dedicated threads are only used to deal with situations when the runtime experiences thread pool starvation. There is one PhysicalPridge instance per connection. As of now there is a way to cofigure PipeScheduler, but there is no way to configure backlog processing behavior. My points:
I wrote a quick benchmark that emulates backlog behavior and processes 10,000,000 items:
Both methods can be optimized, but I hope they are still close enough for illustrative purposes. PS: I'm not a native English speaker and don't use English for communication very often, so my selection of words may not be very accurate. Please forgive any unintended offense. |
@krasin-ga I think we have mismatch in understanding here: that the backlog is part of the normal flow at all. It is not. The backlog is not normally written to, but it must be synchronously drained from to help prevent spiral of death scenarios (because it's quite likely that writes or response processing cannot complete due to starvation). The normal path for async commands is fully async and uses Pipes. The backlog is only used as a buffer when we can't write to the connection, e.g. we're disconnected. And as per the docs, you can turn this off if you want to fail immediately indicating no connections are available. You're stating this can be optimized: hey sure, I'm all ears, but it needs to be in a way that the backlog has a chance of draining thread pool starvation is hit (which is a common scenario for applications losing connections unexpectedly, often having some sync-over-async elsewhere we don't control). That's why we create a thread to do this work and listen for incoming for 5 more seconds instead of repeatedly spinning up a thread if we're recovering from overwhelming the outbound socket buffer. As for the thread pool growing, this is often a misconception when it gets into the details.
|
Yep, I certainly misunderstood that part. Thanks for the clarification! Upd: But not that much: if we can't take mutex we still end up enqueuing to backlog, and this will happen quite often under load. StackExchange.Redis/src/StackExchange.Redis/PhysicalBridge.cs Lines 780 to 791 in 60e5d17
|
@krasin-ga Thanks again for the report! Version 2.7.33 with this fixed up is available on NuGet now :) |
Nice, thanks for the quick fix! |
@krasin-ga @NickCraver Do you guys know which version this issue started to happen? |
@giuliano-macedo I'd assume this was a potential issue when consuming a lot of endpoints since 2.5.x, that's my best guess here. |
@NickCraver Great! And thanks for the fix! |
Here is the code that starts a dedicated backlog processing thread and provides some motivation in comments on the decision not to use the thread pool:
StackExchange.Redis/src/StackExchange.Redis/PhysicalBridge.cs
Lines 901 to 912 in 441f89a
But after asynchronously awaiting
ProcessBridgeBacklogAsync
, the rest of the code (including loop iterations) will not only continue on the thread pool thread but also block it on a sync lock:"StackExchange.Redis/src/StackExchange.Redis/PhysicalBridge.cs
Lines 988 to 1012 in 441f89a
In our case, we have several relatively large Redis clusters with tens to hundreds of nodes, and sometimes we experience situations where the majority of thread pool threads are being blocked inside instances of PhysicalBridge:
So we have the best of two worlds: hundreds of expensive creations of worker threads and thread pool starvation 🙂
The text was updated successfully, but these errors were encountered: