-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #2392: Dequeue all timed out messages from the backlog when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages. #2397
Conversation
I've simplified the changes, but seemingly unrelated tests are failing now on the build server. All tests are passing when run locally, and could not find a way to re-run the checks. |
Checking in here: I wanted to repro this in another path I'm on with timeout messages being better for backlogged scenarios...which I've got now. But, I've got a hang with and without these changes here so something else is amiss. Still digging in to evaluate this. |
@NickCraver, do you have a test case for reproducing the hang you mentioned? (Also, would you mind sharing the draft version of your changes if that is needed to reproduce the hang?) |
@kornelpal Yep - I've got to change tasks here for the day, but here's a push of my changes thus far to |
@NickCraver, I believe that I was able to figure out the root cause of your issue. According to my tests when there is no active connection, all the client objects, including the connection, the backlog, and even the Debugging it is very difficult, because debugging results in object lifetimes being extended either by the runtime or by the debugging aids holding on to Somewhat ironically adding Running other tests likely helps reproducing it because of the added memory pressure. Although I haven't tried creating a unit test for it, using a longer timeout, and a Rooting the connection in the test should fix the issue: Exception ex;
var gcRoot = GCHandle.Alloc(conn);
try
{
ex = await Assert.ThrowsAsync<RedisConnectionException>(() => db.PingAsync());
}
finally
{
gcRoot.Free();
} I don't think that this is affecting real world applications much, because the connection is likely rooted from a static field. Although I am not entirely positive that the same issue is not possible when there is an active connection, the native/unmanaged connection likely keeps these objects alive most of the time. My recommended fix is rooting the As a conclusion I believe that this is a separate issue from both this PR and the changes your are working on. |
@kornelpal I agree on observations, but not conclusions: connections or entire multiplexers get disposed in a myriad of scenarios that we see all the time, so messages that have never-completed awaitables is a significant issue, have to get to the bottom of this. I imagine there's a path in which we're not completing the backlog queue during teardown...but we shouldn't be allowed to tear down until end of scope, and we're awaiting that message during so need to reason about this a bit more. |
This self contained (.NET Framework 4.8) console application reproduces the issue reliably, in case it helps: using StackExchange.Redis;
using System;
using System.Reflection;
using System.Threading;
using System.Threading.Tasks;
class Program
{
static readonly ManualResetEventSlim setupComplete = new ManualResetEventSlim();
static WeakReference connWeakRef;
static WeakReference pingTaskWeakRef;
static ConnectionMultiplexer connStrongRef;
static void Main()
{
Console.WriteLine($"CollectedWithoutDispose: {GetCollectedWithoutDispose()}");
var task = Task.Run(RedisWorker);
setupComplete.Wait();
Console.WriteLine("Received setupComplete.");
while (!task.IsCompleted)
{
GC.Collect();
GC.WaitForPendingFinalizers();
if (!connWeakRef.IsAlive && !pingTaskWeakRef.IsAlive)
{
Console.WriteLine("Connection and ping task were collected by the GC, there is nothing left to invoke the continuation for RedisWorker to complete.");
Console.WriteLine($"CollectedWithoutDispose: {GetCollectedWithoutDispose()}");
break;
}
}
task.GetAwaiter().GetResult();
Console.WriteLine("Main exited.");
}
static int GetCollectedWithoutDispose() => (int)typeof(ConnectionMultiplexer).InvokeMember("CollectedWithoutDispose", BindingFlags.Static | BindingFlags.NonPublic | BindingFlags.GetProperty, null, null, null);
static async Task RedisWorker()
{
try
{
using (var conn = await ConnectionMultiplexer.ConnectAsync("127.0.0.1:1234,abortConnect=false"))
{
if (conn.IsConnected)
{
Console.WriteLine("Redis server connected, exiting.");
Environment.Exit(1);
}
// Uncommenting the following line will result in completion with timeout.
//connStrongRef = conn;
connWeakRef = new WeakReference(conn);
var db = conn.GetDatabase();
Console.WriteLine("Starting PingAsync.");
var pingTask = db.PingAsync();
pingTaskWeakRef = new WeakReference(pingTask);
Console.WriteLine("Setting setupComplete.");
setupComplete.Set();
await pingTask;
}
}
catch (Exception ex)
{
Console.WriteLine("RedisWorker failed: " + ex.Message);
}
Console.WriteLine("RedisWorker exited.");
}
} |
cc @mgravell ^ (we're looking at this crazy this morning) @kornelpal thanks a ton for helping iterate and sanity check here, this is currently looking like async dispose insanity even in a case as simple as a using with an await with a disconnect in there allowing for global collection of things that are used later in the method so...yeah something is very wrong in async land. |
I've created a draft fix and pushed it to main...kornelpal:StackExchange.Redis:gchandle so that you can iterate on it too. |
@kornelpal I've pushed up some more stuff after digs this morning to prove it's being finalized while still in use (so the backlog is never cleared). Now, it'll be cleared with an |
@NickCraver, as I understand, While completing the tasks in the finalizer will avoid hung tasks, and is a good measure of last resort, it is not a proper fix as it will result in the tasks being completed (and with exception) before the timeout expires. Keeping |
I've finished the |
I have another approach in mind (I'm not at a computer until Tuesday) that achieves the same aim, but without the need for |
@mgravell, I've created a |
I have a plan sketched out, that involves much fewer moving parts, but with the same guarantees - plus also handling a few other important nuances; I want to try that on Tuesday before going too far else. Appreciate the thoughts, though. It helps that we cheated and asked Stephen Toub for input and ideas; he has some good tricks for these scenarios. |
Here's what I had in mind: #2413 - the key differences here:
|
@mgravell, your approach looks good to me, and using the timer to keep the multiplexer alive is a clever trick. @mgravell and @NickCraver, should I rebase this PR on #2413? The bug this PR intends to fix is fixed in #2408 too and incrementing These are the changes that are not present and not resolved in either #2408 or #2413:
|
In principle: sure if there are other things needed; however, that third item sounds like a very bad idea - it is very intentional that we don't keep that lock while doing that. I'd want to check the reasoning there, and see if something else might be possible. |
PhysicalBridge.OnConnectionFailed seems to happen independently of the heartbeat, so I believe that it can happen in parallel to CheckBacklogForTimeouts. I mean locking for each BacklogTryDequeue call separately, as in the current version of the PR. Update: I believe that Dispose can be called when CheckBacklogForTimeouts is running that can result in the one message dequeued by CheckBacklogForTimeouts not being completed. |
…og when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages.
I've reverted this PR to the original fix. And created a new PR #2415 for the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current is looking great - thanks a ton for all the debugging help on this and the GC issue <3 - getting this PR in first to make sure I eat the merge over in #2408 then we'll get final test additions in after.
When the client is not connected timed out fire and forget messages currently are not removed from the backlog that also results in subsequent timed out messages not being marked as timed out, as described in #2392.