Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #2392: Dequeue all timed out messages from the backlog when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages. #2397

Merged
merged 3 commits into from
Mar 29, 2023

Conversation

kornelpal
Copy link
Contributor

@kornelpal kornelpal commented Mar 10, 2023

When the client is not connected timed out fire and forget messages currently are not removed from the backlog that also results in subsequent timed out messages not being marked as timed out, as described in #2392.

@kornelpal kornelpal changed the title Fix #2392: Process all timed out messages from the backlog and avoid sending messages after the sync wait timed out. Fix #2392: Dequeue all timed out messages from the backlog when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages. Mar 19, 2023
@kornelpal
Copy link
Contributor Author

I've simplified the changes, but seemingly unrelated tests are failing now on the build server. All tests are passing when run locally, and could not find a way to re-run the checks.

@NickCraver
Copy link
Collaborator

Checking in here: I wanted to repro this in another path I'm on with timeout messages being better for backlogged scenarios...which I've got now. But, I've got a hang with and without these changes here so something else is amiss. Still digging in to evaluate this.

@kornelpal
Copy link
Contributor Author

@NickCraver, do you have a test case for reproducing the hang you mentioned? (Also, would you mind sharing the draft version of your changes if that is needed to reproduce the hang?)

@NickCraver
Copy link
Collaborator

@kornelpal Yep - I've got to change tasks here for the day, but here's a push of my changes thus far to craver/timeout-error-improvements: https://github.com/StackExchange/StackExchange.Redis/compare/craver/timeout-error-improvements?expand=1. I agree your thinking here makes sense (and you'll see some changes reflected in current on that test branch), but some other hang is in play and I hesitate to change anything until we understand both. Only trying to figure it out at the moment, we'll definitely adjust this PR (if needed) and merge so credit and all works when getting the overall in.

@kornelpal
Copy link
Contributor Author

@NickCraver, I believe that I was able to figure out the root cause of your issue.

According to my tests when there is no active connection, all the client objects, including the connection, the backlog, and even the Task instance are freed by the CG. This will leave nothing to invoke the continuations that the async flow depends on.

Debugging it is very difficult, because debugging results in object lifetimes being extended either by the runtime or by the debugging aids holding on to Task or other object instances.

Somewhat ironically adding WithTimeout also fixes the issue because timers are rooted.

Running other tests likely helps reproducing it because of the added memory pressure.

Although I haven't tried creating a unit test for it, using a longer timeout, and a GC.Collect + GC.WaitForPendingFinalizers loop on a separate thread should help reproducing it more reliably.

Rooting the connection in the test should fix the issue:

Exception ex;
var gcRoot = GCHandle.Alloc(conn);
try
{
    ex = await Assert.ThrowsAsync<RedisConnectionException>(() => db.PingAsync());
}
finally
{
    gcRoot.Free();
}

I don't think that this is affecting real world applications much, because the connection is likely rooted from a static field. Although I am not entirely positive that the same issue is not possible when there is an active connection, the native/unmanaged connection likely keeps these objects alive most of the time.

My recommended fix is rooting the PhysicalBridge instance using a GCHandle to itself when the backlog is non-empty, and possibly doing the same for the sent message queue too. When the message volume is low, creating and destroying a GCHandle should not be an issue and it will be kept allocated all the time when the volume is high.

As a conclusion I believe that this is a separate issue from both this PR and the changes your are working on.

@NickCraver
Copy link
Collaborator

@kornelpal I agree on observations, but not conclusions: connections or entire multiplexers get disposed in a myriad of scenarios that we see all the time, so messages that have never-completed awaitables is a significant issue, have to get to the bottom of this. I imagine there's a path in which we're not completing the backlog queue during teardown...but we shouldn't be allowed to tear down until end of scope, and we're awaiting that message during so need to reason about this a bit more.

@kornelpal
Copy link
Contributor Author

This self contained (.NET Framework 4.8) console application reproduces the issue reliably, in case it helps:

using StackExchange.Redis;
using System;
using System.Reflection;
using System.Threading;
using System.Threading.Tasks;

class Program
{
    static readonly ManualResetEventSlim setupComplete = new ManualResetEventSlim();
    static WeakReference connWeakRef;
    static WeakReference pingTaskWeakRef;
    static ConnectionMultiplexer connStrongRef;

    static void Main()
    {
        Console.WriteLine($"CollectedWithoutDispose: {GetCollectedWithoutDispose()}");
        var task = Task.Run(RedisWorker);

        setupComplete.Wait();
        Console.WriteLine("Received setupComplete.");

        while (!task.IsCompleted)
        {
            GC.Collect();
            GC.WaitForPendingFinalizers();

            if (!connWeakRef.IsAlive && !pingTaskWeakRef.IsAlive)
            {
                Console.WriteLine("Connection and ping task were collected by the GC, there is nothing left to invoke the continuation for RedisWorker to complete.");
                Console.WriteLine($"CollectedWithoutDispose: {GetCollectedWithoutDispose()}");
                break;
            }
        }
        task.GetAwaiter().GetResult();

        Console.WriteLine("Main exited.");
    }

    static int GetCollectedWithoutDispose() => (int)typeof(ConnectionMultiplexer).InvokeMember("CollectedWithoutDispose", BindingFlags.Static | BindingFlags.NonPublic | BindingFlags.GetProperty, null, null, null);

    static async Task RedisWorker()
    {
        try
        {
            using (var conn = await ConnectionMultiplexer.ConnectAsync("127.0.0.1:1234,abortConnect=false"))
            {
                if (conn.IsConnected)
                {
                    Console.WriteLine("Redis server connected, exiting.");
                    Environment.Exit(1);
                }

                // Uncommenting the following line will result in completion with timeout.
                //connStrongRef = conn;

                connWeakRef = new WeakReference(conn);

                var db = conn.GetDatabase();
                Console.WriteLine("Starting PingAsync.");

                var pingTask = db.PingAsync();
                pingTaskWeakRef = new WeakReference(pingTask);

                Console.WriteLine("Setting setupComplete.");
                setupComplete.Set();

                await pingTask;
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine("RedisWorker failed: " + ex.Message);
        }

        Console.WriteLine("RedisWorker exited.");
    }
}

@NickCraver
Copy link
Collaborator

cc @mgravell ^ (we're looking at this crazy this morning)

@kornelpal thanks a ton for helping iterate and sanity check here, this is currently looking like async dispose insanity even in a case as simple as a using with an await with a disconnect in there allowing for global collection of things that are used later in the method so...yeah something is very wrong in async land.

@kornelpal
Copy link
Contributor Author

I've created a draft fix and pushed it to main...kornelpal:StackExchange.Redis:gchandle so that you can iterate on it too.

@NickCraver
Copy link
Collaborator

@kornelpal I've pushed up some more stuff after digs this morning to prove it's being finalized while still in use (so the backlog is never cleared). Now, it'll be cleared with an ObjectDisposedException thrown for any backlog messages instead of hanging indefinitely. But that's a symptom solver only...the problem (we think) may be the TaskCompletionSource and lack of ref through ConnectionMultiplexer.ExecuteAsyncImpl. @mgravell's running some experiments now to try and minimally repro to compare to what happens in normal async state machine Task flow. This one's interesting, and amazing we haven't hit it in more scenarios.

@kornelpal
Copy link
Contributor Author

@NickCraver, as I understand, PhysicalBridge is getting finalized because it has has a finalizer and the GC is collecting it. The GC is collecting it because there are no strong references to it reachable from any of the GC roots (stack variables, static fields, GC handles, etc.). Task continuations are forward-only references, so unfortunately keeping a reference to the the last task in the chain will not keep the previous steps needed for its completion alive, resulting in the task never being completed. This is the result of the Task continuation design and using a GC. Task.Delay of the framework for example is keeping its Timer alive so that the GC will not collect it and the task will complete. In this regard PhysicalBridge is similar to Task.Delay as it is the origin of the task completion chain, and is expected to complete those tasks, even when there are no references to it anymore. Creating a GCHandle is a simple and efficient way of having the object keep itself alive, but care must be taken to free up the GCHandle when there are no more pending tasks left to avoid memory leaks.

While completing the tasks in the finalizer will avoid hung tasks, and is a good measure of last resort, it is not a proper fix as it will result in the tasks being completed (and with exception) before the timeout expires. Keeping PhysicalBridge alive on the other hand will result in properly processing those tasks, like when there is a strong reference kept to ConnectionMultiplexer reachable from a static field.

@kornelpal
Copy link
Contributor Author

I've finished the GCHandle patch and pushed it to this PR, in case it helps. I fixed incrementing _backlogCurrentEnqueued as the patch depends on it. To avoid a memory leak, AbandonPendingBacklog has to be called in Dispose, but did not include that as @NickCraver's branch already has it.

@mgravell
Copy link
Collaborator

I have another approach in mind (I'm not at a computer until Tuesday) that achieves the same aim, but without the need for GCHandle (we still root and unroot, just: in an alternative way). I'll compare and contrast both ideas when I get to look at it on Tuesday.

@kornelpal
Copy link
Contributor Author

@mgravell, I've created a GCHandle-free approach too. Unlike GCHandle, HashSet seems to remain the same fast, even with 1M items, as with only one. On the other hand GCHandle is faster with less instances, so adding a holder object to the HashSet at initialization improves run-time performance. Another option would be to use a holder object but root it using a GCHandle.

@mgravell
Copy link
Collaborator

I have a plan sketched out, that involves much fewer moving parts, but with the same guarantees - plus also handling a few other important nuances; I want to try that on Tuesday before going too far else. Appreciate the thoughts, though. It helps that we cheated and asked Stephen Toub for input and ideas; he has some good tricks for these scenarios.

@mgravell
Copy link
Collaborator

Here's what I had in mind: #2413 - the key differences here:

  • we're considering both pending and live queue items
  • we're using simple strong-refs for simplicity
  • by rooting the muxer, we ensure that the heartbeat and hence timeout keep working; there's no advantage in keeping bridges alive if nobody is poking the bridges with a stick to tell things that they're dead

@kornelpal
Copy link
Contributor Author

@mgravell, your approach looks good to me, and using the timer to keep the multiplexer alive is a clever trick.

@mgravell and @NickCraver, should I rebase this PR on #2413?

The bug this PR intends to fix is fixed in #2408 too and incrementing _backlogCurrentEnqueued is fixed in #2413 too. The rooting code from this PR is not needed, so I am going to remove that.

These are the changes that are not present and not resolved in either #2408 or #2413:

  • Issue2392Tests that covers both hung task issues, but because of that it also depends the fix to succeed.
  • Updated comment in CheckBacklogForTimeouts reflecting the change and completing async messages to be consistent with non-backlogged timeouts. This also results in an optimization for F+F messages and keeps the stack trace/exception message consistent for sync messages.
  • Locking the backlog in AbandonPendingBacklog to keep the thread safety requirements of the timeout checker. Maybe a separate version is needed without locking for Dispose and the finalizer.

@mgravell
Copy link
Collaborator

In principle: sure if there are other things needed; however, that third item sounds like a very bad idea - it is very intentional that we don't keep that lock while doing that. I'd want to check the reasoning there, and see if something else might be possible.

@kornelpal
Copy link
Contributor Author

kornelpal commented Mar 28, 2023

PhysicalBridge.OnConnectionFailed seems to happen independently of the heartbeat, so I believe that it can happen in parallel to CheckBacklogForTimeouts. I mean locking for each BacklogTryDequeue call separately, as in the current version of the PR. Update: I believe that Dispose can be called when CheckBacklogForTimeouts is running that can result in the one message dequeued by CheckBacklogForTimeouts not being completed.

…og when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages.
@kornelpal
Copy link
Contributor Author

I've reverted this PR to the original fix. And created a new PR #2415 for the test.

Copy link
Collaborator

@NickCraver NickCraver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current is looking great - thanks a ton for all the debugging help on this and the GC issue <3 - getting this PR in first to make sure I eat the merge over in #2408 then we'll get final test additions in after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants