Fix #2392: Dequeue all timed out messages from the backlog when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages. #2397

kornelpal · 2023-03-10T19:09:15Z

When the client is not connected timed out fire and forget messages currently are not removed from the backlog that also results in subsequent timed out messages not being marked as timed out, as described in #2392.

kornelpal · 2023-03-19T20:35:46Z

I've simplified the changes, but seemingly unrelated tests are failing now on the build server. All tests are passing when run locally, and could not find a way to re-run the checks.

NickCraver · 2023-03-20T16:35:32Z

Checking in here: I wanted to repro this in another path I'm on with timeout messages being better for backlogged scenarios...which I've got now. But, I've got a hang with and without these changes here so something else is amiss. Still digging in to evaluate this.

kornelpal · 2023-03-20T17:02:12Z

@NickCraver, do you have a test case for reproducing the hang you mentioned? (Also, would you mind sharing the draft version of your changes if that is needed to reproduce the hang?)

NickCraver · 2023-03-20T17:20:07Z

@kornelpal Yep - I've got to change tasks here for the day, but here's a push of my changes thus far to craver/timeout-error-improvements: https://github.com/StackExchange/StackExchange.Redis/compare/craver/timeout-error-improvements?expand=1. I agree your thinking here makes sense (and you'll see some changes reflected in current on that test branch), but some other hang is in play and I hesitate to change anything until we understand both. Only trying to figure it out at the moment, we'll definitely adjust this PR (if needed) and merge so credit and all works when getting the overall in.

kornelpal · 2023-03-21T03:26:43Z

@NickCraver, I believe that I was able to figure out the root cause of your issue.

According to my tests when there is no active connection, all the client objects, including the connection, the backlog, and even the Task instance are freed by the CG. This will leave nothing to invoke the continuations that the async flow depends on.

Debugging it is very difficult, because debugging results in object lifetimes being extended either by the runtime or by the debugging aids holding on to Task or other object instances.

Somewhat ironically adding WithTimeout also fixes the issue because timers are rooted.

Running other tests likely helps reproducing it because of the added memory pressure.

Although I haven't tried creating a unit test for it, using a longer timeout, and a GC.Collect + GC.WaitForPendingFinalizers loop on a separate thread should help reproducing it more reliably.

Rooting the connection in the test should fix the issue:

Exception ex;
var gcRoot = GCHandle.Alloc(conn);
try
{
    ex = await Assert.ThrowsAsync<RedisConnectionException>(() => db.PingAsync());
}
finally
{
    gcRoot.Free();
}

I don't think that this is affecting real world applications much, because the connection is likely rooted from a static field. Although I am not entirely positive that the same issue is not possible when there is an active connection, the native/unmanaged connection likely keeps these objects alive most of the time.

My recommended fix is rooting the PhysicalBridge instance using a GCHandle to itself when the backlog is non-empty, and possibly doing the same for the sent message queue too. When the message volume is low, creating and destroying a GCHandle should not be an issue and it will be kept allocated all the time when the volume is high.

As a conclusion I believe that this is a separate issue from both this PR and the changes your are working on.

NickCraver · 2023-03-21T13:12:05Z

@kornelpal I agree on observations, but not conclusions: connections or entire multiplexers get disposed in a myriad of scenarios that we see all the time, so messages that have never-completed awaitables is a significant issue, have to get to the bottom of this. I imagine there's a path in which we're not completing the backlog queue during teardown...but we shouldn't be allowed to tear down until end of scope, and we're awaiting that message during so need to reason about this a bit more.

kornelpal · 2023-03-21T14:10:54Z

This self contained (.NET Framework 4.8) console application reproduces the issue reliably, in case it helps:

using StackExchange.Redis;
using System;
using System.Reflection;
using System.Threading;
using System.Threading.Tasks;

class Program
{
    static readonly ManualResetEventSlim setupComplete = new ManualResetEventSlim();
    static WeakReference connWeakRef;
    static WeakReference pingTaskWeakRef;
    static ConnectionMultiplexer connStrongRef;

    static void Main()
    {
        Console.WriteLine($"CollectedWithoutDispose: {GetCollectedWithoutDispose()}");
        var task = Task.Run(RedisWorker);

        setupComplete.Wait();
        Console.WriteLine("Received setupComplete.");

        while (!task.IsCompleted)
        {
            GC.Collect();
            GC.WaitForPendingFinalizers();

            if (!connWeakRef.IsAlive && !pingTaskWeakRef.IsAlive)
            {
                Console.WriteLine("Connection and ping task were collected by the GC, there is nothing left to invoke the continuation for RedisWorker to complete.");
                Console.WriteLine($"CollectedWithoutDispose: {GetCollectedWithoutDispose()}");
                break;
            }
        }
        task.GetAwaiter().GetResult();

        Console.WriteLine("Main exited.");
    }

    static int GetCollectedWithoutDispose() => (int)typeof(ConnectionMultiplexer).InvokeMember("CollectedWithoutDispose", BindingFlags.Static | BindingFlags.NonPublic | BindingFlags.GetProperty, null, null, null);

    static async Task RedisWorker()
    {
        try
        {
            using (var conn = await ConnectionMultiplexer.ConnectAsync("127.0.0.1:1234,abortConnect=false"))
            {
                if (conn.IsConnected)
                {
                    Console.WriteLine("Redis server connected, exiting.");
                    Environment.Exit(1);
                }

                // Uncommenting the following line will result in completion with timeout.
                //connStrongRef = conn;

                connWeakRef = new WeakReference(conn);

                var db = conn.GetDatabase();
                Console.WriteLine("Starting PingAsync.");

                var pingTask = db.PingAsync();
                pingTaskWeakRef = new WeakReference(pingTask);

                Console.WriteLine("Setting setupComplete.");
                setupComplete.Set();

                await pingTask;
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine("RedisWorker failed: " + ex.Message);
        }

        Console.WriteLine("RedisWorker exited.");
    }
}

NickCraver · 2023-03-21T14:13:21Z

cc @mgravell ^ (we're looking at this crazy this morning)

@kornelpal thanks a ton for helping iterate and sanity check here, this is currently looking like async dispose insanity even in a case as simple as a using with an await with a disconnect in there allowing for global collection of things that are used later in the method so...yeah something is very wrong in async land.

kornelpal · 2023-03-21T14:51:50Z

I've created a draft fix and pushed it to main...kornelpal:StackExchange.Redis:gchandle so that you can iterate on it too.

NickCraver · 2023-03-21T15:12:31Z

@kornelpal I've pushed up some more stuff after digs this morning to prove it's being finalized while still in use (so the backlog is never cleared). Now, it'll be cleared with an ObjectDisposedException thrown for any backlog messages instead of hanging indefinitely. But that's a symptom solver only...the problem (we think) may be the TaskCompletionSource and lack of ref through ConnectionMultiplexer.ExecuteAsyncImpl. @mgravell's running some experiments now to try and minimally repro to compare to what happens in normal async state machine Task flow. This one's interesting, and amazing we haven't hit it in more scenarios.

kornelpal · 2023-03-21T15:36:24Z

@NickCraver, as I understand, PhysicalBridge is getting finalized because it has has a finalizer and the GC is collecting it. The GC is collecting it because there are no strong references to it reachable from any of the GC roots (stack variables, static fields, GC handles, etc.). Task continuations are forward-only references, so unfortunately keeping a reference to the the last task in the chain will not keep the previous steps needed for its completion alive, resulting in the task never being completed. This is the result of the Task continuation design and using a GC. Task.Delay of the framework for example is keeping its Timer alive so that the GC will not collect it and the task will complete. In this regard PhysicalBridge is similar to Task.Delay as it is the origin of the task completion chain, and is expected to complete those tasks, even when there are no references to it anymore. Creating a GCHandle is a simple and efficient way of having the object keep itself alive, but care must be taken to free up the GCHandle when there are no more pending tasks left to avoid memory leaks.

While completing the tasks in the finalizer will avoid hung tasks, and is a good measure of last resort, it is not a proper fix as it will result in the tasks being completed (and with exception) before the timeout expires. Keeping PhysicalBridge alive on the other hand will result in properly processing those tasks, like when there is a strong reference kept to ConnectionMultiplexer reachable from a static field.

kornelpal · 2023-03-26T03:14:37Z

I've finished the GCHandle patch and pushed it to this PR, in case it helps. I fixed incrementing _backlogCurrentEnqueued as the patch depends on it. To avoid a memory leak, AbandonPendingBacklog has to be called in Dispose, but did not include that as @NickCraver's branch already has it.

mgravell · 2023-03-26T07:23:25Z

I have another approach in mind (I'm not at a computer until Tuesday) that achieves the same aim, but without the need for GCHandle (we still root and unroot, just: in an alternative way). I'll compare and contrast both ideas when I get to look at it on Tuesday.

kornelpal · 2023-03-27T05:26:15Z

@mgravell, I've created a GCHandle-free approach too. Unlike GCHandle, HashSet seems to remain the same fast, even with 1M items, as with only one. On the other hand GCHandle is faster with less instances, so adding a holder object to the HashSet at initialization improves run-time performance. Another option would be to use a holder object but root it using a GCHandle.

mgravell · 2023-03-27T06:09:57Z

I have a plan sketched out, that involves much fewer moving parts, but with the same guarantees - plus also handling a few other important nuances; I want to try that on Tuesday before going too far else. Appreciate the thoughts, though. It helps that we cheated and asked Stephen Toub for input and ideas; he has some good tricks for these scenarios.

mgravell · 2023-03-28T10:14:03Z

Here's what I had in mind: #2413 - the key differences here:

we're considering both pending and live queue items
we're using simple strong-refs for simplicity
by rooting the muxer, we ensure that the heartbeat and hence timeout keep working; there's no advantage in keeping bridges alive if nobody is poking the bridges with a stick to tell things that they're dead

kornelpal · 2023-03-28T16:21:02Z

@mgravell, your approach looks good to me, and using the timer to keep the multiplexer alive is a clever trick.

@mgravell and @NickCraver, should I rebase this PR on #2413?

The bug this PR intends to fix is fixed in #2408 too and incrementing _backlogCurrentEnqueued is fixed in #2413 too. The rooting code from this PR is not needed, so I am going to remove that.

These are the changes that are not present and not resolved in either #2408 or #2413:

Issue2392Tests that covers both hung task issues, but because of that it also depends the fix to succeed.
Updated comment in CheckBacklogForTimeouts reflecting the change and completing async messages to be consistent with non-backlogged timeouts. This also results in an optimization for F+F messages and keeps the stack trace/exception message consistent for sync messages.
Locking the backlog in AbandonPendingBacklog to keep the thread safety requirements of the timeout checker. Maybe a separate version is needed without locking for Dispose and the finalizer.

mgravell · 2023-03-28T16:57:12Z

In principle: sure if there are other things needed; however, that third item sounds like a very bad idea - it is very intentional that we don't keep that lock while doing that. I'd want to check the reasoning there, and see if something else might be possible.

kornelpal · 2023-03-28T17:27:15Z

PhysicalBridge.OnConnectionFailed seems to happen independently of the heartbeat, so I believe that it can happen in parallel to CheckBacklogForTimeouts. I mean locking for each BacklogTryDequeue call separately, as in the current version of the PR. Update: I believe that Dispose can be called when CheckBacklogForTimeouts is running that can result in the one message dequeued by CheckBacklogForTimeouts not being completed.

…og when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages.

kornelpal · 2023-03-29T00:02:30Z

I've reverted this PR to the original fix. And created a new PR #2415 for the test.

NickCraver

Current is looking great - thanks a ton for all the debugging help on this and the GC issue <3 - getting this PR in first to make sure I eat the merge over in #2408 then we'll get final test additions in after.

kornelpal force-pushed the issue2392 branch from 9c9b5bc to a632656 Compare March 19, 2023 20:10

Fix StackExchange#2392: Dequeue all timed out messages from the backl…

56a4596

…og when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages.

kornelpal force-pushed the issue2392 branch from a90f7b7 to 56a4596 Compare March 28, 2023 20:12

NickCraver added 2 commits March 29, 2023 07:48

Add release notes

0d14fa0

Merge main

c7df377

NickCraver approved these changes Mar 29, 2023

View reviewed changes

NickCraver merged commit ef388bd into StackExchange:main Mar 29, 2023

kornelpal mentioned this pull request Apr 5, 2023

Fix concurrency issue between AbandonPendingBacklog() and CheckBacklogForTimeouts(), and remove backlog locking #2430

Open

renovate bot mentioned this pull request Sep 7, 2023

Update nuget packages iadjadj/ArmoniK.Core#15

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #2392: Dequeue all timed out messages from the backlog when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages. #2397

Fix #2392: Dequeue all timed out messages from the backlog when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages. #2397

kornelpal commented Mar 10, 2023 •

edited

Loading

kornelpal commented Mar 19, 2023

NickCraver commented Mar 20, 2023

kornelpal commented Mar 20, 2023

NickCraver commented Mar 20, 2023

kornelpal commented Mar 21, 2023

NickCraver commented Mar 21, 2023

kornelpal commented Mar 21, 2023

NickCraver commented Mar 21, 2023

kornelpal commented Mar 21, 2023

NickCraver commented Mar 21, 2023

kornelpal commented Mar 21, 2023

kornelpal commented Mar 26, 2023

mgravell commented Mar 26, 2023

kornelpal commented Mar 27, 2023

mgravell commented Mar 27, 2023

mgravell commented Mar 28, 2023

kornelpal commented Mar 28, 2023

mgravell commented Mar 28, 2023

kornelpal commented Mar 28, 2023 •

edited

Loading

kornelpal commented Mar 29, 2023

NickCraver left a comment

Fix #2392: Dequeue all timed out messages from the backlog when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages. #2397

Fix #2392: Dequeue all timed out messages from the backlog when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages. #2397

Conversation

kornelpal commented Mar 10, 2023 • edited Loading

kornelpal commented Mar 19, 2023

NickCraver commented Mar 20, 2023

kornelpal commented Mar 20, 2023

NickCraver commented Mar 20, 2023

kornelpal commented Mar 21, 2023

NickCraver commented Mar 21, 2023

kornelpal commented Mar 21, 2023

NickCraver commented Mar 21, 2023

kornelpal commented Mar 21, 2023

NickCraver commented Mar 21, 2023

kornelpal commented Mar 21, 2023

kornelpal commented Mar 26, 2023

mgravell commented Mar 26, 2023

kornelpal commented Mar 27, 2023

mgravell commented Mar 27, 2023

mgravell commented Mar 28, 2023

kornelpal commented Mar 28, 2023

mgravell commented Mar 28, 2023

kornelpal commented Mar 28, 2023 • edited Loading

kornelpal commented Mar 29, 2023

NickCraver left a comment

Choose a reason for hiding this comment

kornelpal commented Mar 10, 2023 •

edited

Loading

kornelpal commented Mar 28, 2023 •

edited

Loading