[FireAndForget, Timeouts] Redis won't timeout requests if there is connection issue and FireAnfForget request is queued #2392

plachor · 2023-03-03T11:43:01Z

Hi, recently we observed that when Redis is not accessible to our microservices than they stop responding to requests. This was quite a shock as we always relayed on Redis as on service that is not guaranteed. Therefore we strongly relay on short timeouts and in case Redis is not available we always upstream requests to source of truth.

However some time ago we started to use FireAndForget flag for requests that did not require to complete (best effort). This was ideal when invalidating cache and similar use cases.

It seems that when we queue such request in time when there is network partition to Redis instance or some fail over procedure is in progress than all subsequent request to Redis on every thread using same connection multiplexer hangs for ever (until Redis instance is back and accessible).

Check this simple test:

[Fact]
public async Task IssueReproduction()
{
    // connection to fake service
    var connection = await ConnectionMultiplexer.ConnectAsync(new ConfigurationOptions
    {
        EndPoints   = new EndPointCollection(new List<EndPoint>{new DnsEndPoint("localhost", 1111)}),
        AsyncTimeout = 500,
        SyncTimeout = 500,
        ConnectTimeout = 500,
        ConnectRetry = 0,
        AbortOnConnectFail = false
    });
    // Timeouts works when we try to get something from Redis
    var stopwatch = Stopwatch.StartNew();
    try
    {
        await connection.GetDatabase().StringGetAsync("fake-key");
    }
    catch (RedisTimeoutException) { }
    stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(2));
    // After we try to set something it will work as well
    stopwatch.Restart();
    try
    {
        await connection.GetDatabase().StringSetAsync("fake-key", "value", TimeSpan.FromHours(1), flags: CommandFlags.None);
    }
    catch (RedisTimeoutException) { }
    stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(2));
    
    // It still works when we get something
    stopwatch.Restart();
    try
    {
        await connection.GetDatabase().StringGetAsync("fake-key");
    }
    catch (RedisTimeoutException) { }
    stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(2));
    
    // It works for first fire-and-forget set
    stopwatch.Restart();
    try
    {
        await connection.GetDatabase().StringSetAsync("fake-key", "value", TimeSpan.FromHours(1), flags: CommandFlags.FireAndForget);
    }
    catch (RedisTimeoutException) { }
    stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(2));
    
    // BUT after first fire and forget set it stops respecting timeouts and hangs all future requests for ever
    await connection.GetDatabase().StringGetAsync("fake-key");
    throw new Exception("THIS WILL NOT BE THROWN!");
}

We are on latest version of StackExchange.Redis assembly 2.6.96.

Is it right approach to use FireAndForget mode to speed such invalidations/updates when we hope it will succeed but we do not want to block parent request waiting for Redis response? Like I said best effort to set something in cache in background.

Is it expected that it disrupt timeouts for every subsequent request?

The text was updated successfully, but these errors were encountered:

mgravell · 2023-03-03T11:51:08Z

This seems odd; I agree that this isn't intentional but a quick glance at the code doesn't show any obvious smoking guns - the timeout detection code doesn't check for F+F, it just checks the write time - and the write time is written regardless. Will need to look.

plachor · 2023-03-06T09:42:05Z

I've forgot to mention it hangs on .NET6 and .NET7. Any luck identifying issue behind ?

plachor · 2023-03-10T05:59:09Z

@mgravell have you looked at this issue ? It seems serious problem if someone relay on F+F mode and timeouts.
I've tried to check where it hangs and it seems that this line is issue:
https://github.com/StackExchange/StackExchange.Redis/blob/main/src/StackExchange.Redis/PhysicalBridge.cs#L873

For F+F request message.ResultBox is null which breaks CheckBacklogForTimeouts and pending in queue messages that should be timeout are not timeouting. Should there be continue?

There still is issue with this _backlog which you simply TryPeek so you want cheek next requests

plachor · 2023-03-10T06:13:49Z

Not sure if that is any option :P but removing || message.ResultBox == null does a job :P for this timeout issue.

plachor · 2023-03-10T06:44:37Z

Introduced within: 7dda23a#diff-c64610826746e4cc2aeb0edf12469d2ea64583486a9246f7493d197bc33c6af1R856

Perhaps you need better way to determine completed messages.

mgravell · 2023-03-10T06:52:11Z

have you looked at this issue

Briefly, as per my previous comment - when I looked at the other place timeout can happen, in the sent queue. Ultimately, this isn't my day job, and not everything can be done immediately. I do agree with your analysis about the pending queue and the null result box. The "simple" fix here would be to rip F+F messages from the pending queue and drop them on the floor (there's nobody to tell about a timeout). It might be possible to do something more exotic, though, that preserves them - even if we simply dequeue them, reset their timeout, and enqueue them back at the end. I'll have a think about which might be more desirable in the F+F area.

KrzysztofBranicki · 2023-03-10T07:26:13Z

I would assume that if someone is explicitly using F+F option he doesn't care about being informed whether operation succeeded or failed and is accepting the "best effort" nature of this operation like in pattern below:

try retrieving data from cache
if not present take data from source of truth
2.1 after retrieving data from source of truth populate cache using F+F

Probably the the worst thing we can do here is hang in operation which is assumed to be "best effort" anyway. In this context it make sense to remove F+F messages from the pending queue as soon as they are timed out.

It might be possible to do something more exotic, though, that preserves them - even if we simply dequeue them, reset their timeout, and enqueue them back at the end.

This solution can result in situation when operations are executed on Redis way past their timeout which kinda breaks the contract with developer who specified the timeout. This can potentially result in consistent data (set by microservice which doesn't have connectivity issues) being overridden by stale data provided by some instance which had connectivity issues. Of course we may argute that such thing can happen regardless but having mechanism that resets the messages timeout definitely increases the risk.

plachor · 2023-03-10T09:13:48Z

I also agree that when using F+F you aim for performance and not guarantees of delivering / success.

eligu · 2023-03-10T10:31:51Z

I agree that re-enqueueing them would have side effects, especially taking into accoun that the timeout set by the dev is already reached so he may have compensating actions, such as read again from the source of truth and cache it, so it will lead to some discrepancy. To not break the contract of the timeout, as soon as requests are timed out as per contract they may be dropped.

…og and avoid sending messages after the sync wait timed out.

…og when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages.

plachor changed the title ~~[FireAndForget, Timeouts] Redis won't time requests if there is connection issue and FireAnfForget request is queued~~ [FireAndForget, Timeouts] Redis won't timeout requests if there is connection issue and FireAnfForget request is queued Mar 3, 2023

kornelpal added a commit to kornelpal/StackExchange.Redis that referenced this issue Mar 10, 2023

Fix StackExchange#2392: Process all timed out messages from the backl…

9c9b5bc

…og and avoid sending messages after the sync wait timed out.

kornelpal mentioned this issue Mar 10, 2023

Fix #2392: Dequeue all timed out messages from the backlog when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages. #2397

Merged

kornelpal mentioned this issue Mar 28, 2023

Proposed fix for GC rooting + extensive notes on how/why are in TimerToken #2413

Merged

NickCraver closed this as completed in ef388bd Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FireAndForget, Timeouts] Redis won't timeout requests if there is connection issue and FireAnfForget request is queued #2392

[FireAndForget, Timeouts] Redis won't timeout requests if there is connection issue and FireAnfForget request is queued #2392

plachor commented Mar 3, 2023

mgravell commented Mar 3, 2023

plachor commented Mar 6, 2023

plachor commented Mar 10, 2023 •

edited

Loading

plachor commented Mar 10, 2023

plachor commented Mar 10, 2023

mgravell commented Mar 10, 2023

KrzysztofBranicki commented Mar 10, 2023

plachor commented Mar 10, 2023

eligu commented Mar 10, 2023

[FireAndForget, Timeouts] Redis won't timeout requests if there is connection issue and FireAnfForget request is queued #2392

[FireAndForget, Timeouts] Redis won't timeout requests if there is connection issue and FireAnfForget request is queued #2392

Comments

plachor commented Mar 3, 2023

mgravell commented Mar 3, 2023

plachor commented Mar 6, 2023

plachor commented Mar 10, 2023 • edited Loading

plachor commented Mar 10, 2023

plachor commented Mar 10, 2023

mgravell commented Mar 10, 2023

KrzysztofBranicki commented Mar 10, 2023

plachor commented Mar 10, 2023

eligu commented Mar 10, 2023

plachor commented Mar 10, 2023 •

edited

Loading