Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FireAndForget, Timeouts] Redis won't timeout requests if there is connection issue and FireAnfForget request is queued #2392

Closed
plachor opened this issue Mar 3, 2023 · 9 comments

Comments

@plachor
Copy link

plachor commented Mar 3, 2023

Hi, recently we observed that when Redis is not accessible to our microservices than they stop responding to requests. This was quite a shock as we always relayed on Redis as on service that is not guaranteed. Therefore we strongly relay on short timeouts and in case Redis is not available we always upstream requests to source of truth.

However some time ago we started to use FireAndForget flag for requests that did not require to complete (best effort). This was ideal when invalidating cache and similar use cases.

It seems that when we queue such request in time when there is network partition to Redis instance or some fail over procedure is in progress than all subsequent request to Redis on every thread using same connection multiplexer hangs for ever (until Redis instance is back and accessible).

Check this simple test:

[Fact]
public async Task IssueReproduction()
{
    // connection to fake service
    var connection = await ConnectionMultiplexer.ConnectAsync(new ConfigurationOptions
    {
        EndPoints   = new EndPointCollection(new List<EndPoint>{new DnsEndPoint("localhost", 1111)}),
        AsyncTimeout = 500,
        SyncTimeout = 500,
        ConnectTimeout = 500,
        ConnectRetry = 0,
        AbortOnConnectFail = false
    });
    // Timeouts works when we try to get something from Redis
    var stopwatch = Stopwatch.StartNew();
    try
    {
        await connection.GetDatabase().StringGetAsync("fake-key");
    }
    catch (RedisTimeoutException) { }
    stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(2));
    // After we try to set something it will work as well
    stopwatch.Restart();
    try
    {
        await connection.GetDatabase().StringSetAsync("fake-key", "value", TimeSpan.FromHours(1), flags: CommandFlags.None);
    }
    catch (RedisTimeoutException) { }
    stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(2));
    
    // It still works when we get something
    stopwatch.Restart();
    try
    {
        await connection.GetDatabase().StringGetAsync("fake-key");
    }
    catch (RedisTimeoutException) { }
    stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(2));
    
    // It works for first fire-and-forget set
    stopwatch.Restart();
    try
    {
        await connection.GetDatabase().StringSetAsync("fake-key", "value", TimeSpan.FromHours(1), flags: CommandFlags.FireAndForget);
    }
    catch (RedisTimeoutException) { }
    stopwatch.Elapsed.Should().BeLessThan(TimeSpan.FromSeconds(2));
    
    // BUT after first fire and forget set it stops respecting timeouts and hangs all future requests for ever
    await connection.GetDatabase().StringGetAsync("fake-key");
    throw new Exception("THIS WILL NOT BE THROWN!");
}

We are on latest version of StackExchange.Redis assembly 2.6.96.

Is it right approach to use FireAndForget mode to speed such invalidations/updates when we hope it will succeed but we do not want to block parent request waiting for Redis response? Like I said best effort to set something in cache in background.

Is it expected that it disrupt timeouts for every subsequent request?

@plachor plachor changed the title [FireAndForget, Timeouts] Redis won't time requests if there is connection issue and FireAnfForget request is queued [FireAndForget, Timeouts] Redis won't timeout requests if there is connection issue and FireAnfForget request is queued Mar 3, 2023
@mgravell
Copy link
Collaborator

mgravell commented Mar 3, 2023

This seems odd; I agree that this isn't intentional but a quick glance at the code doesn't show any obvious smoking guns - the timeout detection code doesn't check for F+F, it just checks the write time - and the write time is written regardless. Will need to look.

@plachor
Copy link
Author

plachor commented Mar 6, 2023

I've forgot to mention it hangs on .NET6 and .NET7. Any luck identifying issue behind ?

@plachor
Copy link
Author

plachor commented Mar 10, 2023

@mgravell have you looked at this issue ? It seems serious problem if someone relay on F+F mode and timeouts.
I've tried to check where it hangs and it seems that this line is issue:
https://github.com/StackExchange/StackExchange.Redis/blob/main/src/StackExchange.Redis/PhysicalBridge.cs#L873

For F+F request message.ResultBox is null which breaks CheckBacklogForTimeouts and pending in queue messages that should be timeout are not timeouting. Should there be continue?

There still is issue with this _backlog which you simply TryPeek so you want cheek next requests

@plachor
Copy link
Author

plachor commented Mar 10, 2023

Not sure if that is any option :P but removing || message.ResultBox == null does a job :P for this timeout issue.

@plachor
Copy link
Author

plachor commented Mar 10, 2023

Introduced within: 7dda23a#diff-c64610826746e4cc2aeb0edf12469d2ea64583486a9246f7493d197bc33c6af1R856

Perhaps you need better way to determine completed messages.

@mgravell
Copy link
Collaborator

have you looked at this issue

Briefly, as per my previous comment - when I looked at the other place timeout can happen, in the sent queue. Ultimately, this isn't my day job, and not everything can be done immediately. I do agree with your analysis about the pending queue and the null result box. The "simple" fix here would be to rip F+F messages from the pending queue and drop them on the floor (there's nobody to tell about a timeout). It might be possible to do something more exotic, though, that preserves them - even if we simply dequeue them, reset their timeout, and enqueue them back at the end. I'll have a think about which might be more desirable in the F+F area.

@KrzysztofBranicki
Copy link

I would assume that if someone is explicitly using F+F option he doesn't care about being informed whether operation succeeded or failed and is accepting the "best effort" nature of this operation like in pattern below:

  1. try retrieving data from cache
  2. if not present take data from source of truth
    2.1 after retrieving data from source of truth populate cache using F+F

Probably the the worst thing we can do here is hang in operation which is assumed to be "best effort" anyway. In this context it make sense to remove F+F messages from the pending queue as soon as they are timed out.

It might be possible to do something more exotic, though, that preserves them - even if we simply dequeue them, reset their timeout, and enqueue them back at the end.

This solution can result in situation when operations are executed on Redis way past their timeout which kinda breaks the contract with developer who specified the timeout. This can potentially result in consistent data (set by microservice which doesn't have connectivity issues) being overridden by stale data provided by some instance which had connectivity issues. Of course we may argute that such thing can happen regardless but having mechanism that resets the messages timeout definitely increases the risk.

@plachor
Copy link
Author

plachor commented Mar 10, 2023

I also agree that when using F+F you aim for performance and not guarantees of delivering / success.

@eligu
Copy link

eligu commented Mar 10, 2023

I agree that re-enqueueing them would have side effects, especially taking into accoun that the timeout set by the dev is already reached so he may have compensating actions, such as read again from the source of truth and cache it, so it will lead to some discrepancy. To not break the contract of the timeout, as soon as requests are timed out as per contract they may be dropped.

kornelpal added a commit to kornelpal/StackExchange.Redis that referenced this issue Mar 10, 2023
…og and avoid sending messages after the sync wait timed out.
kornelpal added a commit to kornelpal/StackExchange.Redis that referenced this issue Mar 19, 2023
…og when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages.
kornelpal added a commit to kornelpal/StackExchange.Redis that referenced this issue Mar 28, 2023
…og when not connected, even when no completion is needed, to be able to dequeue and complete other timed out messages.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants