Sentinel failover not detected when connection hangs #1314

mjomble · 2021-03-29T13:53:51Z

I have stumbled onto the same issue as these:

How to timeout when Sentinel Master is unreachable #556
ETIMEDOUT error in Redis Sentinel, when Master loses network connectivity #1021
ioredis not accepting request when sentinel do the failover but doesn't make the last master as new slave #1059

After some investigating, I concluded that ioredis currently relies on Redis closing the connection as described here

However, when the failover is initiated with the Redis DEBUG SLEEP command or docker pause, the connection simply hangs, but doesn't terminate.

This could be solved by subscribing to sentinel messages on the +switch-master channel. Described in the Sentinel docs as "the message most external users are interested in"

I've created a reproducible example: https://github.com/mjomble/ioredis-sentinel-issue
This example listens to the message outside ioredis.
Once received, it uses internal/undocumented fields to call client.connector.stream.destroy() because redis.disconnect(true) (which calls stream.end()) leaves the connection open in this scenario.

Ideally, this could all happen inside ioredis. I could probably submit a PR if needed.

The text was updated successfully, but these errors were encountered:

luin · 2021-03-29T14:08:37Z

Hi Andres, thanks for raising this and the detailed explanation! A PR would be welcome.

mjomble · 2021-03-31T05:58:34Z

@luin I'm thinking of two different approaches here:

Listen to the +switch-master message on all sentinels.
Listen to the message only on the first successfully reached sentinel (the same one used to resolve the master).

The first one seems more reliable, but could get complicated when some sentinels are up and some are down. Also, it might not work well with the existing SentinelIterator logic.
The second one should integrate more naturally with SentinelIterator, but might miss the message if the sentinel hangs as well. Which can happen easily if it's on the same machine as the master. Maybe we should add a ping loop to it?

Any thoughts/recommendations?

luin · 2021-03-31T10:35:49Z

Yeah it's more complex than I thought. I'd go with subscribing to all sentinels because in case of a partition, as you said, it's very likely that the connected sentinel and the old master are in the same network, so we won’t be able to get events from that sentinel.

As for implementation details, I think SentinelConnector will get a lastActiveSentinel property, which defaults to null. Once a node is resolved, the connector will subscribe to all sentinels provided by user (I don’t think it need to be dynamic in v1 as it seems non-trivial to implement). Not sure if it’s necessary but a reasonable connection count limit may be applied to avoid user provides too many sentinels.

When a +switch-master is received, we set lastActiveSentinel to the one that got the event, and disconnect so Redis#connect() will kick in. Next time SentinelConnector will try lastActiveSentinel (and then reset it) first. Wdyt?

@ohadisraeli @leibale btw do you have any inputs on the correct behaviors about whether clients should subscribe to all sentinels or not? Or may be it should behind an option so users can enable/disable?

ioredis-robot · 2021-04-24T09:58:41Z

🎉 This issue has been resolved in version 4.27.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

shifubrams · 2022-04-07T12:33:38Z

Hi, i use the version 4.27.6 and the issue still exists actually.
What is the best option to avoid this issue ?

mjomble · 2022-04-07T19:37:26Z

Hi, i use the version 4.27.6 and the issue still exists actually. What is the best option to avoid this issue ?

Can you create a reproducible example? Otherwise it's extremely difficult to pinpoint the exact cause of the issue.

shifubrams · 2022-04-08T08:15:55Z

Hi, i use the version 4.27.6 and the issue still exists actually. What is the best option to avoid this issue ?

Can you create a reproducible example? Otherwise it's extremely difficult to pinpoint the exact cause of the issue.

Thanks for quick response.
steps are :

create a node project with ioredis and a key with ttl and listen to this expiration for example
connect ioredis with a config of at least 2 sentinels ( 1 master 1 slave )
trigger a redis failover on your redis server ( pass master responsability to slave )
you will see that the listener doesn't continue and stays stuck on the dead sentinel

mjomble · 2022-04-08T08:18:49Z

and listen to this expiration

I'm not familiar with this feature, could you provide a code example?

shifubrams · 2022-04-08T08:27:30Z

and listen to this expiration

I'm not familiar with this feature, could you provide a code example?

Yes sure,

It's about using keyspaces feature for example :

redis.subscribe('__keyevent@0__:expired');

redis.on('message', async (channel, message) => {
    // Handle event
    console.log(channel, message);
  });

mjomble · 2022-04-08T08:33:48Z

Are you sure it's the same issue and ioredis fails to detect the failover?

Or could it be that the failover is actually detected and the problem is more specifically about the subscription?

You could try adding extra code, for example something that increments a Redis value every second and logs the new value to the console.
If the listener fails, but the value incrementing still works, then failover detection is fine and you can create a new, more specific issue about the listener.

shifubrams · 2022-04-14T10:04:48Z

I think the subscription should be reset to another sentinel if we have a failover.

From my view i think it's related. Maybe I'm wrong you can tell me.

mjomble · 2022-04-14T11:21:37Z

I think the subscription should be reset to another sentinel if we have a failover.

I agree

From my view i think it's related. Maybe I'm wrong you can tell me.

It is certainly related. And if the root cause of your problem is that failover detection fails, then it is also the exact same issue.

If, however, it turns out that failover is successfully detected and the problem is that ioredis does not perform the necessary additional actions after successfully detecting a failover, then I would say it is a related, but separate issue.

So my recommendation is to first try and find out if failover is actually detected or not.

larsha · 2023-03-22T13:33:15Z

We're seeing this issue in version 5.3.1. I have verified it by gracefully shutting down redis master instance, then new election happens, and sentinel announce new master, our application also creates a new connection to the master, everything works as it should.

I have also tried executing the sleep command on the redis master instance, and that results in the same master election happening, with sentinel announcing new master, but our application does not initiate a new connection, so commands are still trying to execute against old master.

I will try to implement the workaround that listens to the sentinel +switch-master message and see if that solves is.

EDIT: I just noticed the option failoverDetector which seems to resolve this issue for us. I think this flag should be better documented, or perhaps I just missed it completely :) One can also argue that it should be on by default when using the lib in sentinel-mode.

EDIT 2: I also see these errors when I turned on failoverDetector: #1362 (comment)

mjomble mentioned this issue Mar 30, 2021

Use async/await #1315

Merged

mjomble mentioned this issue Mar 31, 2021

Force disconnect after a timeout if socket is still half-open #1318

Merged

mjomble mentioned this issue Apr 8, 2021

Detect failover from +switch-master messages #1328

Merged

luin closed this as completed in #1328 Apr 24, 2021

ioredis-robot added the released label Apr 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentinel failover not detected when connection hangs #1314

Sentinel failover not detected when connection hangs #1314

mjomble commented Mar 29, 2021

luin commented Mar 29, 2021

mjomble commented Mar 31, 2021

luin commented Mar 31, 2021

ioredis-robot commented Apr 24, 2021

shifubrams commented Apr 7, 2022

mjomble commented Apr 7, 2022

shifubrams commented Apr 8, 2022 •

edited

Loading

mjomble commented Apr 8, 2022

shifubrams commented Apr 8, 2022 •

edited

Loading

mjomble commented Apr 8, 2022

shifubrams commented Apr 14, 2022

mjomble commented Apr 14, 2022

larsha commented Mar 22, 2023 •

edited

Loading

Sentinel failover not detected when connection hangs #1314

Sentinel failover not detected when connection hangs #1314

Comments

mjomble commented Mar 29, 2021

luin commented Mar 29, 2021

mjomble commented Mar 31, 2021

luin commented Mar 31, 2021

ioredis-robot commented Apr 24, 2021

shifubrams commented Apr 7, 2022

mjomble commented Apr 7, 2022

shifubrams commented Apr 8, 2022 • edited Loading

mjomble commented Apr 8, 2022

shifubrams commented Apr 8, 2022 • edited Loading

mjomble commented Apr 8, 2022

shifubrams commented Apr 14, 2022

mjomble commented Apr 14, 2022

larsha commented Mar 22, 2023 • edited Loading

shifubrams commented Apr 8, 2022 •

edited

Loading

shifubrams commented Apr 8, 2022 •

edited

Loading

larsha commented Mar 22, 2023 •

edited

Loading