Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentinel failover not detected when connection hangs #1314

Closed
mjomble opened this issue Mar 29, 2021 · 13 comments · Fixed by #1328
Closed

Sentinel failover not detected when connection hangs #1314

mjomble opened this issue Mar 29, 2021 · 13 comments · Fixed by #1328
Labels

Comments

@mjomble
Copy link
Contributor

mjomble commented Mar 29, 2021

I have stumbled onto the same issue as these:

After some investigating, I concluded that ioredis currently relies on Redis closing the connection as described here

However, when the failover is initiated with the Redis DEBUG SLEEP command or docker pause, the connection simply hangs, but doesn't terminate.

This could be solved by subscribing to sentinel messages on the +switch-master channel. Described in the Sentinel docs as "the message most external users are interested in"

I've created a reproducible example: https://github.com/mjomble/ioredis-sentinel-issue
This example listens to the message outside ioredis.
Once received, it uses internal/undocumented fields to call client.connector.stream.destroy() because redis.disconnect(true) (which calls stream.end()) leaves the connection open in this scenario.

Ideally, this could all happen inside ioredis. I could probably submit a PR if needed.

@luin
Copy link
Collaborator

luin commented Mar 29, 2021

Hi Andres, thanks for raising this and the detailed explanation! A PR would be welcome.

@mjomble
Copy link
Contributor Author

mjomble commented Mar 31, 2021

@luin I'm thinking of two different approaches here:

  1. Listen to the +switch-master message on all sentinels.
  2. Listen to the message only on the first successfully reached sentinel (the same one used to resolve the master).

The first one seems more reliable, but could get complicated when some sentinels are up and some are down. Also, it might not work well with the existing SentinelIterator logic.
The second one should integrate more naturally with SentinelIterator, but might miss the message if the sentinel hangs as well. Which can happen easily if it's on the same machine as the master. Maybe we should add a ping loop to it?

Any thoughts/recommendations?

@luin
Copy link
Collaborator

luin commented Mar 31, 2021

Yeah it's more complex than I thought. I'd go with subscribing to all sentinels because in case of a partition, as you said, it's very likely that the connected sentinel and the old master are in the same network, so we won’t be able to get events from that sentinel.

As for implementation details, I think SentinelConnector will get a lastActiveSentinel property, which defaults to null. Once a node is resolved, the connector will subscribe to all sentinels provided by user (I don’t think it need to be dynamic in v1 as it seems non-trivial to implement). Not sure if it’s necessary but a reasonable connection count limit may be applied to avoid user provides too many sentinels.

When a +switch-master is received, we set lastActiveSentinel to the one that got the event, and disconnect so Redis#connect() will kick in. Next time SentinelConnector will try lastActiveSentinel (and then reset it) first. Wdyt?

@ohadisraeli @leibale btw do you have any inputs on the correct behaviors about whether clients should subscribe to all sentinels or not? Or may be it should behind an option so users can enable/disable?

@ioredis-robot
Copy link
Collaborator

🎉 This issue has been resolved in version 4.27.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@shifubrams
Copy link

Hi, i use the version 4.27.6 and the issue still exists actually.
What is the best option to avoid this issue ?

@mjomble
Copy link
Contributor Author

mjomble commented Apr 7, 2022

Hi, i use the version 4.27.6 and the issue still exists actually. What is the best option to avoid this issue ?

Can you create a reproducible example? Otherwise it's extremely difficult to pinpoint the exact cause of the issue.

@shifubrams
Copy link

shifubrams commented Apr 8, 2022

Hi, i use the version 4.27.6 and the issue still exists actually. What is the best option to avoid this issue ?

Can you create a reproducible example? Otherwise it's extremely difficult to pinpoint the exact cause of the issue.

Thanks for quick response.
steps are :

  • create a node project with ioredis and a key with ttl and listen to this expiration for example
  • connect ioredis with a config of at least 2 sentinels ( 1 master 1 slave )
  • trigger a redis failover on your redis server ( pass master responsability to slave )
  • you will see that the listener doesn't continue and stays stuck on the dead sentinel

@mjomble
Copy link
Contributor Author

mjomble commented Apr 8, 2022

and listen to this expiration

I'm not familiar with this feature, could you provide a code example?

@shifubrams
Copy link

shifubrams commented Apr 8, 2022

and listen to this expiration

I'm not familiar with this feature, could you provide a code example?

Yes sure,

It's about using keyspaces feature for example :

redis.subscribe('__keyevent@0__:expired');

redis.on('message', async (channel, message) => {
    // Handle event
    console.log(channel, message);
  });

@mjomble
Copy link
Contributor Author

mjomble commented Apr 8, 2022

Are you sure it's the same issue and ioredis fails to detect the failover?

Or could it be that the failover is actually detected and the problem is more specifically about the subscription?

You could try adding extra code, for example something that increments a Redis value every second and logs the new value to the console.
If the listener fails, but the value incrementing still works, then failover detection is fine and you can create a new, more specific issue about the listener.

@shifubrams
Copy link

I think the subscription should be reset to another sentinel if we have a failover.

From my view i think it's related. Maybe I'm wrong you can tell me.

@mjomble
Copy link
Contributor Author

mjomble commented Apr 14, 2022

I think the subscription should be reset to another sentinel if we have a failover.

I agree

From my view i think it's related. Maybe I'm wrong you can tell me.

It is certainly related. And if the root cause of your problem is that failover detection fails, then it is also the exact same issue.

If, however, it turns out that failover is successfully detected and the problem is that ioredis does not perform the necessary additional actions after successfully detecting a failover, then I would say it is a related, but separate issue.

So my recommendation is to first try and find out if failover is actually detected or not.

@larsha
Copy link

larsha commented Mar 22, 2023

We're seeing this issue in version 5.3.1. I have verified it by gracefully shutting down redis master instance, then new election happens, and sentinel announce new master, our application also creates a new connection to the master, everything works as it should.

I have also tried executing the sleep command on the redis master instance, and that results in the same master election happening, with sentinel announcing new master, but our application does not initiate a new connection, so commands are still trying to execute against old master.

I will try to implement the workaround that listens to the sentinel +switch-master message and see if that solves is.

EDIT: I just noticed the option failoverDetector which seems to resolve this issue for us. I think this flag should be better documented, or perhaps I just missed it completely :) One can also argue that it should be on by default when using the lib in sentinel-mode.

EDIT 2: I also see these errors when I turned on failoverDetector: #1362 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants