You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The other day, the redis sentinel node(VM) became unable to return a response due to a hypervisor failure, and a timeout error began to occur. I expected lettuce to do the failover. However, failover had failed. Finally, when the VM of the redis sentinel node shut down completely, failover succeeded.
(On the redis and redis sentinel logs, the failover was successful immediately.)
I did various tests. When the sentinel node is completely down and the connection is broken, the failover succeeds.
However, if the timeout error occurs without the sentinel node going down completely, failover does not seem to be possible.
Take application start as an example. In the case of such a spring boot app, lettuce skips the first unreachable sentinel node and starts the application properly.
2020-03-27 00:56:56.085 INFO 63174 --- [ main] com.example.demo.DemoApplication : Starting DemoApplication on haseberyousukenoMacBook-Pro.local with PID 63174 (/Users/hasebe/Desktop/demo/build/classes/java/main started by hasebe in /Users/hasebe/Desktop/demo)
2020-03-27 00:56:56.087 INFO 63174 --- [ main] com.example.demo.DemoApplication : No active profile set, falling back to default profiles: default
2020-03-27 00:56:56.635 INFO 63174 --- [ main] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat initialized with port(s): 8080 (http)
2020-03-27 00:56:56.643 INFO 63174 --- [ main] o.apache.catalina.core.StandardService : Starting service [Tomcat]
2020-03-27 00:56:56.643 INFO 63174 --- [ main] org.apache.catalina.core.StandardEngine : Starting Servlet engine: [Apache Tomcat/9.0.33]
2020-03-27 00:56:56.704 INFO 63174 --- [ main] o.a.c.c.C.[Tomcat].[localhost].[/] : Initializing Spring embedded WebApplicationContext
2020-03-27 00:56:56.704 INFO 63174 --- [ main] o.s.web.context.ContextLoader : Root WebApplicationContext: initialization completed in 586 ms
2020-03-27 00:56:56.896 INFO 63174 --- [ main] io.lettuce.core.EpollProvider : Starting without optional epoll library
2020-03-27 00:56:56.897 INFO 63174 --- [ main] io.lettuce.core.KqueueProvider : Starting without optional kqueue library
--> THIS 2020-03-27 00:56:56.972 WARN 63174 --- [ioEventLoop-4-1] io.lettuce.core.RedisClient : Cannot connect Redis Sentinel at RedisURI [host='cannot connectable node', port=26379]: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: cannot connectable node/cannot connectable node:26379
2020-03-27 00:56:57.129 INFO 63174 --- [ main] o.s.s.concurrent.ThreadPoolTaskExecutor : Initializing ExecutorService 'applicationTaskExecutor'
2020-03-27 00:56:57.225 INFO 63174 --- [ main] o.s.b.w.embedded.tomcat.TomcatWebServer : Tomcat started on port(s): 8080 (http) with context path ''
2020-03-27 00:56:57.228 INFO 63174 --- [ main] com.example.demo.DemoApplication : Started DemoApplication in 1.338 seconds (JVM running for 2.028)
However, when a timeout error occurs, an error occurs and it cannot be started. Why not transfer the processing to the next sentinel node?
(Timeout error is generated by using toxiproxy.)
I also tried it after the application started. Use toxiproxy so that a timeout error occurs on the sentinel node to which lettuce is connected. At this time, bring down the master node. Then, failover fails and the application tries to connect to the master node that has gone down forever.
On the other hand, when the sendinel node goes down and a connect error occurs, it seems that the master node is found from the following sentinel node and the failover succeeds.
Expected behavior/code
If a timeout error occurs for the sentinel node, use the following sentinel node.
(There is no problem with connect error.)
However, it still seems to work around if we set pingBeforeActivateConnection to true. I worked around by enabling this setting.
Environment
Lettuce version(s): 5.2.2.RELEASE and 4.5.0.Final
Redis version: 4.0.8
Possible Solution
...
Additional context
...
The text was updated successfully, but these errors were encountered:
After looking in this issue, the problem arises from the fact that the connect was successful but Sentinel failed to respond within the timeout. The client code assumes that when the connection was established, Sentinel is functional. At the time we query sentinel we no longer have access to the connection progress (i.e. which hosts were tried, which failed and so on) as we operate on an existing connection.
The entire mechanism is asynchronous and therefore it imposes a certain complexity to fix the issue properly. For now, please enable PING on connect via ClientOptions (ClientOptions.builder().pingBeforeActivateConnection(true).build()). What this does is issuing a PING command during the connect phase to ensure that Redis responds properly. We get the guarantee that at least at the time the connection gets created the Sentinel is alive. Unhealthy/unresponsive nodes are skipped and we increase the chance of hitting a sentinel node that is able to properly reply with the master address.
Thank you for the detailed investigation.
Keep pingBeforeActivateConnection enabled until it is fixed.
(It seems to be difficult to fix because it is an asynchronous mechanism...)
Bug Report
Current Behavior & Input Code
My product uses sentinel's master node discovery.
https://github.com/lettuce-io/lettuce-core/wiki/Redis-Sentinel#sentinel.redis-discovery-using-redis-sentinel
The other day, the redis sentinel node(VM) became unable to return a response due to a hypervisor failure, and a timeout error began to occur. I expected lettuce to do the failover. However, failover had failed. Finally, when the VM of the redis sentinel node shut down completely, failover succeeded.
(On the redis and redis sentinel logs, the failover was successful immediately.)
I did various tests. When the sentinel node is completely down and the connection is broken, the failover succeeds.
However, if the timeout error occurs without the sentinel node going down completely, failover does not seem to be possible.
Take application start as an example. In the case of such a spring boot app, lettuce skips the first unreachable sentinel node and starts the application properly.
However, when a timeout error occurs, an error occurs and it cannot be started. Why not transfer the processing to the next sentinel node?
(Timeout error is generated by using toxiproxy.)
I also tried it after the application started. Use toxiproxy so that a timeout error occurs on the sentinel node to which lettuce is connected. At this time, bring down the master node. Then, failover fails and the application tries to connect to the master node that has gone down forever.
On the other hand, when the sendinel node goes down and a connect error occurs, it seems that the master node is found from the following sentinel node and the failover succeeds.
Expected behavior/code
If a timeout error occurs for the sentinel node, use the following sentinel node.
(There is no problem with connect error.)
However, it still seems to work around if we set pingBeforeActivateConnection to true. I worked around by enabling this setting.
Environment
Possible Solution
Additional context
The text was updated successfully, but these errors were encountered: