Correct expectations for cluster connection/command failover. #1757
-
Hi. We're developing a Redis-compatible cluster backend and are using Lettuce and Spring for some integration testing. In one particular case I'm trying to simulate a server failure to understand the ability/limitations of Lettuce to fail over in this scenario. The idea is that the setup should closely mimic a web application. Instead of testing a full web application, I'm creating a When a backend server fails, other servers in the cluster will automatically start hosting the slots of the departed server. I have a
During the test, when a server fails while performing
I appreciate that a I would like to understand the guarantees that Lettuce makes around commands/servers failing and the ability to recover and retry (or not) commands that are currently failing and even those that are currently pending. Is there a succinct way to reason about this? For example, can one assume that failed idempotent operations will always be retried but failed non-idempotent operations (for example Taking this up a level; if I'm developing a web application (say with Spring Boot, Spring Sessions and Lettuce), backed by a redis cluster, do I have to provide retry logic at the application level for failed operations or will one of those components automatically handle it? (Here I'm constraining 'failed operations' to mean session-specific operations). Apologies if this is a bit vague, but any insights and pointers would be greatly appreciated. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
Failover is a quite broad topic, so let me add a bit of my perspective here. Redis Cluster doesn't provide an active notification mechanism for cluster reconfiguration, instead, a client can either poll the cluster topology or react to certain events. Topology polling is available in Lettuce through periodic topology refresh. Other events are modeled as adaptive refresh triggers. We generally assume if a node is down, that it will eventually come back again (because it has crashed, network partition). We do not assume that it was removed from the cluster in the first place. Therefore, commands sent to a node (either by slot routing or because it was manually routed there) stick with the target node until it either comes back online or gets removed from the topology. If a node comes back up, buffered commands (that didn't time out yet) are retried on the same node. If a node gets removed from the cluster, then there's a subtle difference in command handling. If a command was sent to a node using its If a command was sent to the node using host/port (default mechanism for command routing) and the node gets removed, then we retry/resend the command through the command routing to potentially hit a different server because we assume that Note that Lettuce command retries are driven by I/O problems only. If a command fails because of a Redis error response, then the command lifecycle was still completed successfully as it has received a response. Let me know whether that helps and whether you want to discuss further aspects. |
Beta Was this translation helpful? Give feedback.
-
I'm performing a simpler test using only Lettuce. The test crashes a node and then attempts to perform
I was expecting the command to be retried. However, I see that Lettuce continues to attempt to contact the failed node:
At some point, it gives up and does a topology refresh and then the following exception is thrown:
and the command is not retried. |
Beta Was this translation helpful? Give feedback.
Failover is a quite broad topic, so let me add a bit of my perspective here.
Redis Cluster doesn't provide an active notification mechanism for cluster reconfiguration, instead, a client can either poll the cluster topology or react to certain events. Topology polling is available in Lettuce through periodic topology refresh. Other events are modeled as adaptive refresh triggers.
We generally assume if a node is down, that it will eventually come back again (because it has crashed, network partition). We do not assume that it was removed from the cluster in the first place. Therefore, commands sent to a node (either by slot routing or because it was manually routed there) stick with the t…