-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make adaptive topology refresh better usable for failover/master-slave promotion changes #672
Comments
Adaptive triggers are an additional trigger method using runtime signals such as disconnect. Signals are processed immediately – without any delay – triggering a topology refresh, calling It looks like the topology view obtained at that state wasn't reflecting the newly elected master state but a state in between. The inherent problem with Redis Cluster is that it doesn't expose configuration changes through a Pub/Sub mechanism like Sentinel does but through a binary Cluster bus protocol. Going forward, I'd recommend to:
I wonder whether it could make sense to delay adaptive topology refresh. It's primary use was to reflect slot migrations between nodes without the need to wait until the next periodic refresh. |
Btw, we were following recommendation to use adaptive refresh triggers over periodic one from here #333 And I like the suggestion to implement functionality for reloading partitions, will save an application restart. |
#333 isn't saying to disable periodic triggers. Let's turn this ticket into an enhancement for adaptive triggers to make it better usable for failovers – basically either delaying refresh or scheduling subsequent runs to make sure refresh grabs the appropriate state. This requires some conceptual design before we can implement something. |
Alright, Thanks for the help! |
We have a similar problem to the original issue in this ticket, and was looking to hopefully get some more guidance and understanding as to what the expected behavior is and what we are seeing. We too are on Elasticache, running a clustered Redis 3.2 server with 15 shards. We had a failure of one of the shards a few days ago, and after the failed node was replaced by a new empty instance[1], we started to see errors like this in the logs. I'm happy to share the full exception message with the printed
The recovery of the failed shard took ~6 minutes, which is about par for what we've seen with our Easticache setup. After the recovery complete, we did not experience recovery by our Redis code. We continued to see the same error message. This was not the first time we'd experienced a failure, and after some debugging and reading last time, we had determined that we thought that, while adaptive refresh triggers should have detected a rebalance of the cluster, we had a too low of a number of refresh attempts. We had been using the default value of 5 reconnect attempts, and with the default 30 second wait in between meant we were only attempting to reconnect for 2 1/2 minutes. We thus made a configuration change to up the number of reconnect attempts to 30, giving us 15 minutes of reconnect attempts instead. Our cluster configuration (in Scala) is below. private def lettuceClusterTopologyRefreshOptions = {
ClusterTopologyRefreshOptions.builder()
.enableAllAdaptiveRefreshTriggers()
.refreshTriggersReconnectAttempts(30)
.build()
}
private def lettuceClusterClientOptions = {
ClusterClientOptions.builder()
.validateClusterNodeMembership(false)
.topologyRefreshOptions(lettuceClusterTopologyRefreshOptions)
.build()
} Since the most recent failure took ~5-6 minutes, it was unexpected that we did not see recovery after the most recent failure. My understanding is that our code should be configured to attempt to reconnect for 15 minutes after the failure due to the adaptive refresh triggers. After reading this ticket, however, it seems that perhaps the adaptive refresh triggers might not actually be triggered in situations like this? Or perhaps the adaptive trigger refresh did fire, but caught an "in between" state and then never refreshed again? It's confusing to me why we saw errors about unknown partitions for certain keys, but adaptive triggers didn't fire to refresh the cluster membership. I do only have a limited understanding of the expected behavior here so I may just not be understanding things correctly. [1] We do not run replicas, as our cluster is used as an ephemeral cache. So when failed nodes are replaced they are replaced with an empty node. |
Thanks for your detailed comment. You're tackling two things:
Adaptive topology refresh is intended to support in cases where slots are migrated between nodes ( Redis outages that last for a longer time aren't covered by adaptive topology refresh. From what I understood, I'd assume that the failed node was excluded from the topology upon topology retrieval and since then the topology wasn't updated anymore. Redis Cluster communicates its state changes over an internal, binary bus protocol that usually isn't reachable from outside. What Redis Cluster is missing is a Sentinel-like facility that communicates actively changes in the topology. The only viable option is to actively pull Redis/infrastructure details. The same failure with Redis Sentinel would refresh topology as soon as your node is back again as Sentinel communicates state changes via Pub/Sub. I see the following options here:
|
@mp911de thank you for your detailed reply, and patience with my reply. Things have been busy on my end. We will try adding the periodic refresh to see if that helps. I was wondering if you could elaborate on one thing you said... I am not super familiar with the ops for how Elasticache works, but my understanding is that given our cluster setup - 15 shards, no backups or replicas (since all data is ephemeral) - when a node fails, it is replaced with a new empty node at the same DNS name and IP (we run in a VPC). So when you say:
My understanding is that given the topology config I pasted a sample of above, our cluster should try for 15 minutes to reconnect to a failed node. I don't think the failed node (in terms of a hostname/IP) is ever excluded from the cluster (I may be wrong). And that since the Elasticache node was replaced within 5 minutes, once it is replaced, it should reconnect and hopefully reconnect. To ask more directly, I assumed that |
No worries, all good.
Me neither. I'm not working with Elasticache or Microsoft Azure's Redis on a regular basis.
That's not the case, however, you tackle a point that's worth improvement. Lettuce currently only reacts to Redis responses such as |
I think we could reuse |
I think I have the same issue. Sorry if this is wrong and I'm capturing this thread.
(Lettuce 4.4.6, spring-data-redis 1.8.14) Now I see a lot of test failures happening and the log file is filled with exceptions like
BTW: If I didn't set the cluster options as above, I was seeing even much more errors. FTR: I had set up the same test case with Jedis and only seen a couple of test failures and far less exceptions in the log So my main issues are those:
|
@kutzi regarding Spring Data and The difference between Jedis and Lettuce from this perspective is that Spring Data Redis caches the topology for Jedis for 100ms and refreshes it on access. For Lettuce, Spring Data Redis uses What is the average time between an adaptive refresh signal (can be found as |
I found out that my workarounds to hack the ClusterClientOptions into the RedisClient of spring's LettuceConnectionFactory didn't work. How can I find out from the logs when the topology has been reconfigured? (FTR: I forgot in my previous comment, that I'm using Spring RetryTemplate to retry jedis/lettuce operations.) |
Regarding spring-data-redis: https://docs.spring.io/spring-data/redis/docs/current/api/org/springframework/data/redis/connection/lettuce/LettuceClientConfiguration.html is since spring-data-redis 2.x only - I'm on 1.8. |
Adaptive refresh limits the number of refresh operations using a timeout (
It sounds as if it would make sense to add debug logs when a topology refresh starts/finishes. |
Hi, we experienced the same problem with a Redis Cluster made of 3 masters and 2 replicas each (9 nodes in total). To test this we have used a feature provided by AWS (master failover) and a local test. Configuration val topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
.enablePeriodicRefresh(java.time.Duration.ofSeconds(10))
.enableAllAdaptiveRefreshTriggers()
.build()
val client = RedisClusterClient.create(s"redis://$host:$port")
.setOptions(ClusterClientOptions.builder()
.topologyRefreshOptions(topologyRefreshOptions)
.autoReconnect(true)
.build()) Issue Possible Cause Possible Solution Example |
In your example, @alessandrosimi-sa how long does it take (time between the adaptive refresh trigger ran and the replica/master promotion) until the replica node became a master? |
Thanks for your answer @mp911de. I had the feeling 16:36:28.224 INFO TestRedis - stop the master with port [8079]
16:36:28.397 INFO ConnectionWatchdog - Reconnecting, last destination was /127.0.0.1:8079
...
16:36:28.406 WARN ClusterTopologyRefresh - Unable to connect to 127.0.0.1:8079
Partitions [... RedisClusterNodeSnapshot [uri=RedisURI [host='127.0.0.1', port=8079], ... connected=false, .. ]]
...
16:36:38.404 WARN ClusterTopologyRefresh - Unable to connect to 127.0.0.1:8079
Partitions [... RedisClusterNodeSnapshot [port=8082, ... flags=[MASTER]], ... , RedisClusterNodeSnapshot [port=8078, ... flags=[MASTER],], RedisClusterNodeSnapshot [port=8085, flags=[MASTER]], .... , RedisClusterNodeSnapshot [port=8079, flags=[MASTER, FAIL]]] From the moment the test is forcing one master to fail The scenario I am testing is a client that should not drop pending commands using the |
This ticket is about adaptive refresh triggers and a possible delay between we refresh the topology and the time where Redis changes its topology, not about periodic refresh.
Care to elaborate what you mean or what behavior you experience? Commands are routed to a node connection (e.g. write commands are written to a master connection). If the connection gets disconnected, commands are buffered until the connection comes back up. One of the following scenarios is then possible:
|
Scenario Issue |
We just experienced this. We are using adaptive refresh triggers. In our case, the master host died and as expected a slave was promoted to master. However, the Lettuce client didn't detect this and all subsequent queries on the slot range keys failed with the "com.lambdaworks.redis.RedisException: Cannot determine a partition to read for slot 15234" message. Our hypothesis:
|
Lettuce does not know when a failover is completed or whether a failover should take place at all. There are a couple of approaches, and none of them is ideal:
I'm not sure how to proceed with this issue. Would some callback help so the application can trigger a topology refresh upon a specific event? |
"Trigger refresh on This might be the best option, if it includes configuration options such as:
The idea here is that the time it takes for the master election to occur and slot coverage to be re-enabled might vary based on a variety of factors, so these configuration options could be configured based on a given client's environment/situation. |
We have a timeout setting to prevent recurring refreshes at a high rate to limit refreshes to e.g. once every 30 seconds. Is this what you were talking about? |
I was suggesting this in the context of executing a refresh in the presence of the Regardless, we are looking at tweaking the refresh logic for our situation and will let you know if we indeed develop any refresh logic improvements. |
This issue accumulated over time two issues:
We will introduce a new trigger for issue 2 to trigger topology update when Lettuce cannot determine a partition to read from/write to. Problem 1. is harder to solve. In most cases where Redis is running as direct service in a VM/on bare metal, an increase of the disconnect attempt threshold to a higher value can be a good approach. By raising the value from (today's default value) For orchestrated scenarios in which Redis is restarted immediately after discovering a failure (e.g. Kubernetes), this assumption needs careful inspection whether it still renders true or whether a node is spun up with the same IP but a different role. If a reconnect succeeds, then we no longer can derive an update trigger from it. I created #952 so setups that might require delays between adaptive refresh trigger and the actual topology change performed by Redis can consume events and add their own delay on top of Lettuce. |
Lettuce now listens to events raised from routing requests to uncovered slots. Command dispatch for read/write commands that terminates with PartitionSelectorException (Cannot determine a partition …) is now the trigger for uncovered slot events. Uncovered slots can be a late indicator for a topology change in which a number of command fails before the topology is updated to recover operations.
|
Thank you for implementing this change! We will try it out and let you know how it goes. |
Hi @mp911de I noticed that we delay the release of 5.2 Lettuce from June to Sept. This enhancement is very critical to us, do you have any update on the release for this OR suggestion on how to handle the uncovered slot exception? I'd like to know how much resources will periodic refresh consumes for a cluster with size 48, nodes 144? Or any best practice on adding periodic refresh. Thanks! |
Upstream dependencies (Project Reactor) has caused a delay in release dates. If the release is critical to you, then feel free to use snapshots for the time being or release that artifact to your own Maven repository if you have one. The only kind of changes we expect until the GA release are bugfixes. |
I have a redis cluster with 3 shards. Each shard has 2 nodes, 1 primary and 1 replica. I'm using lettuce 4.3.2.Final and following is the configuration im using to create redis client.
Inside
SlaveReadingLettuceClusterConnection
So I'm using all adaptive refresh triggers enabled, and not specifying any periodic refresh trigger for topology. We recently had an issue where one of the primary nodes in a shard of the cluster had problem, which triggered failover. So the shard had two nodes 001 (primary) and 002 (replica). 001 failed-over, and 002 became primary. When 001 recovered, it became replica. My assumption was that the adaptive refresh triggers would kick in, and update the topology upon recovery. It didn't happen, we extracted the partitions/topology of the redis client that was being printed in exceptions.
4 out 6 nodes above are master, while there were only 3. So in the troubled shard, there were two nodes, and both were recognized as primary by the redis client. Since we had configured its read policy as SLAVE, it was throwing the exception
Cannot determine a partition to read for slot ****
. Even though on node had recovered and become a replica, the topology had not refreshed.P.S. We are using AWS setup so redis cluster was AWS Elasticache, and our application was deployed in AWS Elasticbeanstalk (Java, Tomcat Stack). The EB environment had 15 EC2 instances configured behind elastic load balancer, and we faced issue in only 2 of the EC2 instances.
The quick fix we applied was to update to lettuce 4.4.1 and use read policy
SLAVE_PREFERRED
. But we are not sure why the adaptive refresh triggers didn't work.The text was updated successfully, but these errors were encountered: