Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make adaptive topology refresh better usable for failover/master-slave promotion changes #672

Closed
mudasirale opened this issue Dec 20, 2017 · 29 comments
Labels
type: feature A new feature
Milestone

Comments

@mudasirale
Copy link

I have a redis cluster with 3 shards. Each shard has 2 nodes, 1 primary and 1 replica. I'm using lettuce 4.3.2.Final and following is the configuration im using to create redis client.

redisClusterClient = RedisClusterClient.create(RedisURI.builder()
            .withHost(hostName)
            .withPort(port)
            .withTimeout(timeout, TimeUnit.MILLISECONDS)
            .build());

redisClusterClient.setOptions(ClusterClientOptions.builder()
        .autoReconnect(true)
        .cancelCommandsOnReconnectFailure(true)
        .disconnectedBehavior(ClientOptions.DisconnectedBehavior.REJECT_COMMANDS)
        .topologyRefreshOptions(
                ClusterTopologyRefreshOptions.builder().enableAllAdaptiveRefreshTriggers().build())
        .build());

redisClusterConnection = new SlaveReadingLettuceClusterConnection(redisClusterClient,
        enableCompression);

Inside SlaveReadingLettuceClusterConnection

StatefulRedisClusterConnection<byte[], byte[]> connection;
        if(enableCompression) {
            connection = clusterClient.connect(CompressionCodec.valueCompressor(new ByteArrayCodec(), CompressionCodec.CompressionType.GZIP));
        } else {
            connection = (StatefulRedisClusterConnection<byte[], byte[]>) super.doGetAsyncDedicatedConnection();
        }
        connection.setReadFrom(ReadFrom.SLAVE);
        return connection;

So I'm using all adaptive refresh triggers enabled, and not specifying any periodic refresh trigger for topology. We recently had an issue where one of the primary nodes in a shard of the cluster had problem, which triggered failover. So the shard had two nodes 001 (primary) and 002 (replica). 001 failed-over, and 002 became primary. When 001 recovered, it became replica. My assumption was that the adaptive refresh triggers would kick in, and update the topology upon recovery. It didn't happen, we extracted the partitions/topology of the redis client that was being printed in exceptions.

Partitions [
RedisClusterNodeSnapshot [
	uri=RedisURI [host=*.*.*.*, port=****], 
	nodeId=*****************************************, 
	connected=true, 
	slaveOf='null', 
	pingSentTimestamp=0, 
	pongReceivedTimestamp=1513721289619, 
	configEpoch=0, 
	flags=[MASTER], 
	slot count=5461], 
RedisClusterNodeSnapshot [
	uri=RedisURI [host=*.*.*.*, port=****], 
	nodeId=*****************************************, 
	connected=true,
	slaveOf='*****************************************,', 
	pingSentTimestamp=0, 
	pongReceivedTimestamp=1513721290120, 
	configEpoch=2, 
	flags=[SLAVE], 
	slot count=0], 
RedisClusterNodeSnapshot [
	uri=RedisURI [host=*.*.*.*, port=****], 
	nodeId=*****************************************, 
	connected=true, 
	slaveOf='*****************************************,', 
	pingSentTimestamp=0, 
	pongReceivedTimestamp=1513721291124, 
	configEpoch=0, 
	flags=[SLAVE], 
	slot count=0],
RedisClusterNodeSnapshot [
	uri=RedisURI [host=*.*.*.*, port=****], 
	nodeId=*****************************************, 
	connected=true, 
	slaveOf='null', 
	pingSentTimestamp=0, 
	pongReceivedTimestamp=0, 
	configEpoch=2, 
	flags=[MYSELF, MASTER], 
	slot count=5462], 
RedisClusterNodeSnapshot [
	uri=RedisURI [host=*.*.*.*, port=****], 
	nodeId=*****************************************, 
	connected=true, 
	slaveOf='null', 
	pingSentTimestamp=0, 
	pongReceivedTimestamp=1513721292129, 
	configEpoch=3, 
	flags=[MASTER], 
	slot count=5461], 
RedisClusterNodeSnapshot [
	uri=RedisURI [host=*.*.*.*, port=****], 
	nodeId=*****************************************, 
	connected=true, 
	slaveOf='null', 
	pingSentTimestamp=0, 
	pongReceivedTimestamp=1513721291051, 
	configEpoch=3, 
	flags=[MASTER], 
	slot count=0]]

4 out 6 nodes above are master, while there were only 3. So in the troubled shard, there were two nodes, and both were recognized as primary by the redis client. Since we had configured its read policy as SLAVE, it was throwing the exception Cannot determine a partition to read for slot ****. Even though on node had recovered and become a replica, the topology had not refreshed.

P.S. We are using AWS setup so redis cluster was AWS Elasticache, and our application was deployed in AWS Elasticbeanstalk (Java, Tomcat Stack). The EB environment had 15 EC2 instances configured behind elastic load balancer, and we faced issue in only 2 of the EC2 instances.
The quick fix we applied was to update to lettuce 4.4.1 and use read policy SLAVE_PREFERRED. But we are not sure why the adaptive refresh triggers didn't work.

@mp911de
Copy link
Collaborator

mp911de commented Dec 20, 2017

Adaptive triggers are an additional trigger method using runtime signals such as disconnect. Signals are processed immediately – without any delay – triggering a topology refresh, calling RedisClusterClient.reloadPartitions().

It looks like the topology view obtained at that state wasn't reflecting the newly elected master state but a state in between. The inherent problem with Redis Cluster is that it doesn't expose configuration changes through a Pub/Sub mechanism like Sentinel does but through a binary Cluster bus protocol.

Going forward, I'd recommend to:

  1. Enable periodic refresh. It's not a perfect solution but does not require application restarts.
  2. Alternatively, implement a method to update partitions by calling RedisClusterClient.reloadPartitions() as part of ops tooling.

I wonder whether it could make sense to delay adaptive topology refresh. It's primary use was to reflect slot migrations between nodes without the need to wait until the next periodic refresh.

@mudasirale
Copy link
Author

Btw, we were following recommendation to use adaptive refresh triggers over periodic one from here #333

And I like the suggestion to implement functionality for reloading partitions, will save an application restart.

@mp911de
Copy link
Collaborator

mp911de commented Dec 20, 2017

#333 isn't saying to disable periodic triggers. Let's turn this ticket into an enhancement for adaptive triggers to make it better usable for failovers – basically either delaying refresh or scheduling subsequent runs to make sure refresh grabs the appropriate state. This requires some conceptual design before we can implement something.

@mp911de mp911de added the type: feature A new feature label Dec 20, 2017
@mp911de mp911de changed the title Clustered Redis client doesn't refresh topology on primary node fail-over Make adaptive topology refresh better usable for failover/master-slave promotion changes Dec 20, 2017
@mudasirale
Copy link
Author

Alright, Thanks for the help!

@Fluxx
Copy link

Fluxx commented Feb 5, 2018

We have a similar problem to the original issue in this ticket, and was looking to hopefully get some more guidance and understanding as to what the expected behavior is and what we are seeing. We too are on Elasticache, running a clustered Redis 3.2 server with 15 shards.

We had a failure of one of the shards a few days ago, and after the failed node was replaced by a new empty instance[1], we started to see errors like this in the logs. I'm happy to share the full exception message with the printed Partitions: value if desired:

com.lambdaworks.redis.RedisException: Cannot determine a partition to read for slot 12638

The recovery of the failed shard took ~6 minutes, which is about par for what we've seen with our Easticache setup. After the recovery complete, we did not experience recovery by our Redis code. We continued to see the same error message.

This was not the first time we'd experienced a failure, and after some debugging and reading last time, we had determined that we thought that, while adaptive refresh triggers should have detected a rebalance of the cluster, we had a too low of a number of refresh attempts. We had been using the default value of 5 reconnect attempts, and with the default 30 second wait in between meant we were only attempting to reconnect for 2 1/2 minutes. We thus made a configuration change to up the number of reconnect attempts to 30, giving us 15 minutes of reconnect attempts instead.

Our cluster configuration (in Scala) is below.

  private def lettuceClusterTopologyRefreshOptions = {
    ClusterTopologyRefreshOptions.builder()
      .enableAllAdaptiveRefreshTriggers()
      .refreshTriggersReconnectAttempts(30)
      .build()
  }

  private def lettuceClusterClientOptions = {
    ClusterClientOptions.builder()
      .validateClusterNodeMembership(false)
      .topologyRefreshOptions(lettuceClusterTopologyRefreshOptions)
      .build()
  }

Since the most recent failure took ~5-6 minutes, it was unexpected that we did not see recovery after the most recent failure. My understanding is that our code should be configured to attempt to reconnect for 15 minutes after the failure due to the adaptive refresh triggers.

After reading this ticket, however, it seems that perhaps the adaptive refresh triggers might not actually be triggered in situations like this? Or perhaps the adaptive trigger refresh did fire, but caught an "in between" state and then never refreshed again? It's confusing to me why we saw errors about unknown partitions for certain keys, but adaptive triggers didn't fire to refresh the cluster membership. I do only have a limited understanding of the expected behavior here so I may just not be understanding things correctly.

[1] We do not run replicas, as our cluster is used as an ephemeral cache. So when failed nodes are replaced they are replaced with an empty node.

@mp911de
Copy link
Collaborator

mp911de commented Feb 6, 2018

Thanks for your detailed comment. You're tackling two things:

  1. Short-time failure recovery
  2. Long-time failure recovery

Adaptive topology refresh is intended to support in cases where slots are migrated between nodes (ASK, MOVED redirection) and in which a host goes down after its replacement is integrated into the cluster.

Redis outages that last for a longer time aren't covered by adaptive topology refresh. From what I understood, I'd assume that the failed node was excluded from the topology upon topology retrieval and since then the topology wasn't updated anymore.

Redis Cluster communicates its state changes over an internal, binary bus protocol that usually isn't reachable from outside.

What Redis Cluster is missing is a Sentinel-like facility that communicates actively changes in the topology. The only viable option is to actively pull Redis/infrastructure details. The same failure with Redis Sentinel would refresh topology as soon as your node is back again as Sentinel communicates state changes via Pub/Sub.

I see the following options here:

  1. Request a change in Redis to provide Sentinel-like topology change communication – this might be hard to get
  2. Enable periodic topology refresh – generates some load but is guaranteed to recover from an outage
  3. Add bits in your code that PING the known cluster nodes and if they discover a state change, you call RedisClusterClient.reloadPartitions().

@Fluxx
Copy link

Fluxx commented Feb 8, 2018

@mp911de thank you for your detailed reply, and patience with my reply. Things have been busy on my end.

We will try adding the periodic refresh to see if that helps.

I was wondering if you could elaborate on one thing you said...

I am not super familiar with the ops for how Elasticache works, but my understanding is that given our cluster setup - 15 shards, no backups or replicas (since all data is ephemeral) - when a node fails, it is replaced with a new empty node at the same DNS name and IP (we run in a VPC). So when you say:

Redis outages that last for a longer time aren't covered by adaptive topology refresh. From what I understood, I'd assume that the failed node was excluded from the topology upon topology retrieval and since then the topology wasn't updated anymore.

My understanding is that given the topology config I pasted a sample of above, our cluster should try for 15 minutes to reconnect to a failed node. I don't think the failed node (in terms of a hostname/IP) is ever excluded from the cluster (I may be wrong). And that since the Elasticache node was replaced within 5 minutes, once it is replaced, it should reconnect and hopefully reconnect.

To ask more directly, I assumed that Cannot determine a partition to read for slot exceptions would trigger an adaptive refresh - but that may be mistaken?

@mp911de
Copy link
Collaborator

mp911de commented Feb 9, 2018

@mp911de thank you for your detailed reply, and patience with my reply. Things have been busy on my end.

No worries, all good.

I am not super familiar with the ops for how Elasticache works

Me neither. I'm not working with Elasticache or Microsoft Azure's Redis on a regular basis.

To ask more directly, I assumed that Cannot determine a partition to read for slot exceptions would trigger an adaptive refresh - but that may be mistaken?

That's not the case, however, you tackle a point that's worth improvement. Lettuce currently only reacts to Redis responses such as MOVED/ASK errors (redirections). We could add another adaptive trigger to catch that case to facilitate self-healing.

@mp911de
Copy link
Collaborator

mp911de commented Jul 2, 2018

I think we could reuse Delay within ClusterTopologyRefreshOptions for now. Delay isn't a perfect fit as it accepts also an attempt counter but for now, we can leave it with zero so ClusterTopologyRefreshScheduler would call Delay.createDelay(0) and use the delay to delay the actual refresh task. ClusterTopologyRefreshOptions.delay would default to a constant zero-delay to retain the default behavior.

@mp911de mp911de added this to the Lettuce 5.1.0 milestone Jul 2, 2018
@kutzi
Copy link

kutzi commented Aug 22, 2018

I think I have the same issue. Sorry if this is wrong and I'm capturing this thread.
I have a Redis cluster with 3 masters + 4 slaves.
I've set up a test bed to run some integration tests against our webshop (using the Redis cluster as a session store). Then I'll periodically trigger a failover to a cluster slave and when that's finished kill the former master node. Because this is running on Kubernetes, the node is immediately replaced with a new pod and automatically reintegrated into the cluster as a slave.
I'm using this lettuce config

    ClusterTopologyRefreshOptions topologyRefreshOptions =
      ClusterTopologyRefreshOptions.builder()
        .enablePeriodicRefresh(60, TimeUnit.SECONDS)
        .enableAllAdaptiveRefreshTriggers()
        .closeStaleConnections(true)
        .build();

    ClusterClientOptions clusterClientOptions = ClusterClientOptions.builder()
      .topologyRefreshOptions(topologyRefreshOptions)
      .validateClusterNodeMembership(false)
      .build();

(Lettuce 4.4.6, spring-data-redis 1.8.14)

Now I see a lot of test failures happening and the log file is filled with exceptions like

Caused by: com.lambdaworks.redis.RedisConnectionException: Unable to connect to 172.66.0.30:6379
	at com.lambdaworks.redis.RedisConnectionException.create(RedisConnectionException.java:56) ~[lettuce-4.4.6.Final.jar:na]
	at com.lambdaworks.redis.cluster.PooledClusterConnectionProvider.lambda$getConnectionAsync$7(PooledClusterConnectionProvider.java:341) ~[lettuce-4.4.6.Final.jar:na]
	at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) ~[na:1.8.0_162]

BTW: If I didn't set the cluster options as above, I was seeing even much more errors.

FTR: I had set up the same test case with Jedis and only seen a couple of test failures and far less exceptions in the log
I was trying to migrate to Lettuce from Jedis, because Lettuce is the default in upcoming spring-data-redis versions AFAIK, but this would be a deal breaker.

So my main issues are those:

  • the default configuration for clusters in lettuce is totally unusable for the failover case
  • even with the recommended (?) options set, I get a lot of errors. Far more than with Jedis
  • with spring-data-redis it's practically impossible to change the ClusterClientOptions - unless one hacks them in via reflection (probably a spring-data issue and not a lettuce issue)

@mp911de
Copy link
Collaborator

mp911de commented Aug 22, 2018

@kutzi regarding Spring Data and ClusterClientOptions: Please file a new issue at https://jira.spring.io/browse/DATAREDIS to allow setting ClusterClientOptions if this does not work through LettuceClientConfiguration.

The difference between Jedis and Lettuce from this perspective is that Spring Data Redis caches the topology for Jedis for 100ms and refreshes it on access. For Lettuce, Spring Data Redis uses Partitions which is refreshed periodically by Lettuce itself.

What is the average time between an adaptive refresh signal (can be found as DEBUG log event of io.lettuce.core.cluster.ClusterTopologyRefreshScheduler) and the finally reconfigured new topology?

@kutzi
Copy link

kutzi commented Aug 23, 2018

I've set that logger to debug (or
rathercom.lambdaworks.redis.cluster.ClusterTopologyRefreshScheduler as I'm on lettuce 4.x), but cannot see any log output from it 🤔

I found out that my workarounds to hack the ClusterClientOptions into the RedisClient of spring's LettuceConnectionFactory didn't work.
After fixing that, I do get now logs, but the number of errors didn't seem to decrease.

How can I find out from the logs when the topology has been reconfigured?
I can only find a lot of ClusterTopologyRefreshScheduler: ClusterTopologyRefreshScheduler.indicateTopologyRefreshSignal()
sometimes followed by
ClusterTopologyRefreshScheduler: ClusterTopologyRefreshTask requesting partitions from [RedisURI [host='172.66.0.32', port=6379], RedisURI [host='172.66.0.33' ...
usually in the same ms

(FTR: I forgot in my previous comment, that I'm using Spring RetryTemplate to retry jedis/lettuce operations.)

@kutzi
Copy link

kutzi commented Aug 23, 2018

Regarding spring-data-redis: https://docs.spring.io/spring-data/redis/docs/current/api/org/springframework/data/redis/connection/lettuce/LettuceClientConfiguration.html is since spring-data-redis 2.x only - I'm on 1.8.
So I guess they have fixed that issue with 2.x

@mp911de
Copy link
Collaborator

mp911de commented Aug 25, 2018

Adaptive refresh limits the number of refresh operations using a timeout (ClusterTopologyRefreshOptions.getAdaptiveRefreshTimeout()) to prevent excess refreshes, that's why you only sometimes see requesting partitions from in your logs.

requesting partitions from signals that a topology refresh is triggered from the adaptive refresh.

It sounds as if it would make sense to add debug logs when a topology refresh starts/finishes.

@alessandrosimi-sa
Copy link
Contributor

Hi, we experienced the same problem with a Redis Cluster made of 3 masters and 2 replicas each (9 nodes in total). To test this we have used a feature provided by AWS (master failover) and a local test.

Configuration
Here is the client configuration

val topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
  .enablePeriodicRefresh(java.time.Duration.ofSeconds(10))
  .enableAllAdaptiveRefreshTriggers()
  .build()
val client = RedisClusterClient.create(s"redis://$host:$port")
  .setOptions(ClusterClientOptions.builder()
  .topologyRefreshOptions(topologyRefreshOptions)
  .autoReconnect(true)
  .build())

Issue
The client is not able to reconnect even if the replica slave took the master role.

Possible Cause
Looking into the reconnection logic inside ReconnectionHandler we can see the reconnection logic using the address provided by the socketAddressSupplier (inside the reconnect() method).
The socketAddressSupplier always returns the address of the failing master node. In our test we periodically check the cluster info and we can see the cluster has marked the node as fail and a new master took over with the same slots range.
The socketAddressSupplier is created by the getSocketAddressSupplier(connectionKey) method in the AbstractClusterNodeConnectionFactory class (link). The connection key has the host and port hardcoded so socketAddressSupplier will always return the same value.

Possible Solution
The connection key is calculated when a command is executed. This is done by looking the hash of the command key and the current partitions, all inside the getWriteConnection(slot) method of the PooledClusterConnectionProvider class (link). A possible solution is a socketAddressSupplier that reruns the same steps (slot -> partition -> connection key -> socket address) to allow the reconnection logic to point to the right master.

Example
For more information on the type of test we have conducted refer to the following repo https://github.com/spaceapegames/aws-redis-cluster-test
In particular the TestRedis simulates the issue reported.

@mp911de
Copy link
Collaborator

mp911de commented Sep 12, 2018

SocketAddressSupplier is not intended to return different hosts based on their role but to resolve a SocketAddress for a given host and port tuple. There's a distinction between connections (including reconnect) and command routing. A reconnect handler always tries to reconnect to the same host/node whereas the command router tries to find the appropriate host to send a particular command.

In your example, Partitions contains the topology before the replica was promoted to master.

@alessandrosimi-sa how long does it take (time between the adaptive refresh trigger ran and the replica/master promotion) until the replica node became a master?

@alessandrosimi-sa
Copy link
Contributor

Thanks for your answer @mp911de. I had the feeling SocketAddressSupplier was not suitable for returning different hosts.
When I run the test locally with an in-memory cluster the loadPartitions() method detects the partitions change very quickly. The time depends on the refresh topology interval, 10 seconds in the test.

16:36:28.224 INFO  TestRedis - stop the master with port [8079]
16:36:28.397 INFO  ConnectionWatchdog - Reconnecting, last destination was /127.0.0.1:8079
...
16:36:28.406 WARN  ClusterTopologyRefresh - Unable to connect to 127.0.0.1:8079
Partitions [... RedisClusterNodeSnapshot [uri=RedisURI [host='127.0.0.1', port=8079], ... connected=false, .. ]]
...
16:36:38.404 WARN  ClusterTopologyRefresh - Unable to connect to 127.0.0.1:8079
Partitions [... RedisClusterNodeSnapshot [port=8082, ... flags=[MASTER]], ... , RedisClusterNodeSnapshot [port=8078, ... flags=[MASTER],], RedisClusterNodeSnapshot [port=8085, flags=[MASTER]], .... , RedisClusterNodeSnapshot [port=8079, flags=[MASTER, FAIL]]]

From the moment the test is forcing one master to fail TestRedis - stop the master with port [8079] there are 10 seconds before the ClusterTopologyRefresh detects the new master. If I reduce the refresh interval will be quicker.

The scenario I am testing is a client that should not drop pending commands using the disconnectedBuffer inside the DefaultEndpoint class, and ideally, when the connection is re-established, it should flush them all.
If you have a suggestion how to do that with current library I will more than happy to try it.

@mp911de
Copy link
Collaborator

mp911de commented Sep 13, 2018

The time depends on the refresh topology interval, 10 seconds in the test.

This ticket is about adaptive refresh triggers and a possible delay between we refresh the topology and the time where Redis changes its topology, not about periodic refresh.

The scenario I am testing is a client that should not drop pending commands using the disconnectedBuffer inside the DefaultEndpoint

Care to elaborate what you mean or what behavior you experience?

Commands are routed to a node connection (e.g. write commands are written to a master connection). If the connection gets disconnected, commands are buffered until the connection comes back up. One of the following scenarios is then possible:

  • Role change: If the master was demoted to a replica, commands are retried. Write commands typically fail and are completed exceptionally
  • Removal from cluster: If the node was removed from the cluster, the connection is closed and cleaned up. This re-queues commands (as if they were dispatched initially) and routes commands according to their intent (read/write) to a working connection.

@alessandrosimi-sa
Copy link
Contributor

alessandrosimi-sa commented Sep 13, 2018

Scenario
A master node leaves the cluster, as consequence of a failure. The redis node stops running and it is up to the cluster to detect it and promote one of its replica as master.

Issue
The lettuce client doesn't react to the master failover in a cluster.
master failover == a master fails, the cluster detects it and promotes a slave as master.
The client keeps trying to reconnect to the node that doesn't exist anymore when the either the cluster and the internal partition data object has detected the new master.

@HarryTheHawk
Copy link

We just experienced this. We are using adaptive refresh triggers.

In our case, the master host died and as expected a slave was promoted to master. However, the Lettuce client didn't detect this and all subsequent queries on the slot range keys failed with the "com.lambdaworks.redis.RedisException: Cannot determine a partition to read for slot 15234" message.

Our hypothesis:

  • When the master host became unresponsive, a refresh was triggered due to PERSISTENT_RECONNECTS.
  • At the time of refresh, failover to the new master was not yet complete ; hence getting an incomplete topology, leading to "Cannot determine a partition to read for slot" errors for live requests.
  • The failover itself should cause another trigger (either MOVED, ASK). However, it might be ignored due to the "adaptiveRefreshTriggersTimeout" where no refreshes are invoked until the refresh interval timeout expires.

@mp911de
Copy link
Collaborator

mp911de commented Dec 17, 2018

Lettuce does not know when a failover is completed or whether a failover should take place at all. There are a couple of approaches, and none of them is ideal:

  • Obtain regularly topology refreshes until we see a change: This approach might work in environments where Redis promotions take some time. Increasing refresh rate generates additional load on the cluster, and more load might be not the right way for everyone.
  • Add a Delay as per Make adaptive topology refresh better usable for failover/master-slave promotion changes #672 (comment). Delays are an "I don't have a better option" with a severe guess-factor.
  • Trigger refresh on Cannot determine a partition to read for slot. The first read attempt will fail as the trigger happens after the fact of sending a command but an interesting option as an adaptive trigger as this trigger can help self-healing.

I'm not sure how to proceed with this issue. Would some callback help so the application can trigger a topology refresh upon a specific event?

@HarryTheHawk
Copy link

"Trigger refresh on Cannot determine a partition to read for slot. The first read attempt will fail as the trigger happens after the fact of sending a command but an interesting option as an adaptive trigger as this trigger can help self-healing."

This might be the best option, if it includes configuration options such as:

  • Delay between refresh attempts.
  • Max refresh attempts (which, once reached, indicates/implies a critical problem beyond just a normal failover).

The idea here is that the time it takes for the master election to occur and slot coverage to be re-enabled might vary based on a variety of factors, so these configuration options could be configured based on a given client's environment/situation.

@mp911de
Copy link
Collaborator

mp911de commented Dec 18, 2018

Delay between refresh attempts.

We have a timeout setting to prevent recurring refreshes at a high rate to limit refreshes to e.g. once every 30 seconds. Is this what you were talking about?

@HarryTheHawk
Copy link

We have a timeout setting to prevent recurring refreshes at a high rate to limit refreshes to e.g. once every 30 seconds. Is this what you were talking about?

I was suggesting this in the context of executing a refresh in the presence of the "Cannot determine a partition to read for slot" message.

Regardless, we are looking at tweaking the refresh logic for our situation and will let you know if we indeed develop any refresh logic improvements.

@mp911de
Copy link
Collaborator

mp911de commented Jan 2, 2019

This issue accumulated over time two issues:

  1. Redis topology changes happen after refreshing the topology so the client is left with an outdated view
  2. Topology is missing slot coverage (Cannot determine a partition)

We will introduce a new trigger for issue 2 to trigger topology update when Lettuce cannot determine a partition to read from/write to.

Problem 1. is harder to solve. In most cases where Redis is running as direct service in a VM/on bare metal, an increase of the disconnect attempt threshold to a higher value can be a good approach. By raising the value from (today's default value) 5 to e.g. 12 gives the cluster more time (5 translates roughly to 32 milliseconds, 12 translates to 4 seconds) to perform the actual replica to master promotion and Lettuce gets an update trigger at a later time.

For orchestrated scenarios in which Redis is restarted immediately after discovering a failure (e.g. Kubernetes), this assumption needs careful inspection whether it still renders true or whether a node is spun up with the same IP but a different role. If a reconnect succeeds, then we no longer can derive an update trigger from it.

I created #952 so setups that might require delays between adaptive refresh trigger and the actual topology change performed by Redis can consume events and add their own delay on top of Lettuce.

mp911de added a commit that referenced this issue Jan 2, 2019
Lettuce now listens to events raised from routing requests to uncovered slots. Command dispatch for read/write commands that terminates with PartitionSelectorException (Cannot determine a partition …) is now the trigger for uncovered slot events.

Uncovered slots can be a late indicator for a topology change in which a number of command fails before the topology is updated to recover operations.
@mp911de
Copy link
Collaborator

mp911de commented Jan 2, 2019

RefreshTrigger.UNCOVERED_SLOT is in place now and can be enabled through ClusterTopologyRefreshOptions.

@mp911de mp911de closed this as completed Jan 2, 2019
@mp911de mp911de modified the milestones: Lettuce 5.1.0, 5.2.0 Jan 2, 2019
@HarryTheHawk
Copy link

Thank you for implementing this change! We will try it out and let you know how it goes.

@lujia-zhang
Copy link

Hi @mp911de I noticed that we delay the release of 5.2 Lettuce from June to Sept. This enhancement is very critical to us, do you have any update on the release for this OR suggestion on how to handle the uncovered slot exception? I'd like to know how much resources will periodic refresh consumes for a cluster with size 48, nodes 144? Or any best practice on adding periodic refresh. Thanks!

@mp911de
Copy link
Collaborator

mp911de commented Jul 19, 2019

Upstream dependencies (Project Reactor) has caused a delay in release dates. If the release is critical to you, then feel free to use snapshots for the time being or release that artifact to your own Maven repository if you have one. The only kind of changes we expect until the GA release are bugfixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature A new feature
Projects
None yet
Development

No branches or pull requests

7 participants