-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect Cluster Failover Behavior #59
Comments
Hi @aembke |
Hey @neogenie, apologies on the late response. I just published 6.0.0-beta.1 to crates.io with a new implementation for pretty much this entire interface, as well as the underlying clustered connection logic. If you have any feedback please let me know. |
I was able to repro this issue with some tests prior to the changes in 6.0.0-beta.1 and 6.0.0, but now it seems to be working for me with the latest changes. I'm going to close this but feel free to reopen it if you see this issue again. |
@aembke |
Unfortunately the behavior is still not failsafe:
If the connection to the master node of the cluster is lost, the library endlessly unsuccessfully tries to connect to it. This does not reconfigure the cluster and slots on the client side |
Interesting ok, so in this scenario (if I'm reading this correctly) the unique thing is there's one node that thinks its a master but is in a fail state and has no cluster slots assigned. Does the cluster stay in this state for a long period of time or does I added a check in the recent version that runs |
I'm a bit torn on how to handle this because in this scenario the master node On the other hand, all the hash slots are covered so it is still possible for the client to work. But 1/3 of the slots aren't replicated and certain failure modes could lead to data loss if that one master node were to go down before it had a chance to replicate the changes that occurred while in this state. Or maybe I'm misinterpreting this. Looks like this will take some more investigation. |
This is a fairly common scenario for a cluster: if the master fails, Redis switches the master to the replica. Thus, the original master remains in the cluster with a fail status. When it appears on the radar after some time, it does not at all guarantee that it will become a master again, it may remain a replica. In the current version of the library, it does not update the configuration for the cluster, or uses the same connection to update it which is no longer available. This leads to the fact that if the connection to the node on which the requested data is lost, we can never get them again because the library continues to persistently but unsuccessfully knock on the node that is no longer available. |
Yeah that makes sense. I'm just trying to figure out why this doesn't repro in my test, but I think it's just because in mine the But like you said, it doesn't really matter, I can see how this would happen with a network partition and the client should be able to handle this. I'll take a look today and update here once it's fixed. |
You are absolutely right, the situation when there is only one replica and it becomes the master when the original master fails is not very good. Yes, in this case you should alert, write angry logs and call the duty dev-ops or sre :)
In other words, as a redis client that handles this situation quite well while still serving users, we also expect similar behavior from the client library. My solution proposed in the PR is still relevant, although perhaps not very elegant in terms of library architecture and needs to be adjusted. |
@neogenie would you mind trying your repro again with the latest changes in |
|
Unfortunately, no changes so far... It should be noted that my test script contains both cases
It seems to me that the problem now is that the library does not contain either active or even passive health-checks for the activity of network connections. It looks like there are two ways to solve the problem:
|
Thanks for checking that, and that all makes sense. The PR this morning addresses a third scenario where node A is paused and the client sends a command to node B but node B responds with CLUSTERDOWN. Prior to the changes this morning that would result an error. Appreciate the continued feedback and I'll update this when the second scenario you outline is handled. |
Quick update here: The changes in #63 are written against the older major version so I had to update them for the 6.0.0, but those are merged in the latest PR (#85). This adds timeouts to all the connection initialization logic. That will help with unresponsive connections if the server stops responding during connection initialization or a cluster sync operation, but won't detect connections that stop responding while waiting on other commands. I added a new FF Hopefully this helps. If it doesn't fix your use case please let me know. |
|
Unfortunately not working yet... let config = RedisConfig {
server: ServerConfig::Clustered { hosts: nodes },
password,
fail_fast: true,
..Default::default()
};
let perf_config = PerformanceConfig {
max_command_attempts: 1,
default_command_timeout_ms: 10_000,
network_timeout_ms: 5_000,
backpressure: BackpressureConfig {
max_in_flight_commands: 50_000_000,
..Default::default()
},
..Default::default()
}; |
@neogenie looks like the pipeline settings were different between our repros. I was able to repro something very similar to your logs and fixed it with the linked PR. If you have a moment would you mind trying your repro against that branch ( |
Didn't mean to close this when merging the PR. |
Unfortunately, everything is the same. I tried with
|
Hey @neogenie, some more info on this: I'm not sure what we're doing differently in our repros. I can't seem to get an infinite loop to show up regardless of how I configure the client after the changes in #94. I can repro an error with the config you provided above, though. If the Here's my repro steps if you want to compare: I'm using the code in the local inf_loop module with mostly default arguments.
And then in another bash session I'm using From my reading of the logs it looks like the client detects the unresponsive connection after 5 seconds, reconnects 2-3 times until the cluster promotes one of the replicas to replace the paused container, then sends the failed INCR command again to the new primary hash slot owner. Depending on the value of Are there any different or subsequent steps in your repro that I'm missing here? FWIW I made some small changes to the scripts in my working branch ( |
Closing this for now, but if anybody is able to help provide a repro for this please let me know and I'll reopen. |
@aembke hi! Sorry for the delay! Unfortunately, it doesn't work. Or am I doing something wrong or It contains a minimal test that pauses the containers one by one and checks whether the key can be read/written. The test has all the necessary infrastructure to track the status of the cluster. |
@aembke Up, pls ! |
Hey @neogenie, Thanks for taking the time to put this together, that was really helpful. I was able to repro the issue and fix it with the changes in #129. It looks like some combination of what you said initially (some missing timeouts on backchannel operations) and maybe the change in #126 as well. For what it's worth - to make your test pass I also had to reconfigure a few client settings. If your app code needs to handle this failure mode you may want to do something similar:
I'll release 6.2.1 with these changes tonight or tomorrow and then most likely come back and adapt your repro into some integration tests. Thanks again. |
@aembke Thanks a lot! |
Unfortunately, it doesn't work. Or am I doing something wrong |
@aembke |
This is an interesting one. Try this #133. I also had to set |
@aembke |
Yeah I agree, this definitely highlighted the need for better docs and maybe even some "preset" functions on various config structs. Thanks for your patience on this too. |
@aembke |
Yeah, that's from an unrelated issue. The fix for that will go up with: #134 |
Thanks! It works!!! @aembke |
The library behaves incorrectly in a Redis cluster failover scenario.
Reproduce scenario:
Problems:
Bottom line: If the connection to the master node of the cluster is lost, the library endlessly unsuccessfully tries to connect to it. This does not reconfigure the cluster and slots on the client side
The text was updated successfully, but these errors were encountered: