-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance SENTINEL FAILOVER to use the FAILOVER command to avoid data loss #1238
base: unstable
Are you sure you want to change the base?
Conversation
Signed-off-by: Binbin <[email protected]>
This is a PR that was written very quickly (I wrote it at night and i want to push it as soon as possible). There may be problems with the details or it can be optimized. Also see the redis/redis#6062 for more data loss details. The test case also proves data loss locally. |
Signed-off-by: Binbin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this looks good.
I'm not enough familiar with Sentinel. @hwware you are a sentinel expert. Do you want to take a look?
Signed-off-by: Binbin <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1238 +/- ##
============================================
- Coverage 70.72% 70.64% -0.08%
============================================
Files 114 114
Lines 63150 63195 +45
============================================
- Hits 44660 44642 -18
- Misses 18490 18553 +63
|
Signed-off-by: Binbin <[email protected]>
set old_port [RPort $master_id] | ||
set addr [S 0 SENTINEL GET-PRIMARY-ADDR-BY-NAME mymaster] | ||
assert {[lindex $addr 1] == $old_port} | ||
|
||
# Rename the FAILOVER command so that we can fallback to REPLICAOF NO ONE. | ||
if {$type == "legacy"} { | ||
S 0 SENTINEL SET mymaster rename-command FAILOVER NON-EXISTENT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In real production, this step: call rename-command explicitly is must?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is just a test step that try to cover the fallback logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's explain in the comment
# We simulate a server that doesn't have the FAILOVER command.
Overall, LTGM, just one concern about the test case. |
Signed-off-by: Binbin <[email protected]>
Currently, SENTINEL FAILOVER selects a replica to send REPLICAOF NO ONE,
then waits for PROMOTION, and then sends REPLICAOF new_ip new_port to the
old primary and other replicas to complete the failover. The problem here
is that if the old primary has written to it before the role change during
the failover, these writes will be lost.
We can use the FAILOVER command to avoid this data loss. FAILOVER was added
in 0d18a1e, it can coordinates the failover
between the primary and the replica.
Before the original step REPLICAOF NO ONE, we try to send FAILOVER TO ip port
to the primary to go through the FAILOVER process. If the primary does not
support FAILOVER, for example, returns an error, it will fallback to the old
logic, that means fallback to REPLICAOF NO ONE.