Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance SENTINEL FAILOVER to use the FAILOVER command to avoid data loss #1238

Open
wants to merge 5 commits into
base: unstable
Choose a base branch
from

Conversation

enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Oct 29, 2024

Currently, SENTINEL FAILOVER selects a replica to send REPLICAOF NO ONE,
then waits for PROMOTION, and then sends REPLICAOF new_ip new_port to the
old primary and other replicas to complete the failover. The problem here
is that if the old primary has written to it before the role change during
the failover, these writes will be lost.

We can use the FAILOVER command to avoid this data loss. FAILOVER was added
in 0d18a1e, it can coordinates the failover
between the primary and the replica.

Before the original step REPLICAOF NO ONE, we try to send FAILOVER TO ip port
to the primary to go through the FAILOVER process. If the primary does not
support FAILOVER, for example, returns an error, it will fallback to the old
logic, that means fallback to REPLICAOF NO ONE.

@enjoy-binbin
Copy link
Member Author

This is a PR that was written very quickly (I wrote it at night and i want to push it as soon as possible). There may be problems with the details or it can be optimized.

Also see the redis/redis#6062 for more data loss details. The test case also proves data loss locally.

src/sentinel.c Outdated Show resolved Hide resolved
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks good.

I'm not enough familiar with Sentinel. @hwware you are a sentinel expert. Do you want to take a look?

src/networking.c Outdated Show resolved Hide resolved
src/sentinel.c Outdated Show resolved Hide resolved
tests/sentinel/tests/05-manual.tcl Outdated Show resolved Hide resolved
tests/sentinel/tests/05-manual.tcl Outdated Show resolved Hide resolved
Signed-off-by: Binbin <[email protected]>
Copy link

codecov bot commented Oct 31, 2024

Codecov Report

Attention: Patch coverage is 0% with 40 lines in your changes missing coverage. Please review.

Project coverage is 70.64%. Comparing base (c21f1dc) to head (21a6f4c).
Report is 7 commits behind head on unstable.

Files with missing lines Patch % Lines
src/sentinel.c 0.00% 40 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1238      +/-   ##
============================================
- Coverage     70.72%   70.64%   -0.08%     
============================================
  Files           114      114              
  Lines         63150    63195      +45     
============================================
- Hits          44660    44642      -18     
- Misses        18490    18553      +63     
Files with missing lines Coverage Δ
src/sentinel.c 0.00% <0.00%> (ø)

... and 12 files with indirect coverage changes

@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Nov 1, 2024
Signed-off-by: Binbin <[email protected]>
set old_port [RPort $master_id]
set addr [S 0 SENTINEL GET-PRIMARY-ADDR-BY-NAME mymaster]
assert {[lindex $addr 1] == $old_port}

# Rename the FAILOVER command so that we can fallback to REPLICAOF NO ONE.
if {$type == "legacy"} {
S 0 SENTINEL SET mymaster rename-command FAILOVER NON-EXISTENT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In real production, this step: call rename-command explicitly is must?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is just a test step that try to cover the fallback logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's explain in the comment

# We simulate a server that doesn't have the FAILOVER command.

@hwware
Copy link
Member

hwware commented Nov 1, 2024

Overall, LTGM, just one concern about the test case.

Signed-off-by: Binbin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants