Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implemented leader lease when quorum check is on. #5468

Merged
merged 1 commit into from
Jun 1, 2016

Conversation

swingbach
Copy link
Contributor

@swingbach swingbach commented May 27, 2016

Make the check quorum mechanism behaves more like a leader lease, also solve stuck peer problem

see also pr 5451

@@ -565,15 +565,25 @@ func (r *raft) Step(m pb.Message) error {
case m.Term > r.Term:
lead := m.From
if m.Type == pb.MsgVote {
if r.state == StateFollower && r.checkQuorum && r.electionElapsed < r.electionTimeout {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also true for leader? right? also let's put the conditional var checking first

so

if r.checkQuorum && r.state != StateCandidate && r.electionElapsed < r.electionTimeout 

@xiang90 xiang90 changed the title raft: fix stuck peer problem causing by modified behavior of quorum checking implemented leader lease when quorum check is on. May 27, 2016
b.randomizedElectionTimeout = b.electionTimeout + 1
}

for i := 0; i < b.electionTimeout; i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just do 2 * b.electionTimeout and remove line 1274-1276?

@swingbach
Copy link
Contributor Author

@xiang90 PTAL

Thanks

@swingbach swingbach force-pushed the master branch 5 times, most recently from f5c0fbb to 2e0518a Compare May 28, 2016 03:35

nt := newNetwork(a, b, c)

// Prevent campaigning from b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not think we need line 1239->1242.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or at least line 1240 is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case will break without line 1240

i think we need to let b's elapsedTimeout reach to electionTimeout so that it can vote for a,
and also if randomizedElectionTimeout is exactly electionTimeout(that's gonna happen if the random part of timeout is 0) b is gonna start a vote, it affects the behavior of this case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment for this? i think this is tricky actually.

// the reason causes a stuck candidate is because it might be rejected by other peers if
// quorum check is on and those peers are still within leader lease, it ends up having a
// higher term than the quorum
func TestFreeStuckPeerWithCheckQuorum(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestFreeStuckCandidateWithCheckQuorum ensures that a candidate with a higher term can disrupt the leader even if the leader still "officially" holds the lease. The leader is expected to step down and adopt the candidate's term.

@xiang90
Copy link
Contributor

xiang90 commented May 28, 2016

/cc @bdarnell Can you or anyone from cockroach team give this a final review? Thanks!

@swingbach
Copy link
Contributor Author

@xiang90 thank you for all the help, PTAL

thanks

@xiang90
Copy link
Contributor

xiang90 commented May 30, 2016

@swingbach There is one corner case we need to fix in this pull request. A new raft node is not promotable until itself is added into the peer list (a node can only be considered part of the cluster once it added itself into the cluster). If the node is not promotable, then it never increases it election (see https://github.com/coreos/etcd/blob/master/raft/raft.go#L427). Then it never votes to anyone. We need to somehow fix this...

@xiang90
Copy link
Contributor

xiang90 commented May 30, 2016

Also let's cover this by a new test case, a non-promotable node can vote for others.

@@ -565,15 +565,25 @@ func (r *raft) Step(m pb.Message) error {
case m.Term > r.Term:
lead := m.From
if m.Type == pb.MsgVote {
if r.checkQuorum && r.state != StateCandidate && r.electionElapsed < r.electionTimeout {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment like: "Section 4.2.3: 'if a server receives a RequestVote request within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote.' This is tied to the checkQuorum feature because this is the only case in which leaders track the time elapsed since they last confirmed their leadership. In StateCandidate, r.electionElapsed is small, but by definition we have not recently heard from the current leader if we are in this state"

@swingbach
Copy link
Contributor Author

@bdarnell already added the comments you mentioned
@xiang90 fixed the problem of non-promotable node can not vote for others with a simple test case

PTAL, Thanks

b.delProgress(2)

if b.promotable() {
t.Errorf("promotable = %v, want false", b.promotable())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.Fatalf instead of errorf (we should stop testing if the assumption does not hold)

@xiang90
Copy link
Contributor

xiang90 commented Jun 1, 2016

LGTM. Please squash commits.

@swingbach swingbach force-pushed the master branch 2 times, most recently from 31f4ecf to 337ef64 Compare June 1, 2016 22:03
@swingbach
Copy link
Contributor Author

already fixed the assertion and squashed all the commits

@xiang90 PTAL, Thanks

@xiang90 xiang90 merged commit 9fee773 into etcd-io:master Jun 1, 2016
@xiang90
Copy link
Contributor

xiang90 commented Jun 1, 2016

@swingbach Merged. Thanks.

tiancaiamao added a commit to tikv/tikv that referenced this pull request Jun 10, 2016
zhangjinpeng87 added a commit to tikv/tikv that referenced this pull request Aug 2, 2016
BusyJay pushed a commit to tikv/raft-rs that referenced this pull request Dec 19, 2017
siennathesane pushed a commit to shieldmaidens/pleiades that referenced this pull request Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants