-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: liveness problems during apply-time configuration change #12359
Comments
@tbg @xiang90 @NingLin-P @BusyJay PTAL, if there is any problem, please let me know, thanks! |
Well written! I have a suggestion, why not just use latest commit index and term a candidate knows? Tracking last conf change can be tricky. |
Interesting, thank you for the detailed analysis. It makes me nervous to introduce new ways for entries to become committed. For a slightly different issue, John Ousterhout called the idea of sending new entries (instead of the commit index) with vote responses "extremely dangerous" (link). Instead, I look at this as a limitation of learner replicas. We know that replicas must sometimes vote even when they don't know that they are voters. Without learner replicas, they never have to campaign because they'd never be the most up-to-date if they don't know about their own voter status. But as you demonstrate here, learners could be the most up-to-date even if they believe themselves to still be learners. Does this mean that learners should be able to campaign in certain conditions? (That's what would be happening if we processed config changes at append time instead of apply time) I haven't worked out the precise details of this. It's only necessary for learners to campaign if they have previously rejected a vote because their log is more up-to-date, but I'm not sure it's necessary to limit it to this case, so maybe learners could trigger (pre-)votes according to the normal timeout whenever there is no leader. If a learner campaigns in these examples, it would only count votes from |
The abovementioned solution doesn't introduce new ways for entries to become committed or carry new entries with vote request/response, both committing entries and appending entries are still decided by the leader as before. Instead, the |
As @NingLin-P said, this solution doesn't introduce correctness issues. |
Yes. I think using the latest commit index and term a peer knows is correct because we only need the voter who has the most up-to-date logs become leader, who must have all committed logs. |
Yeah, I can't find a specific problem with this. It looks like it should be safe as long as we're just propagating the commit index that was set by a leader. But I also missed the problem with including entries in VoteResponse until Ousterhout pointed it out.
No; as I said it must not vote for itself here. And this may be a problem: in some configurations, would the combination of the crashed server A and the non-voting learner prevent a quorum? In all of the examples here it works, but I'm having trouble formalizing this into a stronger guarantee.
How does this break correctness? There is one very surprising consequence. If the learner is elected leader, it must propose a new entry for its term in order to establish a new commit index. So the learner is acting very much like a full-fledged member of the cluster. And it really is, at least in these examples - it is a voting member of Cnew and should be able to campaign, become leader, propose and commit. But let's look at an example with two learners, and we want to swap one of the learners for one of the voters (same as example 1 with a second learner E)
D and E both receive the latest log entries without applying them, and then A crashes. If D campaigns, it's the same as example 1. But E could also campaign, and then you could have a leader that is not in Cold or Cnew. This feels very wrong, although I'm not sure it's incorrect, as long as E just proposes its one empty log entry, commits it, then one of the replicas in Cnew takes leadership. But it would require some new logic to ensure this happens, and I think this is enough of a problem that I withdraw the idea of letting leaders campaign. Passing the latest commit index seems like the safer solution. |
Indeed. Letting learner campaign breaks the concept of learner and brings more corner cases for both raft algorithm and applications. @xiang90 What's your suggestion? If we reach an agreement, @NingLin-P would be happy to submit a PR to fix it by passing the commit index during elections. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Hei, guys, Would you mind appending 'area/raft' label to this issue? Thanks! |
According to the Problem 2 mentioned by @gengliqi , here is my point of view (I'm not sure if it is right QwQ). The following statement is based on that a node will not be killed directly after it is removed from the Raft group, and the node shoud reply its commit index for VoteRequest/PreVoteRequest: It's clients' duty to track if the new configuration is alive. A cluster using the Raft lib probably has its own way to track/manage the state of itself (especially for clusters that implemented Multi-Raft). In my opinion, it is better not to introduce new mechanisms for the Raft lib to track new configurations if there is not one (e.g. tracing the commit index of followers). To trace the state of a new configuration, the client can use ReadIndex request after ConfChange proposal. When the client receives the ReadIndex response, the quorum of the cluster has already applied the log before ReadIndex Entry, that means the ConfChange Entry has also been applied by the quorum of the configuration. Now the client can decide if the new configuration is alive by waiting for the ReadIndex response. But it's not enough yet. This method can only guarantee killing the node which is removed from a Raft group will not cause liveness problem. But if the leader crush while configuration changing, the ReadIndex response will never be returned to the client. So just append a ReadIndex request after ConfChange is not enough. The client should attach the ReadIndex request a timeout, and retry it if failed. |
The ReadIndex request only needs leader broadcast heartbeat so it can't be used to verify that the majority of this configuration knows the newly commit index. Although the commit index can be piggybacked to heartbeat msg, the follower may respond and then crash without persisting the commit index. |
Thank you for the reply! I found I've made a mistake on it. ReadIndex only guarantees that the legal leader has advanced its apply index in its own view, but it still does not guarantee the consistent view of commit index among the cluster 😭. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Abstract
In #7625 (comment), @tbg proves the safety of apply-time configuration change.
But there are still two liveness problems left.
We discuss these problems and propose a solution.
In the end, we prove that the solution is feasible and the liveness can be guaranteed.
Problem 1
Although a majority of Cold,new is available during conf change, it's still possible that no leader can be elected forever.
Example 1
Joint Consensus
Cold => voters: {A, B, C}, learners: {D}
Cnew => voters: {A, B, D}, learners: {C}
Now the configuration is Cold. A is leader and B crashes.
A proposes Cold,new and some other entries. D receives all these entries but C receives Cold,new and some of other entries because of network issue. After C sends append response to A, A knows Cold,new is committed because the majority of Cold has this entry(i.e. A and C). C knows Cold,new is committed while D does not but D’s entry is more up-to-date than C's. (It may happen because C knows the commit index through heartbeat or A can not send all entries to C due to flow control)
A: ... Cold,new, E … En … Em
B: ...
C: ... Cold,new, E … En
D: ... Cold,new, E … En … Em
Now A crashes and B recovers.
C applies the Cold,new before campaigning. Then C sends vote requests to A, B, D because its configuration is Cold,new. Obviously, B will grant its vote to C while D won’t. D does not know Cold,new is committed so it is still a learner and does not campaign.
Thus no leader can be elected even if the majority of Cold,new is available(i.e. B, C, D).
Example 2
Single Membership Change
Cold => voters: {A, B, C}, learners: {D}
Cnew => voters: {A, B, C, D}
Now the configuration is Cold. A is leader and B crashes.
A proposes Cnew to promote learner D and some other entries. The subsequent process is the same as example 1. D’s entry is more up-to-date than C's but C knows Cnew is committed.
After A crashes and B recovers, no leader can be elected even if both the majority of Cold and Cnew is available.
Note that this kind of issue can not happen if the number of voters in Cold is even because the quorum number does not change and a leader can be elected without the new voter.
Example 3
Joint Consensus
Cold => voters: {A, B, C}, learners: {D}
Cnew => voters: {A, B, D}, learners: {C}
Now Cold,new is committed. A is leader and B crashes.
A proposes Cnew. C and D receive Cnew and responds to A. A knows Cnew is committed and then proposes some other entries. C receives these entries and knows Cnew is committed while D does not.
A: ... Cold,new, Cnew, E …
B: ...
C: ... Cold,new, Cnew, E …
D: ... Cold,new, Cnew
Now A crashes and B recovers.
C applies the Cnew and becomes learner. D sends vote requests to A, B, C because its configuration is Cold,new. Obviously, B will grant its vote to D while C won’t.
Thus no leader can be elected even if the majority of Cold,new is available(B, C, D).
Example 4
Single Membership Change
Cold => voters: {A, B, C, D}
Cnew => voters: {A, B, D}, learners: {C}
Now the configuration is Cold. A is leader and B crashes.
The subsequent process is the same as example 3.
No leader can be elected even if both the majority of Cold and Cnew is available.
Note that this kind of issue can not happen if the number of voters in Cold is odd because the quorum number does not change and a leader can be elected without the new learner.
For example 1 and 2, there is a learner D who has an up-to-date log but does not know the configuration is committed.
For example 3 and 4, there is a learner C who has an up-to-date log and also knows the configuration is committed but other voters don’t know it.
Solution
After some discussion with @NingLin-P, we find this problem can be solved by using vote requests and responses.
For example 1 and 2, D will get a vote request from C so it will advance its commit index to Cold,new. Then D applies Cold,new, becomes follower. It becomes leader after campaigning.
For example 3 and 4, C will get a vote request from D so it will send vote response with the Cnew commit index and term. Then D will advance its commit index, applies Cnew, and becomes leader after campaigning.
In #7625 (comment), @tbg said in CockroachDB, they workaround the available problem 2 in #11284 (comment) by not removing a voter directly. Instead, demoting a voter first and then removing it.
I think it should not just a workaround in client. We must forbid removing a voter directly in raft library, otherwise, the liveness guarantee will be broken.
Problem 2
There is a confusing thing that the client can not know whether this cluster is available when a majority of configuration is alive after it receives the success response of this configuration.
In other words, when can we believe the quorum is always majority of this new configuration?
The answer is after the majority of configuration knows the configuration entry is committed.
But this problem does not exist in some cases. Next, we will discuss different situations.
Add voter/Promote learner
Joint Consensus
Cold,new does this work. It’s not a problem because any quorum of Cold,new include a certain quorum of Cold, the cluster is available when the majority of Cold,new is alive.
Single Membership Change
If number of voters in Cold is odd, this problem does not exist because any quorum of Cnew include a certain quorum of Cold, the cluster is available when the majority of Cnew is alive.
But if it is even, this problem exists.
For example:
Cold => voters: {A, B}, learners: {C}
Cnew => voters: {A, B, C}
Leader A proposes Cnew(promote C). A commits and applies Cnew. B and C have this entry but don’t know it’s committed. Then A crashes. No leader can be elected because both B and C do not know Cnew is committed.
Demote voter
Joint Consensus
Cnew does this work.
Single Membership Change
Like add voter/promote learner, if number of voters in Cold is odd, this problem does not exist.
If it is even, it has the same situations as joint consensus.
Solution
Leader can trace the saved commit index of followers. (raft-rs has implemented this feature already tikv/raft-rs#366). After the majority of followers’s commit index is greater than the configuration index, we can tell the client that information in Ready.
Alternative: After a configuration is committed or applied, leader can propose another log. If this log is committed, it implies the majority of configuration knows the configuration is commited due to append invariant.
Liveness argument
Three conditions mentioned above.
Under these conditions, we can prove a leader can be elected at all steps of configuration change.
Joint Consensus
We assume that a majority of the old configuration is available (at least until the majority of Cnew knows Cnew is committed) and that a majority of the new configuration is available.
If Cold,new is not committed
(a) The available peer with the most up-to-date log can collect votes from a majority of Cold and become leader.
If Cold,new is committed and Cnew is not committed
(a) For a certain quorum of Cold,new, if no peer knows Cold,new is committed, it’s the same as situation 1.
(b) Otherwise, this peer who knows Cold,new is committed must be voter after applying Cold,new. This voter sends vote request to other voters in this quorum of Cold,new. Each voter who has Cold,new entry can know Cold,new is committed through vote request due to condition 1. One of them with the most up-to-date log can collect votes from a majority of Cold,new and become leader.
If Cnew is committed
(a) For a certain quorum of Cold,new, if no peer knows Cnew is committed, it’s the same as situation 2.b because there is at least one peer in any quorum of Cold,new knows Cold,new is committed. (append invariant)
(b) Otherwise, because of condition 2, no peer can be removed even if it knows Cnew is committed and applies Cnew. Each voter who has Cnew entry but does not know Cnew is committed sends vote request to other voters in this Cold,new’s quorum. They will know Cnew is committed through vote response due to condition 1. One of them with the most up-to-date log can collect votes from a majority of Cnew and become leader.
If majority of Cnew knows Cnew is committed
(a) For a certain quorum of Cnew, at least one of them knows the Cnew is committed. So it’s the same as situation 3.b.
Single Membership Change
We assume that a majority of the old configuration is available (at least until the majority of Cnew knows Cnew is committed) and that a majority of the new configuration is available.
If Cnew is not committed
(a) The available peer with the most up-to-date log can collect votes from a majority of Cold and become leader.
If Cnew is committed
(a) For a certain quorum of Cold, if no peer knows Cnew is committed, it’s the same as situation 1.
(b) Otherwise, there is a peer A in Cold knows Cnew is committed. Because of condition 2, no peer can be removed even if it knows Cnew is committed and applies Cnew. Next, there must be a voter B in quorum of Cnew has Cnew entry because the intersection of any quorum of Cold and any quorum of Cnew is non-empty. If peer A and voter B are different peer, voter B will know Cnew is committed through vote response from peer A due to condition 1. Then each voter in Cnew who has Cnew entry can know Cnew is committed through vote request of peer B. One of them with the most up-to-date log can collect votes from a majority of Cnew and become leader.
If majority of Cnew knows Cnew is committed
For a certain quorum of Cnew, at least one of them knows the Cnew is committed. It’s similar to situation 2.b.
The text was updated successfully, but these errors were encountered: