raft: liveness problems during apply-time configuration change #12359

gengliqi · 2020-09-30T17:29:50Z

Abstract

In #7625 (comment), @tbg proves the safety of apply-time configuration change.
But there are still two liveness problems left.
We discuss these problems and propose a solution.
In the end, we prove that the solution is feasible and the liveness can be guaranteed.

Problem 1

Although a majority of Cold,new is available during conf change, it's still possible that no leader can be elected forever.

Example 1

Joint Consensus
Cold => voters: {A, B, C}, learners: {D}
Cnew => voters: {A, B, D}, learners: {C}
Now the configuration is Cold. A is leader and B crashes.
A proposes Cold,new and some other entries. D receives all these entries but C receives Cold,new and some of other entries because of network issue. After C sends append response to A, A knows Cold,new is committed because the majority of Cold has this entry(i.e. A and C). C knows Cold,new is committed while D does not but D’s entry is more up-to-date than C's. (It may happen because C knows the commit index through heartbeat or A can not send all entries to C due to flow control)

A: ... Cold,new, E … En … Em
B: ...
C: ... Cold,new, E … En
D: ... Cold,new, E … En … Em

Now A crashes and B recovers.
C applies the Cold,new before campaigning. Then C sends vote requests to A, B, D because its configuration is Cold,new. Obviously, B will grant its vote to C while D won’t. D does not know Cold,new is committed so it is still a learner and does not campaign.
Thus no leader can be elected even if the majority of Cold,new is available(i.e. B, C, D).

Example 2

Single Membership Change
Cold => voters: {A, B, C}, learners: {D}
Cnew => voters: {A, B, C, D}
Now the configuration is Cold. A is leader and B crashes.
A proposes Cnew to promote learner D and some other entries. The subsequent process is the same as example 1. D’s entry is more up-to-date than C's but C knows Cnew is committed.
After A crashes and B recovers, no leader can be elected even if both the majority of Cold and Cnew is available.
Note that this kind of issue can not happen if the number of voters in Cold is even because the quorum number does not change and a leader can be elected without the new voter.

Example 3

Joint Consensus
Cold => voters: {A, B, C}, learners: {D}
Cnew => voters: {A, B, D}, learners: {C}
Now Cold,new is committed. A is leader and B crashes.
A proposes Cnew. C and D receive Cnew and responds to A. A knows Cnew is committed and then proposes some other entries. C receives these entries and knows Cnew is committed while D does not.

A: ... Cold,new, Cnew, E …
B: ...
C: ... Cold,new, Cnew, E …
D: ... Cold,new, Cnew

Now A crashes and B recovers.
C applies the Cnew and becomes learner. D sends vote requests to A, B, C because its configuration is Cold,new. Obviously, B will grant its vote to D while C won’t.
Thus no leader can be elected even if the majority of Cold,new is available(B, C, D).

Example 4

Single Membership Change
Cold => voters: {A, B, C, D}
Cnew => voters: {A, B, D}, learners: {C}
Now the configuration is Cold. A is leader and B crashes.
The subsequent process is the same as example 3.
No leader can be elected even if both the majority of Cold and Cnew is available.
Note that this kind of issue can not happen if the number of voters in Cold is odd because the quorum number does not change and a leader can be elected without the new learner.

For example 1 and 2, there is a learner D who has an up-to-date log but does not know the configuration is committed.
For example 3 and 4, there is a learner C who has an up-to-date log and also knows the configuration is committed but other voters don’t know it.

Solution

After some discussion with @NingLin-P, we find this problem can be solved by using vote requests and responses.

Vote request and response should carry the last commit index and term of configuration change. When a peer receives a vote request or response, it should advance its commit index if there is a local entry that has the same term and index.

For example 1 and 2, D will get a vote request from C so it will advance its commit index to Cold,new. Then D applies Cold,new, becomes follower. It becomes leader after campaigning.
For example 3 and 4, C will get a vote request from D so it will send vote response with the Cnew commit index and term. Then D will advance its commit index, applies Cnew, and becomes leader after campaigning.

In #7625 (comment), @tbg said in CockroachDB, they workaround the available problem 2 in #11284 (comment) by not removing a voter directly. Instead, demoting a voter first and then removing it.

I think it should not just a workaround in client. We must forbid removing a voter directly in raft library, otherwise, the liveness guarantee will be broken.

Problem 2

There is a confusing thing that the client can not know whether this cluster is available when a majority of configuration is alive after it receives the success response of this configuration.

In other words, when can we believe the quorum is always majority of this new configuration?
The answer is after the majority of configuration knows the configuration entry is committed.

But this problem does not exist in some cases. Next, we will discuss different situations.

Add voter/Promote learner

Joint Consensus

Cold,new does this work. It’s not a problem because any quorum of Cold,new include a certain quorum of Cold, the cluster is available when the majority of Cold,new is alive.

Single Membership Change

If number of voters in Cold is odd, this problem does not exist because any quorum of Cnew include a certain quorum of Cold, the cluster is available when the majority of Cnew is alive.
But if it is even, this problem exists.
For example:
Cold => voters: {A, B}, learners: {C}
Cnew => voters: {A, B, C}
Leader A proposes Cnew(promote C). A commits and applies Cnew. B and C have this entry but don’t know it’s committed. Then A crashes. No leader can be elected because both B and C do not know Cnew is committed.

Demote voter

Joint Consensus

Cnew does this work.

If we want to demote voters to learners and then remove them, the majority of Cnew knows Cnew is committed due to append invariant after the Cremove (removing learner(s)) is committed. Now the quorum is always majority of Cnew, the learner can be safely removed. Note that this method has a little disadvantage that we must keep the majority of Cold,new available until the Cremove is committed.
If we just want to demote voters to learners, this problem exists. For example 3 in problem 1, if C is isolated, no leader can be elected after A crashes because both B and D do not know Cnew is committed. C is a learner but it's still needed for the majority of Cnew.

Single Membership Change

Like add voter/promote learner, if number of voters in Cold is odd, this problem does not exist.
If it is even, it has the same situations as joint consensus.

Solution

Leader can trace the saved commit index of followers. (raft-rs has implemented this feature already tikv/raft-rs#366). After the majority of followers’s commit index is greater than the configuration index, we can tell the client that information in Ready.

Alternative: After a configuration is committed or applied, leader can propose another log. If this log is committed, it implies the majority of configuration knows the configuration is commited due to append invariant.

Liveness argument

Three conditions mentioned above.

Vote request and response should carry the commit index and term of the last configuration change.
Forbid removing a voter directly.
If the majority of configuration knows the configuration entry is committed, the quorum is always the majority of the new configuration.(i.e. leader can be elected if majority of configuration is available)

Under these conditions, we can prove a leader can be elected at all steps of configuration change.

Joint Consensus

We assume that a majority of the old configuration is available (at least until the majority of Cnew knows Cnew is committed) and that a majority of the new configuration is available.

If Cold,new is not committed
(a) The available peer with the most up-to-date log can collect votes from a majority of Cold and become leader.
If Cold,new is committed and Cnew is not committed
(a) For a certain quorum of Cold,new, if no peer knows Cold,new is committed, it’s the same as situation 1.
(b) Otherwise, this peer who knows Cold,new is committed must be voter after applying Cold,new. This voter sends vote request to other voters in this quorum of Cold,new. Each voter who has Cold,new entry can know Cold,new is committed through vote request due to condition 1. One of them with the most up-to-date log can collect votes from a majority of Cold,new and become leader.
If Cnew is committed
(a) For a certain quorum of Cold,new, if no peer knows Cnew is committed, it’s the same as situation 2.b because there is at least one peer in any quorum of Cold,new knows Cold,new is committed. (append invariant)
(b) Otherwise, because of condition 2, no peer can be removed even if it knows Cnew is committed and applies Cnew. Each voter who has Cnew entry but does not know Cnew is committed sends vote request to other voters in this Cold,new’s quorum. They will know Cnew is committed through vote response due to condition 1. One of them with the most up-to-date log can collect votes from a majority of Cnew and become leader.
If majority of Cnew knows Cnew is committed
(a) For a certain quorum of Cnew, at least one of them knows the Cnew is committed. So it’s the same as situation 3.b.

Single Membership Change

We assume that a majority of the old configuration is available (at least until the majority of Cnew knows Cnew is committed) and that a majority of the new configuration is available.

If Cnew is not committed
(a) The available peer with the most up-to-date log can collect votes from a majority of Cold and become leader.
If Cnew is committed
(a) For a certain quorum of Cold, if no peer knows Cnew is committed, it’s the same as situation 1.
(b) Otherwise, there is a peer A in Cold knows Cnew is committed. Because of condition 2, no peer can be removed even if it knows Cnew is committed and applies Cnew. Next, there must be a voter B in quorum of Cnew has Cnew entry because the intersection of any quorum of Cold and any quorum of Cnew is non-empty. If peer A and voter B are different peer, voter B will know Cnew is committed through vote response from peer A due to condition 1. Then each voter in Cnew who has Cnew entry can know Cnew is committed through vote request of peer B. One of them with the most up-to-date log can collect votes from a majority of Cnew and become leader.
If majority of Cnew knows Cnew is committed
For a certain quorum of Cnew, at least one of them knows the Cnew is committed. It’s similar to situation 2.b.

gengliqi · 2020-09-30T17:33:31Z

@tbg @xiang90 @NingLin-P @BusyJay PTAL, if there is any problem, please let me know, thanks!

BusyJay · 2020-10-01T04:32:30Z

Well written! I have a suggestion, why not just use latest commit index and term a candidate knows? Tracking last conf change can be tricky.

bdarnell · 2020-10-06T15:49:51Z

Interesting, thank you for the detailed analysis.

It makes me nervous to introduce new ways for entries to become committed. For a slightly different issue, John Ousterhout called the idea of sending new entries (instead of the commit index) with vote responses "extremely dangerous" (link).

Instead, I look at this as a limitation of learner replicas. We know that replicas must sometimes vote even when they don't know that they are voters. Without learner replicas, they never have to campaign because they'd never be the most up-to-date if they don't know about their own voter status. But as you demonstrate here, learners could be the most up-to-date even if they believe themselves to still be learners. Does this mean that learners should be able to campaign in certain conditions? (That's what would be happening if we processed config changes at append time instead of apply time)

I haven't worked out the precise details of this. It's only necessary for learners to campaign if they have previously rejected a vote because their log is more up-to-date, but I'm not sure it's necessary to limit it to this case, so maybe learners could trigger (pre-)votes according to the normal timeout whenever there is no leader. If a learner campaigns in these examples, it would only count votes from Cold, so it wouldn't be able to vote for itself. That means it needs every surviving voter from Cold to vote for it. I think that always works in these examples, but are there other cases in which a campaigning learner might not be able to get a quorum?

NingLin-P · 2020-10-06T16:54:37Z

It makes me nervous to introduce new ways for entries to become committed. For a slightly different issue, John Ousterhout called the idea of sending new entries (instead of the commit index) with vote responses "extremely dangerous".

The abovementioned solution doesn't introduce new ways for entries to become committed or carry new entries with vote request/response, both committing entries and appending entries are still decided by the leader as before.

Instead, the (index, term) of the last applied confchange entry in the vote request/response just forward the commit index information to the receiver, although this information comes from candidate/follower, but, as a committed entry must remain committed, I think this will not introduce correctness issues.

gengliqi · 2020-10-10T04:47:50Z

Interesting, thank you for the detailed analysis.

It makes me nervous to introduce new ways for entries to become committed. For a slightly different issue, John Ousterhout called the idea of sending new entries (instead of the commit index) with vote responses "extremely dangerous" (link).

Instead, I look at this as a limitation of learner replicas. We know that replicas must sometimes vote even when they don't know that they are voters. Without learner replicas, they never have to campaign because they'd never be the most up-to-date if they don't know about their own voter status. But as you demonstrate here, learners could be the most up-to-date even if they believe themselves to still be learners. Does this mean that learners should be able to campaign in certain conditions? (That's what would be happening if we processed config changes at append time instead of apply time)

I haven't worked out the precise details of this. It's only necessary for learners to campaign if they have previously rejected a vote because their log is more up-to-date, but I'm not sure it's necessary to limit it to this case, so maybe learners could trigger (pre-)votes according to the normal timeout whenever there is no leader. If a learner campaigns in these examples, it would only count votes from Cold, so it wouldn't be able to vote for itself. That means it needs every surviving voter from Cold to vote for it. I think that always works in these examples, but are there other cases in which a campaigning learner might not be able to get a quorum?

As @NingLin-P said, this solution doesn't introduce correctness issues.
I think there are some problems with your idea(i.e. a learner can campaign if it has previously rejected a vote.)
If a learner can campaign, it is not voter in its configuration, does it count itself into the number of votes?
If a learner is indeed a learner, then a certain voter with out-of-date logs sends vote request to this learner, this learner will campaign. It may have up-to-date logs so it will become leader which breaks the correctness.

gengliqi · 2020-10-10T05:12:23Z

Well written! I have a suggestion, why not just use latest commit index and term a candidate knows? Tracking last conf change can be tricky.

Yes. I think using the latest commit index and term a peer knows is correct because we only need the voter who has the most up-to-date logs become leader, who must have all committed logs.
I will update the liveness argument soon.

BusyJay · 2020-10-15T07:08:11Z

@bdarnell @xiang90 any further suggestions?

bdarnell · 2020-10-19T17:08:45Z

The abovementioned solution doesn't introduce new ways for entries to become committed or carry new entries with vote request/response, both committing entries and appending entries are still decided by the leader as before.

Yeah, I can't find a specific problem with this. It looks like it should be safe as long as we're just propagating the commit index that was set by a leader. But I also missed the problem with including entries in VoteResponse until Ousterhout pointed it out.

If a learner can campaign, it is not voter in its configuration, does it count itself into the number of votes?

No; as I said it must not vote for itself here. And this may be a problem: in some configurations, would the combination of the crashed server A and the non-voting learner prevent a quorum? In all of the examples here it works, but I'm having trouble formalizing this into a stronger guarantee.

this learner will campaign. It may have up-to-date logs so it will become leader which breaks the correctness.

How does this break correctness?

There is one very surprising consequence. If the learner is elected leader, it must propose a new entry for its term in order to establish a new commit index. So the learner is acting very much like a full-fledged member of the cluster. And it really is, at least in these examples - it is a voting member of Cnew and should be able to campaign, become leader, propose and commit.

But let's look at an example with two learners, and we want to swap one of the learners for one of the voters (same as example 1 with a second learner E)

Cold: voters A, B, C, learners D, E
Cnew: voters A, B, D, learners C, E

D and E both receive the latest log entries without applying them, and then A crashes. If D campaigns, it's the same as example 1. But E could also campaign, and then you could have a leader that is not in Cold or Cnew.

This feels very wrong, although I'm not sure it's incorrect, as long as E just proposes its one empty log entry, commits it, then one of the replicas in Cnew takes leadership. But it would require some new logic to ensure this happens, and I think this is enough of a problem that I withdraw the idea of letting leaders campaign. Passing the latest commit index seems like the safer solution.

BusyJay · 2020-10-20T06:53:22Z

Indeed. Letting learner campaign breaks the concept of learner and brings more corner cases for both raft algorithm and applications.

@xiang90 What's your suggestion? If we reach an agreement, @NingLin-P would be happy to submit a PR to fix it by passing the commit index during elections.

stale · 2021-01-18T06:58:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

BusyJay · 2021-01-18T07:08:08Z

@bdarnell @gengliqi do you have time to review #12435?

gengliqi · 2021-01-18T08:40:48Z

@bdarnell @gengliqi do you have time to review #12435?

ok, I will review it soon.

MrCroxx · 2021-01-20T06:45:08Z

Hei, guys, Would you mind appending 'area/raft' label to this issue? Thanks!

MrCroxx · 2021-04-15T04:06:03Z

According to the Problem 2 mentioned by @gengliqi , here is my point of view (I'm not sure if it is right QwQ). The following statement is based on that a node will not be killed directly after it is removed from the Raft group, and the node shoud reply its commit index for VoteRequest/PreVoteRequest:

It's clients' duty to track if the new configuration is alive. A cluster using the Raft lib probably has its own way to track/manage the state of itself (especially for clusters that implemented Multi-Raft). In my opinion, it is better not to introduce new mechanisms for the Raft lib to track new configurations if there is not one (e.g. tracing the commit index of followers).

To trace the state of a new configuration, the client can use ReadIndex request after ConfChange proposal. When the client receives the ReadIndex response, the quorum of the cluster has already applied the log before ReadIndex Entry, that means the ConfChange Entry has also been applied by the quorum of the configuration.

Now the client can decide if the new configuration is alive by waiting for the ReadIndex response. But it's not enough yet. This method can only guarantee killing the node which is removed from a Raft group will not cause liveness problem. But if the leader crush while configuration changing, the ReadIndex response will never be returned to the client. So just append a ReadIndex request after ConfChange is not enough. The client should attach the ReadIndex request a timeout, and retry it if failed.

gengliqi · 2021-04-15T15:22:52Z

According to the Problem 2 mentioned by @gengliqi , here is my point of view (I'm not sure if it is right QwQ). The following statement is based on that a node will not be killed directly after it is removed from the Raft group, and the node shoud reply its commit index for VoteRequest/PreVoteRequest:

It's clients' duty to track if the new configuration is alive. A cluster using the Raft lib probably has its own way to track/manage the state of itself (especially for clusters that implemented Multi-Raft). In my opinion, it is better not to introduce new mechanisms for the Raft lib to track new configurations if there is not one (e.g. tracing the commit index of followers).

To trace the state of a new configuration, the client can use ReadIndex request after ConfChange proposal. When the client receives the ReadIndex response, the quorum of the cluster has already applied the log before ReadIndex Entry, that means the ConfChange Entry has also been applied by the quorum of the configuration.

Now the client can decide if the new configuration is alive by waiting for the ReadIndex response. But it's not enough yet. This method can only guarantee killing the node which is removed from a Raft group will not cause liveness problem. But if the leader crush while configuration changing, the ReadIndex response will never be returned to the client. So just append a ReadIndex request after ConfChange is not enough. The client should attach the ReadIndex request a timeout, and retry it if failed.

The ReadIndex request only needs leader broadcast heartbeat so it can't be used to verify that the majority of this configuration knows the newly commit index. Although the commit index can be piggybacked to heartbeat msg, the follower may respond and then crash without persisting the commit index.
We can use a no-op write command to do what you want. The append invariant can ensure if this command is committed, the previous configuration must know to be committed in the majority of peers.
By the way, TiKV uses a workaround to solve this problem. Since the voter must be devoted first, it should be learner then it can be removed from the configuration. After the client knows this learner is removed, this node can be killed and there is no liveness problem because of the append invariant mentioned before.

MrCroxx · 2021-04-15T15:34:58Z

According to the Problem 2 mentioned by @gengliqi , here is my point of view (I'm not sure if it is right QwQ). The following statement is based on that a node will not be killed directly after it is removed from the Raft group, and the node shoud reply its commit index for VoteRequest/PreVoteRequest:

It's clients' duty to track if the new configuration is alive. A cluster using the Raft lib probably has its own way to track/manage the state of itself (especially for clusters that implemented Multi-Raft). In my opinion, it is better not to introduce new mechanisms for the Raft lib to track new configurations if there is not one (e.g. tracing the commit index of followers).

To trace the state of a new configuration, the client can use ReadIndex request after ConfChange proposal. When the client receives the ReadIndex response, the quorum of the cluster has already applied the log before ReadIndex Entry, that means the ConfChange Entry has also been applied by the quorum of the configuration.

Now the client can decide if the new configuration is alive by waiting for the ReadIndex response. But it's not enough yet. This method can only guarantee killing the node which is removed from a Raft group will not cause liveness problem. But if the leader crush while configuration changing, the ReadIndex response will never be returned to the client. So just append a ReadIndex request after ConfChange is not enough. The client should attach the ReadIndex request a timeout, and retry it if failed.

The ReadIndex request only needs leader broadcast heartbeat so it can't be used to verify that the majority of this configuration knows the newly commit index. Although the commit index can be piggybacked to heartbeat msg, the follower may respond and then crash without persisting the commit index.

We can use a no-op write command to do what you want. The append invariant can ensure if this command is committed, the previous configuration must know to be committed in the majority of peers.

By the way, TiKV uses a workaround to solve this problem. Since the voter must be devoted first, it should be learner then it can be removed from the configuration. After the client knows this learner is removed, this node can be killed and there is no liveness problem because of the append invariant mentioned before.

Thank you for the reply! I found I've made a mistake on it. ReadIndex only guarantees that the legal leader has advanced its apply index in its own view, but it still does not guarantee the consistent view of commit index among the cluster 😭.

stale · 2021-07-14T15:48:00Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

NingLin-P mentioned this issue Oct 29, 2020

raft/raft: fix the liveness bug of apply-time activation confchange #12435

Closed

stale bot added the stale label Jan 18, 2021

stale bot removed the stale label Jan 18, 2021

ptabor added the area/raft label Jan 21, 2021

stale bot added the stale label Jul 14, 2021

lni mentioned this issue Aug 3, 2021

Potential etcd liveness issue in joint consensus matrixorigin/matrixcube#72

Closed

stale bot closed this as completed Aug 4, 2021

tbg mentioned this issue Sep 22, 2022

replication: avoid fsync during raft log append cockroachdb/cockroach#88442

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: liveness problems during apply-time configuration change #12359

raft: liveness problems during apply-time configuration change #12359

gengliqi commented Sep 30, 2020 •

edited

Loading

gengliqi commented Sep 30, 2020

BusyJay commented Oct 1, 2020

bdarnell commented Oct 6, 2020

NingLin-P commented Oct 6, 2020

gengliqi commented Oct 10, 2020

gengliqi commented Oct 10, 2020

BusyJay commented Oct 15, 2020

bdarnell commented Oct 19, 2020

BusyJay commented Oct 20, 2020

stale bot commented Jan 18, 2021

BusyJay commented Jan 18, 2021

gengliqi commented Jan 18, 2021

MrCroxx commented Jan 20, 2021

MrCroxx commented Apr 15, 2021

gengliqi commented Apr 15, 2021

MrCroxx commented Apr 15, 2021

stale bot commented Jul 14, 2021

raft: liveness problems during apply-time configuration change #12359

raft: liveness problems during apply-time configuration change #12359

Comments

gengliqi commented Sep 30, 2020 • edited Loading

Abstract

Problem 1

Example 1

Example 2

Example 3

Example 4

Solution

Problem 2

Add voter/Promote learner

Joint Consensus

Single Membership Change

Demote voter

Joint Consensus

Single Membership Change

Solution

Liveness argument

Joint Consensus

Single Membership Change

gengliqi commented Sep 30, 2020

BusyJay commented Oct 1, 2020

bdarnell commented Oct 6, 2020

NingLin-P commented Oct 6, 2020

gengliqi commented Oct 10, 2020

gengliqi commented Oct 10, 2020

BusyJay commented Oct 15, 2020

bdarnell commented Oct 19, 2020

BusyJay commented Oct 20, 2020

stale bot commented Jan 18, 2021

BusyJay commented Jan 18, 2021

gengliqi commented Jan 18, 2021

MrCroxx commented Jan 20, 2021

MrCroxx commented Apr 15, 2021

gengliqi commented Apr 15, 2021

MrCroxx commented Apr 15, 2021

stale bot commented Jul 14, 2021

gengliqi commented Sep 30, 2020 •

edited

Loading