Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: consensus protocol design doc #100

Open
wants to merge 32 commits into
base: master
Choose a base branch
from
Open
Changes from 8 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
5a7a27b
Create Consensus-Protocol.md
SUMUKHA-PK Mar 29, 2020
df185c3
Update Consensus-Protocol.md
SUMUKHA-PK Mar 29, 2020
9657559
Merge branch 'master' into Consensus-protocol-Design-Doc
tsatke Mar 30, 2020
099498a
Update doc/internal/parser/scanner/Consensus-Protocol.md
SUMUKHA-PK Mar 30, 2020
43e600e
Update Consensus-Protocol.md
SUMUKHA-PK Mar 30, 2020
bd90b1f
Update Consensus-Protocol.md
SUMUKHA-PK Mar 31, 2020
7d26490
Update Consensus-Protocol.md
SUMUKHA-PK Apr 1, 2020
5923f4f
Update Consensus-Protocol.md
SUMUKHA-PK Apr 2, 2020
02239c8
Update Consensus-Protocol.md
SUMUKHA-PK Apr 3, 2020
2b900c0
Update Consensus-Protocol.md
SUMUKHA-PK Apr 4, 2020
8527df5
Merge branch 'master' of https://github.com/tomarrell/lbadd into Cons…
SUMUKHA-PK Apr 10, 2020
b18332a
Moved doc to appropriate folder
SUMUKHA-PK Apr 10, 2020
b11486e
Update Consensus-Protocol.md
SUMUKHA-PK May 1, 2020
4e559a5
Merge branch 'master' into Consensus-protocol-Design-Doc
SUMUKHA-PK May 6, 2020
7eef110
Update Consensus-Protocol.md
SUMUKHA-PK May 6, 2020
9b209f1
Merge branch 'master' into Consensus-protocol-Design-Doc
SUMUKHA-PK May 6, 2020
17a8e1a
Merge branch 'master' into Consensus-protocol-Design-Doc
SUMUKHA-PK May 7, 2020
1cb7555
Merge branch 'master' into Consensus-protocol-Design-Doc
SUMUKHA-PK May 7, 2020
1209205
Update doc/internal/consensus/Consensus-Protocol.md
SUMUKHA-PK May 7, 2020
8d25041
Apply suggestions from code review
SUMUKHA-PK May 7, 2020
01d0c32
Apply suggestions from code review
SUMUKHA-PK May 7, 2020
d354ab9
Apply suggestions from code review
SUMUKHA-PK May 7, 2020
2e71890
Merge branch 'master' into Consensus-protocol-Design-Doc
SUMUKHA-PK May 8, 2020
2a8c4d9
Update Consensus-Protocol.md
SUMUKHA-PK May 8, 2020
c32828c
Update Consensus-Protocol.md
SUMUKHA-PK May 8, 2020
6815e8c
Update Consensus-Protocol.md
SUMUKHA-PK May 8, 2020
281906e
Update doc/internal/consensus/Consensus-Protocol.md
SUMUKHA-PK May 8, 2020
b180d33
Update doc/internal/consensus/Consensus-Protocol.md
SUMUKHA-PK May 8, 2020
981d873
Merge branch 'master' into Consensus-protocol-Design-Doc
SUMUKHA-PK May 8, 2020
fd1f272
Update Consensus-Protocol.md
SUMUKHA-PK May 10, 2020
a16b909
Update Consensus-Protocol.md
SUMUKHA-PK May 17, 2020
c312812
Merge branch 'master' into Consensus-protocol-Design-Doc
tsatke Jul 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 27 additions & 18 deletions doc/internal/consensus/Consensus-Protocol.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

Before talking about consensus, we need to discuss some logistics based on how the systems can co-exist.

* Communication: Distributed systems need a method to communicate between each other. Remote Procedure Calls is the mechanism using which a standalone server can talk to another. The existing `network` layer of the database will handle all the communication between servers.
* Communication: Distributed systems need a method to communicate between each other. Remote Procedure Calls is the mechanism using which a standalone server can talk to another. The existing `network` layer of the database will handle all the communication between servers. However, the messages to be passed are decided by the raft layer.
* Security: Access control mechanisms need to be in place to decide on access to functions in the servers based on their state (leader, follower, candidate)
SUMUKHA-PK marked this conversation as resolved.
Show resolved Hide resolved
* Routing to leader: One of the issues with a varying leader is for the clients to know which IP address to contact for the service. We can solve this problem by advertising any/all IPs of the cluster and simply forward this request to the current leader; OR have a proxy that can forward the request to the current leader wheneve the requests come in. (Section client interaction of post has another approach which works too)
* Routing to leader: One of the issues with a varying leader is for the clients to know which IP address to contact for the service. We can solve this problem by advertising any/all IPs of the cluster and the client returns the IP of the leader if its not the leader.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Any node knows all IPs and can answer that question. Why should follower nodes only respond with the leader IP?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At any point of time when raft is in the "working phase" (which is the leader is up, no election is happenning), there are only 2 kinds of nodes; the leader and the follower. So if the client doesn't hit the leader, it means it hit the follower.

* The servers will be implemented in the `interal/node` folders which will import the raft API and perform their functions.

Maintaining consensus is one of the major parts of a distributed system. To know to have achieved a stable system, we need the following two parts of implementation.
Expand All @@ -13,41 +13,43 @@ Maintaining consensus is one of the major parts of a distributed system. To know

The raft protocol will be implemented in `internal/raft` and will implement APIs that each node can call.

Raft is an algorithm to handle replicated log, and we maintain the "log" of the SQL stmts applied on a DB and have a completely replicated cluster.
Raft is an algorithm to handle replicated log, and we maintain the "log" of the SQL statements applied on a DB and have a completely replicated cluster.

#### General Implementation rules:
#### General Implementation basics:
* All RPC calls are done parallely to obtain the best performance.
* Request retries are done in case of network failures.
* Raft does not assume network preserves ordering of the packets.
* The `raft/raft.go` file has all the state parameters and general functions that raft uses.

A raft server may be in any of the 3 states; leader, follower or candidate. All requests are serviced through the leader and it then decides how and if the logs must be replicated in the follower machines. The raft protocol has 3 almost independent modules:
tsatke marked this conversation as resolved.
Show resolved Hide resolved
1. Leader Election
2. Log Replication
3. Safety

A detailed description of all the modules follow:
A detailed description of all the modules and their implementation follow:

### Leader Election

#### Spec
* Startup: All servers start in the follower state and begin by requesting votes to be elected as a leader.
tsatke marked this conversation as resolved.
Show resolved Hide resolved
* Pre-election: The server increments its `currentTerm` by one, changes to `candidate` state and sends out `RequestVotes` RPC parallely to all the peers.
* Vote condition: FCFS basis. If there was no request to the server, it votes for itself (read 3.6 and clear out when to vote for itself)
* Vote condition: FCFS basis. If there was no request to the server, it votes for itself.
* Election timeout: A preset time for which the server waits to see if a peer requested a vote. It is randomly chosen between 150-300ms.
* Election is repeated after an election timeout until:
1. The server wins the election
2. A peer establishes itself as leader.
3. Election timer times out or a split vote occurs (leading to no leader) and the process will be repeated.
* Election win: Majority votes in the term. (More details in safety) The state of the winner is now `Leader` and the others are `Followers`.
* The term problem: Current terms are exchanged when-ever servers communicate; if one server’s current term is smaller than the other’s, then it updatesits current term to the larger value. If a candidate or leader discovers that its term is out of date,it immediately reverts to follower state. If a server receives a request with a stale term number, itrejects the request.
* The term problem: Current terms are exchanged when-ever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date,it immediately reverts to follower state. If a server receives a request with a stale term number, itrejects the request. All terms which are not the current terms are considered _out of date_.
* Maintaining leaders reign: The leader sends `heartbeats` to all servers to establish its reign. This also checks whether other servers are alive based on the response and informs other servers that the leader still is alive too. If the servers do not get timely heartbeat messages, they transform from the `follower` state to `candidate` state.
* Transition from working state to Election happens when a leader fails.
* Maintaining sanity: While waiting for votes, if a `AppendEntriesRPC` is received by the server, and the term of the leader is greater than of equal to the "waiter"'s term, the leader is considered to be legitimate and the waiter becomes a follower of the leader. If the term of the leader is lesser, it is rejected.
* The split vote problem: Though not that common, split votes can occur. To make sure this doesnt continue indefinitely, election timeouts are randomised, making the split votes less probable.
* Transition from leader's normal working state to Election happens when a leader fails.
* Maintaining sanity: While waiting for votes, if a `AppendEntriesRPC` is received by the server, and the term of the leader is greater than of equal to the waiting node's term, the leader is considered to be legitimate and the waiter becomes a follower of the leader. If the term of the leader is lesser, it is rejected.
* The split vote problem: Though not that common, split votes can occur. To make sure this doesn't continue indefinitely, election timeouts are randomised, making the split votes less probable.


#### Implementation

* The implementation of leader election will span over `raft/leaderElection.go`, request votes over `raft/requestVotes.go` and append entries over `appendEntries.go`.
SUMUKHA-PK marked this conversation as resolved.
Show resolved Hide resolved
* The raft module will provide a `StartElection` function that enables a node to begin election. This function just begins the election and doesnt return any result of the election. The decision of the election will be handled by the votes and each server independently.
* The Leader node is the only one to know its the leader in the beginning. It realises it has obtained the majority votes, and starts behaving like the leader node. During this period, other nodes wait for a possible leader and begin to proceed in the candidate state by advancing to the next term unless the leader contacts them.

Expand All @@ -62,16 +64,18 @@ A detailed description of all the modules follow:
* A committed entry: When a leader decides that the log entry is safe to apply to other state machines, that entry is called committed. All committed entries are durable and _will eventually be executed_ by all state machines.
* An entry -> Committed entry: A log entry is called committed once its replicated on the majority of the servers in the cluster. Once an entry is committed, it commits all the previous entries in the leaders log, including the entries created by the previous leaders.
* The leader keeps track of the highest known index that it knows is committed and it is included in all the future `AppendEntriesRPC` (including heartbeats) to inform other servers.
* Theres some issue about log committing - "A log entry is committed once the leader that createdthe entry has replicated it on a majority of the servers" and " Once a follower learns that a log entry is committed, it applies theentry to its local state machine (in log order)." are not clear whether replicating and applying to state machine are the same. If they are its kind of a contradiction, else "aplication" can mean executing the STMT in the DB in our case.
* Theres some issue about log committing - "A log entry is committed once the leader that createdthe entry has replicated it on a majority of the servers" and " Once a follower learns that a log entry is committed, it applies theentry to its local state machine (in log order)." are not clear whether replicating and applying to state machine are the same. If they are its kind of a contradiction, else "application" can mean executing the STMT in the DB in our case.
* Log matching property:
* If two entries in different logs have the same index and term, then they store the same com-mand.
* If two entries in different logs have the same index and term, then the logs are identical in allpreceding entries.
* When sending an AppendEntriesRPC, the leader includes the index and term of the entry in its log that immediately precedes thenew entries. If the follower does not find an entry in its log with the same index and term, then itrefuses the new entries. This helps in log matching. Which implies, a successful `AppendEntries` RPC means a synced log.
* Leader crashes inducing inconsistencies in logs: In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own (the leader's). To do so, the leader must find the latest log entrywhere the two logs agree, delete any entries in the follower’s log after that point, and send thefollower all of the leader’s entries after that point. All of these actions happen in response to theconsistency check performed by AppendEntries RPCs. Meaning, the leader checks for the consistency by maintaining a `nextIndex` value and dropping it down and sending `AppendEntries` RPC (which does the consistency check and fails unless they're same) until a success is returned. These "check `AppendEntries` can be empty to save BandWidth(BW). The follower can also help here by sending the smallest agreeing index in the first RPC instead of the leader probing until it reaches the index.
* If two entries in different logs have the same index and term, then they store the same command.
* If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.
* When sending an AppendEntriesRPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries. If the follower does not find an entry in its log with the same index and term, then it refuses the new entries. This helps in log matching. Which implies, a successful `AppendEntriesRPC` means a synced log.
* Leader crashes inducing inconsistencies in logs: In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own (the leader's). To do so, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point. All of these actions happen in response to the consistency check performed by AppendEntries RPCs. Meaning, the leader checks for the consistency by maintaining a `nextIndex` value and dropping it down and sending `AppendEntriesRPC` (which does the consistency check and fails unless they're same) until a success is returned. These "check `AppendEntries` can be empty to save BandWidth(BW). The follower can also help here by sending the smallest agreeing index in the first RPC instead of the leader probing until it reaches the index.
* Leader append-only property: A leader never overwrites or deletes entries in its own log.

#### Implementation

* Log replication implementation will span over the `raft/logReplication.go` file.
SUMUKHA-PK marked this conversation as resolved.
Show resolved Hide resolved

### Safety

This module ensures that the above protocol runs as expected, eliminating the corner cases.
Expand All @@ -86,7 +90,7 @@ This module ensures that the above protocol runs as expected, eliminating the co

#### Implementation

#### Client interaction:
### Client interaction:
* Idempotency is maintained by having a unique client ID and the request ID. The same request by the same client cannot be requested twice, we assume here that the client didn't receive the responses due to network errors etc. Each server maintains a _session_ for each client. The session tracks the latest serial number processed for the client, along with the associated response. If a server receives a command whose serial number has already been executed, it responds immediately without re-executing the request.
* With each request, the client includes the lowest sequencenumber for which it has not yet received a response, and the state machine then discards all responsesfor lower sequence numbers. Quite similar to TCP.
* The session for a client are _open on all nodes_. Session expiry happens on all nodes at once and in a deterministic way. LRU or an upper bound for sessions can be used. Some sort of timely probing is done to remove stale sessions.
Expand All @@ -98,7 +102,12 @@ This module ensures that the above protocol runs as expected, eliminating the co
* The leader waits for its state machine to advance at least as far as the readIndex; this is current enough to satisfy linearizability
* Finally, the leader issues the query against its state machine and replies to the client with the results.

### How the modules interact

* Leader election is called by every node at init.
* All nodes send a `RequestVotesRPC` parallely to all other nodes.
* If a leader is elected, the leader begins to send `AppendEntriesRPC` to other nodes (followers) to establish its leadership.
* Log replication occurs when the `AppendEntriesRPC` is received by the follower.
## A strict testing mechanism

The testing mechanism to be implemented will enable us in figuring out the problems existing in the implementation leading to a more resilient system.
Expand All @@ -121,7 +130,7 @@ Accepting failures exist and handling them gracefully enables creation of more r

## Appendix

* The difference between _commit_, _replicate_ and _apply_ with respect raft: What I have understood till now is, applying means letting the log run through the node's state machine. This is the end process, happens after a commit. A commit is once replication happens on a majority of the nodes. While replication is simply appending of a log on _one_ node.
* Some gotchas I thought about:
* The difference between _commit_, _replicate_ and _apply_ with respect to raft: Applying implies letting the log run through the node's state machine. This is the end process, happens after a commit. A commit is once replication happens on a majority of the nodes. While replication is simply appending of a log on _one_ node.
* Some gotchas:
* Client connects to the leader and leader crashes -> reject the connection. Let the client connect when the new leader is established.
* Some sort of idempotency must be maintained w.r.t. the client-cluster communication. Multiple requests submitted by the client should not cause problems due to network errors.