Skip to content

Commit

Permalink
Update Raft election, lease intervals, etc. (#16789)
Browse files Browse the repository at this point in the history
Fixes DOC-6375, DOC-7125

Summary of changes:

- Update 'Distribution Layer' page with updated network timeout and
  mention the `COCKROACH_NETWORK_TIMEOUT` env var that controls it

- Update 'Replication Layer' page with:

  - A table listing all of the relevant intervals, their values, and how
    they can controlled (if possible)

  - More links to the table wherever the values in that table are referred to

- Finally, all of the intervals and other constants are stored in
  variables so their values can be referenced and updated in one place
  going forward
  • Loading branch information
rmloveland authored May 1, 2023
1 parent 44fd97a commit f1ec99a
Show file tree
Hide file tree
Showing 5 changed files with 43 additions and 11 deletions.
13 changes: 13 additions & 0 deletions _data/constants.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
cockroach_range_lease_duration: 6s
cockroach_range_lease_acquisition_timeout: 4s

cockroach_network_timeout: 2s
cockroach_network_connection_timeout: 4s

cockroach_network_client_ping_interval: 1s
cockroach_network_client_ping_timeout: 6s

cockroach_raft_election_timeout_ticks: 4
cockroach_raft_reproposal_timeout_ticks: 6

cockroach_raft_tick_interval: 500ms
6 changes: 5 additions & 1 deletion v23.1/architecture/distribution-layer.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,11 @@ In relationship to other layers in CockroachDB, the distribution layer:

gRPC is the software nodes use to communicate with one another. Because the distribution layer is the first layer to communicate with other nodes, CockroachDB implements gRPC here.

gRPC requires inputs and outputs to be formatted as protocol buffers (protobufs). To leverage gRPC, CockroachDB implements a protocol-buffer-based API defined in `api.proto`.
gRPC requires inputs and outputs to be formatted as protocol buffers (protobufs). To leverage gRPC, CockroachDB implements a protocol-buffer-based API.

By default, the network timeout for RPC connections between nodes is {{site.data.constants.cockroach_network_timeout}}, with a connection timeout of {{site.data.constants.cockroach_network_connection_timeout}}, in order to reduce unavailability and tail latencies during infrastructure outages. This can be changed via the environment variable `COCKROACH_NETWORK_TIMEOUT` which defaults to {{site.data.cockroach_network_timeout}}.

Note that the gRPC network timeouts described above are defined from the server's point of view, not the client's (the client is any other node that is trying to make an inbound connection). From the client, we send an additional RPC ping every {{site.data.constants.cockroach_network_client_ping_interval}} to check that connections are alive. These pings have a fairly high timeout of {{site.data.constants.cockroach_network_client_ping_timeout}} (a multiple of the value of `COCKROACH_NETWORK_TIMEOUT`) because they are subject to contention with other RPC traffic; e.g., during a large [`IMPORT`](../import.html) they can be delayed by several seconds.

For more information about gRPC, see the [official gRPC documentation](http://www.grpc.io/docs/guides/).

Expand Down
23 changes: 19 additions & 4 deletions v23.1/architecture/replication-layer.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,16 +28,18 @@ In relationship to other layers in CockroachDB, the replication layer:
- Receives requests from and sends responses to the distribution layer.
- Writes accepted requests to the storage layer.

## Components
## Technical details and components

### Raft

Raft is a consensus protocol––an algorithm which makes sure that your data is safely stored on multiple machines, and that those machines agree on the current state even if some of them are temporarily disconnected.

Raft organizes all nodes that contain a [replica](overview.html#architecture-replica) of a [range](overview.html#architecture-range) into a group--unsurprisingly called a Raft group. Each replica in a Raft group is either a "leader" or a "follower". The leader, which is elected by Raft and long-lived, coordinates all writes to the Raft group. It heartbeats followers periodically and keeps their logs replicated. In the absence of heartbeats, followers become candidates after randomized election timeouts and proceed to hold new leader elections.
Raft organizes all nodes that contain a [replica](overview.html#architecture-replica) of a [range](overview.html#architecture-range) into a group--unsurprisingly called a Raft group. Each replica in a Raft group is either a "leader" or a "follower". The leader, which is elected by Raft and long-lived, coordinates all writes to the Raft group. It heartbeats followers periodically and keeps their logs replicated. In the absence of heartbeats, followers become candidates after [randomized election timeouts](#important-values-and-timeouts) and proceed to hold new leader elections.

A third replica type, the "non-voting" replica, does not participate in Raft elections, but is useful for unlocking use cases that require low-latency multi-region reads. For more information, see [Non-voting replicas](#non-voting-replicas).

For the current values of the Raft election timeout, the Raft proposal timeout, and other important intervals, see [Important values and timeouts](#important-values-and-timeouts).

Once a node receives a `BatchRequest` for a range it contains, it converts those KV operations into Raft commands. Those commands are proposed to the Raft group leader––which is what makes it ideal for the [leaseholder](#leases) and the Raft leader to be one in the same––and written to the Raft log.

For a great overview of Raft, we recommend [The Secret Lives of Data](http://thesecretlivesofdata.com/raft/).
Expand Down Expand Up @@ -138,7 +140,7 @@ However, unlike table data, system ranges cannot use epoch-based leases because

When the cluster needs to access a range on a leaseholder node that is dead, that range's lease must be transferred to a healthy node. This process is as follows:

1. The dead node's liveness record, which is stored in a system range, has an expiration time of 9 seconds, and is heartbeated every 4.5 seconds. When the node dies, the amount of time the cluster has to wait for the record to expire varies, but on average is 6.75 seconds.
1. The dead node's liveness record, which is stored in a system range, has an expiration time of `{{site.data.constants.cockroach_range_lease_duration}}`, and is heartbeated half as often (`{{site.data.constants.cockroach_range_lease_duration}} / 2`). When the node dies, the amount of time the cluster has to wait for the record to expire varies, but should be no more than a few seconds.
1. A healthy node attempts to acquire the lease. This is rejected because lease acquisition can only happen on the Raft leader, which the healthy node is not (yet). Therefore, a Raft election must be held.
1. The rejected attempt at lease acquisition [unquiesces](../ui-replication-dashboard.html#replica-quiescence) ("wakes up") the range associated with the lease.
1. What happens next depends on whether the lease is on [table data](#epoch-based-leases-table-data) or [meta ranges or system ranges](#expiration-based-leases-meta-and-system-ranges):
Expand All @@ -147,7 +149,7 @@ When the cluster needs to access a range on a leaseholder node that is dead, tha
1. The Raft election is held and a new leader is chosen from among the healthy nodes.
1. The lease acquisition can now be processed by the newly elected Raft leader.

This process should take no more than 9 seconds for liveness expiration plus the cost of 2 network roundtrips: 1 for Raft leader election, and 1 for lease acquisition.
This process should take no more than a few seconds for liveness expiration plus the cost of 2 network roundtrips: 1 for Raft leader election, and 1 for lease acquisition.

Finally, note that the process described above is lazily initiated: it only occurs when a new request comes in for the range associated with the lease.

Expand Down Expand Up @@ -207,6 +209,19 @@ Rebalancing is achieved by using a snapshot of a replica from the leaseholder, a

In addition to the rebalancing that occurs when nodes join or leave a cluster, replicas are also rebalanced automatically based on the relative load across the nodes within a cluster. For more information, see the `kv.allocator.load_based_rebalancing` and `kv.allocator.qps_rebalance_threshold` [cluster settings](../cluster-settings.html). Note that depending on the needs of your deployment, you can exercise additional control over the location of leases and replicas by [configuring replication zones](../configure-replication-zones.html).

### Important values and timeouts

The following table lists some important values used by CockroachDB's replication layer:

Constant | Default value | Notes
---------|---------------|------
[Raft](#raft) election timeout | {{site.data.constants.cockroach_raft_election_timeout_ticks}} * {{site.data.constants.cockroach_tick_interval}} | Controlled by `COCKROACH_RAFT_ELECTION_TIMEOUT_TICKS`, which is then multiplied by the default tick interval to determine the timeout value. This value is then multiplied by a random factor of 1-2 to avoid election ties.
[Raft](#raft) proposal timeout | {{site.data.constants.cockroach_raft_reproposal_timeout_ticks}} * {{site.data.constants.cockroach_tick_interval}} | Controlled by `COCKROACH_RAFT_REPROPOSAL_TIMEOUT_TICKS`, which is then multiplied by the default tick interval to determine the value.
[Lease interval](#how-leases-are-transferred-from-a-dead-node) | {{site.data.constants.cockroach_range_lease_duration}} | Controlled by `COCKROACH_RANGE_LEASE_DURATION`.
[Lease acquisition timeout](#how-leases-are-transferred-from-a-dead-node) | {{site.data.constants.cockroach_range_lease_acquisition_timeout}} |
[Node heartbeat interval](#how-leases-are-transferred-from-a-dead-node) | {{site.data.constants.cockroach_range_lease_duration}} / 2 | Used to determine if you're having [node liveness issues](../cluster-setup-troubleshooting.html#node-liveness-issues). This is calculated as one half of the lease interval.
Raft tick interval | {{site.data.constants.cockroach_raft_tick_interval}} | Controlled by `COCKROACH_RAFT_TICK_INTERVAL`. Used to calculate various replication-related timeouts.

## Interactions with other layers

### Replication and distribution layers
Expand Down
10 changes: 5 additions & 5 deletions v23.1/cluster-setup-troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,7 +296,7 @@ The reason this happens is as follows:
- Username and password information is stored in a system range.
- Since all system ranges are located [near the beginning of the keyspace](architecture/distribution-layer.html#monolithic-sorted-map-structure), the system range containing the username/password info can sometimes be colocated with another system range that is used to determine [node liveness](#node-liveness-issues).
- If the username/password info and the node liveness record are stored together as described above, it can take extra time for the lease on this range to be transferred to another node. Normally, lease transfers take about 10 seconds, but in this case it may require multiple rounds of consensus to determine that the node in question is actually dead (the node liveness record check may be retried several times before failing).
- If the username/password info and the node liveness record are stored together as described above, it can take extra time for the lease on this range to be transferred to another node. Normally, [lease transfers take a few seconds](architecture/replication-layer.html#how-leases-are-transferred-from-a-dead-node), but in this case it may require multiple rounds of consensus to determine that the node in question is actually dead (the node liveness record check may be retried several times before failing).
For more information about how lease transfers work when a node dies, see [How leases are transferred from a dead node](architecture/replication-layer.html#how-leases-are-transferred-from-a-dead-node).
Expand Down Expand Up @@ -561,15 +561,15 @@ For more information about how node liveness works, see [Replication Layer](arch
#### Impact of node failure is greater than 10 seconds
When the cluster needs to access a range on a leaseholder node that is dead, that range's [lease must be transferred to a healthy node](architecture/replication-layer.html#how-leases-are-transferred-from-a-dead-node). In theory, this process should take no more than 9 seconds for liveness expiration plus the cost of several network roundtrips.
When the cluster needs to access a range on a leaseholder node that is dead, that range's [lease must be transferred to a healthy node](architecture/replication-layer.html#how-leases-are-transferred-from-a-dead-node). In theory, this process should take no more than a few seconds for liveness expiration plus the cost of several network roundtrips.
In production, lease transfer upon node failure can take longer than expected. In {{ page.version.version }}, this is observed in the following scenarios:
- **The leaseholder node for the liveness range fails.** The liveness range is a system range that [stores the liveness record](architecture/replication-layer.html#epoch-based-leases-table-data) for each node on the cluster. If a node fails and is also the leaseholder for the liveness range, operations cannot proceed until the liveness range is transferred to a new leaseholder and the liveness record is made available to other nodes. This can cause momentary cluster unavailability.
- **The leaseholder node for the liveness range fails.** The liveness range is a system range that [stores the liveness record](architecture/replication-layer.html#epoch-based-leases-table-data) for each node on the cluster. If a node fails and is also the leaseholder for the liveness range, operations cannot proceed until the liveness range is [transferred to a new leaseholder](architecture/replication-layer.html#how-leases-are-transferred-from-a-dead-node) and the liveness record is made available to other nodes. This can cause momentary cluster unavailability.
- **Network or DNS issues cause connection issues between nodes.** If there is no live server for the IP address or DNS lookup, connection attempts to a node will not return an immediate error, but will hang until timing out. This can cause unavailability and prevent a speedy movement of leases and recovery. CockroachDB avoids contacting unresponsive nodes or DNS during certain performance-critical operations, and the connection issue should generally resolve in 10-30 seconds. However, an attempt to contact an unresponsive node could still occur in other scenarios that are not yet addressed.
- **Network or DNS issues cause connection issues between nodes.** If there is no live server for the IP address or DNS lookup, connection attempts to a node will not return an immediate error, but will hang [until timing out](architecture/distribution-layer.html#grpc). This can cause unavailability and prevent a speedy movement of leases and recovery. CockroachDB avoids contacting unresponsive nodes or DNS during certain performance-critical operations, and the connection issue should generally resolve in 10-30 seconds. However, an attempt to contact an unresponsive node could still occur in other scenarios that are not yet addressed.
- **A node's disk stalls.** A [disk stall](#disk-stalls) on a node can cause write operations to stall indefinitely, also causes the node's heartbeats to fail since the storage engine cannot write to disk as part of the heartbeat, and may cause read requests to fail if they are waiting for a conflicting write to complete. Lease acquisition from this node can stall indefinitely until the node is shut down or recovered. Pebble detects most stalls and will terminate the `cockroach` process after 20 seconds, but there are gaps in its detection. In v22.1.2+ and v22.2+, each lease acquisition attempt on an unresponsive node times out after 6 seconds. However, CockroachDB can still appear to stall as these timeouts are occurring.
- **A node's disk stalls.** A [disk stall](#disk-stalls) on a node can cause write operations to stall indefinitely, also causes the node's heartbeats to fail since the storage engine cannot write to disk as part of the heartbeat, and may cause read requests to fail if they are waiting for a conflicting write to complete. Lease acquisition from this node can stall indefinitely until the node is shut down or recovered. Pebble detects most stalls and will terminate the `cockroach` process after 20 seconds, but there are gaps in its detection. In v22.1.2+ and v22.2+, each lease acquisition attempt on an unresponsive node [times out after a few seconds](architecture/replication-layer.html#how-leases-are-transferred-from-a-dead-node). However, CockroachDB can still appear to stall as these timeouts are occurring.
- **Otherwise unresponsive nodes.** Internal deadlock due to faulty code, resource exhaustion, OS/hardware issues, and other arbitrary failures can make a node unresponsive. This can cause leases to become stuck in certain cases, such as when a response from the previous leaseholder is needed in order to move the lease.
Expand Down
2 changes: 1 addition & 1 deletion v23.1/frequently-asked-questions.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ that, in the presence of partitions, the system will become unavailable rather t

Separately, CockroachDB is also Highly Available, although "available" here means something different than the way it is used in the CAP theorem. In the CAP theorem, availability is a binary property, but for High Availability, we talk about availability as a spectrum (using terms like "five nines" for a system that is available 99.999% of the time).

Being both CP and HA means that whenever a majority of replicas can talk to each other, they should be able to make progress. For example, if you deploy CockroachDB to three datacenters and the network link to one of them fails, the other two datacenters should be able to operate normally with only a few seconds' disruption. We do this by attempting to detect partitions and failures quickly and efficiently, transferring leadership to nodes that are able to communicate with the majority, and routing internal traffic away from nodes that are partitioned away.
Being both CP and HA means that whenever a majority of replicas can talk to each other, they should be able to make progress. For example, if you deploy CockroachDB to three datacenters and the network link to one of them fails, the other two datacenters should be able to operate normally with only a few seconds' disruption. We do this by attempting to detect partitions and failures quickly and efficiently, [transferring leadership to nodes that are able to communicate with the majority](architecture/replication-layer.html#how-leases-are-transferred-from-a-dead-node), and routing internal traffic away from nodes that are partitioned away.

### Why is CockroachDB SQL?

Expand Down

0 comments on commit f1ec99a

Please sign in to comment.