Skip to content

Commit

Permalink
VAULT-28520: Docs describing active node/leadership election timing (h…
Browse files Browse the repository at this point in the history
…ashicorp#28608)

* add leadership election delay explanations

* Update website/content/docs/internals/high-availability.mdx

Co-authored-by: Sarah Chavis <[email protected]>

* Update website/content/docs/internals/high-availability.mdx

Co-authored-by: Sarah Chavis <[email protected]>

* Update website/content/docs/internals/high-availability.mdx

Co-authored-by: Sarah Chavis <[email protected]>

* Update website/content/docs/internals/integrated-storage.mdx

Co-authored-by: Sarah Chavis <[email protected]>

* Update website/content/docs/internals/integrated-storage.mdx

Co-authored-by: Sarah Chavis <[email protected]>

* Update website/content/docs/internals/high-availability.mdx

Co-authored-by: Sarah Chavis <[email protected]>

* Update website/content/docs/internals/integrated-storage.mdx

Co-authored-by: Sarah Chavis <[email protected]>

* Update website/content/docs/internals/high-availability.mdx

Co-authored-by: Sarah Chavis <[email protected]>

* small fixes

---------

Co-authored-by: Sarah Chavis <[email protected]>
  • Loading branch information
miagilepner and schavis authored Oct 11, 2024
1 parent 82133e7 commit 5cbebac
Show file tree
Hide file tree
Showing 2 changed files with 73 additions and 1 deletion.
43 changes: 42 additions & 1 deletion website/content/docs/internals/high-availability.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,48 @@ the request, the request is forwarded to the active server. Read-only requests a
Like traditional HA standbys, a Performance Standby Node becomes the active instance when the active node is sealed, fails, or loses
network connectivity.

# Tutorial
# Becoming an active node

An active node in an HA configuration holds an HA lock. When the active node becomes
unavailable or steps down because an operator invokes
[`vault operator step-down`](/vault/docs/commands/operator/step-down), all the
nodes in the cluster attempt to grab the HA lock. The first node that succeeds
in grabbing and holding the lock becomes the new active node.

The HA lock competition process depends on the HA storage backend of the cluster.
For example, with raft integrated storage, Vault always gives the HA lock to
whichever node wins the
[Raft leadership election](/vault/docs/internals/integrated-storage#leadership-elections).

After obtaining an HA lock, the node goes through a series of inauguration steps
to formally become the active node for its cluster:

1. Seals the local Vault instance.
1. Reloads the seal configuration.
1. Migrates the seal, if needed.
1. Reloads encryption keys.
1. Creates a new HA intra-cluster TLS certificate.
1. Writes an entry to storage advertising its status as the active node.
1. Unseals the local Vault instance.
1. Start accepting connections from other nodes in the cluster.

After completing the inauguration steps, the new active node beings responding to new
client requests.

HA standby nodes check for active node updates every 2.5 seconds.
When the active node changes, standby nodes update their forwarding connection
and open a connection to the new active node. The connection will fail until the
new active node starts accepting connections from other nodes in the cluster, so
standby nodes retry the connection every 5 seconds as a heartbeat, or whenever a
new client request arrives that requires forwarding.

With performance standbys enabled, standby nodes also promote themselves as
performance standbys the active node updates. The standby requests a certificate
and key pair from the active node to receive replicated data. Once the standby
node seals, performs the necessary setup tasks, completes its post-unseal setup,
it can serve new client requests.

# Tutorial

Refer to the following tutorials to learn more.

Expand Down
31 changes: 31 additions & 0 deletions website/content/docs/internals/integrated-storage.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,37 @@ In a [high availability](/vault/docs/internals/high-availability#design-overview
configuration, the active Vault node is the leader node and all standby nodes
are followers.

## Leadership elections

Nodes become the Raft leader through Raft leadership elections.

All nodes in a Raft cluster start as **followers**. Followers monitor leader
health through a **leader heartbeat**. If a follower does not receive a heartbeat
within the configured **heartbeat timeout**, the node becomes a **candidate**.
Candidates watch for election notices from other nodes in the cluster. If the
**election timeout** period expires, the candidate starts an election for
leader. If the candidate gets responses from a quorum of other nodes in the
cluster, the candidate becomes the new leader node.

Raft leaders may step down voluntarily if the node cannot connect to a quorum
of nodes with the **leader lease timeout** period.

The relevant timeout periods (heartbeat timeout, election timeout, leader lease
timeout) scale according to the [`performance_multiplier`](/vault/docs/configuration/storage/raft#performance-multiplier) setting in your Vault configuration. By default,
the `performance_multiplier` is 5, which translates to the following timeout
values:

Timeout | Default duration
-------------------- | ----------------
Heartbeat timeout | 5 seconds
Election timeout | 5 seconds
Leader lease timeout | 2.5 seconds

We recommend using the default multiplier unless one of the following is true:

- Platform telemetry strongly indicates the default behavior is insufficient.
- The reliability of your platform or network requires different behavior.

## BoltDB Raft logs

BoltDB is a single file database, which means BoltDB cannot shrink the file on
Expand Down

0 comments on commit 5cbebac

Please sign in to comment.