VAULT-28520: Docs describing active node/leadership election timing (h…

…ashicorp#28608) * add leadership election delay explanations * Update website/content/docs/internals/high-availability.mdx Co-authored-by: Sarah Chavis <[email protected]> * Update website/content/docs/internals/high-availability.mdx Co-authored-by: Sarah Chavis <[email protected]> * Update website/content/docs/internals/high-availability.mdx Co-authored-by: Sarah Chavis <[email protected]> * Update website/content/docs/internals/integrated-storage.mdx Co-authored-by: Sarah Chavis <[email protected]> * Update website/content/docs/internals/integrated-storage.mdx Co-authored-by: Sarah Chavis <[email protected]> * Update website/content/docs/internals/high-availability.mdx Co-authored-by: Sarah Chavis <[email protected]> * Update website/content/docs/internals/integrated-storage.mdx Co-authored-by: Sarah Chavis <[email protected]> * Update website/content/docs/internals/high-availability.mdx Co-authored-by: Sarah Chavis <[email protected]> * small fixes --------- Co-authored-by: Sarah Chavis <[email protected]>
Nordix · Oct 11, 2024 · 5cbebac · 5cbebac
1 parent 82133e7
commit 5cbebac
Show file tree

Hide file tree

Showing 2 changed files with 73 additions and 1 deletion.
diff --git a/website/content/docs/internals/high-availability.mdx b/website/content/docs/internals/high-availability.mdx
@@ -47,7 +47,48 @@ the request, the request is forwarded to the active server. Read-only requests a
 Like traditional HA standbys, a Performance Standby Node becomes the active instance when the active node is sealed, fails, or loses
 network connectivity.
 
-# Tutorial 
+# Becoming an active node
+
+An active node in an HA configuration holds an HA lock. When the active node becomes
+unavailable or steps down because an operator invokes
+[`vault operator step-down`](/vault/docs/commands/operator/step-down), all the
+nodes in the cluster attempt to grab the HA lock. The first node that succeeds
+in grabbing and holding the lock becomes the new active node.
+
+The HA lock competition process depends on the HA storage backend of the cluster.
+For example, with raft integrated storage, Vault always gives the HA lock to
+whichever node wins the
+[Raft leadership election](/vault/docs/internals/integrated-storage#leadership-elections).
+
+After obtaining an HA lock, the node goes through a series of inauguration steps
+to formally become the active node for its cluster:
+
+1. Seals the local Vault instance.
+1. Reloads the seal configuration.
+1. Migrates the seal, if needed.
+1. Reloads encryption keys.
+1. Creates a new HA intra-cluster TLS certificate.
+1. Writes an entry to storage advertising its status as the active node.
+1. Unseals the local Vault instance.
+1. Start accepting connections from other nodes in the cluster.
+
+After completing the inauguration steps, the new active node beings responding to new
+client requests.
+
+HA standby nodes check for active node updates every 2.5 seconds.
+When the active node changes, standby nodes update their forwarding connection
+and open a connection to the new active node. The connection will fail until the
+new active node starts accepting connections from other nodes in the cluster, so
+standby nodes retry the connection every 5 seconds as a heartbeat, or whenever a
+new client request arrives that requires forwarding.
+
+With performance standbys enabled, standby nodes also promote themselves as
+performance standbys the active node updates. The standby requests a certificate
+and key pair from the active node to receive replicated data. Once the standby
+node seals, performs the necessary setup tasks, completes its post-unseal setup,
+it can serve new client requests.
+
+# Tutorial
 
 Refer to the following tutorials to learn more.
 

diff --git a/website/content/docs/internals/integrated-storage.mdx b/website/content/docs/internals/integrated-storage.mdx
@@ -138,6 +138,37 @@ In a [high availability](/vault/docs/internals/high-availability#design-overview
 configuration, the active Vault node is the leader node and all standby nodes
 are followers.
 
+## Leadership elections
+
+Nodes become the Raft leader through Raft leadership elections.
+
+All nodes in a Raft cluster start as **followers**. Followers monitor leader
+health through a **leader heartbeat**. If a follower does not receive a heartbeat
+within the  configured **heartbeat timeout**, the node becomes a **candidate**.
+Candidates watch for election notices from other nodes in the cluster. If the
+**election timeout** period expires, the candidate starts an election for
+leader. If the candidate gets responses from a quorum of other nodes in the
+cluster, the candidate becomes the new leader node.
+
+Raft leaders may step down voluntarily if the node cannot connect to a quorum
+of nodes with the **leader lease timeout** period.
+
+The relevant timeout periods (heartbeat timeout, election timeout, leader lease
+timeout) scale according to the [`performance_multiplier`](/vault/docs/configuration/storage/raft#performance-multiplier) setting in your Vault configuration. By default,
+the `performance_multiplier` is 5, which translates to the following timeout
+values:
+
+Timeout              | Default duration
+-------------------- | ----------------
+Heartbeat timeout    | 5 seconds
+Election timeout     | 5 seconds
+Leader lease timeout | 2.5 seconds
+
+We recommend using the default multiplier unless one of the following is true:
+
+- Platform telemetry strongly indicates the default behavior is insufficient.
+- The reliability of your platform or network requires different behavior.
+
 ## BoltDB Raft logs
 
 BoltDB is a single file database, which means BoltDB cannot shrink the file on