diff --git a/docs/pages/setup/operations/ca-rotation.mdx b/docs/pages/setup/operations/ca-rotation.mdx index 6be371fca1769..934b20114de66 100644 --- a/docs/pages/setup/operations/ca-rotation.mdx +++ b/docs/pages/setup/operations/ca-rotation.mdx @@ -11,57 +11,82 @@ description: How to rotate Teleport's certificate authority (!docs/pages/includes/tctl.mdx!) -For cloud, login with a teleport user with editor privileges: -```code -# tsh logs you in and receives short-lived certificates -$ tsh login --proxy=myinstance.teleport.sh --user=email@example.com -# try out the connection -$ tctl get nodes -``` + For Cloud, log in with a Teleport user with editor privileges: + ```code + # tsh logs you in and receives short-lived certificates + $ tsh login --proxy=myinstance.teleport.sh --user=email@example.com + # try out the connection + $ tctl get nodes + ``` -## Certificate Authority Rotation - -Take a look at the [Certificates chapter](../../architecture/authentication.mdx#authentication-in-teleport) in the -architecture document to learn how the certificate authority rotation works. +## Certificate Authority rotation This section will show you how to implement certificate rotation in practice. -During manual and semi-automatic certificate authority rotation, Teleport generates a new certificate -authority and issues certificates for auth servers, proxies, nodes and users. + + If you are using [CA + Pinning](../admin/adding-nodes.mdx#untrusted-auth-servers) when adding new + nodes, the CA pin will change after the rotation. Make sure you use the *new* + CA pin when adding nodes after rotation. + + +### Rotation phases + +The rotation consists of several phases: + +- `standby`: All operations have completed or haven't started yet. +- `init`: All components are notified of the rotation. A new certificate + authority is issued, but not used. It is necessary for remote trusted clusters + to fetch the new certificate authority, otherwise new clients will reject it. +- `update_clients`: Internal clients certs are updated and reloaded. Servers + will use and respond with old credentials because clients have no idea about + new certificates at first. +- `update_servers`: Servers reload and start serving TLS and SSH certificates + signed by the new certificate authority, but will still accept certificates + issued by the old certificate authority. +- `rollback`: The rotation was aborted and is rolling back to the old + certificate authority. -Rotation consists of several phases: +### Rotation types -- `standby` All operations have completed or haven't started yet. -- `init` - All components are notified of the rotation. A new certificate authority is issued, but not used. - It is necessary for remote trusted clusters to fetch the new certificate authority, otherwise the new clients - will reject it. -- `update_clients` - internal clients certs are updated and reloaded. - Servers will use and respond with old credentials because clients have no idea about new certificates at first. -- `update_servers` Servers will reload and would start serving -TLS and SSH certificates signed by the new certificate authority, but will still accept certificates -issued by old certificate authority. -- `rollback` rotation is rolling back to the old certificate authority. +There are two kinds of certificate rotations: -Both in manual and semi-automatic rotation, cluster goes through the states above in sequence: +- **Manual:** it is the cluster administrator's reponsibility to transition + between each phase of the rotation while monitoring the state of the cluster. + Manual rotations provide the greatest level of control, and are performed by + providing the desired phase using the `--phase` flag with the + `tctl auth rotate` command. +- **Semi-automatic:** Teleport automatically transitions between phases of the + rotation after some amount of time (known as a *grace period*) elapses. + +For both types of rotations, the cluster goes through the phases in the +following order: - `standby` -> `init` -> `update_clients` -> `update_servers` -> `standby` -Administrators can rollback all the changes before rotation is completed by entering `standby`. +Administrators can abort the rotation and revert all changes any time before +the rotation is completed by entering the `rollback` phase. + +```sh +$ tctl auth rotate --phase=rollback --manual +``` -For example, if admin has detected that some nodes failed to upgrade during `update_servers`, -they can rollback to the previous certificate authority: +For example, if an admin has detected that some nodes failed to upgrade during +`update_servers`, they can roll back to the previous certificate authority, and +the phase transitions look like this: - `update_servers` -> `rollback` -> `standby`. -Try rotation/rollback in manual mode first to understand all the edge-cases -and gotchas before going with semi-automatic version. + Try rotation/rollback in manual mode first to understand all the edge-cases + and gotchas before going with semi-automatic version. ## Manual rotation -In manual mode, we would transition between phases while monitoring the state of the cluster. +In manual mode, we manually transition between phases while monitoring the state +of the cluster. **Start the rotation** @@ -72,7 +97,7 @@ $ tctl auth rotate --phase=init --manual --type=host Updated rotation phase to "init". To check status use 'tctl status' ``` -Cluster status will reflect active rotation in progress: +Use `tctl` to confirm that there is an active rotation in progress: ```code $ tctl status @@ -96,23 +121,24 @@ $ tctl get nodes --format=json | jq '.[] | {hostname: .spec.hostname, rotation: } ``` -Host `terminal` has updated it status to phase `init`. It has downloaded a new CA public key and is ready -for state transitions. +In this example, the node named `terminal` has updated its status to phase +`init`. This means it has downloaded a new CA public key and is ready for state +transitions. - -If some nodes are offline during rotation or have failed to update the status, -you will lose connectivity after the transition `update_servers` -> `standby`. Make sure that all -nodes are up to date with the transitions. + + If some nodes are offline during rotation or have failed to update the status, + you will lose connectivity after the transition `update_servers` -> `standby`. + Make sure that all nodes are up to date with the transitions before + proceeding. **Update clients** -Execute transition `init` -> `update_clients`: +Execute the transition from `init` to `update_clients`: ```code $ tctl auth rotate --phase=update_clients --manual -# Updated rotation phase to "init". To check status use 'tctl status' +# Updated rotation phase to "update_clients". To check status use 'tctl status' $ tctl status # Cluster acme.cluster # Version (=teleport.version=) @@ -120,7 +146,8 @@ $ tctl status ``` -Clients will temporarily lose connectivity during proxy and auth servers restarts. + Clients will temporarily lose connectivity during Proxy and Auth Server + restarts. Verify that nodes have caught up and now see the current cluster state: @@ -136,11 +163,12 @@ $ tctl get nodes --format=json | jq '.[] | {hostname: .spec.hostname, rotation: **Update servers** -All nodes have caught up. Execute the transition `update_clients` -> `update_servers`: +Now that all nodes have caught up, execute the transition from `update_clients` +to `update_servers`: ```code $ tctl auth rotate --phase=update_servers --manual -# Updated rotation phase to "init". To check status use 'tctl status' +# Updated rotation phase to "update_servers". To check status use 'tctl status' $ tctl status # Cluster acme.cluster @@ -149,8 +177,9 @@ $ tctl status ``` -Usually if things go wrong, they go wrong at this transition. If you have lost connectivity to nodes, -[rollback](#rollback) to the old certificate authority. + Usually if things go wrong, they go wrong at this transition. If you have lost + connectivity to nodes, [roll back](#rollback) to the old certificate + authority. Verify that nodes have caught up: @@ -169,28 +198,23 @@ $ tctl get nodes --format=json | jq '.[] | {hostname: .spec.hostname, rotation: Before wrapping up, verify that you have not lost any nodes and can connect to them, for example: ```code -$ tsh ssh hello@terminal hostname +$ tsh ssh hello@terminal ``` -This is the last stage when you can rollback. If you have lost connectivity to nodes, -[rollback](#rollback) to the old certificate authority. + This is the last stage where you have the opportunity to roll back. If you + have lost connectivity to nodes, [roll back](#rollback) to the old certificate + authority. ```code $ tctl auth rotate --phase=standby --manual -# Updated rotation phase to "init". To check status use 'tctl status' - -$ tctl status -# Cluster acme.cluster -# Version (=teleport.version=) -# Host CA rotating servers (mode: manual, started: Sep 20 01:44:36 UTC, ending: Sep 21 07:44:36 UTC) ``` -Cluster status should indicate succesffully completed rotation. +Verify that the rotation has completed with `tctl`: ```code -tctl status +$ tctl status Cluster acme.cluster Version (=teleport.version=) Host CA rotated Sep 20 02:11:25 UTC @@ -210,31 +234,26 @@ $ tctl get nodes --format=json | jq '.[] | {hostname: .spec.hostname, rotation: } ``` - -If you are using [CA Pinning](../admin/adding-nodes.mdx#untrusted-auth-servers) when adding new nodes, the CA pin will change after the rotation. -Make sure you use the *new* CA pin when adding nodes after rotation. - - ## Semi-Automatic rotation -Semi-automatic rotation executes the same steps as the manual rotation, but with a grace period between them. -It currently does not track the states of the nodes and you can lose connectivity if things go wrong. + Semi-automatic rotation executes the same steps as the manual rotation, but + with a grace period between them. It currently does not track the states of + the nodes and you can lose connectivity if things go wrong. -You can trigger semi-automatic rotation: +You can trigger semi-automatic rotation by omitting the `--manual` and `--phase` +flags. ```code $ tctl auth rotate ``` -This will trigger a rotation process for both hosts and users with a *grace period* of 48 hours. -During the grace period, certificates issued both by old and new certificate authority work. +This will trigger a rotation process for both hosts and users with a default +grace period of 48 hours. During the grace period, certificates issued both by +old and new certificate authority work. -You can customize grace period: +You can customize grace period and CA type with additional flags: ```code # Rotate only user certificates with a grace period of 200 hours: @@ -248,33 +267,29 @@ The rotation takes time, especially for hosts, because each node in a cluster needs to be notified that a rotation is taking place and request a new certificate for itself before the grace period ends. - - Be careful when choosing a grace period when rotating host certificates. The grace period needs to be long enough for all nodes in a cluster to request a new certificate. If some nodes go offline during the - rotation and come back only after the grace period has ended, they will be - forced to leave the cluster, i.e. users will no longer be allowed to SSH - into them. - +During semi-automatic rotations, Teleport will attempt to divide the grace +period so that it spends an equal amount of time in each phase before +transitioning to the next phase. This means that using a shorter grace period +will result in faster state transitions. + + + Be careful when choosing a grace period when rotating host certificates. + -Check the cluster status of rotation: +The grace period needs to be long enough for all nodes in a cluster to request a +new certificate. If some nodes go offline during the rotation and come back only +after the grace period has ended, they will be forced to leave the cluster, i.e. +users will no longer be allowed to SSH into them. + +Check the cluster status: ```code -tctl status +$ tctl status Cluster acme.cluster Version (=teleport.version=) Host CA initialized (mode: manual, started: Sep 20 01:44:36 UTC, ending: Sep 21 07:44:36 UTC) ``` - - If you are using [CA Pinning](../admin/adding-nodes.mdx#untrusted-auth-servers) when adding new nodes, the CA pin will change after the rotation. Make sure you use the - *new* CA pin when adding nodes after rotation. - - Check the status of individual nodes: ```code @@ -287,21 +302,22 @@ $ tctl get nodes --format=json | jq '.[] | {hostname: .spec.hostname, rotation: } ``` -Host `terminal` has updated it status to phase `init`. It has downloaded a new CA public key and is ready -for state transitions. +The node named `terminal` has updated its status to phase `init`. This means it +has downloaded a new CA public key and is ready for state transitions. ## Rollback -Rollback is only possible before rotation enters `standby` state. +Rollback must be performed before the rotation enters `standby` state. -First, override the rotation to the manual rollback: +First, enter the rollback phase with a manual phase transition: ```code $ tctl auth rotate --phase=rollback --manual # Updated rotation phase to "rollback". To check status use 'tctl status' ``` -Make sure that nodes that have updated have caught up: +Make sure that any nodes which have already updated have caught up and entered +the `rollback` phase. ```code # Check rotation status of the nodes @@ -313,5 +329,11 @@ $ tctl get nodes --format=json | jq '.[] | {hostname: .spec.hostname, rotation: } ``` -If any of the nodes were lost and using the old cert authority, they should reconnect -once you switch the control plane to the old cert authority. +If connectivity to any of the nodes was lost during the rotation, this is likely +because they were still using the old cert authority. Connectivity to these +nodes should be restored when the rollback completes and the old certificate +authority is made active. + +## Further reading + +How the [Teleport Certificate Authority](../../architecture/authentication.mdx#authentication-in-teleport) works.