From 2aab2ea14b7eb072d75af4e22c5c33a8fca037b6 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 3 Apr 2024 18:36:44 -0400 Subject: [PATCH 001/105] Create 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 271 ++++++++++++++++++++++++++ 1 file changed, 271 insertions(+) create mode 100644 rfd/0169-auto-updates-linux-agents.md diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md new file mode 100644 index 0000000000000..110df9195ae70 --- /dev/null +++ b/rfd/0169-auto-updates-linux-agents.md @@ -0,0 +1,271 @@ +--- +authors: Stephen Levine (stephen.levine@goteleport.com) +state: draft +--- + +# RFD 0169 - Automatic Updates for Linux Agents + +## Required Approvers + +* Engineering: @rjones && @bernardjkim +* Security: @reed + +## What + +This RFD proposes a new mechanism for Teleport agents installed on Linux servers to automatically update to a version set by an operator via tctl. + +The following anti-goals are out-of-scope for this proposal, but will be addressed in future RFDs: +- Analogous adjustments for Teleport agents installed on Kubernetes +- Phased rollouts of new agent versions for agents connected to an existing cluster +- Signing of agent artifacts via TUF +- Teleport Cloud APIs for updating agents + +This RFD proposes a specific implementation of several sections in https://github.com/gravitational/teleport/pull/39217. + +Additionally, this RFD parallels the auto-update functionality for client tools proposed in https://github.com/gravitational/teleport/pull/39805. + +## Why + +The existing mechanism for automatic agent updates does not provide a hands-off experience for all Teleport users. + +1. The use of system package management leads to interactions with `apt upgrade`, `yum upgrade`, etc. that can result in unintentional upgrades or confusing command output. +2. The use of system package management requires complex logic for each target distribution. +3. The installation mechanism requires 4-5 commands, includes manually installing multiple packages, and varies depending on your version and edition of Teleport. +4. The use of bash to implement the updater makes changes difficult and prone to error. +5. The existing auto-updater has limited automated testing. +6. The use of GPG keys in system package managers has key management implications that we would prefer to solve with TUF in the future. +7. The desired agent version cannot be set via Teleport's operator-targeted CLI (tctl). +8. The rollout plan for the new agent version is not fully-configurable using tctl. +9. Agent installation logic is spread between the auto-updater script, install script, auto-discovery script, and documentation. +10. Teleport contains logic that is specific to Teleport Cloud upgrade workflows. +11. The existing auto-updater is not self-updating. +12. It is difficult and undocumented to automate agent upgrades with custom automation (e.g., with JamF). + +We must provide a seamless, hands-off experience for auto-updates that is easy to maintain. + +## Details + +We will ship a new auto-updater package written in Go that does not interface with the system package manager. +It will be versioned separately from Teleport, and manage the installation of the correct Teleport agent version manually. +It will read the unauthenticated `/v1/webapi/ping` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. +It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. + +### Installation + +```shell +$ apt-get install teleport-ent-updater +$ teleport-update enable --proxy example.teleport.sh + +# if not enabled already, configure teleport and: +$ systemctl enable teleport +``` + +### API + +#### Endpoints + +`/v1/webapi/ping` +```json +{ + "agent_version": "15.1.1", + "agent_auto_update": true, + "agent_update_after": "2024-04-23T18:00:00.000Z", + "agent_update_jitter": 10, +} +``` +Notes: +- Critical updates are achieved by serving `agent_update_after` with the current time. +- The Teleport proxy translates upgrade hours (below) into a specific time after which all agents should be upgraded. +- If an agent misses an upgrade window, it will always update immediately. + +#### Teleport Resources + +```yaml +kind: cluster_maintenance_config +spec: + # agent_auto_update allows turning agent updates on or off at the + # cluster level. Only turn agent automatic updates off if self-managed + # agent updates are in place. + agent_auto_update: on|off + # agent_update_hour sets the hour in UTC at which clients should update their agents. + # The value -1 will set the upgrade time to the current time, resulting in immediate upgrades. + agent_update_hour: -1-23 + # agent_update_jitter sets a duration in which the upgrade will occur after the hour. + # The agent upgrader will pick a random time within this duration in which to upgrade. + agent_update_jitter: 0-MAXINT64 + + [...] +``` +``` +$ tctl autoupdate update --set-agent-auto-update=off +Automatic updates configuration has been updated. +$ tctl autoupdate update --set-agent-update-hour=3 +Automatic updates configuration has been updated. +$ tctl autoupdate update --set-agent-update-jitterr=600 +Automatic updates configuration has been updated. +``` + +```yaml +kind: autoupdate_version +spec: + # agent_version is the version of the agent the cluster will advertise. + # Can be auto (match the version of the proxy) or an exact semver formatted + # version. + agent_version: auto|X.Y.Z + + [...] +``` +``` +$ tctl autoupdate update --set-agent-version=15.1.1 +Automatic updates configuration has been updated. +``` + +Notes: +- These two resources are separate so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. + +Questions: +- Should we use a time-only format for specifying the update hour? E.g., `agent_update_time: "18:00:00.000+01` + This would allow users to set an exact time via the CLI, instead of restricting to hours. + +### Filesystem + +``` +$ tree /var/lib/teleport +/var/lib/teleport +└── versions + ├── 15.0.0 + │ ├── bin + │ │ ├── ... + │ │ ├── teleport-updater + │ │ └── teleport + │ └── etc + │ ├── ... + │ └── systemd + │ └── teleport.service + ├── 15.1.1 + │ ├── bin + │ │ ├── ... + │ │ ├── teleport-updater + │ │ └── teleport + │ └── etc + │ ├── ... + │ └── systemd + │ └── teleport.service + └── updates.yaml +$ ls -l /usr/local/bin/teleport +/usr/local/bin/teleport -> /var/lib/teleport/versions/15.0.0/bin/teleport +$ ls -l /usr/local/bin/teleport +/usr/local/bin/teleport-updater -> /var/lib/teleport/versions/15.0.0/bin/teleport-updater +$ ls -l /usr/local/lib/systemd/system/teleport.service +/usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service +``` + +updates.yaml: +``` +version: v1 +proxy: mytenant.teleport.sh +enabled: true +active_version: 15.1.1 +``` + +### Runtime + +The agent-updater will run as a periodically executing systemd service which runs every 10 minutes. +The systemd service will run: +```shell +$ teleport-updater update +``` + +After it is installed, the `update` subcommand will no-op when executed until configured with the `teleport-updater` command: +```shell +$ teleport-updater enable --proxy mytenant.teleport.sh +``` + +If the proxy address is not provided with `--proxy`, the current proxy address from `teleport.yaml` is used. + +On servers without Teleport installed already, the `enable` subcommand will change the behavior of `teleport-update update` to update teleport and restart the existing agent, if running. +It will also run update teleport immediately, to ensure that subsequent executions succeed. + +The `enable` subcommand will: +1. Configure `updates.yaml` with the current proxy address and set `enabled` to true. +2. Query the `/v1/webapi/ping` endpoint. +3. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, quit. +4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (12). +5. Download the desired Teleport tarball specified by `agent_version`. +6. Verify the checksum. +7. Extract the tarball to `/var/lib/teleport/versions/VERSION`. +8. Replace any existing binaries or symlinks with symlinks to the current version. +9. Restart the agent if the systemd service is already enabled. +10. Set `active_version` in `updates.yaml` if successful or not enabled. +11. Replace the old symlinks or binaries and quit (exit 1) if unsuccessful. +12. Remove any `teleport` package if installed. +13. Verify the symlinks to the active version still exists. +14. Remove all stored versions of the agent except the current version and last working version. + +The `disable` subcommand will: +1. Configure `updates.yaml` to set `enabled` to false. + +When `update` subcommand is otherwise executed, it will: +1. Check `updates.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. +2. Query the `/v1/webapi/ping` endpoint. +3. Check if the current time is after the time advertised in `agent_update_after`, and that `agent_auto_updates` is true. +4. If the current version of Teleport is the latest, quit. +5. Wait `random(0, agent_update_jitter)` seconds. +6. Download the desired Teleport tarball specified by `agent_version`. +7. Verify the checksum. +8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. +9. Update symlinks to point at the new version. +10. Restart the agent if the systemd service is already enabled. +11. Set `active_version` in `updates.yaml` if successful or not enabled. +12. Replace the old symlink or binary and quit (exit 1) if unsuccessful. +13. Remove all stored versions of the agent except the current version and last working version. + +To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. +The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. + +### Manual Workflow + +For use cases that fall outside of the functionality provided by `teleport-updater`, such as JamF or ansible-controlled updates, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint. + +Cluster administrators that want to self-manage client tools updates will be +able to get and watch for changes to agent versions which can then be +used to trigger other integrations to update the installed version of agents. + +```shell +$ tctl autoupdate watch +{"agent_version": "1.0.0"} +{"agent_version": "1.0.1"} +{"agent_version": "2.0.0"} +[...] +``` + +```shell +$ tctl autoupdate get +{"agent_version": "2.0.0"} +``` + +### Scripts + +All scripts will install the latest updater and run `teleport-updater enable` with the proxy address. + +Eventually, additional logic from the scripts could be added to `teleport-updater`, such that `teleport-updater` can configure teleport. + +This is out-of-scope for this proposal. + +## Security + +The initial version of automatic updates will rely on TLS to establish +connection authenticity to the Teleport download server. The authenticity of +assets served from the download server is out of scope for this RFD. Cluster +administrators concerned with the authenticity of assets served from the +download server can use self-managed updates with system package managers which +are signed. + +The Upgrade Framework (TUF) will be used to implement secure updates in the future. + +## Execution Plan + +1. Implement new auto-updater in Go. +2. Prep documentation changes. +3. Release new updater via teleport-ent-updater package. +4. Release documentation changes. From 2eab6f73bad61c76fbc6f5975180ddb34011d669 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 3 Apr 2024 18:40:03 -0400 Subject: [PATCH 002/105] Fix github handle --- rfd/0169-auto-updates-linux-agents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 110df9195ae70..db5639e6eb8af 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -7,7 +7,7 @@ state: draft ## Required Approvers -* Engineering: @rjones && @bernardjkim +* Engineering: @russjones && @bernardjkim * Security: @reed ## What From 61a3db1700ea45c6a11e48f68d50941005b88117 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 3 Apr 2024 18:40:36 -0400 Subject: [PATCH 003/105] Fix Github handle --- rfd/0169-auto-updates-linux-agents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index db5639e6eb8af..04f1d55512dd5 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -8,7 +8,7 @@ state: draft ## Required Approvers * Engineering: @russjones && @bernardjkim -* Security: @reed +* Security: @reedloden ## What From 5aad72033f9c9698114b78b4f793e42f6ba23584 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 4 Apr 2024 16:32:09 -0400 Subject: [PATCH 004/105] Clarify jitter flag --- rfd/0169-auto-updates-linux-agents.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 04f1d55512dd5..576a3d61116e9 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -70,7 +70,7 @@ $ systemctl enable teleport "agent_version": "15.1.1", "agent_auto_update": true, "agent_update_after": "2024-04-23T18:00:00.000Z", - "agent_update_jitter": 10, + "agent_update_jitter_seconds": 10, } ``` Notes: @@ -90,9 +90,9 @@ spec: # agent_update_hour sets the hour in UTC at which clients should update their agents. # The value -1 will set the upgrade time to the current time, resulting in immediate upgrades. agent_update_hour: -1-23 - # agent_update_jitter sets a duration in which the upgrade will occur after the hour. + # agent_update_jitter_seconds sets a duration in which the upgrade will occur after the hour. # The agent upgrader will pick a random time within this duration in which to upgrade. - agent_update_jitter: 0-MAXINT64 + agent_update_jitter_seconds: 0-MAXINT64 [...] ``` @@ -101,7 +101,7 @@ $ tctl autoupdate update --set-agent-auto-update=off Automatic updates configuration has been updated. $ tctl autoupdate update --set-agent-update-hour=3 Automatic updates configuration has been updated. -$ tctl autoupdate update --set-agent-update-jitterr=600 +$ tctl autoupdate update --set-agent-update-jitter-seconds=600 Automatic updates configuration has been updated. ``` @@ -210,7 +210,7 @@ When `update` subcommand is otherwise executed, it will: 2. Query the `/v1/webapi/ping` endpoint. 3. Check if the current time is after the time advertised in `agent_update_after`, and that `agent_auto_updates` is true. 4. If the current version of Teleport is the latest, quit. -5. Wait `random(0, agent_update_jitter)` seconds. +5. Wait `random(0, agent_update_jitter_seconds)` seconds. 6. Download the desired Teleport tarball specified by `agent_version`. 7. Verify the checksum. 8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. From 176e96f2d5d35f43a5ca7086b73e632348ba8404 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 4 Apr 2024 16:33:09 -0400 Subject: [PATCH 005/105] Remove time question --- rfd/0169-auto-updates-linux-agents.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 576a3d61116e9..25ae473846d97 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -123,10 +123,6 @@ Automatic updates configuration has been updated. Notes: - These two resources are separate so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. -Questions: -- Should we use a time-only format for specifying the update hour? E.g., `agent_update_time: "18:00:00.000+01` - This would allow users to set an exact time via the CLI, instead of restricting to hours. - ### Filesystem ``` From 1f5aee06dbd3b5c3679528f168cf27e5afb13671 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 4 Apr 2024 22:16:23 -0400 Subject: [PATCH 006/105] Update rfd/0169-auto-updates-linux-agents.md Co-authored-by: Russell Jones --- rfd/0169-auto-updates-linux-agents.md | 1 + 1 file changed, 1 insertion(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 25ae473846d97..2311867e11e37 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -8,6 +8,7 @@ state: draft ## Required Approvers * Engineering: @russjones && @bernardjkim +* Product: @klizhentas || @xinding33 * Security: @reedloden ## What From 5bf39d250c73314aaedc57e1201e45612ab72d7d Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 4 Apr 2024 22:19:47 -0400 Subject: [PATCH 007/105] Update rfd/0169-auto-updates-linux-agents.md Co-authored-by: Russell Jones --- rfd/0169-auto-updates-linux-agents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 2311867e11e37..5c1d0dedc9c2c 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -224,7 +224,7 @@ The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid ree For use cases that fall outside of the functionality provided by `teleport-updater`, such as JamF or ansible-controlled updates, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint. -Cluster administrators that want to self-manage client tools updates will be +Cluster administrators that want to self-manage agent updates will be able to get and watch for changes to agent versions which can then be used to trigger other integrations to update the installed version of agents. From 99b0373674cc56167c308dae107d5e19fb4ea6cd Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 4 Apr 2024 22:23:53 -0400 Subject: [PATCH 008/105] Update rfd/0169-auto-updates-linux-agents.md Co-authored-by: Russell Jones --- rfd/0169-auto-updates-linux-agents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 5c1d0dedc9c2c..6d2c770a072ae 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -151,7 +151,7 @@ $ tree /var/lib/teleport └── updates.yaml $ ls -l /usr/local/bin/teleport /usr/local/bin/teleport -> /var/lib/teleport/versions/15.0.0/bin/teleport -$ ls -l /usr/local/bin/teleport +$ ls -l /usr/local/bin/teleport-updater /usr/local/bin/teleport-updater -> /var/lib/teleport/versions/15.0.0/bin/teleport-updater $ ls -l /usr/local/lib/systemd/system/teleport.service /usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service From 475a76996c78705df5ff15d84881db6616d32d6a Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 5 Apr 2024 12:00:04 -0400 Subject: [PATCH 009/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 6d2c770a072ae..bb760237234bc 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -160,9 +160,14 @@ $ ls -l /usr/local/lib/systemd/system/teleport.service updates.yaml: ``` version: v1 -proxy: mytenant.teleport.sh -enabled: true -active_version: 15.1.1 +kind: agent_versions +spec: + # proxy specifies the Teleport proxy address to retrieve the agent version and update configuration from. + proxy: mytenant.teleport.sh + # enabled specifies whether auto-updates are enabled, i.e., whether teleport-updater update is allowed to update the agent. + enabled: true + # active_version specifies the active (symlinked) deployment of the telepport agent. + active_version: 15.1.1 ``` ### Runtime From 586287f5e7ec2c2fbf2b2331e5b21c03df878be7 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 5 Apr 2024 12:24:09 -0400 Subject: [PATCH 010/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index bb760237234bc..336a488d25a81 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -227,7 +227,8 @@ The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid ree ### Manual Workflow -For use cases that fall outside of the functionality provided by `teleport-updater`, such as JamF or ansible-controlled updates, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint. +For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint. +This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-updater` because they use their own automation for updates (e.g., JamF or ansible). Cluster administrators that want to self-manage agent updates will be able to get and watch for changes to agent versions which can then be @@ -235,15 +236,15 @@ used to trigger other integrations to update the installed version of agents. ```shell $ tctl autoupdate watch -{"agent_version": "1.0.0"} -{"agent_version": "1.0.1"} -{"agent_version": "2.0.0"} +{"agent_version": "1.0.0", ... } +{"agent_version": "1.0.1, ... } +{"agent_version": "2.0.0", ... } [...] ``` ```shell $ tctl autoupdate get -{"agent_version": "2.0.0"} +{"agent_version": "2.0.0", ... } ``` ### Scripts From 222c860a85fb3c63a7886f0f2213f4acd017ee6c Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 5 Apr 2024 12:55:53 -0400 Subject: [PATCH 011/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 336a488d25a81..83f7ecef285f2 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -225,6 +225,19 @@ When `update` subcommand is otherwise executed, it will: To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. +To retrieve known information about agent upgrades, the `status` subcommand will return the following: +```json +{ + "agent_version_installed": "15.1.1", + "agent_version_desired": "15.1.2", + "agent_version_previous": "15.1.0", + "update_time_next": "2020-12-09T16:09:53+00:00", + "update_time_last": "2020-12-10T16:00:00+00:00", + "update_time_jitter": 600, + "updates_enabled": true +} +``` + ### Manual Workflow For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint. From df65f460629a71eb937cc057243f98a00779f8c9 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 5 Apr 2024 15:38:41 -0400 Subject: [PATCH 012/105] add editions --- rfd/0169-auto-updates-linux-agents.md | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 83f7ecef285f2..476facee68a15 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -68,6 +68,7 @@ $ systemctl enable teleport `/v1/webapi/ping` ```json { + "server_edition": "enterprise", "agent_version": "15.1.1", "agent_auto_update": true, "agent_update_after": "2024-04-23T18:00:00.000Z", @@ -78,6 +79,7 @@ Notes: - Critical updates are achieved by serving `agent_update_after` with the current time. - The Teleport proxy translates upgrade hours (below) into a specific time after which all agents should be upgraded. - If an agent misses an upgrade window, it will always update immediately. +- The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. #### Teleport Resources @@ -193,7 +195,7 @@ The `enable` subcommand will: 2. Query the `/v1/webapi/ping` endpoint. 3. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, quit. 4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (12). -5. Download the desired Teleport tarball specified by `agent_version`. +5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Verify the checksum. 7. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 8. Replace any existing binaries or symlinks with symlinks to the current version. @@ -213,7 +215,7 @@ When `update` subcommand is otherwise executed, it will: 3. Check if the current time is after the time advertised in `agent_update_after`, and that `agent_auto_updates` is true. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. -6. Download the desired Teleport tarball specified by `agent_version`. +6. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 7. Verify the checksum. 8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 9. Update symlinks to point at the new version. @@ -231,10 +233,13 @@ To retrieve known information about agent upgrades, the `status` subcommand will "agent_version_installed": "15.1.1", "agent_version_desired": "15.1.2", "agent_version_previous": "15.1.0", - "update_time_next": "2020-12-09T16:09:53+00:00", - "update_time_last": "2020-12-10T16:00:00+00:00", - "update_time_jitter": 600, - "updates_enabled": true + "agent_edition_installed": "enterprise", + "agent_edition_desired": "enterprise", + "agent_edition_previous": "enterprise", + "agent_update_time_next": "2020-12-09T16:09:53+00:00", + "agent_update_time_last": "2020-12-10T16:00:00+00:00", + "agent_update_time_jitter": 600, + "agent_updates_enabled": true } ``` @@ -249,15 +254,15 @@ used to trigger other integrations to update the installed version of agents. ```shell $ tctl autoupdate watch -{"agent_version": "1.0.0", ... } -{"agent_version": "1.0.1, ... } -{"agent_version": "2.0.0", ... } +{"agent_version": "1.0.0", "agent_edition": "enterprise", ... } +{"agent_version": "1.0.1, "agent_edition": "enterprise", ... } +{"agent_version": "2.0.0", "agent_edition": "enterprise", ... } [...] ``` ```shell $ tctl autoupdate get -{"agent_version": "2.0.0", ... } +{"agent_version": "2.0.0", "agent_edition": "enterprise", ... } ``` ### Scripts From 48eaa83767a7e1e6dd5c9cc1e149be2ca0cb6688 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 8 Apr 2024 14:12:52 -0400 Subject: [PATCH 013/105] Installers and docs --- rfd/0169-auto-updates-linux-agents.md | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 476facee68a15..0fb526e7609c0 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -265,13 +265,29 @@ $ tctl autoupdate get {"agent_version": "2.0.0", "agent_edition": "enterprise", ... } ``` -### Scripts +### Installers -All scripts will install the latest updater and run `teleport-updater enable` with the proxy address. +The following install scripts will install the latest updater and run `teleport-updater enable` with the proxy address: + +- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/agentless-installer.sh.tmpl +- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/installer.sh.tmpl +- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/oneoff/oneoff.sh +- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/node-join/install.sh +- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/assets/aws/files/install-hardened.sh Eventually, additional logic from the scripts could be added to `teleport-updater`, such that `teleport-updater` can configure teleport. -This is out-of-scope for this proposal. +Moving additional logic into the upgrader is out-of-scope for this proposal. + +### Documentation + +The following documentation will need to be updated to cover the new upgrader workflow: +- https://goteleport.com/docs/choose-an-edition/teleport-cloud/downloads +- https://goteleport.com/docs/installation +- https://goteleport.com/docs/upgrading/self-hosted-linux +- https://goteleport.com/docs/upgrading/self-hosted-automatic-agent-updates + +Additionally, the Cloud dashboard tenants downloads tab will need to be updated to reference the new instructions. ## Security From c22657a5e0b8e0feea2e7f15220193c792d20b1d Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 8 Apr 2024 14:13:36 -0400 Subject: [PATCH 014/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 0fb526e7609c0..4bfb444a60b1e 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -267,8 +267,7 @@ $ tctl autoupdate get ### Installers -The following install scripts will install the latest updater and run `teleport-updater enable` with the proxy address: - +The following install scripts will be updated to install the latest updater and run `teleport-updater enable` with the proxy address: - https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/agentless-installer.sh.tmpl - https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/installer.sh.tmpl - https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/oneoff/oneoff.sh From 46372416ba08ddd5d111cece4828a9c8e214dc23 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 8 Apr 2024 14:16:28 -0400 Subject: [PATCH 015/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 4bfb444a60b1e..e3eddf662b735 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -268,11 +268,11 @@ $ tctl autoupdate get ### Installers The following install scripts will be updated to install the latest updater and run `teleport-updater enable` with the proxy address: -- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/agentless-installer.sh.tmpl -- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/installer.sh.tmpl -- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/oneoff/oneoff.sh -- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/node-join/install.sh -- https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/assets/aws/files/install-hardened.sh +- [/api/types/installers/agentless-installer.sh.tmpl](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/agentless-installer.sh.tmpl) +- [/api/types/installers/installer.sh.tmpl](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/installer.sh.tmpl) +- [/lib/web/scripts/oneoff/oneoff.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/oneoff/oneoff.sh) +- [/lib/web/scripts/node-join/install.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/node-join/install.sh) +- [/assets/aws/files/install-hardened.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/assets/aws/files/install-hardened.sh) Eventually, additional logic from the scripts could be added to `teleport-updater`, such that `teleport-updater` can configure teleport. From f8e11d7f87d1fdde7037246214975eeeb6ea696c Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 8 Apr 2024 14:18:10 -0400 Subject: [PATCH 016/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index e3eddf662b735..69e8521ec8852 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -255,7 +255,7 @@ used to trigger other integrations to update the installed version of agents. ```shell $ tctl autoupdate watch {"agent_version": "1.0.0", "agent_edition": "enterprise", ... } -{"agent_version": "1.0.1, "agent_edition": "enterprise", ... } +{"agent_version": "1.0.1", "agent_edition": "enterprise", ... } {"agent_version": "2.0.0", "agent_edition": "enterprise", ... } [...] ``` From e1e8a9a7f526cd5bcf61d7859e4a0b0e14719536 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 8 Apr 2024 14:21:21 -0400 Subject: [PATCH 017/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 69e8521ec8852..57cf7d21bf590 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -91,8 +91,10 @@ spec: # agent updates are in place. agent_auto_update: on|off # agent_update_hour sets the hour in UTC at which clients should update their agents. - # The value -1 will set the upgrade time to the current time, resulting in immediate upgrades. - agent_update_hour: -1-23 + agent_update_hour: 0-23 + # agent_update_now overrides agent_update_hour and sets agent update time to the current time. + # This is useful for rolling out critical security updates and bug fixes. + agent_update_now: on|off # agent_update_jitter_seconds sets a duration in which the upgrade will occur after the hour. # The agent upgrader will pick a random time within this duration in which to upgrade. agent_update_jitter_seconds: 0-MAXINT64 @@ -104,6 +106,8 @@ $ tctl autoupdate update --set-agent-auto-update=off Automatic updates configuration has been updated. $ tctl autoupdate update --set-agent-update-hour=3 Automatic updates configuration has been updated. +$ tctl autoupdate update --set-agent-update-now=true +Automatic updates configuration has been updated. $ tctl autoupdate update --set-agent-update-jitter-seconds=600 Automatic updates configuration has been updated. ``` From b472177b48a78ea8cf8814340c55358e6d03205d Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 8 Apr 2024 14:39:50 -0400 Subject: [PATCH 018/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 57cf7d21bf590..2eabcf3dabc4e 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -282,6 +282,21 @@ Eventually, additional logic from the scripts could be added to `teleport-update Moving additional logic into the upgrader is out-of-scope for this proposal. +To create pre-baked VM or container images that reduce the complexity of the cluster joining operation, two workflows are permitted: +- Install the `teleport-updater` package and defer `teleport-updater enable`, Teleport configuration, and `systemctl enable teleport` to cloud-init scripts. + This allows both the proxy address and token to be injected at VM initialization. The VM image may be used with any Teleport cluster. + Installers scripts will continue to function, as the package install operation will no-op. +- Install the `teleport-updater` package and run `teleport-updater enable` before the image is baked, but defer final Teleport configuration and `systemctl enable teleport` to cloud-init scripts. + This allows the proxy address to be pre-set in the image. `teleport.yaml` can be partially configured during image creation. At minimum, the token must be injected via cloud-init scripts. + Installers scripts would be skipped in favor of the `teleport configure` command. + +It is possible for a VM or container image to be created with a baked-in join token. +We should recommend against this workflow for security reasons, since a long-lived token improperly stored in an image could be leaked. + +Alternatively, users may prefer to skip pre-baked agent configuration, and run one of the script-based installers to join VMs to the cluster after the VM is started. + +Documentation should be created covering the above workflows. + ### Documentation The following documentation will need to be updated to cover the new upgrader workflow: From 5fa53a73b719abc31ecb97fb004035880531bbb8 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 15 Apr 2024 17:32:03 -0400 Subject: [PATCH 019/105] Downgrades --- rfd/0169-auto-updates-linux-agents.md | 66 +++++++++++++++++++++------ 1 file changed, 52 insertions(+), 14 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 2eabcf3dabc4e..a017e5ff231b6 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -141,10 +141,13 @@ $ tree /var/lib/teleport │ │ ├── ... │ │ ├── teleport-updater │ │ └── teleport - │ └── etc - │ ├── ... - │ └── systemd - │ └── teleport.service + │ ├── etc + │ │ ├── ... + │ │ └── systemd + │ │ └── teleport.service + │ └── backup + │ ├── teleport + │ └── backup.yaml ├── 15.1.1 │ ├── bin │ │ ├── ... @@ -176,6 +179,19 @@ spec: active_version: 15.1.1 ``` +backup.yaml: +``` +version: v1 +kind: config_backup +spec: + # proxy address from the backup + proxy: mytenant.teleport.sh + # version from the backup + version: 15.1.0 + # time the backup was created + creation_time: 2020-12-09T16:09:53+00:00 +``` + ### Runtime The agent-updater will run as a periodically executing systemd service which runs every 10 minutes. @@ -203,12 +219,13 @@ The `enable` subcommand will: 6. Verify the checksum. 7. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 8. Replace any existing binaries or symlinks with symlinks to the current version. -9. Restart the agent if the systemd service is already enabled. -10. Set `active_version` in `updates.yaml` if successful or not enabled. -11. Replace the old symlinks or binaries and quit (exit 1) if unsuccessful. -12. Remove any `teleport` package if installed. -13. Verify the symlinks to the active version still exists. -14. Remove all stored versions of the agent except the current version and last working version. +9. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport` +10. Restart the agent if the systemd service is already enabled. +11. Set `active_version` in `updates.yaml` if successful or not enabled. +12. Replace the symlink/binary and `/var/lib/teleport` and quit (exit 1) if unsuccessful. +13. Remove any `teleport` package if installed. +14. Verify the symlinks to the active version still exists. +15. Remove all stored versions of the agent except the current version and last working version. The `disable` subcommand will: 1. Configure `updates.yaml` to set `enabled` to false. @@ -223,10 +240,11 @@ When `update` subcommand is otherwise executed, it will: 7. Verify the checksum. 8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 9. Update symlinks to point at the new version. -10. Restart the agent if the systemd service is already enabled. -11. Set `active_version` in `updates.yaml` if successful or not enabled. -12. Replace the old symlink or binary and quit (exit 1) if unsuccessful. -13. Remove all stored versions of the agent except the current version and last working version. +10. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport`. +11. Restart the agent if the systemd service is already enabled. +12. Set `active_version` in `updates.yaml` if successful or not enabled. +13. Replace the old symlink/binary and `/var/lib/teleport` and quit (exit 1) if unsuccessful. +14. Remove all stored versions of the agent except the current version and last working version. To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. @@ -247,6 +265,26 @@ To retrieve known information about agent upgrades, the `status` subcommand will } ``` +### Downgrades + +Downgrades may be necessary in cases where we have rolled out a bug or security vulnerability with critical impact. +Downgrades are challenging, because `/var/lib/teleport` used by newer version of Teleport may not be valid for older versions of Teleport. + +When Teleport is downgraded to a previous version that has a backup of `/var/lib/teleport` present in `/var/lib/teleport/versions/OLD-VERSION/backup/teleport`: +1. `/var/lib/teleport/versions/OLD-VERSION/backup/backup.yaml` is validated to determine if the backup is usable (proxy and version must match, age must be less than cert lifetime, etc.) +2. If the backup is valid, Teleport is fully stopped, the backup is restored along with symlinks, and the downgraded version of Teleport is started. +3. If the backup is invalid, we refuse to downgrade. + +Downgrades are still applied with `teleport-upgrader update`. +The above steps modulate the standard workflow in the section above. + +Notes: +- Downgrades can lead to downtime, as Teleport must be fully-stopped to safely replace `/var/lib/teleport`. +- `/var/lib/teleport/versions/` is not included in backups. + +Questions: +- Should we refuse to downgrade in step (3), or risk starting the older version of Teleport with the newer `/var/lib/teleport`? + ### Manual Workflow For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint. From af85475924a983248dca3f49b21f62b1d69d7d10 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 13 May 2024 12:53:04 -0400 Subject: [PATCH 020/105] Feedback --- rfd/0169-auto-updates-linux-agents.md | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index a017e5ff231b6..470acef51445e 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -138,11 +138,12 @@ $ tree /var/lib/teleport └── versions ├── 15.0.0 │ ├── bin - │ │ ├── ... + │ │ ├── tsh + │ │ ├── tbot + │ │ ├── ... # other binaries │ │ ├── teleport-updater │ │ └── teleport │ ├── etc - │ │ ├── ... │ │ └── systemd │ │ └── teleport.service │ └── backup @@ -150,14 +151,19 @@ $ tree /var/lib/teleport │ └── backup.yaml ├── 15.1.1 │ ├── bin - │ │ ├── ... + │ │ ├── tsh + │ │ ├── tbot + │ │ ├── ... # other binaries │ │ ├── teleport-updater │ │ └── teleport │ └── etc - │ ├── ... │ └── systemd │ └── teleport.service └── updates.yaml +$ ls -l /usr/local/bin/tsh +/usr/local/bin/tsh -> /var/lib/teleport/versions/15.0.0/bin/tsh +$ ls -l /usr/local/bin/tbot +/usr/local/bin/tbot -> /var/lib/teleport/versions/15.0.0/bin/tbot $ ls -l /usr/local/bin/teleport /usr/local/bin/teleport -> /var/lib/teleport/versions/15.0.0/bin/teleport $ ls -l /usr/local/bin/teleport-updater @@ -216,13 +222,13 @@ The `enable` subcommand will: 3. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, quit. 4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (12). 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. -6. Verify the checksum. +6. Download and verify the checksum. 7. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 8. Replace any existing binaries or symlinks with symlinks to the current version. 9. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport` 10. Restart the agent if the systemd service is already enabled. 11. Set `active_version` in `updates.yaml` if successful or not enabled. -12. Replace the symlink/binary and `/var/lib/teleport` and quit (exit 1) if unsuccessful. +12. Replace the symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. 13. Remove any `teleport` package if installed. 14. Verify the symlinks to the active version still exists. 15. Remove all stored versions of the agent except the current version and last working version. @@ -237,18 +243,20 @@ When `update` subcommand is otherwise executed, it will: 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. 6. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. -7. Verify the checksum. +7. Download and verify the checksum. 8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 9. Update symlinks to point at the new version. 10. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport`. 11. Restart the agent if the systemd service is already enabled. 12. Set `active_version` in `updates.yaml` if successful or not enabled. -13. Replace the old symlink/binary and `/var/lib/teleport` and quit (exit 1) if unsuccessful. +13. Replace the old symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. 14. Remove all stored versions of the agent except the current version and last working version. To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. +If `teleport-updater` fails with an error, and an older version of `teleport-updater` is available, the upgrade will retry with the older version. + To retrieve known information about agent upgrades, the `status` subcommand will return the following: ```json { From 0c832a4dfbe4d8a69831b85306105378fc1e3581 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 13 May 2024 14:49:21 -0400 Subject: [PATCH 021/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 470acef51445e..15dd526180439 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -255,7 +255,15 @@ When `update` subcommand is otherwise executed, it will: To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. -If `teleport-updater` fails with an error, and an older version of `teleport-updater` is available, the upgrade will retry with the older version. +#### Failure Conditions + +If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. + +If `teleport-updater` itself fails with an error, and an older version of `teleport-updater` is available, the upgrade will retry with the older version. + +Known failure conditions caused by intentional configuration (e.g., upgrades disabled) will not trigger retry logic. + +#### Status To retrieve known information about agent upgrades, the `status` subcommand will return the following: ```json From 63bde20a89286729d6fd41602c5af7f27ba46c84 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 13 May 2024 14:55:05 -0400 Subject: [PATCH 022/105] Remove last working copy of teleport --- rfd/0169-auto-updates-linux-agents.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 15dd526180439..9404e0c4d44a3 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -231,7 +231,7 @@ The `enable` subcommand will: 12. Replace the symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. 13. Remove any `teleport` package if installed. 14. Verify the symlinks to the active version still exists. -15. Remove all stored versions of the agent except the current version and last working version. +15. Remove all stored versions of the agent except the current version. The `disable` subcommand will: 1. Configure `updates.yaml` to set `enabled` to false. @@ -250,7 +250,7 @@ When `update` subcommand is otherwise executed, it will: 11. Restart the agent if the systemd service is already enabled. 12. Set `active_version` in `updates.yaml` if successful or not enabled. 13. Replace the old symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. -14. Remove all stored versions of the agent except the current version and last working version. +14. Remove all stored versions of the agent except the current version. To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. From 80aeae4d3ecc7a2c94647d3e7786cb48656acfd6 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 13 May 2024 15:01:41 -0400 Subject: [PATCH 023/105] add step to ensure free disk space --- rfd/0169-auto-updates-linux-agents.md | 44 ++++++++++++++------------- 1 file changed, 23 insertions(+), 21 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 9404e0c4d44a3..9d9764027aed5 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -220,18 +220,19 @@ The `enable` subcommand will: 1. Configure `updates.yaml` with the current proxy address and set `enabled` to true. 2. Query the `/v1/webapi/ping` endpoint. 3. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, quit. -4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (12). -5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. -6. Download and verify the checksum. -7. Extract the tarball to `/var/lib/teleport/versions/VERSION`. -8. Replace any existing binaries or symlinks with symlinks to the current version. -9. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport` -10. Restart the agent if the systemd service is already enabled. -11. Set `active_version` in `updates.yaml` if successful or not enabled. -12. Replace the symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. -13. Remove any `teleport` package if installed. -14. Verify the symlinks to the active version still exists. -15. Remove all stored versions of the agent except the current version. +4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (14). +5. Ensure there is enough free disk space to upgrade Teleport. +6. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. +7. Download and verify the checksum (tarball URL suffixed with `.sha256`). +8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. +9. Replace any existing binaries or symlinks with symlinks to the current version. +10. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport` +11. Restart the agent if the systemd service is already enabled. +12. Set `active_version` in `updates.yaml` if successful or not enabled. +13. Replace the symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. +14. Remove any `teleport` package if installed. +15. Verify the symlinks to the active version still exists. +16. Remove all stored versions of the agent except the current version. The `disable` subcommand will: 1. Configure `updates.yaml` to set `enabled` to false. @@ -242,15 +243,16 @@ When `update` subcommand is otherwise executed, it will: 3. Check if the current time is after the time advertised in `agent_update_after`, and that `agent_auto_updates` is true. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. -6. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. -7. Download and verify the checksum. -8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. -9. Update symlinks to point at the new version. -10. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport`. -11. Restart the agent if the systemd service is already enabled. -12. Set `active_version` in `updates.yaml` if successful or not enabled. -13. Replace the old symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. -14. Remove all stored versions of the agent except the current version. +6. Ensure there is enough free disk space to upgrade Teleport. +7. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. +8. Download and verify the checksum (tarball URL suffixed with `.sha256`). +9. Extract the tarball to `/var/lib/teleport/versions/VERSION`. +10. Update symlinks to point at the new version. +11. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport`. +12. Restart the agent if the systemd service is already enabled. +13. Set `active_version` in `updates.yaml` if successful or not enabled. +14. Replace the old symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. +15. Remove all stored versions of the agent except the current version. To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. From 1729c83b2149a6f00f0c1f82278b66e048d6a16b Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 13 May 2024 15:05:30 -0400 Subject: [PATCH 024/105] Typos --- rfd/0169-auto-updates-linux-agents.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 9d9764027aed5..afefc14255b44 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -255,7 +255,7 @@ When `update` subcommand is otherwise executed, it will: 15. Remove all stored versions of the agent except the current version. To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. -The `/usr/local/bin/teleport-upgrader` symlink will take precedence to avoid reexec in most scenarios. +The `/usr/local/bin/teleport-updater` symlink will take precedence to avoid reexec in most scenarios. #### Failure Conditions @@ -293,7 +293,7 @@ When Teleport is downgraded to a previous version that has a backup of `/var/lib 2. If the backup is valid, Teleport is fully stopped, the backup is restored along with symlinks, and the downgraded version of Teleport is started. 3. If the backup is invalid, we refuse to downgrade. -Downgrades are still applied with `teleport-upgrader update`. +Downgrades are still applied with `teleport-updater update`. The above steps modulate the standard workflow in the section above. Notes: From 54d70b01081283b8a1cfa48be73178ec17dd4a5f Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 23 May 2024 11:31:26 -0400 Subject: [PATCH 025/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index afefc14255b44..82847e65edd60 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -48,7 +48,7 @@ We must provide a seamless, hands-off experience for auto-updates that is easy t We will ship a new auto-updater package written in Go that does not interface with the system package manager. It will be versioned separately from Teleport, and manage the installation of the correct Teleport agent version manually. -It will read the unauthenticated `/v1/webapi/ping` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. +It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. ### Installation @@ -65,7 +65,7 @@ $ systemctl enable teleport #### Endpoints -`/v1/webapi/ping` +`/v1/webapi/find` ```json { "server_edition": "enterprise", @@ -89,7 +89,7 @@ spec: # agent_auto_update allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed # agent updates are in place. - agent_auto_update: on|off + agent_auto_update: true|false # agent_update_hour sets the hour in UTC at which clients should update their agents. agent_update_hour: 0-23 # agent_update_now overrides agent_update_hour and sets agent update time to the current time. @@ -97,7 +97,7 @@ spec: agent_update_now: on|off # agent_update_jitter_seconds sets a duration in which the upgrade will occur after the hour. # The agent upgrader will pick a random time within this duration in which to upgrade. - agent_update_jitter_seconds: 0-MAXINT64 + agent_update_jitter_seconds: 0-3600 [...] ``` @@ -218,7 +218,7 @@ It will also run update teleport immediately, to ensure that subsequent executio The `enable` subcommand will: 1. Configure `updates.yaml` with the current proxy address and set `enabled` to true. -2. Query the `/v1/webapi/ping` endpoint. +2. Query the `/v1/webapi/find` endpoint. 3. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, quit. 4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (14). 5. Ensure there is enough free disk space to upgrade Teleport. @@ -239,7 +239,7 @@ The `disable` subcommand will: When `update` subcommand is otherwise executed, it will: 1. Check `updates.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. -2. Query the `/v1/webapi/ping` endpoint. +2. Query the `/v1/webapi/find` endpoint. 3. Check if the current time is after the time advertised in `agent_update_after`, and that `agent_auto_updates` is true. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. @@ -305,7 +305,7 @@ Questions: ### Manual Workflow -For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/ping` endpoint. +For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/find` endpoint. This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-updater` because they use their own automation for updates (e.g., JamF or ansible). Cluster administrators that want to self-manage agent updates will be From a8894f08b283f65ade1554b1110be3c1ed844ca0 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 23 May 2024 12:26:49 -0400 Subject: [PATCH 026/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 82847e65edd60..f709b48171627 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -213,7 +213,7 @@ $ teleport-updater enable --proxy mytenant.teleport.sh If the proxy address is not provided with `--proxy`, the current proxy address from `teleport.yaml` is used. -On servers without Teleport installed already, the `enable` subcommand will change the behavior of `teleport-update update` to update teleport and restart the existing agent, if running. +The `enable` subcommand will change the behavior of `teleport-update update` to update teleport and restart the existing agent, if running. It will also run update teleport immediately, to ensure that subsequent executions succeed. The `enable` subcommand will: @@ -377,6 +377,9 @@ The Upgrade Framework (TUF) will be used to implement secure updates in the futu ## Execution Plan 1. Implement new auto-updater in Go. -2. Prep documentation changes. -3. Release new updater via teleport-ent-updater package. -4. Release documentation changes. +2. Test extensively on all supported Linux distributions. +3. Prep documentation changes. +4. Release new updater via teleport-ent-updater package. +5. Release documentation changes. +6. Communicate to select Cloud customers that they must update their updater, starting with lower ARR customers. +7. Communicate to all Cloud customers that they must update their updater. From 569e9b30d74848b942f2d5e5a8b96b0c50c83cda Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 28 May 2024 17:37:01 -0400 Subject: [PATCH 027/105] feedback --- rfd/0169-auto-updates-linux-agents.md | 55 +++++++++++++-------------- 1 file changed, 26 insertions(+), 29 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index f709b48171627..7cff749bae7dd 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -47,7 +47,7 @@ We must provide a seamless, hands-off experience for auto-updates that is easy t ## Details We will ship a new auto-updater package written in Go that does not interface with the system package manager. -It will be versioned separately from Teleport, and manage the installation of the correct Teleport agent version manually. +It will be distributed as a separate package from Teleport, and manage the installation of the correct Teleport agent version manually. It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. @@ -71,14 +71,13 @@ $ systemctl enable teleport "server_edition": "enterprise", "agent_version": "15.1.1", "agent_auto_update": true, - "agent_update_after": "2024-04-23T18:00:00.000Z", - "agent_update_jitter_seconds": 10, + "agent_update_jitter_seconds": 10 } ``` Notes: -- Critical updates are achieved by serving `agent_update_after` with the current time. -- The Teleport proxy translates upgrade hours (below) into a specific time after which all agents should be upgraded. -- If an agent misses an upgrade window, it will always update immediately. +- The Teleport proxy translates upgrade hours (below) into a specific time after which the served `agent_version` changes, resulting in all agents being upgraded. +- Critical updates are achieved by serving the desired `agent_version` immediately. +- If an agent misses an upgrade window, it will always update immediately due to the new agent version being served. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. #### Teleport Resources @@ -92,11 +91,11 @@ spec: agent_auto_update: true|false # agent_update_hour sets the hour in UTC at which clients should update their agents. agent_update_hour: 0-23 - # agent_update_now overrides agent_update_hour and sets agent update time to the current time. + # agent_update_now overrides agent_update_hour and serves the new version immediately. # This is useful for rolling out critical security updates and bug fixes. agent_update_now: on|off # agent_update_jitter_seconds sets a duration in which the upgrade will occur after the hour. - # The agent upgrader will pick a random time within this duration in which to upgrade. + # The agent upgrader will pick a random time within this duration to wait to upgrade. agent_update_jitter_seconds: 0-3600 [...] @@ -116,9 +115,7 @@ Automatic updates configuration has been updated. kind: autoupdate_version spec: # agent_version is the version of the agent the cluster will advertise. - # Can be auto (match the version of the proxy) or an exact semver formatted - # version. - agent_version: auto|X.Y.Z + agent_version: X.Y.Z [...] ``` @@ -147,7 +144,7 @@ $ tree /var/lib/teleport │ │ └── systemd │ │ └── teleport.service │ └── backup - │ ├── teleport + │ ├── sqlite.db │ └── backup.yaml ├── 15.1.1 │ ├── bin @@ -188,7 +185,7 @@ spec: backup.yaml: ``` version: v1 -kind: config_backup +kind: db_backup spec: # proxy address from the backup proxy: mytenant.teleport.sh @@ -226,13 +223,13 @@ The `enable` subcommand will: 7. Download and verify the checksum (tarball URL suffixed with `.sha256`). 8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 9. Replace any existing binaries or symlinks with symlinks to the current version. -10. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport` +10. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. 11. Restart the agent if the systemd service is already enabled. 12. Set `active_version` in `updates.yaml` if successful or not enabled. -13. Replace the symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. +13. Replace the symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. 14. Remove any `teleport` package if installed. 15. Verify the symlinks to the active version still exists. -16. Remove all stored versions of the agent except the current version. +16. Remove all stored versions of the agent except the current version and last working version. The `disable` subcommand will: 1. Configure `updates.yaml` to set `enabled` to false. @@ -240,7 +237,7 @@ The `disable` subcommand will: When `update` subcommand is otherwise executed, it will: 1. Check `updates.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. 2. Query the `/v1/webapi/find` endpoint. -3. Check if the current time is after the time advertised in `agent_update_after`, and that `agent_auto_updates` is true. +3. Check that `agent_auto_updates` is true. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. 6. Ensure there is enough free disk space to upgrade Teleport. @@ -248,15 +245,17 @@ When `update` subcommand is otherwise executed, it will: 8. Download and verify the checksum (tarball URL suffixed with `.sha256`). 9. Extract the tarball to `/var/lib/teleport/versions/VERSION`. 10. Update symlinks to point at the new version. -11. Backup /var/lib/teleport into `/var/lib/teleport/versions/OLD-VERSION/backup/teleport`. +11. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. 12. Restart the agent if the systemd service is already enabled. 13. Set `active_version` in `updates.yaml` if successful or not enabled. -14. Replace the old symlinks/binaries and `/var/lib/teleport` and quit (exit 1) if unsuccessful. -15. Remove all stored versions of the agent except the current version. +14. Replace the old symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. +15. Remove all stored versions of the agent except the current version and last working version. To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. The `/usr/local/bin/teleport-updater` symlink will take precedence to avoid reexec in most scenarios. +To ensure that SELinux permissions do not prevent the `teleport-updater` binary from installing/removing Teleport versions, the updater package will configure SELinux contexts to allow changes to all required paths. + #### Failure Conditions If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. @@ -276,7 +275,6 @@ To retrieve known information about agent upgrades, the `status` subcommand will "agent_edition_installed": "enterprise", "agent_edition_desired": "enterprise", "agent_edition_previous": "enterprise", - "agent_update_time_next": "2020-12-09T16:09:53+00:00", "agent_update_time_last": "2020-12-10T16:00:00+00:00", "agent_update_time_jitter": 600, "agent_updates_enabled": true @@ -286,9 +284,9 @@ To retrieve known information about agent upgrades, the `status` subcommand will ### Downgrades Downgrades may be necessary in cases where we have rolled out a bug or security vulnerability with critical impact. -Downgrades are challenging, because `/var/lib/teleport` used by newer version of Teleport may not be valid for older versions of Teleport. +Downgrades are challenging, because `sqlite.db` used by newer version of Teleport may not be valid for older versions of Teleport. -When Teleport is downgraded to a previous version that has a backup of `/var/lib/teleport` present in `/var/lib/teleport/versions/OLD-VERSION/backup/teleport`: +When Teleport is downgraded to a previous version that has a backup of `sqlite.db` present in `/var/lib/teleport/versions/OLD-VERSION/backup/`: 1. `/var/lib/teleport/versions/OLD-VERSION/backup/backup.yaml` is validated to determine if the backup is usable (proxy and version must match, age must be less than cert lifetime, etc.) 2. If the backup is valid, Teleport is fully stopped, the backup is restored along with symlinks, and the downgraded version of Teleport is started. 3. If the backup is invalid, we refuse to downgrade. @@ -296,17 +294,16 @@ When Teleport is downgraded to a previous version that has a backup of `/var/lib Downgrades are still applied with `teleport-updater update`. The above steps modulate the standard workflow in the section above. -Notes: -- Downgrades can lead to downtime, as Teleport must be fully-stopped to safely replace `/var/lib/teleport`. -- `/var/lib/teleport/versions/` is not included in backups. +Downgrades lead to downtime, as Teleport must be fully-stopped to safely replace `sqlite.db`. -Questions: -- Should we refuse to downgrade in step (3), or risk starting the older version of Teleport with the newer `/var/lib/teleport`? +Teleport CA certificate rotations will break rollbacks. +This may be addressed in the future by additional validation of the agent's client certificate issuer fingerprints. +This would prevent downgrades to backups with invalid certs. ### Manual Workflow For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/find` endpoint. -This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-updater` because they use their own automation for updates (e.g., JamF or ansible). +This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-updater` because they use their own automation for updates (e.g., JamF or Ansible). Cluster administrators that want to self-manage agent updates will be able to get and watch for changes to agent versions which can then be From 5d9d131ba53a9f5a748a625c64be69c20aae84a9 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 28 May 2024 17:49:58 -0400 Subject: [PATCH 028/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 7cff749bae7dd..dba4c72923fd5 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -284,6 +284,8 @@ To retrieve known information about agent upgrades, the `status` subcommand will ### Downgrades Downgrades may be necessary in cases where we have rolled out a bug or security vulnerability with critical impact. +To initiate a downgrade, `agent_version` is set to an older version than it was previously set to. + Downgrades are challenging, because `sqlite.db` used by newer version of Teleport may not be valid for older versions of Teleport. When Teleport is downgraded to a previous version that has a backup of `sqlite.db` present in `/var/lib/teleport/versions/OLD-VERSION/backup/`: @@ -291,10 +293,12 @@ When Teleport is downgraded to a previous version that has a backup of `sqlite.d 2. If the backup is valid, Teleport is fully stopped, the backup is restored along with symlinks, and the downgraded version of Teleport is started. 3. If the backup is invalid, we refuse to downgrade. -Downgrades are still applied with `teleport-updater update`. +Downgrades are applied with `teleport-updater update`, just like upgrades. The above steps modulate the standard workflow in the section above. -Downgrades lead to downtime, as Teleport must be fully-stopped to safely replace `sqlite.db`. +Teleport must be fully-stopped to safely replace `sqlite.db`. +When restarting the agent during an upgrade, `SIGHUP` is used. +When restarting the agent during a downgrade, `systemd stop/start` are used before/after the downgrade. Teleport CA certificate rotations will break rollbacks. This may be addressed in the future by additional validation of the agent's client certificate issuer fingerprints. From 32f3f010e3d948435c34e360071d1dba366483a7 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 29 May 2024 13:48:46 -0400 Subject: [PATCH 029/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index dba4c72923fd5..bdfe33352d6ea 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -9,7 +9,7 @@ state: draft * Engineering: @russjones && @bernardjkim * Product: @klizhentas || @xinding33 -* Security: @reedloden +* Security: Vendor TBD ## What @@ -301,8 +301,21 @@ When restarting the agent during an upgrade, `SIGHUP` is used. When restarting the agent during a downgrade, `systemd stop/start` are used before/after the downgrade. Teleport CA certificate rotations will break rollbacks. -This may be addressed in the future by additional validation of the agent's client certificate issuer fingerprints. -This would prevent downgrades to backups with invalid certs. +In the future, this could be addressed with additional validation of the agent's client certificate issuer fingerprints. +For now, rolling forward will allow recovery from a broken rollback. + +Given that rollbacks may fail, we must maintain the following invariants: +1. Broken rollbacks can always be reverted by reversing the rollback exactly. +2. Broken versions can always be reverted by rolling back and then skipping the broken version. + +When rolling forward, the backup of the newer version's `sqlite.db` is only restored if that exact version is the roll-forward version. +Otherwise, the older, rollback version of `sqlite.db` is preserved (i.e., the newer version's backup is not used). +This ensures that a version upgrade which broke the database can be recovered with a rollback and a new patch. +It also ensures that a broken rollback is always recoverable by reversing the rollback. + +Example: Given v1, v2, v3 versions of Teleport, where v2 is broken: +1. v1 -> v2 -> v1 -> v3 => DB from v1 is migrated directly to v3, avoiding v2 breakage. +2. v1 -> v2 -> v1 -> v2 -> v3 => DB from v2 is recovered, in case v1 database no longer has a valid certificate. ### Manual Workflow From 8d9be34772c92e1cc01de8ac73724c5c68793f58 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 29 May 2024 14:00:41 -0400 Subject: [PATCH 030/105] apt purge --- rfd/0169-auto-updates-linux-agents.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index bdfe33352d6ea..5430679aa1fbd 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -256,6 +256,9 @@ The `/usr/local/bin/teleport-updater` symlink will take precedence to avoid reex To ensure that SELinux permissions do not prevent the `teleport-updater` binary from installing/removing Teleport versions, the updater package will configure SELinux contexts to allow changes to all required paths. +To ensure that `teleport` package removal does not interfere with `teleport-updater`, package removal will run `apt purge` (or `yum` equivalent) while ensuring that `/etc/teleport.yaml` and `/var/lib/teleport` are not purged. +Failure to do this could result in `/etc/teleport.yaml` being removed when an operator runs `apt purge` at a later date. + #### Failure Conditions If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. From ea310c98cde043d4a8a2c3c4a7dc9c8e288d1583 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 29 May 2024 14:08:59 -0400 Subject: [PATCH 031/105] Only enable auto-upgrades if successful --- rfd/0169-auto-updates-linux-agents.md | 32 +++++++++++++-------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 5430679aa1fbd..dd6d9152c60f5 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -214,22 +214,22 @@ The `enable` subcommand will change the behavior of `teleport-update update` to It will also run update teleport immediately, to ensure that subsequent executions succeed. The `enable` subcommand will: -1. Configure `updates.yaml` with the current proxy address and set `enabled` to true. -2. Query the `/v1/webapi/find` endpoint. -3. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, quit. -4. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (14). -5. Ensure there is enough free disk space to upgrade Teleport. -6. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. -7. Download and verify the checksum (tarball URL suffixed with `.sha256`). -8. Extract the tarball to `/var/lib/teleport/versions/VERSION`. -9. Replace any existing binaries or symlinks with symlinks to the current version. -10. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. -11. Restart the agent if the systemd service is already enabled. -12. Set `active_version` in `updates.yaml` if successful or not enabled. -13. Replace the symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. -14. Remove any `teleport` package if installed. -15. Verify the symlinks to the active version still exists. -16. Remove all stored versions of the agent except the current version and last working version. +1. Query the `/v1/webapi/find` endpoint. +2. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, jump to (16). +3. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (13). +4. Ensure there is enough free disk space to upgrade Teleport. +5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. +6. Download and verify the checksum (tarball URL suffixed with `.sha256`). +7. Extract the tarball to `/var/lib/teleport/versions/VERSION`. +8. Replace any existing binaries or symlinks with symlinks to the current version. +9. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. +10. Restart the agent if the systemd service is already enabled. +11. Set `active_version` in `updates.yaml` if successful or not enabled. +12. Replace the symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. +13. Remove and purge any `teleport` package if installed. +14. Verify the symlinks to the active version still exists. +15. Remove all stored versions of the agent except the current version and last working version. +16. Configure `updates.yaml` with the current proxy address and set `enabled` to true. The `disable` subcommand will: 1. Configure `updates.yaml` to set `enabled` to false. From bb46025913ffa5ca0b460bc241176d1920d36107 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 29 May 2024 14:11:37 -0400 Subject: [PATCH 032/105] reentrant lock --- rfd/0169-auto-updates-linux-agents.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index dd6d9152c60f5..765fd10fafe4a 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -213,6 +213,8 @@ If the proxy address is not provided with `--proxy`, the current proxy address f The `enable` subcommand will change the behavior of `teleport-update update` to update teleport and restart the existing agent, if running. It will also run update teleport immediately, to ensure that subsequent executions succeed. +Both `update` and `enable` will maintain a shared lock file preventing any re-entrant executions. + The `enable` subcommand will: 1. Query the `/v1/webapi/find` endpoint. 2. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, jump to (16). From 5c45b53fc5a683f292ba0777b8d07be18a4c7de3 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 29 May 2024 14:25:37 -0400 Subject: [PATCH 033/105] reset --- rfd/0169-auto-updates-linux-agents.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 765fd10fafe4a..6089cd975af69 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -109,6 +109,8 @@ $ tctl autoupdate update --set-agent-update-now=true Automatic updates configuration has been updated. $ tctl autoupdate update --set-agent-update-jitter-seconds=600 Automatic updates configuration has been updated. +$ tctl autoupdate reset +Automatic updates configuration has been reset to defaults. ``` ```yaml From 4325ea50d900a2dba4ee15e89399a2bab98df327 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 31 May 2024 19:06:30 -0400 Subject: [PATCH 034/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 6089cd975af69..c49ab35aa2503 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -51,6 +51,8 @@ It will be distributed as a separate package from Teleport, and manage the insta It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. +Source code for the updater will live in `integrations/updater`. + ### Installation ```shell From 4ebf4a0cb78245ebe9eed330da9301daffddeeee Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 4 Jun 2024 16:51:29 -0400 Subject: [PATCH 035/105] add note on backups --- rfd/0169-auto-updates-linux-agents.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index c49ab35aa2503..dd026686547f0 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -265,6 +265,8 @@ To ensure that SELinux permissions do not prevent the `teleport-updater` binary To ensure that `teleport` package removal does not interfere with `teleport-updater`, package removal will run `apt purge` (or `yum` equivalent) while ensuring that `/etc/teleport.yaml` and `/var/lib/teleport` are not purged. Failure to do this could result in `/etc/teleport.yaml` being removed when an operator runs `apt purge` at a later date. +To ensure that backups are consistent, the updater will use the [SQLite backup API](https://www.sqlite.org/backup.html) to perform the backup. + #### Failure Conditions If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. From 25aefe24f8aa545ba6c1048a6d6da113efca2eb0 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 6 Jun 2024 18:39:20 -0400 Subject: [PATCH 036/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index dd026686547f0..cecaae08e9906 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -93,9 +93,6 @@ spec: agent_auto_update: true|false # agent_update_hour sets the hour in UTC at which clients should update their agents. agent_update_hour: 0-23 - # agent_update_now overrides agent_update_hour and serves the new version immediately. - # This is useful for rolling out critical security updates and bug fixes. - agent_update_now: on|off # agent_update_jitter_seconds sets a duration in which the upgrade will occur after the hour. # The agent upgrader will pick a random time within this duration to wait to upgrade. agent_update_jitter_seconds: 0-3600 @@ -107,12 +104,17 @@ $ tctl autoupdate update --set-agent-auto-update=off Automatic updates configuration has been updated. $ tctl autoupdate update --set-agent-update-hour=3 Automatic updates configuration has been updated. -$ tctl autoupdate update --set-agent-update-now=true -Automatic updates configuration has been updated. $ tctl autoupdate update --set-agent-update-jitter-seconds=600 Automatic updates configuration has been updated. $ tctl autoupdate reset Automatic updates configuration has been reset to defaults. +$ tctl autoupdate status +Status: disabled +Current: v1.2.3 +Desired: v1.2.4 (critical) +Window: 3 +Jitter: 600s + ``` ```yaml @@ -120,6 +122,10 @@ kind: autoupdate_version spec: # agent_version is the version of the agent the cluster will advertise. agent_version: X.Y.Z + # agent_critical makes the version as critical. + # This overrides agent_update_hour in cluster_maintenance_config and serves the version immediately. + # This is useful for rolling out critical security updates and bug fixes. + agent_critical: true|false [...] ``` From f4716be28372d00d6fb309f2bc5a7f9841c123ed Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 6 Jun 2024 18:46:23 -0400 Subject: [PATCH 037/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index cecaae08e9906..60647b539926e 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -132,6 +132,8 @@ spec: ``` $ tctl autoupdate update --set-agent-version=15.1.1 Automatic updates configuration has been updated. +$ tctl autoupdate update --set-agent-version=15.1.2 --critical +Automatic updates configuration has been updated. ``` Notes: From 87dc2df8fe018ef7f412ace3adfcbbc6e29aa598 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 10 Jun 2024 13:48:39 -0400 Subject: [PATCH 038/105] Clarify restore/rollback process and validations --- rfd/0169-auto-updates-linux-agents.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 60647b539926e..ea7e96cc2e43a 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -9,7 +9,7 @@ state: draft * Engineering: @russjones && @bernardjkim * Product: @klizhentas || @xinding33 -* Security: Vendor TBD +* Security: Doyensec ## What @@ -234,7 +234,7 @@ The `enable` subcommand will: 4. Ensure there is enough free disk space to upgrade Teleport. 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Download and verify the checksum (tarball URL suffixed with `.sha256`). -7. Extract the tarball to `/var/lib/teleport/versions/VERSION`. +7. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. 8. Replace any existing binaries or symlinks with symlinks to the current version. 9. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. 10. Restart the agent if the systemd service is already enabled. @@ -257,7 +257,7 @@ When `update` subcommand is otherwise executed, it will: 6. Ensure there is enough free disk space to upgrade Teleport. 7. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 8. Download and verify the checksum (tarball URL suffixed with `.sha256`). -9. Extract the tarball to `/var/lib/teleport/versions/VERSION`. +9. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. 10. Update symlinks to point at the new version. 11. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. 12. Restart the agent if the systemd service is already enabled. @@ -314,6 +314,9 @@ When Teleport is downgraded to a previous version that has a backup of `sqlite.d Downgrades are applied with `teleport-updater update`, just like upgrades. The above steps modulate the standard workflow in the section above. +If the downgraded version is already present, the uncompressed version is used to ensure fast recovery of the exact state before the failed upgrade. +To ensure that the target version is was not corrupted by incomplete extraction, the downgrade checks for the existance of `/var/lib/teleport/versions/TARGET-VERSION/sha256` before downgrading. +To ensure that the DB backup was not corrupted by incomplete copying, the downgrade checks for the existance of `/var/lib/teleport/versions/TARGET-VERSION/backup/backup.yaml` before restoring. Teleport must be fully-stopped to safely replace `sqlite.db`. When restarting the agent during an upgrade, `SIGHUP` is used. From 1806795f0d09c3ca550b25ba8a8f1c3519cf8fdf Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 10 Jun 2024 14:11:49 -0400 Subject: [PATCH 039/105] Added section on logging --- rfd/0169-auto-updates-linux-agents.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index ea7e96cc2e43a..59db915416ce1 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -410,6 +410,13 @@ are signed. The Upgrade Framework (TUF) will be used to implement secure updates in the future. +## Logging + +All installation steps will be logged locally, such that they are viewable with `journalctl`. +Care will be taken to ensure that updater logs are sharable with Teleport Support for debugging and auditing purposes. + +When TUF is added, that events related to supply chain security may be sent to the Teleport cluster via the Teleport Agent. + ## Execution Plan 1. Implement new auto-updater in Go. From 467b64067fa994c74426cc4ec36a1537e8bda7c4 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 9 Jul 2024 14:54:31 -0400 Subject: [PATCH 040/105] Add schedules --- rfd/0169-auto-updates-linux-agents.md | 221 +++++++++++++++++--------- 1 file changed, 147 insertions(+), 74 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 59db915416ce1..963c97beb830f 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -3,7 +3,7 @@ authors: Stephen Levine (stephen.levine@goteleport.com) state: draft --- -# RFD 0169 - Automatic Updates for Linux Agents +# RFD 0169 - Automatic Updates for Agents ## Required Approvers @@ -13,13 +13,14 @@ state: draft ## What -This RFD proposes a new mechanism for Teleport agents installed on Linux servers to automatically update to a version set by an operator via tctl. +This RFD proposes a new mechanism for Teleport agents to automatically update to a version scheduled by an operator via tctl. + +All agent installations are in-scope for this proposal, including agents installed on Linux servers and Kubernetes. The following anti-goals are out-of-scope for this proposal, but will be addressed in future RFDs: -- Analogous adjustments for Teleport agents installed on Kubernetes -- Phased rollouts of new agent versions for agents connected to an existing cluster - Signing of agent artifacts via TUF - Teleport Cloud APIs for updating agents +- Improvements to the local functionality of the Kubernetes agent for better compatibility with FluxCD and ArgoCD. This RFD proposes a specific implementation of several sections in https://github.com/gravitational/teleport/pull/39217. @@ -29,7 +30,7 @@ Additionally, this RFD parallels the auto-update functionality for client tools The existing mechanism for automatic agent updates does not provide a hands-off experience for all Teleport users. -1. The use of system package management leads to interactions with `apt upgrade`, `yum upgrade`, etc. that can result in unintentional upgrades or confusing command output. +1. The use of system package management leads to interactions with `apt upgrade`, `yum upgrade`, etc. that can result in unintentional upgrades. 2. The use of system package management requires complex logic for each target distribution. 3. The installation mechanism requires 4-5 commands, includes manually installing multiple packages, and varies depending on your version and edition of Teleport. 4. The use of bash to implement the updater makes changes difficult and prone to error. @@ -44,30 +45,15 @@ The existing mechanism for automatic agent updates does not provide a hands-off We must provide a seamless, hands-off experience for auto-updates that is easy to maintain. -## Details - -We will ship a new auto-updater package written in Go that does not interface with the system package manager. -It will be distributed as a separate package from Teleport, and manage the installation of the correct Teleport agent version manually. -It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. -It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. - -Source code for the updater will live in `integrations/updater`. - -### Installation +## Details - Teleport API -```shell -$ apt-get install teleport-ent-updater -$ teleport-update enable --proxy example.teleport.sh - -# if not enabled already, configure teleport and: -$ systemctl enable teleport -``` +Teleport will be updated to serve the desired version of Teleport from `/v1/webapi/find`. -### API +The version served with be configured using the `cluster_maintenance_config` and `autoupdate_version` resources. -#### Endpoints +### Endpoints -`/v1/webapi/find` +`/v1/webapi/find?host=[host_uuid]` ```json { "server_edition": "enterprise", @@ -77,12 +63,11 @@ $ systemctl enable teleport } ``` Notes: -- The Teleport proxy translates upgrade hours (below) into a specific time after which the served `agent_version` changes, resulting in all agents being upgraded. -- Critical updates are achieved by serving the desired `agent_version` immediately. -- If an agent misses an upgrade window, it will always update immediately due to the new agent version being served. +- The Teleport proxy uses `cluster_maintenance_config` and `autoupdate_config` (below) to determine the time when the served `agent_auto_update` is `true` for the provided host UUID. +- Agents will only upgrade if `agent_auto_update` is `true`, but new installations will use `agent_version` regardless of the value in `agent_auto_update`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. -#### Teleport Resources +### Teleport Resources ```yaml kind: cluster_maintenance_config @@ -91,30 +76,94 @@ spec: # cluster level. Only turn agent automatic updates off if self-managed # agent updates are in place. agent_auto_update: true|false - # agent_update_hour sets the hour in UTC at which clients should update their agents. - agent_update_hour: 0-23 - # agent_update_jitter_seconds sets a duration in which the upgrade will occur after the hour. - # The agent upgrader will pick a random time within this duration to wait to upgrade. - agent_update_jitter_seconds: 0-3600 - - [...] + + # agent_auto_update_groups contains both "regular" or "critical" schedules. + # The schedule used is determined by the agent_version_schedule associated + # with the version in autoupdate_version. + agent_auto_update_groups: + # schedule is "regular" or "critical" + regular: + - name: staging-group + # agent_selection defines which agents are included in the group. + agent_selection: + # query selects agents by resource query. + # default: all connected agents + query: 'labels["environment"]=="staging"' + # days specifies the days of the week when the group may be upgraded. + # default: ["*"] (all days) + days: [“Sun”, “Mon”, ... | "*"] + # start_hour specifies the hour when the group may start upgrading. + # default: 0 + start_hour: 0-23 + # max_in_flight specifies the maximum number of agents that may be upgraded at the same time. + # default: 100% + max_in_flight: 0-100% + # timeout_seconds specifies the amount of time, after the specified jitter, after which + # an agent upgrade will be considered timed out if the version does not change. + # default: 60 + timeout_seconds: 30-900 + # failure_seconds specifies the amount of time after which an agent upgrade will be considered + # failed if the agent heartbeat stops before the upgrade is complete. + # default: 0 + failure_seconds: 0-900 + # jitter_seconds specifies a maximum jitter duration after the start hour. + # The agent upgrader client will pick a random time within this duration to wait to upgrade. + # default: 0 + jitter_seconds: 0-60 + # max_failed_before_halt specifies the percentage of clients that may fail before this group + # and all dependent groups are halted. + # default: 0 + max_failed_before_halt: 0-100% + # max_timeout_before_halt specifies the percentage of clients that may time out before this group + # and all dependent groups are halted. + # default: 10% + max_timeout_before_halt: 0-100% + # requires specifies groups that must pass with the current version before this group is allowed + # to run using that version. + requires: ["test-group"] + # ... ``` + +Note the MVP version of this resource will not support host UUIDs, groups, or backpressure, and will use the following simplified UX. +This field will remain indefinitely, to cover agents that do not present a known host UUID, as well as connected agents that are not matched to a group. + +```yaml +kind: cluster_maintenance_config +spec: + # ... + + # agent_auto_update contains both "regular" or "critical" schedules. + # The schedule used is determined by the agent_version_schedule associated + # with the version in autoupdate_version. + agent_auto_update: + regular: # or "critical" + # days specifies the days of the week when the group may be upgraded. + # default: ["*"] (all days) + days: [“Sun”, “Mon”, ... | "*"] + # start_hour specifies the hour when the group may start upgrading. + # default: 0 + start_hour: 0-23 + # jitter_seconds specifies a maximum jitter duration after the start hour. + # The agent upgrader client will pick a random time within this duration to wait to upgrade. + # default: 0 + jitter_seconds: 0-60 + # ... ``` -$ tctl autoupdate update --set-agent-auto-update=off + + +```shell +$ tctl autoupdate update--set-agent-auto-update=off Automatic updates configuration has been updated. -$ tctl autoupdate update --set-agent-update-hour=3 +$ tctl autoupdate update --schedule regular --group staging-group --set-start-hour=3 Automatic updates configuration has been updated. -$ tctl autoupdate update --set-agent-update-jitter-seconds=600 +$ tctl autoupdate update --schedule regular --group staging-group --set-jitter-seconds=600 Automatic updates configuration has been updated. $ tctl autoupdate reset Automatic updates configuration has been reset to defaults. $ tctl autoupdate status Status: disabled -Current: v1.2.3 -Desired: v1.2.4 (critical) -Window: 3 -Jitter: 600s - +Version: v1.2.4 +Schedule: regular ``` ```yaml @@ -122,14 +171,14 @@ kind: autoupdate_version spec: # agent_version is the version of the agent the cluster will advertise. agent_version: X.Y.Z - # agent_critical makes the version as critical. - # This overrides agent_update_hour in cluster_maintenance_config and serves the version immediately. - # This is useful for rolling out critical security updates and bug fixes. - agent_critical: true|false + # agent_version_schedule specifies the rollout schedule associated with the version. + # Currently, only critical and regular schedules are permitted. + agent_version_schedule: critical|regular - [...] -``` + # ... ``` + +```shell $ tctl autoupdate update --set-agent-version=15.1.1 Automatic updates configuration has been updated. $ tctl autoupdate update --set-agent-version=15.1.2 --critical @@ -139,6 +188,25 @@ Automatic updates configuration has been updated. Notes: - These two resources are separate so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. +## Details - Linux Agents + +We will ship a new auto-updater package for Linux servers written in Go that does not interface with the system package manager. +It will be distributed as a separate package from Teleport, and manage the installation of the correct Teleport agent version manually. +It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. +It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. + +Source code for the updater will live in `integrations/updater`. + +### Installation + +```shell +$ apt-get install teleport-ent-updater +$ teleport-update enable --proxy example.teleport.sh + +# if not enabled already, configure teleport and: +$ systemctl enable teleport +``` + ### Filesystem ``` @@ -251,7 +319,7 @@ The `disable` subcommand will: When `update` subcommand is otherwise executed, it will: 1. Check `updates.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. 2. Query the `/v1/webapi/find` endpoint. -3. Check that `agent_auto_updates` is true. +3. Check that `agent_auto_updates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. 6. Ensure there is enough free disk space to upgrade Teleport. @@ -344,22 +412,7 @@ Example: Given v1, v2, v3 versions of Teleport, where v2 is broken: For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/find` endpoint. This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-updater` because they use their own automation for updates (e.g., JamF or Ansible). -Cluster administrators that want to self-manage agent updates will be -able to get and watch for changes to agent versions which can then be -used to trigger other integrations to update the installed version of agents. - -```shell -$ tctl autoupdate watch -{"agent_version": "1.0.0", "agent_edition": "enterprise", ... } -{"agent_version": "1.0.1", "agent_edition": "enterprise", ... } -{"agent_version": "2.0.0", "agent_edition": "enterprise", ... } -[...] -``` - -```shell -$ tctl autoupdate get -{"agent_version": "2.0.0", "agent_edition": "enterprise", ... } -``` +Cluster administrators that want to self-manage agent updates may manually query the `/v1/webapi/find` endpoint using the host UUID, and implement auto-updates with their own automation. ### Installers @@ -399,6 +452,19 @@ The following documentation will need to be updated to cover the new upgrader wo Additionally, the Cloud dashboard tenants downloads tab will need to be updated to reference the new instructions. + +## Details - Kubernetes Agents + +The Kubernetes agent updater will be updated for compatibility with the new scheduling system. + +This means that it will stop reading upgrade windows using the authenticated connection to the proxy, and instead upgrade when indicated by the `/v1/webapi/find` endpoint. + +Rollbacks for the Kubernetes updater, as well as packaging changes to improve UX and compatibility, will be covered in a future RFD. + +## Migration + +The existing update scheduling system will remain in-place until the old auto-updater is fully deprecated. + ## Security The initial version of automatic updates will rely on TLS to establish @@ -410,6 +476,9 @@ are signed. The Upgrade Framework (TUF) will be used to implement secure updates in the future. +Anyone who possesses a host UUID can determine when that host is scheduled to upgrade by repeatedly querying the public `/v1/webapi/find` endpoint. +It is not possible to discover the current version of that host, only the designated upgrade window. + ## Logging All installation steps will be logged locally, such that they are viewable with `journalctl`. @@ -419,10 +488,14 @@ When TUF is added, that events related to supply chain security may be sent to t ## Execution Plan -1. Implement new auto-updater in Go. -2. Test extensively on all supported Linux distributions. -3. Prep documentation changes. -4. Release new updater via teleport-ent-updater package. -5. Release documentation changes. -6. Communicate to select Cloud customers that they must update their updater, starting with lower ARR customers. -7. Communicate to all Cloud customers that they must update their updater. +1. Implement Teleport APIs for new scheduling system (without groups and backpressure) +2. Implement new auto-updater in Go. +3. Implement changes to Kubernetes auto-updater. +4. Test extensively on all supported Linux distributions. +5. Prep documentation changes. +6. Release new updater via teleport-ent-updater package. +7. Release documentation changes. +8. Communicate to select Cloud customers that they must update their updater, starting with lower ARR customers. +9. Communicate to all Cloud customers that they must update their updater. +10. Deprecate old auto-updater endpoints. +11. Add groups and backpressure features. From 55cc5a8643f93c550c7f78c0b6a3c8d08f7521a7 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 9 Jul 2024 16:51:32 -0400 Subject: [PATCH 041/105] immediate schedule + note on cycles and chains --- rfd/0169-auto-updates-linux-agents.md | 28 ++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 963c97beb830f..9e0aefce9ea6f 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -77,9 +77,10 @@ spec: # agent updates are in place. agent_auto_update: true|false - # agent_auto_update_groups contains both "regular" or "critical" schedules. + # agent_auto_update_groups contains both "regular" and "critical" schedules. # The schedule used is determined by the agent_version_schedule associated # with the version in autoupdate_version. + # Groups are not configurable with the "immediate" schedule. agent_auto_update_groups: # schedule is "regular" or "critical" regular: @@ -95,6 +96,10 @@ spec: # start_hour specifies the hour when the group may start upgrading. # default: 0 start_hour: 0-23 + # jitter_seconds specifies a maximum jitter duration after the start hour. + # The agent upgrader client will pick a random time within this duration to wait to upgrade. + # default: 0 + jitter_seconds: 0-60 # max_in_flight specifies the maximum number of agents that may be upgraded at the same time. # default: 100% max_in_flight: 0-100% @@ -106,10 +111,6 @@ spec: # failed if the agent heartbeat stops before the upgrade is complete. # default: 0 failure_seconds: 0-900 - # jitter_seconds specifies a maximum jitter duration after the start hour. - # The agent upgrader client will pick a random time within this duration to wait to upgrade. - # default: 0 - jitter_seconds: 0-60 # max_failed_before_halt specifies the percentage of clients that may fail before this group # and all dependent groups are halted. # default: 0 @@ -124,7 +125,9 @@ spec: # ... ``` -Note the MVP version of this resource will not support host UUIDs, groups, or backpressure, and will use the following simplified UX. +Note that cycles and dependency chains longer than a week will be rejected. + +Note the MVP version of this resource will not support host UUIDs, groups, or backpressure, and will use the following simplified UX with `agent_auto_update` field. This field will remain indefinitely, to cover agents that do not present a known host UUID, as well as connected agents that are not matched to a group. ```yaml @@ -132,10 +135,17 @@ kind: cluster_maintenance_config spec: # ... - # agent_auto_update contains both "regular" or "critical" schedules. + # agent_auto_update contains "regular," "critical," and "immediate" schedules. # The schedule used is determined by the agent_version_schedule associated # with the version in autoupdate_version. agent_auto_update: + # The immediate schedule results in all agents updating simultaneously. + # Only client-side jitter is configurable. + immediate: + # jitter_seconds specifies a maximum jitter duration after the start hour. + # The agent upgrader client will pick a random time within this duration to wait to upgrade. + # default: 0 + jitter_seconds: 0-60 regular: # or "critical" # days specifies the days of the week when the group may be upgraded. # default: ["*"] (all days) @@ -172,8 +182,8 @@ spec: # agent_version is the version of the agent the cluster will advertise. agent_version: X.Y.Z # agent_version_schedule specifies the rollout schedule associated with the version. - # Currently, only critical and regular schedules are permitted. - agent_version_schedule: critical|regular + # Currently, only critical, regular, and immediate schedules are permitted. + agent_version_schedule: regular|critical|immediate # ... ``` From b28416ec2825eba2e99ca3bd9370d6ec7238ed61 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 10 Jul 2024 17:07:24 -0400 Subject: [PATCH 042/105] more details, more tctl commands --- rfd/0169-auto-updates-linux-agents.md | 44 +++++++++++++++++++++++++-- 1 file changed, 41 insertions(+), 3 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 9e0aefce9ea6f..4d24504a9c756 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -47,9 +47,21 @@ We must provide a seamless, hands-off experience for auto-updates that is easy t ## Details - Teleport API -Teleport will be updated to serve the desired version of Teleport from `/v1/webapi/find`. +Teleport will be updated to serve the desired agent version and edition from `/v1/webapi/find`. +The version and edition served from that endpoint will be configured using the `cluster_maintenance_config` and `autoupdate_version` resources. +Whether the updater querying the endpoint is instructed to upgrade (via `agent_auto_update`) is dependent on the `host=[uuid]` parameter sent to `/v1/webapi/find`. -The version served with be configured using the `cluster_maintenance_config` and `autoupdate_version` resources. +To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to the `/v1/webapi/find`. +Teleport proxies use their access to heartbeat data to drive the rollout and modulate the `/v1/webapi/find` response given the host UUID. + +Rollouts are specified as interdependent groups of hosts, selected by resource label. +A host is eligible to upgrade if the label is present on any of its connected resources. + +At the start of a group rollout, the Teleport proxy marks a desired group of hosts to update in the backend. +A fixed number of hosts (`max_in_flight`) are instructed to upgrade via `/v1/webapi/find`. +Additional hosts are instructed to update as earlier updates complete, timeout, or fail, never exceeding `max_in_flight`. +The group rollout is halted if timeouts or failures exceed their specified thresholds. +Group rollouts may be retried with `tctl autoupdate run`. ### Endpoints @@ -66,6 +78,7 @@ Notes: - The Teleport proxy uses `cluster_maintenance_config` and `autoupdate_config` (below) to determine the time when the served `agent_auto_update` is `true` for the provided host UUID. - Agents will only upgrade if `agent_auto_update` is `true`, but new installations will use `agent_version` regardless of the value in `agent_auto_update`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. +- The host UUID is ready from `/var/lib/teleport` by the updater. ### Teleport Resources @@ -126,6 +139,10 @@ spec: ``` Note that cycles and dependency chains longer than a week will be rejected. +Otherwise, updates could take up to 7 weeks to propagate. + +Changing the version or schedule completely resets progress. +Releasing new client versions multiple times a week has the potential to starve dependent groups from updates. Note the MVP version of this resource will not support host UUIDs, groups, or backpressure, and will use the following simplified UX with `agent_auto_update` field. This field will remain indefinitely, to cover agents that do not present a known host UUID, as well as connected agents that are not matched to a group. @@ -162,6 +179,7 @@ spec: ```shell +# configuration $ tctl autoupdate update--set-agent-auto-update=off Automatic updates configuration has been updated. $ tctl autoupdate update --schedule regular --group staging-group --set-start-hour=3 @@ -170,10 +188,31 @@ $ tctl autoupdate update --schedule regular --group staging-group --set-jitter-s Automatic updates configuration has been updated. $ tctl autoupdate reset Automatic updates configuration has been reset to defaults. + +# status $ tctl autoupdate status Status: disabled Version: v1.2.4 Schedule: regular + +Groups: +staging-group: succeeded at 2024-01-03 23:43:22 UTC +prod-group: scheduled for 2024-01-03 23:43:22 UTC (depends on prod-group) +other-group: failed at 2024-01-05 22:53:22 UTC + +$ tctl autoupdate status --group staging-group +Status: succeeded +Date: 2024-01-03 23:43:22 UTC +Requires: (none) + +Upgraded: 230 (95%) +Unchanged: 10 (2%) +Failed: 15 (3%) +Timed-out: 0 + +# re-running failed group +$ tctl autoupdate run --group staging-group +Executing auto-update for group 'staging-group' immediately. ``` ```yaml @@ -462,7 +501,6 @@ The following documentation will need to be updated to cover the new upgrader wo Additionally, the Cloud dashboard tenants downloads tab will need to be updated to reference the new instructions. - ## Details - Kubernetes Agents The Kubernetes agent updater will be updated for compatibility with the new scheduling system. From ed1b5fbd0be14828d8eeb2830ed2ce5343bad0b5 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 11 Jul 2024 15:39:29 -0400 Subject: [PATCH 043/105] Update 0169-auto-updates-linux-agents.md --- rfd/0169-auto-updates-linux-agents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 4d24504a9c756..52d8d50961f46 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -58,7 +58,7 @@ Rollouts are specified as interdependent groups of hosts, selected by resource l A host is eligible to upgrade if the label is present on any of its connected resources. At the start of a group rollout, the Teleport proxy marks a desired group of hosts to update in the backend. -A fixed number of hosts (`max_in_flight`) are instructed to upgrade via `/v1/webapi/find`. +An arbitrarily selected fixed number of hosts (`max_in_flight x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. Additional hosts are instructed to update as earlier updates complete, timeout, or fail, never exceeding `max_in_flight`. The group rollout is halted if timeouts or failures exceed their specified thresholds. Group rollouts may be retried with `tctl autoupdate run`. From 2b50f65bb8a458763e43ef08d5478d9293f26ec1 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 29 Jul 2024 14:23:46 -0400 Subject: [PATCH 044/105] scalability --- rfd/0169-auto-updates-linux-agents.md | 59 +++++++++++++++++++++++---- 1 file changed, 52 insertions(+), 7 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 52d8d50961f46..036baf7b667bd 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -54,15 +54,57 @@ Whether the updater querying the endpoint is instructed to upgrade (via `agent_a To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to the `/v1/webapi/find`. Teleport proxies use their access to heartbeat data to drive the rollout and modulate the `/v1/webapi/find` response given the host UUID. -Rollouts are specified as interdependent groups of hosts, selected by resource label. -A host is eligible to upgrade if the label is present on any of its connected resources. +Rollouts are specified as interdependent groups of hosts, selected by SSH resource or instance label query. +A host is eligible to upgrade if the seleciton query returns true. +Instance labels are a new feature introduced by this RFD that may be used when SSH service is not running or it is undesirable to reuse SSH labels: + +``` +teleport: + labels: + environment: staging + commands: + # this command will add a label 'arch=x86_64' to an instance + - name: arch + command: ['/bin/uname', '-p'] + period: 1h0m0s +``` + +Only static and command-based and labels may be used. At the start of a group rollout, the Teleport proxy marks a desired group of hosts to update in the backend. -An arbitrarily selected fixed number of hosts (`max_in_flight x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. +An arbitrary but UUID-deterministic fixed number of hosts (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. Additional hosts are instructed to update as earlier updates complete, timeout, or fail, never exceeding `max_in_flight`. The group rollout is halted if timeouts or failures exceed their specified thresholds. Group rollouts may be retried with `tctl autoupdate run`. +### Scalability + +Instance heartbeats will now be cached at both the auth server and the proxy. + +All rollout logic is trigger by instance heartbeat backend writes, as changes can only occur on these events. +The following data related to the rollout are stored in each instance heartbeat: +- `agent_upgrade_start_time`: timestamp of individual agent's upgrade time +- `agent_upgrade_group_schedule`: schedule type of group (e.g., critical) +- `agent_upgrade_group_name`: name of group (e.g., staging) +- `agent_upgrade_group_start_time`: timestamp of current window start time +- `agent_upgrade_group_end_time`: timestamp of current window start time + +At the start of the window, all queried instance heartbeats are marked with updated values for the `agent_upgrade_group_*` fields. +Instance heartbeats are included in the current window if all three fields match the window defined in `cluster_maintenance_config`. + +On each instance heartbeat write, the auth server looks at instance heartbeats in cache and determines if additional agents should be upgrading. +If they should, additional instance heartbeats are marked as upgrading by setting `agent_upgrade_start_time` to the current time. +When `agent_upgrade_start_time` is in the group's window, the proxy serves `agent_auto_upgrade: true` when queried via `/v1/webapi/find`. + +To avoid synchronization issues between auth servers, the rollout order is deterministically sorted by UUID. +Two concurrent writes to different auth servers may temporarily result in fewer upgrading instances than desired, but this should be resolved on the next write. + +Upgrading all agents generates the following write load: +- One write of `agent_upgrade_group_*` fields per agent +- One write of `agent_upgrade_start_time` field per agent + +All reads are from cache. + ### Endpoints `/v1/webapi/find?host=[host_uuid]` @@ -98,11 +140,14 @@ spec: # schedule is "regular" or "critical" regular: - name: staging-group - # agent_selection defines which agents are included in the group. - agent_selection: - # query selects agents by resource query. + # agents defines which agents are included in the group. + agents: + # node_labels_expression selects agents by SSH resource query. + # default: all connected agents + node_labels_expression: 'labels["environment"]=="staging"' + # instance_labels_expression selects agents by instance query. # default: all connected agents - query: 'labels["environment"]=="staging"' + instance_labels_expression: 'labels["environment"]=="staging"' # days specifies the days of the week when the group may be upgraded. # default: ["*"] (all days) days: [“Sun”, “Mon”, ... | "*"] From 0f9aa290d1fa3f4b6797dc1005b246f1f0646912 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 29 Jul 2024 14:41:31 -0400 Subject: [PATCH 045/105] df --- rfd/0169-auto-updates-linux-agents.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 036baf7b667bd..e1e0ca7a82c8f 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -393,7 +393,7 @@ The `enable` subcommand will: 1. Query the `/v1/webapi/find` endpoint. 2. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, jump to (16). 3. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (13). -4. Ensure there is enough free disk space to upgrade Teleport. +4. Ensure there is enough free disk space to upgrade Teleport via `df .`. 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Download and verify the checksum (tarball URL suffixed with `.sha256`). 7. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. @@ -416,7 +416,7 @@ When `update` subcommand is otherwise executed, it will: 3. Check that `agent_auto_updates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. -6. Ensure there is enough free disk space to upgrade Teleport. +6. Ensure there is enough free disk space to upgrade Teleport via `df .`. 7. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 8. Download and verify the checksum (tarball URL suffixed with `.sha256`). 9. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. From b7d44a9464039d15304636f844c238b2d8bb3da6 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 29 Jul 2024 14:47:02 -0400 Subject: [PATCH 046/105] content-length --- rfd/0169-auto-updates-linux-agents.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index e1e0ca7a82c8f..8af0b7f4b275f 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -393,7 +393,7 @@ The `enable` subcommand will: 1. Query the `/v1/webapi/find` endpoint. 2. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, jump to (16). 3. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (13). -4. Ensure there is enough free disk space to upgrade Teleport via `df .`. +4. Ensure there is enough free disk space to upgrade Teleport via `df .` and `content-length` header from `HEAD` request. 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Download and verify the checksum (tarball URL suffixed with `.sha256`). 7. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. @@ -416,7 +416,7 @@ When `update` subcommand is otherwise executed, it will: 3. Check that `agent_auto_updates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. -6. Ensure there is enough free disk space to upgrade Teleport via `df .`. +6. Ensure there is enough free disk space to upgrade Teleport via `df .` and `content-length` header from `HEAD` request. 7. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 8. Download and verify the checksum (tarball URL suffixed with `.sha256`). 9. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. From 39be754840169d4c53777dacecaed42383079f48 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 29 Jul 2024 15:20:53 -0400 Subject: [PATCH 047/105] cache init --- rfd/0169-auto-updates-linux-agents.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 8af0b7f4b275f..56490acaa1e70 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -104,6 +104,9 @@ Upgrading all agents generates the following write load: - One write of `agent_upgrade_start_time` field per agent All reads are from cache. +If the cache is unhealthy, `agent_auto_update` is still served based on the last available value in cache. +This is safe because `agent_upgrade_start_time` is only written once during the upgrade. +However, this means that timeout thresholds should account for possible cache init time if initialization occurs right after `agent_upgrade_start_time` is written. ### Endpoints From b5587c0f899ba1db85cb52e12b5a6528bc1fc238 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 29 Jul 2024 16:10:28 -0400 Subject: [PATCH 048/105] binary --- rfd/0169-auto-updates-linux-agents.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 56490acaa1e70..2ee8e1a3813f0 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -292,7 +292,7 @@ It will be distributed as a separate package from Teleport, and manage the insta It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. -Source code for the updater will live in `integrations/updater`. +Source code for the updater will live in the main Teleport repository, with the updater binary built from `tools/teleport-update`. ### Installation From f22873f0532f4f6bf785af0574dad798c5ab8a07 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 2 Aug 2024 15:19:48 -0400 Subject: [PATCH 049/105] more rollout mechanism changes --- rfd/0169-auto-updates-linux-agents.md | 81 ++++++++++++++++----------- 1 file changed, 47 insertions(+), 34 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 2ee8e1a3813f0..d0c2fee3ce8e0 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -54,25 +54,16 @@ Whether the updater querying the endpoint is instructed to upgrade (via `agent_a To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to the `/v1/webapi/find`. Teleport proxies use their access to heartbeat data to drive the rollout and modulate the `/v1/webapi/find` response given the host UUID. -Rollouts are specified as interdependent groups of hosts, selected by SSH resource or instance label query. -A host is eligible to upgrade if the seleciton query returns true. -Instance labels are a new feature introduced by this RFD that may be used when SSH service is not running or it is undesirable to reuse SSH labels: +Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier. +A host is eligible to upgrade if the upgrade group identifier matches, set in teleport.yaml: ``` teleport: - labels: - environment: staging - commands: - # this command will add a label 'arch=x86_64' to an instance - - name: arch - command: ['/bin/uname', '-p'] - period: 1h0m0s + upgrade_group: staging ``` -Only static and command-based and labels may be used. - -At the start of a group rollout, the Teleport proxy marks a desired group of hosts to update in the backend. -An arbitrary but UUID-deterministic fixed number of hosts (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. +At the start of a group rollout, the Teleport auth server captures the desired group of hosts to update in the backend. +An fixed number of hosts (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. Additional hosts are instructed to update as earlier updates complete, timeout, or fail, never exceeding `max_in_flight`. The group rollout is halted if timeouts or failures exceed their specified thresholds. Group rollouts may be retried with `tctl autoupdate run`. @@ -81,29 +72,41 @@ Group rollouts may be retried with `tctl autoupdate run`. Instance heartbeats will now be cached at both the auth server and the proxy. -All rollout logic is trigger by instance heartbeat backend writes, as changes can only occur on these events. +The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events. + The following data related to the rollout are stored in each instance heartbeat: - `agent_upgrade_start_time`: timestamp of individual agent's upgrade time -- `agent_upgrade_group_schedule`: schedule type of group (e.g., critical) -- `agent_upgrade_group_name`: name of group (e.g., staging) -- `agent_upgrade_group_start_time`: timestamp of current window start time -- `agent_upgrade_group_end_time`: timestamp of current window start time +- `agent_upgrade_group_name`: name of auto-update group + +At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend under a single key. +This plan is protected by optimistic locking, and contains the following data: + +Data key: `[name of group]@[scheduled type]` (e.g., `staging@critical`) -At the start of the window, all queried instance heartbeats are marked with updated values for the `agent_upgrade_group_*` fields. -Instance heartbeats are included in the current window if all three fields match the window defined in `cluster_maintenance_config`. +Data value JSON: +- `group_start_time`: timestamp of current window start time +- `group_end_time`: timestamp of current window start time +- `host_order`: list of UUIDs in randomized order + +At a fixed interval, auth servers will check the plan to determine if a new plan is needed by comparing `group_start_time` to the current time and the desired window. +If a new plan is needed, auth servers will query their cache of instance heartbeats and attempt to write the new plan. +The first auth server to write the plan wins; others will be rejected by the optimistic lock. +Auth servers will only write the plan if their instance heartbeat cache is initialized and recently updated. On each instance heartbeat write, the auth server looks at instance heartbeats in cache and determines if additional agents should be upgrading. If they should, additional instance heartbeats are marked as upgrading by setting `agent_upgrade_start_time` to the current time. When `agent_upgrade_start_time` is in the group's window, the proxy serves `agent_auto_upgrade: true` when queried via `/v1/webapi/find`. -To avoid synchronization issues between auth servers, the rollout order is deterministically sorted by UUID. -Two concurrent writes to different auth servers may temporarily result in fewer upgrading instances than desired, but this should be resolved on the next write. + +The predetermined ordering of hosts avoids cache synchronization issues between auth servers. +Two concurrent heartbeat writes to by auth servers may temporarily result in fewer upgrading instances than desired, but this should be resolved on the next write. Upgrading all agents generates the following write load: -- One write of `agent_upgrade_group_*` fields per agent -- One write of `agent_upgrade_start_time` field per agent +- One write of plan. +- One write of `agent_upgrade_start_time` field per agent. All reads are from cache. +Each instance heartbeat write will trigger an eventually consistent cache update on all auth servers and proxies, but not agents. If the cache is unhealthy, `agent_auto_update` is still served based on the last available value in cache. This is safe because `agent_upgrade_start_time` is only written once during the upgrade. However, this means that timeout thresholds should account for possible cache init time if initialization occurs right after `agent_upgrade_start_time` is written. @@ -142,15 +145,8 @@ spec: agent_auto_update_groups: # schedule is "regular" or "critical" regular: + # name of the group - name: staging-group - # agents defines which agents are included in the group. - agents: - # node_labels_expression selects agents by SSH resource query. - # default: all connected agents - node_labels_expression: 'labels["environment"]=="staging"' - # instance_labels_expression selects agents by instance query. - # default: all connected agents - instance_labels_expression: 'labels["environment"]=="staging"' # days specifies the days of the week when the group may be upgraded. # default: ["*"] (all days) days: [“Sun”, “Mon”, ... | "*"] @@ -186,7 +182,7 @@ spec: # ... ``` -Note that cycles and dependency chains longer than a week will be rejected. +Cycles and dependency chains longer than a week will be rejected. Otherwise, updates could take up to 7 weeks to propagate. Changing the version or schedule completely resets progress. @@ -285,6 +281,21 @@ Automatic updates configuration has been updated. Notes: - These two resources are separate so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. +### Version Promotion + +Maintaining the version of different groups of agents is out-of-scope for this RFD. +This means that groups which employ auto-scaling or ephemeral resources will slowly converge to the latest Teleport version. +This could lead to a production outage, as the latest Teleport version may not receive any validation before it is advertised to newly provisioned resources in production. + +To solve this in the future, we can add an additional `--group` flag to `teleport-update`: +```shell +$ teleport-update enable --proxy example.teleport.sh --group staging +``` + +This group name could be provided as a parameter to `/v1/webapi/find`, so that newly added resources may install at the group's designated version. + +This will require tracking the desired version of groups in the backend, which will add additional complexity to the rollout logic. + ## Details - Linux Agents We will ship a new auto-updater package for Linux servers written in Go that does not interface with the system package manager. @@ -438,6 +449,8 @@ To ensure that SELinux permissions do not prevent the `teleport-updater` binary To ensure that `teleport` package removal does not interfere with `teleport-updater`, package removal will run `apt purge` (or `yum` equivalent) while ensuring that `/etc/teleport.yaml` and `/var/lib/teleport` are not purged. Failure to do this could result in `/etc/teleport.yaml` being removed when an operator runs `apt purge` at a later date. +To ensure that `teleport` package removal does not lead to a hard restart of Teleport, the updater will ensure that the package is removed without triggering needrestart or similar services. + To ensure that backups are consistent, the updater will use the [SQLite backup API](https://www.sqlite.org/backup.html) to perform the backup. #### Failure Conditions From ab73d114b5fff75b870eb8687062ea71818f639f Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 7 Aug 2024 16:07:47 -0400 Subject: [PATCH 050/105] scalability --- rfd/0169-auto-updates-linux-agents.md | 112 ++++++++++++++++++-------- 1 file changed, 80 insertions(+), 32 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index d0c2fee3ce8e0..129e29cf1af7c 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -7,7 +7,7 @@ state: draft ## Required Approvers -* Engineering: @russjones && @bernardjkim +* Engineering: @russjones * Product: @klizhentas || @xinding33 * Security: Doyensec @@ -18,9 +18,10 @@ This RFD proposes a new mechanism for Teleport agents to automatically update to All agent installations are in-scope for this proposal, including agents installed on Linux servers and Kubernetes. The following anti-goals are out-of-scope for this proposal, but will be addressed in future RFDs: -- Signing of agent artifacts via TUF +- Signing of agent artifacts (e.g., via TUF) - Teleport Cloud APIs for updating agents - Improvements to the local functionality of the Kubernetes agent for better compatibility with FluxCD and ArgoCD. +- Support for progressive rollouts to different groups of ephemeral or auto-scaling agents (see: Version Promotion). This RFD proposes a specific implementation of several sections in https://github.com/gravitational/teleport/pull/39217. @@ -31,13 +32,13 @@ Additionally, this RFD parallels the auto-update functionality for client tools The existing mechanism for automatic agent updates does not provide a hands-off experience for all Teleport users. 1. The use of system package management leads to interactions with `apt upgrade`, `yum upgrade`, etc. that can result in unintentional upgrades. -2. The use of system package management requires complex logic for each target distribution. +2. The use of system package management requires logic that varies significantly by target distribution. 3. The installation mechanism requires 4-5 commands, includes manually installing multiple packages, and varies depending on your version and edition of Teleport. 4. The use of bash to implement the updater makes changes difficult and prone to error. 5. The existing auto-updater has limited automated testing. 6. The use of GPG keys in system package managers has key management implications that we would prefer to solve with TUF in the future. 7. The desired agent version cannot be set via Teleport's operator-targeted CLI (tctl). -8. The rollout plan for the new agent version is not fully-configurable using tctl. +8. The rollout plan for new agent versions is not fully-configurable using tctl. 9. Agent installation logic is spread between the auto-updater script, install script, auto-discovery script, and documentation. 10. Teleport contains logic that is specific to Teleport Cloud upgrade workflows. 11. The existing auto-updater is not self-updating. @@ -51,7 +52,7 @@ Teleport will be updated to serve the desired agent version and edition from `/v The version and edition served from that endpoint will be configured using the `cluster_maintenance_config` and `autoupdate_version` resources. Whether the updater querying the endpoint is instructed to upgrade (via `agent_auto_update`) is dependent on the `host=[uuid]` parameter sent to `/v1/webapi/find`. -To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to the `/v1/webapi/find`. +To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. Teleport proxies use their access to heartbeat data to drive the rollout and modulate the `/v1/webapi/find` response given the host UUID. Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier. @@ -70,46 +71,85 @@ Group rollouts may be retried with `tctl autoupdate run`. ### Scalability -Instance heartbeats will now be cached at both the auth server and the proxy. +#### Window Capture -The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events. - -The following data related to the rollout are stored in each instance heartbeat: -- `agent_upgrade_start_time`: timestamp of individual agent's upgrade time -- `agent_upgrade_group_name`: name of auto-update group +Instance heartbeats will be cached by auth servers using a dedicated cache. +This cache is updated using rate-limited backend reads that occur in the background, to avoid mass-reads of instance heartbeats. +The rate is modulated by the total number of instance heartbeats. +The cache is considered healthy when all instance heartbeats present on the backend have been read in a time period that is also modulated by the total number of heartbeats. At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend under a single key. This plan is protected by optimistic locking, and contains the following data: -Data key: `[name of group]@[scheduled type]` (e.g., `staging@critical`) +Data key: `/autoupdate/[name of group]/[scheduled type](/[page-id])` (e.g., `/autoupdate/staging/critical/8745823`) Data value JSON: -- `group_start_time`: timestamp of current window start time -- `group_end_time`: timestamp of current window start time -- `host_order`: list of UUIDs in randomized order +- `start_time`: timestamp of current window start time +- `version`: version for which this rollout is valid +- `hosts`: list of UUIDs in randomized order +- `next_page`: additional UUIDs, if list is greater than 100,000 UUIDs +- `expiry`: 2 weeks -At a fixed interval, auth servers will check the plan to determine if a new plan is needed by comparing `group_start_time` to the current time and the desired window. +At a fixed interval, auth servers will check the plan to determine if a new plan is needed by comparing `start_time` to the current time and the desired window. If a new plan is needed, auth servers will query their cache of instance heartbeats and attempt to write the new plan. The first auth server to write the plan wins; others will be rejected by the optimistic lock. -Auth servers will only write the plan if their instance heartbeat cache is initialized and recently updated. +Auth servers will only write the plan if their instance heartbeat cache is healthy. -On each instance heartbeat write, the auth server looks at instance heartbeats in cache and determines if additional agents should be upgrading. -If they should, additional instance heartbeats are marked as upgrading by setting `agent_upgrade_start_time` to the current time. -When `agent_upgrade_start_time` is in the group's window, the proxy serves `agent_auto_upgrade: true` when queried via `/v1/webapi/find`. +If the list is greater than 100,000 UUIDs, auth servers will first write pages with a randomly generated suffix, in a linked-link, before the atomic non-suffixed write. +If the non-suffixed write fails, the auth server is responsible for cleaning up the unusable pages. +If cleanup fails, the unusable pages will expire after 2 weeks. +``` +Winning auth: + WRITE: /autoupdate/staging/critical/4324234 | next_page: null + WRITE: /autoupdate/staging/critical/8745823 | next_page: 4324234 + WRITE: /autoupdate/staging/critical | next_page: 8745823 + +Losing auth: + WRITE: /autoupdate/staging/critical/2342343 | next_page: null + WRITE: /autoupdate/staging/critical/7678686 | next_page: 2342343 + WRITE CONFLICT: /autoupdate/staging/critical | next_page: 7678686 + DELETE: /autoupdate/staging/critical/7678686 + DELETE: /autoupdate/staging/critical/2342343 +``` -The predetermined ordering of hosts avoids cache synchronization issues between auth servers. -Two concurrent heartbeat writes to by auth servers may temporarily result in fewer upgrading instances than desired, but this should be resolved on the next write. +#### Rollout + +The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events. + +The following data related to the rollout are stored in each instance heartbeat: +- `agent_upgrade_start_time`: timestamp of individual agent's upgrade time +- `agent_upgrade_version`: current agent version +- `expiry`: expiration time of the heartbeat (extended to 24 hours at `agent_upgrade_start_time`) + +Additionally, an in-memory data structure is maintained based on the cache, and kept up-to-date by a background process. +This data structure contains the number of unfinished (pending and ongoing) upgrades preceding each instance heartbeat in the rollout plan. +Instance heartbeats are considered completed when either `agent_upgrade_version` matches the plan version, or `agent_upgrade_start_time` is past the expiration time. +``` +upgrading := make(map[Rollout][UUID]int) +``` -Upgrading all agents generates the following write load: -- One write of plan. -- One write of `agent_upgrade_start_time` field per agent. +On each instance heartbeat write, the auth server looks at the data structure to determine if the associated agent should begin upgrading. +This determination is made by comparing the stored number of unfinished upgrades to `max_in_flight % x len(hosts)`. +If the stored number is fewer, `agent_upgrade_start_time` is updated to the current time when the heartbeat is written. -All reads are from cache. -Each instance heartbeat write will trigger an eventually consistent cache update on all auth servers and proxies, but not agents. -If the cache is unhealthy, `agent_auto_update` is still served based on the last available value in cache. -This is safe because `agent_upgrade_start_time` is only written once during the upgrade. -However, this means that timeout thresholds should account for possible cache init time if initialization occurs right after `agent_upgrade_start_time` is written. +The auth server writes the index of the last host that is allowed to upgrade to `/autoupdate/[name of group]/[scheduled type]/progress` (e.g., `/autoupdate/staging/critical/progress`). +Writes are rate-limited such that the progress is only updated every 10 seconds. + +Proxies read all groups and maintain an in-memory map of host UUID to upgrading status: +``` +upgrading := make(map[UUID]bool) +``` +Proxies watch for changes to `/progress` and update the map accordingly. + +When the updater queries the proxy via `/v1/webapi/find?host=[host_uuid]`, the proxies query the map to determine the value of `agent_auto_upgrade: true`. + +The predetermined ordering of hosts avoids cache synchronization issues between auth servers. +Two concurrent heartbeat writes may temporarily result in fewer upgrading instances than desired, but this will eventually be resolved by cache propagation. + +Upgrading all agents generates the following additional backend write load: +- One write per page of the rollout plan per upgrade group. +- One write per auth server every 10 seconds, during rollouts. ### Endpoints @@ -185,6 +225,9 @@ spec: Cycles and dependency chains longer than a week will be rejected. Otherwise, updates could take up to 7 weeks to propagate. +The updater will receive `agent_auto_update: true` from the time is it designated for upgrade until the version changes in `autoupdate_version`. +After 24 hours, the upgrade is halted in-place, and the group is considered failed if unfinished. + Changing the version or schedule completely resets progress. Releasing new client versions multiple times a week has the potential to starve dependent groups from updates. @@ -228,7 +271,7 @@ $ tctl autoupdate update--set-agent-auto-update=off Automatic updates configuration has been updated. $ tctl autoupdate update --schedule regular --group staging-group --set-start-hour=3 Automatic updates configuration has been updated. -$ tctl autoupdate update --schedule regular --group staging-group --set-jitter-seconds=600 +$ tctl autoupdate update --schedule regular --group staging-group --set-jitter-seconds=60 Automatic updates configuration has been updated. $ tctl autoupdate reset Automatic updates configuration has been reset to defaults. @@ -289,7 +332,7 @@ This could lead to a production outage, as the latest Teleport version may not r To solve this in the future, we can add an additional `--group` flag to `teleport-update`: ```shell -$ teleport-update enable --proxy example.teleport.sh --group staging +$ teleport-update enable --proxy example.teleport.sh --group staging-group ``` This group name could be provided as a parameter to `/v1/webapi/find`, so that newly added resources may install at the group's designated version. @@ -315,6 +358,11 @@ $ teleport-update enable --proxy example.teleport.sh $ systemctl enable teleport ``` +For air-gapped Teleport installs, the agent may be configured with a custom tarball path template: +```shell +$ teleport-update enable --proxy example.teleport.sh --template 'https://example.com/teleport-{{ .Edition }}-{{ .Version }}-{{ .Arch }}.tgz' +``` + ### Filesystem ``` From 5ab98cb363aa1a5286fb647f7d7a7646ad33a2d1 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 7 Aug 2024 16:39:22 -0400 Subject: [PATCH 051/105] more scalability --- rfd/0169-auto-updates-linux-agents.md | 33 +++++++++++++-------------- 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 129e29cf1af7c..50c6ad8615918 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -20,8 +20,8 @@ All agent installations are in-scope for this proposal, including agents install The following anti-goals are out-of-scope for this proposal, but will be addressed in future RFDs: - Signing of agent artifacts (e.g., via TUF) - Teleport Cloud APIs for updating agents -- Improvements to the local functionality of the Kubernetes agent for better compatibility with FluxCD and ArgoCD. -- Support for progressive rollouts to different groups of ephemeral or auto-scaling agents (see: Version Promotion). +- Improvements to the local functionality of the Kubernetes agent for better compatibility with FluxCD and ArgoCD +- Support for progressive rollouts to different groups of ephemeral or auto-scaling agents (see: Version Promotion) This RFD proposes a specific implementation of several sections in https://github.com/gravitational/teleport/pull/39217. @@ -44,20 +44,18 @@ The existing mechanism for automatic agent updates does not provide a hands-off 11. The existing auto-updater is not self-updating. 12. It is difficult and undocumented to automate agent upgrades with custom automation (e.g., with JamF). -We must provide a seamless, hands-off experience for auto-updates that is easy to maintain. +We must provide a seamless, hands-off experience for auto-updates of Teleport Agents that is easy to maintain. ## Details - Teleport API -Teleport will be updated to serve the desired agent version and edition from `/v1/webapi/find`. -The version and edition served from that endpoint will be configured using the `cluster_maintenance_config` and `autoupdate_version` resources. -Whether the updater querying the endpoint is instructed to upgrade (via `agent_auto_update`) is dependent on the `host=[uuid]` parameter sent to `/v1/webapi/find`. +Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. +The version and edition served from that endpoint will be configured using new `cluster_maintenance_config` and `autoupdate_version` resources. +Whether the Teleport updater querying the endpoint is instructed to upgrade (via `agent_auto_update`) is dependent on the `host=[uuid]` parameter sent to `/v1/webapi/find`. To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. -Teleport proxies use their access to heartbeat data to drive the rollout and modulate the `/v1/webapi/find` response given the host UUID. +Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID. Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier. -A host is eligible to upgrade if the upgrade group identifier matches, set in teleport.yaml: - ``` teleport: upgrade_group: staging @@ -75,7 +73,7 @@ Group rollouts may be retried with `tctl autoupdate run`. Instance heartbeats will be cached by auth servers using a dedicated cache. This cache is updated using rate-limited backend reads that occur in the background, to avoid mass-reads of instance heartbeats. -The rate is modulated by the total number of instance heartbeats. +The rate is modulated by the total number of instance heartbeats, to avoid putting too much load on the backend on large clusters. The cache is considered healthy when all instance heartbeats present on the backend have been read in a time period that is also modulated by the total number of heartbeats. At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend under a single key. @@ -86,7 +84,7 @@ Data key: `/autoupdate/[name of group]/[scheduled type](/[page-id])` (e.g., `/au Data value JSON: - `start_time`: timestamp of current window start time - `version`: version for which this rollout is valid -- `hosts`: list of UUIDs in randomized order +- `hosts`: list of host UUIDs in randomized order - `next_page`: additional UUIDs, if list is greater than 100,000 UUIDs - `expiry`: 2 weeks @@ -125,8 +123,8 @@ The following data related to the rollout are stored in each instance heartbeat: Additionally, an in-memory data structure is maintained based on the cache, and kept up-to-date by a background process. This data structure contains the number of unfinished (pending and ongoing) upgrades preceding each instance heartbeat in the rollout plan. Instance heartbeats are considered completed when either `agent_upgrade_version` matches the plan version, or `agent_upgrade_start_time` is past the expiration time. -``` -upgrading := make(map[Rollout][UUID]int) +```golang +unfinished := make(map[Rollout][UUID]int) ``` On each instance heartbeat write, the auth server looks at the data structure to determine if the associated agent should begin upgrading. @@ -137,7 +135,7 @@ The auth server writes the index of the last host that is allowed to upgrade to Writes are rate-limited such that the progress is only updated every 10 seconds. Proxies read all groups and maintain an in-memory map of host UUID to upgrading status: -``` +```golang upgrading := make(map[UUID]bool) ``` Proxies watch for changes to `/progress` and update the map accordingly. @@ -163,7 +161,6 @@ Upgrading all agents generates the following additional backend write load: } ``` Notes: -- The Teleport proxy uses `cluster_maintenance_config` and `autoupdate_config` (below) to determine the time when the served `agent_auto_update` is `true` for the provided host UUID. - Agents will only upgrade if `agent_auto_update` is `true`, but new installations will use `agent_version` regardless of the value in `agent_auto_update`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. - The host UUID is ready from `/var/lib/teleport` by the updater. @@ -222,7 +219,8 @@ spec: # ... ``` -Cycles and dependency chains longer than a week will be rejected. +Dependency cycles are rejected. +Dependency chains longer than a week will be rejected. Otherwise, updates could take up to 7 weeks to propagate. The updater will receive `agent_auto_update: true` from the time is it designated for upgrade until the version changes in `autoupdate_version`. @@ -328,7 +326,8 @@ Notes: Maintaining the version of different groups of agents is out-of-scope for this RFD. This means that groups which employ auto-scaling or ephemeral resources will slowly converge to the latest Teleport version. -This could lead to a production outage, as the latest Teleport version may not receive any validation before it is advertised to newly provisioned resources in production. + +**This could lead to a production outage, as the latest Teleport version may not receive any validation before it is advertised to newly provisioned resources in production.** To solve this in the future, we can add an additional `--group` flag to `teleport-update`: ```shell From 562f7340599a7b2a108e910fb4fce8814f658728 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 8 Aug 2024 17:29:29 -0400 Subject: [PATCH 052/105] use 100kib pages for plan --- rfd/0169-auto-updates-linux-agents.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 50c6ad8615918..291993f274676 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -93,7 +93,10 @@ If a new plan is needed, auth servers will query their cache of instance heartbe The first auth server to write the plan wins; others will be rejected by the optimistic lock. Auth servers will only write the plan if their instance heartbeat cache is healthy. -If the list is greater than 100,000 UUIDs, auth servers will first write pages with a randomly generated suffix, in a linked-link, before the atomic non-suffixed write. +If the resource size is greater than 100 KiB, auth servers will divide the resource into pages no greater than 100 KiB each. +Each page will duplicate all values besides `hosts`, which will be different for each page. +All pages besides the first page will be suffixed with a randomly generated number. +Pages will be written in reverse order, in a linked-link, before the final atomic non-suffixed write of the first page. If the non-suffixed write fails, the auth server is responsible for cleaning up the unusable pages. If cleanup fails, the unusable pages will expire after 2 weeks. From 3f0fb8f77d33c2e6999cf8b66e4c27f58e3edf42 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 13 Aug 2024 14:03:19 -0400 Subject: [PATCH 053/105] Add RPCs, tweak API design --- rfd/0169-auto-updates-linux-agents.md | 412 +++++++++++++++++++++++--- 1 file changed, 370 insertions(+), 42 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 291993f274676..d3e693d9a56a6 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -49,8 +49,8 @@ We must provide a seamless, hands-off experience for auto-updates of Teleport Ag ## Details - Teleport API Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. -The version and edition served from that endpoint will be configured using new `cluster_maintenance_config` and `autoupdate_version` resources. -Whether the Teleport updater querying the endpoint is instructed to upgrade (via `agent_auto_update`) is dependent on the `host=[uuid]` parameter sent to `/v1/webapi/find`. +The version and edition served from that endpoint will be configured using new `cluster_autoupdate_config` and `autoupdate_version` resources. +Whether the Teleport updater querying the endpoint is instructed to upgrade (via `agent_autoupdate`) is dependent on the `host=[uuid]` parameter sent to `/v1/webapi/find`. To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID. @@ -79,11 +79,12 @@ The cache is considered healthy when all instance heartbeats present on the back At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend under a single key. This plan is protected by optimistic locking, and contains the following data: -Data key: `/autoupdate/[name of group]/[scheduled type](/[page-id])` (e.g., `/autoupdate/staging/critical/8745823`) +Data key: `/autoupdate/[name of group](/[page-id])` (e.g., `/autoupdate/staging/8745823`) Data value JSON: - `start_time`: timestamp of current window start time - `version`: version for which this rollout is valid +- `schedule`: type of schedule that triggered the rollout - `hosts`: list of host UUIDs in randomized order - `next_page`: additional UUIDs, if list is greater than 100,000 UUIDs - `expiry`: 2 weeks @@ -102,16 +103,16 @@ If cleanup fails, the unusable pages will expire after 2 weeks. ``` Winning auth: - WRITE: /autoupdate/staging/critical/4324234 | next_page: null - WRITE: /autoupdate/staging/critical/8745823 | next_page: 4324234 - WRITE: /autoupdate/staging/critical | next_page: 8745823 + WRITE: /autoupdate/staging/4324234 | next_page: null + WRITE: /autoupdate/staging/8745823 | next_page: 4324234 + WRITE: /autoupdate/staging | next_page: 8745823 Losing auth: - WRITE: /autoupdate/staging/critical/2342343 | next_page: null - WRITE: /autoupdate/staging/critical/7678686 | next_page: 2342343 - WRITE CONFLICT: /autoupdate/staging/critical | next_page: 7678686 - DELETE: /autoupdate/staging/critical/7678686 - DELETE: /autoupdate/staging/critical/2342343 + WRITE: /autoupdate/staging/2342343 | next_page: null + WRITE: /autoupdate/staging/7678686 | next_page: 2342343 + WRITE CONFLICT: /autoupdate/staging | next_page: 7678686 + DELETE: /autoupdate/staging/7678686 + DELETE: /autoupdate/staging/2342343 ``` #### Rollout @@ -134,55 +135,66 @@ On each instance heartbeat write, the auth server looks at the data structure to This determination is made by comparing the stored number of unfinished upgrades to `max_in_flight % x len(hosts)`. If the stored number is fewer, `agent_upgrade_start_time` is updated to the current time when the heartbeat is written. -The auth server writes the index of the last host that is allowed to upgrade to `/autoupdate/[name of group]/[scheduled type]/progress` (e.g., `/autoupdate/staging/critical/progress`). +The auth server writes the following keys to `/autoupdate/[name of group]/status` (e.g., `/autoupdate/staging/status`): +- `last_active_host_index`: index of the last host allowed to upgrade +- `failed_host_count`: failed host count +- `timeout_host_count`: timed-out host count + Writes are rate-limited such that the progress is only updated every 10 seconds. +If the auth server's cached progress is greater than its calculated progress, the auth server declines to update the progress. + +The predetermined ordering of hosts avoids cache synchronization issues between auth servers. +Two concurrent heartbeat writes may temporarily result in fewer upgrading instances than desired, but this will eventually be resolved by cache propagation. -Proxies read all groups and maintain an in-memory map of host UUID to upgrading status: +Each group rollout is represented by an `agent_rollout_plan` Teleport resource that includes the progress and host count, but not the list of UUIDs. +Proxies use the start time in the resource to determine when to stream the list of UUIDs via a dedicated RPC. +Proxies watch the status section of `agent_rollout_plan` for updates to progress. + +Proxies read all started rollouts and maintain an in-memory map of host UUID to upgrading status: ```golang upgrading := make(map[UUID]bool) ``` -Proxies watch for changes to `/progress` and update the map accordingly. +Proxies watch for changes to the plan and update the map accordingly. When the updater queries the proxy via `/v1/webapi/find?host=[host_uuid]`, the proxies query the map to determine the value of `agent_auto_upgrade: true`. -The predetermined ordering of hosts avoids cache synchronization issues between auth servers. -Two concurrent heartbeat writes may temporarily result in fewer upgrading instances than desired, but this will eventually be resolved by cache propagation. - Upgrading all agents generates the following additional backend write load: - One write per page of the rollout plan per upgrade group. - One write per auth server every 10 seconds, during rollouts. -### Endpoints +### REST Endpoints `/v1/webapi/find?host=[host_uuid]` ```json { "server_edition": "enterprise", "agent_version": "15.1.1", - "agent_auto_update": true, + "agent_autoupdate": true, "agent_update_jitter_seconds": 10 } ``` Notes: -- Agents will only upgrade if `agent_auto_update` is `true`, but new installations will use `agent_version` regardless of the value in `agent_auto_update`. +- Agents will only upgrade if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. - The host UUID is ready from `/var/lib/teleport` by the updater. ### Teleport Resources +#### Scheduling + ```yaml -kind: cluster_maintenance_config +kind: cluster_autoupdate_config spec: - # agent_auto_update allows turning agent updates on or off at the + # agent_autoupdate allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed # agent updates are in place. - agent_auto_update: true|false + agent_autoupdate: true|false - # agent_auto_update_groups contains both "regular" and "critical" schedules. + # agent_group_schedules contains both "regular" and "critical" schedules. # The schedule used is determined by the agent_version_schedule associated # with the version in autoupdate_version. # Groups are not configurable with the "immediate" schedule. - agent_auto_update_groups: + agent_group_schedules: # schedule is "regular" or "critical" regular: # name of the group @@ -197,9 +209,6 @@ spec: # The agent upgrader client will pick a random time within this duration to wait to upgrade. # default: 0 jitter_seconds: 0-60 - # max_in_flight specifies the maximum number of agents that may be upgraded at the same time. - # default: 100% - max_in_flight: 0-100% # timeout_seconds specifies the amount of time, after the specified jitter, after which # an agent upgrade will be considered timed out if the version does not change. # default: 60 @@ -208,14 +217,17 @@ spec: # failed if the agent heartbeat stops before the upgrade is complete. # default: 0 failure_seconds: 0-900 - # max_failed_before_halt specifies the percentage of clients that may fail before this group - # and all dependent groups are halted. - # default: 0 - max_failed_before_halt: 0-100% + # max_in_flight specifies the maximum number of agents that may be upgraded at the same time. + # default: 100% + max_in_flight: 0-100% # max_timeout_before_halt specifies the percentage of clients that may time out before this group # and all dependent groups are halted. # default: 10% max_timeout_before_halt: 0-100% + # max_failed_before_halt specifies the percentage of clients that may fail before this group + # and all dependent groups are halted. + # default: 0 + max_failed_before_halt: 0-100% # requires specifies groups that must pass with the current version before this group is allowed # to run using that version. requires: ["test-group"] @@ -226,24 +238,28 @@ Dependency cycles are rejected. Dependency chains longer than a week will be rejected. Otherwise, updates could take up to 7 weeks to propagate. -The updater will receive `agent_auto_update: true` from the time is it designated for upgrade until the version changes in `autoupdate_version`. +The updater will receive `agent_autoupdate: true` from the time is it designated for upgrade until the version changes in `autoupdate_version`. After 24 hours, the upgrade is halted in-place, and the group is considered failed if unfinished. Changing the version or schedule completely resets progress. Releasing new client versions multiple times a week has the potential to starve dependent groups from updates. -Note the MVP version of this resource will not support host UUIDs, groups, or backpressure, and will use the following simplified UX with `agent_auto_update` field. -This field will remain indefinitely, to cover agents that do not present a known host UUID, as well as connected agents that are not matched to a group. +Note the MVP version of this resource will not support host UUIDs, groups, or backpressure, and will use the following simplified UX with `agent_default_schedules` field. +This field will remain indefinitely to cover connected agents that are not matched to a group. ```yaml -kind: cluster_maintenance_config +kind: cluster_autoupdate_config spec: - # ... + # agent_autoupdate allows turning agent updates on or off at the + # cluster level. Only turn agent automatic updates off if self-managed + # agent updates are in place. + agent_autoupdate: true|false - # agent_auto_update contains "regular," "critical," and "immediate" schedules. + # agent_default_schedules contains "regular," "critical," and "immediate" schedules. + # These schedules apply to agents not scheduled by agent_group_schedules. # The schedule used is determined by the agent_version_schedule associated - # with the version in autoupdate_version. - agent_auto_update: + # with the agent_version in the autoupdate_version resource. + agent_default_schedules: # The immediate schedule results in all agents updating simultaneously. # Only client-side jitter is configurable. immediate: @@ -265,6 +281,7 @@ spec: # ... ``` +To allow `agent_default_schedules` and `agent_group_schedules` to co-exist, a reserved `default` `agent_rollout_plan` will be created. ```shell # configuration @@ -274,6 +291,8 @@ $ tctl autoupdate update --schedule regular --group staging-group --set-start-ho Automatic updates configuration has been updated. $ tctl autoupdate update --schedule regular --group staging-group --set-jitter-seconds=60 Automatic updates configuration has been updated. +$ tctl autoupdate update --schedule regular --default --set-jitter-seconds=60 +Automatic updates configuration has been updated. $ tctl autoupdate reset Automatic updates configuration has been reset to defaults. @@ -323,7 +342,32 @@ Automatic updates configuration has been updated. ``` Notes: -- These two resources are separate so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. +- `autoupdate_version` is separate from `cluster_autoupdate_config` so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. + +#### Rollout + +```yaml +kind: agent_rollout_plan +spec: + # start time of the rollout + start_time: 0001-01-01T00:00:00Z + # target version of the rollout + version: X.Y.Z + # schedule that triggered the rollout + schedule: regular + # hosts updated by the rollout + host_count: 127 +status: + # current host index in rollout progress + last_active_host_index: 23 + # failed hosts + failed_host_count: 3 + # timed-out hosts + timeout_host_count: 1 +``` + +Notes: +- This resource is stored in a paginated format with separate keys for each page and progress ### Version Promotion @@ -477,7 +521,7 @@ The `disable` subcommand will: When `update` subcommand is otherwise executed, it will: 1. Check `updates.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. 2. Query the `/v1/webapi/find` endpoint. -3. Check that `agent_auto_updates` is true, quit otherwise. +3. Check that `agent_autoupdates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. 6. Ensure there is enough free disk space to upgrade Teleport via `df .` and `content-length` header from `HEAD` request. @@ -624,6 +668,8 @@ Rollbacks for the Kubernetes updater, as well as packaging changes to improve UX The existing update scheduling system will remain in-place until the old auto-updater is fully deprecated. +Eventually, the `cluster_maintenance_config` resource will be deprecated. + ## Security The initial version of automatic updates will rely on TLS to establish @@ -645,6 +691,288 @@ Care will be taken to ensure that updater logs are sharable with Teleport Suppor When TUF is added, that events related to supply chain security may be sent to the Teleport cluster via the Teleport Agent. +## Protobuf API Changes + +Note: all updates use revisions to prevent data loss in case of concurrent access. + +### clusterconfig/v1 + +```protobuf +syntax = "proto3"; + +package teleport.clusterconfig.v1; + +option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/clusterconfig/v1;clusterconfigv1"; + +// ClusterConfigService provides methods to manage cluster configuration resources. +service ClusterConfigService { + // ... + + // GetClusterAutoupdateConfig updates the cluster autoupdate config. + rpc GetClusterAutoupdateConfig(GetClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); + // CreateClusterAutoupdateConfig creates the cluster autoupdate config. + rpc CreateClusterAutoupdateConfig(CreateClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); + // UpdateClusterAutoupdateConfig updates the cluster autoupdate config. + rpc UpdateClusterAutoupdateConfig(UpdateClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); + // UpsertClusterAutoupdateConfig overwrites the cluster autoupdate config. + rpc UpsertClusterAutoupdateConfig(UpsertClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); + // ResetClusterAutoupdateConfig restores the cluster autoupdate config to default values. + rpc ResetClusterAutoupdateConfig(ResetClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); +} + +// GetClusterAutoupdateConfigRequest requests the contents of the ClusterAutoupdateConfig. +message GetClusterAutoupdateConfigRequest {} + +// CreateClusterAutoupdateConfigRequest requests creation of the the ClusterAutoupdateConfig. +message CreateClusterAutoupdateConfigRequest { + ClusterAutoupdateConfig cluster_autoupdate_config = 1; +} + +// UpdateClusterAutoupdateConfigRequest requests an update of the the ClusterAutoupdateConfig. +message UpdateClusterAutoupdateConfigRequest { + ClusterAutoupdateConfig cluster_autoupdate_config = 1; +} + +// UpsertClusterAutoupdateConfigRequest requests an upsert of the the ClusterAutoupdateConfig. +message UpsertClusterAutoupdateConfigRequest { + ClusterAutoupdateConfig cluster_autoupdate_config = 1; +} + +// ResetClusterAutoupdateConfigRequest requests a reset of the the ClusterAutoupdateConfig to default values. +message ResetClusterAutoupdateConfigRequest {} + +// ClusterAutoupdateConfig holds dynamic configuration settings for cluster maintenance activities. +message ClusterAutoupdateConfig { + // kind is the kind of the resource. + string kind = 1; + // sub_kind is the sub kind of the resource. + string sub_kind = 2; + // version is the version of the resource. + string version = 3; + // metadata is the metadata of the resource. + teleport.header.v1.Metadata metadata = 4; + // spec is the spec of the resource. + ClusterAutoupdateConfigSpec spec = 7; +} + +// ClusterAutoupdateConfigSpec is the spec for the cluster autoupdate config. +message ClusterAutoupdateConfigSpec { + // agent_autoupdate specifies whether agent autoupdates are enabled. + bool agent_autoupdate = 1; + // agent_default_schedules specifies schedules for upgrades of agents. + // not scheduled by agent_group_schedules. + AgentAutoupdateDefaultSchedules agent_default_schedules = 2; + // agent_group_schedules specifies schedules for upgrades of grouped agents. + AgentAutoupdateGroupSchedules agent_group_schedules = 3; +} + +// AgentAutoupdateDefaultSchedules specifies the default update schedules for non-grouped agent. +message AgentAutoupdateDefaultSchedules { + // regular schedule for non-critical versions. + AgentAutoupdateSchedule regular = 1; + // critical schedule for urgently needed versions. + AgentAutoupdateSchedule critical = 2; + // immediate schedule for versions that must be deployed with no delay. + AgentAutoupdateImmediateSchedule immediate = 3; +} + +// AgentAutoupdateSchedule specifies a default schedule for non-grouped agents. +message AgentAutoupdateSchedule { + // days to run update + repeated Day days = 2; + // start_hour to initiate update + int32 start_hour = 3; + // jitter_seconds to introduce before update as rand([0, jitter_seconds]). + int32 jitter_seconds = 4; +} + +// AgentAutoupdateSchedule specifies a default schedule for non-grouped agents on the immediate scehdule. +message AgentAutoupdateImmediateSchedule { + // jitter to introduce before update as rand([0, jitter_seconds]). + int32 jitter_seconds = 4; +} + +// AgentAutoupdateGroupSchedules specifies update scheduled for grouped agents. +message AgentAutoupdateGroupSchedules { + // regular schedules for non-critical versions. + repeated AgentAutoupdateGroup regular = 1; + // critical schedules for urgently needed versions. + repeated AgentAutoupdateGroup critical = 2; +} + +// AgentAutoupdateGroup specifies the update schedule for a group of agents. +message AgentAutoupdateGroup { + // name of the group + string name = 1; + // days to run update + repeated Day days = 2; + // start_hour to initiate update + int32 start_hour = 3; + // jitter_seconds to introduce before update as rand([0, jitter_seconds]). + int32 jitter_seconds = 4; + // timeout_seconds before an agent is considered time-out (no version change) + int32 timeout_seconds = 5; + // failure_seconds before an agent is considered failed (loses connection) + int32 failure_seconds = 6; + // max_in_flight specifies agents that can be upgraded at the same time, by percent. + string max_in_flight = 7; + // max_timeout_before_halt specifies agents that can timeout before the rollout is halted, by percent. + string max_timeout_before_halt = 8; + // max_failed_before_halt specifies agents that can fail before the rollout is halted, by percent. + string max_failed_before_halt = 9; + // requires specifies rollout groups that must succeed for the current version/schedule before this rollout can run. + repeated string requires = 10; +} + +// Day of the week +enum Day { + ALL = 0; + SUNDAY = 1; + MONDAY = 2; + TUESDAY = 3; + WEDNESDAY = 4; + THURSDAY = 5; + FRIDAY = 6; + SATURDAY = 7; +} +``` + +### autoupdate/v1 + +```protobuf +syntax = "proto3"; + +package teleport.autoupdate.v1; + +option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdatev1"; + +// AutoupdateService serves agent and client automatic version updates. +service AutoupdateService { + // GetAutoupdateVersion returns the autoupdate version. + rpc GetAutoupdateVersion(GetAutoupdateVersionRequest) returns (AutoupdateVersion); + // CreateAutoupdateVersion creates the autoupdate version. + rpc CreateAutoupdateVersion(CreateAutoupdateVersionRequest) returns (AutoupdateVersion); + // UpdateAutoupdateVersion updates the autoupdate version. + rpc UpdateAutoupdateVersion(UpdateAutoupdateVersionRequest) returns (AutoupdateVersion); + // UpsertAutoupdateVersion overwrites the autoupdate version. + rpc UpsertAutoupdateVersion(UpsertAutoupdateVersionRequest) returns (AutoupdateVersion); + + // GetAgentRolloutPlan returns the agent rollout plan and current progress. + rpc GetAgentRolloutPlan(GetAgentRolloutPlanRequest) returns (AgentRolloutPlan); + // GetAutoupdateVersion streams the agent rollout plan's list of all hosts. + rpc GetAgentRolloutPlanHosts(GetAgentRolloutPlanHostsRequest) returns (stream AgentRolloutPlanHost); +} + +// GetAutoupdateVersionRequest requests the autoupdate_version singleton resource. +message GetAutoupdateVersionRequest {} + +// GetAutoupdateVersionRequest requests creation of the autoupdate_version singleton resource. +message CreateAutoupdateVersionRequest { + // autoupdate_version resource contents + AutoupdateVersion autoupdate_version = 1; +} + +// GetAutoupdateVersionRequest requests an update of the autoupdate_version singleton resource. +message UpdateAutoupdateVersionRequest { + // autoupdate_version resource contents + AutoupdateVersion autoupdate_version = 1; +} + +// GetAutoupdateVersionRequest requests an upsert of the autoupdate_version singleton resource. +message UpsertAutoupdateVersionRequest { + // autoupdate_version resource contents + AutoupdateVersion autoupdate_version = 1; +} + +// AutoupdateVersion holds dynamic configuration settings for autoupdate versions. +message AutoupdateVersion { + // kind is the kind of the resource. + string kind = 1; + // sub_kind is the sub kind of the resource. + string sub_kind = 2; + // version is the version of the resource. + string version = 3; + // metadata is the metadata of the resource. + teleport.header.v1.Metadata metadata = 4; + // spec is the spec of the resource. + AutoupdateVersionSpec spec = 6; +} + +// AutoupdateVersionSpec is the spec for the autoupdate version. +message AutoupdateVersionSpec { + // agent_version is the desired agent version for new rollouts. + string agent_version = 1; + // agent_version schedule is the schedule to use for rolling out the agent_version. + Schedule agent_version_schedule = 2; +} + +// Schedule type for the rollout +enum Schedule { + // REGULAR update schedule + REGULAR = 0; + // CRITICAL update schedule for critical bugs and vulnerabilities + CRITICAL = 1; + // IMMEDIATE update schedule for updating all agents immediately + IMMEDIATE = 2; +} + +// GetAgentRolloutPlanRequest requests an agent_rollout_plan. +message GetAgentRolloutPlanRequest { + // name of the agent_rollout_plan + string name = 1; +} + +// GetAgentRolloutPlanHostsRequest requests the ordered host UUIDs for an agent_rollout_plan. +message GetAgentRolloutPlanHostsRequest { + // name of the agent_rollout_plan + string name = 1; +} + +// AgentRolloutPlan defines a version update rollout consisting a fixed group of agents. +message AgentRolloutPlan { + // kind is the kind of the resource. + string kind = 1; + // sub_kind is the sub kind of the resource. + string sub_kind = 2; + // version is the version of the resource. + string version = 3; + // metadata is the metadata of the resource. + teleport.header.v1.Metadata metadata = 4; + // spec is the spec of the resource. + AgentRolloutPlanSpec spec = 5; + // status is the status of the resource. + AgentRolloutPlanStatus status = 6; +} + +// AutoupdateVersionSpec is the spec for the autoupdate version. +message AgentRolloutPlanSpec { + // start_time of the rollout + google.protobuf.Timestamp start_time = 1; + // version targetted by the rollout + string version = 2; + // schedule that triggered the rollout + string schedule = 3; + // host_count of hosts to update + int64 host_count = 4; +} + +// AutoupdateVersionSpec is the spec for the autoupdate version. +message AgentRolloutPlanStatus { + // last_active_host_index specifies the index of the last host that may be updated. + int64 last_active_host_index = 1; + // failed_host_count specifies the number of failed hosts. + int64 failed_host_count = 2; + // timeout_host_count specifies the number of timed-out hosts. + int64 timeout_host_count = 3; +} + +// AgentRolloutPlanHost identifies an agent by host ID +message AgentRolloutPlanHost { + // host_id of a host included in the rollout + string host_id = 1; +} +``` + ## Execution Plan 1. Implement Teleport APIs for new scheduling system (without groups and backpressure) From 697a548fc8823c3574d45aed9dd7421e1be19824 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 13 Aug 2024 14:14:40 -0400 Subject: [PATCH 054/105] clarify wording --- rfd/0169-auto-updates-linux-agents.md | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index d3e693d9a56a6..d57a24b786a55 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -13,7 +13,9 @@ state: draft ## What -This RFD proposes a new mechanism for Teleport agents to automatically update to a version scheduled by an operator via tctl. +This RFD proposes a new mechanism for scheduled, automatic updates of Teleport agents. + +Users of Teleport will be able to use the tctl CLI to specify desired versions and update schedules. All agent installations are in-scope for this proposal, including agents installed on Linux servers and Kubernetes. @@ -43,19 +45,25 @@ The existing mechanism for automatic agent updates does not provide a hands-off 10. Teleport contains logic that is specific to Teleport Cloud upgrade workflows. 11. The existing auto-updater is not self-updating. 12. It is difficult and undocumented to automate agent upgrades with custom automation (e.g., with JamF). +13. There is no phased rollout mechanism for updates. +14. There is no way to automatically detect and halt failed updates. -We must provide a seamless, hands-off experience for auto-updates of Teleport Agents that is easy to maintain. +We must provide a seamless, hands-off experience for auto-updates of Teleport Agents that is easy to maintain and safer for production use. ## Details - Teleport API Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. -The version and edition served from that endpoint will be configured using new `cluster_autoupdate_config` and `autoupdate_version` resources. -Whether the Teleport updater querying the endpoint is instructed to upgrade (via `agent_autoupdate`) is dependent on the `host=[uuid]` parameter sent to `/v1/webapi/find`. +The version and edition served from that endpoint will be configured using new `autoupdate_version` resource. + +Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: +- The `host=[uuid]` parameter sent to `/v1/webapi/find` +- The schedule defined in the new `cluster_autoupdate_config` resource +- The status of past agent upgrades for the given version To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID. -Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier. +Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `teleport.yaml` file. ``` teleport: upgrade_group: staging @@ -65,7 +73,7 @@ At the start of a group rollout, the Teleport auth server captures the desired g An fixed number of hosts (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. Additional hosts are instructed to update as earlier updates complete, timeout, or fail, never exceeding `max_in_flight`. The group rollout is halted if timeouts or failures exceed their specified thresholds. -Group rollouts may be retried with `tctl autoupdate run`. +Rollouts may be retried with `tctl autoupdate run`. ### Scalability From ed8e7ed7266a30bd4def82fd244823f5171e2788 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 13 Aug 2024 14:24:35 -0400 Subject: [PATCH 055/105] wording --- rfd/0169-auto-updates-linux-agents.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index d57a24b786a55..a79e7962a56da 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -82,7 +82,7 @@ Rollouts may be retried with `tctl autoupdate run`. Instance heartbeats will be cached by auth servers using a dedicated cache. This cache is updated using rate-limited backend reads that occur in the background, to avoid mass-reads of instance heartbeats. The rate is modulated by the total number of instance heartbeats, to avoid putting too much load on the backend on large clusters. -The cache is considered healthy when all instance heartbeats present on the backend have been read in a time period that is also modulated by the total number of heartbeats. +The cache is considered healthy when all instance heartbeats present on the backend have been read within a time period that is also modulated by the total number of heartbeats. At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend under a single key. This plan is protected by optimistic locking, and contains the following data: @@ -95,7 +95,8 @@ Data value JSON: - `schedule`: type of schedule that triggered the rollout - `hosts`: list of host UUIDs in randomized order - `next_page`: additional UUIDs, if list is greater than 100,000 UUIDs -- `expiry`: 2 weeks + +Expiration time of each key is 2 weeks. At a fixed interval, auth servers will check the plan to determine if a new plan is needed by comparing `start_time` to the current time and the desired window. If a new plan is needed, auth servers will query their cache of instance heartbeats and attempt to write the new plan. @@ -103,6 +104,8 @@ The first auth server to write the plan wins; others will be rejected by the opt Auth servers will only write the plan if their instance heartbeat cache is healthy. If the resource size is greater than 100 KiB, auth servers will divide the resource into pages no greater than 100 KiB each. +This is necessary to support backends with a value size limit. + Each page will duplicate all values besides `hosts`, which will be different for each page. All pages besides the first page will be suffixed with a randomly generated number. Pages will be written in reverse order, in a linked-link, before the final atomic non-suffixed write of the first page. @@ -130,7 +133,8 @@ The rollout logic is progressed by instance heartbeat backend writes, as changes The following data related to the rollout are stored in each instance heartbeat: - `agent_upgrade_start_time`: timestamp of individual agent's upgrade time - `agent_upgrade_version`: current agent version -- `expiry`: expiration time of the heartbeat (extended to 24 hours at `agent_upgrade_start_time`) + +Expiration time of the heartbeat is extended to 24 hours when `agent_upgrade_start_time` is written. Additionally, an in-memory data structure is maintained based on the cache, and kept up-to-date by a background process. This data structure contains the number of unfinished (pending and ongoing) upgrades preceding each instance heartbeat in the rollout plan. @@ -184,7 +188,7 @@ Upgrading all agents generates the following additional backend write load: Notes: - Agents will only upgrade if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. -- The host UUID is ready from `/var/lib/teleport` by the updater. +- The host UUID is read from `/var/lib/teleport/host_uuid` by the updater. ### Teleport Resources @@ -289,7 +293,7 @@ spec: # ... ``` -To allow `agent_default_schedules` and `agent_group_schedules` to co-exist, a reserved `default` `agent_rollout_plan` will be created. +To allow `agent_default_schedules` and `agent_group_schedules` to co-exist, a reserved `agent_rollout_plan` named `default` will be employed. ```shell # configuration From e85904d25740d7b6b1902c41cff568a0f31d9101 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 13 Aug 2024 17:30:39 -0400 Subject: [PATCH 056/105] Update rfd/0169-auto-updates-linux-agents.md Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com> --- rfd/0169-auto-updates-linux-agents.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index a79e7962a56da..16072331ff563 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -838,14 +838,15 @@ message AgentAutoupdateGroup { // Day of the week enum Day { - ALL = 0; - SUNDAY = 1; - MONDAY = 2; - TUESDAY = 3; - WEDNESDAY = 4; - THURSDAY = 5; - FRIDAY = 6; - SATURDAY = 7; + DAY_UNSPECIFIED = 0; + DAY_ALL = 1; + DAY_SUNDAY = 2; + DAY_MONDAY = 3; + DAY_TUESDAY = 4; + DAY_WEDNESDAY = 5; + DAY_THURSDAY = 6; + DAY_FRIDAY = 7; + DAY_SATURDAY = 8; } ``` From 139fcbba2859f52caf93ffff78352e57231ef762 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 13 Aug 2024 17:30:52 -0400 Subject: [PATCH 057/105] Update rfd/0169-auto-updates-linux-agents.md Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com> --- rfd/0169-auto-updates-linux-agents.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 16072331ff563..6156b3e21fbd9 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -921,12 +921,13 @@ message AutoupdateVersionSpec { // Schedule type for the rollout enum Schedule { + Schedule_UNSPECIFIED = 0; // REGULAR update schedule - REGULAR = 0; + Schedule_REGULAR = 1; // CRITICAL update schedule for critical bugs and vulnerabilities - CRITICAL = 1; + Schedule_CRITICAL = 2; // IMMEDIATE update schedule for updating all agents immediately - IMMEDIATE = 2; + Schedule_IMMEDIATE = 3; } // GetAgentRolloutPlanRequest requests an agent_rollout_plan. From acb7b3df0ba3b922f28bd4dc63aac90ad47227d9 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 13 Aug 2024 17:37:59 -0400 Subject: [PATCH 058/105] linting --- rfd/0169-auto-updates-linux-agents.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 6156b3e21fbd9..4299ace12af7a 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -921,13 +921,14 @@ message AutoupdateVersionSpec { // Schedule type for the rollout enum Schedule { - Schedule_UNSPECIFIED = 0; + // UNSPECIFIED update schedule + SCHEDULE_UNSPECIFIED = 0; // REGULAR update schedule - Schedule_REGULAR = 1; + SCHEDULE_REGULAR = 1; // CRITICAL update schedule for critical bugs and vulnerabilities - Schedule_CRITICAL = 2; + SCHEDULE_CRITICAL = 2; // IMMEDIATE update schedule for updating all agents immediately - Schedule_IMMEDIATE = 3; + SCHEDULE_IMMEDIATE = 3; } // GetAgentRolloutPlanRequest requests an agent_rollout_plan. From 2343138a47088ca976d4b72b9a3b7f97495fda1c Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 21 Aug 2024 22:04:31 -0400 Subject: [PATCH 059/105] Move all RPCs into autoupdate/v1 --- rfd/0169-auto-updates-linux-agents.md | 121 +++++++++++--------------- 1 file changed, 53 insertions(+), 68 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 4299ace12af7a..3dfe6a35a7c1d 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -57,7 +57,7 @@ The version and edition served from that endpoint will be configured using new ` Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: - The `host=[uuid]` parameter sent to `/v1/webapi/find` -- The schedule defined in the new `cluster_autoupdate_config` resource +- The schedule defined in the new `autoupdate_config` resource - The status of past agent upgrades for the given version To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. @@ -195,7 +195,7 @@ Notes: #### Scheduling ```yaml -kind: cluster_autoupdate_config +kind: autoupdate_config spec: # agent_autoupdate allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed @@ -260,7 +260,7 @@ Note the MVP version of this resource will not support host UUIDs, groups, or ba This field will remain indefinitely to cover connected agents that are not matched to a group. ```yaml -kind: cluster_autoupdate_config +kind: autoupdate_config spec: # agent_autoupdate allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed @@ -354,7 +354,7 @@ Automatic updates configuration has been updated. ``` Notes: -- `autoupdate_version` is separate from `cluster_autoupdate_config` so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. +- `autoupdate_version` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. #### Rollout @@ -707,54 +707,66 @@ When TUF is added, that events related to supply chain security may be sent to t Note: all updates use revisions to prevent data loss in case of concurrent access. -### clusterconfig/v1 +### autoupdate/v1 ```protobuf syntax = "proto3"; -package teleport.clusterconfig.v1; +package teleport.autoupdate.v1; + +option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdatev1"; -option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/clusterconfig/v1;clusterconfigv1"; +// AutoupdateService serves agent and client automatic version updates. +service AutoupdateService { + // GetAutoupdateConfig updates the autoupdate config. + rpc GetAutoupdateConfig(GetAutoupdateConfigRequest) returns (AutoupdateConfig); + // CreateAutoupdateConfig creates the autoupdate config. + rpc CreateAutoupdateConfig(CreateAutoupdateConfigRequest) returns (AutoupdateConfig); + // UpdateAutoupdateConfig updates the autoupdate config. + rpc UpdateAutoupdateConfig(UpdateAutoupdateConfigRequest) returns (AutoupdateConfig); + // UpsertAutoupdateConfig overwrites the autoupdate config. + rpc UpsertAutoupdateConfig(UpsertAutoupdateConfigRequest) returns (AutoupdateConfig); + // ResetAutoupdateConfig restores the autoupdate config to default values. + rpc ResetAutoupdateConfig(ResetAutoupdateConfigRequest) returns (AutoupdateConfig); -// ClusterConfigService provides methods to manage cluster configuration resources. -service ClusterConfigService { - // ... + // GetAutoupdateVersion returns the autoupdate version. + rpc GetAutoupdateVersion(GetAutoupdateVersionRequest) returns (AutoupdateVersion); + // CreateAutoupdateVersion creates the autoupdate version. + rpc CreateAutoupdateVersion(CreateAutoupdateVersionRequest) returns (AutoupdateVersion); + // UpdateAutoupdateVersion updates the autoupdate version. + rpc UpdateAutoupdateVersion(UpdateAutoupdateVersionRequest) returns (AutoupdateVersion); + // UpsertAutoupdateVersion overwrites the autoupdate version. + rpc UpsertAutoupdateVersion(UpsertAutoupdateVersionRequest) returns (AutoupdateVersion); - // GetClusterAutoupdateConfig updates the cluster autoupdate config. - rpc GetClusterAutoupdateConfig(GetClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); - // CreateClusterAutoupdateConfig creates the cluster autoupdate config. - rpc CreateClusterAutoupdateConfig(CreateClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); - // UpdateClusterAutoupdateConfig updates the cluster autoupdate config. - rpc UpdateClusterAutoupdateConfig(UpdateClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); - // UpsertClusterAutoupdateConfig overwrites the cluster autoupdate config. - rpc UpsertClusterAutoupdateConfig(UpsertClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); - // ResetClusterAutoupdateConfig restores the cluster autoupdate config to default values. - rpc ResetClusterAutoupdateConfig(ResetClusterAutoupdateConfigRequest) returns (ClusterAutoupdateConfig); + // GetAgentRolloutPlan returns the agent rollout plan and current progress. + rpc GetAgentRolloutPlan(GetAgentRolloutPlanRequest) returns (AgentRolloutPlan); + // GetAutoupdateVersion streams the agent rollout plan's list of all hosts. + rpc GetAgentRolloutPlanHosts(GetAgentRolloutPlanHostsRequest) returns (stream AgentRolloutPlanHost); } -// GetClusterAutoupdateConfigRequest requests the contents of the ClusterAutoupdateConfig. -message GetClusterAutoupdateConfigRequest {} +// GetAutoupdateConfigRequest requests the contents of the AutoupdateConfig. +message GetAutoupdateConfigRequest {} -// CreateClusterAutoupdateConfigRequest requests creation of the the ClusterAutoupdateConfig. -message CreateClusterAutoupdateConfigRequest { - ClusterAutoupdateConfig cluster_autoupdate_config = 1; +// CreateAutoupdateConfigRequest requests creation of the the AutoupdateConfig. +message CreateAutoupdateConfigRequest { + AutoupdateConfig autoupdate_config = 1; } -// UpdateClusterAutoupdateConfigRequest requests an update of the the ClusterAutoupdateConfig. -message UpdateClusterAutoupdateConfigRequest { - ClusterAutoupdateConfig cluster_autoupdate_config = 1; +// UpdateAutoupdateConfigRequest requests an update of the the AutoupdateConfig. +message UpdateAutoupdateConfigRequest { + AutoupdateConfig autoupdate_config = 1; } -// UpsertClusterAutoupdateConfigRequest requests an upsert of the the ClusterAutoupdateConfig. -message UpsertClusterAutoupdateConfigRequest { - ClusterAutoupdateConfig cluster_autoupdate_config = 1; +// UpsertAutoupdateConfigRequest requests an upsert of the the AutoupdateConfig. +message UpsertAutoupdateConfigRequest { + AutoupdateConfig autoupdate_config = 1; } -// ResetClusterAutoupdateConfigRequest requests a reset of the the ClusterAutoupdateConfig to default values. -message ResetClusterAutoupdateConfigRequest {} +// ResetAutoupdateConfigRequest requests a reset of the the AutoupdateConfig to default values. +message ResetAutoupdateConfigRequest {} -// ClusterAutoupdateConfig holds dynamic configuration settings for cluster maintenance activities. -message ClusterAutoupdateConfig { +// AutoupdateConfig holds dynamic configuration settings for automatic updates. +message AutoupdateConfig { // kind is the kind of the resource. string kind = 1; // sub_kind is the sub kind of the resource. @@ -764,11 +776,11 @@ message ClusterAutoupdateConfig { // metadata is the metadata of the resource. teleport.header.v1.Metadata metadata = 4; // spec is the spec of the resource. - ClusterAutoupdateConfigSpec spec = 7; + AutoupdateConfigSpec spec = 7; } -// ClusterAutoupdateConfigSpec is the spec for the cluster autoupdate config. -message ClusterAutoupdateConfigSpec { +// AutoupdateConfigSpec is the spec for the autoupdate config. +message AutoupdateConfigSpec { // agent_autoupdate specifies whether agent autoupdates are enabled. bool agent_autoupdate = 1; // agent_default_schedules specifies schedules for upgrades of agents. @@ -848,33 +860,6 @@ enum Day { DAY_FRIDAY = 7; DAY_SATURDAY = 8; } -``` - -### autoupdate/v1 - -```protobuf -syntax = "proto3"; - -package teleport.autoupdate.v1; - -option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdatev1"; - -// AutoupdateService serves agent and client automatic version updates. -service AutoupdateService { - // GetAutoupdateVersion returns the autoupdate version. - rpc GetAutoupdateVersion(GetAutoupdateVersionRequest) returns (AutoupdateVersion); - // CreateAutoupdateVersion creates the autoupdate version. - rpc CreateAutoupdateVersion(CreateAutoupdateVersionRequest) returns (AutoupdateVersion); - // UpdateAutoupdateVersion updates the autoupdate version. - rpc UpdateAutoupdateVersion(UpdateAutoupdateVersionRequest) returns (AutoupdateVersion); - // UpsertAutoupdateVersion overwrites the autoupdate version. - rpc UpsertAutoupdateVersion(UpsertAutoupdateVersionRequest) returns (AutoupdateVersion); - - // GetAgentRolloutPlan returns the agent rollout plan and current progress. - rpc GetAgentRolloutPlan(GetAgentRolloutPlanRequest) returns (AgentRolloutPlan); - // GetAutoupdateVersion streams the agent rollout plan's list of all hosts. - rpc GetAgentRolloutPlanHosts(GetAgentRolloutPlanHostsRequest) returns (stream AgentRolloutPlanHost); -} // GetAutoupdateVersionRequest requests the autoupdate_version singleton resource. message GetAutoupdateVersionRequest {} @@ -908,7 +893,7 @@ message AutoupdateVersion { // metadata is the metadata of the resource. teleport.header.v1.Metadata metadata = 4; // spec is the spec of the resource. - AutoupdateVersionSpec spec = 6; + AutoupdateVersionSpec spec = 5; } // AutoupdateVersionSpec is the spec for the autoupdate version. @@ -959,7 +944,7 @@ message AgentRolloutPlan { AgentRolloutPlanStatus status = 6; } -// AutoupdateVersionSpec is the spec for the autoupdate version. +// AutoupdateVersionSpec is the spec for the AgentRolloutPlan. message AgentRolloutPlanSpec { // start_time of the rollout google.protobuf.Timestamp start_time = 1; @@ -971,7 +956,7 @@ message AgentRolloutPlanSpec { int64 host_count = 4; } -// AutoupdateVersionSpec is the spec for the autoupdate version. +// AutoupdateVersionStatus is the status for the AgentRolloutPlan. message AgentRolloutPlanStatus { // last_active_host_index specifies the index of the last host that may be updated. int64 last_active_host_index = 1; From 8e6bc8e0e8154dab442a4f820764eeeea1a08421 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 26 Aug 2024 15:33:49 -0400 Subject: [PATCH 060/105] Move groups to MVP --- rfd/0169-auto-updates-linux-agents.md | 125 +++++++------------------- 1 file changed, 31 insertions(+), 94 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 3dfe6a35a7c1d..70a761bbd0cb2 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -57,16 +57,16 @@ The version and edition served from that endpoint will be configured using new ` Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: - The `host=[uuid]` parameter sent to `/v1/webapi/find` +- The `group=[name]` parameter sent to `/v1/webapi/find` - The schedule defined in the new `autoupdate_config` resource - The status of past agent upgrades for the given version To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. -Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID. +Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. -Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `teleport.yaml` file. -``` -teleport: - upgrade_group: staging +Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/updates.yaml` file, which is written via `teleport-updater enable`: +```shell +$ teleport-updater enable --proxy teleport.example.com --group staging ``` At the start of a group rollout, the Teleport auth server captures the desired group of hosts to update in the backend. @@ -176,7 +176,7 @@ Upgrading all agents generates the following additional backend write load: ### REST Endpoints -`/v1/webapi/find?host=[host_uuid]` +`/v1/webapi/find?host=[host_uuid]&group=[name]` ```json { "server_edition": "enterprise", @@ -189,6 +189,7 @@ Notes: - Agents will only upgrade if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. - The host UUID is read from `/var/lib/teleport/host_uuid` by the updater. +- The group name is read from `/var/lib/teleport/versions/updates.yaml` by the updater. ### Teleport Resources @@ -202,11 +203,11 @@ spec: # agent updates are in place. agent_autoupdate: true|false - # agent_group_schedules contains both "regular" and "critical" schedules. + # agent_schedules contains both "regular" and "critical" schedules. # The schedule used is determined by the agent_version_schedule associated # with the version in autoupdate_version. # Groups are not configurable with the "immediate" schedule. - agent_group_schedules: + agent_schedules: # schedule is "regular" or "critical" regular: # name of the group @@ -256,44 +257,7 @@ After 24 hours, the upgrade is halted in-place, and the group is considered fail Changing the version or schedule completely resets progress. Releasing new client versions multiple times a week has the potential to starve dependent groups from updates. -Note the MVP version of this resource will not support host UUIDs, groups, or backpressure, and will use the following simplified UX with `agent_default_schedules` field. -This field will remain indefinitely to cover connected agents that are not matched to a group. - -```yaml -kind: autoupdate_config -spec: - # agent_autoupdate allows turning agent updates on or off at the - # cluster level. Only turn agent automatic updates off if self-managed - # agent updates are in place. - agent_autoupdate: true|false - - # agent_default_schedules contains "regular," "critical," and "immediate" schedules. - # These schedules apply to agents not scheduled by agent_group_schedules. - # The schedule used is determined by the agent_version_schedule associated - # with the agent_version in the autoupdate_version resource. - agent_default_schedules: - # The immediate schedule results in all agents updating simultaneously. - # Only client-side jitter is configurable. - immediate: - # jitter_seconds specifies a maximum jitter duration after the start hour. - # The agent upgrader client will pick a random time within this duration to wait to upgrade. - # default: 0 - jitter_seconds: 0-60 - regular: # or "critical" - # days specifies the days of the week when the group may be upgraded. - # default: ["*"] (all days) - days: [“Sun”, “Mon”, ... | "*"] - # start_hour specifies the hour when the group may start upgrading. - # default: 0 - start_hour: 0-23 - # jitter_seconds specifies a maximum jitter duration after the start hour. - # The agent upgrader client will pick a random time within this duration to wait to upgrade. - # default: 0 - jitter_seconds: 0-60 - # ... -``` - -To allow `agent_default_schedules` and `agent_group_schedules` to co-exist, a reserved `agent_rollout_plan` named `default` will be employed. +Note that the `default` schedule applies to agents that do not specify a group name. ```shell # configuration @@ -383,17 +347,13 @@ Notes: ### Version Promotion -Maintaining the version of different groups of agents is out-of-scope for this RFD. +This RFD only proposed a mechanism to signal when agent auto-updates should occur. +Advertising different target Teleport versions for different groups of agents is out-of-scope for this RFD. This means that groups which employ auto-scaling or ephemeral resources will slowly converge to the latest Teleport version. **This could lead to a production outage, as the latest Teleport version may not receive any validation before it is advertised to newly provisioned resources in production.** -To solve this in the future, we can add an additional `--group` flag to `teleport-update`: -```shell -$ teleport-update enable --proxy example.teleport.sh --group staging-group -``` - -This group name could be provided as a parameter to `/v1/webapi/find`, so that newly added resources may install at the group's designated version. +To solve this in the future, we can use the group name (provided to `/v1/webapi/find` and specified via `teleport-updater enable`) to determine which version should be served. This will require tracking the desired version of groups in the backend, which will add additional complexity to the rollout logic. @@ -416,6 +376,11 @@ $ teleport-update enable --proxy example.teleport.sh $ systemctl enable teleport ``` +For grouped upgrades, a group identifier may be configured: +```shell +$ teleport-update enable --proxy example.teleport.sh --group staging +``` + For air-gapped Teleport installs, the agent may be configured with a custom tarball path template: ```shell $ teleport-update enable --proxy example.teleport.sh --template 'https://example.com/teleport-{{ .Edition }}-{{ .Version }}-{{ .Arch }}.tgz' @@ -470,6 +435,8 @@ kind: agent_versions spec: # proxy specifies the Teleport proxy address to retrieve the agent version and update configuration from. proxy: mytenant.teleport.sh + # group specifies the update group + group: staging # enabled specifies whether auto-updates are enabled, i.e., whether teleport-updater update is allowed to update the agent. enabled: true # active_version specifies the active (symlinked) deployment of the telepport agent. @@ -499,7 +466,7 @@ $ teleport-updater update After it is installed, the `update` subcommand will no-op when executed until configured with the `teleport-updater` command: ```shell -$ teleport-updater enable --proxy mytenant.teleport.sh +$ teleport-updater enable --proxy mytenant.teleport.sh --group staging ``` If the proxy address is not provided with `--proxy`, the current proxy address from `teleport.yaml` is used. @@ -525,7 +492,7 @@ The `enable` subcommand will: 13. Remove and purge any `teleport` package if installed. 14. Verify the symlinks to the active version still exists. 15. Remove all stored versions of the agent except the current version and last working version. -16. Configure `updates.yaml` with the current proxy address and set `enabled` to true. +16. Configure `updates.yaml` with the current proxy address and group, and set `enabled` to true. The `disable` subcommand will: 1. Configure `updates.yaml` to set `enabled` to false. @@ -783,41 +750,12 @@ message AutoupdateConfig { message AutoupdateConfigSpec { // agent_autoupdate specifies whether agent autoupdates are enabled. bool agent_autoupdate = 1; - // agent_default_schedules specifies schedules for upgrades of agents. - // not scheduled by agent_group_schedules. - AgentAutoupdateDefaultSchedules agent_default_schedules = 2; - // agent_group_schedules specifies schedules for upgrades of grouped agents. - AgentAutoupdateGroupSchedules agent_group_schedules = 3; -} - -// AgentAutoupdateDefaultSchedules specifies the default update schedules for non-grouped agent. -message AgentAutoupdateDefaultSchedules { - // regular schedule for non-critical versions. - AgentAutoupdateSchedule regular = 1; - // critical schedule for urgently needed versions. - AgentAutoupdateSchedule critical = 2; - // immediate schedule for versions that must be deployed with no delay. - AgentAutoupdateImmediateSchedule immediate = 3; -} - -// AgentAutoupdateSchedule specifies a default schedule for non-grouped agents. -message AgentAutoupdateSchedule { - // days to run update - repeated Day days = 2; - // start_hour to initiate update - int32 start_hour = 3; - // jitter_seconds to introduce before update as rand([0, jitter_seconds]). - int32 jitter_seconds = 4; -} - -// AgentAutoupdateSchedule specifies a default schedule for non-grouped agents on the immediate scehdule. -message AgentAutoupdateImmediateSchedule { - // jitter to introduce before update as rand([0, jitter_seconds]). - int32 jitter_seconds = 4; + // agent_schedules specifies schedules for upgrades of grouped agents. + AgentAutoupdateSchedules agent_schedules = 3; } -// AgentAutoupdateGroupSchedules specifies update scheduled for grouped agents. -message AgentAutoupdateGroupSchedules { +// AgentAutoupdateSchedules specifies update scheduled for grouped agents. +message AgentAutoupdateSchedules { // regular schedules for non-critical versions. repeated AgentAutoupdateGroup regular = 1; // critical schedules for urgently needed versions. @@ -975,14 +913,13 @@ message AgentRolloutPlanHost { ## Execution Plan -1. Implement Teleport APIs for new scheduling system (without groups and backpressure) -2. Implement new auto-updater in Go. +1. Implement Teleport APIs for new scheduling system (without backpressure) +2. Implement new Linux server auto-updater in Go. 3. Implement changes to Kubernetes auto-updater. 4. Test extensively on all supported Linux distributions. 5. Prep documentation changes. 6. Release new updater via teleport-ent-updater package. 7. Release documentation changes. -8. Communicate to select Cloud customers that they must update their updater, starting with lower ARR customers. -9. Communicate to all Cloud customers that they must update their updater. -10. Deprecate old auto-updater endpoints. -11. Add groups and backpressure features. +8. Communicate to users that they should update their updater. +9. Deprecate old auto-updater endpoints. +10. Add groups and backpressure features. From bbdfc253596cfa7bc7284eeca1bfdf6037aacf39 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 26 Aug 2024 17:51:24 -0400 Subject: [PATCH 061/105] note about checksum --- rfd/0169-auto-updates-linux-agents.md | 1 + 1 file changed, 1 insertion(+) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 70a761bbd0cb2..3e3142b1755b9 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -385,6 +385,7 @@ For air-gapped Teleport installs, the agent may be configured with a custom tarb ```shell $ teleport-update enable --proxy example.teleport.sh --template 'https://example.com/teleport-{{ .Edition }}-{{ .Version }}-{{ .Arch }}.tgz' ``` +(Checksum will use template path + `.sha256`) ### Filesystem From 5974b0397ea73c1a4502717b8818c88c51435a4e Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 26 Aug 2024 20:01:19 -0400 Subject: [PATCH 062/105] typos, consistency --- rfd/0169-auto-updates-linux-agents.md | 74 +++++++++++++-------------- 1 file changed, 37 insertions(+), 37 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 3e3142b1755b9..b26efad93c702 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -131,24 +131,24 @@ Losing auth: The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events. The following data related to the rollout are stored in each instance heartbeat: -- `agent_upgrade_start_time`: timestamp of individual agent's upgrade time -- `agent_upgrade_version`: current agent version +- `agent_update_start_time`: timestamp of individual agent's upgrade time +- `agent_update_version`: current agent version -Expiration time of the heartbeat is extended to 24 hours when `agent_upgrade_start_time` is written. +Expiration time of the heartbeat is extended to 24 hours when `agent_update_start_time` is written. Additionally, an in-memory data structure is maintained based on the cache, and kept up-to-date by a background process. This data structure contains the number of unfinished (pending and ongoing) upgrades preceding each instance heartbeat in the rollout plan. -Instance heartbeats are considered completed when either `agent_upgrade_version` matches the plan version, or `agent_upgrade_start_time` is past the expiration time. +Instance heartbeats are considered completed when either `agent_update_version` matches the plan version, or `agent_update_start_time` is past the expiration time. ```golang unfinished := make(map[Rollout][UUID]int) ``` On each instance heartbeat write, the auth server looks at the data structure to determine if the associated agent should begin upgrading. This determination is made by comparing the stored number of unfinished upgrades to `max_in_flight % x len(hosts)`. -If the stored number is fewer, `agent_upgrade_start_time` is updated to the current time when the heartbeat is written. +If the stored number is fewer, `agent_update_start_time` is updated to the current time when the heartbeat is written. The auth server writes the following keys to `/autoupdate/[name of group]/status` (e.g., `/autoupdate/staging/status`): -- `last_active_host_index`: index of the last host allowed to upgrade +- `last_active_host_index`: index of the last host allowed to update - `failed_host_count`: failed host count - `timeout_host_count`: timed-out host count @@ -168,10 +168,10 @@ upgrading := make(map[UUID]bool) ``` Proxies watch for changes to the plan and update the map accordingly. -When the updater queries the proxy via `/v1/webapi/find?host=[host_uuid]`, the proxies query the map to determine the value of `agent_auto_upgrade: true`. +When the updater queries the proxy via `/v1/webapi/find?host=[host_uuid]`, the proxies query the map to determine the value of `agent_autoupdate: true`. -Upgrading all agents generates the following additional backend write load: -- One write per page of the rollout plan per upgrade group. +Updating all agents generates the following additional backend write load: +- One write per page of the rollout plan per update group. - One write per auth server every 10 seconds, during rollouts. ### REST Endpoints @@ -186,7 +186,7 @@ Upgrading all agents generates the following additional backend write load: } ``` Notes: -- Agents will only upgrade if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. +- Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. - The host UUID is read from `/var/lib/teleport/host_uuid` by the updater. - The group name is read from `/var/lib/teleport/versions/updates.yaml` by the updater. @@ -212,25 +212,25 @@ spec: regular: # name of the group - name: staging-group - # days specifies the days of the week when the group may be upgraded. + # days specifies the days of the week when the group may be updated. # default: ["*"] (all days) days: [“Sun”, “Mon”, ... | "*"] # start_hour specifies the hour when the group may start upgrading. # default: 0 start_hour: 0-23 # jitter_seconds specifies a maximum jitter duration after the start hour. - # The agent upgrader client will pick a random time within this duration to wait to upgrade. + # The agent updater client will pick a random time within this duration to wait to update. # default: 0 jitter_seconds: 0-60 # timeout_seconds specifies the amount of time, after the specified jitter, after which - # an agent upgrade will be considered timed out if the version does not change. + # an agent update will be considered timed out if the version does not change. # default: 60 timeout_seconds: 30-900 - # failure_seconds specifies the amount of time after which an agent upgrade will be considered - # failed if the agent heartbeat stops before the upgrade is complete. + # failure_seconds specifies the amount of time after which an agent update will be considered + # failed if the agent heartbeat stops before the update is complete. # default: 0 failure_seconds: 0-900 - # max_in_flight specifies the maximum number of agents that may be upgraded at the same time. + # max_in_flight specifies the maximum number of agents that may be updated at the same time. # default: 100% max_in_flight: 0-100% # max_timeout_before_halt specifies the percentage of clients that may time out before this group @@ -251,8 +251,8 @@ Dependency cycles are rejected. Dependency chains longer than a week will be rejected. Otherwise, updates could take up to 7 weeks to propagate. -The updater will receive `agent_autoupdate: true` from the time is it designated for upgrade until the version changes in `autoupdate_version`. -After 24 hours, the upgrade is halted in-place, and the group is considered failed if unfinished. +The updater will receive `agent_autoupdate: true` from the time is it designated for update until the version changes in `autoupdate_version`. +After 24 hours, the update is halted in-place, and the group is considered failed if unfinished. Changing the version or schedule completely resets progress. Releasing new client versions multiple times a week has the potential to starve dependent groups from updates. @@ -288,7 +288,7 @@ Status: succeeded Date: 2024-01-03 23:43:22 UTC Requires: (none) -Upgraded: 230 (95%) +Updated: 230 (95%) Unchanged: 10 (2%) Failed: 15 (3%) Timed-out: 0 @@ -361,7 +361,7 @@ This will require tracking the desired version of groups in the backend, which w We will ship a new auto-updater package for Linux servers written in Go that does not interface with the system package manager. It will be distributed as a separate package from Teleport, and manage the installation of the correct Teleport agent version manually. -It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified upgrade plan. +It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified update plan. It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. Source code for the updater will live in the main Teleport repository, with the updater binary built from `tools/teleport-update`. @@ -376,7 +376,7 @@ $ teleport-update enable --proxy example.teleport.sh $ systemctl enable teleport ``` -For grouped upgrades, a group identifier may be configured: +For grouped updates, a group identifier may be configured: ```shell $ teleport-update enable --proxy example.teleport.sh --group staging ``` @@ -481,7 +481,7 @@ The `enable` subcommand will: 1. Query the `/v1/webapi/find` endpoint. 2. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, jump to (16). 3. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (13). -4. Ensure there is enough free disk space to upgrade Teleport via `df .` and `content-length` header from `HEAD` request. +4. Ensure there is enough free disk space to update Teleport via `df .` and `content-length` header from `HEAD` request. 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Download and verify the checksum (tarball URL suffixed with `.sha256`). 7. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. @@ -504,7 +504,7 @@ When `update` subcommand is otherwise executed, it will: 3. Check that `agent_autoupdates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. -6. Ensure there is enough free disk space to upgrade Teleport via `df .` and `content-length` header from `HEAD` request. +6. Ensure there is enough free disk space to update Teleport via `df .` and `content-length` header from `HEAD` request. 7. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 8. Download and verify the checksum (tarball URL suffixed with `.sha256`). 9. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. @@ -531,13 +531,13 @@ To ensure that backups are consistent, the updater will use the [SQLite backup A If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. -If `teleport-updater` itself fails with an error, and an older version of `teleport-updater` is available, the upgrade will retry with the older version. +If `teleport-updater` itself fails with an error, and an older version of `teleport-updater` is available, the update will retry with the older version. -Known failure conditions caused by intentional configuration (e.g., upgrades disabled) will not trigger retry logic. +Known failure conditions caused by intentional configuration (e.g., updates disabled) will not trigger retry logic. #### Status -To retrieve known information about agent upgrades, the `status` subcommand will return the following: +To retrieve known information about agent updates, the `status` subcommand will return the following: ```json { "agent_version_installed": "15.1.1", @@ -567,8 +567,8 @@ When Teleport is downgraded to a previous version that has a backup of `sqlite.d Downgrades are applied with `teleport-updater update`, just like upgrades. The above steps modulate the standard workflow in the section above. If the downgraded version is already present, the uncompressed version is used to ensure fast recovery of the exact state before the failed upgrade. -To ensure that the target version is was not corrupted by incomplete extraction, the downgrade checks for the existance of `/var/lib/teleport/versions/TARGET-VERSION/sha256` before downgrading. -To ensure that the DB backup was not corrupted by incomplete copying, the downgrade checks for the existance of `/var/lib/teleport/versions/TARGET-VERSION/backup/backup.yaml` before restoring. +To ensure that the target version is was not corrupted by incomplete extraction, the downgrade checks for the existence of `/var/lib/teleport/versions/TARGET-VERSION/sha256` before downgrading. +To ensure that the DB backup was not corrupted by incomplete copying, the downgrade checks for the existence of `/var/lib/teleport/versions/TARGET-VERSION/backup/backup.yaml` before restoring. Teleport must be fully-stopped to safely replace `sqlite.db`. When restarting the agent during an upgrade, `SIGHUP` is used. @@ -584,7 +584,7 @@ Given that rollbacks may fail, we must maintain the following invariants: When rolling forward, the backup of the newer version's `sqlite.db` is only restored if that exact version is the roll-forward version. Otherwise, the older, rollback version of `sqlite.db` is preserved (i.e., the newer version's backup is not used). -This ensures that a version upgrade which broke the database can be recovered with a rollback and a new patch. +This ensures that a version update which broke the database can be recovered with a rollback and a new patch. It also ensures that a broken rollback is always recoverable by reversing the rollback. Example: Given v1, v2, v3 versions of Teleport, where v2 is broken: @@ -609,7 +609,7 @@ The following install scripts will be updated to install the latest updater and Eventually, additional logic from the scripts could be added to `teleport-updater`, such that `teleport-updater` can configure teleport. -Moving additional logic into the upgrader is out-of-scope for this proposal. +Moving additional logic into the updater is out-of-scope for this proposal. To create pre-baked VM or container images that reduce the complexity of the cluster joining operation, two workflows are permitted: - Install the `teleport-updater` package and defer `teleport-updater enable`, Teleport configuration, and `systemctl enable teleport` to cloud-init scripts. @@ -628,7 +628,7 @@ Documentation should be created covering the above workflows. ### Documentation -The following documentation will need to be updated to cover the new upgrader workflow: +The following documentation will need to be updated to cover the new updater workflow: - https://goteleport.com/docs/choose-an-edition/teleport-cloud/downloads - https://goteleport.com/docs/installation - https://goteleport.com/docs/upgrading/self-hosted-linux @@ -640,7 +640,7 @@ Additionally, the Cloud dashboard tenants downloads tab will need to be updated The Kubernetes agent updater will be updated for compatibility with the new scheduling system. -This means that it will stop reading upgrade windows using the authenticated connection to the proxy, and instead upgrade when indicated by the `/v1/webapi/find` endpoint. +This means that it will stop reading update windows using the authenticated connection to the proxy, and instead update when indicated by the `/v1/webapi/find` endpoint. Rollbacks for the Kubernetes updater, as well as packaging changes to improve UX and compatibility, will be covered in a future RFD. @@ -659,10 +659,10 @@ administrators concerned with the authenticity of assets served from the download server can use self-managed updates with system package managers which are signed. -The Upgrade Framework (TUF) will be used to implement secure updates in the future. +The Update Framework (TUF) will be used to implement secure updates in the future. -Anyone who possesses a host UUID can determine when that host is scheduled to upgrade by repeatedly querying the public `/v1/webapi/find` endpoint. -It is not possible to discover the current version of that host, only the designated upgrade window. +Anyone who possesses a host UUID can determine when that host is scheduled to update by repeatedly querying the public `/v1/webapi/find` endpoint. +It is not possible to discover the current version of that host, only the designated update window. ## Logging @@ -751,7 +751,7 @@ message AutoupdateConfig { message AutoupdateConfigSpec { // agent_autoupdate specifies whether agent autoupdates are enabled. bool agent_autoupdate = 1; - // agent_schedules specifies schedules for upgrades of grouped agents. + // agent_schedules specifies schedules for updates of grouped agents. AgentAutoupdateSchedules agent_schedules = 3; } @@ -777,7 +777,7 @@ message AgentAutoupdateGroup { int32 timeout_seconds = 5; // failure_seconds before an agent is considered failed (loses connection) int32 failure_seconds = 6; - // max_in_flight specifies agents that can be upgraded at the same time, by percent. + // max_in_flight specifies agents that can be updated at the same time, by percent. string max_in_flight = 7; // max_timeout_before_halt specifies agents that can timeout before the rollout is halted, by percent. string max_timeout_before_halt = 8; From 803260f5b5ef7497d3accbbba03f138876d5f9f5 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 26 Aug 2024 20:54:13 -0400 Subject: [PATCH 063/105] clarify binary is teleport-update, package is teleport-ent-updater --- rfd/0169-auto-updates-linux-agents.md | 46 +++++++++++++-------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index b26efad93c702..6803de84873aa 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -64,9 +64,9 @@ Whether the Teleport updater querying the endpoint is instructed to upgrade (via To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. -Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/updates.yaml` file, which is written via `teleport-updater enable`: +Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/updates.yaml` file, which is written via `teleport-update enable`: ```shell -$ teleport-updater enable --proxy teleport.example.com --group staging +$ teleport-update enable --proxy teleport.example.com --group staging ``` At the start of a group rollout, the Teleport auth server captures the desired group of hosts to update in the backend. @@ -353,7 +353,7 @@ This means that groups which employ auto-scaling or ephemeral resources will slo **This could lead to a production outage, as the latest Teleport version may not receive any validation before it is advertised to newly provisioned resources in production.** -To solve this in the future, we can use the group name (provided to `/v1/webapi/find` and specified via `teleport-updater enable`) to determine which version should be served. +To solve this in the future, we can use the group name (provided to `/v1/webapi/find` and specified via `teleport-update enable`) to determine which version should be served. This will require tracking the desired version of groups in the backend, which will add additional complexity to the rollout logic. @@ -398,7 +398,7 @@ $ tree /var/lib/teleport │ │ ├── tsh │ │ ├── tbot │ │ ├── ... # other binaries - │ │ ├── teleport-updater + │ │ ├── teleport-update │ │ └── teleport │ ├── etc │ │ └── systemd @@ -411,7 +411,7 @@ $ tree /var/lib/teleport │ │ ├── tsh │ │ ├── tbot │ │ ├── ... # other binaries - │ │ ├── teleport-updater + │ │ ├── teleport-update │ │ └── teleport │ └── etc │ └── systemd @@ -423,8 +423,8 @@ $ ls -l /usr/local/bin/tbot /usr/local/bin/tbot -> /var/lib/teleport/versions/15.0.0/bin/tbot $ ls -l /usr/local/bin/teleport /usr/local/bin/teleport -> /var/lib/teleport/versions/15.0.0/bin/teleport -$ ls -l /usr/local/bin/teleport-updater -/usr/local/bin/teleport-updater -> /var/lib/teleport/versions/15.0.0/bin/teleport-updater +$ ls -l /usr/local/bin/teleport-update +/usr/local/bin/teleport-update -> /var/lib/teleport/versions/15.0.0/bin/teleport-update $ ls -l /usr/local/lib/systemd/system/teleport.service /usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service ``` @@ -438,7 +438,7 @@ spec: proxy: mytenant.teleport.sh # group specifies the update group group: staging - # enabled specifies whether auto-updates are enabled, i.e., whether teleport-updater update is allowed to update the agent. + # enabled specifies whether auto-updates are enabled, i.e., whether teleport-update update is allowed to update the agent. enabled: true # active_version specifies the active (symlinked) deployment of the telepport agent. active_version: 15.1.1 @@ -462,12 +462,12 @@ spec: The agent-updater will run as a periodically executing systemd service which runs every 10 minutes. The systemd service will run: ```shell -$ teleport-updater update +$ teleport-update update ``` -After it is installed, the `update` subcommand will no-op when executed until configured with the `teleport-updater` command: +After it is installed, the `update` subcommand will no-op when executed until configured with the `teleport-update` command: ```shell -$ teleport-updater enable --proxy mytenant.teleport.sh --group staging +$ teleport-update enable --proxy mytenant.teleport.sh --group staging ``` If the proxy address is not provided with `--proxy`, the current proxy address from `teleport.yaml` is used. @@ -515,12 +515,12 @@ When `update` subcommand is otherwise executed, it will: 14. Replace the old symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. 15. Remove all stored versions of the agent except the current version and last working version. -To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-updater` at that version if present and different. -The `/usr/local/bin/teleport-updater` symlink will take precedence to avoid reexec in most scenarios. +To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-update` at that version if present and different. +The `/usr/local/bin/teleport-update` symlink will take precedence to avoid reexec in most scenarios. -To ensure that SELinux permissions do not prevent the `teleport-updater` binary from installing/removing Teleport versions, the updater package will configure SELinux contexts to allow changes to all required paths. +To ensure that SELinux permissions do not prevent the `teleport-update` binary from installing/removing Teleport versions, the updater package will configure SELinux contexts to allow changes to all required paths. -To ensure that `teleport` package removal does not interfere with `teleport-updater`, package removal will run `apt purge` (or `yum` equivalent) while ensuring that `/etc/teleport.yaml` and `/var/lib/teleport` are not purged. +To ensure that `teleport` package removal does not interfere with `teleport-update`, package removal will run `apt purge` (or `yum` equivalent) while ensuring that `/etc/teleport.yaml` and `/var/lib/teleport` are not purged. Failure to do this could result in `/etc/teleport.yaml` being removed when an operator runs `apt purge` at a later date. To ensure that `teleport` package removal does not lead to a hard restart of Teleport, the updater will ensure that the package is removed without triggering needrestart or similar services. @@ -531,7 +531,7 @@ To ensure that backups are consistent, the updater will use the [SQLite backup A If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. -If `teleport-updater` itself fails with an error, and an older version of `teleport-updater` is available, the update will retry with the older version. +If `teleport-update` itself fails with an error, and an older version of `teleport-update` is available, the update will retry with the older version. Known failure conditions caused by intentional configuration (e.g., updates disabled) will not trigger retry logic. @@ -564,7 +564,7 @@ When Teleport is downgraded to a previous version that has a backup of `sqlite.d 2. If the backup is valid, Teleport is fully stopped, the backup is restored along with symlinks, and the downgraded version of Teleport is started. 3. If the backup is invalid, we refuse to downgrade. -Downgrades are applied with `teleport-updater update`, just like upgrades. +Downgrades are applied with `teleport-update update`, just like upgrades. The above steps modulate the standard workflow in the section above. If the downgraded version is already present, the uncompressed version is used to ensure fast recovery of the exact state before the failed upgrade. To ensure that the target version is was not corrupted by incomplete extraction, the downgrade checks for the existence of `/var/lib/teleport/versions/TARGET-VERSION/sha256` before downgrading. @@ -593,29 +593,29 @@ Example: Given v1, v2, v3 versions of Teleport, where v2 is broken: ### Manual Workflow -For use cases that fall outside of the functionality provided by `teleport-updater`, we provide an alternative manual workflow using the `/v1/webapi/find` endpoint. -This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-updater` because they use their own automation for updates (e.g., JamF or Ansible). +For use cases that fall outside of the functionality provided by `teleport-update`, we provide an alternative manual workflow using the `/v1/webapi/find` endpoint. +This workflow supports customers that cannot use the auto-update mechanism provided by `teleport-update` because they use their own automation for updates (e.g., JamF or Ansible). Cluster administrators that want to self-manage agent updates may manually query the `/v1/webapi/find` endpoint using the host UUID, and implement auto-updates with their own automation. ### Installers -The following install scripts will be updated to install the latest updater and run `teleport-updater enable` with the proxy address: +The following install scripts will be updated to install the latest updater and run `teleport-update enable` with the proxy address: - [/api/types/installers/agentless-installer.sh.tmpl](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/agentless-installer.sh.tmpl) - [/api/types/installers/installer.sh.tmpl](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/api/types/installers/installer.sh.tmpl) - [/lib/web/scripts/oneoff/oneoff.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/oneoff/oneoff.sh) - [/lib/web/scripts/node-join/install.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/lib/web/scripts/node-join/install.sh) - [/assets/aws/files/install-hardened.sh](https://github.com/gravitational/teleport/blob/d0a68fd82412b48cb54f664ae8500f625fb91e48/assets/aws/files/install-hardened.sh) -Eventually, additional logic from the scripts could be added to `teleport-updater`, such that `teleport-updater` can configure teleport. +Eventually, additional logic from the scripts could be added to `teleport-update`, such that `teleport-update` can configure teleport. Moving additional logic into the updater is out-of-scope for this proposal. To create pre-baked VM or container images that reduce the complexity of the cluster joining operation, two workflows are permitted: -- Install the `teleport-updater` package and defer `teleport-updater enable`, Teleport configuration, and `systemctl enable teleport` to cloud-init scripts. +- Install the `teleport-ent-updater` package and defer `teleport-update enable`, Teleport configuration, and `systemctl enable teleport` to cloud-init scripts. This allows both the proxy address and token to be injected at VM initialization. The VM image may be used with any Teleport cluster. Installers scripts will continue to function, as the package install operation will no-op. -- Install the `teleport-updater` package and run `teleport-updater enable` before the image is baked, but defer final Teleport configuration and `systemctl enable teleport` to cloud-init scripts. +- Install the `teleport-ent-updater` package and run `teleport-update enable` before the image is baked, but defer final Teleport configuration and `systemctl enable teleport` to cloud-init scripts. This allows the proxy address to be pre-set in the image. `teleport.yaml` can be partially configured during image creation. At minimum, the token must be injected via cloud-init scripts. Installers scripts would be skipped in favor of the `teleport configure` command. From 89b285dad3214ca888c6b3a9542c1c68f46338ea Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 27 Aug 2024 20:53:21 -0400 Subject: [PATCH 064/105] switch from df to unix.Statfs --- rfd/0169-auto-updates-linux-agents.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 6803de84873aa..40f727d4f3686 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -481,7 +481,7 @@ The `enable` subcommand will: 1. Query the `/v1/webapi/find` endpoint. 2. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, jump to (16). 3. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (13). -4. Ensure there is enough free disk space to update Teleport via `df .` and `content-length` header from `HEAD` request. +4. Ensure there is enough free disk space to update Teleport via `unix.Statfs()` and `content-length` header from `HEAD` request. 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Download and verify the checksum (tarball URL suffixed with `.sha256`). 7. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. @@ -504,7 +504,7 @@ When `update` subcommand is otherwise executed, it will: 3. Check that `agent_autoupdates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. -6. Ensure there is enough free disk space to update Teleport via `df .` and `content-length` header from `HEAD` request. +6. Ensure there is enough free disk space to update Teleport via `unix.Statfs()` and `content-length` header from `HEAD` request. 7. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 8. Download and verify the checksum (tarball URL suffixed with `.sha256`). 9. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. From df32f3d259ae3e1b7ce0b898d12fb64e8279920b Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 4 Sep 2024 11:25:42 -0400 Subject: [PATCH 065/105] security feedback + naming adjustments --- rfd/0169-auto-updates-linux-agents.md | 157 +++++++++++++------------- 1 file changed, 79 insertions(+), 78 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 40f727d4f3686..d672615b4e72c 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -87,7 +87,7 @@ The cache is considered healthy when all instance heartbeats present on the back At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend under a single key. This plan is protected by optimistic locking, and contains the following data: -Data key: `/autoupdate/[name of group](/[page-id])` (e.g., `/autoupdate/staging/8745823`) +Data key: `/autoupdate/[name of group](/[page uuid])` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`) Data value JSON: - `start_time`: timestamp of current window start time @@ -95,6 +95,7 @@ Data value JSON: - `schedule`: type of schedule that triggered the rollout - `hosts`: list of host UUIDs in randomized order - `next_page`: additional UUIDs, if list is greater than 100,000 UUIDs +- `auth_server`: ID of auth server writing the plan Expiration time of each key is 2 weeks. @@ -110,20 +111,20 @@ Each page will duplicate all values besides `hosts`, which will be different for All pages besides the first page will be suffixed with a randomly generated number. Pages will be written in reverse order, in a linked-link, before the final atomic non-suffixed write of the first page. If the non-suffixed write fails, the auth server is responsible for cleaning up the unusable pages. -If cleanup fails, the unusable pages will expire after 2 weeks. +If cleanup fails, the unusable pages will expire from the backend after 2 weeks. ``` Winning auth: - WRITE: /autoupdate/staging/4324234 | next_page: null - WRITE: /autoupdate/staging/8745823 | next_page: 4324234 - WRITE: /autoupdate/staging | next_page: 8745823 + WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56 | next_page: null + WRITE: /autoupdate/staging/9ae65c11-35f2-483c-987e-73ef36989d3b | next_page: 58526ba2-c12d-4a49-b5a4-1b694b82bf56 + WRITE: /autoupdate/staging | next_page: 9ae65c11-35f2-483c-987e-73ef36989d3b Losing auth: - WRITE: /autoupdate/staging/2342343 | next_page: null - WRITE: /autoupdate/staging/7678686 | next_page: 2342343 - WRITE CONFLICT: /autoupdate/staging | next_page: 7678686 - DELETE: /autoupdate/staging/7678686 - DELETE: /autoupdate/staging/2342343 + WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530 | next_page: null + WRITE: /autoupdate/staging/dc27497b-ce25-4d85-b537-d0639996110d | next_page: dd850e65-d2b2-4557-8ffb-def893c52530 + WRITE CONFLICT: /autoupdate/staging | next_page: dc27497b-ce25-4d85-b537-d0639996110d + DELETE: /autoupdate/staging/dc27497b-ce25-4d85-b537-d0639996110d + DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530 ``` #### Rollout @@ -210,7 +211,7 @@ spec: agent_schedules: # schedule is "regular" or "critical" regular: - # name of the group + # name of the group. Must only contain valid backend / resource name characters. - name: staging-group # days specifies the days of the week when the group may be updated. # default: ["*"] (all days) @@ -684,57 +685,57 @@ package teleport.autoupdate.v1; option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdatev1"; -// AutoupdateService serves agent and client automatic version updates. -service AutoupdateService { - // GetAutoupdateConfig updates the autoupdate config. - rpc GetAutoupdateConfig(GetAutoupdateConfigRequest) returns (AutoupdateConfig); - // CreateAutoupdateConfig creates the autoupdate config. - rpc CreateAutoupdateConfig(CreateAutoupdateConfigRequest) returns (AutoupdateConfig); - // UpdateAutoupdateConfig updates the autoupdate config. - rpc UpdateAutoupdateConfig(UpdateAutoupdateConfigRequest) returns (AutoupdateConfig); - // UpsertAutoupdateConfig overwrites the autoupdate config. - rpc UpsertAutoupdateConfig(UpsertAutoupdateConfigRequest) returns (AutoupdateConfig); - // ResetAutoupdateConfig restores the autoupdate config to default values. - rpc ResetAutoupdateConfig(ResetAutoupdateConfigRequest) returns (AutoupdateConfig); - - // GetAutoupdateVersion returns the autoupdate version. - rpc GetAutoupdateVersion(GetAutoupdateVersionRequest) returns (AutoupdateVersion); - // CreateAutoupdateVersion creates the autoupdate version. - rpc CreateAutoupdateVersion(CreateAutoupdateVersionRequest) returns (AutoupdateVersion); - // UpdateAutoupdateVersion updates the autoupdate version. - rpc UpdateAutoupdateVersion(UpdateAutoupdateVersionRequest) returns (AutoupdateVersion); - // UpsertAutoupdateVersion overwrites the autoupdate version. - rpc UpsertAutoupdateVersion(UpsertAutoupdateVersionRequest) returns (AutoupdateVersion); +// AutoUpdateService serves agent and client automatic version updates. +service AutoUpdateService { + // GetAutoUpdateConfig updates the autoupdate config. + rpc GetAutoUpdateConfig(GetAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // CreateAutoUpdateConfig creates the autoupdate config. + rpc CreateAutoUpdateConfig(CreateAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // UpdateAutoUpdateConfig updates the autoupdate config. + rpc UpdateAutoUpdateConfig(UpdateAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // UpsertAutoUpdateConfig overwrites the autoupdate config. + rpc UpsertAutoUpdateConfig(UpsertAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // ResetAutoUpdateConfig restores the autoupdate config to default values. + rpc ResetAutoUpdateConfig(ResetAutoUpdateConfigRequest) returns (AutoUpdateConfig); + + // GetAutoUpdateVersion returns the autoupdate version. + rpc GetAutoUpdateVersion(GetAutoUpdateVersionRequest) returns (AutoUpdateVersion); + // CreateAutoUpdateVersion creates the autoupdate version. + rpc CreateAutoUpdateVersion(CreateAutoUpdateVersionRequest) returns (AutoUpdateVersion); + // UpdateAutoUpdateVersion updates the autoupdate version. + rpc UpdateAutoUpdateVersion(UpdateAutoUpdateVersionRequest) returns (AutoUpdateVersion); + // UpsertAutoUpdateVersion overwrites the autoupdate version. + rpc UpsertAutoUpdateVersion(UpsertAutoUpdateVersionRequest) returns (AutoUpdateVersion); // GetAgentRolloutPlan returns the agent rollout plan and current progress. rpc GetAgentRolloutPlan(GetAgentRolloutPlanRequest) returns (AgentRolloutPlan); - // GetAutoupdateVersion streams the agent rollout plan's list of all hosts. + // GetAutoUpdateVersion streams the agent rollout plan's list of all hosts. rpc GetAgentRolloutPlanHosts(GetAgentRolloutPlanHostsRequest) returns (stream AgentRolloutPlanHost); } -// GetAutoupdateConfigRequest requests the contents of the AutoupdateConfig. -message GetAutoupdateConfigRequest {} +// GetAutoUpdateConfigRequest requests the contents of the AutoUpdateConfig. +message GetAutoUpdateConfigRequest {} -// CreateAutoupdateConfigRequest requests creation of the the AutoupdateConfig. -message CreateAutoupdateConfigRequest { - AutoupdateConfig autoupdate_config = 1; +// CreateAutoUpdateConfigRequest requests creation of the the AutoUpdateConfig. +message CreateAutoUpdateConfigRequest { + AutoUpdateConfig autoupdate_config = 1; } -// UpdateAutoupdateConfigRequest requests an update of the the AutoupdateConfig. -message UpdateAutoupdateConfigRequest { - AutoupdateConfig autoupdate_config = 1; +// UpdateAutoUpdateConfigRequest requests an update of the the AutoUpdateConfig. +message UpdateAutoUpdateConfigRequest { + AutoUpdateConfig autoupdate_config = 1; } -// UpsertAutoupdateConfigRequest requests an upsert of the the AutoupdateConfig. -message UpsertAutoupdateConfigRequest { - AutoupdateConfig autoupdate_config = 1; +// UpsertAutoUpdateConfigRequest requests an upsert of the the AutoUpdateConfig. +message UpsertAutoUpdateConfigRequest { + AutoUpdateConfig autoupdate_config = 1; } -// ResetAutoupdateConfigRequest requests a reset of the the AutoupdateConfig to default values. -message ResetAutoupdateConfigRequest {} +// ResetAutoUpdateConfigRequest requests a reset of the the AutoUpdateConfig to default values. +message ResetAutoUpdateConfigRequest {} -// AutoupdateConfig holds dynamic configuration settings for automatic updates. -message AutoupdateConfig { +// AutoUpdateConfig holds dynamic configuration settings for automatic updates. +message AutoUpdateConfig { // kind is the kind of the resource. string kind = 1; // sub_kind is the sub kind of the resource. @@ -744,27 +745,27 @@ message AutoupdateConfig { // metadata is the metadata of the resource. teleport.header.v1.Metadata metadata = 4; // spec is the spec of the resource. - AutoupdateConfigSpec spec = 7; + AutoUpdateConfigSpec spec = 7; } -// AutoupdateConfigSpec is the spec for the autoupdate config. -message AutoupdateConfigSpec { +// AutoUpdateConfigSpec is the spec for the autoupdate config. +message AutoUpdateConfigSpec { // agent_autoupdate specifies whether agent autoupdates are enabled. bool agent_autoupdate = 1; // agent_schedules specifies schedules for updates of grouped agents. - AgentAutoupdateSchedules agent_schedules = 3; + AgentAutoUpdateSchedules agent_schedules = 3; } -// AgentAutoupdateSchedules specifies update scheduled for grouped agents. -message AgentAutoupdateSchedules { +// AgentAutoUpdateSchedules specifies update scheduled for grouped agents. +message AgentAutoUpdateSchedules { // regular schedules for non-critical versions. - repeated AgentAutoupdateGroup regular = 1; + repeated AgentAutoUpdateGroup regular = 1; // critical schedules for urgently needed versions. - repeated AgentAutoupdateGroup critical = 2; + repeated AgentAutoUpdateGroup critical = 2; } -// AgentAutoupdateGroup specifies the update schedule for a group of agents. -message AgentAutoupdateGroup { +// AgentAutoUpdateGroup specifies the update schedule for a group of agents. +message AgentAutoUpdateGroup { // name of the group string name = 1; // days to run update @@ -800,29 +801,29 @@ enum Day { DAY_SATURDAY = 8; } -// GetAutoupdateVersionRequest requests the autoupdate_version singleton resource. -message GetAutoupdateVersionRequest {} +// GetAutoUpdateVersionRequest requests the autoupdate_version singleton resource. +message GetAutoUpdateVersionRequest {} -// GetAutoupdateVersionRequest requests creation of the autoupdate_version singleton resource. -message CreateAutoupdateVersionRequest { +// GetAutoUpdateVersionRequest requests creation of the autoupdate_version singleton resource. +message CreateAutoUpdateVersionRequest { // autoupdate_version resource contents - AutoupdateVersion autoupdate_version = 1; + AutoUpdateVersion autoupdate_version = 1; } -// GetAutoupdateVersionRequest requests an update of the autoupdate_version singleton resource. -message UpdateAutoupdateVersionRequest { +// GetAutoUpdateVersionRequest requests an update of the autoupdate_version singleton resource. +message UpdateAutoUpdateVersionRequest { // autoupdate_version resource contents - AutoupdateVersion autoupdate_version = 1; + AutoUpdateVersion autoupdate_version = 1; } -// GetAutoupdateVersionRequest requests an upsert of the autoupdate_version singleton resource. -message UpsertAutoupdateVersionRequest { +// GetAutoUpdateVersionRequest requests an upsert of the autoupdate_version singleton resource. +message UpsertAutoUpdateVersionRequest { // autoupdate_version resource contents - AutoupdateVersion autoupdate_version = 1; + AutoUpdateVersion autoupdate_version = 1; } -// AutoupdateVersion holds dynamic configuration settings for autoupdate versions. -message AutoupdateVersion { +// AutoUpdateVersion holds dynamic configuration settings for autoupdate versions. +message AutoUpdateVersion { // kind is the kind of the resource. string kind = 1; // sub_kind is the sub kind of the resource. @@ -832,11 +833,11 @@ message AutoupdateVersion { // metadata is the metadata of the resource. teleport.header.v1.Metadata metadata = 4; // spec is the spec of the resource. - AutoupdateVersionSpec spec = 5; + AutoUpdateVersionSpec spec = 5; } -// AutoupdateVersionSpec is the spec for the autoupdate version. -message AutoupdateVersionSpec { +// AutoUpdateVersionSpec is the spec for the autoupdate version. +message AutoUpdateVersionSpec { // agent_version is the desired agent version for new rollouts. string agent_version = 1; // agent_version schedule is the schedule to use for rolling out the agent_version. @@ -883,7 +884,7 @@ message AgentRolloutPlan { AgentRolloutPlanStatus status = 6; } -// AutoupdateVersionSpec is the spec for the AgentRolloutPlan. +// AutoUpdateVersionSpec is the spec for the AgentRolloutPlan. message AgentRolloutPlanSpec { // start_time of the rollout google.protobuf.Timestamp start_time = 1; @@ -895,7 +896,7 @@ message AgentRolloutPlanSpec { int64 host_count = 4; } -// AutoupdateVersionStatus is the status for the AgentRolloutPlan. +// AutoUpdateVersionStatus is the status for the AgentRolloutPlan. message AgentRolloutPlanStatus { // last_active_host_index specifies the index of the last host that may be updated. int64 last_active_host_index = 1; @@ -914,7 +915,7 @@ message AgentRolloutPlanHost { ## Execution Plan -1. Implement Teleport APIs for new scheduling system (without backpressure) +1. Implement Teleport APIs for new scheduling system (without backpressure or group interdependence) 2. Implement new Linux server auto-updater in Go. 3. Implement changes to Kubernetes auto-updater. 4. Test extensively on all supported Linux distributions. @@ -923,4 +924,4 @@ message AgentRolloutPlanHost { 7. Release documentation changes. 8. Communicate to users that they should update their updater. 9. Deprecate old auto-updater endpoints. -10. Add groups and backpressure features. +10. Add group interdependence and backpressure features. From c4f813abef0cb73a11f764179cb706751df7c013 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 6 Sep 2024 13:45:30 -0400 Subject: [PATCH 066/105] tweak rollout paging --- rfd/0169-auto-updates-linux-agents.md | 43 ++++++++++++++++----------- 1 file changed, 25 insertions(+), 18 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index d672615b4e72c..50a7c72be7ecb 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -80,22 +80,27 @@ Rollouts may be retried with `tctl autoupdate run`. #### Window Capture Instance heartbeats will be cached by auth servers using a dedicated cache. -This cache is updated using rate-limited backend reads that occur in the background, to avoid mass-reads of instance heartbeats. +This cache is initialized from the backend when the auth server starts, and kept up-to-date when the heartbeats are broadcast to all auth servers. + +When the auth server is started, the cache is initialized using rate-limited backend reads that occur in the background, to avoid mass-reads of instance heartbeats. The rate is modulated by the total number of instance heartbeats, to avoid putting too much load on the backend on large clusters. -The cache is considered healthy when all instance heartbeats present on the backend have been read within a time period that is also modulated by the total number of heartbeats. +The cache is considered healthy when all instance heartbeats present on the backend have been read at least once. + +Instance heartbeats are currently broadcast to all auth servers. +The cache will be kept up-to-date when the auth server receives updates. -At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend under a single key. +At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend. This plan is protected by optimistic locking, and contains the following data: -Data key: `/autoupdate/[name of group](/[page uuid])` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`) +Data key: `/autoupdate/[name of group](/[auth ID]/page[number])` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/page1`) Data value JSON: - `start_time`: timestamp of current window start time - `version`: version for which this rollout is valid - `schedule`: type of schedule that triggered the rollout - `hosts`: list of host UUIDs in randomized order -- `next_page`: additional UUIDs, if list is greater than 100,000 UUIDs -- `auth_server`: ID of auth server writing the plan +- `auth_id`: ID of the auth server writing the plan + Expiration time of each key is 2 weeks. @@ -108,25 +113,27 @@ If the resource size is greater than 100 KiB, auth servers will divide the resou This is necessary to support backends with a value size limit. Each page will duplicate all values besides `hosts`, which will be different for each page. -All pages besides the first page will be suffixed with a randomly generated number. -Pages will be written in reverse order, in a linked-link, before the final atomic non-suffixed write of the first page. -If the non-suffixed write fails, the auth server is responsible for cleaning up the unusable pages. +All pages besides the first page will be prefixed with the auth server's ID. +Pages will be written in reverse order before the final atomic non-prefixed write of the first page. +If the non-prefixed write fails, the auth server is responsible for cleaning up the unusable pages. If cleanup fails, the unusable pages will expire from the backend after 2 weeks. ``` Winning auth: - WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56 | next_page: null - WRITE: /autoupdate/staging/9ae65c11-35f2-483c-987e-73ef36989d3b | next_page: 58526ba2-c12d-4a49-b5a4-1b694b82bf56 - WRITE: /autoupdate/staging | next_page: 9ae65c11-35f2-483c-987e-73ef36989d3b + WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/page2 + WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/page1 + WRITE: /autoupdate/staging | auth_id: 58526ba2-c12d-4a49-b5a4-1b694b82bf56 Losing auth: - WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530 | next_page: null - WRITE: /autoupdate/staging/dc27497b-ce25-4d85-b537-d0639996110d | next_page: dd850e65-d2b2-4557-8ffb-def893c52530 - WRITE CONFLICT: /autoupdate/staging | next_page: dc27497b-ce25-4d85-b537-d0639996110d - DELETE: /autoupdate/staging/dc27497b-ce25-4d85-b537-d0639996110d - DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530 + WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page2 + WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page1 + WRITE CONFLICT: /autoupdate/staging | auth_id: dd850e65-d2b2-4557-8ffb-def893c52530 + DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page1 + DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page2 ``` +To read all pages, auth servers read the first page, get the auth server ID from the `auth_id` field, and then range-read the remaining pages. + #### Rollout The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events. @@ -169,7 +176,7 @@ upgrading := make(map[UUID]bool) ``` Proxies watch for changes to the plan and update the map accordingly. -When the updater queries the proxy via `/v1/webapi/find?host=[host_uuid]`, the proxies query the map to determine the value of `agent_autoupdate: true`. +When the updater queries the proxy via `/v1/webapi/find?host=[host_uuid]&group=[name]`, the proxies query the map to determine the value of `agent_autoupdate: true`. Updating all agents generates the following additional backend write load: - One write per page of the rollout plan per update group. From f82fd62ebe7e745ac3d922aae16ee504e8bf708c Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 6 Sep 2024 13:47:03 -0400 Subject: [PATCH 067/105] tweak rollout paging again --- rfd/0169-auto-updates-linux-agents.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 50a7c72be7ecb..25f58f89994ec 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -92,7 +92,7 @@ The cache will be kept up-to-date when the auth server receives updates. At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend. This plan is protected by optimistic locking, and contains the following data: -Data key: `/autoupdate/[name of group](/[auth ID]/page[number])` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/page1`) +Data key: `/autoupdate/[name of group](/[auth ID]/[number])` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/1`) Data value JSON: - `start_time`: timestamp of current window start time @@ -120,16 +120,16 @@ If cleanup fails, the unusable pages will expire from the backend after 2 weeks. ``` Winning auth: - WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/page2 - WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/page1 + WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/2 + WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/1 WRITE: /autoupdate/staging | auth_id: 58526ba2-c12d-4a49-b5a4-1b694b82bf56 Losing auth: - WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page2 - WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page1 + WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/2 + WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/1 WRITE CONFLICT: /autoupdate/staging | auth_id: dd850e65-d2b2-4557-8ffb-def893c52530 - DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page1 - DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/page2 + DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/1 + DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/2 ``` To read all pages, auth servers read the first page, get the auth server ID from the `auth_id` field, and then range-read the remaining pages. From a8afbed2d1ef005e251c926c33766dca928642dc Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Mon, 9 Sep 2024 22:19:06 -0400 Subject: [PATCH 068/105] feedback --- rfd/0169-auto-updates-linux-agents.md | 43 +++++++++++++++++++++++---- 1 file changed, 38 insertions(+), 5 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 25f58f89994ec..9f4d5a430b868 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -15,7 +15,9 @@ state: draft This RFD proposes a new mechanism for scheduled, automatic updates of Teleport agents. -Users of Teleport will be able to use the tctl CLI to specify desired versions and update schedules. +Users of Teleport will be able to use the tctl CLI to specify desired versions, update schedules, and rollout speed. + +Agents will be updated by a new `teleport-update` binary, built from `tools/teleport-update` in the Teleport repository. All agent installations are in-scope for this proposal, including agents installed on Linux servers and Kubernetes. @@ -24,6 +26,7 @@ The following anti-goals are out-of-scope for this proposal, but will be address - Teleport Cloud APIs for updating agents - Improvements to the local functionality of the Kubernetes agent for better compatibility with FluxCD and ArgoCD - Support for progressive rollouts to different groups of ephemeral or auto-scaling agents (see: Version Promotion) +- Support for progressive rollouts of tbot, when not installed on the same system as a Teleport agent This RFD proposes a specific implementation of several sections in https://github.com/gravitational/teleport/pull/39217. @@ -36,7 +39,7 @@ The existing mechanism for automatic agent updates does not provide a hands-off 1. The use of system package management leads to interactions with `apt upgrade`, `yum upgrade`, etc. that can result in unintentional upgrades. 2. The use of system package management requires logic that varies significantly by target distribution. 3. The installation mechanism requires 4-5 commands, includes manually installing multiple packages, and varies depending on your version and edition of Teleport. -4. The use of bash to implement the updater makes changes difficult and prone to error. +4. The use of bash to implement the updater makes long-term maintenance difficult. 5. The existing auto-updater has limited automated testing. 6. The use of GPG keys in system package managers has key management implications that we would prefer to solve with TUF in the future. 7. The desired agent version cannot be set via Teleport's operator-targeted CLI (tctl). @@ -437,10 +440,13 @@ $ ls -l /usr/local/lib/systemd/system/teleport.service /usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service ``` -updates.yaml: +#### updates.yaml + +This file stores configuration for `teleport-update`. + ``` version: v1 -kind: agent_versions +kind: updates spec: # proxy specifies the Teleport proxy address to retrieve the agent version and update configuration from. proxy: mytenant.teleport.sh @@ -452,7 +458,10 @@ spec: active_version: 15.1.1 ``` -backup.yaml: +#### backup.yaml + +This file stores metadata about an individual backup of the Teleport agent's sqlite DB. + ``` version: v1 kind: db_backup @@ -920,6 +929,30 @@ message AgentRolloutPlanHost { } ``` +## Alternatives + +### `teleport update` Subcommand + +`teleport-update` is intended to be a minimal binary, with few dependencies, that is used to bootstrap initial Teleport agent installations. +It may be baked into AMIs or containers. + +If the entirely `teleport` binary were used instead, security scanners would match vulnerabilities all Teleport dependencies, so customers would have to handle rebuilding artifacts (e.g., AMIs) more often. +Deploying these updates is often more disruptive than a soft restart of the agent triggered by the auto-updater. + +`teleport-update` will also handle `tbot` updates in the future, and it would be undesirable to distribute `teleport` with `tbot` just to enable automated updates. + +Finally, `teleport-update`'s API contract with the cluster must remain stable to ensure that outdated agent installations can always be recovered. +The first version of `teleport-update` will need to work with Teleport v14 and all future versions of Teleport. +This contract may be easier to manage with a separate artifact. + +### Mutually-Authenticated RPC for Update Boolean + +Agents will not always have a mutually-authenticated connection to auth to receive update instructions. +For example, the agent may be in a failed state due to a botched upgrade, may be temporarily stopped, or may be newly installed. +In the future, `tbot`-only installations may have expired certificates. + +Making the update boolean instruction available via the `/webapi/find` TLS endpoint reduces complexity as well as the risk of unrecoverable outages. + ## Execution Plan 1. Implement Teleport APIs for new scheduling system (without backpressure or group interdependence) From b669bf4fb09d7648d0d48d1873fc905c0f6ac724 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 10 Sep 2024 17:07:44 -0400 Subject: [PATCH 069/105] adjust update.yaml to match implementation feedback --- rfd/0169-auto-updates-linux-agents.md | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 9f4d5a430b868..48d9d6a53240c 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -67,7 +67,7 @@ Whether the Teleport updater querying the endpoint is instructed to upgrade (via To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. -Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/updates.yaml` file, which is written via `teleport-update enable`: +Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/update.yaml` file, which is written via `teleport-update enable`: ```shell $ teleport-update enable --proxy teleport.example.com --group staging ``` @@ -200,7 +200,7 @@ Notes: - Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. - The host UUID is read from `/var/lib/teleport/host_uuid` by the updater. -- The group name is read from `/var/lib/teleport/versions/updates.yaml` by the updater. +- The group name is read from `/var/lib/teleport/versions/update.yaml` by the updater. ### Teleport Resources @@ -427,7 +427,7 @@ $ tree /var/lib/teleport │ └── etc │ └── systemd │ └── teleport.service - └── updates.yaml + └── update.yaml $ ls -l /usr/local/bin/tsh /usr/local/bin/tsh -> /var/lib/teleport/versions/15.0.0/bin/tsh $ ls -l /usr/local/bin/tbot @@ -440,7 +440,7 @@ $ ls -l /usr/local/lib/systemd/system/teleport.service /usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service ``` -#### updates.yaml +#### update.yaml This file stores configuration for `teleport-update`. @@ -452,8 +452,11 @@ spec: proxy: mytenant.teleport.sh # group specifies the update group group: staging + # url_template specifies a custom URL template for downloading Teleport. + # url_template: "" # enabled specifies whether auto-updates are enabled, i.e., whether teleport-update update is allowed to update the agent. enabled: true +status: # active_version specifies the active (symlinked) deployment of the telepport agent. active_version: 15.1.1 ``` @@ -505,18 +508,18 @@ The `enable` subcommand will: 8. Replace any existing binaries or symlinks with symlinks to the current version. 9. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. 10. Restart the agent if the systemd service is already enabled. -11. Set `active_version` in `updates.yaml` if successful or not enabled. +11. Set `active_version` in `update.yaml` if successful or not enabled. 12. Replace the symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. 13. Remove and purge any `teleport` package if installed. 14. Verify the symlinks to the active version still exists. 15. Remove all stored versions of the agent except the current version and last working version. -16. Configure `updates.yaml` with the current proxy address and group, and set `enabled` to true. +16. Configure `update.yaml` with the current proxy address and group, and set `enabled` to true. The `disable` subcommand will: -1. Configure `updates.yaml` to set `enabled` to false. +1. Configure `update.yaml` to set `enabled` to false. When `update` subcommand is otherwise executed, it will: -1. Check `updates.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. +1. Check `update.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. 2. Query the `/v1/webapi/find` endpoint. 3. Check that `agent_autoupdates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. @@ -528,7 +531,7 @@ When `update` subcommand is otherwise executed, it will: 10. Update symlinks to point at the new version. 11. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. 12. Restart the agent if the systemd service is already enabled. -13. Set `active_version` in `updates.yaml` if successful or not enabled. +13. Set `active_version` in `update.yaml` if successful or not enabled. 14. Replace the old symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. 15. Remove all stored versions of the agent except the current version and last working version. From f8736b9609f0dfbb8b33451736f9421310ce5627 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 24 Sep 2024 15:49:37 -0700 Subject: [PATCH 070/105] wip - new model --- rfd/0169-auto-updates-linux-agents.md | 618 +++++++++++--------------- 1 file changed, 266 insertions(+), 352 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 48d9d6a53240c..2b8f264ed2899 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -25,7 +25,6 @@ The following anti-goals are out-of-scope for this proposal, but will be address - Signing of agent artifacts (e.g., via TUF) - Teleport Cloud APIs for updating agents - Improvements to the local functionality of the Kubernetes agent for better compatibility with FluxCD and ArgoCD -- Support for progressive rollouts to different groups of ephemeral or auto-scaling agents (see: Version Promotion) - Support for progressive rollouts of tbot, when not installed on the same system as a Teleport agent This RFD proposes a specific implementation of several sections in https://github.com/gravitational/teleport/pull/39217. @@ -53,154 +52,9 @@ The existing mechanism for automatic agent updates does not provide a hands-off We must provide a seamless, hands-off experience for auto-updates of Teleport Agents that is easy to maintain and safer for production use. -## Details - Teleport API - -Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. -The version and edition served from that endpoint will be configured using new `autoupdate_version` resource. - -Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: -- The `host=[uuid]` parameter sent to `/v1/webapi/find` -- The `group=[name]` parameter sent to `/v1/webapi/find` -- The schedule defined in the new `autoupdate_config` resource -- The status of past agent upgrades for the given version - -To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. -Teleport auth servers use their access to heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. - -Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/update.yaml` file, which is written via `teleport-update enable`: -```shell -$ teleport-update enable --proxy teleport.example.com --group staging -``` - -At the start of a group rollout, the Teleport auth server captures the desired group of hosts to update in the backend. -An fixed number of hosts (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. -Additional hosts are instructed to update as earlier updates complete, timeout, or fail, never exceeding `max_in_flight`. -The group rollout is halted if timeouts or failures exceed their specified thresholds. -Rollouts may be retried with `tctl autoupdate run`. - -### Scalability - -#### Window Capture - -Instance heartbeats will be cached by auth servers using a dedicated cache. -This cache is initialized from the backend when the auth server starts, and kept up-to-date when the heartbeats are broadcast to all auth servers. - -When the auth server is started, the cache is initialized using rate-limited backend reads that occur in the background, to avoid mass-reads of instance heartbeats. -The rate is modulated by the total number of instance heartbeats, to avoid putting too much load on the backend on large clusters. -The cache is considered healthy when all instance heartbeats present on the backend have been read at least once. - -Instance heartbeats are currently broadcast to all auth servers. -The cache will be kept up-to-date when the auth server receives updates. - -At the start of the upgrade window, auth servers attempt to write an update rollout plan to the backend. -This plan is protected by optimistic locking, and contains the following data: - -Data key: `/autoupdate/[name of group](/[auth ID]/[number])` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/1`) - -Data value JSON: -- `start_time`: timestamp of current window start time -- `version`: version for which this rollout is valid -- `schedule`: type of schedule that triggered the rollout -- `hosts`: list of host UUIDs in randomized order -- `auth_id`: ID of the auth server writing the plan - - -Expiration time of each key is 2 weeks. - -At a fixed interval, auth servers will check the plan to determine if a new plan is needed by comparing `start_time` to the current time and the desired window. -If a new plan is needed, auth servers will query their cache of instance heartbeats and attempt to write the new plan. -The first auth server to write the plan wins; others will be rejected by the optimistic lock. -Auth servers will only write the plan if their instance heartbeat cache is healthy. - -If the resource size is greater than 100 KiB, auth servers will divide the resource into pages no greater than 100 KiB each. -This is necessary to support backends with a value size limit. - -Each page will duplicate all values besides `hosts`, which will be different for each page. -All pages besides the first page will be prefixed with the auth server's ID. -Pages will be written in reverse order before the final atomic non-prefixed write of the first page. -If the non-prefixed write fails, the auth server is responsible for cleaning up the unusable pages. -If cleanup fails, the unusable pages will expire from the backend after 2 weeks. - -``` -Winning auth: - WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/2 - WRITE: /autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56/1 - WRITE: /autoupdate/staging | auth_id: 58526ba2-c12d-4a49-b5a4-1b694b82bf56 - -Losing auth: - WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/2 - WRITE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/1 - WRITE CONFLICT: /autoupdate/staging | auth_id: dd850e65-d2b2-4557-8ffb-def893c52530 - DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/1 - DELETE: /autoupdate/staging/dd850e65-d2b2-4557-8ffb-def893c52530/2 -``` - -To read all pages, auth servers read the first page, get the auth server ID from the `auth_id` field, and then range-read the remaining pages. - -#### Rollout - -The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events. - -The following data related to the rollout are stored in each instance heartbeat: -- `agent_update_start_time`: timestamp of individual agent's upgrade time -- `agent_update_version`: current agent version - -Expiration time of the heartbeat is extended to 24 hours when `agent_update_start_time` is written. - -Additionally, an in-memory data structure is maintained based on the cache, and kept up-to-date by a background process. -This data structure contains the number of unfinished (pending and ongoing) upgrades preceding each instance heartbeat in the rollout plan. -Instance heartbeats are considered completed when either `agent_update_version` matches the plan version, or `agent_update_start_time` is past the expiration time. -```golang -unfinished := make(map[Rollout][UUID]int) -``` - -On each instance heartbeat write, the auth server looks at the data structure to determine if the associated agent should begin upgrading. -This determination is made by comparing the stored number of unfinished upgrades to `max_in_flight % x len(hosts)`. -If the stored number is fewer, `agent_update_start_time` is updated to the current time when the heartbeat is written. +## UX -The auth server writes the following keys to `/autoupdate/[name of group]/status` (e.g., `/autoupdate/staging/status`): -- `last_active_host_index`: index of the last host allowed to update -- `failed_host_count`: failed host count -- `timeout_host_count`: timed-out host count - -Writes are rate-limited such that the progress is only updated every 10 seconds. -If the auth server's cached progress is greater than its calculated progress, the auth server declines to update the progress. - -The predetermined ordering of hosts avoids cache synchronization issues between auth servers. -Two concurrent heartbeat writes may temporarily result in fewer upgrading instances than desired, but this will eventually be resolved by cache propagation. - -Each group rollout is represented by an `agent_rollout_plan` Teleport resource that includes the progress and host count, but not the list of UUIDs. -Proxies use the start time in the resource to determine when to stream the list of UUIDs via a dedicated RPC. -Proxies watch the status section of `agent_rollout_plan` for updates to progress. - -Proxies read all started rollouts and maintain an in-memory map of host UUID to upgrading status: -```golang -upgrading := make(map[UUID]bool) -``` -Proxies watch for changes to the plan and update the map accordingly. - -When the updater queries the proxy via `/v1/webapi/find?host=[host_uuid]&group=[name]`, the proxies query the map to determine the value of `agent_autoupdate: true`. - -Updating all agents generates the following additional backend write load: -- One write per page of the rollout plan per update group. -- One write per auth server every 10 seconds, during rollouts. - -### REST Endpoints - -`/v1/webapi/find?host=[host_uuid]&group=[name]` -```json -{ - "server_edition": "enterprise", - "agent_version": "15.1.1", - "agent_autoupdate": true, - "agent_update_jitter_seconds": 10 -} -``` -Notes: -- Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. -- The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. -- The host UUID is read from `/var/lib/teleport/host_uuid` by the updater. -- The group name is read from `/var/lib/teleport/versions/update.yaml` by the updater. +[Hugo to add] ### Teleport Resources @@ -211,74 +65,85 @@ kind: autoupdate_config spec: # agent_autoupdate allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed - # agent updates are in place. - agent_autoupdate: true|false + # agent updates are in place. Setting this to pause will halt the rollout. + agent_autoupdate: disable|enable|pause - # agent_schedules contains both "regular" and "critical" schedules. - # The schedule used is determined by the agent_version_schedule associated - # with the version in autoupdate_version. - # Groups are not configurable with the "immediate" schedule. + # agent_schedules specifies version rollout schedules for agents. + # The schedule used is determined by the schedule associated + # with the version in the rollout_plan resource. + # For now, only the "regular" strategy is configurable. agent_schedules: - # schedule is "regular" or "critical" + # rollout strategy must be "regular" for now regular: - # name of the group. Must only contain valid backend / resource name characters. - - name: staging-group - # days specifies the days of the week when the group may be updated. - # default: ["*"] (all days) - days: [“Sun”, “Mon”, ... | "*"] - # start_hour specifies the hour when the group may start upgrading. - # default: 0 - start_hour: 0-23 - # jitter_seconds specifies a maximum jitter duration after the start hour. - # The agent updater client will pick a random time within this duration to wait to update. - # default: 0 - jitter_seconds: 0-60 - # timeout_seconds specifies the amount of time, after the specified jitter, after which - # an agent update will be considered timed out if the version does not change. - # default: 60 - timeout_seconds: 30-900 - # failure_seconds specifies the amount of time after which an agent update will be considered - # failed if the agent heartbeat stops before the update is complete. - # default: 0 - failure_seconds: 0-900 - # max_in_flight specifies the maximum number of agents that may be updated at the same time. - # default: 100% - max_in_flight: 0-100% - # max_timeout_before_halt specifies the percentage of clients that may time out before this group - # and all dependent groups are halted. - # default: 10% - max_timeout_before_halt: 0-100% - # max_failed_before_halt specifies the percentage of clients that may fail before this group - # and all dependent groups are halted. - # default: 0 - max_failed_before_halt: 0-100% - # requires specifies groups that must pass with the current version before this group is allowed - # to run using that version. - requires: ["test-group"] + # name of the group. Must only contain valid backend / resource name characters. + - name: staging + # days specifies the days of the week when the group may be updated. + # default: ["*"] (all days) + days: [ “Sun”, “Mon”, ... | "*" ] + # start_hour specifies the hour when the group may start upgrading. + # default: 0 + start_hour: 0-23 + # wait_days specifies how many days to wait after the previous group finished before starting. + # default: 0 + wait_days: 0-1 + # jitter_seconds specifies a maximum jitter duration after the start hour. + # The agent updater client will pick a random time within this duration to wait to update. + # default: 5 + jitter_seconds: 0-60 + # max_in_flight specifies the maximum number of agents that may be updated at the same time. + # Only valid for the backpressure strategy. + # default: 20% + max_in_flight: 10-100% + # alert_after specifies the duration after which a cluster alert will be set if the rollout has + # not completed. + # default: 4h + alert_after: 1h + # ... ``` +Default resource: +```yaml +kind: autoupdate_config +spec: + agent_autoupdate: enable + agent_schedules: + regular: + - name: default + days: ["*"] + start_hour: 0 + jitter_seconds: 5 + max_in_flight: 20% + alert_after: 4h +``` + Dependency cycles are rejected. Dependency chains longer than a week will be rejected. Otherwise, updates could take up to 7 weeks to propagate. -The updater will receive `agent_autoupdate: true` from the time is it designated for update until the version changes in `autoupdate_version`. -After 24 hours, the update is halted in-place, and the group is considered failed if unfinished. +The update proceeds from the first group to the last group, ensuring that each group successfully updates before allowing the next group to proceed. + +The updater will receive `agent_autoupdate: true` from the time is it designated for update until the `target_version` in `autoupdate_agent_plan` (below) changes. +Changing the `target_version` resets the schedule immediately, clearing all progress. + +Changing the `current_version` in `autoupdate_agent_plan` changes the advertised `current_version` for all unfinished groups. -Changing the version or schedule completely resets progress. -Releasing new client versions multiple times a week has the potential to starve dependent groups from updates. +Changing `agent_schedules` will preserve the `state` of groups that have the same name before and after the change. +However, any changes to `agent_schedules` that occur while a group is active will be rejected. + +Releasing new agent versions multiple times a week has the potential to starve dependent groups from updates. Note that the `default` schedule applies to agents that do not specify a group name. ```shell # configuration -$ tctl autoupdate update--set-agent-auto-update=off +$ tctl autoupdate update --set-agent-auto-update=off Automatic updates configuration has been updated. -$ tctl autoupdate update --schedule regular --group staging-group --set-start-hour=3 +$ tctl autoupdate update --group staging-group --set-start-hour=3 Automatic updates configuration has been updated. -$ tctl autoupdate update --schedule regular --group staging-group --set-jitter-seconds=60 +$ tctl autoupdate update --group staging-group --set-jitter-seconds=60 Automatic updates configuration has been updated. -$ tctl autoupdate update --schedule regular --default --set-jitter-seconds=60 +$ tctl autoupdate update --group default --set-jitter-seconds=60 Automatic updates configuration has been updated. $ tctl autoupdate reset Automatic updates configuration has been reset to defaults. @@ -309,16 +174,46 @@ $ tctl autoupdate run --group staging-group Executing auto-update for group 'staging-group' immediately. ``` +Notes: +- `autoupdate_agent_plan` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating `autoupdate_agent_plan`, while maintaining control over the rollout. + +#### Rollout + ```yaml -kind: autoupdate_version +kind: autoupdate_agent_plan spec: - # agent_version is the version of the agent the cluster will advertise. - agent_version: X.Y.Z - # agent_version_schedule specifies the rollout schedule associated with the version. - # Currently, only critical, regular, and immediate schedules are permitted. - agent_version_schedule: regular|critical|immediate - - # ... + # current_version is the desired version for agents before their window. + current_version: A.B.C + # target_version is the desired version for agents after their window. + target_version: X.Y.Z + # schedule to use for the rollout + schedule: regular|immediate + # strategy to use for the rollout + # default: backpressure + strategy: backpressure|grouped + # paused specifies whether the rollout is paused + # default: enabled + autoupdate: enabled|disabled|paused +status: + groups: + # name of group + - name: staging + # start_time is the time the upgrade will start + start_time: 2020-12-09T16:09:53+00:00 + # initial_count is the number of connected agents at the start of the window + initial_count: 432 + # missing_count is the number of agents disconnected since the start of the rollout + present_count: 53 + # failed_count is the number of agents rolled-back since the start of the rollout + failed_count: 23 + # progress is the current progress through the rollout + progress: 0.532 + # state is the current state of the rollout (unstarted, active, done, rollback) + state: active + # last_update_time is the time of the previous update for the group + last_update_time: 2020-12-09T16:09:53+00:00 + # last_update_reason is the trigger for the last update + last_update_reason: rollback ``` ```shell @@ -328,45 +223,96 @@ $ tctl autoupdate update --set-agent-version=15.1.2 --critical Automatic updates configuration has been updated. ``` -Notes: -- `autoupdate_version` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating `autoupdate_version`, while maintaining control over the rollout. +## Details - Teleport API -#### Rollout +Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. +The version and edition served from that endpoint will be configured using new `autoupdate_agent_plan` resource. -```yaml -kind: agent_rollout_plan -spec: - # start time of the rollout - start_time: 0001-01-01T00:00:00Z - # target version of the rollout - version: X.Y.Z - # schedule that triggered the rollout - schedule: regular - # hosts updated by the rollout - host_count: 127 -status: - # current host index in rollout progress - last_active_host_index: 23 - # failed hosts - failed_host_count: 3 - # timed-out hosts - timeout_host_count: 1 +Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: +- The `host=[uuid]` parameter sent to `/v1/webapi/find` +- The `group=[name]` parameter sent to `/v1/webapi/find` +- The schedule defined in the new `autoupdate_config` resource +- The status of past agent upgrades for the given version + +To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. +Teleport auth servers use their access to agent heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. + +Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/update.yaml` file, which is written via `teleport-update enable`: +```shell +$ teleport-update enable --proxy teleport.example.com --group staging ``` -Notes: -- This resource is stored in a paginated format with separate keys for each page and progress +At the start of a group rollout, the Teleport auth servers record the initial number connected agents. +A fixed number of connected agents (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. +Additional agents are instructed to update as earlier updates complete, never exceeding `max_in_flight`. +Rollouts may be paused with `tctl autoupdate pause` or manually triggered with `tctl autoupdate run`. + +### Rollout + +Instance heartbeats will be extended to incorporate and send data that is written to `/var/lib/teleport/versions/update.yaml` by the `teleport-update` binary. + +The following data related to the rollout are stored in each instance heartbeat: +- `agent_update_start_time`: timestamp of individual agent's upgrade time +- `agent_update_current_version`: current agent version +- `agent_update_rollback`: whether the agent was rolled-back automatically +- `agent_update_uuid`: Auto-update UUID +- `agent_update_group`: Auto-update group name + +Auth servers use their local instance inventory to calculate rollout statistics and write them to `/autoupdate/[group]/[auth ID]` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`). + +Every minute, auth servers persist the version counts: +- `version_counts[group][version]` + - `count`: number of currently connected agents at `version` in `group` + - `failed_count`: number of currently connected agents at `version` in `group` that experienced a rollback or inability to upgrade + - `lowest_uuid`: lowest UUID of all currently connected agents at `version` in `group` + +At the start of each group's window, auth servers persist an initial count: +- `initial_counts[group]` + - `count`: number of connected agents in `group` at start of window -### Version Promotion +Expiration time of the persisted key is 1 hour. -This RFD only proposed a mechanism to signal when agent auto-updates should occur. -Advertising different target Teleport versions for different groups of agents is out-of-scope for this RFD. -This means that groups which employ auto-scaling or ephemeral resources will slowly converge to the latest Teleport version. +To progress the rollout, auth servers will range-read keys from `/autoupdate/[group]/*`, sum the counts, and write back to the `autoupdate_agent_plan` status on a one-minute interval. +- To calculate the initial number of agents connected at the start of the window, each auth server will write the summed count of agents to `autoupdate_agent_plan` status, if not already written. +- To determine the progress through the rollout, auth servers will write the calculated progress to the `autoupdate_agent_plan` status using the formulas, declining to write if the current written progress is further ahead. -**This could lead to a production outage, as the latest Teleport version may not receive any validation before it is advertised to newly provisioned resources in production.** +If `/autoupdate/[group]/[auth ID]` is older than 1 minute, we do not consider its contents. +This prevents double-counting agents when auth servers are killed. -To solve this in the future, we can use the group name (provided to `/v1/webapi/find` and specified via `teleport-update enable`) to determine which version should be served. +#### Progress Formulas -This will require tracking the desired version of groups in the backend, which will add additional complexity to the rollout logic. +Each auth server will calculate the progress as `( max_in_flight * initial_counts[group].count + version_counts[group][target_version].count ) / initial_counts[group].count` and write the progress to `autoupdate_agent_plan` status. +This formula determines the progress percentage by adding a `max_in_flight` percentage-window above the number of currently updated agents in the group. + +However, if `as_numeral(version_counts[group][not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the calculated progress, that progress value will be used instead. +This protects against a statistical deadlock, where no UUIDs fall within the next `max_in_flight` window of UUID space, by always permitting the next non-updated agent to update. + +To ensure that the rollout is halted if more than `max_in_flight` un-updated agents drop off, an addition restriction must be imposed for the rollout to proceed: +`version_counts[group][*].count > initial_counts[group].count - max_in_flight * initial_counts[group].count` + +To prevent double-counting of agents when considering all counts across all auth servers, only agents connected for one minute will be considered in these formulas. + +#### Proxies + +When the updater queries the proxy via `/v1/webapi/find?host=[uuid]&group=[name]`, the proxies query the `autoupdate_agent_plan` status to determine the value of `agent_autoupdate: true`. +The boolean is returned as `true` in the case that the provided `host` contains a UUID that is under the progress percentage for the `group`: +`as_numeral(host_uuid) / as_numeral(max_uuid) < progress` + +### REST Endpoints + +`/v1/webapi/find?host=[uuid]&group=[name]` +```json +{ + "server_edition": "enterprise", + "agent_version": "15.1.1", + "agent_autoupdate": true, + "agent_update_jitter_seconds": 10 +} +``` +Notes: +- Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. +- The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. +- The UUID and group name are read from `/var/lib/teleport/versions/update.yaml` by the updater. ## Details - Linux Agents @@ -380,7 +326,7 @@ Source code for the updater will live in the main Teleport repository, with the ### Installation ```shell -$ apt-get install teleport-ent-updater +$ apt-get install teleport $ teleport-update enable --proxy example.teleport.sh # if not enabled already, configure teleport and: @@ -427,6 +373,16 @@ $ tree /var/lib/teleport │ └── etc │ └── systemd │ └── teleport.service + ├── system # if installed via OS package + │ ├── bin + │ │ ├── tsh + │ │ ├── tbot + │ │ ├── ... # other binaries + │ │ ├── teleport-update + │ │ └── teleport + │ └── etc + │ └── systemd + │ └── teleport.service └── update.yaml $ ls -l /usr/local/bin/tsh /usr/local/bin/tsh -> /var/lib/teleport/versions/15.0.0/bin/tsh @@ -444,9 +400,11 @@ $ ls -l /usr/local/lib/systemd/system/teleport.service This file stores configuration for `teleport-update`. +All updates are applied atomically using renameio. + ``` version: v1 -kind: updates +kind: update_config spec: # proxy specifies the Teleport proxy address to retrieve the agent version and update configuration from. proxy: mytenant.teleport.sh @@ -457,8 +415,16 @@ spec: # enabled specifies whether auto-updates are enabled, i.e., whether teleport-update update is allowed to update the agent. enabled: true status: - # active_version specifies the active (symlinked) deployment of the telepport agent. + # start_time specifies the start time of the most recent update. + start_time: 2020-12-09T16:09:53+00:00 + # active_version specifies the active (symlinked) deployment of the teleport agent. active_version: 15.1.1 + # version_history specifies the previous deployed versions, in order by recency. + version_history: ["15.1.3", "15.0.4"] + # rollback specifies whether the most recent version was deployed by an automated rollback. + rollback: true + # error specifies the last error encounted + error: "" ``` #### backup.yaml @@ -479,7 +445,7 @@ spec: ### Runtime -The agent-updater will run as a periodically executing systemd service which runs every 10 minutes. +The `teleport-update` binary will run as a periodically executing systemd service which runs every 10 minutes. The systemd service will run: ```shell $ teleport-update update @@ -490,7 +456,7 @@ After it is installed, the `update` subcommand will no-op when executed until co $ teleport-update enable --proxy mytenant.teleport.sh --group staging ``` -If the proxy address is not provided with `--proxy`, the current proxy address from `teleport.yaml` is used. +If the proxy address is not provided with `--proxy`, the current proxy address from `teleport.yaml` is used, if present. The `enable` subcommand will change the behavior of `teleport-update update` to update teleport and restart the existing agent, if running. It will also run update teleport immediately, to ensure that subsequent executions succeed. @@ -498,9 +464,9 @@ It will also run update teleport immediately, to ensure that subsequent executio Both `update` and `enable` will maintain a shared lock file preventing any re-entrant executions. The `enable` subcommand will: -1. Query the `/v1/webapi/find` endpoint. -2. If the current updater-managed version of Teleport is the latest, and teleport package is not installed, jump to (16). -3. If the current updater-managed version of Teleport is the latest, but the teleport package is installed, jump to (13). +1. If an updater-incompatible version of the Teleport package is installed, fail immediately. +2. Query the `/v1/webapi/find` endpoint. +3. If the current updater-managed version of Teleport is the latest, jump to (14). 4. Ensure there is enough free disk space to update Teleport via `unix.Statfs()` and `content-length` header from `HEAD` request. 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Download and verify the checksum (tarball URL suffixed with `.sha256`). @@ -510,10 +476,8 @@ The `enable` subcommand will: 10. Restart the agent if the systemd service is already enabled. 11. Set `active_version` in `update.yaml` if successful or not enabled. 12. Replace the symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. -13. Remove and purge any `teleport` package if installed. -14. Verify the symlinks to the active version still exists. -15. Remove all stored versions of the agent except the current version and last working version. -16. Configure `update.yaml` with the current proxy address and group, and set `enabled` to true. +13. Remove all stored versions of the agent except the current version and last working version. +14. Configure `update.yaml` with the current proxy address and group, and set `enabled` to true. The `disable` subcommand will: 1. Configure `update.yaml` to set `enabled` to false. @@ -535,18 +499,21 @@ When `update` subcommand is otherwise executed, it will: 14. Replace the old symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. 15. Remove all stored versions of the agent except the current version and last working version. -To enable auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-update` at that version if present and different. +To guarantee auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-update` at that version if present and different. The `/usr/local/bin/teleport-update` symlink will take precedence to avoid reexec in most scenarios. To ensure that SELinux permissions do not prevent the `teleport-update` binary from installing/removing Teleport versions, the updater package will configure SELinux contexts to allow changes to all required paths. -To ensure that `teleport` package removal does not interfere with `teleport-update`, package removal will run `apt purge` (or `yum` equivalent) while ensuring that `/etc/teleport.yaml` and `/var/lib/teleport` are not purged. -Failure to do this could result in `/etc/teleport.yaml` being removed when an operator runs `apt purge` at a later date. - -To ensure that `teleport` package removal does not lead to a hard restart of Teleport, the updater will ensure that the package is removed without triggering needrestart or similar services. - To ensure that backups are consistent, the updater will use the [SQLite backup API](https://www.sqlite.org/backup.html) to perform the backup. +The `teleport` apt and yum packages contain a system installation of Teleport in `/var/lib/teleport/versions/system`. +Post package installation, the `link` subcommand is executed automatically to link the system installation when no auto-updater-managed version of Teleport is linked: +``` +/usr/local/bin/teleport -> /var/lib/teleport/versions/system/bin/teleport +/usr/local/bin/teleport-updater -> /var/lib/teleport/versions/system/bin/teleport-updater +... +``` + #### Failure Conditions If the new version of Teleport fails to start, the installation of Teleport is reverted as described above. @@ -563,9 +530,6 @@ To retrieve known information about agent updates, the `status` subcommand will "agent_version_installed": "15.1.1", "agent_version_desired": "15.1.2", "agent_version_previous": "15.1.0", - "agent_edition_installed": "enterprise", - "agent_edition_desired": "enterprise", - "agent_edition_previous": "enterprise", "agent_update_time_last": "2020-12-10T16:00:00+00:00", "agent_update_time_jitter": 600, "agent_updates_enabled": true @@ -618,6 +582,8 @@ This workflow supports customers that cannot use the auto-update mechanism provi Cluster administrators that want to self-manage agent updates may manually query the `/v1/webapi/find` endpoint using the host UUID, and implement auto-updates with their own automation. +Cluster administrators that choose this path may use the `teleport` package without auto-updates enabled locally. + ### Installers The following install scripts will be updated to install the latest updater and run `teleport-update enable` with the proxy address: @@ -632,10 +598,10 @@ Eventually, additional logic from the scripts could be added to `teleport-update Moving additional logic into the updater is out-of-scope for this proposal. To create pre-baked VM or container images that reduce the complexity of the cluster joining operation, two workflows are permitted: -- Install the `teleport-ent-updater` package and defer `teleport-update enable`, Teleport configuration, and `systemctl enable teleport` to cloud-init scripts. +- Install the `teleport` package and defer `teleport-update enable`, Teleport configuration, and `systemctl enable teleport` to cloud-init scripts. This allows both the proxy address and token to be injected at VM initialization. The VM image may be used with any Teleport cluster. Installers scripts will continue to function, as the package install operation will no-op. -- Install the `teleport-ent-updater` package and run `teleport-update enable` before the image is baked, but defer final Teleport configuration and `systemctl enable teleport` to cloud-init scripts. +- Install the `teleport` package and run `teleport-update enable` before the image is baked, but defer final Teleport configuration and `systemctl enable teleport` to cloud-init scripts. This allows the proxy address to be pre-set in the image. `teleport.yaml` can be partially configured during image creation. At minimum, the token must be injected via cloud-init scripts. Installers scripts would be skipped in favor of the `teleport configure` command. @@ -666,9 +632,12 @@ Rollbacks for the Kubernetes updater, as well as packaging changes to improve UX ## Migration -The existing update scheduling system will remain in-place until the old auto-updater is fully deprecated. +The existing update system will remain in-place until the old auto-updater is fully deprecated. + +Both update systems can co-exist on the same machine. +The old auto-updater will update the system package, which will not affect the `teleport-update`-managed installation. -Eventually, the `cluster_maintenance_config` resource will be deprecated. +Eventually, the `cluster_maintenance_config` resource and `teleport-ent-upgrader` package will be deprecated. ## Security @@ -717,19 +686,14 @@ service AutoUpdateService { // ResetAutoUpdateConfig restores the autoupdate config to default values. rpc ResetAutoUpdateConfig(ResetAutoUpdateConfigRequest) returns (AutoUpdateConfig); - // GetAutoUpdateVersion returns the autoupdate version. - rpc GetAutoUpdateVersion(GetAutoUpdateVersionRequest) returns (AutoUpdateVersion); - // CreateAutoUpdateVersion creates the autoupdate version. - rpc CreateAutoUpdateVersion(CreateAutoUpdateVersionRequest) returns (AutoUpdateVersion); - // UpdateAutoUpdateVersion updates the autoupdate version. - rpc UpdateAutoUpdateVersion(UpdateAutoUpdateVersionRequest) returns (AutoUpdateVersion); - // UpsertAutoUpdateVersion overwrites the autoupdate version. - rpc UpsertAutoUpdateVersion(UpsertAutoUpdateVersionRequest) returns (AutoUpdateVersion); - - // GetAgentRolloutPlan returns the agent rollout plan and current progress. - rpc GetAgentRolloutPlan(GetAgentRolloutPlanRequest) returns (AgentRolloutPlan); - // GetAutoUpdateVersion streams the agent rollout plan's list of all hosts. - rpc GetAgentRolloutPlanHosts(GetAgentRolloutPlanHostsRequest) returns (stream AgentRolloutPlanHost); + // GetAutoUpdateAgentPlan returns the autoupdate plan for agents. + rpc GetAutoUpdateAgentPlan(GetAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); + // CreateAutoUpdateAgentPlan creates the autoupdate plan for agents. + rpc CreateAutoUpdateAgentPlan(CreateAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); + // UpdateAutoUpdateAgentPlan updates the autoupdate plan for agents. + rpc UpdateAutoUpdateAgentPlan(UpdateAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); + // UpsertAutoUpdateAgentPlan overwrites the autoupdate plan for agents. + rpc UpsertAutoUpdateAgentPlan(UpsertAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); } // GetAutoUpdateConfigRequest requests the contents of the AutoUpdateConfig. @@ -779,8 +743,6 @@ message AutoUpdateConfigSpec { message AgentAutoUpdateSchedules { // regular schedules for non-critical versions. repeated AgentAutoUpdateGroup regular = 1; - // critical schedules for urgently needed versions. - repeated AgentAutoUpdateGroup critical = 2; } // AgentAutoUpdateGroup specifies the update schedule for a group of agents. @@ -799,12 +761,6 @@ message AgentAutoUpdateGroup { int32 failure_seconds = 6; // max_in_flight specifies agents that can be updated at the same time, by percent. string max_in_flight = 7; - // max_timeout_before_halt specifies agents that can timeout before the rollout is halted, by percent. - string max_timeout_before_halt = 8; - // max_failed_before_halt specifies agents that can fail before the rollout is halted, by percent. - string max_failed_before_halt = 9; - // requires specifies rollout groups that must succeed for the current version/schedule before this rollout can run. - repeated string requires = 10; } // Day of the week @@ -820,29 +776,29 @@ enum Day { DAY_SATURDAY = 8; } -// GetAutoUpdateVersionRequest requests the autoupdate_version singleton resource. -message GetAutoUpdateVersionRequest {} +// GetAutoUpdateAgentPlanRequest requests the autoupdate_agent_plan singleton resource. +message GetAutoUpdateAgentPlanRequest {} -// GetAutoUpdateVersionRequest requests creation of the autoupdate_version singleton resource. -message CreateAutoUpdateVersionRequest { - // autoupdate_version resource contents - AutoUpdateVersion autoupdate_version = 1; +// GetAutoUpdateAgentPlanRequest requests creation of the autoupdate_agent_plan singleton resource. +message CreateAutoUpdateAgentPlanRequest { + // autoupdate_agent_plan resource contents + AutoUpdateAgentPlan autoupdate_agent_plan = 1; } -// GetAutoUpdateVersionRequest requests an update of the autoupdate_version singleton resource. -message UpdateAutoUpdateVersionRequest { - // autoupdate_version resource contents - AutoUpdateVersion autoupdate_version = 1; +// GetAutoUpdateAgentPlanRequest requests an update of the autoupdate_agent_plan singleton resource. +message UpdateAutoUpdateAgentPlanRequest { + // autoupdate_agent_plan resource contents + AutoUpdateAgentPlan autoupdate_agent_plan = 1; } -// GetAutoUpdateVersionRequest requests an upsert of the autoupdate_version singleton resource. -message UpsertAutoUpdateVersionRequest { - // autoupdate_version resource contents - AutoUpdateVersion autoupdate_version = 1; +// GetAutoUpdateAgentPlanRequest requests an upsert of the autoupdate_agent_plan singleton resource. +message UpsertAutoUpdateAgentPlanRequest { + // autoupdate_agent_plan resource contents + AutoUpdateAgentPlan autoupdate_agent_plan = 1; } -// AutoUpdateVersion holds dynamic configuration settings for autoupdate versions. -message AutoUpdateVersion { +// AutoUpdateAgentPlan holds dynamic configuration settings for agent autoupdates. +message AutoUpdateAgentPlan { // kind is the kind of the resource. string kind = 1; // sub_kind is the sub kind of the resource. @@ -852,11 +808,13 @@ message AutoUpdateVersion { // metadata is the metadata of the resource. teleport.header.v1.Metadata metadata = 4; // spec is the spec of the resource. - AutoUpdateVersionSpec spec = 5; + AutoUpdateAgentPlanSpec spec = 5; + // status is the status of the resource. + AutoUpdateAgentPlanStatus status = 6; } -// AutoUpdateVersionSpec is the spec for the autoupdate version. -message AutoUpdateVersionSpec { +// AutoUpdateAgentPlanSpec is the spec for the autoupdate version. +message AutoUpdateAgentPlanSpec { // agent_version is the desired agent version for new rollouts. string agent_version = 1; // agent_version schedule is the schedule to use for rolling out the agent_version. @@ -869,54 +827,16 @@ enum Schedule { SCHEDULE_UNSPECIFIED = 0; // REGULAR update schedule SCHEDULE_REGULAR = 1; - // CRITICAL update schedule for critical bugs and vulnerabilities - SCHEDULE_CRITICAL = 2; // IMMEDIATE update schedule for updating all agents immediately - SCHEDULE_IMMEDIATE = 3; + SCHEDULE_IMMEDIATE = 2; } -// GetAgentRolloutPlanRequest requests an agent_rollout_plan. -message GetAgentRolloutPlanRequest { - // name of the agent_rollout_plan - string name = 1; -} - -// GetAgentRolloutPlanHostsRequest requests the ordered host UUIDs for an agent_rollout_plan. -message GetAgentRolloutPlanHostsRequest { - // name of the agent_rollout_plan - string name = 1; -} - -// AgentRolloutPlan defines a version update rollout consisting a fixed group of agents. -message AgentRolloutPlan { - // kind is the kind of the resource. - string kind = 1; - // sub_kind is the sub kind of the resource. - string sub_kind = 2; - // version is the version of the resource. - string version = 3; - // metadata is the metadata of the resource. - teleport.header.v1.Metadata metadata = 4; - // spec is the spec of the resource. - AgentRolloutPlanSpec spec = 5; - // status is the status of the resource. - AgentRolloutPlanStatus status = 6; -} - -// AutoUpdateVersionSpec is the spec for the AgentRolloutPlan. -message AgentRolloutPlanSpec { - // start_time of the rollout - google.protobuf.Timestamp start_time = 1; +// AutoUpdateAgentPlanStatus is the status for the AutoUpdateAgentPlan. +message AutoUpdateAgentPlanStatus { // version targetted by the rollout string version = 2; - // schedule that triggered the rollout - string schedule = 3; - // host_count of hosts to update - int64 host_count = 4; -} - -// AutoUpdateVersionStatus is the status for the AgentRolloutPlan. -message AgentRolloutPlanStatus { + // start_time of the rollout + google.protobuf.Timestamp start_time = 1; // last_active_host_index specifies the index of the last host that may be updated. int64 last_active_host_index = 1; // failed_host_count specifies the number of failed hosts. @@ -924,12 +844,6 @@ message AgentRolloutPlanStatus { // timeout_host_count specifies the number of timed-out hosts. int64 timeout_host_count = 3; } - -// AgentRolloutPlanHost identifies an agent by host ID -message AgentRolloutPlanHost { - // host_id of a host included in the rollout - string host_id = 1; -} ``` ## Alternatives @@ -958,13 +872,13 @@ Making the update boolean instruction available via the `/webapi/find` TLS endpo ## Execution Plan -1. Implement Teleport APIs for new scheduling system (without backpressure or group interdependence) +1. Implement Teleport APIs for new scheduling system (without backpressure strategy) 2. Implement new Linux server auto-updater in Go. 3. Implement changes to Kubernetes auto-updater. 4. Test extensively on all supported Linux distributions. 5. Prep documentation changes. -6. Release new updater via teleport-ent-updater package. +6. Release via `teleport` package. 7. Release documentation changes. 8. Communicate to users that they should update their updater. -9. Deprecate old auto-updater endpoints. +9. Begin deprecation of old auto-updater resources, packages, and endpoints. 10. Add group interdependence and backpressure features. From 1f6918e19b719a1d3820ad0e92c18ca28a7e8722 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 25 Sep 2024 14:00:48 -0700 Subject: [PATCH 071/105] canaries --- rfd/0169-auto-updates-linux-agents.md | 39 ++++++++++++++++++++------- 1 file changed, 30 insertions(+), 9 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 2b8f264ed2899..89527542fe8df 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -90,14 +90,18 @@ spec: # The agent updater client will pick a random time within this duration to wait to update. # default: 5 jitter_seconds: 0-60 + # canary_count specifies the desired number of canaries to update before any other agents + # are updated. + # default: 5 + canaries: 0-10 # max_in_flight specifies the maximum number of agents that may be updated at the same time. # Only valid for the backpressure strategy. # default: 20% max_in_flight: 10-100% # alert_after specifies the duration after which a cluster alert will be set if the rollout has # not completed. - # default: 4h - alert_after: 1h + # default: 4 + alert_after_hours: 1-8 # ... ``` @@ -206,6 +210,8 @@ status: present_count: 53 # failed_count is the number of agents rolled-back since the start of the rollout failed_count: 23 + # canaries is a list of updater UUIDs used for canary deployments + canaries: ["abc123-..."] # progress is the current progress through the rollout progress: 0.532 # state is the current state of the rollout (unstarted, active, done, rollback) @@ -243,8 +249,14 @@ $ teleport-update enable --proxy teleport.example.com --group staging ``` At the start of a group rollout, the Teleport auth servers record the initial number connected agents. -A fixed number of connected agents (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. +The number of updated and non-updated agents is tracked by the auth servers. + +If backpressure is enabled, a fixed number of connected agents (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. Additional agents are instructed to update as earlier updates complete, never exceeding `max_in_flight`. + +If canaries are enabled, a user-specified number of agents are updated first. +These agents must all update successfully for the rollout to proceed to the remaining agents. + Rollouts may be paused with `tctl autoupdate pause` or manually triggered with `tctl autoupdate run`. ### Rollout @@ -269,11 +281,13 @@ Every minute, auth servers persist the version counts: At the start of each group's window, auth servers persist an initial count: - `initial_counts[group]` - `count`: number of connected agents in `group` at start of window + - `canaries`: list of updater UUIDs to use for canary deployments Expiration time of the persisted key is 1 hour. To progress the rollout, auth servers will range-read keys from `/autoupdate/[group]/*`, sum the counts, and write back to the `autoupdate_agent_plan` status on a one-minute interval. - To calculate the initial number of agents connected at the start of the window, each auth server will write the summed count of agents to `autoupdate_agent_plan` status, if not already written. +- To calculate the canaries, each auth server will write a random selection of all canaries to `autoupdate_agent_plan` status, if not already written. - To determine the progress through the rollout, auth servers will write the calculated progress to the `autoupdate_agent_plan` status using the formulas, declining to write if the current written progress is further ahead. If `/autoupdate/[group]/[auth ID]` is older than 1 minute, we do not consider its contents. @@ -312,7 +326,8 @@ The boolean is returned as `true` in the case that the provided `host` contains Notes: - Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. -- The UUID and group name are read from `/var/lib/teleport/versions/update.yaml` by the updater. +- The group name is read from `/var/lib/teleport/versions/update.yaml` by the updater. +- The UUID is read from `/tmp/teleport_update_uuid`, which `teleport-update` regenerates when missing. ## Details - Linux Agents @@ -520,6 +535,8 @@ If the new version of Teleport fails to start, the installation of Teleport is r If `teleport-update` itself fails with an error, and an older version of `teleport-update` is available, the update will retry with the older version. +If the agent losses its connection to the proxy, `teleport-update` updates the agent to the group's current desired version immediately. + Known failure conditions caused by intentional configuration (e.g., updates disabled) will not trigger retry logic. #### Status @@ -872,13 +889,17 @@ Making the update boolean instruction available via the `/webapi/find` TLS endpo ## Execution Plan -1. Implement Teleport APIs for new scheduling system (without backpressure strategy) -2. Implement new Linux server auto-updater in Go. +1. Implement Teleport APIs for new scheduling system (without backpressure strategy, canaries, or completion tracking) +2. Implement new Linux server auto-updater in Go, including systemd-based rollbacks. 3. Implement changes to Kubernetes auto-updater. 4. Test extensively on all supported Linux distributions. 5. Prep documentation changes. -6. Release via `teleport` package. +6. Release via `teleport` package and script for packageless install. 7. Release documentation changes. -8. Communicate to users that they should update their updater. +8. Communicate to users that they should update to the new system. 9. Begin deprecation of old auto-updater resources, packages, and endpoints. -10. Add group interdependence and backpressure features. +10. Add healthcheck endpoint to Teleport agents and incorporate into rollback logic. +10. Add progress and completion checking. +10. Add canary functionality. +10. Add backpressure functionality if necessary. +11. Add DB backups if necessary. From 2e5ee9838b3f52b405479eedddc79b45268a21b9 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 25 Sep 2024 14:23:55 -0700 Subject: [PATCH 072/105] canary 2 --- rfd/0169-auto-updates-linux-agents.md | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 89527542fe8df..999c39053cb8c 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -65,15 +65,15 @@ kind: autoupdate_config spec: # agent_autoupdate allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed - # agent updates are in place. Setting this to pause will halt the rollout. + # agent updates are in place. Setting this to pause will temporarily halt the rollout. agent_autoupdate: disable|enable|pause # agent_schedules specifies version rollout schedules for agents. # The schedule used is determined by the schedule associated - # with the version in the rollout_plan resource. - # For now, only the "regular" strategy is configurable. + # with the version in the autoupdate_agent_plan resource. + # For now, only the "regular" schedule is configurable. agent_schedules: - # rollout strategy must be "regular" for now + # rollout schedule must be "regular" for now regular: # name of the group. Must only contain valid backend / resource name characters. - name: staging @@ -93,7 +93,7 @@ spec: # canary_count specifies the desired number of canaries to update before any other agents # are updated. # default: 5 - canaries: 0-10 + canary_count: 0-10 # max_in_flight specifies the maximum number of agents that may be updated at the same time. # Only valid for the backpressure strategy. # default: 20% @@ -117,6 +117,7 @@ spec: days: ["*"] start_hour: 0 jitter_seconds: 5 + canary_count: 5 max_in_flight: 20% alert_after: 4h ``` @@ -143,9 +144,9 @@ Note that the `default` schedule applies to agents that do not specify a group n # configuration $ tctl autoupdate update --set-agent-auto-update=off Automatic updates configuration has been updated. -$ tctl autoupdate update --group staging-group --set-start-hour=3 +$ tctl autoupdate update --group staging --set-start-hour=3 Automatic updates configuration has been updated. -$ tctl autoupdate update --group staging-group --set-jitter-seconds=60 +$ tctl autoupdate update --group staging --set-jitter-seconds=60 Automatic updates configuration has been updated. $ tctl autoupdate update --group default --set-jitter-seconds=60 Automatic updates configuration has been updated. @@ -159,11 +160,11 @@ Version: v1.2.4 Schedule: regular Groups: -staging-group: succeeded at 2024-01-03 23:43:22 UTC -prod-group: scheduled for 2024-01-03 23:43:22 UTC (depends on prod-group) -other-group: failed at 2024-01-05 22:53:22 UTC +staging: succeeded at 2024-01-03 23:43:22 UTC +prod: scheduled for 2024-01-03 23:43:22 UTC (depends on prod) +other: failed at 2024-01-05 22:53:22 UTC -$ tctl autoupdate status --group staging-group +$ tctl autoupdate status --group staging Status: succeeded Date: 2024-01-03 23:43:22 UTC Requires: (none) @@ -174,8 +175,8 @@ Failed: 15 (3%) Timed-out: 0 # re-running failed group -$ tctl autoupdate run --group staging-group -Executing auto-update for group 'staging-group' immediately. +$ tctl autoupdate run --group staging +Executing auto-update for group 'staging' immediately. ``` Notes: From 07adca5e1509f288a220299b55ca75114184e120 Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Wed, 25 Sep 2024 18:19:44 -0400 Subject: [PATCH 073/105] describe state, transitions, and proxy response --- rfd/0169-auto-updates-linux-agents.md | 91 +++++++++++++++++++++++++-- 1 file changed, 87 insertions(+), 4 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 999c39053cb8c..80fb19bd6a50a 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -66,7 +66,7 @@ spec: # agent_autoupdate allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed # agent updates are in place. Setting this to pause will temporarily halt the rollout. - agent_autoupdate: disable|enable|pause + agent_autoupdate_mode: disable|enable|pause # agent_schedules specifies version rollout schedules for agents. # The schedule used is determined by the schedule associated @@ -110,7 +110,7 @@ Default resource: ```yaml kind: autoupdate_config spec: - agent_autoupdate: enable + agent_autoupdate_mode: enable agent_schedules: regular: - name: default @@ -198,7 +198,7 @@ spec: strategy: backpressure|grouped # paused specifies whether the rollout is paused # default: enabled - autoupdate: enabled|disabled|paused + autoupdate_mode: enabled|disabled|paused status: groups: # name of group @@ -242,7 +242,7 @@ Whether the Teleport updater querying the endpoint is instructed to upgrade (via - The status of past agent upgrades for the given version To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. -Teleport auth servers use their access to agent heartbeat data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. +Teleport auth servers use their access to the instance inventory data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/update.yaml` file, which is written via `teleport-update enable`: ```shell @@ -260,6 +260,89 @@ These agents must all update successfully for the rollout to proceed to the rema Rollouts may be paused with `tctl autoupdate pause` or manually triggered with `tctl autoupdate run`. +### Group states + +Let `v1` be the current version and `v2` the target version. + +A group can be in 5 state: +- unstarted: the group update has not been started yet. +- canary: a few canaries are getting updated. New agents should run `v1`. Existing agents should not attempt to update and keep their existing version. +- active: the group is actively getting updated. New agents should run `v2`, existing agents are instructed to update to `v2`. +- done: the group has been updated. New agents should run `v2`. +- rolledback: the group has been rolledback. New agents should run `v1`, existing agents should update to `v1`. + +The finite state machine is the following: +```mermaid +flowchart TD + unstarted((unstarted)) + canary((canary)) + active((active)) + done((done)) + rolledback((rolledback)) + + unstarted -->|StartGroup
MaintenanceTriggerOK| canary + canary -->|canary came back alive| active + canary -->|ForceGroup| done + canary -->|RollbackGroup| rolledback + active -->|ForceGroup
Success criteria met| done + done -->|RollbackGroup| rolledback + active -->|RollbackGroup| rolledback + + canary -->|ResetGroup| canary + active -->|ResetGroup| active +``` + +### Agent auto update modes + +The agent auto update mode is specified by both Cloud (via `autoupdate_agent_plan`) +and by the customer (via `autoupdate_config`). + +The agent update mode can take 3 values: + +1. disabled: teleport should not manage agent updates +2. paused: the updates are temporarily suspended, we honour the existing rollout state +3. enabled: teleport can update agents + +The cluster agent rollout mode is computed by taking the lowest value. +For example: + +- cloud says `enabled` and the customer says `enabled` -> the updates are `enabled` +- cloud says `enabled` and the customer says `suspended` -> the updates are `suspended` +- cloud says `disabled` and the customer says `suspended` -> the updates are `disabled` +- cloud says `disabled` and the customer says `enabled` -> the updates are `disabled` + +### Proxy answer + +The proxy response contains two parts related to automatic updates: +- the target version of the requested group +- if the agent should be updated + +#### Rollout status: disabled + +| Group state | Version | Should update | +|-------------|---------|---------------| +| * | v2 | false | + +#### Rollout status: paused + +| Group state | Version | Should update | +|-------------|---------|---------------| +| unstarted | v1 | false | +| canary | v1 | false | +| active | v2 | false | +| done | v2 | false | +| rolledback | v1 | false | + +#### Rollout status: enabled + +| Group state | Version | Should update | +|-------------|---------|----------------------------| +| unstarted | v1 | false | +| canary | v1 | false, except for canaries | +| active | v2 | true if UUID <= progress | +| done | v2 | true | +| rolledback | v1 | true | + ### Rollout Instance heartbeats will be extended to incorporate and send data that is written to `/var/lib/teleport/versions/update.yaml` by the `teleport-update` binary. From c903f1eef8aa8404c0543e8ea8c5939bda3696f4 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 25 Sep 2024 15:29:28 -0700 Subject: [PATCH 074/105] rpcs --- rfd/0169-auto-updates-linux-agents.md | 66 +++++++++++++++++++-------- 1 file changed, 48 insertions(+), 18 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 80fb19bd6a50a..e83670dbea421 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -131,7 +131,7 @@ The update proceeds from the first group to the last group, ensuring that each g The updater will receive `agent_autoupdate: true` from the time is it designated for update until the `target_version` in `autoupdate_agent_plan` (below) changes. Changing the `target_version` resets the schedule immediately, clearing all progress. -Changing the `current_version` in `autoupdate_agent_plan` changes the advertised `current_version` for all unfinished groups. +Changing the `start_version` in `autoupdate_agent_plan` changes the advertised `start_version` for all unfinished groups. Changing `agent_schedules` will preserve the `state` of groups that have the same name before and after the change. However, any changes to `agent_schedules` that occur while a group is active will be rejected. @@ -187,8 +187,8 @@ Notes: ```yaml kind: autoupdate_agent_plan spec: - # current_version is the desired version for agents before their window. - current_version: A.B.C + # start_version is the desired version for agents before their window. + start_version: A.B.C # target_version is the desired version for agents after their window. target_version: X.Y.Z # schedule to use for the rollout @@ -349,7 +349,7 @@ Instance heartbeats will be extended to incorporate and send data that is writte The following data related to the rollout are stored in each instance heartbeat: - `agent_update_start_time`: timestamp of individual agent's upgrade time -- `agent_update_current_version`: current agent version +- `agent_update_start_version`: current agent version - `agent_update_rollback`: whether the agent was rolled-back automatically - `agent_update_uuid`: Auto-update UUID - `agent_update_group`: Auto-update group name @@ -751,7 +751,7 @@ are signed. The Update Framework (TUF) will be used to implement secure updates in the future. -Anyone who possesses a host UUID can determine when that host is scheduled to update by repeatedly querying the public `/v1/webapi/find` endpoint. +Anyone who possesses a updater UUID can determine when that host is scheduled to update by repeatedly querying the public `/v1/webapi/find` endpoint. It is not possible to discover the current version of that host, only the designated update window. ## Logging @@ -834,8 +834,8 @@ message AutoUpdateConfig { // AutoUpdateConfigSpec is the spec for the autoupdate config. message AutoUpdateConfigSpec { - // agent_autoupdate specifies whether agent autoupdates are enabled. - bool agent_autoupdate = 1; + // agent_autoupdate_mode specifies whether agent autoupdates are enabled, disabled, or paused. + Mode agent_autoupdate_mode = 1; // agent_schedules specifies schedules for updates of grouped agents. AgentAutoUpdateSchedules agent_schedules = 3; } @@ -854,14 +854,16 @@ message AgentAutoUpdateGroup { repeated Day days = 2; // start_hour to initiate update int32 start_hour = 3; - // jitter_seconds to introduce before update as rand([0, jitter_seconds]). - int32 jitter_seconds = 4; - // timeout_seconds before an agent is considered time-out (no version change) - int32 timeout_seconds = 5; - // failure_seconds before an agent is considered failed (loses connection) - int32 failure_seconds = 6; + // wait_days after last group succeeds before this group can run + int32 wait_days = 4; + // jitter_seconds to introduce before update as rand([0, jitter_seconds]) + int32 jitter_seconds = 5; + // canary_count of agents to use in the canary deployment. + int32 canary_count = 6; + // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. + int32 alert_after_hours = 7; // max_in_flight specifies agents that can be updated at the same time, by percent. - string max_in_flight = 7; + string max_in_flight = 8; } // Day of the week @@ -877,6 +879,18 @@ enum Day { DAY_SATURDAY = 8; } +// Mode of operation +enum Mode { + // UNSPECIFIED update mode + MODE_UNSPECIFIED = 0; + // DISABLE updates + MODE_DISABLE = 1; + // ENABLE updates + MODE_ENABLE = 2; + // PAUSE updates + MODE_PAUSE = 3; +} + // GetAutoUpdateAgentPlanRequest requests the autoupdate_agent_plan singleton resource. message GetAutoUpdateAgentPlanRequest {} @@ -916,10 +930,16 @@ message AutoUpdateAgentPlan { // AutoUpdateAgentPlanSpec is the spec for the autoupdate version. message AutoUpdateAgentPlanSpec { - // agent_version is the desired agent version for new rollouts. - string agent_version = 1; - // agent_version schedule is the schedule to use for rolling out the agent_version. - Schedule agent_version_schedule = 2; + // start_version is the version to update from. + string start_version = 1; + // target_version is the version to update to. + string target_version = 2; + // schedule to use for the rollout + Schedule schedule = 3; + // strategy to use for the rollout + Strategy strategy = 4; + // autoupdate_mode to use for the rollout + Mode autoupdate_mode = 5; } // Schedule type for the rollout @@ -932,6 +952,16 @@ enum Schedule { SCHEDULE_IMMEDIATE = 2; } +// Strategy type for the rollout +enum Strategy { + // UNSPECIFIED update strategy + STRATEGY_UNSPECIFIED = 0; + // GROUPED update schedule, with no backpressure + STRATEGY_GROUPED = 1; + // BACKPRESSURE update schedule + STRATEGY_BACKPRESSURE = 2; +} + // AutoUpdateAgentPlanStatus is the status for the AutoUpdateAgentPlan. message AutoUpdateAgentPlanStatus { // version targetted by the rollout From 4a63316bd7052ca1dbf497c9a306d4306d75056d Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 25 Sep 2024 15:50:26 -0700 Subject: [PATCH 075/105] finish rpcs --- rfd/0169-auto-updates-linux-agents.md | 63 +++++++++++++++++++++------ 1 file changed, 50 insertions(+), 13 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index e83670dbea421..cf3abf8b56a13 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -855,13 +855,13 @@ message AgentAutoUpdateGroup { // start_hour to initiate update int32 start_hour = 3; // wait_days after last group succeeds before this group can run - int32 wait_days = 4; + int64 wait_days = 4; + // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. + int64 alert_after_hours = 5; // jitter_seconds to introduce before update as rand([0, jitter_seconds]) - int32 jitter_seconds = 5; + int64 jitter_seconds = 6; // canary_count of agents to use in the canary deployment. - int32 canary_count = 6; - // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. - int32 alert_after_hours = 7; + int64 canary_count = 7; // max_in_flight specifies agents that can be updated at the same time, by percent. string max_in_flight = 8; } @@ -964,17 +964,54 @@ enum Strategy { // AutoUpdateAgentPlanStatus is the status for the AutoUpdateAgentPlan. message AutoUpdateAgentPlanStatus { - // version targetted by the rollout - string version = 2; + // name of the group + string name = 0; // start_time of the rollout google.protobuf.Timestamp start_time = 1; - // last_active_host_index specifies the index of the last host that may be updated. - int64 last_active_host_index = 1; - // failed_host_count specifies the number of failed hosts. - int64 failed_host_count = 2; - // timeout_host_count specifies the number of timed-out hosts. - int64 timeout_host_count = 3; + // initial_count is the number of connected agents at the start of the window. + int64 initial_count = 2; + // present_count is the current number of connected agents. + int64 present_count = 3; + // failed_count specifies the number of failed agents. + int64 failed_count = 4; + // canaries is a list of canary agents. + repeated Canary canaries = 5; + // progress is the current progress through the rollout. + float64 progress = 6; + // state is the current state of the rollout. + State state = 7; + // last_update_time is the time of the previous update for this group. + google.protobuf.Timestamp last_update_time = 8; + // last_update_reason is the trigger for the last update + string last_update_reason = 9; +} + +// Canary agent +message Canary { + // update_uuid of the canary agent + string update_uuid = 0; + // host_uuid of the canary agent + string host_uuid = 1; + // hostname of the canary agent + string hostname = 2; +} + +// State of the rollout +enum State { + // UNSPECIFIED state + STATE_UNSPECIFIED = 0; + // UNSTARTED state + STATE_UNSTARTED = 1; + // CANARY state + STATE_CANARY = 2; + // ACTIVE state + STATE_ACTIVE = 3; + // DONE state + STATE_DONE = 4; + // ROLLEDBACK state + STATE_ROLLEDBACK = 5; } + ``` ## Alternatives From 34c2cb78bdbbed8b38c1ef85bd2d313f444128c3 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Fri, 27 Sep 2024 07:42:16 -0700 Subject: [PATCH 076/105] minor tweaks --- rfd/0169-auto-updates-linux-agents.md | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index cf3abf8b56a13..f372b7759f9c6 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -357,14 +357,12 @@ The following data related to the rollout are stored in each instance heartbeat: Auth servers use their local instance inventory to calculate rollout statistics and write them to `/autoupdate/[group]/[auth ID]` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`). Every minute, auth servers persist the version counts: -- `version_counts[group][version]` +- `agent_data[group].stats[version]` - `count`: number of currently connected agents at `version` in `group` - `failed_count`: number of currently connected agents at `version` in `group` that experienced a rollback or inability to upgrade - `lowest_uuid`: lowest UUID of all currently connected agents at `version` in `group` - -At the start of each group's window, auth servers persist an initial count: -- `initial_counts[group]` - - `count`: number of connected agents in `group` at start of window + - `count`: number of connected agents at `version` in `group` at start of window +- `agent_data[group]` - `canaries`: list of updater UUIDs to use for canary deployments Expiration time of the persisted key is 1 hour. @@ -379,14 +377,19 @@ This prevents double-counting agents when auth servers are killed. #### Progress Formulas -Each auth server will calculate the progress as `( max_in_flight * initial_counts[group].count + version_counts[group][target_version].count ) / initial_counts[group].count` and write the progress to `autoupdate_agent_plan` status. +Given: +``` +initial_count[group] = sum(agent_data[group].stats[*]).count +``` + +Each auth server will calculate the progress as `( max_in_flight * initial_count[group] + agent_data[group].stats[target_version].count ) / initial_count[group]` and write the progress to `autoupdate_agent_plan` status. This formula determines the progress percentage by adding a `max_in_flight` percentage-window above the number of currently updated agents in the group. -However, if `as_numeral(version_counts[group][not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the calculated progress, that progress value will be used instead. +However, if `as_numeral(agent_data[group].stats[not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the calculated progress, that progress value will be used instead. This protects against a statistical deadlock, where no UUIDs fall within the next `max_in_flight` window of UUID space, by always permitting the next non-updated agent to update. To ensure that the rollout is halted if more than `max_in_flight` un-updated agents drop off, an addition restriction must be imposed for the rollout to proceed: -`version_counts[group][*].count > initial_counts[group].count - max_in_flight * initial_counts[group].count` +`agent_data[group].stats[*].count > initial_count[group] - max_in_flight * initial_count[group]` To prevent double-counting of agents when considering all counts across all auth servers, only agents connected for one minute will be considered in these formulas. @@ -977,7 +980,7 @@ message AutoUpdateAgentPlanStatus { // canaries is a list of canary agents. repeated Canary canaries = 5; // progress is the current progress through the rollout. - float64 progress = 6; + float progress = 6; // state is the current state of the rollout. State state = 7; // last_update_time is the time of the previous update for this group. @@ -994,6 +997,8 @@ message Canary { string host_uuid = 1; // hostname of the canary agent string hostname = 2; + // success state of the canary agent + bool success = 3; } // State of the rollout From 768c283d0e0e11ca88d2a11ad60c4514e337ca17 Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Mon, 30 Sep 2024 09:28:18 -0400 Subject: [PATCH 077/105] Add user stories --- rfd/0169-auto-updates-linux-agents.md | 342 +++++++++++++++++++++++++- 1 file changed, 341 insertions(+), 1 deletion(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index f372b7759f9c6..c22abdf66fa35 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -54,7 +54,347 @@ We must provide a seamless, hands-off experience for auto-updates of Teleport Ag ## UX -[Hugo to add] +### As Teleport Cloud I want to be able to update customers agents to a newer Teleport version + +
+Before + +```yaml +kind: autoupdate_agent_plan +spec: + current_version: v1 + target_version: v2 + schedule: regular + strategy: grouped + autoupdate_mode: enabled +status: + groups: + - name: dev + start_time: 2020-12-09T16:09:53+00:00 + initial_count: 100 + present_count: 103 + failed_count: 2 + progress: 1 + state: active + last_update_time: 2020-12-09T16:09:53+00:00 + last_update_reason: success + - name: staging + start_time: 0000-00-00 + initial_count: 0 + present_count: 0 + failed_count: 0 + progress: 0 + state: unstarted + last_update_time: 2020-12-09T16:09:53+00:00 + last_update_reason: newAgentPlan +``` +
+ +I run +```bash +tctl autoupdate agent new-rollout v3 +# created new rollout from v2 to v3 +``` + +
+After + +```yaml +kind: autoupdate_agent_plan +spec: + current_version: v2 + target_version: v3 + schedule: regular + strategy: grouped + autoupdate_mode: enabled +status: + groups: + - name: dev + start_time: 0000-00-00 + initial_count: 0 + present_count: 0 + failed_count: 0 + progress: 0 + state: unstarted + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: newAgentPlan + - name: staging + start_time: 0000-00-00 + initial_count: 0 + present_count: 0 + failed_count: 0 + progress: 0 + state: unstarted + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: newAgentPlan +``` +
+ +Now, new agents will install v2 by default, and v3 after the maintenance. + +> [!NOTE] +> If the previous maintenance was not finished, I will install v2 on new prod agents while the rest of prod is still running v1. +> This is expected as we don't want to keep track of an infinite number of versions. +> +> If this is an issue I can create a v1 -> v3 rollout instead. +> +> ```bash +> tctl autoupdate agent new-rollout v3 --current-version v1 +> # created new update plan from v2 to v3 +> ``` + +### As Teleport Cloud I want to minimize the damage of a broken version to improve Teleport's availability to 99.99% + +#### Failure mode 1: the new version crashes + +I create a new deployment, with a broken version. The version is deployed to the canaries. +The canaries crash, the updater reverts the update, the agents connect back online and +advertise they rolled-back. The maintenance is stuck until the canaries are running the target version. + +
+Autoupdate agent plan + +```yaml +kind: autoupdate_agent_plan +spec: + current_version: v1 + target_version: v2 + schedule: regular + strategy: grouped + autoupdate_mode: enabled +status: + groups: + - name: dev + start_time: 2020-12-09T16:09:53+00:00 + initial_count: 100 + present_count: 100 + failed_count: 0 + progress: 0 + state: canaries + canaries: + - updater_id: abc + host_id: def + hostname: foo.example.com + success: false + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: canaryTesting + - name: staging + start_time: 0000-00-00 + initial_count: 0 + present_count: 0 + failed_count: 0 + progress: 0 + state: unstarted + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: newAgentPlan +``` +
+ +I and the customer get an alert if the canary testing has not succeeded after an hour. +Teleport cloud operators and the user can access the canary hostname and hostid +to + +The rollout resumes. + +#### Failure mode 1 bis: the new version crashes, but not on the canaries + +This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. +For example: [the agent fails to read cloud-provider specific metadata and crashes](TODO add link). + +The canaries might not select one of the affected agent and allow the update to proceed. +All agents are updated, and all agents hosted on the cloud provider affected by the bug crash. +The updaters of the affected agents will attempt to self-heal by reverting to the previous version. + +Once the previous Teleport version is running, the agent will advertise its update failed and it had to rollback. +If too many agents failed, this will block the group from transitioning from `active` to `done`, protecting the future +groups from the faulty updates. + +#### Failure mode 2: the new version crashes, and the old version cannot start + +I create a new deployment, with a broken version. The version is deployed to the canaries. +The canaries attempt the update, and the new Teleport instance crashes. +The updater fails to self-heal as the old version does not start anymore. + +This is typically caused by external sources like full disk, faulty networking, resource exhaustion. +This can also be caused by the Teleport control plan not being available. + +The group update is stuck until the canary comes back online and runs the latest version. + +The customer and Teleport cloud receive an alert. The customer and Teleport cloud can retrieve the +hostid and hostname of the faulty canaries. With this information they can go troubleshoot the failed agents. + +#### Failure mode 2 bis: the new version crashes, and the old version cannot start, but not on the canaries + +This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. +For example: a clock drift blocks agents from re-connecting to Teleport. + +The canaries might not select one of the affected agent and allow the update to proceed. +All agents are updated, and all agents hosted on the cloud provider affected by the bug crash. +The updater fails to self-heal as the old version does not start anymore. + +If too many agents failed, this will block the group from transitioning from `active` to `done`, protecting the future +groups from the faulty updates. + +In this case, it's hard to identify which agent dropped. + +#### Failure mode 3: shadow failure + +Teleport cloud deploys a new version. Agents from the first group get updated. +The agents are seemingly running properly, but some functions are impaired. +For example, host user creation is failing. + +Some user tries to access a resource served by the agent, it fails and the user +notices the disruption. + +The customer can observe the agent update status and see that a recent update +might have caused this: + +```shell +tctl auto-update agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev complete YYYY-MM-DD HHh 120 115 2 +# staging in progress (53%) YYYY-MM-D2 HHh 20 10 0 +# prod not started 234 0 0 +``` + +Then, the customer or Teleport Cloud team can suspend the rollout: + +```shell +tctl auto-update agent suspend +# Automatic updates suspended +# No existing agent will get updated. New agents might install the new version +# depending on their group. +``` + +At this point, no new agent is updated to reduce the service disruption. +The customer can investigate, and get help from Teleport's support via a support ticket. +If the update is really the cause of the issue, the customer or Teleport cloud can perform a rollback: + +```shell +tctl auto-update agent rollback +# Rolledback groups: [dev, staging] +# Warning: the automatic agent updates are suspended. +# Agents will not rollback until you run: +# $> tctl auto-update agent resume +``` + +> [!NOTE] +> By default, all groups not in the "unstarted" state are rolledback. +> It is also possible to rollback only specific groups. + +The new state looks like +```shell +tctl auto-update agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: suspended +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev rolledback YYYY-MM-DD HHh 120 115 2 +# staging rolledback YYYY-MM-D2 HHh 20 10 0 +# prod not started 234 0 0 +``` + +Finally, when the user is happy with the new plan, they can resume the updates. +This will trigger the rollback. + +```shell +tctl auto-update agent resume +``` + +### As a Teleport user and a Teleport on-call responder, I want to be able to pin a specific Teleport version of an agent to understand if a specific behaviour is caused by a specific Teleport version + +I connect to the node and lookup its status: +```shell +teleport-updater status +# Running version v16.2.5 +# Automatic updates enabled. +# Proxy: example.teleport.sh +# Group: staging +``` + +I try to set a specific version: +```shell +teleport-udpater use-version v16.2.3 +# Error: the instance is enrolled into automatic updates. +# You must specify --disable-automatic-updates to opt this agent out of automatic updates and manually control the version. +``` + +I acknowledge that I am leaving automatic updates: +```shell +teleport-udpater use-version v16.2.3 --disable-automatic-updates +# Disabling automatic updates for the node. You can enable them back by running `teleport-updater enable` +# Downloading version 16.2.3 +# Restarting teleport +# Cleaning up old binaries +``` + +When the issue is fixed, I can enroll back into automatic updates: + +```shell +teleport-updater enable +# Enabling automatic updates +# Proxy: example.teleport.sh +# Group: staging +``` + +### As a Teleport user I want to fast-track a group update + +I have a new rollout, completely unstarted, and my current maintenance schedule updates over seevral days. +However, the new version contains something that I need as soon s possible (e.g. a fix for a bug that affects me). + +
+Before: + +```shell +tctl auto-update agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev not started 120 0 0 +# staging not started 20 0 0 +# prod not started 234 0 0 +``` +
+ +I can trigger the dev group immediately using the command: + +```shell +tctl auto-update agent trigger-group dev +# Dev group update triggered +``` + +[TODO: how to deal with the canary vs active vs done states?] + +
+After: + +```shell +tctl auto-update agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev not started 120 0 0 +# staging not started 20 0 0 +# prod not started 234 0 0 +``` +
### Teleport Resources From cf8170429edac0410d0e0015459d3a543c18acce Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Mon, 30 Sep 2024 11:15:26 -0400 Subject: [PATCH 078/105] Put new requirements at the top + edit UX + add TODOs --- rfd/0169-auto-updates-linux-agents.md | 237 +++++++++++++++++--------- 1 file changed, 152 insertions(+), 85 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index c22abdf66fa35..e83eece92b5d3 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -33,26 +33,72 @@ Additionally, this RFD parallels the auto-update functionality for client tools ## Why -The existing mechanism for automatic agent updates does not provide a hands-off experience for all Teleport users. - -1. The use of system package management leads to interactions with `apt upgrade`, `yum upgrade`, etc. that can result in unintentional upgrades. -2. The use of system package management requires logic that varies significantly by target distribution. -3. The installation mechanism requires 4-5 commands, includes manually installing multiple packages, and varies depending on your version and edition of Teleport. -4. The use of bash to implement the updater makes long-term maintenance difficult. -5. The existing auto-updater has limited automated testing. -6. The use of GPG keys in system package managers has key management implications that we would prefer to solve with TUF in the future. -7. The desired agent version cannot be set via Teleport's operator-targeted CLI (tctl). -8. The rollout plan for new agent versions is not fully-configurable using tctl. -9. Agent installation logic is spread between the auto-updater script, install script, auto-discovery script, and documentation. -10. Teleport contains logic that is specific to Teleport Cloud upgrade workflows. -11. The existing auto-updater is not self-updating. -12. It is difficult and undocumented to automate agent upgrades with custom automation (e.g., with JamF). -13. There is no phased rollout mechanism for updates. -14. There is no way to automatically detect and halt failed updates. - -We must provide a seamless, hands-off experience for auto-updates of Teleport Agents that is easy to maintain and safer for production use. - -## UX +1. We want customers always running the latest release of Teleport to always be secure, have access to the latest + features, and not deal with the pain of updating the agents. +2. Reduce Teleport Cloud operational costs of contacting customers with old agents. + Make updating easier for self-hosted customers so we don't have to provide support for older Teleport versions. +3. Increase reliability to 99.99%. + +The current systemd updater does not meet those requirements: +- its use of package managers leads users to accidentally upgrade Teleport +- the installation process is complex and users end up installing the wrong version of Teleport +- the current update process does not provide safeties to protect against broken updates +- many customers are not adopting the existing updater because they want to control when updates happen +- we don't offer a ni + +## Product requirements + +1. Phased rollout for our tenants. We should be able to control the agent version per-tenant. + +2. Bucketed rollout that tenants have control over. + - Control the bucket update day + - Control the bucket update hour + - Ability to pause a rollout + +3. Customers should be able to run "apt-get update" without updating Teleport. + + Installation from a package manager should be possible, but the version should be controlled by Teleport. + +4. Self-managed updates should be a first class citizen. Teleport must advertise the desired agent and client version. + +5. Self-hosted customers should be supported, for example, customers whose their own internal customer is running a Teleport agent. + +6. Upgrading a leaf cluster is out of scope. + +7. Rolling back after a broken update should be supported. Roll forward get's you 99.9%, we need rollback for 99.99%. + +8. We should have high quality metrics that report the version they are running and if they are running automatic + updates. For users and us. + +9. Best effort should be made so automatic updates should be applied in a way that sessions are not terminated. (Currently only supported for SSH) + +10. All backends should be supported. + +11. Teleport Discover installation (curl one-liner) should be supported. + +12. We need to support repo mirrors. + +13. I should be able to install Teleport via whatever mechanism I want to. + +14. If new nodes join a bucket outside the upgrade window and you are within your compat. window, wait until your next group update start. + If you are not within your compat. window attempt to upgrade right away. + +15. If an agent comes back online after some period of time and is still compat. with + control plane, it wait until the next upgrade window when it will be upgraded. + +16. Regular cloud tenant update schedule should run in les than a week. + Select tenants might support longer schedules. + +17. A cloud customer should be able to pause, resume, and rollback and existing rollout schedule. + A cloud customer should not be able to create new rollout schedules. + + Teleport can create as many rollout schedules as it wants. + +18. A user on the host, should be able to turn autoupdate off or select a version for that particular host. + +19. Operating system packages should be supported. + +## User Stories ### As Teleport Cloud I want to be able to update customers agents to a newer Teleport version @@ -288,7 +334,9 @@ tctl auto-update agent rollback > By default, all groups not in the "unstarted" state are rolledback. > It is also possible to rollback only specific groups. -The new state looks like +
+After: + ```shell tctl auto-update agent status # Rollout plan created the YYYY-MM-DD @@ -302,6 +350,7 @@ tctl auto-update agent status # staging rolledback YYYY-MM-D2 HHh 20 10 0 # prod not started 234 0 0 ``` +
Finally, when the user is happy with the new plan, they can resume the updates. This will trigger the rollback. @@ -396,9 +445,16 @@ tctl auto-update agent status ``` -### Teleport Resources -#### Scheduling +## Teleport Resources + +### Scheduling + +This resource is owned by the Teleport cluster user. +This is how Teleport customers can specify their automatic update preferences such as: +- if automatic updates are enabled, disabled, or temporarily suspended +- in which order their agents should be updated (`dev` before `staging` before `prod`) +- when should the updates start ```yaml kind: autoupdate_config @@ -419,6 +475,7 @@ spec: - name: staging # days specifies the days of the week when the group may be updated. # default: ["*"] (all days) + # TODO: explicit the supported values based on the customer QoS days: [ “Sun”, “Mon”, ... | "*" ] # start_hour specifies the hour when the group may start upgrading. # default: 0 @@ -426,6 +483,7 @@ spec: # wait_days specifies how many days to wait after the previous group finished before starting. # default: 0 wait_days: 0-1 + # TODO: is this needed? In which case a customer would need to set a custom jitter? # jitter_seconds specifies a maximum jitter duration after the start hour. # The agent updater client will pick a random time within this duration to wait to update. # default: 5 @@ -438,7 +496,7 @@ spec: # Only valid for the backpressure strategy. # default: 20% max_in_flight: 10-100% - # alert_after specifies the duration after which a cluster alert will be set if the rollout has + # alert_after specifies the duration after which a cluster alert will be set if the group update has # not completed. # default: 4 alert_after_hours: 1-8 @@ -454,7 +512,7 @@ spec: agent_schedules: regular: - name: default - days: ["*"] + days: ["*"] # TODO: restrict to work week? Minus Friday? start_hour: 0 jitter_seconds: 5 canary_count: 5 @@ -462,15 +520,14 @@ spec: alert_after: 4h ``` -Dependency cycles are rejected. -Dependency chains longer than a week will be rejected. -Otherwise, updates could take up to 7 weeks to propagate. The update proceeds from the first group to the last group, ensuring that each group successfully updates before allowing the next group to proceed. +By default, only 5 agent groups are allowed, this mitigates very long rollout plans. The updater will receive `agent_autoupdate: true` from the time is it designated for update until the `target_version` in `autoupdate_agent_plan` (below) changes. Changing the `target_version` resets the schedule immediately, clearing all progress. +[TODO: What is the use-case for this? can we do like with target_version and reset all instead of trying to merge the state] Changing the `start_version` in `autoupdate_agent_plan` changes the advertised `start_version` for all unfinished groups. Changing `agent_schedules` will preserve the `state` of groups that have the same name before and after the change. @@ -479,9 +536,12 @@ However, any changes to `agent_schedules` that occur while a group is active wil Releasing new agent versions multiple times a week has the potential to starve dependent groups from updates. Note that the `default` schedule applies to agents that do not specify a group name. +[TODO: It seems we removed the default bool, So we have a mandatory default group? Can we pick the last one instead?] ```shell # configuration +# TODO: "tctl autoudpate update" is bad UX, especially as this doen't even trigger agent update but updates the AU resource. +# We should chose a user-friendly signature $ tctl autoupdate update --set-agent-auto-update=off Automatic updates configuration has been updated. $ tctl autoupdate update --group staging --set-start-hour=3 @@ -520,9 +580,22 @@ Executing auto-update for group 'staging' immediately. ``` Notes: -- `autoupdate_agent_plan` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating `autoupdate_agent_plan`, while maintaining control over the rollout. +- `autoupdate_agent_plan` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating + `autoupdate_agent_plan`, while maintaining control over the rollout. + +### Rollout -#### Rollout +The `autoupdate_agent_plan` spec is owned by the Teleport cluster administrator. +In Teleport Cloud, this is the cloud operations team. For self-hosted setups this is the user with access to the local +admin socket (tctl on local machine). + +> [!NOTE] +> This is currently an anti-pattern as we are trying to remove the use of the local administrator in Teleport. +> However, Teleport does not provide any role/permission that we can use for Teleport Cloud operations and cannot be +> granted to users. To part with local admin rights, we need a way to have cloud or admi-only operations. +> This would also improve Cloud team operations by interacting with Teleport API rather than executing local tctl. +> +> Solving this problem is out of the scope of this RFD. ```yaml kind: autoupdate_agent_plan @@ -570,42 +643,12 @@ $ tctl autoupdate update --set-agent-version=15.1.2 --critical Automatic updates configuration has been updated. ``` -## Details - Teleport API - -Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. -The version and edition served from that endpoint will be configured using new `autoupdate_agent_plan` resource. - -Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: -- The `host=[uuid]` parameter sent to `/v1/webapi/find` -- The `group=[name]` parameter sent to `/v1/webapi/find` -- The schedule defined in the new `autoupdate_config` resource -- The status of past agent upgrades for the given version - -To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. -Teleport auth servers use their access to the instance inventory data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. - -Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/update.yaml` file, which is written via `teleport-update enable`: -```shell -$ teleport-update enable --proxy teleport.example.com --group staging -``` - -At the start of a group rollout, the Teleport auth servers record the initial number connected agents. -The number of updated and non-updated agents is tracked by the auth servers. - -If backpressure is enabled, a fixed number of connected agents (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. -Additional agents are instructed to update as earlier updates complete, never exceeding `max_in_flight`. - -If canaries are enabled, a user-specified number of agents are updated first. -These agents must all update successfully for the rollout to proceed to the remaining agents. - -Rollouts may be paused with `tctl autoupdate pause` or manually triggered with `tctl autoupdate run`. - ### Group states Let `v1` be the current version and `v2` the target version. A group can be in 5 state: -- unstarted: the group update has not been started yet. +- unstarted: the group update has not been started yet. - canary: a few canaries are getting updated. New agents should run `v1`. Existing agents should not attempt to update and keep their existing version. - active: the group is actively getting updated. New agents should run `v2`, existing agents are instructed to update to `v2`. - done: the group has been updated. New agents should run `v2`. @@ -651,37 +694,35 @@ For example: - cloud says `disabled` and the customer says `suspended` -> the updates are `disabled` - cloud says `disabled` and the customer says `enabled` -> the updates are `disabled` -### Proxy answer +## Details - Teleport API -The proxy response contains two parts related to automatic updates: -- the target version of the requested group -- if the agent should be updated +Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. +The version and edition served from that endpoint will be configured using new `autoupdate_agent_plan` resource. -#### Rollout status: disabled +Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: +- The `host=[uuid]` parameter sent to `/v1/webapi/find` +- The `group=[name]` parameter sent to `/v1/webapi/find` +- The schedule defined in the new `autoupdate_config` resource +- The status of past agent upgrades for the given version -| Group state | Version | Should update | -|-------------|---------|---------------| -| * | v2 | false | +To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. +Teleport auth servers use their access to the instance inventory data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. -#### Rollout status: paused +Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/update.yaml` file, which is written via `teleport-update enable`: +```shell +$ teleport-update enable --proxy teleport.example.com --group staging +``` -| Group state | Version | Should update | -|-------------|---------|---------------| -| unstarted | v1 | false | -| canary | v1 | false | -| active | v2 | false | -| done | v2 | false | -| rolledback | v1 | false | +At the start of a group rollout, the Teleport auth servers record the initial number connected agents. +The number of updated and non-updated agents is tracked by the auth servers. -#### Rollout status: enabled +If backpressure is enabled, a fixed number of connected agents (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. +Additional agents are instructed to update as earlier updates complete, never exceeding `max_in_flight`. -| Group state | Version | Should update | -|-------------|---------|----------------------------| -| unstarted | v1 | false | -| canary | v1 | false, except for canaries | -| active | v2 | true if UUID <= progress | -| done | v2 | true | -| rolledback | v1 | true | +If canaries are enabled, a user-specified number of agents are updated first. +These agents must all update successfully for the rollout to proceed to the remaining agents. + +Rollouts may be paused with `tctl autoupdate pause` or manually triggered with `tctl autoupdate run`. ### Rollout @@ -739,6 +780,32 @@ When the updater queries the proxy via `/v1/webapi/find?host=[uuid]&group=[name] The boolean is returned as `true` in the case that the provided `host` contains a UUID that is under the progress percentage for the `group`: `as_numeral(host_uuid) / as_numeral(max_uuid) < progress` +##### Rollout status: disabled + +| Group state | Version | Should update | +|-------------|---------|---------------| +| * | v2 | false | + +##### Rollout status: paused + +| Group state | Version | Should update | +|-------------|---------|---------------| +| unstarted | v1 | false | +| canary | v1 | false | +| active | v2 | false | +| done | v2 | false | +| rolledback | v1 | false | + +##### Rollout status: enabled + +| Group state | Version | Should update | +|-------------|---------|----------------------------| +| unstarted | v1 | false | +| canary | v1 | false, except for canaries | +| active | v2 | true if UUID <= progress | +| done | v2 | true | +| rolledback | v1 | true | + ### REST Endpoints `/v1/webapi/find?host=[uuid]&group=[name]` From 234aecdbbcb5700f9af4227f2c5840873543665d Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Wed, 2 Oct 2024 12:34:20 -0400 Subject: [PATCH 079/105] Edition work --- rfd/0169-auto-updates-linux-agents.md | 659 +++++++++++++++----------- 1 file changed, 385 insertions(+), 274 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index e83eece92b5d3..0d5ce9d964b2a 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -15,7 +15,7 @@ state: draft This RFD proposes a new mechanism for scheduled, automatic updates of Teleport agents. -Users of Teleport will be able to use the tctl CLI to specify desired versions, update schedules, and rollout speed. +Users of Teleport will be able to use the tctl CLI to specify desired versions, update schedules, and rollout strategy. Agents will be updated by a new `teleport-update` binary, built from `tools/teleport-update` in the Teleport repository. @@ -44,9 +44,103 @@ The current systemd updater does not meet those requirements: - the installation process is complex and users end up installing the wrong version of Teleport - the current update process does not provide safeties to protect against broken updates - many customers are not adopting the existing updater because they want to control when updates happen -- we don't offer a ni - -## Product requirements +- we don't offer a nice user experience for self-hosted users, this ends up in a marginal automatic updates + adoption and does not reduce the cost of upgrading self-hosted clusters. + +## How + +The new agent automatic updates will rely on a separate updating binary controlling which Teleport version is +installed. The automatic updates will be implemented via incremental improvements over the existing mechanism: + +- Phase 1: introduce a new updater binary not relying on package managers +- Phase 2: introduce the concept of agent update groups and make the users chose in which order groups are updated +- Phase 3: add the ability for the agent updater to immediately revert a faulty update +- Phase 4: add a feedback mechanism for the Teleport inventory to track the agents of each group and their update status +- Phase 5: add the canary deployment strategy: a few agents are updated first, if they don't die, the whole group is updated +- Phase 6: add the ability to perform slow and incremental version rollouts within an agent update group + +The updater will be usable after phase 1, and will gain new capabilities after each phase. +Future phases might change as we are working on the implementation and collecting real-world feedback and experience. + +### Resources + +We will introduce 2 user-facing resources: + +1. The `autoupdate_config` resource, owned by the Teleport user. This resource allows Teleport users to configure: + - if automatic updates are enabled, disabled, or temporarily suspended + - in which order their agents should be updated (`dev` before `staging` before `prod`) + - when should the updates start + + The resource will look like: + ```yaml + kind: autoupdate_config + spec: + agent_autoupdate_mode: enable + agent_schedules: + regular: + - name: dev + days: ["Mon", "Tue", "Wed", "Thu"] + start_hour: 0 + alert_after: 4h + canary_count: 5 # added in phase 5 + max_in_flight: 20% # added in phase 6 + - name: prod + days: ["Mon", "Tue", "Wed", "Thu"] + start_hour: 0 + wait_days: 1 # update this group at least 1 day after the previous one + alert_after: 4h + canary_count: 5 # added in phase 5 + max_in_flight: 20% # added in phase 6 + ``` + +2. The `autoupdate_agent_plan` resource, its spec is owned by the Teleport cluster administrator (e.g. Teleport Cloud team). + Its status is owned by Teleport and contains the current rollout status. Some parts of the status can be changed via + select RPCs, for example fast-tracking a group update. + ```yaml + kind: autoupdate_agent_plan + spec: + current_version: v1 + target_version: v2 + schedule: regular + strategy: grouped + autoupdate_mode: enabled + status: + groups: + - name: dev + start_time: 2020-12-09T16:09:53+00:00 + initial_count: 100 # part of phase 4 + present_count: 100 # part of phase 4 + failed_count: 0 # part of phase 4 + progress: 0 + state: canaries + canaries: # part of phase 5 + - updater_id: abc + host_id: def + hostname: foo.example.com + success: false + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: canaryTesting + - name: prod + start_time: 0000-00-00 + initial_count: 0 + present_count: 0 + failed_count: 0 + progress: 0 + state: unstarted + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: newAgentPlan + ``` + +You can find more details about each resource field [in the dedicated resource section](#teleport-resources). + +## Details + +This section contains the proposed implementation details and is mainly relevant for Teleport developers and curious +users who want to know the motivations behind this specific design. + +### Product requirements + +Those are the requirements coming from engineering, product, and cloud teams: 1. Phased rollout for our tenants. We should be able to control the agent version per-tenant. @@ -84,7 +178,7 @@ The current systemd updater does not meet those requirements: If you are not within your compat. window attempt to upgrade right away. 15. If an agent comes back online after some period of time and is still compat. with - control plane, it wait until the next upgrade window when it will be upgraded. + control lane, it wait until the next upgrade window when it will be upgraded. 16. Regular cloud tenant update schedule should run in les than a week. Select tenants might support longer schedules. @@ -98,41 +192,25 @@ The current systemd updater does not meet those requirements: 19. Operating system packages should be supported. -## User Stories +### User Stories -### As Teleport Cloud I want to be able to update customers agents to a newer Teleport version +#### As Teleport Cloud I want to be able to update customers agents to a newer Teleport version
Before -```yaml -kind: autoupdate_agent_plan -spec: - current_version: v1 - target_version: v2 - schedule: regular - strategy: grouped - autoupdate_mode: enabled -status: - groups: - - name: dev - start_time: 2020-12-09T16:09:53+00:00 - initial_count: 100 - present_count: 103 - failed_count: 2 - progress: 1 - state: active - last_update_time: 2020-12-09T16:09:53+00:00 - last_update_reason: success - - name: staging - start_time: 0000-00-00 - initial_count: 0 - present_count: 0 - failed_count: 0 - progress: 0 - state: unstarted - last_update_time: 2020-12-09T16:09:53+00:00 - last_update_reason: newAgentPlan +```shell +tctl auto-update agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v1 +# New version: v2 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev complete YYYY-MM-DD HHh 120 115 2 +# staging complete YYYY-MM-D2 HHh 20 20 0 +# prod not started 234 0 0 ```
@@ -145,35 +223,20 @@ tctl autoupdate agent new-rollout v3
After -```yaml -kind: autoupdate_agent_plan -spec: - current_version: v2 - target_version: v3 - schedule: regular - strategy: grouped - autoupdate_mode: enabled -status: - groups: - - name: dev - start_time: 0000-00-00 - initial_count: 0 - present_count: 0 - failed_count: 0 - progress: 0 - state: unstarted - last_update_time: 2020-12-10T16:09:53+00:00 - last_update_reason: newAgentPlan - - name: staging - start_time: 0000-00-00 - initial_count: 0 - present_count: 0 - failed_count: 0 - progress: 0 - state: unstarted - last_update_time: 2020-12-10T16:09:53+00:00 - last_update_reason: newAgentPlan +```shell +tctl auto-update agent status +# Rollout plan created the YYYY-MM-DD +# Previous version: v2 +# New version: v3 +# Status: enabled +# +# Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates +# ---------- ----------------- ----------------- ---------------- ----------------- -------------- +# dev not started 120 115 2 +# staging not started 20 20 0 +# prod not started 234 0 0 ``` +
Now, new agents will install v2 by default, and v3 after the maintenance. @@ -186,12 +249,12 @@ Now, new agents will install v2 by default, and v3 after the maintenance. > > ```bash > tctl autoupdate agent new-rollout v3 --current-version v1 -> # created new update plan from v2 to v3 +> # created new update plan from v1 to v3 > ``` -### As Teleport Cloud I want to minimize the damage of a broken version to improve Teleport's availability to 99.99% +#### As Teleport Cloud I want to minimize the damage of a broken version to improve Teleport's availability to 99.99% -#### Failure mode 1: the new version crashes +##### Failure mode 1: the new version crashes I create a new deployment, with a broken version. The version is deployed to the canaries. The canaries crash, the updater reverts the update, the agents connect back online and @@ -242,7 +305,7 @@ to The rollout resumes. -#### Failure mode 1 bis: the new version crashes, but not on the canaries +##### Failure mode 1 bis: the new version crashes, but not on the canaries This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. For example: [the agent fails to read cloud-provider specific metadata and crashes](TODO add link). @@ -255,7 +318,7 @@ Once the previous Teleport version is running, the agent will advertise its upda If too many agents failed, this will block the group from transitioning from `active` to `done`, protecting the future groups from the faulty updates. -#### Failure mode 2: the new version crashes, and the old version cannot start +##### Failure mode 2: the new version crashes, and the old version cannot start I create a new deployment, with a broken version. The version is deployed to the canaries. The canaries attempt the update, and the new Teleport instance crashes. @@ -269,7 +332,7 @@ The group update is stuck until the canary comes back online and runs the latest The customer and Teleport cloud receive an alert. The customer and Teleport cloud can retrieve the hostid and hostname of the faulty canaries. With this information they can go troubleshoot the failed agents. -#### Failure mode 2 bis: the new version crashes, and the old version cannot start, but not on the canaries +##### Failure mode 2 bis: the new version crashes, and the old version cannot start, but not on the canaries This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. For example: a clock drift blocks agents from re-connecting to Teleport. @@ -283,7 +346,7 @@ groups from the faulty updates. In this case, it's hard to identify which agent dropped. -#### Failure mode 3: shadow failure +##### Failure mode 3: shadow failure Teleport cloud deploys a new version. Agents from the first group get updated. The agents are seemingly running properly, but some functions are impaired. @@ -359,7 +422,7 @@ This will trigger the rollback. tctl auto-update agent resume ``` -### As a Teleport user and a Teleport on-call responder, I want to be able to pin a specific Teleport version of an agent to understand if a specific behaviour is caused by a specific Teleport version +#### As a Teleport user and a Teleport on-call responder, I want to be able to pin a specific Teleport version of an agent to understand if a specific behaviour is caused by a specific Teleport version I connect to the node and lookup its status: ```shell @@ -395,7 +458,7 @@ teleport-updater enable # Group: staging ``` -### As a Teleport user I want to fast-track a group update +#### As a Teleport user I want to fast-track a group update I have a new rollout, completely unstarted, and my current maintenance schedule updates over seevral days. However, the new version contains something that I need as soon s possible (e.g. a fix for a bug that affects me). @@ -404,7 +467,7 @@ However, the new version contains something that I need as soon s possible (e.g. Before: ```shell -tctl auto-update agent status +tctl auto-updates agent status # Rollout plan created the YYYY-MM-DD # Previous version: v2 # New version: v3 @@ -421,11 +484,14 @@ tctl auto-update agent status I can trigger the dev group immediately using the command: ```shell -tctl auto-update agent trigger-group dev -# Dev group update triggered +tctl auto-updates agent start-update dev --no-canary +# Dev group update triggered (canary or active) ``` -[TODO: how to deal with the canary vs active vs done states?] +Alternatively +```shell +tctl auto-update agent force-done dev +```
After: @@ -445,16 +511,12 @@ tctl auto-update agent status ```
+### Teleport Resources -## Teleport Resources - -### Scheduling +#### Autoupdate Config This resource is owned by the Teleport cluster user. -This is how Teleport customers can specify their automatic update preferences such as: -- if automatic updates are enabled, disabled, or temporarily suspended -- in which order their agents should be updated (`dev` before `staging` before `prod`) -- when should the updates start +This is how Teleport customers can specify their automatic update preferences. ```yaml kind: autoupdate_config @@ -483,11 +545,6 @@ spec: # wait_days specifies how many days to wait after the previous group finished before starting. # default: 0 wait_days: 0-1 - # TODO: is this needed? In which case a customer would need to set a custom jitter? - # jitter_seconds specifies a maximum jitter duration after the start hour. - # The agent updater client will pick a random time within this duration to wait to update. - # default: 5 - jitter_seconds: 0-60 # canary_count specifies the desired number of canaries to update before any other agents # are updated. # default: 5 @@ -512,7 +569,7 @@ spec: agent_schedules: regular: - name: default - days: ["*"] # TODO: restrict to work week? Minus Friday? + days: ["Mon", "Tue", "Wed", "Thu"] start_hour: 0 jitter_seconds: 5 canary_count: 5 @@ -520,70 +577,7 @@ spec: alert_after: 4h ``` - -The update proceeds from the first group to the last group, ensuring that each group successfully updates before allowing the next group to proceed. -By default, only 5 agent groups are allowed, this mitigates very long rollout plans. - -The updater will receive `agent_autoupdate: true` from the time is it designated for update until the `target_version` in `autoupdate_agent_plan` (below) changes. -Changing the `target_version` resets the schedule immediately, clearing all progress. - -[TODO: What is the use-case for this? can we do like with target_version and reset all instead of trying to merge the state] -Changing the `start_version` in `autoupdate_agent_plan` changes the advertised `start_version` for all unfinished groups. - -Changing `agent_schedules` will preserve the `state` of groups that have the same name before and after the change. -However, any changes to `agent_schedules` that occur while a group is active will be rejected. - -Releasing new agent versions multiple times a week has the potential to starve dependent groups from updates. - -Note that the `default` schedule applies to agents that do not specify a group name. -[TODO: It seems we removed the default bool, So we have a mandatory default group? Can we pick the last one instead?] - -```shell -# configuration -# TODO: "tctl autoudpate update" is bad UX, especially as this doen't even trigger agent update but updates the AU resource. -# We should chose a user-friendly signature -$ tctl autoupdate update --set-agent-auto-update=off -Automatic updates configuration has been updated. -$ tctl autoupdate update --group staging --set-start-hour=3 -Automatic updates configuration has been updated. -$ tctl autoupdate update --group staging --set-jitter-seconds=60 -Automatic updates configuration has been updated. -$ tctl autoupdate update --group default --set-jitter-seconds=60 -Automatic updates configuration has been updated. -$ tctl autoupdate reset -Automatic updates configuration has been reset to defaults. - -# status -$ tctl autoupdate status -Status: disabled -Version: v1.2.4 -Schedule: regular - -Groups: -staging: succeeded at 2024-01-03 23:43:22 UTC -prod: scheduled for 2024-01-03 23:43:22 UTC (depends on prod) -other: failed at 2024-01-05 22:53:22 UTC - -$ tctl autoupdate status --group staging -Status: succeeded -Date: 2024-01-03 23:43:22 UTC -Requires: (none) - -Updated: 230 (95%) -Unchanged: 10 (2%) -Failed: 15 (3%) -Timed-out: 0 - -# re-running failed group -$ tctl autoupdate run --group staging -Executing auto-update for group 'staging' immediately. -``` - -Notes: -- `autoupdate_agent_plan` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating - `autoupdate_agent_plan`, while maintaining control over the rollout. - -### Rollout +#### Autoupdate agent plan The `autoupdate_agent_plan` spec is owned by the Teleport cluster administrator. In Teleport Cloud, this is the cloud operations team. For self-hosted setups this is the user with access to the local @@ -594,7 +588,7 @@ admin socket (tctl on local machine). > However, Teleport does not provide any role/permission that we can use for Teleport Cloud operations and cannot be > granted to users. To part with local admin rights, we need a way to have cloud or admi-only operations. > This would also improve Cloud team operations by interacting with Teleport API rather than executing local tctl. -> +> > Solving this problem is out of the scope of this RFD. ```yaml @@ -636,49 +630,16 @@ status: last_update_reason: rollback ``` -```shell -$ tctl autoupdate update --set-agent-version=15.1.1 -Automatic updates configuration has been updated. -$ tctl autoupdate update --set-agent-version=15.1.2 --critical -Automatic updates configuration has been updated. -``` - -### Group states +### Backend logic to progress the rollout -Let `v1` be the current version and `v2` the target version. +The update proceeds from the first group to the last group, ensuring that each group successfully updates before +allowing the next group to proceed. By default, only 5 agent groups are allowed, this mitigates very long rollout plans. -A group can be in 5 state: -- unstarted: the group update has not been started yet. -- canary: a few canaries are getting updated. New agents should run `v1`. Existing agents should not attempt to update and keep their existing version. -- active: the group is actively getting updated. New agents should run `v2`, existing agents are instructed to update to `v2`. -- done: the group has been updated. New agents should run `v2`. -- rolledback: the group has been rolledback. New agents should run `v1`, existing agents should update to `v1`. - -The finite state machine is the following: -```mermaid -flowchart TD - unstarted((unstarted)) - canary((canary)) - active((active)) - done((done)) - rolledback((rolledback)) - - unstarted -->|StartGroup
MaintenanceTriggerOK| canary - canary -->|canary came back alive| active - canary -->|ForceGroup| done - canary -->|RollbackGroup| rolledback - active -->|ForceGroup
Success criteria met| done - done -->|RollbackGroup| rolledback - active -->|RollbackGroup| rolledback - - canary -->|ResetGroup| canary - active -->|ResetGroup| active -``` - -### Agent auto update modes +#### Agent update mode The agent auto update mode is specified by both Cloud (via `autoupdate_agent_plan`) -and by the customer (via `autoupdate_config`). +and by the customer (via `autoupdate_config`). The agent update mode control whether +the cluster in enrolled into automatic agent updates. The agent update mode can take 3 values: @@ -694,92 +655,225 @@ For example: - cloud says `disabled` and the customer says `suspended` -> the updates are `disabled` - cloud says `disabled` and the customer says `enabled` -> the updates are `disabled` -## Details - Teleport API +The Teleport cluster only progresses the rollout if the mode is `enabled`. -Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. -The version and edition served from that endpoint will be configured using new `autoupdate_agent_plan` resource. +#### Group States -Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on: -- The `host=[uuid]` parameter sent to `/v1/webapi/find` -- The `group=[name]` parameter sent to `/v1/webapi/find` -- The schedule defined in the new `autoupdate_config` resource -- The status of past agent upgrades for the given version +Let `v1` be the previous version and `v2` the target version. -To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. -Teleport auth servers use their access to the instance inventory data to drive the rollout, while Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. +A group can be in 5 states: +- `unstarted`: the group update has not been started yet. +- `canary`: a few canaries are getting updated. New agents should run `v1`. Existing agents should not attempt to update + and keep their existing version. +- `active`: the group is actively getting updated. New agents should run `v2`, existing agents are instructed to update + to `v2`. +- `done`: the group has been updated. New agents should run `v2`. +- `rolledback`: the group has been rolledback. New agents should run `v1`, existing agents should update to `v1`. -Rollouts are specified as interdependent groups of hosts, selected by upgrade group identifier specified in the agent's `/var/lib/teleport/versions/update.yaml` file, which is written via `teleport-update enable`: -```shell -$ teleport-update enable --proxy teleport.example.com --group staging -``` +The finite state machine is the following: -At the start of a group rollout, the Teleport auth servers record the initial number connected agents. -The number of updated and non-updated agents is tracked by the auth servers. +```mermaid +flowchart TD + unstarted((unstarted)) + canary((canary)) + active((active)) + done((done)) + rolledback((rolledback)) -If backpressure is enabled, a fixed number of connected agents (`max_in_flight % x total`) are instructed to upgrade at the same time via `/v1/webapi/find`. -Additional agents are instructed to update as earlier updates complete, never exceeding `max_in_flight`. + unstarted -->|TriggerGroupRPC
Start conditions are met| canary + canary -->|Canary came back alive| active + canary -->|ForceGroupRPC| done + canary -->|RollbackGroupRPC| rolledback + active -->|ForceGroupRPC
Success criteria met| done + done -->|RollbackGroupRPC| rolledback + active -->|RollbackGroupRPC| rolledback + + canary -->|ResetGroupRPC| canary + active -->|ResetGroupRPC| active +``` -If canaries are enabled, a user-specified number of agents are updated first. -These agents must all update successfully for the rollout to proceed to the remaining agents. +#### Starting a group -Rollouts may be paused with `tctl autoupdate pause` or manually triggered with `tctl autoupdate run`. +A group can be started if the following criteria are met +- all of its previous group are in the `done` state +- it has been at least `wait_days` until the previous group update started +- the current week day is in the `days` list +- the current hours equals the `hour` field -### Rollout +When all hose criteria are met, the auth will transition the group into a new state. +If `canary_count` is not null, the group transitions to the `canary` state. +Else it transitions to the `active` state. -Instance heartbeats will be extended to incorporate and send data that is written to `/var/lib/teleport/versions/update.yaml` by the `teleport-update` binary. +In phase 4, at the start of a group rollout, the Teleport auth servers record the initial number connected agents. +The number of updated and non-updated agents is tracked by the auth servers. This will be used later to evaluate the +update success criteria. -The following data related to the rollout are stored in each instance heartbeat: -- `agent_update_start_time`: timestamp of individual agent's upgrade time -- `agent_update_start_version`: current agent version -- `agent_update_rollback`: whether the agent was rolled-back automatically -- `agent_update_uuid`: Auto-update UUID -- `agent_update_group`: Auto-update group name +#### Canary testing (phase 5) -Auth servers use their local instance inventory to calculate rollout statistics and write them to `/autoupdate/[group]/[auth ID]` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`). +A group in `canary` state will get assigned canaries. +The proxies will instruct those canaries to update now. +During each reconciliation loop, the auth will lookup the instance healthcheck in the backend of the canaries. -Every minute, auth servers persist the version counts: -- `agent_data[group].stats[version]` - - `count`: number of currently connected agents at `version` in `group` - - `failed_count`: number of currently connected agents at `version` in `group` that experienced a rollback or inability to upgrade - - `lowest_uuid`: lowest UUID of all currently connected agents at `version` in `group` - - `count`: number of connected agents at `version` in `group` at start of window -- `agent_data[group]` - - `canaries`: list of updater UUIDs to use for canary deployments +Once all canaries have a healthcheck containing the new version (the healthcheck must not be older than 20 minutes), +they successfully came back online and the group can transition to the `active` state. -Expiration time of the persisted key is 1 hour. +If canaries never update, report rollback, or disappear, the group will stay stuck in `canary` state. +An alert will eventually fire, warning the user about the stuck update. -To progress the rollout, auth servers will range-read keys from `/autoupdate/[group]/*`, sum the counts, and write back to the `autoupdate_agent_plan` status on a one-minute interval. -- To calculate the initial number of agents connected at the start of the window, each auth server will write the summed count of agents to `autoupdate_agent_plan` status, if not already written. -- To calculate the canaries, each auth server will write a random selection of all canaries to `autoupdate_agent_plan` status, if not already written. -- To determine the progress through the rollout, auth servers will write the calculated progress to the `autoupdate_agent_plan` status using the formulas, declining to write if the current written progress is further ahead. +#### Updating a group -If `/autoupdate/[group]/[auth ID]` is older than 1 minute, we do not consider its contents. -This prevents double-counting agents when auth servers are killed. +A group in `active` mode is currently being updated. The conditions to leave te `active` mode and transition to the +`done` mode will vary based on the phase and rollout strategy. + +- Phase 2: we don't have any information about agents. The group transitions to `done` 60 minutes after its start. +- Phase 4: we know about the connected agent count and the connected agent versions. The group transitions to `done` if: + - at least `(100 - max_in_flight)%` of the agents are still connected + - at least `(100 - max_in_flight)%` of the agents are running the new version +- Phase 6: we incrementally update the progress, this adds a new criteria: the group progress is at 100% -#### Progress Formulas +The phase 6 backpressure update is the following: Given: ``` initial_count[group] = sum(agent_data[group].stats[*]).count ``` -Each auth server will calculate the progress as `( max_in_flight * initial_count[group] + agent_data[group].stats[target_version].count ) / initial_count[group]` and write the progress to `autoupdate_agent_plan` status. -This formula determines the progress percentage by adding a `max_in_flight` percentage-window above the number of currently updated agents in the group. +Each auth server will calculate the progress as +`( max_in_flight * initial_count[group] + agent_data[group].stats[target_version].count ) / initial_count[group]` and +write the progress to `autoupdate_agent_plan` status. This formula determines the progress percentage by adding a +`max_in_flight` percentage-window above the number of currently updated agents in the group. -However, if `as_numeral(agent_data[group].stats[not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the calculated progress, that progress value will be used instead. -This protects against a statistical deadlock, where no UUIDs fall within the next `max_in_flight` window of UUID space, by always permitting the next non-updated agent to update. +However, if `as_numeral(agent_data[group].stats[not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the +calculated progress, that progress value will be used instead. This protects against a statistical deadlock, where no +UUIDs fall within the next `max_in_flight` window of UUID space, by always permitting the next non-updated agent to +update. -To ensure that the rollout is halted if more than `max_in_flight` un-updated agents drop off, an addition restriction must be imposed for the rollout to proceed: +To ensure that the rollout is halted if more than `max_in_flight` un-updated agents drop off, an addition restriction +must be imposed for the rollout to proceed: `agent_data[group].stats[*].count > initial_count[group] - max_in_flight * initial_count[group]` -To prevent double-counting of agents when considering all counts across all auth servers, only agents connected for one minute will be considered in these formulas. +To prevent double-counting of agents when considering all counts across all auth servers, only agents connected for one +minute will be considered in these formulas. + +### Manually interacting with the rollout + + +#### RPCs +Users and administrators can interact with the rollout plan using the following RPCs: + +```protobuf +``` + +#### CLI + +### Editing the plan + +The updater will receive `agent_autoupdate: true` from the time is it designated for update until the `target_version` in `autoupdate_agent_plan` (below) changes. +Changing the `target_version` resets the schedule immediately, clearing all progress. + +[TODO: What is the use-case for this? can we do like with target_version and reset all instead of trying to merge the state] +Changing the `start_version` in `autoupdate_agent_plan` changes the advertised `start_version` for all unfinished groups. + +Changing `agent_schedules` will preserve the `state` of groups that have the same name before and after the change. +However, any changes to `agent_schedules` that occur while a group is active will be rejected. + +Releasing new agent versions multiple times a week has the potential to starve dependent groups from updates. + +Note that the `default` schedule applies to agents that do not specify a group name. +[TODO: It seems we removed the default bool, So we have a mandatory default group? Can we pick the last one instead?] + +```shell +# configuration +# TODO: "tctl autoudpate update" is bad UX, especially as this doen't even trigger agent update but updates the AU resource. +# We should chose a user-friendly signature +$ tctl autoupdate update --set-agent-auto-update=off +Automatic updates configuration has been updated. +$ tctl autoupdate update --group staging --set-start-hour=3 +Automatic updates configuration has been updated. +$ tctl autoupdate update --group staging --set-jitter-seconds=60 +Automatic updates configuration has been updated. +$ tctl autoupdate update --group default --set-jitter-seconds=60 +Automatic updates configuration has been updated. +$ tctl autoupdate reset +Automatic updates configuration has been reset to defaults. + +# status +$ tctl autoupdate status +Status: disabled +Version: v1.2.4 +Schedule: regular + +Groups: +staging: succeeded at 2024-01-03 23:43:22 UTC +prod: scheduled for 2024-01-03 23:43:22 UTC (depends on prod) +other: failed at 2024-01-05 22:53:22 UTC + +$ tctl autoupdate status --group staging +Status: succeeded +Date: 2024-01-03 23:43:22 UTC +Requires: (none) + +Updated: 230 (95%) +Unchanged: 10 (2%) +Failed: 15 (3%) +Timed-out: 0 + +# re-running failed group +$ tctl autoupdate run --group staging +Executing auto-update for group 'staging' immediately. +``` + +Notes: +- `autoupdate_agent_plan` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating + `autoupdate_agent_plan`, while maintaining control over the rollout. + +### Updater APIs + +#### Update requests + +Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. +The version and edition served from that endpoint will be configured using new `autoupdate_agent_plan` resource. + +Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is +dependent on: +- The `host=[uuid]` parameter sent to `/v1/webapi/find` +- The `group=[name]` parameter sent to `/v1/webapi/find` +- The group state from the `autoupdate_agent_plan` status -#### Proxies +To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via +unauthenticated requests to `/v1/webapi/find`. Teleport proxies modulate the `/v1/webapi/find` response given the host +UUID and group name. -When the updater queries the proxy via `/v1/webapi/find?host=[uuid]&group=[name]`, the proxies query the `autoupdate_agent_plan` status to determine the value of `agent_autoupdate: true`. -The boolean is returned as `true` in the case that the provided `host` contains a UUID that is under the progress percentage for the `group`: +When the updater queries the proxy via `/v1/webapi/find?host=[uuid]&group=[name]`, the proxies query the +`autoupdate_agent_plan` status to determine the value of `agent_autoupdate: true`. +The boolean is returned as `true` in the case that the provided `host` contains a UUID that is under the progress +percentage for the `group`: `as_numeral(host_uuid) / as_numeral(max_uuid) < progress` +The returned JSON looks like: + +`/v1/webapi/find?host=[uuid]&group=[name]` +```json +{ + "server_edition": "enterprise", + "agent_version": "15.1.1", + "agent_autoupdate": true, + "agent_update_jitter_seconds": 10 +} +``` + +Notes: + +- Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of + the value in `agent_autoupdate`. +- The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. +- The group name is read from `/var/lib/teleport/versions/update.yaml` by the updater. +- The UUID is read from `/tmp/teleport_update_uuid`, which `teleport-update` regenerates when missing. +- the jitter is served by the teleport cluster and depends on the rollout strategy (60 sec by default, 10sec when using + the backpressure strategy). + +Let `v1` be the previous version and `v2` the target version, the response matrix is the following: + ##### Rollout status: disabled | Group state | Version | Should update | @@ -806,24 +900,41 @@ The boolean is returned as `true` in the case that the provided `host` contains | done | v2 | true | | rolledback | v1 | true | -### REST Endpoints +#### Updater status reporting -`/v1/webapi/find?host=[uuid]&group=[name]` -```json -{ - "server_edition": "enterprise", - "agent_version": "15.1.1", - "agent_autoupdate": true, - "agent_update_jitter_seconds": 10 -} -``` -Notes: -- Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. -- The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. -- The group name is read from `/var/lib/teleport/versions/update.yaml` by the updater. -- The UUID is read from `/tmp/teleport_update_uuid`, which `teleport-update` regenerates when missing. +Instance heartbeats will be extended to incorporate and send data that is written to `/var/lib/teleport/versions/update.yaml` by the `teleport-update` binary. + +The following data related to the rollout are stored in each instance heartbeat: +- `agent_update_start_time`: timestamp of individual agent's upgrade time +- `agent_update_start_version`: current agent version +- `agent_update_rollback`: whether the agent was rolled-back automatically +- `agent_update_uuid`: Auto-update UUID +- `agent_update_group`: Auto-update group name + +[TODO: mention that we'll also send this info in the hello and store it in the auth invenotry] + +Auth servers use their local instance inventory to calculate rollout statistics and write them to `/autoupdate/[group]/[auth ID]` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`). + +Every minute, auth servers persist the version counts: +- `agent_data[group].stats[version]` + - `count`: number of currently connected agents at `version` in `group` + - `failed_count`: number of currently connected agents at `version` in `group` that experienced a rollback or inability to upgrade + - `lowest_uuid`: lowest UUID of all currently connected agents at `version` in `group` + - `count`: number of connected agents at `version` in `group` at start of window +- `agent_data[group]` + - `canaries`: list of updater UUIDs to use for canary deployments + +Expiration time of the persisted key is 1 hour. + +To progress the rollout, auth servers will range-read keys from `/autoupdate/[group]/*`, sum the counts, and write back to the `autoupdate_agent_plan` status on a one-minute interval. +- To calculate the initial number of agents connected at the start of the window, each auth server will write the summed count of agents to `autoupdate_agent_plan` status, if not already written. +- To calculate the canaries, each auth server will write a random selection of all canaries to `autoupdate_agent_plan` status, if not already written. +- To determine the progress through the rollout, auth servers will write the calculated progress to the `autoupdate_agent_plan` status using the formulas, declining to write if the current written progress is further ahead. + +If `/autoupdate/[group]/[auth ID]` is older than 1 minute, we do not consider its contents. +This prevents double-counting agents when auth servers are killed. -## Details - Linux Agents +### Linux Agents We will ship a new auto-updater package for Linux servers written in Go that does not interface with the system package manager. It will be distributed as a separate package from Teleport, and manage the installation of the correct Teleport agent version manually. @@ -832,7 +943,7 @@ It will download the correct version of Teleport as a tarball, unpack it in `/va Source code for the updater will live in the main Teleport repository, with the updater binary built from `tools/teleport-update`. -### Installation +#### Installation ```shell $ apt-get install teleport @@ -853,7 +964,7 @@ $ teleport-update enable --proxy example.teleport.sh --template 'https://example ``` (Checksum will use template path + `.sha256`) -### Filesystem +#### Filesystem ``` $ tree /var/lib/teleport @@ -905,7 +1016,7 @@ $ ls -l /usr/local/lib/systemd/system/teleport.service /usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service ``` -#### update.yaml +##### update.yaml This file stores configuration for `teleport-update`. @@ -936,7 +1047,7 @@ status: error: "" ``` -#### backup.yaml +##### backup.yaml This file stores metadata about an individual backup of the Teleport agent's sqlite DB. @@ -952,7 +1063,7 @@ spec: creation_time: 2020-12-09T16:09:53+00:00 ``` -### Runtime +#### Runtime The `teleport-update` binary will run as a periodically executing systemd service which runs every 10 minutes. The systemd service will run: @@ -1133,7 +1244,7 @@ The following documentation will need to be updated to cover the new updater wor Additionally, the Cloud dashboard tenants downloads tab will need to be updated to reference the new instructions. -## Details - Kubernetes Agents +### Details - Kubernetes Agents The Kubernetes agent updater will be updated for compatibility with the new scheduling system. @@ -1462,7 +1573,7 @@ Making the update boolean instruction available via the `/webapi/find` TLS endpo 8. Communicate to users that they should update to the new system. 9. Begin deprecation of old auto-updater resources, packages, and endpoints. 10. Add healthcheck endpoint to Teleport agents and incorporate into rollback logic. -10. Add progress and completion checking. -10. Add canary functionality. -10. Add backpressure functionality if necessary. -11. Add DB backups if necessary. +11. Add progress and completion checking. +12. Add canary functionality. +13. Add backpressure functionality if necessary. +14. Add DB backups if necessary. From 246afe416ffd18b5ea7552e79f3ee2f221e58d41 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 2 Oct 2024 13:19:28 -0400 Subject: [PATCH 080/105] cleanup + swap phases 1 and 2 --- rfd/0169-auto-updates-linux-agents.md | 80 ++++++++++++++++----------- 1 file changed, 49 insertions(+), 31 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 0d5ce9d964b2a..c791815db0055 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -15,7 +15,7 @@ state: draft This RFD proposes a new mechanism for scheduled, automatic updates of Teleport agents. -Users of Teleport will be able to use the tctl CLI to specify desired versions, update schedules, and rollout strategy. +Users of Teleport will be able to use the tctl CLI to specify desired versions, update schedules, and a rollout strategy. Agents will be updated by a new `teleport-update` binary, built from `tools/teleport-update` in the Teleport repository. @@ -33,43 +33,49 @@ Additionally, this RFD parallels the auto-update functionality for client tools ## Why -1. We want customers always running the latest release of Teleport to always be secure, have access to the latest - features, and not deal with the pain of updating the agents. -2. Reduce Teleport Cloud operational costs of contacting customers with old agents. - Make updating easier for self-hosted customers so we don't have to provide support for older Teleport versions. -3. Increase reliability to 99.99%. +1. We want customers to run the latest release of Teleport so that they are secure and have access to the latest + features. +2. We do not want customers to deal with the pain of updating agents installed on their own infrastructure. +3. We want to reduce the operational cost of customers running old agents. + For Cloud customers, this will allow us to support fewer simultaneous cluster versions and reduce support load. + For self-hosted customers, this will reduce support load associated with debugging old versions of Teleport. +4. Providing 99.99% availability for customers requires us to maintain that level of availability at the agent-level + as well as the cluster-level. The current systemd updater does not meet those requirements: -- its use of package managers leads users to accidentally upgrade Teleport -- the installation process is complex and users end up installing the wrong version of Teleport -- the current update process does not provide safeties to protect against broken updates -- many customers are not adopting the existing updater because they want to control when updates happen -- we don't offer a nice user experience for self-hosted users, this ends up in a marginal automatic updates +- Its use of package managers leads users to accidentally upgrade Teleport. +- Its installation process is complex and users end up installing the wrong version of Teleport. +- Its update process does not provide safeties to protect against broken updates. +- Customers are not adopting the existing updater because they want to control when updates happen. +- We do not offer a nice user experience for self-hosted users. This results in a marginal automatic updates adoption and does not reduce the cost of upgrading self-hosted clusters. ## How -The new agent automatic updates will rely on a separate updating binary controlling which Teleport version is -installed. The automatic updates will be implemented via incremental improvements over the existing mechanism: +The new agent automatic updates will rely on a separate `teleport-update` binary controlling which Teleport version is +installed. Automatic updates will be implemented via incrementally: -- Phase 1: introduce a new updater binary not relying on package managers -- Phase 2: introduce the concept of agent update groups and make the users chose in which order groups are updated -- Phase 3: add the ability for the agent updater to immediately revert a faulty update -- Phase 4: add a feedback mechanism for the Teleport inventory to track the agents of each group and their update status -- Phase 5: add the canary deployment strategy: a few agents are updated first, if they don't die, the whole group is updated -- Phase 6: add the ability to perform slow and incremental version rollouts within an agent update group +- Phase 1: Introduce a new updater binary which does not rely on package managers. Allow tctl to roll out updates to all agents. +- Phase 2: Add the ability for the agent updater to immediately revert a faulty update. +- Phase 3: Introduce the concept of agent update groups and make users chose in which order groups are updated. +- Phase 4: Add a feedback mechanism for the Teleport inventory to track the agents of each group and their update status. +- Phase 5: Add the canary deployment strategy: a few agents are updated first, if they don't die, the whole group is updated. +- Phase 6: Add the ability to perform slow and incremental version rollouts within an agent update group. The updater will be usable after phase 1, and will gain new capabilities after each phase. +After phase 2, the new updater will have feature-parity with the old updater. +The existing auto-updates mechanism will remain unchanged throughout the process, and deprecated in the future. + Future phases might change as we are working on the implementation and collecting real-world feedback and experience. ### Resources -We will introduce 2 user-facing resources: +We will introduce two user-facing resources: 1. The `autoupdate_config` resource, owned by the Teleport user. This resource allows Teleport users to configure: - - if automatic updates are enabled, disabled, or temporarily suspended - - in which order their agents should be updated (`dev` before `staging` before `prod`) - - when should the updates start + - Whether automatic updates are enabled, disabled, or temporarily suspended + - The order in which their agents should be updated (`dev` before `staging` before `prod`) + - When updates should start The resource will look like: ```yaml @@ -157,7 +163,7 @@ Those are the requirements coming from engineering, product, and cloud teams: 5. Self-hosted customers should be supported, for example, customers whose their own internal customer is running a Teleport agent. -6. Upgrading a leaf cluster is out of scope. +6. Upgrading a leaf cluster is out-of-scope. 7. Rolling back after a broken update should be supported. Roll forward get's you 99.9%, we need rollback for 99.99%. @@ -174,17 +180,17 @@ Those are the requirements coming from engineering, product, and cloud teams: 13. I should be able to install Teleport via whatever mechanism I want to. -14. If new nodes join a bucket outside the upgrade window and you are within your compat. window, wait until your next group update start. +14. If new nodes join a bucket outside the upgrade window, and you are within your compatibility window, wait until your next group update start. If you are not within your compat. window attempt to upgrade right away. -15. If an agent comes back online after some period of time and is still compat. with +15. If an agent comes back online after some period of time, and it is still compatible with control lane, it wait until the next upgrade window when it will be upgraded. -16. Regular cloud tenant update schedule should run in les than a week. +16. Regular cloud tenant update schedule should run in less than a week. Select tenants might support longer schedules. -17. A cloud customer should be able to pause, resume, and rollback and existing rollout schedule. - A cloud customer should not be able to create new rollout schedules. +17. A Cloud customer should be able to pause, resume, and rollback and existing rollout schedule. + A Cloud customer should not be able to create new rollout schedules. Teleport can create as many rollout schedules as it wants. @@ -618,8 +624,20 @@ status: present_count: 53 # failed_count is the number of agents rolled-back since the start of the rollout failed_count: 23 - # canaries is a list of updater UUIDs used for canary deployments - canaries: ["abc123-..."] + # canaries is a list of agents used for canary deployments + canaries: # part of phase 5 + # updater_id is the updater UUID + - updater_id: abc123-... + # host_id is the agent host UUID + host_id: def534-... + # hostname of the agent + hostname: foo.example.com + # success status + success: false + # last_update_time is [TODO: what does this represent?] + last_update_time: 2020-12-10T16:09:53+00:00 + # last_update_reason is [TODO: what does this represent?] + last_update_reason: canaryTesting # progress is the current progress through the rollout progress: 0.532 # state is the current state of the rollout (unstarted, active, done, rollback) From b815c942c91c0adcea6708fccd044e5882cd36dc Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Wed, 2 Oct 2024 16:21:03 -0400 Subject: [PATCH 081/105] Move protobuf --- rfd/0169-auto-updates-linux-agents.md | 542 ++++++++++++++------------ 1 file changed, 285 insertions(+), 257 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index c791815db0055..2a9048d628ad7 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -68,8 +68,6 @@ The existing auto-updates mechanism will remain unchanged throughout the process Future phases might change as we are working on the implementation and collecting real-world feedback and experience. -### Resources - We will introduce two user-facing resources: 1. The `autoupdate_config` resource, owned by the Teleport user. This resource allows Teleport users to configure: @@ -648,6 +646,291 @@ status: last_update_reason: rollback ``` +#### Protobuf + +```protobuf +syntax = "proto3"; + +package teleport.autoupdate.v1; + +option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdatev1"; + +// AutoUpdateService serves agent and client automatic version updates. +service AutoUpdateService { + // GetAutoUpdateConfig updates the autoupdate config. + rpc GetAutoUpdateConfig(GetAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // CreateAutoUpdateConfig creates the autoupdate config. + rpc CreateAutoUpdateConfig(CreateAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // UpdateAutoUpdateConfig updates the autoupdate config. + rpc UpdateAutoUpdateConfig(UpdateAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // UpsertAutoUpdateConfig overwrites the autoupdate config. + rpc UpsertAutoUpdateConfig(UpsertAutoUpdateConfigRequest) returns (AutoUpdateConfig); + // ResetAutoUpdateConfig restores the autoupdate config to default values. + rpc ResetAutoUpdateConfig(ResetAutoUpdateConfigRequest) returns (AutoUpdateConfig); + + // GetAutoUpdateAgentPlan returns the autoupdate plan for agents. + rpc GetAutoUpdateAgentPlan(GetAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); + // CreateAutoUpdateAgentPlan creates the autoupdate plan for agents. + rpc CreateAutoUpdateAgentPlan(CreateAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); + // UpdateAutoUpdateAgentPlan updates the autoupdate plan for agents. + rpc UpdateAutoUpdateAgentPlan(UpdateAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); + // UpsertAutoUpdateAgentPlan overwrites the autoupdate plan for agents. + rpc UpsertAutoUpdateAgentPlan(UpsertAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); + + // TriggerAgentGroup changes the state of an agent group from `unstarted` to `active` or `canary`. + rpc TriggerAgentGroup(TriggerAgentGroupRequest) returns (AutoUpdateAgentPlan); + // ForceAgentGroup changes the state of an agent group from `unstarted`, `canary`, or `active` to the `done` state. + rpc ForceAgentGroup(ForceAgentGroupRequest) returns (AutoUpdateAgentPlan); + // ResetAgentGroup resets the state of an agent group. + // For `canary`, this means new canaries are picked + // For `active`, this means the initial node count is computed again. + rpc ResetAgentGroup(ResetAgentGroupRequest) returns (AutoUpdateAgentPlan); + // RollbackAgentGroup changes the state of an agent group to `rolledback`. + rpc RollbackAgentGroup(RollbackAgentGroupRequest) returns (AutoUpdateAgentPlan); +} + +// GetAutoUpdateConfigRequest requests the contents of the AutoUpdateConfig. +message GetAutoUpdateConfigRequest {} + +// CreateAutoUpdateConfigRequest requests creation of the the AutoUpdateConfig. +message CreateAutoUpdateConfigRequest { + AutoUpdateConfig autoupdate_config = 1; +} + +// UpdateAutoUpdateConfigRequest requests an update of the the AutoUpdateConfig. +message UpdateAutoUpdateConfigRequest { + AutoUpdateConfig autoupdate_config = 1; +} + +// UpsertAutoUpdateConfigRequest requests an upsert of the the AutoUpdateConfig. +message UpsertAutoUpdateConfigRequest { + AutoUpdateConfig autoupdate_config = 1; +} + +// ResetAutoUpdateConfigRequest requests a reset of the the AutoUpdateConfig to default values. +message ResetAutoUpdateConfigRequest {} + +// AutoUpdateConfig holds dynamic configuration settings for automatic updates. +message AutoUpdateConfig { + // kind is the kind of the resource. + string kind = 1; + // sub_kind is the sub kind of the resource. + string sub_kind = 2; + // version is the version of the resource. + string version = 3; + // metadata is the metadata of the resource. + teleport.header.v1.Metadata metadata = 4; + // spec is the spec of the resource. + AutoUpdateConfigSpec spec = 7; +} + +// AutoUpdateConfigSpec is the spec for the autoupdate config. +message AutoUpdateConfigSpec { + // agent_autoupdate_mode specifies whether agent autoupdates are enabled, disabled, or paused. + Mode agent_autoupdate_mode = 1; + // agent_schedules specifies schedules for updates of grouped agents. + AgentAutoUpdateSchedules agent_schedules = 3; +} + +// AgentAutoUpdateSchedules specifies update scheduled for grouped agents. +message AgentAutoUpdateSchedules { + // regular schedules for non-critical versions. + repeated AgentAutoUpdateGroup regular = 1; +} + +// AgentAutoUpdateGroup specifies the update schedule for a group of agents. +message AgentAutoUpdateGroup { + // name of the group + string name = 1; + // days to run update + repeated Day days = 2; + // start_hour to initiate update + int32 start_hour = 3; + // wait_days after last group succeeds before this group can run + int64 wait_days = 4; + // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. + int64 alert_after_hours = 5; + // jitter_seconds to introduce before update as rand([0, jitter_seconds]) + int64 jitter_seconds = 6; + // canary_count of agents to use in the canary deployment. + int64 canary_count = 7; + // max_in_flight specifies agents that can be updated at the same time, by percent. + string max_in_flight = 8; +} + +// Day of the week +enum Day { + DAY_UNSPECIFIED = 0; + DAY_ALL = 1; + DAY_SUNDAY = 2; + DAY_MONDAY = 3; + DAY_TUESDAY = 4; + DAY_WEDNESDAY = 5; + DAY_THURSDAY = 6; + DAY_FRIDAY = 7; + DAY_SATURDAY = 8; +} + +// Mode of operation +enum Mode { + // UNSPECIFIED update mode + MODE_UNSPECIFIED = 0; + // DISABLE updates + MODE_DISABLE = 1; + // ENABLE updates + MODE_ENABLE = 2; + // PAUSE updates + MODE_PAUSE = 3; +} + +// GetAutoUpdateAgentPlanRequest requests the autoupdate_agent_plan singleton resource. +message GetAutoUpdateAgentPlanRequest {} + +// GetAutoUpdateAgentPlanRequest requests creation of the autoupdate_agent_plan singleton resource. +message CreateAutoUpdateAgentPlanRequest { + // autoupdate_agent_plan resource contents + AutoUpdateAgentPlan autoupdate_agent_plan = 1; +} + +// GetAutoUpdateAgentPlanRequest requests an update of the autoupdate_agent_plan singleton resource. +message UpdateAutoUpdateAgentPlanRequest { + // autoupdate_agent_plan resource contents + AutoUpdateAgentPlan autoupdate_agent_plan = 1; +} + +// GetAutoUpdateAgentPlanRequest requests an upsert of the autoupdate_agent_plan singleton resource. +message UpsertAutoUpdateAgentPlanRequest { + // autoupdate_agent_plan resource contents + AutoUpdateAgentPlan autoupdate_agent_plan = 1; +} + +// AutoUpdateAgentPlan holds dynamic configuration settings for agent autoupdates. +message AutoUpdateAgentPlan { + // kind is the kind of the resource. + string kind = 1; + // sub_kind is the sub kind of the resource. + string sub_kind = 2; + // version is the version of the resource. + string version = 3; + // metadata is the metadata of the resource. + teleport.header.v1.Metadata metadata = 4; + // spec is the spec of the resource. + AutoUpdateAgentPlanSpec spec = 5; + // status is the status of the resource. + AutoUpdateAgentPlanStatus status = 6; +} + +// AutoUpdateAgentPlanSpec is the spec for the autoupdate version. +message AutoUpdateAgentPlanSpec { + // start_version is the version to update from. + string start_version = 1; + // target_version is the version to update to. + string target_version = 2; + // schedule to use for the rollout + Schedule schedule = 3; + // strategy to use for the rollout + Strategy strategy = 4; + // autoupdate_mode to use for the rollout + Mode autoupdate_mode = 5; +} + +// Schedule type for the rollout +enum Schedule { + // UNSPECIFIED update schedule + SCHEDULE_UNSPECIFIED = 0; + // REGULAR update schedule + SCHEDULE_REGULAR = 1; + // IMMEDIATE update schedule for updating all agents immediately + SCHEDULE_IMMEDIATE = 2; +} + +// Strategy type for the rollout +enum Strategy { + // UNSPECIFIED update strategy + STRATEGY_UNSPECIFIED = 0; + // GROUPED update schedule, with no backpressure + STRATEGY_GROUPED = 1; + // BACKPRESSURE update schedule + STRATEGY_BACKPRESSURE = 2; +} + +// AutoUpdateAgentPlanStatus is the status for the AutoUpdateAgentPlan. +message AutoUpdateAgentPlanStatus { + // name of the group + string name = 0; + // start_time of the rollout + google.protobuf.Timestamp start_time = 1; + // initial_count is the number of connected agents at the start of the window. + int64 initial_count = 2; + // present_count is the current number of connected agents. + int64 present_count = 3; + // failed_count specifies the number of failed agents. + int64 failed_count = 4; + // canaries is a list of canary agents. + repeated Canary canaries = 5; + // progress is the current progress through the rollout. + float progress = 6; + // state is the current state of the rollout. + State state = 7; + // last_update_time is the time of the previous update for this group. + google.protobuf.Timestamp last_update_time = 8; + // last_update_reason is the trigger for the last update + string last_update_reason = 9; +} + +// Canary agent +message Canary { + // update_uuid of the canary agent + string update_uuid = 0; + // host_uuid of the canary agent + string host_uuid = 1; + // hostname of the canary agent + string hostname = 2; + // success state of the canary agent + bool success = 3; +} + +// State of the rollout +enum State { + // UNSPECIFIED state + STATE_UNSPECIFIED = 0; + // UNSTARTED state + STATE_UNSTARTED = 1; + // CANARY state + STATE_CANARY = 2; + // ACTIVE state + STATE_ACTIVE = 3; + // DONE state + STATE_DONE = 4; + // ROLLEDBACK state + STATE_ROLLEDBACK = 5; +} + +message TriggerAgentGroupRequest { + // group is the agent update group name whose maintenance should be triggered. + string group = 1; + // desired_state describes the desired start state. + // Supported values are STATE_UNSPECIFIED, STATE_CANARY, and STATE_ACTIVE. + // When left empty, defaults to canary if they are supported. + State desired_state = 2; +} + +message ForceAgentGroupRequest { + // group is the agent update group name whose state should be forced to `done`. + string group = 1; +} + +message ResetAgentGroupRequest { + // group is the agent update group name whose state should be reset. + string group = 1; +} + +message RollbackAgentGroupRequest { + // group is the agent update group name whose state should change to `rolledback`. + string group = 1; +} +``` + ### Backend logic to progress the rollout The update proceeds from the first group to the last group, ensuring that each group successfully updates before @@ -1300,261 +1583,6 @@ Care will be taken to ensure that updater logs are sharable with Teleport Suppor When TUF is added, that events related to supply chain security may be sent to the Teleport cluster via the Teleport Agent. -## Protobuf API Changes - -Note: all updates use revisions to prevent data loss in case of concurrent access. - -### autoupdate/v1 - -```protobuf -syntax = "proto3"; - -package teleport.autoupdate.v1; - -option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdatev1"; - -// AutoUpdateService serves agent and client automatic version updates. -service AutoUpdateService { - // GetAutoUpdateConfig updates the autoupdate config. - rpc GetAutoUpdateConfig(GetAutoUpdateConfigRequest) returns (AutoUpdateConfig); - // CreateAutoUpdateConfig creates the autoupdate config. - rpc CreateAutoUpdateConfig(CreateAutoUpdateConfigRequest) returns (AutoUpdateConfig); - // UpdateAutoUpdateConfig updates the autoupdate config. - rpc UpdateAutoUpdateConfig(UpdateAutoUpdateConfigRequest) returns (AutoUpdateConfig); - // UpsertAutoUpdateConfig overwrites the autoupdate config. - rpc UpsertAutoUpdateConfig(UpsertAutoUpdateConfigRequest) returns (AutoUpdateConfig); - // ResetAutoUpdateConfig restores the autoupdate config to default values. - rpc ResetAutoUpdateConfig(ResetAutoUpdateConfigRequest) returns (AutoUpdateConfig); - - // GetAutoUpdateAgentPlan returns the autoupdate plan for agents. - rpc GetAutoUpdateAgentPlan(GetAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); - // CreateAutoUpdateAgentPlan creates the autoupdate plan for agents. - rpc CreateAutoUpdateAgentPlan(CreateAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); - // UpdateAutoUpdateAgentPlan updates the autoupdate plan for agents. - rpc UpdateAutoUpdateAgentPlan(UpdateAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); - // UpsertAutoUpdateAgentPlan overwrites the autoupdate plan for agents. - rpc UpsertAutoUpdateAgentPlan(UpsertAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); -} - -// GetAutoUpdateConfigRequest requests the contents of the AutoUpdateConfig. -message GetAutoUpdateConfigRequest {} - -// CreateAutoUpdateConfigRequest requests creation of the the AutoUpdateConfig. -message CreateAutoUpdateConfigRequest { - AutoUpdateConfig autoupdate_config = 1; -} - -// UpdateAutoUpdateConfigRequest requests an update of the the AutoUpdateConfig. -message UpdateAutoUpdateConfigRequest { - AutoUpdateConfig autoupdate_config = 1; -} - -// UpsertAutoUpdateConfigRequest requests an upsert of the the AutoUpdateConfig. -message UpsertAutoUpdateConfigRequest { - AutoUpdateConfig autoupdate_config = 1; -} - -// ResetAutoUpdateConfigRequest requests a reset of the the AutoUpdateConfig to default values. -message ResetAutoUpdateConfigRequest {} - -// AutoUpdateConfig holds dynamic configuration settings for automatic updates. -message AutoUpdateConfig { - // kind is the kind of the resource. - string kind = 1; - // sub_kind is the sub kind of the resource. - string sub_kind = 2; - // version is the version of the resource. - string version = 3; - // metadata is the metadata of the resource. - teleport.header.v1.Metadata metadata = 4; - // spec is the spec of the resource. - AutoUpdateConfigSpec spec = 7; -} - -// AutoUpdateConfigSpec is the spec for the autoupdate config. -message AutoUpdateConfigSpec { - // agent_autoupdate_mode specifies whether agent autoupdates are enabled, disabled, or paused. - Mode agent_autoupdate_mode = 1; - // agent_schedules specifies schedules for updates of grouped agents. - AgentAutoUpdateSchedules agent_schedules = 3; -} - -// AgentAutoUpdateSchedules specifies update scheduled for grouped agents. -message AgentAutoUpdateSchedules { - // regular schedules for non-critical versions. - repeated AgentAutoUpdateGroup regular = 1; -} - -// AgentAutoUpdateGroup specifies the update schedule for a group of agents. -message AgentAutoUpdateGroup { - // name of the group - string name = 1; - // days to run update - repeated Day days = 2; - // start_hour to initiate update - int32 start_hour = 3; - // wait_days after last group succeeds before this group can run - int64 wait_days = 4; - // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. - int64 alert_after_hours = 5; - // jitter_seconds to introduce before update as rand([0, jitter_seconds]) - int64 jitter_seconds = 6; - // canary_count of agents to use in the canary deployment. - int64 canary_count = 7; - // max_in_flight specifies agents that can be updated at the same time, by percent. - string max_in_flight = 8; -} - -// Day of the week -enum Day { - DAY_UNSPECIFIED = 0; - DAY_ALL = 1; - DAY_SUNDAY = 2; - DAY_MONDAY = 3; - DAY_TUESDAY = 4; - DAY_WEDNESDAY = 5; - DAY_THURSDAY = 6; - DAY_FRIDAY = 7; - DAY_SATURDAY = 8; -} - -// Mode of operation -enum Mode { - // UNSPECIFIED update mode - MODE_UNSPECIFIED = 0; - // DISABLE updates - MODE_DISABLE = 1; - // ENABLE updates - MODE_ENABLE = 2; - // PAUSE updates - MODE_PAUSE = 3; -} - -// GetAutoUpdateAgentPlanRequest requests the autoupdate_agent_plan singleton resource. -message GetAutoUpdateAgentPlanRequest {} - -// GetAutoUpdateAgentPlanRequest requests creation of the autoupdate_agent_plan singleton resource. -message CreateAutoUpdateAgentPlanRequest { - // autoupdate_agent_plan resource contents - AutoUpdateAgentPlan autoupdate_agent_plan = 1; -} - -// GetAutoUpdateAgentPlanRequest requests an update of the autoupdate_agent_plan singleton resource. -message UpdateAutoUpdateAgentPlanRequest { - // autoupdate_agent_plan resource contents - AutoUpdateAgentPlan autoupdate_agent_plan = 1; -} - -// GetAutoUpdateAgentPlanRequest requests an upsert of the autoupdate_agent_plan singleton resource. -message UpsertAutoUpdateAgentPlanRequest { - // autoupdate_agent_plan resource contents - AutoUpdateAgentPlan autoupdate_agent_plan = 1; -} - -// AutoUpdateAgentPlan holds dynamic configuration settings for agent autoupdates. -message AutoUpdateAgentPlan { - // kind is the kind of the resource. - string kind = 1; - // sub_kind is the sub kind of the resource. - string sub_kind = 2; - // version is the version of the resource. - string version = 3; - // metadata is the metadata of the resource. - teleport.header.v1.Metadata metadata = 4; - // spec is the spec of the resource. - AutoUpdateAgentPlanSpec spec = 5; - // status is the status of the resource. - AutoUpdateAgentPlanStatus status = 6; -} - -// AutoUpdateAgentPlanSpec is the spec for the autoupdate version. -message AutoUpdateAgentPlanSpec { - // start_version is the version to update from. - string start_version = 1; - // target_version is the version to update to. - string target_version = 2; - // schedule to use for the rollout - Schedule schedule = 3; - // strategy to use for the rollout - Strategy strategy = 4; - // autoupdate_mode to use for the rollout - Mode autoupdate_mode = 5; -} - -// Schedule type for the rollout -enum Schedule { - // UNSPECIFIED update schedule - SCHEDULE_UNSPECIFIED = 0; - // REGULAR update schedule - SCHEDULE_REGULAR = 1; - // IMMEDIATE update schedule for updating all agents immediately - SCHEDULE_IMMEDIATE = 2; -} - -// Strategy type for the rollout -enum Strategy { - // UNSPECIFIED update strategy - STRATEGY_UNSPECIFIED = 0; - // GROUPED update schedule, with no backpressure - STRATEGY_GROUPED = 1; - // BACKPRESSURE update schedule - STRATEGY_BACKPRESSURE = 2; -} - -// AutoUpdateAgentPlanStatus is the status for the AutoUpdateAgentPlan. -message AutoUpdateAgentPlanStatus { - // name of the group - string name = 0; - // start_time of the rollout - google.protobuf.Timestamp start_time = 1; - // initial_count is the number of connected agents at the start of the window. - int64 initial_count = 2; - // present_count is the current number of connected agents. - int64 present_count = 3; - // failed_count specifies the number of failed agents. - int64 failed_count = 4; - // canaries is a list of canary agents. - repeated Canary canaries = 5; - // progress is the current progress through the rollout. - float progress = 6; - // state is the current state of the rollout. - State state = 7; - // last_update_time is the time of the previous update for this group. - google.protobuf.Timestamp last_update_time = 8; - // last_update_reason is the trigger for the last update - string last_update_reason = 9; -} - -// Canary agent -message Canary { - // update_uuid of the canary agent - string update_uuid = 0; - // host_uuid of the canary agent - string host_uuid = 1; - // hostname of the canary agent - string hostname = 2; - // success state of the canary agent - bool success = 3; -} - -// State of the rollout -enum State { - // UNSPECIFIED state - STATE_UNSPECIFIED = 0; - // UNSTARTED state - STATE_UNSTARTED = 1; - // CANARY state - STATE_CANARY = 2; - // ACTIVE state - STATE_ACTIVE = 3; - // DONE state - STATE_DONE = 4; - // ROLLEDBACK state - STATE_ROLLEDBACK = 5; -} - -``` - ## Alternatives ### `teleport update` Subcommand From 400fc3e01eaf314f8d26aa267a6111e2b3d6ad81 Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Wed, 2 Oct 2024 18:28:27 -0400 Subject: [PATCH 082/105] Add installation scenarios --- rfd/0169-auto-updates-linux-agents.md | 119 ++++++++++++++------------ 1 file changed, 66 insertions(+), 53 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 2a9048d628ad7..a0f73b35f5caf 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -515,6 +515,71 @@ tctl auto-update agent status ``` +#### As a Teleport user, I want to install a new agent automatically updated + +The manual way: + +```bash +wget https://cdn.teleport.dev/teleport-updater-- +chmod +x teleport-updater +./teleport-updater enable example.teleport.sh --group production +# Detecting the Teleport version and edition used by cluster "example.teleport.sh" +# Installing the following teleport version: +# Version: 16.2.1 +# Edition: Enterprise +# OS: Linux +# Architecture: x86 +# Teleport installed +# Enabling automatic updates, the agent is part of the "production" update group. +# You can now configure the teleport agent with `teleport configure` or by writing your own `teleport.yaml`. +# When the configuration is done, enable and start teleport by running: +# `systemctl start teleport && systemctl enable teleport` +``` + +The one-liner: + +``` +curl https://cdn.teleport.dev/auto-install | bash -s example.teleport.sh +# Downloading the teleport updater +# Detecting the Teleport version and edition used by cluster "example.teleport.sh" +# Installing the following teleport version: +# Version: 16.2.1 +# Edition: Enterprise +# OS: Linux +# Architecture: x86 +# Teleport installed +# Enabling automatic updates, the agent is part of the "default" update group. +# You can now configure the teleport agent with `teleport configure` or by writing your own `teleport.yaml`. +# When the configuration is finished, enable and start teleport by running: +# `systemctl start teleport && systemctl enable teleport` +``` + +I can also install teleport using the package manager, then enroll the agent into AUs. See the section below: + +#### As a Teleport user I want to enroll my existing agent into AUs + +I have an agent, installed from a package manager or by manually unpacking the tarball. +I have the teleport updater installed and available in my path. +I run: + +```shell +teleport-updater enable --group production +# Detecting the Teleport version and edition used by cluster "example.teleport.sh" +# Installing the following teleport version: +# Version: 16.2.1 +# Edition: Enterprise +# OS: Linux +# Architecture: x86 +# Teleport installed, reloading the service. +# Enabling automatic updates, the agent is part of the "production" update group. +``` + +> [!NOTE] +> The updater saw the teleport unit running and the existing teleport configuration. +> It used the configuration to pick the right proxy address. As teleport is already running, the teleport service is +> reloaded to use the new binary. + + ### Teleport Resources #### Autoupdate Config @@ -1058,14 +1123,7 @@ minute will be considered in these formulas. ### Manually interacting with the rollout - -#### RPCs -Users and administrators can interact with the rollout plan using the following RPCs: - -```protobuf -``` - -#### CLI +[TODO add cli commands] ### Editing the plan @@ -1083,51 +1141,6 @@ Releasing new agent versions multiple times a week has the potential to starve d Note that the `default` schedule applies to agents that do not specify a group name. [TODO: It seems we removed the default bool, So we have a mandatory default group? Can we pick the last one instead?] -```shell -# configuration -# TODO: "tctl autoudpate update" is bad UX, especially as this doen't even trigger agent update but updates the AU resource. -# We should chose a user-friendly signature -$ tctl autoupdate update --set-agent-auto-update=off -Automatic updates configuration has been updated. -$ tctl autoupdate update --group staging --set-start-hour=3 -Automatic updates configuration has been updated. -$ tctl autoupdate update --group staging --set-jitter-seconds=60 -Automatic updates configuration has been updated. -$ tctl autoupdate update --group default --set-jitter-seconds=60 -Automatic updates configuration has been updated. -$ tctl autoupdate reset -Automatic updates configuration has been reset to defaults. - -# status -$ tctl autoupdate status -Status: disabled -Version: v1.2.4 -Schedule: regular - -Groups: -staging: succeeded at 2024-01-03 23:43:22 UTC -prod: scheduled for 2024-01-03 23:43:22 UTC (depends on prod) -other: failed at 2024-01-05 22:53:22 UTC - -$ tctl autoupdate status --group staging -Status: succeeded -Date: 2024-01-03 23:43:22 UTC -Requires: (none) - -Updated: 230 (95%) -Unchanged: 10 (2%) -Failed: 15 (3%) -Timed-out: 0 - -# re-running failed group -$ tctl autoupdate run --group staging -Executing auto-update for group 'staging' immediately. -``` - -Notes: -- `autoupdate_agent_plan` is separate from `autoupdate_config` so that Cloud customers can be restricted from updating - `autoupdate_agent_plan`, while maintaining control over the rollout. - ### Updater APIs #### Update requests From 2cc46a671875a81f660324e09714fd7570796f82 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 2 Oct 2024 21:09:53 -0400 Subject: [PATCH 083/105] cleanup + move backpressure formulas --- rfd/0169-auto-updates-linux-agents.md | 228 +++++++++++++------------- 1 file changed, 114 insertions(+), 114 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index a0f73b35f5caf..0c73af930fa50 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -1,5 +1,5 @@ --- -authors: Stephen Levine (stephen.levine@goteleport.com) +authors: Stephen Levine (stephen.levine@goteleport.com) & Hugo Hervieux (hugo.hervieux@goteleport.com) state: draft --- @@ -45,24 +45,24 @@ Additionally, this RFD parallels the auto-update functionality for client tools The current systemd updater does not meet those requirements: - Its use of package managers leads users to accidentally upgrade Teleport. - Its installation process is complex and users end up installing the wrong version of Teleport. -- Its update process does not provide safeties to protect against broken updates. +- Its update process does not provide sufficient safeties to protect against broken updates. - Customers are not adopting the existing updater because they want to control when updates happen. - We do not offer a nice user experience for self-hosted users. This results in a marginal automatic updates adoption and does not reduce the cost of upgrading self-hosted clusters. ## How -The new agent automatic updates will rely on a separate `teleport-update` binary controlling which Teleport version is -installed. Automatic updates will be implemented via incrementally: +The new agent automatic updates system will rely on a separate `teleport-update` binary controlling which Teleport version is +installed. Automatic updates will be implemented incrementally: -- Phase 1: Introduce a new updater binary which does not rely on package managers. Allow tctl to roll out updates to all agents. +- Phase 1: Introduce a new, self-updating updater binary which does not rely on package managers. Allow tctl to roll out updates to all agents. - Phase 2: Add the ability for the agent updater to immediately revert a faulty update. -- Phase 3: Introduce the concept of agent update groups and make users chose in which order groups are updated. +- Phase 3: Introduce the concept of agent update groups and make users chose the order in which groups are updated. - Phase 4: Add a feedback mechanism for the Teleport inventory to track the agents of each group and their update status. - Phase 5: Add the canary deployment strategy: a few agents are updated first, if they don't die, the whole group is updated. - Phase 6: Add the ability to perform slow and incremental version rollouts within an agent update group. -The updater will be usable after phase 1, and will gain new capabilities after each phase. +The updater will be usable after phase 1 and will gain new capabilities after each phase. After phase 2, the new updater will have feature-parity with the old updater. The existing auto-updates mechanism will remain unchanged throughout the process, and deprecated in the future. @@ -72,8 +72,9 @@ We will introduce two user-facing resources: 1. The `autoupdate_config` resource, owned by the Teleport user. This resource allows Teleport users to configure: - Whether automatic updates are enabled, disabled, or temporarily suspended - - The order in which their agents should be updated (`dev` before `staging` before `prod`) - - When updates should start + - The order in which agents should be updated (`dev` before `staging` before `prod`) + - Times when agent updates should start + - Configuration for client auto-updates (e.g., `tsh` and `tctl`), which are out-of-scope for this RFD The resource will look like: ```yaml @@ -97,9 +98,9 @@ We will introduce two user-facing resources: max_in_flight: 20% # added in phase 6 ``` -2. The `autoupdate_agent_plan` resource, its spec is owned by the Teleport cluster administrator (e.g. Teleport Cloud team). - Its status is owned by Teleport and contains the current rollout status. Some parts of the status can be changed via - select RPCs, for example fast-tracking a group update. +2. The `autoupdate_agent_plan` resource, with `spec` owned by the Teleport cluster administrator (e.g. Teleport Cloud team). + Its `status` is owned by Teleport and contains the current rollout status. Some parts of the status can be changed via + select RPCs (for example, an RPC to fast-track a group update). ```yaml kind: autoupdate_agent_plan spec: @@ -118,12 +119,12 @@ We will introduce two user-facing resources: progress: 0 state: canaries canaries: # part of phase 5 - - updater_id: abc - host_id: def + - updater_uuid: abc + host_uuid: def hostname: foo.example.com success: false - last_update_time: 2020-12-10T16:09:53+00:00 - last_update_reason: canaryTesting + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: canaryTesting - name: prod start_time: 0000-00-00 initial_count: 0 @@ -144,29 +145,29 @@ users who want to know the motivations behind this specific design. ### Product requirements -Those are the requirements coming from engineering, product, and cloud teams: +Those are the requirements coming from engineering, product, and Cloud teams: -1. Phased rollout for our tenants. We should be able to control the agent version per-tenant. +1. Phased rollout for Cloud tenants. We should be able to control the agent version per-tenant. -2. Bucketed rollout that tenants have control over. +2. Bucketed rollout that customers have control over. - Control the bucket update day - Control the bucket update hour - Ability to pause a rollout -3. Customers should be able to run "apt-get update" without updating Teleport. +3. Customers should be able to run "apt-get upgrade" without updating Teleport. - Installation from a package manager should be possible, but the version should be controlled by Teleport. + Installation from a package manager should be possible, but the version should still be controlled by Teleport. 4. Self-managed updates should be a first class citizen. Teleport must advertise the desired agent and client version. -5. Self-hosted customers should be supported, for example, customers whose their own internal customer is running a Teleport agent. +5. Self-hosted customers should be supported, for example, customers whose own internal customer is running a Teleport agent. -6. Upgrading a leaf cluster is out-of-scope. +6. Upgrading leaf clusters is out-of-scope. -7. Rolling back after a broken update should be supported. Roll forward get's you 99.9%, we need rollback for 99.99%. +7. Rolling back after a broken update should be supported. Roll forward gets you 99.9%, we need rollback for 99.99%. 8. We should have high quality metrics that report the version they are running and if they are running automatic - updates. For users and us. + updates. For both users and us. 9. Best effort should be made so automatic updates should be applied in a way that sessions are not terminated. (Currently only supported for SSH) @@ -174,27 +175,25 @@ Those are the requirements coming from engineering, product, and cloud teams: 11. Teleport Discover installation (curl one-liner) should be supported. -12. We need to support repo mirrors. +12. We need to support Docker image repository mirrors and Teleport artifact mirrors. -13. I should be able to install Teleport via whatever mechanism I want to. +13. I should be able to install an auto-updating deployment of Teleport via whatever mechanism I want to, including OS packages such as apt and yum. 14. If new nodes join a bucket outside the upgrade window, and you are within your compatibility window, wait until your next group update start. - If you are not within your compat. window attempt to upgrade right away. + If you are not within your compatibility window, attempt to upgrade right away. 15. If an agent comes back online after some period of time, and it is still compatible with - control lane, it wait until the next upgrade window when it will be upgraded. + control plane, it should wait until the next upgrade window to be upgraded. -16. Regular cloud tenant update schedule should run in less than a week. - Select tenants might support longer schedules. +16. Regular agent updates for Cloud tenants should complete in less than a week. + (Select tenants may support longer schedules, at the Cloud team's discretion.) 17. A Cloud customer should be able to pause, resume, and rollback and existing rollout schedule. A Cloud customer should not be able to create new rollout schedules. Teleport can create as many rollout schedules as it wants. -18. A user on the host, should be able to turn autoupdate off or select a version for that particular host. - -19. Operating system packages should be supported. +18. A user logged-in to the agent host should be able to disable agent auto-updates and pin a version for that particular host. ### User Stories @@ -204,7 +203,7 @@ Those are the requirements coming from engineering, product, and cloud teams: Before ```shell -tctl auto-update agent status +tctl autoupdate agent status # Rollout plan created the YYYY-MM-DD # Previous version: v1 # New version: v2 @@ -224,11 +223,14 @@ tctl autoupdate agent new-rollout v3 # created new rollout from v2 to v3 ``` +TODO(sclevine): What about `update` or `target` instead of `new-rollout`? + `new-rollout` seems like we're creating a new resource, not changing target version. +
After ```shell -tctl auto-update agent status +tctl autoupdate agent status # Rollout plan created the YYYY-MM-DD # Previous version: v2 # New version: v3 @@ -256,13 +258,13 @@ Now, new agents will install v2 by default, and v3 after the maintenance. > # created new update plan from v1 to v3 > ``` -#### As Teleport Cloud I want to minimize the damage of a broken version to improve Teleport's availability to 99.99% +#### As Teleport Cloud I want to minimize damage caused by broken versions to ensure we maintain 99.99% availability -##### Failure mode 1: the new version crashes +##### Failure mode 1(a): the new version crashes -I create a new deployment, with a broken version. The version is deployed to the canaries. +I create a new deployment with a broken version. The version is deployed to the canaries. The canaries crash, the updater reverts the update, the agents connect back online and -advertise they rolled-back. The maintenance is stuck until the canaries are running the target version. +advertise they have rolled-back. The maintenance is stuck until the canaries are running the target version.
Autoupdate agent plan @@ -285,10 +287,10 @@ status: progress: 0 state: canaries canaries: - - updater_id: abc - host_id: def - hostname: foo.example.com - success: false + - updater_uuid: abc + host_uuid: def + hostname: foo.example.com + success: false last_update_time: 2020-12-10T16:09:53+00:00 last_update_reason: canaryTesting - name: staging @@ -304,25 +306,25 @@ status:
I and the customer get an alert if the canary testing has not succeeded after an hour. -Teleport cloud operators and the user can access the canary hostname and hostid -to +Teleport cloud operators and the customer can access the canary hostname and host_uuid +to identify the broken agent. The rollout resumes. -##### Failure mode 1 bis: the new version crashes, but not on the canaries +##### Failure mode 1(b): the new version crashes, but not on the canaries This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. For example: [the agent fails to read cloud-provider specific metadata and crashes](TODO add link). -The canaries might not select one of the affected agent and allow the update to proceed. +The canaries might not select one of the affected agents and allow the update to proceed. All agents are updated, and all agents hosted on the cloud provider affected by the bug crash. The updaters of the affected agents will attempt to self-heal by reverting to the previous version. -Once the previous Teleport version is running, the agent will advertise its update failed and it had to rollback. -If too many agents failed, this will block the group from transitioning from `active` to `done`, protecting the future +Once the previous Teleport version is running, the agent will advertise the update failed and that it had to rollback. +If too many agents fail, this will block the group from transitioning from `active` to `done`, protecting the future groups from the faulty updates. -##### Failure mode 2: the new version crashes, and the old version cannot start +##### Failure mode 2(a): the new version crashes, and the old version cannot start I create a new deployment, with a broken version. The version is deployed to the canaries. The canaries attempt the update, and the new Teleport instance crashes. @@ -336,7 +338,7 @@ The group update is stuck until the canary comes back online and runs the latest The customer and Teleport cloud receive an alert. The customer and Teleport cloud can retrieve the hostid and hostname of the faulty canaries. With this information they can go troubleshoot the failed agents. -##### Failure mode 2 bis: the new version crashes, and the old version cannot start, but not on the canaries +##### Failure mode 2(b): the new version crashes, and the old version cannot start, but not on the canaries This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. For example: a clock drift blocks agents from re-connecting to Teleport. @@ -345,7 +347,7 @@ The canaries might not select one of the affected agent and allow the update to All agents are updated, and all agents hosted on the cloud provider affected by the bug crash. The updater fails to self-heal as the old version does not start anymore. -If too many agents failed, this will block the group from transitioning from `active` to `done`, protecting the future +If too many agents fail, this will block the group from transitioning from `active` to `done`, protecting the future groups from the faulty updates. In this case, it's hard to identify which agent dropped. @@ -605,8 +607,8 @@ spec: # name of the group. Must only contain valid backend / resource name characters. - name: staging # days specifies the days of the week when the group may be updated. + # mandatory value for most Cloud customers: ["Mon", "Tue", "Wed", "Thu"] # default: ["*"] (all days) - # TODO: explicit the supported values based on the customer QoS days: [ “Sun”, “Mon”, ... | "*" ] # start_hour specifies the hour when the group may start upgrading. # default: 0 @@ -649,13 +651,13 @@ spec: #### Autoupdate agent plan The `autoupdate_agent_plan` spec is owned by the Teleport cluster administrator. -In Teleport Cloud, this is the cloud operations team. For self-hosted setups this is the user with access to the local +In Teleport Cloud, this is the Cloud operations team. For self-hosted setups this is the user with access to the local admin socket (tctl on local machine). > [!NOTE] > This is currently an anti-pattern as we are trying to remove the use of the local administrator in Teleport. > However, Teleport does not provide any role/permission that we can use for Teleport Cloud operations and cannot be -> granted to users. To part with local admin rights, we need a way to have cloud or admi-only operations. +> granted to users. To part with local admin rights, we need a way to have Cloud or admin-only operations. > This would also improve Cloud team operations by interacting with Teleport API rather than executing local tctl. > > Solving this problem is out of the scope of this RFD. @@ -689,18 +691,14 @@ status: failed_count: 23 # canaries is a list of agents used for canary deployments canaries: # part of phase 5 - # updater_id is the updater UUID - - updater_id: abc123-... - # host_id is the agent host UUID - host_id: def534-... + # updater_uuid is the updater UUID + - updater_uuid: abc123-... + # host_uuid is the agent host UUID + host_uuid: def534-... # hostname of the agent hostname: foo.example.com # success status success: false - # last_update_time is [TODO: what does this represent?] - last_update_time: 2020-12-10T16:09:53+00:00 - # last_update_reason is [TODO: what does this represent?] - last_update_reason: canaryTesting # progress is the current progress through the rollout progress: 0.532 # state is the current state of the rollout (unstarted, active, done, rollback) @@ -922,37 +920,37 @@ enum Strategy { // AutoUpdateAgentPlanStatus is the status for the AutoUpdateAgentPlan. message AutoUpdateAgentPlanStatus { // name of the group - string name = 0; + string name = 1; // start_time of the rollout - google.protobuf.Timestamp start_time = 1; + google.protobuf.Timestamp start_time = 2; // initial_count is the number of connected agents at the start of the window. - int64 initial_count = 2; + int64 initial_count = 3; // present_count is the current number of connected agents. - int64 present_count = 3; + int64 present_count = 4; // failed_count specifies the number of failed agents. - int64 failed_count = 4; + int64 failed_count = 5; // canaries is a list of canary agents. - repeated Canary canaries = 5; + repeated Canary canaries = 6; // progress is the current progress through the rollout. - float progress = 6; + float progress = 7; // state is the current state of the rollout. - State state = 7; + State state = 8; // last_update_time is the time of the previous update for this group. - google.protobuf.Timestamp last_update_time = 8; + google.protobuf.Timestamp last_update_time = 9; // last_update_reason is the trigger for the last update - string last_update_reason = 9; + string last_update_reason = 10; } // Canary agent message Canary { // update_uuid of the canary agent - string update_uuid = 0; + string update_uuid = 1; // host_uuid of the canary agent - string host_uuid = 1; + string host_uuid = 2; // hostname of the canary agent - string hostname = 2; + string hostname = 3; // success state of the canary agent - bool success = 3; + bool success = 4; } // State of the rollout @@ -999,12 +997,12 @@ message RollbackAgentGroupRequest { ### Backend logic to progress the rollout The update proceeds from the first group to the last group, ensuring that each group successfully updates before -allowing the next group to proceed. By default, only 5 agent groups are allowed, this mitigates very long rollout plans. +allowing the next group to proceed. By default, only 5 agent groups are allowed. This mitigates very long rollout plans. #### Agent update mode The agent auto update mode is specified by both Cloud (via `autoupdate_agent_plan`) -and by the customer (via `autoupdate_config`). The agent update mode control whether +and by the customer (via `autoupdate_config`). The agent update mode controls whether the cluster in enrolled into automatic agent updates. The agent update mode can take 3 values: @@ -1016,10 +1014,10 @@ The agent update mode can take 3 values: The cluster agent rollout mode is computed by taking the lowest value. For example: -- cloud says `enabled` and the customer says `enabled` -> the updates are `enabled` -- cloud says `enabled` and the customer says `suspended` -> the updates are `suspended` -- cloud says `disabled` and the customer says `suspended` -> the updates are `disabled` -- cloud says `disabled` and the customer says `enabled` -> the updates are `disabled` +- Cloud says `enabled` and the customer says `enabled` -> the updates are `enabled` +- Cloud says `enabled` and the customer says `suspended` -> the updates are `suspended` +- Cloud says `disabled` and the customer says `suspended` -> the updates are `disabled` +- Cloud says `disabled` and the customer says `enabled` -> the updates are `disabled` The Teleport cluster only progresses the rollout if the mode is `enabled`. @@ -1062,9 +1060,9 @@ flowchart TD A group can be started if the following criteria are met - all of its previous group are in the `done` state -- it has been at least `wait_days` until the previous group update started +- it has been at least `wait_days` since the previous group update started - the current week day is in the `days` list -- the current hours equals the `hour` field +- the current hour equals the `hour` field When all hose criteria are met, the auth will transition the group into a new state. If `canary_count` is not null, the group transitions to the `canary` state. @@ -1078,9 +1076,9 @@ update success criteria. A group in `canary` state will get assigned canaries. The proxies will instruct those canaries to update now. -During each reconciliation loop, the auth will lookup the instance healthcheck in the backend of the canaries. +During each reconciliation loop, the auth will lookup the instance heartbeat of each canary in the backend. -Once all canaries have a healthcheck containing the new version (the healthcheck must not be older than 20 minutes), +Once all canaries have a heartbeat containing the new version (the heartbeat must not be older than 20 minutes), they successfully came back online and the group can transition to the `active` state. If canaries never update, report rollback, or disappear, the group will stay stuck in `canary` state. @@ -1088,38 +1086,16 @@ An alert will eventually fire, warning the user about the stuck update. #### Updating a group -A group in `active` mode is currently being updated. The conditions to leave te `active` mode and transition to the +A group in `active` mode is currently being updated. The conditions to leave `active` mode and transition to the `done` mode will vary based on the phase and rollout strategy. -- Phase 2: we don't have any information about agents. The group transitions to `done` 60 minutes after its start. +- Phase 3: we don't have any information about agents. The group transitions to `done` 60 minutes after its start. - Phase 4: we know about the connected agent count and the connected agent versions. The group transitions to `done` if: - at least `(100 - max_in_flight)%` of the agents are still connected - at least `(100 - max_in_flight)%` of the agents are running the new version - Phase 6: we incrementally update the progress, this adds a new criteria: the group progress is at 100% -The phase 6 backpressure update is the following: - -Given: -``` -initial_count[group] = sum(agent_data[group].stats[*]).count -``` - -Each auth server will calculate the progress as -`( max_in_flight * initial_count[group] + agent_data[group].stats[target_version].count ) / initial_count[group]` and -write the progress to `autoupdate_agent_plan` status. This formula determines the progress percentage by adding a -`max_in_flight` percentage-window above the number of currently updated agents in the group. - -However, if `as_numeral(agent_data[group].stats[not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the -calculated progress, that progress value will be used instead. This protects against a statistical deadlock, where no -UUIDs fall within the next `max_in_flight` window of UUID space, by always permitting the next non-updated agent to -update. - -To ensure that the rollout is halted if more than `max_in_flight` un-updated agents drop off, an addition restriction -must be imposed for the rollout to proceed: -`agent_data[group].stats[*].count > initial_count[group] - max_in_flight * initial_count[group]` - -To prevent double-counting of agents when considering all counts across all auth servers, only agents connected for one -minute will be considered in these formulas. +The phase 6 backpressure calculations are covered in the Backpressure Calculations section below.. ### Manually interacting with the rollout @@ -1225,7 +1201,7 @@ The following data related to the rollout are stored in each instance heartbeat: - `agent_update_uuid`: Auto-update UUID - `agent_update_group`: Auto-update group name -[TODO: mention that we'll also send this info in the hello and store it in the auth invenotry] +[TODO: mention that we'll also send this info in the hello and store it in the auth inventory] Auth servers use their local instance inventory to calculate rollout statistics and write them to `/autoupdate/[group]/[auth ID]` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`). @@ -1248,6 +1224,30 @@ To progress the rollout, auth servers will range-read keys from `/autoupdate/[gr If `/autoupdate/[group]/[auth ID]` is older than 1 minute, we do not consider its contents. This prevents double-counting agents when auth servers are killed. +#### Backpressure Calculations + +Given: +``` +initial_count[group] = sum(agent_data[group].stats[*]).count +``` + +Each auth server will calculate the progress as +`( max_in_flight * initial_count[group] + agent_data[group].stats[target_version].count ) / initial_count[group]` and +write the progress to `autoupdate_agent_plan` status. This formula determines the progress percentage by adding a +`max_in_flight` percentage-window above the number of currently updated agents in the group. + +However, if `as_numeral(agent_data[group].stats[not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the +calculated progress, that progress value will be used instead. This protects against a statistical deadlock, where no +UUIDs fall within the next `max_in_flight` window of UUID space, by always permitting the next non-updated agent to +update. + +To ensure that the rollout is halted if more than `max_in_flight` un-updated agents drop off, an addition restriction +must be imposed for the rollout to proceed: +`agent_data[group].stats[*].count > initial_count[group] - max_in_flight * initial_count[group]` + +To prevent double-counting of agents when considering all counts across all auth servers, only agents connected for one +minute will be considered in these formulas. + ### Linux Agents We will ship a new auto-updater package for Linux servers written in Go that does not interface with the system package manager. From 7b82d0b17caf6efaac396ab68132d4164f1a4845 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 2 Oct 2024 21:17:57 -0400 Subject: [PATCH 084/105] more cleanup --- rfd/0169-auto-updates-linux-agents.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0169-auto-updates-linux-agents.md index 0c73af930fa50..e46d776989380 100644 --- a/rfd/0169-auto-updates-linux-agents.md +++ b/rfd/0169-auto-updates-linux-agents.md @@ -223,8 +223,8 @@ tctl autoupdate agent new-rollout v3 # created new rollout from v2 to v3 ``` -TODO(sclevine): What about `update` or `target` instead of `new-rollout`? - `new-rollout` seems like we're creating a new resource, not changing target version. +[TODO(sclevine): What about `update` or `target` instead of `new-rollout`? + `new-rollout` seems like we're creating a new resource, not changing target version.]
After @@ -1114,8 +1114,8 @@ However, any changes to `agent_schedules` that occur while a group is active wil Releasing new agent versions multiple times a week has the potential to starve dependent groups from updates. -Note that the `default` schedule applies to agents that do not specify a group name. -[TODO: It seems we removed the default bool, So we have a mandatory default group? Can we pick the last one instead?] +Note that the `default` group applies to agents that do not specify a group name. +If a `default` group is not present, the last group is treated as the default. ### Updater APIs @@ -1156,10 +1156,10 @@ Notes: - Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of the value in `agent_autoupdate`. -- The edition served is the cluster edition (enterprise, enterprise-fips, or oss), and cannot be configured. +- The edition served is the cluster edition (enterprise, enterprise-fips, or oss) and cannot be configured. - The group name is read from `/var/lib/teleport/versions/update.yaml` by the updater. - The UUID is read from `/tmp/teleport_update_uuid`, which `teleport-update` regenerates when missing. -- the jitter is served by the teleport cluster and depends on the rollout strategy (60 sec by default, 10sec when using +- the jitter is served by the teleport cluster and depends on the rollout strategy (60s by default, 10s when using the backpressure strategy). Let `v1` be the previous version and `v2` the target version, the response matrix is the following: @@ -1192,7 +1192,7 @@ Let `v1` be the previous version and `v2` the target version, the response matri #### Updater status reporting -Instance heartbeats will be extended to incorporate and send data that is written to `/var/lib/teleport/versions/update.yaml` by the `teleport-update` binary. +Instance heartbeats will be extended to incorporate and send data that is written to `/var/lib/teleport/versions/update.yaml` and `/tmp/teleport_update_uuid` by the `teleport-update` binary. The following data related to the rollout are stored in each instance heartbeat: - `agent_update_start_time`: timestamp of individual agent's upgrade time @@ -1627,7 +1627,7 @@ Making the update boolean instruction available via the `/webapi/find` TLS endpo 3. Implement changes to Kubernetes auto-updater. 4. Test extensively on all supported Linux distributions. 5. Prep documentation changes. -6. Release via `teleport` package and script for packageless install. +6. Release via `teleport` package and script for package-less installation. 7. Release documentation changes. 8. Communicate to users that they should update to the new system. 9. Begin deprecation of old auto-updater resources, packages, and endpoints. From 005859b716c95df1545d32b8ccb4ce1b9c35b15e Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 2 Oct 2024 21:24:01 -0400 Subject: [PATCH 085/105] rename to unused number --- ...69-auto-updates-linux-agents.md => 0184-agent-auto-updates.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename rfd/{0169-auto-updates-linux-agents.md => 0184-agent-auto-updates.md} (100%) diff --git a/rfd/0169-auto-updates-linux-agents.md b/rfd/0184-agent-auto-updates.md similarity index 100% rename from rfd/0169-auto-updates-linux-agents.md rename to rfd/0184-agent-auto-updates.md From c0ebba8d9b9ea2b9ba6a4c3a5b743398fff9b263 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 2 Oct 2024 21:24:47 -0400 Subject: [PATCH 086/105] fix title --- rfd/0184-agent-auto-updates.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index e46d776989380..b5c06c8799d09 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -3,7 +3,7 @@ authors: Stephen Levine (stephen.levine@goteleport.com) & Hugo Hervieux (hugo.he state: draft --- -# RFD 0169 - Automatic Updates for Agents +# RFD 0184 - Agent Automatic Updates ## Required Approvers From 700216b528c10882685eac5314cc271144ec10aa Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 2 Oct 2024 21:35:21 -0400 Subject: [PATCH 087/105] more cleanup --- rfd/0184-agent-auto-updates.md | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index b5c06c8799d09..582c2c2540768 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -39,16 +39,16 @@ Additionally, this RFD parallels the auto-update functionality for client tools 3. We want to reduce the operational cost of customers running old agents. For Cloud customers, this will allow us to support fewer simultaneous cluster versions and reduce support load. For self-hosted customers, this will reduce support load associated with debugging old versions of Teleport. -4. Providing 99.99% availability for customers requires us to maintain that level of availability at the agent-level +4. Providing 99.99% availability for customers requires us to maintain high availability at the agent-level as well as the cluster-level. The current systemd updater does not meet those requirements: -- Its use of package managers leads users to accidentally upgrade Teleport. -- Its installation process is complex and users end up installing the wrong version of Teleport. -- Its update process does not provide sufficient safeties to protect against broken updates. -- Customers are not adopting the existing updater because they want to control when updates happen. +- The use of package managers (apt and yum) to apply updates leads users to accidentally upgrade Teleport. +- The installation process is complex, and users often end up installing the wrong version of Teleport. +- The update process does not provide sufficient safeties to protect against broken agent updates. +- Customers decline to adopt the existing updater because they want more control over when updates occur. - We do not offer a nice user experience for self-hosted users. This results in a marginal automatic updates - adoption and does not reduce the cost of upgrading self-hosted clusters. + adoption and does not reduce the support cost associated with upgrading self-hosted clusters. ## How @@ -61,10 +61,12 @@ installed. Automatic updates will be implemented incrementally: - Phase 4: Add a feedback mechanism for the Teleport inventory to track the agents of each group and their update status. - Phase 5: Add the canary deployment strategy: a few agents are updated first, if they don't die, the whole group is updated. - Phase 6: Add the ability to perform slow and incremental version rollouts within an agent update group. +- Phase 7: If needed, backup local agent DB and restore during agent rollbacks. The updater will be usable after phase 1 and will gain new capabilities after each phase. -After phase 2, the new updater will have feature-parity with the old updater. -The existing auto-updates mechanism will remain unchanged throughout the process, and deprecated in the future. +After phase 2, the new updater will have feature-parity with the existing updater script. +The existing auto-updates mechanism will remain unchanged and fully-functional throughout the process. +It will be deprecated in the future. Future phases might change as we are working on the implementation and collecting real-world feedback and experience. @@ -73,7 +75,7 @@ We will introduce two user-facing resources: 1. The `autoupdate_config` resource, owned by the Teleport user. This resource allows Teleport users to configure: - Whether automatic updates are enabled, disabled, or temporarily suspended - The order in which agents should be updated (`dev` before `staging` before `prod`) - - Times when agent updates should start + - Days and hours when agent updates should start - Configuration for client auto-updates (e.g., `tsh` and `tctl`), which are out-of-scope for this RFD The resource will look like: @@ -98,7 +100,7 @@ We will introduce two user-facing resources: max_in_flight: 20% # added in phase 6 ``` -2. The `autoupdate_agent_plan` resource, with `spec` owned by the Teleport cluster administrator (e.g. Teleport Cloud team). +2. The `autoupdate_agent_plan` resource, with `spec` owned by the Teleport cluster administrator (e.g. Teleport Cloud operators). Its `status` is owned by Teleport and contains the current rollout status. Some parts of the status can be changed via select RPCs (for example, an RPC to fast-track a group update). ```yaml From f08b80cab0e9e8f4ac04cfddb73fbedbcdfafb7b Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 2 Oct 2024 21:44:06 -0400 Subject: [PATCH 088/105] correct inconsistencies --- rfd/0184-agent-auto-updates.md | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index 582c2c2540768..33980102085b1 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -147,7 +147,7 @@ users who want to know the motivations behind this specific design. ### Product requirements -Those are the requirements coming from engineering, product, and Cloud teams: +The following product requirements were defined by our leadership team: 1. Phased rollout for Cloud tenants. We should be able to control the agent version per-tenant. @@ -199,7 +199,7 @@ Those are the requirements coming from engineering, product, and Cloud teams: ### User Stories -#### As Teleport Cloud I want to be able to update customers agents to a newer Teleport version +#### As a Teleport Cloud operator I want to be able to update customers agents to a newer Teleport version
Before @@ -260,7 +260,7 @@ Now, new agents will install v2 by default, and v3 after the maintenance. > # created new update plan from v1 to v3 > ``` -#### As Teleport Cloud I want to minimize damage caused by broken versions to ensure we maintain 99.99% availability +#### As a Teleport Cloud operator I want to minimize damage caused by broken versions to ensure we maintain 99.99% availability ##### Failure mode 1(a): the new version crashes @@ -308,8 +308,8 @@ status:
I and the customer get an alert if the canary testing has not succeeded after an hour. -Teleport cloud operators and the customer can access the canary hostname and host_uuid -to identify the broken agent. +Teleport cloud operators and the customer can access the canary `hostname` and `host_uuid` +to identify broken canary agents. The rollout resumes. @@ -434,7 +434,7 @@ tctl auto-update agent resume I connect to the node and lookup its status: ```shell -teleport-updater status +teleport-update status # Running version v16.2.5 # Automatic updates enabled. # Proxy: example.teleport.sh @@ -443,15 +443,15 @@ teleport-updater status I try to set a specific version: ```shell -teleport-udpater use-version v16.2.3 +teleport-update use-version v16.2.3 # Error: the instance is enrolled into automatic updates. # You must specify --disable-automatic-updates to opt this agent out of automatic updates and manually control the version. ``` I acknowledge that I am leaving automatic updates: ```shell -teleport-udpater use-version v16.2.3 --disable-automatic-updates -# Disabling automatic updates for the node. You can enable them back by running `teleport-updater enable` +teleport-update use-version v16.2.3 --disable-automatic-updates +# Disabling automatic updates for the node. You can enable them back by running `teleport-update enable` # Downloading version 16.2.3 # Restarting teleport # Cleaning up old binaries @@ -460,7 +460,7 @@ teleport-udpater use-version v16.2.3 --disable-automatic-updates When the issue is fixed, I can enroll back into automatic updates: ```shell -teleport-updater enable +teleport-update enable # Enabling automatic updates # Proxy: example.teleport.sh # Group: staging @@ -468,8 +468,8 @@ teleport-updater enable #### As a Teleport user I want to fast-track a group update -I have a new rollout, completely unstarted, and my current maintenance schedule updates over seevral days. -However, the new version contains something that I need as soon s possible (e.g. a fix for a bug that affects me). +I have a new rollout, completely unstarted, and my current maintenance schedule updates over several days. +However, the new version contains something that I need as soon as possible (e.g., a fix for a bug that affects me).
Before: @@ -524,9 +524,9 @@ tctl auto-update agent status The manual way: ```bash -wget https://cdn.teleport.dev/teleport-updater-- -chmod +x teleport-updater -./teleport-updater enable example.teleport.sh --group production +wget https://cdn.teleport.dev/teleport-update-- +chmod +x teleport-update +./teleport-update enable example.teleport.sh --group production # Detecting the Teleport version and edition used by cluster "example.teleport.sh" # Installing the following teleport version: # Version: 16.2.1 @@ -567,7 +567,7 @@ I have the teleport updater installed and available in my path. I run: ```shell -teleport-updater enable --group production +teleport-update enable --group production # Detecting the Teleport version and edition used by cluster "example.teleport.sh" # Installing the following teleport version: # Version: 16.2.1 @@ -1446,7 +1446,7 @@ The `teleport` apt and yum packages contain a system installation of Teleport in Post package installation, the `link` subcommand is executed automatically to link the system installation when no auto-updater-managed version of Teleport is linked: ``` /usr/local/bin/teleport -> /var/lib/teleport/versions/system/bin/teleport -/usr/local/bin/teleport-updater -> /var/lib/teleport/versions/system/bin/teleport-updater +/usr/local/bin/teleport-update -> /var/lib/teleport/versions/system/bin/teleport-update ... ``` From 012779ef15204d24cf98d50c0bd495390db503be Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 2 Oct 2024 21:45:24 -0400 Subject: [PATCH 089/105] fix more inconsistencies --- rfd/0184-agent-auto-updates.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index 33980102085b1..de42696c3c71f 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -367,7 +367,7 @@ The customer can observe the agent update status and see that a recent update might have caused this: ```shell -tctl auto-update agent status +tctl autoupdate agent status # Rollout plan created the YYYY-MM-DD # Previous version: v2 # New version: v3 @@ -383,7 +383,7 @@ tctl auto-update agent status Then, the customer or Teleport Cloud team can suspend the rollout: ```shell -tctl auto-update agent suspend +tctl autoupdate agent suspend # Automatic updates suspended # No existing agent will get updated. New agents might install the new version # depending on their group. @@ -394,11 +394,11 @@ The customer can investigate, and get help from Teleport's support via a support If the update is really the cause of the issue, the customer or Teleport cloud can perform a rollback: ```shell -tctl auto-update agent rollback +tctl autoupdate agent rollback # Rolledback groups: [dev, staging] # Warning: the automatic agent updates are suspended. # Agents will not rollback until you run: -# $> tctl auto-update agent resume +# $> tctl autoupdate agent resume ``` > [!NOTE] @@ -409,7 +409,7 @@ tctl auto-update agent rollback After: ```shell -tctl auto-update agent status +tctl autoupdate agent status # Rollout plan created the YYYY-MM-DD # Previous version: v2 # New version: v3 @@ -427,7 +427,7 @@ Finally, when the user is happy with the new plan, they can resume the updates. This will trigger the rollback. ```shell -tctl auto-update agent resume +tctl autoupdate agent resume ``` #### As a Teleport user and a Teleport on-call responder, I want to be able to pin a specific Teleport version of an agent to understand if a specific behaviour is caused by a specific Teleport version @@ -475,7 +475,7 @@ However, the new version contains something that I need as soon as possible (e.g Before: ```shell -tctl auto-updates agent status +tctl autoupdate agent status # Rollout plan created the YYYY-MM-DD # Previous version: v2 # New version: v3 @@ -492,20 +492,20 @@ tctl auto-updates agent status I can trigger the dev group immediately using the command: ```shell -tctl auto-updates agent start-update dev --no-canary +tctl autoupdate agent start-update dev --no-canary # Dev group update triggered (canary or active) ``` Alternatively ```shell -tctl auto-update agent force-done dev +tctl autoupdate agent force-done dev ```
After: ```shell -tctl auto-update agent status +tctl autoupdate agent status # Rollout plan created the YYYY-MM-DD # Previous version: v2 # New version: v3 From 9dee7e9d77cc98495e5c4e658b6cbd2094549b92 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 2 Oct 2024 21:46:18 -0400 Subject: [PATCH 090/105] missing proxy flag --- rfd/0184-agent-auto-updates.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index de42696c3c71f..cf2a2ebcd104b 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -526,7 +526,7 @@ The manual way: ```bash wget https://cdn.teleport.dev/teleport-update-- chmod +x teleport-update -./teleport-update enable example.teleport.sh --group production +./teleport-update enable --proxy example.teleport.sh --group production # Detecting the Teleport version and edition used by cluster "example.teleport.sh" # Installing the following teleport version: # Version: 16.2.1 From 36bdaf0cbd90c77f5cb07dc1af5cfcee36899bc1 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Wed, 2 Oct 2024 21:50:40 -0400 Subject: [PATCH 091/105] typo --- rfd/0184-agent-auto-updates.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index cf2a2ebcd104b..1803679f709de 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -1588,7 +1588,7 @@ are signed. The Update Framework (TUF) will be used to implement secure updates in the future. -Anyone who possesses a updater UUID can determine when that host is scheduled to update by repeatedly querying the public `/v1/webapi/find` endpoint. +Anyone who possesses an updater UUID can determine when that host is scheduled to update by repeatedly querying the public `/v1/webapi/find` endpoint. It is not possible to discover the current version of that host, only the designated update window. ## Logging From 55cb87a882c7751f3f0accb73f75ff9d86c54158 Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Thu, 3 Oct 2024 10:55:22 -0400 Subject: [PATCH 092/105] Add CLI reference --- rfd/0184-agent-auto-updates.md | 52 ++++++++++++++++++++++++++-------- 1 file changed, 40 insertions(+), 12 deletions(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index 1803679f709de..5c21b761c6c06 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -221,13 +221,10 @@ tctl autoupdate agent status I run ```bash -tctl autoupdate agent new-rollout v3 +tctl autoupdate agent-plan new-target v3 # created new rollout from v2 to v3 ``` -[TODO(sclevine): What about `update` or `target` instead of `new-rollout`? - `new-rollout` seems like we're creating a new resource, not changing target version.] -
After @@ -240,8 +237,8 @@ tctl autoupdate agent status # # Group Name Status Update Start Time Connected Agents Up-to-date agents failed updates # ---------- ----------------- ----------------- ---------------- ----------------- -------------- -# dev not started 120 115 2 -# staging not started 20 20 0 +# dev not started 120 0 0 +# staging not started 20 0 0 # prod not started 234 0 0 ``` @@ -256,7 +253,7 @@ Now, new agents will install v2 by default, and v3 after the maintenance. > If this is an issue I can create a v1 -> v3 rollout instead. > > ```bash -> tctl autoupdate agent new-rollout v3 --current-version v1 +> tctl autoupdate agent-plan new-target v3 --previous-version v1 > # created new update plan from v1 to v3 > ``` @@ -1101,7 +1098,27 @@ The phase 6 backpressure calculations are covered in the Backpressure Calculatio ### Manually interacting with the rollout -[TODO add cli commands] +For user: +```shell +tctl autoupdate agent suspend/resume +tctl autoupdate agent enable/disable + +tctl autoupdate agent status +tctl autoupdate agent status + +tctl autoupdate agent start [--no-canary] +tctl autoupdate agent force +tctl autoupdate agent reset + +tctl autoupdate agent rollback [|--all] +``` + +For admin +```shell +tctl autoupdate agent-plan target [--previous-version ] +tctl autoupdate agent-plan enable/disable +tctl autoupdate agent-plan suspend/resume +``` ### Editing the plan @@ -1194,17 +1211,28 @@ Let `v1` be the previous version and `v2` the target version, the response matri #### Updater status reporting -Instance heartbeats will be extended to incorporate and send data that is written to `/var/lib/teleport/versions/update.yaml` and `/tmp/teleport_update_uuid` by the `teleport-update` binary. +The updater reports status through the agent. The agent has two ways of reporting the update information: +- via instance heartbeats +- via the hello message, when registering against an auth server -The following data related to the rollout are stored in each instance heartbeat: +Instance heartbeat happen infrequently, based on the cluster size they can take up to 17 minutes to happen. +However, they are exposed to the user via existing `tctl inventory` method and will allow users to query which instance +is running which version and belongs to which group. + +Hello messages are sent on connection and are used to build the serve's local inventory. +This information is available almost instantaneously after the connection and can be cheaply queried by the auth ( +everything is in memory). The inventory is then used to count the local nodes and drive the rollout. + +Both instance heartbeats and Hello merssages will be extended to incorporate and send data that is written to +`/var/lib/teleport/versions/update.yaml` and `/tmp/teleport_update_uuid` by the `teleport-update` binary. + +The following data related to the update is sent by the agent: - `agent_update_start_time`: timestamp of individual agent's upgrade time - `agent_update_start_version`: current agent version - `agent_update_rollback`: whether the agent was rolled-back automatically - `agent_update_uuid`: Auto-update UUID - `agent_update_group`: Auto-update group name -[TODO: mention that we'll also send this info in the hello and store it in the auth inventory] - Auth servers use their local instance inventory to calculate rollout statistics and write them to `/autoupdate/[group]/[auth ID]` (e.g., `/autoupdate/staging/58526ba2-c12d-4a49-b5a4-1b694b82bf56`). Every minute, auth servers persist the version counts: From f5ab4dcebb870c23821b2bc8307e0cd6afe91121 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 3 Oct 2024 15:38:48 -0400 Subject: [PATCH 093/105] feedback --- rfd/0184-agent-auto-updates.md | 77 ++++++++++++++++++---------------- 1 file changed, 42 insertions(+), 35 deletions(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index 5c21b761c6c06..45cd4048e7742 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -82,7 +82,7 @@ We will introduce two user-facing resources: ```yaml kind: autoupdate_config spec: - agent_autoupdate_mode: enable + agent_auto_update_mode: enable agent_schedules: regular: - name: dev @@ -591,10 +591,10 @@ This is how Teleport customers can specify their automatic update preferences. ```yaml kind: autoupdate_config spec: - # agent_autoupdate allows turning agent updates on or off at the + # agent_auto_update allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed # agent updates are in place. Setting this to pause will temporarily halt the rollout. - agent_autoupdate_mode: disable|enable|pause + agent_auto_update_mode: disable|enable|pause # agent_schedules specifies version rollout schedules for agents. # The schedule used is determined by the schedule associated @@ -635,13 +635,12 @@ Default resource: ```yaml kind: autoupdate_config spec: - agent_autoupdate_mode: enable + agent_auto_update_mode: enable agent_schedules: regular: - name: default days: ["Mon", "Tue", "Wed", "Thu"] start_hour: 0 - jitter_seconds: 5 canary_count: 5 max_in_flight: 20% alert_after: 4h @@ -788,8 +787,8 @@ message AutoUpdateConfig { // AutoUpdateConfigSpec is the spec for the autoupdate config. message AutoUpdateConfigSpec { - // agent_autoupdate_mode specifies whether agent autoupdates are enabled, disabled, or paused. - Mode agent_autoupdate_mode = 1; + // agent_auto_update_mode specifies whether agent autoupdates are enabled, disabled, or paused. + Mode agent_auto_update_mode = 1; // agent_schedules specifies schedules for updates of grouped agents. AgentAutoUpdateSchedules agent_schedules = 3; } @@ -812,12 +811,10 @@ message AgentAutoUpdateGroup { int64 wait_days = 4; // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. int64 alert_after_hours = 5; - // jitter_seconds to introduce before update as rand([0, jitter_seconds]) - int64 jitter_seconds = 6; // canary_count of agents to use in the canary deployment. - int64 canary_count = 7; + int64 canary_count = 6; // max_in_flight specifies agents that can be updated at the same time, by percent. - string max_in_flight = 8; + string max_in_flight = 7; } // Day of the week @@ -1063,7 +1060,7 @@ A group can be started if the following criteria are met - the current week day is in the `days` list - the current hour equals the `hour` field -When all hose criteria are met, the auth will transition the group into a new state. +When all those criteria are met, the auth will transition the group into a new state. If `canary_count` is not null, the group transitions to the `canary` state. Else it transitions to the `active` state. @@ -1122,7 +1119,7 @@ tctl autoupdate agent-plan suspend/resume ### Editing the plan -The updater will receive `agent_autoupdate: true` from the time is it designated for update until the `target_version` in `autoupdate_agent_plan` (below) changes. +The updater will receive `agent_auto_update: true` from the time is it designated for update until the `target_version` in `autoupdate_agent_plan` (below) changes. Changing the `target_version` resets the schedule immediately, clearing all progress. [TODO: What is the use-case for this? can we do like with target_version and reset all instead of trying to merge the state] @@ -1143,7 +1140,7 @@ If a `default` group is not present, the last group is treated as the default. Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. The version and edition served from that endpoint will be configured using new `autoupdate_agent_plan` resource. -Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is +Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_auto_update` field) is dependent on: - The `host=[uuid]` parameter sent to `/v1/webapi/find` - The `group=[name]` parameter sent to `/v1/webapi/find` @@ -1154,7 +1151,7 @@ unauthenticated requests to `/v1/webapi/find`. Teleport proxies modulate the `/v UUID and group name. When the updater queries the proxy via `/v1/webapi/find?host=[uuid]&group=[name]`, the proxies query the -`autoupdate_agent_plan` status to determine the value of `agent_autoupdate: true`. +`autoupdate_agent_plan` status to determine the value of `agent_auto_update: true`. The boolean is returned as `true` in the case that the provided `host` contains a UUID that is under the progress percentage for the `group`: `as_numeral(host_uuid) / as_numeral(max_uuid) < progress` @@ -1165,16 +1162,19 @@ The returned JSON looks like: ```json { "server_edition": "enterprise", - "agent_version": "15.1.1", - "agent_autoupdate": true, - "agent_update_jitter_seconds": 10 + "auto_update": { + "agent_version": "15.1.1", + "agent_auto_update": true, + "agent_update_jitter_seconds": 10 + }, + // ... } ``` Notes: -- Agents will only update if `agent_autoupdate` is `true`, but new installations will use `agent_version` regardless of - the value in `agent_autoupdate`. +- Agents will only update if `agent_auto_update` is `true`, but new installations will use `agent_version` regardless of + the value in `agent_auto_update`. - The edition served is the cluster edition (enterprise, enterprise-fips, or oss) and cannot be configured. - The group name is read from `/var/lib/teleport/versions/update.yaml` by the updater. - The UUID is read from `/tmp/teleport_update_uuid`, which `teleport-update` regenerates when missing. @@ -1308,6 +1308,11 @@ $ teleport-update enable --proxy example.teleport.sh --template 'https://example ``` (Checksum will use template path + `.sha256`) +For Teleport installs with custom data directories, the data directory must be specified on each invocation: +```shell +$ teleport-update enable --proxy example.teleport.sh --data-dir /var/lib/teleport +``` + #### Filesystem ``` @@ -1430,18 +1435,19 @@ Both `update` and `enable` will maintain a shared lock file preventing any re-en The `enable` subcommand will: 1. If an updater-incompatible version of the Teleport package is installed, fail immediately. 2. Query the `/v1/webapi/find` endpoint. -3. If the current updater-managed version of Teleport is the latest, jump to (14). +3. If the current updater-managed version of Teleport is the latest, jump to (15). 4. Ensure there is enough free disk space to update Teleport via `unix.Statfs()` and `content-length` header from `HEAD` request. 5. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 6. Download and verify the checksum (tarball URL suffixed with `.sha256`). 7. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. -8. Replace any existing binaries or symlinks with symlinks to the current version. -9. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. -10. Restart the agent if the systemd service is already enabled. -11. Set `active_version` in `update.yaml` if successful or not enabled. -12. Replace the symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. -13. Remove all stored versions of the agent except the current version and last working version. -14. Configure `update.yaml` with the current proxy address and group, and set `enabled` to true. +8. Verify that the downloaded binaries are valid executables on the host. +9. Replace any existing binaries or symlinks with symlinks to the current version. +10. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. +11. Restart the agent if the systemd service is already enabled. +12. Set `active_version` in `update.yaml` if successful or not enabled. +13. Replace the symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. +14. Remove all stored versions of the agent except the current version and last working version. +15. Configure `update.yaml` with the current proxy address and group, and set `enabled` to true. The `disable` subcommand will: 1. Configure `update.yaml` to set `enabled` to false. @@ -1449,19 +1455,20 @@ The `disable` subcommand will: When `update` subcommand is otherwise executed, it will: 1. Check `update.yaml`, and quit (exit 0) if `enabled` is false, or quit (exit 1) if `enabled` is true and no proxy address is set. 2. Query the `/v1/webapi/find` endpoint. -3. Check that `agent_autoupdates` is true, quit otherwise. +3. Check that `agent_auto_updates` is true, quit otherwise. 4. If the current version of Teleport is the latest, quit. 5. Wait `random(0, agent_update_jitter_seconds)` seconds. 6. Ensure there is enough free disk space to update Teleport via `unix.Statfs()` and `content-length` header from `HEAD` request. 7. Download the desired Teleport tarball specified by `agent_version` and `server_edition`. 8. Download and verify the checksum (tarball URL suffixed with `.sha256`). 9. Extract the tarball to `/var/lib/teleport/versions/VERSION` and write the SHA to `/var/lib/teleport/versions/VERSION/sha256`. -10. Update symlinks to point at the new version. -11. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. -12. Restart the agent if the systemd service is already enabled. -13. Set `active_version` in `update.yaml` if successful or not enabled. -14. Replace the old symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. -15. Remove all stored versions of the agent except the current version and last working version. +10. Verify that the downloaded binaries are valid executables on the host. +11. Update symlinks to point at the new version. +12. Backup `/var/lib/teleport/proc/sqlite.db` into `/var/lib/teleport/versions/OLD-VERSION/backup/sqlite.db` and create `backup.yaml`. +13. Restart the agent if the systemd service is already enabled. +14. Set `active_version` in `update.yaml` if successful or not enabled. +15. Replace the old symlinks/binaries and `/var/lib/teleport/proc/sqlite.db` and quit (exit 1) if unsuccessful. +16. Remove all stored versions of the agent except the current version and last working version. To guarantee auto-updates of the updater itself, all commands will first check for an `active_version`, and reexec using the `teleport-update` at that version if present and different. The `/usr/local/bin/teleport-update` symlink will take precedence to avoid reexec in most scenarios. From 842dde8eef9f8df3f17bfdd9d193d4661db92706 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 3 Oct 2024 15:59:37 -0400 Subject: [PATCH 094/105] alerts note --- rfd/0184-agent-auto-updates.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index 45cd4048e7742..5fc3ccec41ba2 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -308,6 +308,8 @@ I and the customer get an alert if the canary testing has not succeeded after an Teleport cloud operators and the customer can access the canary `hostname` and `host_uuid` to identify broken canary agents. +Customers receive cluster alerts, while Cloud receive alerts driven by Teleport metrics. + The rollout resumes. ##### Failure mode 1(b): the new version crashes, but not on the canaries From 7492b248c0f220804e171fae3681a026ad61fd5c Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 3 Oct 2024 16:02:16 -0400 Subject: [PATCH 095/105] typos --- rfd/0184-agent-auto-updates.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index 5fc3ccec41ba2..f302b0f239bf0 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -308,7 +308,7 @@ I and the customer get an alert if the canary testing has not succeeded after an Teleport cloud operators and the customer can access the canary `hostname` and `host_uuid` to identify broken canary agents. -Customers receive cluster alerts, while Cloud receive alerts driven by Teleport metrics. +Customers receive cluster alerts, while Cloud receives alerts driven by Teleport metrics. The rollout resumes. @@ -1642,7 +1642,7 @@ When TUF is added, that events related to supply chain security may be sent to t `teleport-update` is intended to be a minimal binary, with few dependencies, that is used to bootstrap initial Teleport agent installations. It may be baked into AMIs or containers. -If the entirely `teleport` binary were used instead, security scanners would match vulnerabilities all Teleport dependencies, so customers would have to handle rebuilding artifacts (e.g., AMIs) more often. +If the entire `teleport` binary were used instead, security scanners would match vulnerabilities all Teleport dependencies, so customers would have to handle rebuilding artifacts (e.g., AMIs) more often. Deploying these updates is often more disruptive than a soft restart of the agent triggered by the auto-updater. `teleport-update` will also handle `tbot` updates in the future, and it would be undesirable to distribute `teleport` with `tbot` just to enable automated updates. From edb0f196a2fd1fe3b31df984c42c21dc8b787ff0 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 3 Oct 2024 16:03:13 -0400 Subject: [PATCH 096/105] Update rfd/0184-agent-auto-updates.md Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com> --- rfd/0184-agent-auto-updates.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index f302b0f239bf0..a886c4e2321b7 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -450,7 +450,7 @@ teleport-update use-version v16.2.3 I acknowledge that I am leaving automatic updates: ```shell teleport-update use-version v16.2.3 --disable-automatic-updates -# Disabling automatic updates for the node. You can enable them back by running `teleport-update enable` +# Disabling automatic updates. You can re-enable them by running `teleport-update enable` # Downloading version 16.2.3 # Restarting teleport # Cleaning up old binaries From cec0f69de22c240d72b4df49c0f382830e834c8b Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 3 Oct 2024 16:07:40 -0400 Subject: [PATCH 097/105] clarify canary logic --- rfd/0184-agent-auto-updates.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index a886c4e2321b7..a2f5d319225c0 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -1072,9 +1072,10 @@ update success criteria. #### Canary testing (phase 5) -A group in `canary` state will get assigned canaries. -The proxies will instruct those canaries to update now. -During each reconciliation loop, the auth will lookup the instance heartbeat of each canary in the backend. +A group in `canary` state will be randomly assigned `canary_count` canary agents. +Auth servers will select those canaries by reading them from instance heartbeats and writing them to the `canaries` list in `agent_rollout_plan` status. +The proxies will instruct those canaries to update immediately. +During each reconciliation loop, the auth will lookup the instance heartbeat of each canary in the backend and update `agent_rollout_plan` status if needed. Once all canaries have a heartbeat containing the new version (the heartbeat must not be older than 20 minutes), they successfully came back online and the group can transition to the `active` state. From 27f4741a40b1e30e5aeb06a0f8da0a1c32cdfa1c Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Thu, 3 Oct 2024 16:28:34 -0400 Subject: [PATCH 098/105] Update rfd/0184-agent-auto-updates.md Co-authored-by: Hugo Shaka --- rfd/0184-agent-auto-updates.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index a2f5d319225c0..c674f5a9e5ea3 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -1073,7 +1073,7 @@ update success criteria. #### Canary testing (phase 5) A group in `canary` state will be randomly assigned `canary_count` canary agents. -Auth servers will select those canaries by reading them from instance heartbeats and writing them to the `canaries` list in `agent_rollout_plan` status. +Auth servers will select those canaries by reading them from the auth instance inventory and writing them to the `canaries` list in `agent_rollout_plan` status. The proxies will instruct those canaries to update immediately. During each reconciliation loop, the auth will lookup the instance heartbeat of each canary in the backend and update `agent_rollout_plan` status if needed. From 56a27905495f332f0364bb859202e738de62c4c8 Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 8 Oct 2024 14:01:04 -0400 Subject: [PATCH 099/105] Support for multiple installations / tarball --- rfd/0184-agent-auto-updates.md | 88 +++++++++++++++++++++++++++++++--- 1 file changed, 82 insertions(+), 6 deletions(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index c674f5a9e5ea3..e4643cd307a7d 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -106,7 +106,7 @@ We will introduce two user-facing resources: ```yaml kind: autoupdate_agent_plan spec: - current_version: v1 + start_version: v1 target_version: v2 schedule: regular strategy: grouped @@ -271,7 +271,7 @@ advertise they have rolled-back. The maintenance is stuck until the canaries are ```yaml kind: autoupdate_agent_plan spec: - current_version: v1 + start_version: v1 target_version: v2 schedule: regular strategy: grouped @@ -1283,8 +1283,10 @@ minute will be considered in these formulas. ### Linux Agents -We will ship a new auto-updater package for Linux servers written in Go that does not interface with the system package manager. -It will be distributed as a separate package from Teleport, and manage the installation of the correct Teleport agent version manually. +We will ship a new auto-updater binary for Linux servers written in Go that does not interface with the system package manager. +It will be distributed within the existing `teleport` packages, and additionally, in a dedicated `teleport-update-vX.Y.Z.tgz` tarball. +It will manage the installation of the correct Teleport agent version manually. + It will read the unauthenticated `/v1/webapi/find` endpoint from the Teleport proxy, parse new fields on that endpoint, and install the specified agent version according to the specified update plan. It will download the correct version of Teleport as a tarball, unpack it in `/var/lib/teleport`, and ensure it is symlinked from `/usr/local/bin`. @@ -1292,6 +1294,7 @@ Source code for the updater will live in the main Teleport repository, with the #### Installation +Package-initiated install: ```shell $ apt-get install teleport $ teleport-update enable --proxy example.teleport.sh @@ -1300,6 +1303,15 @@ $ teleport-update enable --proxy example.teleport.sh $ systemctl enable teleport ``` +Packageless install: +```shell +$ curl https://cdn.teleport.dev/teleport-update.tgz | tar xzf +$ ./teleport-update enable --proxy example.teleport.sh + +# if not enabled already, configure teleport and: +$ systemctl enable teleport +``` + For grouped updates, a group identifier may be configured: ```shell $ teleport-update enable --proxy example.teleport.sh --group staging @@ -1311,13 +1323,21 @@ $ teleport-update enable --proxy example.teleport.sh --template 'https://example ``` (Checksum will use template path + `.sha256`) -For Teleport installs with custom data directories, the data directory must be specified on each invocation: +For Teleport installs with custom data directories, the data directory must be specified on each binary invocation: ```shell -$ teleport-update enable --proxy example.teleport.sh --data-dir /var/lib/teleport +$ teleport-update enable --proxy example.teleport.sh --data-dir /var/lib/teleport ``` +For managing multiple Teleport installs, the install suffix must be specified on each binary invocation: +```shell +$ teleport-update enable --proxy example.teleport.sh --install-suffix clusterA +``` +This will create suffixed directories for binaries (`/usr/local/teleport/clusterA/bin`) and systemd units (`teleport-clusterA`). + + #### Filesystem +For a default install, without --install-suffix: ``` $ tree /var/lib/teleport /var/lib/teleport @@ -1356,6 +1376,7 @@ $ tree /var/lib/teleport │ └── systemd │ └── teleport.service └── update.yaml + $ ls -l /usr/local/bin/tsh /usr/local/bin/tsh -> /var/lib/teleport/versions/15.0.0/bin/tsh $ ls -l /usr/local/bin/tbot @@ -1368,6 +1389,61 @@ $ ls -l /usr/local/lib/systemd/system/teleport.service /usr/local/lib/systemd/system/teleport.service -> /var/lib/teleport/versions/15.0.0/etc/systemd/teleport.service ``` +With --install-suffix clusterA: +``` +$ tree /var/lib/teleport/install/clusterA +/var/lib/teleport/install/clusterA +└── versions + ├── 15.0.0 + │ ├── bin + │ │ ├── tsh + │ │ ├── tbot + │ │ ├── ... # other binaries + │ │ ├── teleport-update + │ │ └── teleport + │ ├── etc + │ │ └── systemd + │ │ └── teleport.service + │ └── backup + │ ├── sqlite.db + │ └── backup.yaml + ├── 15.1.1 + │ ├── bin + │ │ ├── tsh + │ │ ├── tbot + │ │ ├── ... # other binaries + │ │ ├── teleport-update + │ │ └── teleport + │ └── etc + │ └── systemd + │ └── teleport.service + └── update.yaml + +/var/lib/teleport +└── versions + ├── system # if installed via OS package + ├── bin + │ ├── tsh + │ ├── tbot + │ ├── ... # other binaries + │ ├── teleport-update + │ └── teleport + └── etc + └── systemd + └── teleport.service + +$ ls -l /usr/local/bin/tsh +/usr/local/teleport/clusterA/bin/tsh -> /var/lib/teleport/install/clusterA/versions/15.0.0/bin/tsh +$ ls -l /usr/local/bin/tbot +/usr/local/teleport/clusterA/bin/tbot -> /var/lib/teleport/install/clusterA/versions/15.0.0/bin/tbot +$ ls -l /usr/local/bin/teleport +/usr/local/teleport/clusterA/bin/teleport -> /var/lib/teleport/install/clusterA/versions/15.0.0/bin/teleport +$ ls -l /usr/local/bin/teleport-update +/usr/local/teleport/clusterA/bin/teleport-update -> /var/lib/teleport/install/clusterA/versions/15.0.0/bin/teleport-update +$ ls -l /usr/local/lib/systemd/system/teleport-clusterA.service +/usr/local/lib/systemd/system/teleport-clutserA.service -> /var/lib/teleport/install/clusterA/versions/15.0.0/etc/systemd/teleport.service +``` + ##### update.yaml This file stores configuration for `teleport-update`. From 6b62769b444bfcd03658c8b5a82190dd9395fd24 Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Thu, 10 Oct 2024 10:44:43 -0400 Subject: [PATCH 100/105] Address reviewer's feedback - Rephrase the UX section to not assume prior canary knowledge - Explicit how the canaries are picked, the limitations, and potential improvements - replace node with instance to avoid confusion between ssh nodes and generic teleport agent instances - Explicit how the previous updater interacts with the new one - More explicit names for command line args --- rfd/0184-agent-auto-updates.md | 70 ++++++++++++++++++++++------------ 1 file changed, 45 insertions(+), 25 deletions(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index e4643cd307a7d..c473cc6fb02ea 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -181,7 +181,7 @@ The following product requirements were defined by our leadership team: 13. I should be able to install an auto-updating deployment of Teleport via whatever mechanism I want to, including OS packages such as apt and yum. -14. If new nodes join a bucket outside the upgrade window, and you are within your compatibility window, wait until your next group update start. +14. If new instances join a bucket outside the upgrade window, and you are within your compatibility window, wait until your next group update start. If you are not within your compatibility window, attempt to upgrade right away. 15. If an agent comes back online after some period of time, and it is still compatible with @@ -261,9 +261,11 @@ Now, new agents will install v2 by default, and v3 after the maintenance. ##### Failure mode 1(a): the new version crashes -I create a new deployment with a broken version. The version is deployed to the canaries. -The canaries crash, the updater reverts the update, the agents connect back online and -advertise they have rolled-back. The maintenance is stuck until the canaries are running the target version. +I create a new deployment with a broken version. The version is deployed to a few instances picked randomly. +Those instances are called the canaries. As the new version has an issue, one or many of those canary instances can't run the +new version and their updater has to revert to the previous one. The agents connect back online and +advertise they have failed to update. The maintenance is stuck until every instance that got selected to test the new version +is back online, and running the new version.
Autoupdate agent plan @@ -304,40 +306,47 @@ status: ```
-I and the customer get an alert if the canary testing has not succeeded after an hour. -Teleport cloud operators and the customer can access the canary `hostname` and `host_uuid` -to identify broken canary agents. +I and the customer get an alert if the test instances are not running the expected version after an hour. +Teleport cloud operators and the customer can look up the hostname and host UUID of the test instances +to identify which one(s) failed to update and go troubleshoot. Customers receive cluster alerts, while Cloud receives alerts driven by Teleport metrics. The rollout resumes. +If the issue is related to a specific instance and not the new Teleport version (e.g. VM out of disk space), +the user can instruct teleport to pick 5 new canary instances. + ##### Failure mode 1(b): the new version crashes, but not on the canaries This scenario is the same as the previous one but the Teleport agent bug only manifests on select agents. -For example: [the agent fails to read cloud-provider specific metadata and crashes](TODO add link). +For example: [the agent fails to read cloud-provider specific metadata and crashes](https://github.com/gravitational/teleport/issues/42312). +This can also be caused by a specific Teleport service crashing. For example, the discovery service is crashing but +all other services are OK. As most instances are running ssh_service, the discovery_service instances are less likely +to get picked. + +The version is deployed to a few instances picked randomly but none of them runs on the affected cloud provider. +The canary instances can update properly and the update is sent to every instance of the group. -The canaries might not select one of the affected agents and allow the update to proceed. All agents are updated, and all agents hosted on the cloud provider affected by the bug crash. The updaters of the affected agents will attempt to self-heal by reverting to the previous version. -Once the previous Teleport version is running, the agent will advertise the update failed and that it had to rollback. -If too many agents fail, this will block the group from transitioning from `active` to `done`, protecting the future +Once the previous Teleport version is running, the agents from the affected cloud platform will advertise the update +failed, and they had to rollback. + +If too many agents failed, this will block the group from transitioning from `active` to `done`, protecting the future groups from the faulty updates. ##### Failure mode 2(a): the new version crashes, and the old version cannot start -I create a new deployment, with a broken version. The version is deployed to the canaries. -The canaries attempt the update, and the new Teleport instance crashes. -The updater fails to self-heal as the old version does not start anymore. - -This is typically caused by external sources like full disk, faulty networking, resource exhaustion. -This can also be caused by the Teleport control plan not being available. +I create a new deployment with a broken version. The version is deployed to a few instances picked randomly. +Those instances are called the canaries. As the new version has an issue, one or many of those canary instances can't +run the new version. Their updater also fails to revert to the previous version. The group update is stuck until the canary comes back online and runs the latest version. -The customer and Teleport cloud receive an alert. The customer and Teleport cloud can retrieve the -hostid and hostname of the faulty canaries. With this information they can go troubleshoot the failed agents. +The customer and Teleport cloud receive an alert. Both customer and Teleport cloud can retrieve the +host id and hostname of the faulty canary instances. With this information they can go troubleshoot the failed agents. ##### Failure mode 2(b): the new version crashes, and the old version cannot start, but not on the canaries @@ -431,7 +440,7 @@ tctl autoupdate agent resume #### As a Teleport user and a Teleport on-call responder, I want to be able to pin a specific Teleport version of an agent to understand if a specific behaviour is caused by a specific Teleport version -I connect to the node and lookup its status: +I connect to the server and lookup its status: ```shell teleport-update status # Running version v16.2.5 @@ -491,13 +500,15 @@ tctl autoupdate agent status I can trigger the dev group immediately using the command: ```shell -tctl autoupdate agent start-update dev --no-canary -# Dev group update triggered (canary or active) +tctl autoupdate agent start-update dev [--force] +# Dev group update triggered. ``` +The `--force` flag allows the user to skip progressive deployment mechanism such as canaries or backpressure. + Alternatively ```shell -tctl autoupdate agent force-done dev +tctl autoupdate agent mark-done dev ```
@@ -562,6 +573,7 @@ I can also install teleport using the package manager, then enroll the agent int #### As a Teleport user I want to enroll my existing agent into AUs I have an agent, installed from a package manager or by manually unpacking the tarball. +This agent might or might not be enrolled in the previous automatic update mechanism (apt/yum-based). I have the teleport updater installed and available in my path. I run: @@ -582,6 +594,8 @@ teleport-update enable --group production > It used the configuration to pick the right proxy address. As teleport is already running, the teleport service is > reloaded to use the new binary. +If the agent was previously enrolled into AUs with the old teleport updater package, the `enable` command will also +remove the old package. ### Teleport Resources @@ -746,7 +760,7 @@ service AutoUpdateService { rpc ForceAgentGroup(ForceAgentGroupRequest) returns (AutoUpdateAgentPlan); // ResetAgentGroup resets the state of an agent group. // For `canary`, this means new canaries are picked - // For `active`, this means the initial node count is computed again. + // For `active`, this means the initial instance count is computed again. rpc ResetAgentGroup(ResetAgentGroupRequest) returns (AutoUpdateAgentPlan); // RollbackAgentGroup changes the state of an agent group to `rolledback`. rpc RollbackAgentGroup(RollbackAgentGroupRequest) returns (AutoUpdateAgentPlan); @@ -1083,6 +1097,12 @@ they successfully came back online and the group can transition to the `active` If canaries never update, report rollback, or disappear, the group will stay stuck in `canary` state. An alert will eventually fire, warning the user about the stuck update. +> [!NOTE] +> In the first version, canary selection will happen randomly. As most instances are running the ssh_service and not +> the other ones, we are less likely to catch an issue in a less common service. +> An optimisation would be to try to pick canaries maximizing the service coverage. +> This would make the test more robust and provide better availability guarantees. + #### Updating a group A group in `active` mode is currently being updated. The conditions to leave `active` mode and transition to the @@ -1224,7 +1244,7 @@ is running which version and belongs to which group. Hello messages are sent on connection and are used to build the serve's local inventory. This information is available almost instantaneously after the connection and can be cheaply queried by the auth ( -everything is in memory). The inventory is then used to count the local nodes and drive the rollout. +everything is in memory). The inventory is then used to count the local instances and drive the rollout. Both instance heartbeats and Hello merssages will be extended to incorporate and send data that is written to `/var/lib/teleport/versions/update.yaml` and `/tmp/teleport_update_uuid` by the `teleport-update` binary. From 480483465b2d15df0041f55f43077cfffa05ca36 Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Thu, 10 Oct 2024 17:22:38 -0400 Subject: [PATCH 101/105] agent_plan -> agent_rollout + reuse autoupdate_config --- rfd/0184-agent-auto-updates.md | 619 ++++++++++++++++++++------------- 1 file changed, 375 insertions(+), 244 deletions(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index c473cc6fb02ea..ddb41bffa4d12 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -82,62 +82,86 @@ We will introduce two user-facing resources: ```yaml kind: autoupdate_config spec: - agent_auto_update_mode: enable - agent_schedules: - regular: - - name: dev - days: ["Mon", "Tue", "Wed", "Thu"] - start_hour: 0 - alert_after: 4h - canary_count: 5 # added in phase 5 - max_in_flight: 20% # added in phase 6 - - name: prod - days: ["Mon", "Tue", "Wed", "Thu"] - start_hour: 0 - wait_days: 1 # update this group at least 1 day after the previous one - alert_after: 4h - canary_count: 5 # added in phase 5 - max_in_flight: 20% # added in phase 6 + # existing field, deprecated + tools_autoupdate: true/false + # new fields + tools: + mode: enabled/disabled/suspended + agents: + mode: enabled/disabled/suspended + schedules: + regular: + - name: dev + days: ["Mon", "Tue", "Wed", "Thu"] + start_hour: 0 + alert_after: 4h + canary_count: 5 # added in phase 5 + max_in_flight: 20% # added in phase 6 + - name: prod + days: ["Mon", "Tue", "Wed", "Thu"] + start_hour: 0 + wait_days: 1 # update this group at least 1 day after the previous one + alert_after: 4h + canary_count: 5 # added in phase 5 + max_in_flight: 20% # added in phase 6 ``` -2. The `autoupdate_agent_plan` resource, with `spec` owned by the Teleport cluster administrator (e.g. Teleport Cloud operators). - Its `status` is owned by Teleport and contains the current rollout status. Some parts of the status can be changed via - select RPCs (for example, an RPC to fast-track a group update). +2. The `autoupdate_version` resource, with `spec` owned by the Teleport cluster administrator (e.g. Teleport Cloud operators). ```yaml - kind: autoupdate_agent_plan + kind: autoupdate_version spec: - start_version: v1 - target_version: v2 - schedule: regular - strategy: grouped - autoupdate_mode: enabled - status: - groups: - - name: dev - start_time: 2020-12-09T16:09:53+00:00 - initial_count: 100 # part of phase 4 - present_count: 100 # part of phase 4 - failed_count: 0 # part of phase 4 - progress: 0 - state: canaries - canaries: # part of phase 5 - - updater_uuid: abc - host_uuid: def - hostname: foo.example.com - success: false - last_update_time: 2020-12-10T16:09:53+00:00 - last_update_reason: canaryTesting - - name: prod - start_time: 0000-00-00 - initial_count: 0 - present_count: 0 - failed_count: 0 - progress: 0 - state: unstarted - last_update_time: 2020-12-10T16:09:53+00:00 - last_update_reason: newAgentPlan + # existing fields + tools_version: vX + # new fields + agents: + start_version: v1 + target_version: v2 + schedule: regular + strategy: grouped + mode: enabled ``` +We will also introduce an internal resource, tracking the agent rollout status. This resource is +owned by Teleport. Users and cluster operators can read its content but cannot create/update/upsert/delete it. +This resource is editable via select RPCs (e.g. start or rollback a group). + +```yaml +kind: autoupdate_agent_rollout +spec: + # content copied from the `autoupdate_version.spec.agents` + version_config: + start_version: v1 + target_version: v2 + schedule: regular + strategy: grouped + mode: enabled +status: + groups: + - name: dev + start_time: 2020-12-09T16:09:53+00:00 + initial_count: 100 # part of phase 4 + present_count: 100 # part of phase 4 + failed_count: 0 # part of phase 4 + progress: 0 + state: canaries + canaries: # part of phase 5 + - updater_uuid: abc + host_uuid: def + hostname: foo.example.com + success: false + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: canaryTesting + - name: prod + start_time: 0000-00-00 + initial_count: 0 + present_count: 0 + failed_count: 0 + progress: 0 + state: unstarted + last_update_time: 2020-12-10T16:09:53+00:00 + last_update_reason: newAgentPlan +``` + You can find more details about each resource field [in the dedicated resource section](#teleport-resources). ## Details @@ -268,16 +292,17 @@ advertise they have failed to update. The maintenance is stuck until every insta is back online, and running the new version.
-Autoupdate agent plan +Autoupdate agent rollout ```yaml -kind: autoupdate_agent_plan +kind: autoupdate_agent_rollout spec: - start_version: v1 - target_version: v2 - schedule: regular - strategy: grouped - autoupdate_mode: enabled + version_config: + start_version: v1 + target_version: v2 + schedule: regular + strategy: grouped + mode: enabled status: groups: - name: dev @@ -607,18 +632,21 @@ This is how Teleport customers can specify their automatic update preferences. ```yaml kind: autoupdate_config spec: - # agent_auto_update allows turning agent updates on or off at the - # cluster level. Only turn agent automatic updates off if self-managed - # agent updates are in place. Setting this to pause will temporarily halt the rollout. - agent_auto_update_mode: disable|enable|pause - - # agent_schedules specifies version rollout schedules for agents. - # The schedule used is determined by the schedule associated - # with the version in the autoupdate_agent_plan resource. - # For now, only the "regular" schedule is configurable. - agent_schedules: - # rollout schedule must be "regular" for now - regular: + # existing field + tools_autoupdate: true + tools: + mode: enabled/disabled/suspended + agents: + # agent_auto_update allows turning agent updates on or off at the + # cluster level. Only turn agent automatic updates off if self-managed + # agent updates are in place. Setting this to pause will temporarily halt the rollout. + mode: enabled/disabled/suspended + # agent_schedules specifies version rollout schedules for agents. + # The schedule used is determined by the schedule associated + # with the version in the autoupdate_version resource. + # For now, only the "regular" schedule is configurable. + schedules: + regular: # name of the group. Must only contain valid backend / resource name characters. - name: staging # days specifies the days of the week when the group may be updated. @@ -643,7 +671,6 @@ spec: # not completed. # default: 4 alert_after_hours: 1-8 - # ... ``` @@ -651,20 +678,23 @@ Default resource: ```yaml kind: autoupdate_config spec: - agent_auto_update_mode: enable - agent_schedules: - regular: - - name: default - days: ["Mon", "Tue", "Wed", "Thu"] - start_hour: 0 - canary_count: 5 - max_in_flight: 20% - alert_after: 4h + tools: + mode: enabled + agents: + mode: enabled + schedules: + regular: + - name: default + days: ["Mon", "Tue", "Wed", "Thu"] + start_hour: 0 + canary_count: 5 + max_in_flight: 20% + alert_after: 4h ``` -#### Autoupdate agent plan +#### Autoupdate version -The `autoupdate_agent_plan` spec is owned by the Teleport cluster administrator. +The `autoupdate_version` spec is owned by the Teleport cluster administrator. In Teleport Cloud, this is the Cloud operations team. For self-hosted setups this is the user with access to the local admin socket (tctl on local machine). @@ -677,50 +707,72 @@ admin socket (tctl on local machine). > Solving this problem is out of the scope of this RFD. ```yaml -kind: autoupdate_agent_plan +kind: autoupdate_version +spec: + # existing fields + tools_version: vX + # new fields + agents: + # start_version is the desired version for agents before their window. + start_version: v1 + # target_version is the desired version for agents after their window. + target_version: v2 + # schedule to use for the rollout + schedule: regular + # strategy to use for the rollout + # default: backpressure + strategy: grouped + # paused specifies whether the rollout is paused + # default: enabled + mode: enabled|disabled|suspended +``` + +#### Autoupdate agent rollout + +The `autoupdate_agent_rollout` resource is owned by Teleport. This resource can be read by users but not directly applied. +To create and reconcile this resource, the Auth service looks up bot `autoupdate_config` and `autoupdate_version` to know the desired mode, versions, and schedule. +Once the agent rollout is created, the auth uses its status to track the progress of the rollout through the different groups. + +```yaml +kind: autoupdate_agent_rollout spec: - # start_version is the desired version for agents before their window. - start_version: A.B.C - # target_version is the desired version for agents after their window. - target_version: X.Y.Z - # schedule to use for the rollout - schedule: regular|immediate - # strategy to use for the rollout - # default: backpressure - strategy: backpressure|grouped - # paused specifies whether the rollout is paused - # default: enabled - autoupdate_mode: enabled|disabled|paused + # content copied from the `autoupdate_version.spec.agents` + version_config: + start_version: v1 + target_version: v2 + schedule: regular + strategy: grouped + mode: enabled status: groups: # name of group - - name: staging - # start_time is the time the upgrade will start - start_time: 2020-12-09T16:09:53+00:00 - # initial_count is the number of connected agents at the start of the window - initial_count: 432 - # missing_count is the number of agents disconnected since the start of the rollout - present_count: 53 - # failed_count is the number of agents rolled-back since the start of the rollout - failed_count: 23 - # canaries is a list of agents used for canary deployments - canaries: # part of phase 5 - # updater_uuid is the updater UUID - - updater_uuid: abc123-... - # host_uuid is the agent host UUID - host_uuid: def534-... - # hostname of the agent - hostname: foo.example.com - # success status - success: false - # progress is the current progress through the rollout - progress: 0.532 - # state is the current state of the rollout (unstarted, active, done, rollback) - state: active - # last_update_time is the time of the previous update for the group - last_update_time: 2020-12-09T16:09:53+00:00 - # last_update_reason is the trigger for the last update - last_update_reason: rollback + - name: staging + # start_time is the time the upgrade will start + start_time: 2020-12-09T16:09:53+00:00 + # initial_count is the number of connected agents at the start of the window + initial_count: 432 + # missing_count is the number of agents disconnected since the start of the rollout + present_count: 53 + # failed_count is the number of agents rolled-back since the start of the rollout + failed_count: 23 + # canaries is a list of agents used for canary deployments + canaries: # part of phase 5 + # updater_uuid is the updater UUID + - updater_uuid: abc123-... + # host_uuid is the agent host UUID + host_uuid: def534-... + # hostname of the agent + hostname: foo.example.com + # success status + success: false + # progress is the current progress through the rollout + progress: 0.532 + # state is the current state of the rollout (unstarted, active, done, rollback) + state: active + # last_update_time is the time of the previous update for the group + last_update_time: 2020-12-09T16:09:53+00:00 + # last_update_reason is the trigger for the last update + last_update_reason: rollback ``` #### Protobuf @@ -730,83 +782,42 @@ syntax = "proto3"; package teleport.autoupdate.v1; -option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdatev1"; - -// AutoUpdateService serves agent and client automatic version updates. -service AutoUpdateService { - // GetAutoUpdateConfig updates the autoupdate config. - rpc GetAutoUpdateConfig(GetAutoUpdateConfigRequest) returns (AutoUpdateConfig); - // CreateAutoUpdateConfig creates the autoupdate config. - rpc CreateAutoUpdateConfig(CreateAutoUpdateConfigRequest) returns (AutoUpdateConfig); - // UpdateAutoUpdateConfig updates the autoupdate config. - rpc UpdateAutoUpdateConfig(UpdateAutoUpdateConfigRequest) returns (AutoUpdateConfig); - // UpsertAutoUpdateConfig overwrites the autoupdate config. - rpc UpsertAutoUpdateConfig(UpsertAutoUpdateConfigRequest) returns (AutoUpdateConfig); - // ResetAutoUpdateConfig restores the autoupdate config to default values. - rpc ResetAutoUpdateConfig(ResetAutoUpdateConfigRequest) returns (AutoUpdateConfig); - - // GetAutoUpdateAgentPlan returns the autoupdate plan for agents. - rpc GetAutoUpdateAgentPlan(GetAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); - // CreateAutoUpdateAgentPlan creates the autoupdate plan for agents. - rpc CreateAutoUpdateAgentPlan(CreateAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); - // UpdateAutoUpdateAgentPlan updates the autoupdate plan for agents. - rpc UpdateAutoUpdateAgentPlan(UpdateAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); - // UpsertAutoUpdateAgentPlan overwrites the autoupdate plan for agents. - rpc UpsertAutoUpdateAgentPlan(UpsertAutoUpdateAgentPlanRequest) returns (AutoUpdateAgentPlan); - - // TriggerAgentGroup changes the state of an agent group from `unstarted` to `active` or `canary`. - rpc TriggerAgentGroup(TriggerAgentGroupRequest) returns (AutoUpdateAgentPlan); - // ForceAgentGroup changes the state of an agent group from `unstarted`, `canary`, or `active` to the `done` state. - rpc ForceAgentGroup(ForceAgentGroupRequest) returns (AutoUpdateAgentPlan); - // ResetAgentGroup resets the state of an agent group. - // For `canary`, this means new canaries are picked - // For `active`, this means the initial instance count is computed again. - rpc ResetAgentGroup(ResetAgentGroupRequest) returns (AutoUpdateAgentPlan); - // RollbackAgentGroup changes the state of an agent group to `rolledback`. - rpc RollbackAgentGroup(RollbackAgentGroupRequest) returns (AutoUpdateAgentPlan); -} - -// GetAutoUpdateConfigRequest requests the contents of the AutoUpdateConfig. -message GetAutoUpdateConfigRequest {} +import "teleport/header/v1/metadata.proto"; -// CreateAutoUpdateConfigRequest requests creation of the the AutoUpdateConfig. -message CreateAutoUpdateConfigRequest { - AutoUpdateConfig autoupdate_config = 1; -} +option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdate"; -// UpdateAutoUpdateConfigRequest requests an update of the the AutoUpdateConfig. -message UpdateAutoUpdateConfigRequest { - AutoUpdateConfig autoupdate_config = 1; -} +// CONFIG -// UpsertAutoUpdateConfigRequest requests an upsert of the the AutoUpdateConfig. -message UpsertAutoUpdateConfigRequest { - AutoUpdateConfig autoupdate_config = 1; -} - -// ResetAutoUpdateConfigRequest requests a reset of the the AutoUpdateConfig to default values. -message ResetAutoUpdateConfigRequest {} - -// AutoUpdateConfig holds dynamic configuration settings for automatic updates. +// AutoUpdateConfig is a config singleton used to configure cluster +// autoupdate settings. message AutoUpdateConfig { - // kind is the kind of the resource. string kind = 1; - // sub_kind is the sub kind of the resource. string sub_kind = 2; - // version is the version of the resource. string version = 3; - // metadata is the metadata of the resource. teleport.header.v1.Metadata metadata = 4; - // spec is the spec of the resource. - AutoUpdateConfigSpec spec = 7; + + AutoUpdateConfigSpec spec = 5; } -// AutoUpdateConfigSpec is the spec for the autoupdate config. +// AutoUpdateConfigSpec encodes the parameters of the autoupdate config object. message AutoUpdateConfigSpec { - // agent_auto_update_mode specifies whether agent autoupdates are enabled, disabled, or paused. + reserved 1; + AutoUpdateConfigSpecTools tools = 2; + AutoUpdateConfigSpecAgents agents = 3; +} + +// AutoUpdateConfigSpecTools encodes the parameters of automatic tools update. +message AutoUpdateConfigSpecTools { + // Mode encodes the feature flag to enable/disable tools autoupdates. + Mode mode = 1; +} + +// AutoUpdateConfigSpecTools encodes the parameters of automatic tools update. +message AutoUpdateConfigSpecAgents { + // mode specifies whether agent autoupdates are enabled, disabled, or paused. Mode agent_auto_update_mode = 1; // agent_schedules specifies schedules for updates of grouped agents. - AgentAutoUpdateSchedules agent_schedules = 3; + AgentAutoUpdateSchedules agent_schedules = 2; } // AgentAutoUpdateSchedules specifies update scheduled for grouped agents. @@ -858,45 +869,45 @@ enum Mode { MODE_PAUSE = 3; } -// GetAutoUpdateAgentPlanRequest requests the autoupdate_agent_plan singleton resource. -message GetAutoUpdateAgentPlanRequest {} - -// GetAutoUpdateAgentPlanRequest requests creation of the autoupdate_agent_plan singleton resource. -message CreateAutoUpdateAgentPlanRequest { - // autoupdate_agent_plan resource contents - AutoUpdateAgentPlan autoupdate_agent_plan = 1; -} - -// GetAutoUpdateAgentPlanRequest requests an update of the autoupdate_agent_plan singleton resource. -message UpdateAutoUpdateAgentPlanRequest { - // autoupdate_agent_plan resource contents - AutoUpdateAgentPlan autoupdate_agent_plan = 1; +// Schedule type for the rollout +enum Schedule { + // UNSPECIFIED update schedule + SCHEDULE_UNSPECIFIED = 0; + // REGULAR update schedule + SCHEDULE_REGULAR = 1; + // IMMEDIATE update schedule for updating all agents immediately + SCHEDULE_IMMEDIATE = 2; } -// GetAutoUpdateAgentPlanRequest requests an upsert of the autoupdate_agent_plan singleton resource. -message UpsertAutoUpdateAgentPlanRequest { - // autoupdate_agent_plan resource contents - AutoUpdateAgentPlan autoupdate_agent_plan = 1; -} +// VERSION -// AutoUpdateAgentPlan holds dynamic configuration settings for agent autoupdates. -message AutoUpdateAgentPlan { - // kind is the kind of the resource. +// AutoUpdateVersion is a resource singleton with version required for +// tools autoupdate. +message AutoUpdateVersion { string kind = 1; - // sub_kind is the sub kind of the resource. string sub_kind = 2; - // version is the version of the resource. string version = 3; - // metadata is the metadata of the resource. teleport.header.v1.Metadata metadata = 4; - // spec is the spec of the resource. - AutoUpdateAgentPlanSpec spec = 5; - // status is the status of the resource. - AutoUpdateAgentPlanStatus status = 6; + + AutoUpdateVersionSpec spec = 5; +} + +// AutoUpdateVersionSpec encodes the parameters of the autoupdate versions. +message AutoUpdateVersionSpec { + // ToolsVersion is the semantic version required for tools autoupdates. + reserved 1; + AutoUpdateVersionSpecTools tools = 2; + AutoUpdateVersionSpecAgents agents = 3; } -// AutoUpdateAgentPlanSpec is the spec for the autoupdate version. -message AutoUpdateAgentPlanSpec { +// AutoUpdateVersionSpecTools is the spec for the autoupdate version. +message AutoUpdateVersionSpecTools { + // target_version is the target tools version. + string target_version = 1; +} + +// AutoUpdateVersionSpecAgents is the spec for the autoupdate version. +message AutoUpdateVersionSpecAgents { // start_version is the version to update from. string start_version = 1; // target_version is the version to update to. @@ -909,28 +920,26 @@ message AutoUpdateAgentPlanSpec { Mode autoupdate_mode = 5; } -// Schedule type for the rollout -enum Schedule { - // UNSPECIFIED update schedule - SCHEDULE_UNSPECIFIED = 0; - // REGULAR update schedule - SCHEDULE_REGULAR = 1; - // IMMEDIATE update schedule for updating all agents immediately - SCHEDULE_IMMEDIATE = 2; +// AGENT ROLLOUT + +message AutoUpdateAgentRollout { + string kind = 1; + string sub_kind = 2; + string version = 3; + teleport.header.v1.Metadata metadata = 4; + AutoUpdateAgentRolloutSpec spec = 5; + AutoUpdateAgentRolloutStatus status = 6; } -// Strategy type for the rollout -enum Strategy { - // UNSPECIFIED update strategy - STRATEGY_UNSPECIFIED = 0; - // GROUPED update schedule, with no backpressure - STRATEGY_GROUPED = 1; - // BACKPRESSURE update schedule - STRATEGY_BACKPRESSURE = 2; +message AutoUpdateAgentRolloutSpec { + AutoUpdateVersionSpecAgents version = 1; } -// AutoUpdateAgentPlanStatus is the status for the AutoUpdateAgentPlan. -message AutoUpdateAgentPlanStatus { +message AutoUpdateAgentRolloutStatus { + repeated AutoUpdateAgentRolloutStatusGroup groups = 1; +} + +message AutoUpdateAgentRolloutStatusGroup { // name of the group string name = 1; // start_time of the rollout @@ -981,6 +990,128 @@ enum State { STATE_ROLLEDBACK = 5; } +// AutoUpdateService provides an API to manage autoupdates. +service AutoUpdateService { + // GetAutoUpdateConfig gets the current autoupdate config singleton. + rpc GetAutoUpdateConfig(GetAutoUpdateConfigRequest) returns (AutoUpdateConfig); + + // CreateAutoUpdateConfig creates a new AutoUpdateConfig. + rpc CreateAutoUpdateConfig(CreateAutoUpdateConfigRequest) returns (AutoUpdateConfig); + + // CreateAutoUpdateConfig updates AutoUpdateConfig singleton. + rpc UpdateAutoUpdateConfig(UpdateAutoUpdateConfigRequest) returns (AutoUpdateConfig); + + // UpsertAutoUpdateConfig creates a new AutoUpdateConfig or replaces an existing AutoUpdateConfig. + rpc UpsertAutoUpdateConfig(UpsertAutoUpdateConfigRequest) returns (AutoUpdateConfig); + + // DeleteAutoUpdateConfig hard deletes the specified AutoUpdateConfig. + rpc DeleteAutoUpdateConfig(DeleteAutoUpdateConfigRequest) returns (google.protobuf.Empty); + + // GetAutoUpdateVersion gets the current autoupdate version singleton. + rpc GetAutoUpdateVersion(GetAutoUpdateVersionRequest) returns (AutoUpdateVersion); + + // CreateAutoUpdateVersion creates a new AutoUpdateVersion. + rpc CreateAutoUpdateVersion(CreateAutoUpdateVersionRequest) returns (AutoUpdateVersion); + + // UpdateAutoUpdateVersion updates AutoUpdateVersion singleton. + rpc UpdateAutoUpdateVersion(UpdateAutoUpdateVersionRequest) returns (AutoUpdateVersion); + + // UpsertAutoUpdateVersion creates a new AutoUpdateVersion or replaces an existing AutoUpdateVersion. + rpc UpsertAutoUpdateVersion(UpsertAutoUpdateVersionRequest) returns (AutoUpdateVersion); + + // DeleteAutoUpdateVersion hard deletes the specified AutoUpdateVersionRequest. + rpc DeleteAutoUpdateVersion(DeleteAutoUpdateVersionRequest) returns (google.protobuf.Empty); + + // GetAutoUpdateAgentRollout gets the current autoupdate version singleton. + rpc GetAutoUpdateAgentRollout(GetAutoUpdateAgentRolloutRequest) returns (AutoUpdateAgentRollout); + + // CreateAutoUpdateAgentRollout creates a new AutoUpdateAgentRollout. + rpc CreateAutoUpdateAgentRollout(CreateAutoUpdateAgentRolloutRequest) returns (AutoUpdateAgentRollout); + + // UpdateAutoUpdateAgentRollout updates AutoUpdateAgentRollout singleton. + rpc UpdateAutoUpdateAgentRollout(UpdateAutoUpdateAgentRolloutRequest) returns (AutoUpdateAgentRollout); + + // UpsertAutoUpdateAgentRollout creates a new AutoUpdateAgentRollout or replaces an existing AutoUpdateAgentRollout. + rpc UpsertAutoUpdateAgentRollout(UpsertAutoUpdateAgentRolloutRequest) returns (AutoUpdateAgentRollout); + + // DeleteAutoUpdateAgentRollout hard deletes the specified AutoUpdateAgentRolloutRequest. + rpc DeleteAutoUpdateAgentRollout(DeleteAutoUpdateAgentRolloutRequest) returns (google.protobuf.Empty); + + // TriggerAgentGroup changes the state of an agent group from `unstarted` to `active` or `canary`. + rpc TriggerAgentGroup(TriggerAgentGroupRequest) returns (AutoUpdateAgentRollout); + // ForceAgentGroup changes the state of an agent group from `unstarted`, `canary`, or `active` to the `done` state. + rpc ForceAgentGroup(ForceAgentGroupRequest) returns (AutoUpdateAgentRollout); + // ResetAgentGroup resets the state of an agent group. + // For `canary`, this means new canaries are picked + // For `active`, this means the initial instance count is computed again. + rpc ResetAgentGroup(ResetAgentGroupRequest) returns (AutoUpdateAgentRollout); + // RollbackAgentGroup changes the state of an agent group to `rolledback`. + rpc RollbackAgentGroup(RollbackAgentGroupRequest) returns (AutoUpdateAgentRollout); +} + +// Request for GetAutoUpdateConfig. +message GetAutoUpdateConfigRequest {} + +// Request for CreateAutoUpdateConfig. +message CreateAutoUpdateConfigRequest { + AutoUpdateConfig config = 1; +} + +// Request for UpdateAutoUpdateConfig. +message UpdateAutoUpdateConfigRequest { + AutoUpdateConfig config = 1; +} + +// Request for UpsertAutoUpdateConfig. +message UpsertAutoUpdateConfigRequest { + AutoUpdateConfig config = 1; +} + +// Request for DeleteAutoUpdateConfig. +message DeleteAutoUpdateConfigRequest {} + +// Request for GetAutoUpdateVersion. +message GetAutoUpdateVersionRequest {} + +// Request for CreateAutoUpdateVersion. +message CreateAutoUpdateVersionRequest { + AutoUpdateVersion version = 1; +} + +// Request for UpdateAutoUpdateConfig. +message UpdateAutoUpdateVersionRequest { + AutoUpdateVersion version = 1; +} + +// Request for UpsertAutoUpdateVersion. +message UpsertAutoUpdateVersionRequest { + AutoUpdateVersion version = 1; +} + +// Request for DeleteAutoUpdateVersion. +message DeleteAutoUpdateVersionRequest {} + +// Request for GetAutoUpdateAgentRollout. +message GetAutoUpdateAgentRolloutRequest {} + +// Request for CreateAutoUpdateAgentRollout. +message CreateAutoUpdateAgentRolloutRequest { + AutoUpdateAgentRollout plan = 1; +} + +// Request for UpdateAutoUpdateConfig. +message UpdateAutoUpdateAgentRolloutRequest { + AutoUpdateAgentRollout plan = 1; +} + +// Request for UpsertAutoUpdateAgentRollout. +message UpsertAutoUpdateAgentRolloutRequest { + AutoUpdateAgentRollout plan = 1; +} + +// Request for DeleteAutoUpdateAgentRollout. +message DeleteAutoUpdateAgentRolloutRequest {} + message TriggerAgentGroupRequest { // group is the agent update group name whose maintenance should be triggered. string group = 1; @@ -1013,7 +1144,7 @@ allowing the next group to proceed. By default, only 5 agent groups are allowed. #### Agent update mode -The agent auto update mode is specified by both Cloud (via `autoupdate_agent_plan`) +The agent auto update mode is specified by both Cloud (via `autoupdate_version`) and by the customer (via `autoupdate_config`). The agent update mode controls whether the cluster in enrolled into automatic agent updates. @@ -1142,11 +1273,11 @@ tctl autoupdate agent-plan suspend/resume ### Editing the plan -The updater will receive `agent_auto_update: true` from the time is it designated for update until the `target_version` in `autoupdate_agent_plan` (below) changes. +The updater will receive `agent_auto_update: true` from the time is it designated for update until the `target_version` in `autoupdate_version` (below) changes. Changing the `target_version` resets the schedule immediately, clearing all progress. [TODO: What is the use-case for this? can we do like with target_version and reset all instead of trying to merge the state] -Changing the `start_version` in `autoupdate_agent_plan` changes the advertised `start_version` for all unfinished groups. +Changing the `start_version` in `autoupdate_version` changes the advertised `start_version` for all unfinished groups. Changing `agent_schedules` will preserve the `state` of groups that have the same name before and after the change. However, any changes to `agent_schedules` that occur while a group is active will be rejected. @@ -1161,20 +1292,20 @@ If a `default` group is not present, the last group is treated as the default. #### Update requests Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`. -The version and edition served from that endpoint will be configured using new `autoupdate_agent_plan` resource. +The version served from that endpoint will be configured using new `autoupdate_version` resource. Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_auto_update` field) is dependent on: - The `host=[uuid]` parameter sent to `/v1/webapi/find` - The `group=[name]` parameter sent to `/v1/webapi/find` -- The group state from the `autoupdate_agent_plan` status +- The group state from the `autoupdate_agent_rollout` status (this also contains the version from `autoupdate_version`) To ensure that the updater is always able to retrieve the desired version, instructions to the updater are delivered via unauthenticated requests to `/v1/webapi/find`. Teleport proxies modulate the `/v1/webapi/find` response given the host UUID and group name. When the updater queries the proxy via `/v1/webapi/find?host=[uuid]&group=[name]`, the proxies query the -`autoupdate_agent_plan` status to determine the value of `agent_auto_update: true`. +`autoupdate_agent_rollout` to determine the value of `agent_auto_update: true`. The boolean is returned as `true` in the case that the provided `host` contains a UUID that is under the progress percentage for the `group`: `as_numeral(host_uuid) / as_numeral(max_uuid) < progress` @@ -1269,10 +1400,10 @@ Every minute, auth servers persist the version counts: Expiration time of the persisted key is 1 hour. -To progress the rollout, auth servers will range-read keys from `/autoupdate/[group]/*`, sum the counts, and write back to the `autoupdate_agent_plan` status on a one-minute interval. -- To calculate the initial number of agents connected at the start of the window, each auth server will write the summed count of agents to `autoupdate_agent_plan` status, if not already written. -- To calculate the canaries, each auth server will write a random selection of all canaries to `autoupdate_agent_plan` status, if not already written. -- To determine the progress through the rollout, auth servers will write the calculated progress to the `autoupdate_agent_plan` status using the formulas, declining to write if the current written progress is further ahead. +To progress the rollout, auth servers will range-read keys from `/autoupdate/[group]/*`, sum the counts, and write back to the `autoupdate_agent_rollout` status on a one-minute interval. +- To calculate the initial number of agents connected at the start of the window, each auth server will write the summed count of agents to `autoupdate_agent_rollout` status, if not already written. +- To calculate the canaries, each auth server will write a random selection of all canaries to `autoupdate_agent_rollout` status, if not already written. +- To determine the progress through the rollout, auth servers will write the calculated progress to the `autoupdate_agent_rollout` status using the formulas, declining to write if the current written progress is further ahead. If `/autoupdate/[group]/[auth ID]` is older than 1 minute, we do not consider its contents. This prevents double-counting agents when auth servers are killed. @@ -1286,7 +1417,7 @@ initial_count[group] = sum(agent_data[group].stats[*]).count Each auth server will calculate the progress as `( max_in_flight * initial_count[group] + agent_data[group].stats[target_version].count ) / initial_count[group]` and -write the progress to `autoupdate_agent_plan` status. This formula determines the progress percentage by adding a +write the progress to `autoupdate_agent_rollout` status. This formula determines the progress percentage by adding a `max_in_flight` percentage-window above the number of currently updated agents in the group. However, if `as_numeral(agent_data[group].stats[not(target_version)].lowest_uuid) / as_numeral(max_uuid)` is above the From 408b33d21c29a56193321f28fd4cf3002dd86fc0 Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Thu, 10 Oct 2024 17:26:08 -0400 Subject: [PATCH 102/105] align tool version --- rfd/0184-agent-auto-updates.md | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index ddb41bffa4d12..ffa01c084503c 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -110,9 +110,8 @@ We will introduce two user-facing resources: ```yaml kind: autoupdate_version spec: - # existing fields - tools_version: vX - # new fields + tools: + target_version: vX agents: start_version: v1 target_version: v2 @@ -632,7 +631,7 @@ This is how Teleport customers can specify their automatic update preferences. ```yaml kind: autoupdate_config spec: - # existing field + # existing field, deprecated tools_autoupdate: true tools: mode: enabled/disabled/suspended @@ -709,9 +708,8 @@ admin socket (tctl on local machine). ```yaml kind: autoupdate_version spec: - # existing fields - tools_version: vX - # new fields + tools: + target_version: vX agents: # start_version is the desired version for agents before their window. start_version: v1 From ed130a6ddff74c37aff25942286915923d232aca Mon Sep 17 00:00:00 2001 From: Stephen Levine Date: Tue, 15 Oct 2024 13:23:11 -0400 Subject: [PATCH 103/105] Move package system dir --- rfd/0184-agent-auto-updates.md | 16 +++------------- 1 file changed, 3 insertions(+), 13 deletions(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index ffa01c084503c..ad38c9b658908 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -1514,16 +1514,6 @@ $ tree /var/lib/teleport │ └── etc │ └── systemd │ └── teleport.service - ├── system # if installed via OS package - │ ├── bin - │ │ ├── tsh - │ │ ├── tbot - │ │ ├── ... # other binaries - │ │ ├── teleport-update - │ │ └── teleport - │ └── etc - │ └── systemd - │ └── teleport.service └── update.yaml $ ls -l /usr/local/bin/tsh @@ -1705,11 +1695,11 @@ To ensure that SELinux permissions do not prevent the `teleport-update` binary f To ensure that backups are consistent, the updater will use the [SQLite backup API](https://www.sqlite.org/backup.html) to perform the backup. -The `teleport` apt and yum packages contain a system installation of Teleport in `/var/lib/teleport/versions/system`. +The `teleport` apt and yum packages will contain a system installation of Teleport in `/usr/local/teleport-system/`. Post package installation, the `link` subcommand is executed automatically to link the system installation when no auto-updater-managed version of Teleport is linked: ``` -/usr/local/bin/teleport -> /var/lib/teleport/versions/system/bin/teleport -/usr/local/bin/teleport-update -> /var/lib/teleport/versions/system/bin/teleport-update +/usr/local/bin/teleport -> /usr/local/teleport-system/bin/teleport +/usr/local/bin/teleport-update -> /usr/local/teleport-system/bin/teleport-update ... ``` From f3cb9010fd50d870b4bd02a60a8a718a5fd3a82b Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Tue, 15 Oct 2024 14:22:16 -0400 Subject: [PATCH 104/105] add time-based strategy --- rfd/0184-agent-auto-updates.md | 221 ++++++++++++++++++++++----------- 1 file changed, 147 insertions(+), 74 deletions(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index ad38c9b658908..252b701f74516 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -86,7 +86,7 @@ We will introduce two user-facing resources: tools_autoupdate: true/false # new fields tools: - mode: enabled/disabled/suspended + mode: enabled/disabled agents: mode: enabled/disabled/suspended schedules: @@ -116,7 +116,7 @@ We will introduce two user-facing resources: start_version: v1 target_version: v2 schedule: regular - strategy: grouped + strategy: previous-must-succeed mode: enabled ``` @@ -124,41 +124,32 @@ We will also introduce an internal resource, tracking the agent rollout status. owned by Teleport. Users and cluster operators can read its content but cannot create/update/upsert/delete it. This resource is editable via select RPCs (e.g. start or rollback a group). -```yaml -kind: autoupdate_agent_rollout -spec: - # content copied from the `autoupdate_version.spec.agents` - version_config: - start_version: v1 - target_version: v2 - schedule: regular - strategy: grouped - mode: enabled -status: - groups: - - name: dev - start_time: 2020-12-09T16:09:53+00:00 - initial_count: 100 # part of phase 4 - present_count: 100 # part of phase 4 - failed_count: 0 # part of phase 4 - progress: 0 - state: canaries - canaries: # part of phase 5 - - updater_uuid: abc - host_uuid: def - hostname: foo.example.com - success: false - last_update_time: 2020-12-10T16:09:53+00:00 - last_update_reason: canaryTesting - - name: prod - start_time: 0000-00-00 - initial_count: 0 - present_count: 0 - failed_count: 0 - progress: 0 - state: unstarted - last_update_time: 2020-12-10T16:09:53+00:00 - last_update_reason: newAgentPlan +The system will look like: + +```mermaid +flowchart TD + user(fa:fa-user User) + operator(fa:fa-user Operator) + auth[Auth Service] + proxy[Proxy Service] + updater[teleport-updater] + agent[Teleport Agent] + + autoupdate_config@{shape: notch-rect} + autoupdate_version@{shape: notch-rect} + autoupdate_rollout@{shape: notch-rect} + updater_status@{shape: notch-rect, label: "updater.yaml"} + + user -->|defines update schedule| autoupdate_config + operator -->|choses target version| autoupdate_version + autoupdate_config --> auth + autoupdate_version --> auth + auth -->|Describes desired state for each agent group| autoupdate_rollout + autoupdate_rollout --> proxy + proxy -->|Serves update instructions
via /find| updater + updater -->|Writes status| updater_status + updater_status --> agent + agent -->|Reports version and status via HelloMessage and InstanceHeartbeat| auth ``` You can find more details about each resource field [in the dedicated resource section](#teleport-resources). @@ -300,7 +291,7 @@ spec: start_version: v1 target_version: v2 schedule: regular - strategy: grouped + strategy: previous-must-succeed mode: enabled status: groups: @@ -634,12 +625,19 @@ spec: # existing field, deprecated tools_autoupdate: true tools: - mode: enabled/disabled/suspended + mode: enabled/disabled agents: # agent_auto_update allows turning agent updates on or off at the # cluster level. Only turn agent automatic updates off if self-managed # agent updates are in place. Setting this to pause will temporarily halt the rollout. mode: enabled/disabled/suspended + # strategy to use for the rollout + # Supported values are: + # - time-based + # - previous-must-succeed + # - previous-must-succeed-with-backpressure + # defaults to previous-must-succeed, might default to previous-must-succeed-with-backpressure after phase 6. + strategy: previous-must-succeed # agent_schedules specifies version rollout schedules for agents. # The schedule used is determined by the schedule associated # with the version in the autoupdate_version resource. @@ -656,6 +654,7 @@ spec: # default: 0 start_hour: 0-23 # wait_days specifies how many days to wait after the previous group finished before starting. + # This must be 0 when using the `time-based` strategy. # default: 0 wait_days: 0-1 # canary_count specifies the desired number of canaries to update before any other agents @@ -681,6 +680,8 @@ spec: mode: enabled agents: mode: enabled + strategy: previous-must-succeed + alert_after: 4h schedules: regular: - name: default @@ -688,7 +689,6 @@ spec: start_hour: 0 canary_count: 5 max_in_flight: 20% - alert_after: 4h ``` #### Autoupdate version @@ -717,9 +717,6 @@ spec: target_version: v2 # schedule to use for the rollout schedule: regular - # strategy to use for the rollout - # default: backpressure - strategy: grouped # paused specifies whether the rollout is paused # default: enabled mode: enabled|disabled|suspended @@ -739,7 +736,7 @@ spec: start_version: v1 target_version: v2 schedule: regular - strategy: grouped + strategy: previous-must-succeed mode: enabled status: groups: @@ -781,6 +778,8 @@ syntax = "proto3"; package teleport.autoupdate.v1; import "teleport/header/v1/metadata.proto"; +import "google/protobuf/empty.proto"; +import "google/protobuf/timestamp.proto"; option go_package = "github.com/gravitational/teleport/api/gen/proto/go/teleport/autoupdate/v1;autoupdate"; @@ -814,8 +813,25 @@ message AutoUpdateConfigSpecTools { message AutoUpdateConfigSpecAgents { // mode specifies whether agent autoupdates are enabled, disabled, or paused. Mode agent_auto_update_mode = 1; + // strategy to use for updating the agents. + Strategy strategy = 2; + // maintenance_window_minutes is the maintenance window duration in minutes. This can only be set if `strategy` is "time-based". + int64 maintenance_window_minutes = 3; + // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. + // This can only be set if `strategy` is "previous-must-succeed". + int64 alert_after_hours = 5; // agent_schedules specifies schedules for updates of grouped agents. - AgentAutoUpdateSchedules agent_schedules = 2; + AgentAutoUpdateSchedules agent_schedules = 6; +} + +// Strategy type for the rollout +enum Strategy { + // UNSPECIFIED update strategy + STRATEGY_UNSPECIFIED = 0; + // PREVIOUS_MUST_SUCCEED update strategy with no backpressure + STRATEGY_PREVIOUS_MUST_SUCCEED = 1; + // TIME_BASED update strategy. + STRATEGY_TIME_BASED = 2; } // AgentAutoUpdateSchedules specifies update scheduled for grouped agents. @@ -832,14 +848,12 @@ message AgentAutoUpdateGroup { repeated Day days = 2; // start_hour to initiate update int32 start_hour = 3; - // wait_days after last group succeeds before this group can run + // wait_days after last group succeeds before this group can run. This can only be used when the strategy is "previous-must-finish". int64 wait_days = 4; - // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. - int64 alert_after_hours = 5; // canary_count of agents to use in the canary deployment. - int64 canary_count = 6; + int64 canary_count = 5; // max_in_flight specifies agents that can be updated at the same time, by percent. - string max_in_flight = 7; + string max_in_flight = 6; } // Day of the week @@ -912,10 +926,8 @@ message AutoUpdateVersionSpecAgents { string target_version = 2; // schedule to use for the rollout Schedule schedule = 3; - // strategy to use for the rollout - Strategy strategy = 4; // autoupdate_mode to use for the rollout - Mode autoupdate_mode = 5; + Mode autoupdate_mode = 4; } // AGENT ROLLOUT @@ -930,7 +942,16 @@ message AutoUpdateAgentRollout { } message AutoUpdateAgentRolloutSpec { - AutoUpdateVersionSpecAgents version = 1; + // start_version is the version to update from. + string start_version = 1; + // target_version is the version to update to. + string target_version = 2; + // schedule to use for the rollout + Schedule schedule = 3; + // autoupdate_mode to use for the rollout + Mode autoupdate_mode = 4; + // strategy to use for updating the agents. + Strategy strategy = 5; } message AutoUpdateAgentRolloutStatus { @@ -1137,8 +1158,24 @@ message RollbackAgentGroupRequest { ### Backend logic to progress the rollout -The update proceeds from the first group to the last group, ensuring that each group successfully updates before -allowing the next group to proceed. By default, only 5 agent groups are allowed. This mitigates very long rollout plans. +#### Rollout strategies + +We support two rollout strategies, for two distinct use-cases: + +- `previous-must-succeed` for damage reduction of a faulty update +- `time-based` for time-constrained maintenances + +In `previous-must-succeed`, the update proceeds from the first group to the last group, ensuring that each group +successfully updates before allowing the next group to proceed. By default, only 5 agent groups are allowed. This +mitigates very long rollout plans. This is the strategy that offers the best availability. A group finishes its update +once most of its agents are running the correct version. Agents that missed the group update will try to catch +back as soon as possible. + +In `time-based` maintenances, agents update as soon as their maintenance window starts. There is no dependency +between groups. This strategy allows Teleport users to setup reliable follow-the-sun updates and enforce the +maintenance window more strictly. A group finishes its update at the end of the maintenance window, regardless +of the new version adoption rate. Agents that missed the maintenance window will not attempt to +update until the next maintenance window. #### Agent update mode @@ -1175,7 +1212,7 @@ A group can be in 5 states: - `done`: the group has been updated. New agents should run `v2`. - `rolledback`: the group has been rolledback. New agents should run `v1`, existing agents should update to `v1`. -The finite state machine is the following: +The finite state machine for the `previous-must-succeed` is the following: ```mermaid flowchart TD @@ -1186,7 +1223,7 @@ flowchart TD rolledback((rolledback)) unstarted -->|TriggerGroupRPC
Start conditions are met| canary - canary -->|Canary came back alive| active + canary -->|Canaries came back alive| active canary -->|ForceGroupRPC| done canary -->|RollbackGroupRPC| rolledback active -->|ForceGroupRPC
Success criteria met| done @@ -1197,13 +1234,44 @@ flowchart TD active -->|ResetGroupRPC| active ``` +The finite state machine for the `time-based` is the following: +```mermaid +flowchart TD + unstarted((unstarted)) + canary((canary)) + active((active)) + done((done)) + rolledback((rolledback)) + + unstarted -->|TriggerGroupRPC
Start conditions are met| canary + canary -->|Canaries came back alive and window is still active| active + canary -->|ForceGroupRPC
Canaries came back alive and window is over| done + canary -->|RollbackGroupRPC| rolledback + active -->|ForceGroupRPC
End of window| done + done -->|Beginning of window| active + done -->|RollbackGroupRPC| rolledback + active -->|RollbackGroupRPC| rolledback + + canary -->|ResetGroupRPC| canary +``` + + +> [!NOTE] +> Once we have a proper feedback mechanism (phase 5) we might introduce a new `unfinished` state, similar to done, but +> which indicates that not all agents got updated when using the `time-based` strategy. This does not change the update +> logic but might be clearer for the end user. + #### Starting a group A group can be started if the following criteria are met -- all of its previous group are in the `done` state -- it has been at least `wait_days` since the previous group update started -- the current week day is in the `days` list -- the current hour equals the `hour` field +- for the `previous-must-succeed` strategy: + - all of its previous group are in the `done` state + - it has been at least `wait_days` since the previous group update started + - the current week day is in the `days` list + - the current hour equals the `hour` field +- for the `time-based` strategy: + - the current week day is in the `days` list + - the current hour equals the `hour` field When all those criteria are met, the auth will transition the group into a new state. If `canary_count` is not null, the group transitions to the `canary` state. @@ -1237,11 +1305,16 @@ An alert will eventually fire, warning the user about the stuck update. A group in `active` mode is currently being updated. The conditions to leave `active` mode and transition to the `done` mode will vary based on the phase and rollout strategy. -- Phase 3: we don't have any information about agents. The group transitions to `done` 60 minutes after its start. -- Phase 4: we know about the connected agent count and the connected agent versions. The group transitions to `done` if: - - at least `(100 - max_in_flight)%` of the agents are still connected - - at least `(100 - max_in_flight)%` of the agents are running the new version -- Phase 6: we incrementally update the progress, this adds a new criteria: the group progress is at 100% +- for the `previous-must-succeed` strategy: + - Phase 3: we don't have any information about agents. The group transitions to `done` 60 minutes after its start. + - Phase 4: we know about the connected agent count and the connected agent versions. The group transitions to `done` if: + - at least `(100 - max_in_flight)%` of the agents are still connected + - at least `(100 - max_in_flight)%` of the agents are running the new version + - Phase 6: we incrementally update the progress, this adds a new criteria: the group progress is at 100% +- for the `time-based` strategy: + - the group transitions to the `done` state `maintenance_window_minutes` minutes after the `active` transition. + The rollout's `start_time` must be used to do this transition, not the schedule's `start_hour`. + This will allow the user to trigger manual out-of-maintenance updates if needed. The phase 6 backpressure calculations are covered in the Backpressure Calculations section below.. @@ -1353,13 +1426,13 @@ Let `v1` be the previous version and `v2` the target version, the response matri ##### Rollout status: enabled -| Group state | Version | Should update | -|-------------|---------|----------------------------| -| unstarted | v1 | false | -| canary | v1 | false, except for canaries | -| active | v2 | true if UUID <= progress | -| done | v2 | true | -| rolledback | v1 | true | +| Group state | Version | Should update | +|-------------|---------|--------------------------------------------------------| +| unstarted | v1 | false | +| canary | v1 | false, except for canaries | +| active | v2 | true if UUID <= progress | +| done | v2 | true if `previous-must-succeed`, false if `time-based` | +| rolledback | v1 | true | #### Updater status reporting From 066b45c0da8a45dce5192c41d8870c526be8a90a Mon Sep 17 00:00:00 2001 From: hugoShaka Date: Tue, 15 Oct 2024 17:10:03 -0400 Subject: [PATCH 105/105] rename previous-must-succeed -> halt-on-failure --- rfd/0184-agent-auto-updates.md | 49 ++++++++++++++++++---------------- 1 file changed, 26 insertions(+), 23 deletions(-) diff --git a/rfd/0184-agent-auto-updates.md b/rfd/0184-agent-auto-updates.md index 252b701f74516..ae843482bb225 100644 --- a/rfd/0184-agent-auto-updates.md +++ b/rfd/0184-agent-auto-updates.md @@ -116,7 +116,7 @@ We will introduce two user-facing resources: start_version: v1 target_version: v2 schedule: regular - strategy: previous-must-succeed + strategy: halt-on-failure mode: enabled ``` @@ -291,7 +291,7 @@ spec: start_version: v1 target_version: v2 schedule: regular - strategy: previous-must-succeed + strategy: halt-on-failure mode: enabled status: groups: @@ -634,10 +634,10 @@ spec: # strategy to use for the rollout # Supported values are: # - time-based - # - previous-must-succeed - # - previous-must-succeed-with-backpressure - # defaults to previous-must-succeed, might default to previous-must-succeed-with-backpressure after phase 6. - strategy: previous-must-succeed + # - halt-on-failure + # - halt-on-failure-with-backpressure + # defaults to halt-on-failure, might default to halt-on-failure-with-backpressure after phase 6. + strategy: halt-on-failure # agent_schedules specifies version rollout schedules for agents. # The schedule used is determined by the schedule associated # with the version in the autoupdate_version resource. @@ -680,7 +680,7 @@ spec: mode: enabled agents: mode: enabled - strategy: previous-must-succeed + strategy: halt-on-failure alert_after: 4h schedules: regular: @@ -736,7 +736,7 @@ spec: start_version: v1 target_version: v2 schedule: regular - strategy: previous-must-succeed + strategy: halt-on-failure mode: enabled status: groups: @@ -818,7 +818,7 @@ message AutoUpdateConfigSpecAgents { // maintenance_window_minutes is the maintenance window duration in minutes. This can only be set if `strategy` is "time-based". int64 maintenance_window_minutes = 3; // alert_after_hours specifies the number of hours to wait before alerting that the rollout is not complete. - // This can only be set if `strategy` is "previous-must-succeed". + // This can only be set if `strategy` is "halt-on-failure". int64 alert_after_hours = 5; // agent_schedules specifies schedules for updates of grouped agents. AgentAutoUpdateSchedules agent_schedules = 6; @@ -829,7 +829,7 @@ enum Strategy { // UNSPECIFIED update strategy STRATEGY_UNSPECIFIED = 0; // PREVIOUS_MUST_SUCCEED update strategy with no backpressure - STRATEGY_PREVIOUS_MUST_SUCCEED = 1; + STRATEGY_HALT_ON_FAILURE = 1; // TIME_BASED update strategy. STRATEGY_TIME_BASED = 2; } @@ -848,7 +848,7 @@ message AgentAutoUpdateGroup { repeated Day days = 2; // start_hour to initiate update int32 start_hour = 3; - // wait_days after last group succeeds before this group can run. This can only be used when the strategy is "previous-must-finish". + // wait_days after last group succeeds before this group can run. This can only be used when the strategy is "halt-on-failure". int64 wait_days = 4; // canary_count of agents to use in the canary deployment. int64 canary_count = 5; @@ -1162,10 +1162,10 @@ message RollbackAgentGroupRequest { We support two rollout strategies, for two distinct use-cases: -- `previous-must-succeed` for damage reduction of a faulty update +- `halt-on-failure` for damage reduction of a faulty update - `time-based` for time-constrained maintenances -In `previous-must-succeed`, the update proceeds from the first group to the last group, ensuring that each group +In `halt-on-failure`, the update proceeds from the first group to the last group, ensuring that each group successfully updates before allowing the next group to proceed. By default, only 5 agent groups are allowed. This mitigates very long rollout plans. This is the strategy that offers the best availability. A group finishes its update once most of its agents are running the correct version. Agents that missed the group update will try to catch @@ -1177,6 +1177,9 @@ maintenance window more strictly. A group finishes its update at the end of the of the new version adoption rate. Agents that missed the maintenance window will not attempt to update until the next maintenance window. +After phase 6, a third strategy, `backpressure` will be added. This strategy will behave the same way `halt-on-failure` +does, except the agents will be progressively rolled-out within a group. + #### Agent update mode The agent auto update mode is specified by both Cloud (via `autoupdate_version`) @@ -1212,7 +1215,7 @@ A group can be in 5 states: - `done`: the group has been updated. New agents should run `v2`. - `rolledback`: the group has been rolledback. New agents should run `v1`, existing agents should update to `v1`. -The finite state machine for the `previous-must-succeed` is the following: +The finite state machine for the `halt-on-failure` is the following: ```mermaid flowchart TD @@ -1264,7 +1267,7 @@ flowchart TD #### Starting a group A group can be started if the following criteria are met -- for the `previous-must-succeed` strategy: +- for the `halt-on-failure` strategy: - all of its previous group are in the `done` state - it has been at least `wait_days` since the previous group update started - the current week day is in the `days` list @@ -1305,7 +1308,7 @@ An alert will eventually fire, warning the user about the stuck update. A group in `active` mode is currently being updated. The conditions to leave `active` mode and transition to the `done` mode will vary based on the phase and rollout strategy. -- for the `previous-must-succeed` strategy: +- for the `halt-on-failure` strategy: - Phase 3: we don't have any information about agents. The group transitions to `done` 60 minutes after its start. - Phase 4: we know about the connected agent count and the connected agent versions. The group transitions to `done` if: - at least `(100 - max_in_flight)%` of the agents are still connected @@ -1426,13 +1429,13 @@ Let `v1` be the previous version and `v2` the target version, the response matri ##### Rollout status: enabled -| Group state | Version | Should update | -|-------------|---------|--------------------------------------------------------| -| unstarted | v1 | false | -| canary | v1 | false, except for canaries | -| active | v2 | true if UUID <= progress | -| done | v2 | true if `previous-must-succeed`, false if `time-based` | -| rolledback | v1 | true | +| Group state | Version | Should update | +|-------------|---------|--------------------------------------------------| +| unstarted | v1 | false | +| canary | v1 | false, except for canaries | +| active | v2 | true if UUID <= progress | +| done | v2 | true if `halt-on-failure`, false if `time-based` | +| rolledback | v1 | true | #### Updater status reporting