You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 12, 2024. It is now read-only.
The helios rollout command rolls out a new version of a job to a set of hosts. This involves:
Determining the list and sequence of hosts to deploy to.
Undeploying any prior versions of the same job from a host before deploying the new version.
Deploying the new version of the job to each host (sequentially at first, with other possible strategies in the future), and waiting for the job to reach a RUNNING state before continuing.
Rolling back to the prior version(s) of the job in case an error occurs in the deployment. We won't implement this in the first go, but this proposal discusses a possible way to implement rollback anyway.
Why?
This type of rollout of a new job is what users generally do after they have a new image built. We already have two different Python scripts internally at Spotify that do this kind of thing (helios-helper and spheliosdeploy). This is definitely something that users want.
Legitimizing "rollout" as a first-class Helios operation gives us benefits over existing scripts:
Reduced confusion over what is part of Helios and what is part of the existing deployment scripts. This has proven to be a problem at Spotify, and caused issues for teams trying to troubleshoot their deployment pipelines.
Ability to test the rollout mechanism as part of Helios's test suite. This in stark contrast to the existing scripts, which have no integration test coverage.
More robust rollouts, which survive failure of the client and can guarantee completion or rollback as long as ZooKeeper is up. With existing scripts, the client or build machine failing generally means the deployment and rollback are aborted.
A central point for managing rollout strategies and behavior, and collecting metrics on rollouts. We can iterate on rollouts in this one place, with any improvements become available to all Helios users. For example, we could add new rollout strategies or options over time.
How?
First, the user creates a new job version with the Helios CLI. For example, myservice:0.2. Next, they create a rollout that specifies the new job and a host filter regex. Here's what this might look like:
helios rollout myservice:0.2 '.*-myservice-.*'
In the example above, the goal of the rollout would be to deploy myservice:0.2 to all Helios agents whose hostnames match the regex: .*-myservice-.*
In the future, this could be augmented with additional host selectors (agent tags, Puppet, etc.) or rollout strategies.
Creating the rollout
When a master receives a rollout request, it determines the list of applicable hosts creates a rollout configuration in ZooKeeper. This rollout configuration has these initial fields:
Rollout operation ID
A UUID that is used to identify all tasks that are created by this rollout.
Rollback operation ID
A UUID that is used to identify all tasks that are created by a rollback of this rollout.
New job ID
The ID of the new job to rollout.
Rollout status
Initially set to CREATED. Possible values include ROLLING_OUT, ROLLING_BACKDONE, and FAILED.
Rollout hosts
The exact sequence of hosts that the new job will be deployed to.
Rollout index
An atomic long used to track which host is currently being rolled out to.
Out of the above, the rollout hosts and rollout index are actually specific to the implementation of the simple rollout strategy. In the future, we might have other strategies that have other fields.
Rollout controller
We have a new rollout controller daemon, which can run side-by-side with the Helios master daemon (or anywhere it can reach ZooKeeper). By running the rollout controller on each master, we also get robustness — a rollout will continue as long as ZooKeeper is up and a single rollout controller is alive.
Each rollout controller watches the /config/rollouts node in ZooKeeper for rollouts. For any rollouts that are CREATED, ROLLING_OUT, or ROLLING_BACK, each rollout controller attempts to create the appropriate deploy or undeploy task for the current host as determined by the current rollout index. In the same transaction, the master increments the rollout index (or decrements it in case of a rollback).
When creating deploy and undeploy tasks for the rollout, every rollout controller uses the same predetermined rollout or rollback operation ID. This ensures that only one rollout controller will "win" and actually create the task and increment/decrement the rollout index.
If the rollout status is currently CREATED, the rollout controller updates it to ROLLING_OUT as part of the same transaction.
If the rollout status is currently ROLLING_OUT, the rollout controller waits for the job on the previous host to reach RUNNING before creating any tasks. If this doesn't happen within a reasonable timeout, the rollout controller updates the rollout status to ROLLING_BACK and decrements the rollout counter.
Once the last host is deployed or rolled back to, the rollout controller sets the rollout status to DONE or FAILED, respectively.
Deploy and undeploy tasks
Rolling out to each host actually consists of creating an undeploy task for any previous versions of the job to be rolled out, and creating a deploy task for the new version. When doing this, the rollout controller must also record the job ID for the undeployed version in ZooKeeper, if any.
When rolling back, the rollout controller creates an undeploy task for the new job and a deploy task for the previously-deployed version that was recorded earlier.
Roadmap
In the first iteration, we will actually not implement rollbacks. In case of an error or timeout, the rollout controller will update the rollout status to FAILED and give up immediately. This also means we don't have to track undeployed job ID's.
rollout command
The
helios rollout
command rolls out a new version of a job to a set of hosts. This involves:RUNNING
state before continuing.Why?
This type of rollout of a new job is what users generally do after they have a new image built. We already have two different Python scripts internally at Spotify that do this kind of thing (helios-helper and
spheliosdeploy
). This is definitely something that users want.Legitimizing "rollout" as a first-class Helios operation gives us benefits over existing scripts:
How?
First, the user creates a new job version with the Helios CLI. For example, myservice:0.2. Next, they create a rollout that specifies the new job and a host filter regex. Here's what this might look like:
In the example above, the goal of the rollout would be to deploy myservice:0.2 to all Helios agents whose hostnames match the regex:
.*-myservice-.*
In the future, this could be augmented with additional host selectors (agent tags, Puppet, etc.) or rollout strategies.
Creating the rollout
When a master receives a rollout request, it determines the list of applicable hosts creates a rollout configuration in ZooKeeper. This rollout configuration has these initial fields:
A UUID that is used to identify all tasks that are created by this rollout.
A UUID that is used to identify all tasks that are created by a rollback of this rollout.
The ID of the new job to rollout.
Initially set to
CREATED
. Possible values includeROLLING_OUT
,ROLLING_BACK
DONE
, andFAILED
.The exact sequence of hosts that the new job will be deployed to.
An atomic long used to track which host is currently being rolled out to.
Out of the above, the rollout hosts and rollout index are actually specific to the implementation of the simple rollout strategy. In the future, we might have other strategies that have other fields.
Rollout controller
We have a new rollout controller daemon, which can run side-by-side with the Helios master daemon (or anywhere it can reach ZooKeeper). By running the rollout controller on each master, we also get robustness — a rollout will continue as long as ZooKeeper is up and a single rollout controller is alive.
Each rollout controller watches the
/config/rollouts
node in ZooKeeper for rollouts. For any rollouts that areCREATED
,ROLLING_OUT
, orROLLING_BACK
, each rollout controller attempts to create the appropriate deploy or undeploy task for the current host as determined by the current rollout index. In the same transaction, the master increments the rollout index (or decrements it in case of a rollback).When creating deploy and undeploy tasks for the rollout, every rollout controller uses the same predetermined rollout or rollback operation ID. This ensures that only one rollout controller will "win" and actually create the task and increment/decrement the rollout index.
If the rollout status is currently
CREATED
, the rollout controller updates it toROLLING_OUT
as part of the same transaction.If the rollout status is currently
ROLLING_OUT
, the rollout controller waits for the job on the previous host to reachRUNNING
before creating any tasks. If this doesn't happen within a reasonable timeout, the rollout controller updates the rollout status toROLLING_BACK
and decrements the rollout counter.Once the last host is deployed or rolled back to, the rollout controller sets the rollout status to
DONE
orFAILED
, respectively.Deploy and undeploy tasks
Rolling out to each host actually consists of creating an undeploy task for any previous versions of the job to be rolled out, and creating a deploy task for the new version. When doing this, the rollout controller must also record the job ID for the undeployed version in ZooKeeper, if any.
When rolling back, the rollout controller creates an undeploy task for the new job and a deploy task for the previously-deployed version that was recorded earlier.
Roadmap
In the first iteration, we will actually not implement rollbacks. In case of an error or timeout, the rollout controller will update the rollout status to
FAILED
and give up immediately. This also means we don't have to track undeployed job ID's.In future versions, we might want to have:
RUNNING
on one host and when we try to deploy to the next host.The text was updated successfully, but these errors were encountered: