Proposal: helios rollout #502

rohansingh · 2015-06-25T18:53:50Z

rollout command

The helios rollout command rolls out a new version of a job to a set of hosts. This involves:

Determining the list and sequence of hosts to deploy to.
Undeploying any prior versions of the same job from a host before deploying the new version.
Deploying the new version of the job to each host (sequentially at first, with other possible strategies in the future), and waiting for the job to reach a RUNNING state before continuing.
Rolling back to the prior version(s) of the job in case an error occurs in the deployment. We won't implement this in the first go, but this proposal discusses a possible way to implement rollback anyway.

Why?

This type of rollout of a new job is what users generally do after they have a new image built. We already have two different Python scripts internally at Spotify that do this kind of thing (helios-helper and spheliosdeploy). This is definitely something that users want.

Legitimizing "rollout" as a first-class Helios operation gives us benefits over existing scripts:

Reduced confusion over what is part of Helios and what is part of the existing deployment scripts. This has proven to be a problem at Spotify, and caused issues for teams trying to troubleshoot their deployment pipelines.
Ability to test the rollout mechanism as part of Helios's test suite. This in stark contrast to the existing scripts, which have no integration test coverage.
More robust rollouts, which survive failure of the client and can guarantee completion or rollback as long as ZooKeeper is up. With existing scripts, the client or build machine failing generally means the deployment and rollback are aborted.
A central point for managing rollout strategies and behavior, and collecting metrics on rollouts. We can iterate on rollouts in this one place, with any improvements become available to all Helios users. For example, we could add new rollout strategies or options over time.

How?

First, the user creates a new job version with the Helios CLI. For example, myservice:0.2. Next, they create a rollout that specifies the new job and a host filter regex. Here's what this might look like:

helios rollout myservice:0.2 '.*-myservice-.*'

In the example above, the goal of the rollout would be to deploy myservice:0.2 to all Helios agents whose hostnames match the regex: .*-myservice-.*

In the future, this could be augmented with additional host selectors (agent tags, Puppet, etc.) or rollout strategies.

Creating the rollout

When a master receives a rollout request, it determines the list of applicable hosts creates a rollout configuration in ZooKeeper. This rollout configuration has these initial fields:

Rollout operation ID
A UUID that is used to identify all tasks that are created by this rollout.
Rollback operation ID
A UUID that is used to identify all tasks that are created by a rollback of this rollout.
New job ID
The ID of the new job to rollout.
Rollout status
Initially set to CREATED. Possible values include ROLLING_OUT, ROLLING_BACK DONE, and FAILED.
Rollout hosts
The exact sequence of hosts that the new job will be deployed to.
Rollout index
An atomic long used to track which host is currently being rolled out to.

Out of the above, the rollout hosts and rollout index are actually specific to the implementation of the simple rollout strategy. In the future, we might have other strategies that have other fields.

Rollout controller

We have a new rollout controller daemon, which can run side-by-side with the Helios master daemon (or anywhere it can reach ZooKeeper). By running the rollout controller on each master, we also get robustness — a rollout will continue as long as ZooKeeper is up and a single rollout controller is alive.

Each rollout controller watches the /config/rollouts node in ZooKeeper for rollouts. For any rollouts that are CREATED, ROLLING_OUT, or ROLLING_BACK, each rollout controller attempts to create the appropriate deploy or undeploy task for the current host as determined by the current rollout index. In the same transaction, the master increments the rollout index (or decrements it in case of a rollback).

When creating deploy and undeploy tasks for the rollout, every rollout controller uses the same predetermined rollout or rollback operation ID. This ensures that only one rollout controller will "win" and actually create the task and increment/decrement the rollout index.

If the rollout status is currently CREATED, the rollout controller updates it to ROLLING_OUT as part of the same transaction.

If the rollout status is currently ROLLING_OUT, the rollout controller waits for the job on the previous host to reach RUNNING before creating any tasks. If this doesn't happen within a reasonable timeout, the rollout controller updates the rollout status to ROLLING_BACK and decrements the rollout counter.

Once the last host is deployed or rolled back to, the rollout controller sets the rollout status to DONE or FAILED, respectively.

Deploy and undeploy tasks

Rolling out to each host actually consists of creating an undeploy task for any previous versions of the job to be rolled out, and creating a deploy task for the new version. When doing this, the rollout controller must also record the job ID for the undeployed version in ZooKeeper, if any.

When rolling back, the rollout controller creates an undeploy task for the new job and a deploy task for the previously-deployed version that was recorded earlier.

Roadmap

In the first iteration, we will actually not implement rollbacks. In case of an error or timeout, the rollout controller will update the rollout status to FAILED and give up immediately. This also means we don't have to track undeployed job ID's.

In future versions, we might want to have:

Additional rollout strategies (exponential, red/black, etc.).
Standing rollout commands, which automatically deploy to new agents as soon as they come up.
Notifications on rollout success or failure.
Some sort of UI for monitoring rollouts.
Rollbacks, and some way to have cross-site rollbacks across multiple Helios clusters.
A configurable interval for how long to wait between when the job is RUNNING on one host and when we try to deploy to the next host.

The text was updated successfully, but these errors were encountered:

rohansingh · 2015-06-25T19:02:59Z

We should definitely have helios status --rollout <something> so you can monitor a rollout, and possible helios rollout --abort as well.

davidxia · 2015-06-26T08:48:32Z

👍

rohansingh · 2015-07-13T23:07:41Z

Fixed by #521.

rohansingh added the enhancement label Jun 25, 2015

rohansingh mentioned this issue Jun 26, 2015

Declarative language for specifying stacks using helios #450

Closed

rohansingh closed this as completed Jul 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: helios rollout #502

Proposal: helios rollout #502

rohansingh commented Jun 25, 2015

rohansingh commented Jun 25, 2015

davidxia commented Jun 26, 2015

rohansingh commented Jul 13, 2015

Proposal: helios rollout #502

Proposal: helios rollout #502

Comments

rohansingh commented Jun 25, 2015

rollout command

Why?

How?

Creating the rollout

Rollout controller

Deploy and undeploy tasks

Roadmap

rohansingh commented Jun 25, 2015

davidxia commented Jun 26, 2015

rohansingh commented Jul 13, 2015