Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Atlantis with Geodesic #355

Closed
osterman opened this issue Jan 4, 2019 · 1 comment
Closed

Document Atlantis with Geodesic #355

osterman opened this issue Jan 4, 2019 · 1 comment

Comments

@osterman
Copy link
Member

osterman commented Jan 4, 2019

what

  • Explain this mind warping concept

Introduction

When you run your infrastructure with geodesic as your base image, you have the benfit of being able to run it anywhere you have docker support.

For example, you can do so in multiple ways:

  1. On your local workstation
  2. On a remote ECS cluster (e.g. as a service task)
  3. On a remote Kubernetes cluster (e.g. as a Deployment)

Since we have a container, we should be able to apply all the standard release engineering "best practices" to build and deploy Infrastructure as Code.

Release Engineering

Before we continue, it's important to point out the (2) most common CI/CD pipelines:

  1. Monorepo CI/CD is where you have one repository with multiple apps, each with different SDLCs
  2. Polyrepo CI/CD is where you have one repository with a single app, and a single SDLC

Usually, the master branch (or some branch or tag like that) represents the state of production. That is, some commit sha should equal what has been deployed. If you follow, we're on the same page.

Both these strategies share one common pattern:

  1. Build Any time a Pull Request is opened or synchronized, then check out the code, build the application, and run the tests
  2. Deploy Any time a Pull Request is merged to master, deploy

Now, this is oversimplified. There's perhaps a lot more going on that just this, but the gist of it should be something like that.

Thought Experiment

Consider this thought experiment:

  1. We open up a Pull Request. All tests pass.
  2. We merge to master (our "production" state), which triggers a deployment.
  3. Deployment fails.

What do we do? The master branch now contains code that was not successfully deployed. Now our production environment has diverged from what is in git. That's no good.

From here we would typically expect a few things to happen.

  1. Our deployment process is so robust, the failed deployment didn't affect production. It was caught early during the "rolling update" deployment process (or "blue/green") rollout. We continue running the previous version in production.
  2. Our engineers revert the Pull Request, restoring the pristine nature of the master branch so that it represents production.

Now, we totally agree that the above process is how things should look like. But what happens if the technology or software we are using doesn't support that workflow? Do we try to fix the technology? Or do we find "compensating controls" so we can achieve the same outcome?

The problem

When we're deploying infrastructure as code ("IaC"), we're often deploying the backplane itself. The foundation upon which everything else runs upon. One of the most common tools for deploying IaC is the tool called terraform.

Anyone who uses terraform on a regular basis has probably seen the following error:

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

Okay, great. Now what? How do we apply the "CI/CD Best Practices" to terraform when the tool itself doesn't support the key capability we've come to rely on to achieve CD for decades?

Side rant: This is not an easy problem to solve.

this is not the fault of terraform. It's extremely difficult to generalize what the rollback process should look like as it relates to IaC. It's better that a human operator identity the best course of action, rather than the tool making a best guess (e.g. "Opps, let's just destroy the production database and restore it to the previous version" because the security group didn't exist)

So now we have a couple of problems:

  1. We cannot reliably do rollbacks
  2. We may have master inconsistent with what's in production
  3. If we merge to master, then others in the company are going to start developing against a "desired" state that doesn't yet exist and might even be impossible to achieve.
  4. We have to pull the "emergency break" and stop everything

The Compromise

Now we're going to layout our solution to this problem. It borrows on the fine work of atlantis and bends it to our needs. The fundamental innovation of atlantis is a new kind of CI/CD pipeline.

Let's call this option (3):

"CI/CD Operations by Interactive Pull Requests".

But what does that mean?

The new workflow looks something like this:

  1. Create a new branch
  2. Make your changes in a branch.
  3. Open up a "Pull Request" when you want to see what should happen (e.g. terraform plan)
  4. Test those changes with a "dry run" automatically (if enabled & user is authorized)
  5. Use GitHub comments to interact with that pull request (e.g. atlantis plan or atlantis apply)
  6. To apply changes, get the PR approved. Then run atlantis apply.
  7. If successful, then merge to master. Else, go back to customize header #2. Repeat until successful.

The new assumptions as it relates to a geodesic based infrastructure repo (E.g. testing.cloudposse.co:

  1. Treat the repo as a monorepo that contains multiple projects (e.g. in /conf) each with their on SDLC.
  2. Treat atlantis as one of the apps in this monorepo. It has it's own SDLC.

Here's what this then looks like:

  1. We deploy our geodesic container to some AWS account with an IAM role that allows it to perform operations at our behest. This is becomes one of our operating contexts that we can use to deploy infrastructure. Depending on where this container runs and the permissions it has, we can the capability to affect infrastructure.
  2. This container is receiving webhook callbacks from the infrastructure repo. When it receives an authorized request, it carries out the action. It checks out the code at the commit sha, runs the command. Each one of these callbacks is a different SDLC workflow. This is the monorepo CI/CD process.

Note, there can be multiple PRs open against the same /conf/$project, so it doesn't make sense to operate in /conf/$project. As such, atlantis checks out the work in a temporary folder and executes from there. IMPORTANT atlantis does not operate in the /conf folder the way a human operator would. It's more like atlantis is operating in something like the /localhost folder.

Thought Experiment #2

"I don't agree this is necessary!"

Okay, we hear you. We don't want to do this either. But let's consider the alternative: We build the docker image containing all the infrastructure as code treating this as a poly repo CI/CD pipeline.

Now we need to go apply the changes. How do we know what changed? We cannot use git techniques to identify the changes. The only way is to iterate over every project and do a terraform plan and possible terraform apply if there were changes.

If we do this, then:

  1. deployments will take forever for large infrastructures because we have to iterate over all projects;
  2. our dramatically expand the blast radius, since we possibly apply changes that were not clearly expressed by our PR (yes, this avoid drift, but the tradeoff is wicked)

Proposed Changes

  • When we run geodesic with atlantis we should move away from multi-stage and instead use terraform init -from-module=...
@sarkis
Copy link
Contributor

sarkis commented Jan 5, 2019

The one issue I thought of with the compromise section #6 what happens if there is new IaC approved and applied in a different PR that may affect state of the resources/infra in current PR? This was one of the reasons to only terraform apply on a single branch (master?). Perhaps there is a way for Atlantis to check if current branch is out of date with master/main branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants