You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When you run your infrastructure with geodesic as your base image, you have the benfit of being able to run it anywhere you have docker support.
For example, you can do so in multiple ways:
On your local workstation
On a remote ECS cluster (e.g. as a service task)
On a remote Kubernetes cluster (e.g. as a Deployment)
Since we have a container, we should be able to apply all the standard release engineering "best practices" to build and deploy Infrastructure as Code.
Release Engineering
Before we continue, it's important to point out the (2) most common CI/CD pipelines:
Monorepo CI/CD is where you have one repository with multiple apps, each with different SDLCs
Polyrepo CI/CD is where you have one repository with a single app, and a single SDLC
Usually, the master branch (or some branch or tag like that) represents the state of production. That is, some commit sha should equal what has been deployed. If you follow, we're on the same page.
Both these strategies share one common pattern:
Build Any time a Pull Request is opened or synchronized, then check out the code, build the application, and run the tests
Deploy Any time a Pull Request is merged to master, deploy
Now, this is oversimplified. There's perhaps a lot more going on that just this, but the gist of it should be something like that.
Thought Experiment
Consider this thought experiment:
We open up a Pull Request. All tests pass.
We merge to master (our "production" state), which triggers a deployment.
Deployment fails.
What do we do? The master branch now contains code that was not successfully deployed. Now our production environment has diverged from what is in git. That's no good.
From here we would typically expect a few things to happen.
Our deployment process is so robust, the failed deployment didn't affect production. It was caught early during the "rolling update" deployment process (or "blue/green") rollout. We continue running the previous version in production.
Our engineers revert the Pull Request, restoring the pristine nature of the master branch so that it represents production.
Now, we totally agree that the above process is how things should look like. But what happens if the technology or software we are using doesn't support that workflow? Do we try to fix the technology? Or do we find "compensating controls" so we can achieve the same outcome?
The problem
When we're deploying infrastructure as code ("IaC"), we're often deploying the backplane itself. The foundation upon which everything else runs upon. One of the most common tools for deploying IaC is the tool called terraform.
Anyone who uses terraform on a regular basis has probably seen the following error:
Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.
Okay, great. Now what? How do we apply the "CI/CD Best Practices" to terraform when the tool itself doesn't support the key capability we've come to rely on to achieve CD for decades?
Side rant: This is not an easy problem to solve.
this is not the fault of terraform. It's extremely difficult to generalize what the rollback process should look like as it relates to IaC. It's better that a human operator identity the best course of action, rather than the tool making a best guess (e.g. "Opps, let's just destroy the production database and restore it to the previous version" because the security group didn't exist)
So now we have a couple of problems:
We cannot reliably do rollbacks
We may have master inconsistent with what's in production
If we merge to master, then others in the company are going to start developing against a "desired" state that doesn't yet exist and might even be impossible to achieve.
We have to pull the "emergency break" and stop everything
The Compromise
Now we're going to layout our solution to this problem. It borrows on the fine work of atlantis and bends it to our needs. The fundamental innovation of atlantis is a new kind of CI/CD pipeline.
Let's call this option (3):
"CI/CD Operations by Interactive Pull Requests".
But what does that mean?
The new workflow looks something like this:
Create a new branch
Make your changes in a branch.
Open up a "Pull Request" when you want to see what should happen (e.g. terraform plan)
Test those changes with a "dry run" automatically (if enabled & user is authorized)
Use GitHub comments to interact with that pull request (e.g. atlantis plan or atlantis apply)
To apply changes, get the PR approved. Then run atlantis apply.
If successful, then merge to master. Else, go back to customize header #2. Repeat until successful.
The new assumptions as it relates to a geodesic based infrastructure repo (E.g. testing.cloudposse.co:
Treat the repo as a monorepo that contains multiple projects (e.g. in /conf) each with their on SDLC.
Treat atlantis as one of the apps in this monorepo. It has it's own SDLC.
Here's what this then looks like:
We deploy our geodesic container to some AWS account with an IAM role that allows it to perform operations at our behest. This is becomes one of our operating contexts that we can use to deploy infrastructure. Depending on where this container runs and the permissions it has, we can the capability to affect infrastructure.
This container is receiving webhook callbacks from the infrastructure repo. When it receives an authorized request, it carries out the action. It checks out the code at the commit sha, runs the command. Each one of these callbacks is a different SDLC workflow. This is the monorepo CI/CD process.
Note, there can be multiple PRs open against the same /conf/$project, so it doesn't make sense to operate in /conf/$project. As such, atlantis checks out the work in a temporary folder and executes from there. IMPORTANTatlantis does not operate in the /conf folder the way a human operator would. It's more like atlantis is operating in something like the /localhost folder.
Okay, we hear you. We don't want to do this either. But let's consider the alternative: We build the docker image containing all the infrastructure as code treating this as a poly repo CI/CD pipeline.
Now we need to go apply the changes. How do we know what changed? We cannot use git techniques to identify the changes. The only way is to iterate over every project and do a terraform plan and possible terraform apply if there were changes.
If we do this, then:
deployments will take forever for large infrastructures because we have to iterate over all projects;
our dramatically expand the blast radius, since we possibly apply changes that were not clearly expressed by our PR (yes, this avoid drift, but the tradeoff is wicked)
Proposed Changes
When we run geodesic with atlantis we should move away from multi-stage and instead use terraform init -from-module=...
The text was updated successfully, but these errors were encountered:
The one issue I thought of with the compromise section #6 what happens if there is new IaC approved and applied in a different PR that may affect state of the resources/infra in current PR? This was one of the reasons to only terraform apply on a single branch (master?). Perhaps there is a way for Atlantis to check if current branch is out of date with master/main branch?
what
Introduction
When you run your infrastructure with
geodesic
as your base image, you have the benfit of being able to run it anywhere you have docker support.For example, you can do so in multiple ways:
Deployment
)Since we have a container, we should be able to apply all the standard release engineering "best practices" to build and deploy Infrastructure as Code.
Release Engineering
Before we continue, it's important to point out the (2) most common CI/CD pipelines:
Usually, the
master
branch (or some branch or tag like that) represents the state of production. That is, some commitsha
should equal what has been deployed. If you follow, we're on the same page.Both these strategies share one common pattern:
Now, this is oversimplified. There's perhaps a lot more going on that just this, but the gist of it should be something like that.
Thought Experiment
Consider this thought experiment:
What do we do? The
master
branch now contains code that was not successfully deployed. Now our production environment hasdiverged
from what is ingit
. That's no good.From here we would typically expect a few things to happen.
master
branch so that it represents production.Now, we totally agree that the above process is how things should look like. But what happens if the technology or software we are using doesn't support that workflow? Do we try to fix the technology? Or do we find "compensating controls" so we can achieve the same outcome?
The problem
When we're deploying infrastructure as code ("IaC"), we're often deploying the backplane itself. The foundation upon which everything else runs upon. One of the most common tools for deploying IaC is the tool called
terraform
.Anyone who uses
terraform
on a regular basis has probably seen the following error:Okay, great. Now what? How do we apply the "CI/CD Best Practices" to
terraform
when the tool itself doesn't support the key capability we've come to rely on to achieve CD for decades?Side rant: This is not an easy problem to solve.
this is not the fault of terraform. It's extremely difficult to generalize what the rollback process should look like as it relates to IaC. It's better that a human operator identity the best course of action, rather than the tool making a best guess (e.g. "Opps, let's just destroy the production database and restore it to the previous version" because the security group didn't exist)
So now we have a couple of problems:
master
inconsistent with what's in productionmaster
, then others in the company are going to start developing against a "desired" state that doesn't yet exist and might even be impossible to achieve.The Compromise
Now we're going to layout our solution to this problem. It borrows on the fine work of
atlantis
and bends it to our needs. The fundamental innovation ofatlantis
is a new kind of CI/CD pipeline.Let's call this option (3):
But what does that mean?
The new workflow looks something like this:
terraform plan
)atlantis plan
oratlantis apply
)atlantis apply
.The new assumptions as it relates to a
geodesic
based infrastructure repo (E.g.testing.cloudposse.co
:/conf
) each with their on SDLC.atlantis
as one of the apps in this monorepo. It has it's own SDLC.Here's what this then looks like:
geodesic
container to some AWS account with an IAM role that allows it to perform operations at our behest. This is becomes one of our operating contexts that we can use to deploy infrastructure. Depending on where this container runs and the permissions it has, we can the capability to affect infrastructure.Note, there can be multiple PRs open against the same
/conf/$project
, so it doesn't make sense to operate in/conf/$project
. As such,atlantis
checks out the work in a temporary folder and executes from there. IMPORTANTatlantis
does not operate in the/conf
folder the way a human operator would. It's more likeatlantis
is operating in something like the/localhost
folder.Thought Experiment #2
Okay, we hear you. We don't want to do this either. But let's consider the alternative: We build the docker image containing all the infrastructure as code treating this as a poly repo CI/CD pipeline.
Now we need to go apply the changes. How do we know what changed? We cannot use
git
techniques to identify the changes. The only way is to iterate over every project and do aterraform plan
and possibleterraform apply
if there were changes.If we do this, then:
Proposed Changes
geodesic
withatlantis
we should move away from multi-stage and instead useterraform init -from-module=...
The text was updated successfully, but these errors were encountered: