Skip to content
This repository has been archived by the owner on Sep 22, 2020. It is now read-only.

Latest commit

 

History

History
330 lines (237 loc) · 15.5 KB

README.md

File metadata and controls

330 lines (237 loc) · 15.5 KB

Odin Auto-Scaling Group Deployer

Odin

Deploy your 12-factor-applications to AWS easily and securely with the Odin. Odin is a AWS Step Function base on the step framework that deploys services as Auto-Scaling Groups (ASG's) to AWS.

Odin's goals/requirements/features are:

  1. Ephemeral Blue/Green: create new instances, wait for them to become healthy, delete old instances; treating compute instances as disposable and ephemeral.
  2. Declarative: describe what a successful release looks like, not how to deploy it.
  3. Scalable: can scale both vertically (larger instances) and horizontally (more instances).
  4. Secure: resources are verified to ensure that they cannot be used accidentally or maliciously.
  5. Gracefully Fail: handle failures to recover and roll back with no/minimal impact to users.
  6. Configuration Parity: minimize divergence between production, staging and development environments by keeping releases as similar as possible.
  7. No Configuration: once Odin is deployed it requires no further configuration.
  8. Multi Account: one deployer for all AWS accounts.

Getting Started

Odin is implemented as an AWS Lambda Function and AWS Step Function that deploys by assuming a role into an AWS account. You can bootstrap these into AWS with:

git pull # pull down new code
./scripts/bootstrap

Testing with deploy-test

Odin includes a test project deploy-test that has one service web that starts an nginx server to be mounted behind a Elastic Load Balancer (ELB) and Application Load Balancer target group. The service instances have a security group and instance profile.

To create the AWS resources for deploy-test:

./scripts/geo apply resources/deploy-test-resources.rb

Note: you will also have to tag the latest Ubuntu release with Name: ubuntu and DeployWith: odin

A deploy-test release file deployer-test-release.json looks like:

{
  "project_name": "coinbase/deploy-test",
  "config_name": "development",
  "subnets": ["test_private_subnet_a", "test_private_subnet_b"],
  "ami": "ubuntu",
  "services": {
    "web": {
      "instance_type": "t2.nano",
      "security_groups": ["ec2::coinbase/deploy-test::development"],
      "elbs": ["coinbase-deploy-test-web-elb"],
      "profile": "coinbase-deploy-test",
      "target_groups": ["coinbase-deploy-test-web-tg"]
    }
  }
}

The user data for the release is from the file deployer-test-release.json.userdata:

#cloud-config
repo_update: true
repo_upgrade: all

packages:
 - docker.io

runcmd:
 - docker run -d -p 8000:80 nginx

To build a release for deploy-test and send it to Odin we use the odin executable:

odin deploy deploy-test-release.json

Odin deploy

The odin executable takes the release file, merges in the user data, attaches some meta-data like created_at and `release_id, then send the release to the Odin step function that:

  1. validates the sent release and any referenced resources.
  2. creates a new auto-scaling group for web that starts an nginx server.
  3. waits for the EC2 instances in the web ASG to become healthy w.r.t. the ASG, the ELB and the target group. This may take a few minutes.
  4. Once healthy the ASGs from the previous release and terminate their instances.

This is the ephemeral blue/green where old instances are deleted and new servers created.

Odin Release

An Odin release is a request to deploy a Project-Configuration where:

  • A Project is a code-base typically named with org/name.
  • A Configuration is the environment the project is being deployed into, e.g. development, production.

Each release can define 1-to-many Services; each service is a logical group of servers, e.g. web or worker, that maps to a single auto-scaling group (ASG).

When Odin is sent a release, it moves it through a state machine:

odin state diagram

  1. Validate: validate the release is correct.
  2. Lock: grabs a lock on project-configuration.
  3. ValidateResources: validate resources w.r.t. the project, configuration and service using them.
  4. Deploy: creates an ASG and other resource for each service.
  5. CheckHealthy: check to see if the new instances created are healthy w.r.t. their ASGs ELBs and target groups. If instances are seen to be terminating immediately halt release.
  6. CleanUpSuccess: if the release was a success, then delete the old ASGs.
  7. CleanUpFailure: if the release failed, delete the new ASGs.
  8. ReleaseLockFailure: try to release the lock and fail.

At each of these states it is possible to fail and then move towards a failure state. The typical failures are:

  • BadReleaseError: The release sent was invalid because either its structure was incorrect, its values were invalid, or its resources were invalid.
  • LockExistsError: Could not grab the lock because either another deploy for the project-configuration is currently going out, or a previous deploy left a lock in place.
  • DeployError: Unable to create a new ASG or resource.
  • HaltError: Halt was detected or instances were found terminating.
  • TimeoutError: The deploy took too long and failed.

The end states are:

  1. Success: the release went went as planned.
  2. FailureClean: release was unsuccessful, but cleanup was successful, so AWS was left in good state.
  3. FailureDirty: release was unsuccessful, but cleanup failed so AWS was left in a bad state. This should never happen and should alert if this happens, and file a bug.
  4. It is possible to not end in one of these states if the state machine is incorrect. This is very bad, alert if this happens and file a bug.

Resources

A release uses resources that must exist and be configured correctly to be used for the project-configuration-service being deployed.

A release must have:

  1. an AMI defined with the ami key that can be either a Name tag or AMI ID e.g. ami-1234567
  2. Subnets defined with subnets key that is a list of either Name tags or Subnet IDs e.g. subnet-1234567

Both the above resources MUST have a tag DeployWith that equals odin.

Services can have:

  1. Security Groups defined with security_groups key is a list of security groups Name tags
  2. Elastic Load Balancers defined with elbs key is a list of ELB names
  3. Application Load Balancer Target Groups defined with target_groups is a list of target group's Name tags

All the above resources MUST be tagged with the ProjectName, ConfigName and ServiceName of the release to ensure that resources are assigned correctly.

Services can also have an Instance Profile defined by the profile key that is and instance profile Name tag. The roles path MUST be equal to /<project_name>/<config_name>/<service_name>/.

Scale

Odin makes it easy to scale both vertically and horizontally. To scale deploy-test we add to the release:

{ ...
  "services": {
    "web": { ...
      "instance_type": "c4.xlarge",
      "ebs_volume_size": 20,
      "ebs_volume_type": "gp2",
      "ebs_device_name": "/dev/sda1",
      "autoscaling": {
        "min_size": 3,
        "max_size": 5,
        "spread": 0.2,
        "max_terms": 1,
        "policies": [
          {
            "type": "cpu_scale_up",
            "threshold" : 25,
            "scaling_adjustment": 2
          },
          {
            "type": "cpu_scale_down",
            "threshold" : 15,
            "scaling_adjustment": -1
          }
        ]
      }
    }
  }
}
  • instance_type is the EC2 instance type for the service
  • ebs_volume_size, ebs_volume_type, ebs_device_name define the attached EBS volume in GB.

The autoscaling key defines the horizontal scaling of a service:

  • all calculations are bounded by min_size and max_size.
  • the desired_capacity is equal to the min_size or capacity of the previously launched service
  • the actual number of instances launched is the desired_capacity * (1 + spread)
  • to be deemed the healthy the service must have desired_capacity * (1 - spread)
  • if the number of terminating is greater than or equal to max_terms (default 0), the release is immediately halts.
  • policies are defined above to increase the desired_capacity by 2 instances if the CPU goes above 25% and reduce by 1 instance if it drops below 15%.

Both spread and max_terms are useful when launching many instances because as scale increases the number of cloud errors increase.

User Data

Do not put sensitive data into user data. User data is easily accessible from the AWS console, difficult to secure with IAM, and very limited in size. Odin requires user data passed to it to be KMS encrypted, uploaded to S3, and a SHA256 be passed in the release to be checked. The userdata will still be accessible in plain text on a launch configuration and EC2 instances, so these precautions are more to protect tampering than secrets.

For any secret an instance needs access to, we recommend using Vault, AWS Parameter store, or KMS encrypted S3 authenticated by a service's instance profile.

The user data is KMS encrypted and uploaded to S3. Odin will replace some strings with information about the release, project, config and service, e.g.:

...
write_files:
  - path: /
    content: |
      {{RELEASE_ID}}
      {{PROJECT_NAME}}
      {{CONFIG_NAME}}
      {{SERVICE_NAME}}

Odin will replace {{PROJECT_NAME}} with the name of the project and {{SERVICE_NAME}} with the name of the service. This can be useful for getting service specific configuration and logging.

The odin client will upload the user data for the services from the <release_file>.userdata file, e.g. deployer-test-release.json.userdata.

Timeout

A release can have a timeout which is how long in seconds a release will wait for its services to become healthy. By default the timeout is 10 minutes, the max value would be around a year (31556926 seconds) since that is how long a step function can run.

Lifecycle

AWS provides Auto Scaling Group Lifecycle Hooks to detect and react to auto-scaling events. You can add the lifecycle hooks to the ASGs with:

{ ...
  "lifecycle": {
    "termhook" : {
      "transition": "autoscaling:EC2_INSTANCE_TERMINATING",
      "role": "asg_lifecycle_hooks",
      "sns": "asg_lifecycle_hooks",
      "heartbeat_timeout": 300
    }
  }
}

These can be used to gracefully shutdown instances, which is necessary if a service has long running jobs e.g. a worker service.

Halt

Odin supports manually stopping a release while is it being deployed. Just execute:

odin halt deploy-test-release.json

This will:

  1. Find the running deploy for the project configuration
  2. Write a halt file to S3
  3. Wait for Odin to detect the halt file and fail the deploy

Halt does not guarantee that the release will not be deployed, if executed too late the release may still result in success.

DO NOT use Stop execution of the Odin step function as it will not clean up resources and leave AWS in a bad state.

Security

Deployers are critical pieces of infrastructure as they may be used to compromise software they deploy. As such, we take security very seriously around the odin and try to answer the following questions:

  1. Authentication: Who can deploy?
  2. Authorization: What can be deployed?
  3. Replay and Man-in-the-middle (MITM): Can some unauthorized person edit or reuse a release to change what is deployed?
  4. Audit: Who has done what, and when?

Authentication

The central authentication mechanisms are the AWS IAM permissions for step functions and S3.

By limiting the ec2:CreateAutoscalingGroup, permissions the Odin function becomes the only way to deploy ASG's. Then limiting permissions to who can call states:StartExecution for Odin limits who can deploy.

Ensuring that Odin's lambda can only access a single S3 bucket, further limits who can deploy with:

{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject*", "s3:PutObject*",
    "s3:List*", "s3:DeleteObject*"
  ],
  "Resource": [
    "arn:aws:s3:::#{s3_bucket_name}/*",
    "arn:aws:s3:::#{s3_bucket_name}"
  ]
},
{
  "Effect": "Deny",
  "Action": ["s3:*"],
  "NotResource": [
    "arn:aws:s3:::#{s3_bucket_name}/*",
    "arn:aws:s3:::#{s3_bucket_name}"
  ]
},

The Odin step function also needs to decrypt the KMS encrypted user-data that is uploaded to S3. By default it is encrypted with the alias/aws/s3 key, but a custom KMS key can be used and wither an alias or arn can be added to user_data_kms_key. A custom key will give a better audit trail, and can lock down who can release even more.

Who can execute the step function, and who can upload to S3 are the two permissions that guard who can deploy.

Authorization

All resources that can be used in a Odin deploy must opt-in using tags or paths. Additionally, service resources require specific tags or paths denoting which project/config/service can use them.

Assets uploaded to S3 are in the path /<ProjectName>/<ConfigName> so limiting who can s3:PutObject to a path can be used to limit what project-configs they can deploy or halt.

Replay and MITM

Each release the client generates a release release_id, a created_at date, and together also uploads the release to S3.

The odin will reject any request where the created_at date is not recent, or the release sent to the step function and S3 don't match. This means that if a user can invoke the step function, but not upload to S3 (or vice-versa) it is not possible to deploy old or malicious code.

Audit

Working out what happened and when is very useful for debugging and security response. Step functions make it easy to see the history of all executions in the AWS console and via API. S3 can log all access to cloud-trail, so collecting from these two sources will show all information about a deploy.

Continuing Deployment

There is always more to do:

  1. Allow LifeCycle Hooks to send to Cloudwatch.
  2. Subnet, AMI, life cycle and userdata overrides per service.
  3. Check EC2 instance limits and capacity before deploying.
  4. Slowly scale (Canary) instances up rather than all at once, e.g. deploy 1 instance check it is healthy then deploy the rest.
  5. Add ELB and Target Group error rates when checking healthy.
  6. Custom auto-scaling policy types.