Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

late 2022 - adding pre-apply run task and other #29

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions tfc-agent-ecs/consumer/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,17 @@ data "aws_ami" "amazon-linux" {
}
}

resource "aws_vpc" "example" {
cidr_block = var.cidr_block
}

resource "aws_subnet" "example" {
vpc_id = aws_vpc.example.id
cidr_block = var.cidr_block
}

resource "aws_instance" "example" {
ami = data.aws_ami.amazon-linux.id
instance_type = "t3.micro"
subnet_id = aws_subnet.example.id
}
5 changes: 5 additions & 0 deletions tfc-agent-ecs/consumer/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@ variable "aws_role_arn" {
description = "Amazon Resource Name of the role to be assumed (this was created in the producer workspace)"
}

variable "cidr_block" {
description = "VPC CIDR block"
default = "10.0.0.0/16"
}

variable "region" {
description = "The region where the resources are created."
default = "us-west-2"
Expand Down
12 changes: 6 additions & 6 deletions tfc-agent-ecs/producer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,18 +24,18 @@ Additionally, these may now be created and managed with Terraform due to the add
Prior to the addition of these resources to the tfe provider, I had written helper scripts to create and revoke agent tokens using the Terraform Cloud API. Those scripts remain available [here](files/README.md).

## Autoscaling tfc-agent with a Lambda Function
I've included a Lambda function that, when combined with [Terraform Cloud notifications](https://www.terraform.io/docs/cloud/workspaces/notifications.html), enables autoscaling the number of Terraform Cloud Agents running.
I've included a sample Lambda function that, when combined with [Run Tasks](https://developer.hashicorp.com/terraform/cloud-docs/workspaces/settings/run-tasks) and [Workspace Notifications](https://www.terraform.io/docs/cloud/workspaces/notifications.html), enables autoscaling the number of Terraform Cloud Agents running.

![notification_config](./files/notification_config.png)
Before a plan is started, a pre-plan task will boot an agent for the plan. Once the plan completes, a post-plan task will remove it. When the plan is confirmed, a pre-apply task will boot an agent for the apply. When that completes, a workspace event notification will remove it.

To use it, you'll need to:
1. Configure the `desired_count` and `max_count` Terraform variables as desired. `desired_count` sets the baseline number of agents to always be running. `max_count` sets the maximum number of agents allowed to be running at one time.

2. Configure a [generic notification](https://www.terraform.io/docs/cloud/workspaces/notifications.html#creating-a-notification-configuration) on each Terraform Cloud workspace that will be using an agent (workspace [execution mode](https://www.terraform.io/docs/cloud/workspaces/settings.html#execution-mode) set to `Agent`). I've included a helper script that will create them for you, however you can always create and manage these in the Terraform Cloud workspace Settings. You could also use the [Terraform Enterprise provider](https://registry.terraform.io/providers/hashicorp/tfe/latest/docs).

That's it! When a run is queued, Terraform Cloud will send a notification to the Lambda function, increasing the number of running agents. When the run is completed, Terraform Cloud will send another notification to the Lambda function, decreasing the number of running agents.
3. In your organization settings, create three run tasks, one for each stage. Provide the webhook lambda URL as the `Endpoint URL`. The `HMAC key` must match what was provided as the `notification_token` terraform variable.

Note: [Speculative Plans](https://www.terraform.io/docs/cloud/run/index.html#speculative-plans) do not trigger this autoscaling.
4. In the consumer workspace settings, add each run task, one for pre-plan, one for post-plan, and one for pre-apply.

### Add Notification to Workspaces script

Expand All @@ -47,7 +47,7 @@ Note: [Speculative Plans](https://www.terraform.io/docs/cloud/run/index.html#spe

Example usage:
```
→ ./files/add_notification_to_workspaces.sh hashidemos andys-lab https://h8alki27g6.execute-api.us-west-2.amazonaws.com/test
→ ./files/add_notification_to_workspaces.sh hashidemos andys-lab https://z27xbc52zarorvekotaweysk3y0xexqm.lambda-url.us-west-2.on.aws/
```

Here's an example usage with the [TFE provider](https://registry.terraform.io/providers/hashicorp/tfe/latest/docs):
Expand All @@ -56,7 +56,7 @@ resource "tfe_notification_configuration" "agent_lambda_webhook" {
name = "tfc-agent"
enabled = true
destination_type = "generic"
triggers = ["run:created", "run:completed", "run:errored"]
triggers = ["run:completed", "run:errored"]
url = data.terraform_remote_state.tfc-agent-ecs-producer.outputs.webhook_url
workspace_external_id = tfe_workspace.test.id
}
Expand Down
36 changes: 36 additions & 0 deletions tfc-agent-ecs/producer/doormat.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
variable "TFC_WORKSPACE_NAME" {
type = string
default = "" # An error occurs when you are running TF backend other than Terraform Cloud
}

data "tfe_outputs" "doormat_role" {
organization = "hashidemos"
workspace = "doormat-aws-infra"
}

provider "doormat" {}

data "doormat_aws_credentials" "creds" {
provider = doormat

role_arn = "${data.tfe_outputs.doormat_role.values.role_arn_base}${var.TFC_WORKSPACE_NAME}"
}

terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = ">= 4.31.0"
}
doormat = {
source = "doormat.hashicorp.services/hashicorp-security/doormat"
version = "~> 0.0.2"
}
tfe = {
source = "hashicorp/tfe"
version = ">= 0.26.0"
}
}

required_version = ">= 1.1"
}
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@ read -r -d '' NOTIFICATION_CONFIGURATION_PAYLOAD << EOM
"token": "$HMAC_SALT",
"triggers": [
"run:completed",
"run:created",
"run:errored"
]
}
Expand Down
138 changes: 102 additions & 36 deletions tfc-agent-ecs/producer/files/main.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
import boto3
"""A webhook receiver for starting/stopping tfc-agents"""

import hashlib
import hmac
import json
import os
import boto3
import requests


CLUSTER = os.getenv("CLUSTER", None)
Expand All @@ -13,69 +16,127 @@
SSM_PARAM_NAME = os.getenv("SSM_PARAM_NAME", None)


ADD_SERVICE_STATES = {'pending'}
SUB_SERVICE_STATES = {
'errored',
'canceled',
'discarded',
'planned_and_finished',
'applied',
'completed'
'canceled',
'errored'
}


# Initialize boto3 client at global scope for connection reuse
session = boto3.Session(region_name=REGION)
ssm = session.client('ssm')
ecs = session.client('ecs')


def lambda_handler(event, context):
def lambda_handler(event, _context):
"""Primary handler for incoming requests"""
print(event)
message = bytes(event['body'], 'utf-8')
secret = bytes(ssm.get_parameter(Name=SALT_PATH, WithDecryption=True)['Parameter']['Value'], 'utf-8')
hash = hmac.new(secret, message, hashlib.sha512)
if hash.hexdigest() == event['headers']['X-Tfe-Notification-Signature']:
# HMAC verified
if event['httpMethod'] == "POST":
return post(event)
return get()
return 'Invalid HMAC'
secret = bytes(ssm.get_parameter(Name=SALT_PATH, WithDecryption=True)[
'Parameter']['Value'], 'utf-8')
calculated_hash = hmac.new(secret, message, hashlib.sha512)
headers = {k.lower(): v for k, v in event['headers'].items()}

if 'x-tfe-notification-signature' in headers: # notification
if calculated_hash.hexdigest() == headers['x-tfe-notification-signature']:
# Notification HMAC verified
if 'requestContext' in event:
if 'http' in event['requestContext']:
if event['requestContext']['http']['method'] == "POST":
return post(event)
if 'httpMethod' in event['requestContext']:
if event['requestContext']['httpMethod'] == "POST":
return post(event)
return get()
return 'Invalid HMAC'

if 'x-tfc-task-signature' in headers: # run task
if calculated_hash.hexdigest() == headers['x-tfc-task-signature']:
# Run Task HMAC verified
if 'requestContext' in event:
if 'http' in event['requestContext']:
if event['requestContext']['http']['method'] == "POST":
return post(event)
if 'httpMethod' in event['requestContext']:
if event['requestContext']['httpMethod'] == "POST":
return post(event)
return get()
return 'Invalid HMAC'

return None


def get():
"""Handler for GET requests"""
return {
"statusCode": 200,
"headers": {
"Content-Type": "application/json",
"Access-Control-Allow-Origin": "*"
},
"body": "I'm here!"
}
}


def post(event):
"""Handler for POST requests"""
payload = json.loads(event['body'])
post_response = "I'm here!"

response = ecs.describe_services(
ecs_response = ecs.describe_services(
cluster=CLUSTER,
services=[
SERVICE,
]
)

service_count = response['services'][0]['desiredCount']
print("Current service count:", int(service_count))
service_count = ecs_response['services'][0]['desiredCount']
print(f"Current service count: {int(service_count)}")

if payload and 'run_status' in payload['notifications'][0]:
body = payload['notifications'][0]
if body['run_status'] in ADD_SERVICE_STATES:
if 'task_result_callback_url' in payload: # it's a run task
if payload['task_result_enforcement_level'] == 'test':
return {
"statusCode": 200,
"body": json.dumps(post_response)
}

if payload['stage'] == 'pre_apply' or payload['stage'] == 'pre_plan':
post_response = update_service_count(ecs, 'add')
print("Run status indicates add an agent.")
elif body['run_status'] in SUB_SERVICE_STATES:
print(f"Run task indicates add an agent for {payload['run_id']}.")
print(f"{payload['run_app_url']}")

tfc_headers = {'Authorization': 'Bearer ' + payload['access_token'],
'Content-Type': 'application/vnd.api+json'}
tfc_body = {"data": {"type": "task-results",
"attributes": {"status": "passed",
"message": "tfc-agent autosleeper"}}}
callback_response = requests.patch(
payload['task_result_callback_url'], headers=tfc_headers, json=tfc_body)
print('Callback response from TFC:', callback_response.status_code,
callback_response.text)

if payload['stage'] == 'post_apply' or payload['stage'] == 'post_plan':
post_response = update_service_count(ecs, 'sub')
print("Run status indicates subtract an agent.")
print(f"Run task indicates subtract an agent for {payload['run_id']}.")
print(f"{payload['run_app_url']}")

tfc_headers = {'Authorization': 'Bearer ' + payload['access_token'],
'Content-Type': 'application/vnd.api+json'}
tfc_body = {"data": {"type": "task-results",
"attributes": {"status": "passed",
"message": "tfc-agent autosleeper"}}}
callback_response = requests.patch(
payload['task_result_callback_url'], headers=tfc_headers, json=tfc_body)
print('Callback response:', callback_response.status_code,
callback_response.text)

else: # it's a workspace notification
if payload and 'run_status' in payload['notifications'][0]:
body = payload['notifications'][0]
if body['run_status'] in SUB_SERVICE_STATES:
post_response = update_service_count(ecs, 'sub')
print(f"Run status indicates subtract an agent for {payload['run_id']}.")
print(f"{payload['run_url']}")

return {
"statusCode": 200,
Expand All @@ -84,21 +145,26 @@ def post(event):


def update_service_count(client, operation):
num_runs_queued = int(ssm.get_parameter(Name=SSM_PARAM_NAME)['Parameter']['Value'])
if operation is 'add':
"""Increase or decrease number of agents"""
num_runs_queued = int(ssm.get_parameter(
Name=SSM_PARAM_NAME)['Parameter']['Value'])
if operation == 'add':
num_runs_queued = num_runs_queued + 1
elif operation is 'sub':
num_runs_queued=num_runs_queued - 1 if num_runs_queued > 0 else 0
elif operation == 'sub':
num_runs_queued = num_runs_queued - 1 if num_runs_queued > 0 else 0
else:
return
response = ssm.put_parameter(Name=SSM_PARAM_NAME, Value=str(num_runs_queued), Type='String', Overwrite=True)
return None

ssm.put_parameter(Name=SSM_PARAM_NAME, Value=str(
num_runs_queued), Type='String', Overwrite=True)

desired_count=int(MAX_AGENTS) if num_runs_queued > int(MAX_AGENTS) else num_runs_queued
desired_count = int(MAX_AGENTS) if num_runs_queued > int(
MAX_AGENTS) else num_runs_queued
client.update_service(
cluster=CLUSTER,
service=SERVICE,
desiredCount=desired_count
)

print("Updated service count:", desired_count)
return("Updated service count:", desired_count)
print(f"Updated service count: {desired_count}")
return ("Updated service count:", desired_count)
Binary file modified tfc-agent-ecs/producer/files/webhook.zip
Binary file not shown.
Loading