- How it Works
- The Demo Policies
- Prerequisites
- Installation
- Concepts and Terms
- Working with Policies
- More About Modes
Cloud Custodian is a rules engine for managing AWS resources at scale. You define the rules that your resources should follow, and Cloud Custodian automatically provisions event sources and AWS Lambda functions to enforce those rules. Instead of writing custom serverless workflows, you can manage resources across all of your accounts via simple YAML files.
Cloud Custodian documentation: here
The policies are split out into four different files, to showcase the different uses of Cloud Custodian. Each Cloud Custodian policy file has a corresponding IAM policy file; this IAM policy contains the permissions required if you choose to execute the Cloud Custodian policy via a Lambda function.
- Cloud Custodian policy file:
cost-control.yml
- IAM policy file:
cost-control-permissions.json
Policies:
Together, these policies turn on instances during business hours (8 am to 8pm), and turn them off in the evening. Perfect for dev/test environments; if implemented, this results in a ~50% decrease in instance costs.
This watches for large instances that are running at less than 30% CPU utilization for a given period of time, and resizes them to the next-smaller instance type.
The cost savings for this policy can be significant: a single m4.10xlarge instance, resized to a m4.4xlarge, will save $878.40 a month.
This policy terminates instances that are older than 30 days. Not something you'd want to run in production, but ideal for dev accounts where resources tend to get created...and forgotten. Cleaning up just 5 abandoned m4.xlarge instances (forgotten auto-scaling group, anyone?)results in a $732/month cost reduction.
What to do with orphaned EBS volumes you're no longer using? Delete them! This deletes any EBS volume that's not attached to an instance. Again, not something you'd want to run in prod, but a handy tool for dev accounts.
- Cloud Custodian policy file:
tagging.yml
- IAM policy file:
tagging-permissions.json
Policies:
This tags instances with Custodian: true
as they enter the running state, to signify that the resource is being managed by Cloud Custodian. All other policies are applied to resources with this tag. If a resource is not intended to be managed by Custodian policies, the tag can be removed.
Let's say that you want to have all instances tagged with a CostCenter
tag. However, people are people, which means they'll forget to add tags. Every day, this will find instances that don't have these tags, and send emails to the owners of these resources. (This assumes that instances have an OwnerContact tag that contains an email address for the person who created it.)
- Cloud Custodian policy file:
security.yml
- IAM policy file:
security-permissions.json
Policies:
Need something to take action immediately if a bucket is created with (or given) a public ACL? Even though ACLs are being deprecated in favor of bucket policies, it still happens. This policy watches for the relevant CloudTrail events, and then removes public grants if they are found.
Creating a security group that opens up SSH to the world is a bad idea. This policy stops it. When a CloudTrail event associated with security group rule creation comes in, it detects whether it allows SSH from anywhere - and promptly removes the rule if it does.
Like the "global SSH is a bad thing" policy (above), this one watches for the creation of security group rules allowing access to the world on all port ranges...and then deletes those rules.
- Cloud Custodian policy file:
iam.yml
- IAM policy file:
iam-permissions.json
These policies are in a separate file because all IAM-related policies must be run in us-east-1
(the home of IAM). Policies:
Want to know when some attaches a policy with admin (* on *) privileges to a user, group, or role? This is your policy.
If you want to take this admin-permissions thing one step further, this policy will alert you when a policy is created with * on * permissions.
Install python, pip, and Pipenv
To install Cloud Custodian, run:
$ git clone this repo
$ pipenv install
$ pipenv shell
$ custodian -h
- Policy Policies first specify a resource type, then filter those resources, and finally apply actions to those selected resources. Policies are written in YML format.
- Resource Within your policy, you write filters and actions to apply to different resource types (e.g. EC2, S3, RDS, etc.). Resources are retrieved via the AWS API; each resource type has different filters and actions that can be applied to it.
- Filter Filters are used to target the specific subset of resources that you're interested in. Some examples: EC2 instances more than 90 days old; S3 buckets that violate tagging conventions.
- Action Once you've filtered a given list of resources to your liking, you apply actions to those resources. Actions are verbs: e.g. stop, start, encrypt.
- Mode
Mode
specifies how you would like the policy to be deployed. If no mode is given, the policy will be executed once, from the CLI, and no lambda will be created. (This is often calledpull mode
in the documentation.) If your policy contains amode
, then a lambda will be created, plus any other resources required to trigger that lambda (e.g. CloudWatch event, Config rule, etc.). Check out the More About Modes section for more info.
The EC2 instance policies are laid out in the policy.yml
file.
The filters and actions available to you vary depending on the targeted resource (e.g. EC2, S3, etc.). Cloud Custodian has good CLI documentation to help you find the right filter or action for your needs. To get started, run
custodian schema -h
to see all the different resources that Cloud Custodian supports. Then, to see the resource-specific filters/actions, run
custodian schema [resourceName]
for example:
custodian schema EC2
will list the filters and actions available for EC2. For details on a specific action or filter, the format is:
custodian schema [resourceName].[actions or filters].[action or filter name]
for example:
custodian schema EC2.filters.instance-age
Before running the policy, you'll need to give the resulting Lambda function the permissions required. Use the IAM policy provided for each Cloud Custodian policy file as a starting place: create the policy, attach it to a new role, and update the Cloud Custodian policy with the ARN of that role.
Once you've updated your policies, you'll want to run the Cloud Custodian validator to check the file for errors:
custodian validate policy.yml
It's a good idea to "dry-run" policies before actually deploying them. A "dry-run" will query the AWS API for resources that match the given filters, then save those resources in a resources.json
file for each policy.
custodian run --dryrun -s output policy.yml
The --dryrun
option ensures that no actions are taken, while the -s
option specifies the path for output files (in this case, the output directory). Each policy will have its own subdirectory containing the output files. In this example, the resources selected by a policy named ec2-terminate-old-instances
would be contained in output/ec2-terminate-old-instances/resources.json
.
Once you're satisfied with the results of your filters, deploy the policies with:
custodian run -s . policy.yml
Cloud Custodian will take care of creating the needed lambda functions, CloudWatch events, etc. Sit back and watch it work!
Some of the policies send notifications via SNS, email, or Slack. To send notifications, you'll need to implement the Mailer tool in your account. Instructions on how to do this are in usingTheMailer.md. An IAM policy with permissions required by the mailer is in mailer-permissions.json.
Modes can be confusing. Here are the different mode types, what they do, and what their yml
block should look like:
asg-instance-state
triggers the lambda in response to Auto Scaling group state events (e.g. Auto Scaling launched an instance). The four events supported are:
yml option | AWS ASG event detail-type |
---|---|
launch-success |
"EC2 Instance Launch Successful" |
launch-failure |
"EC2 Instance Launch Unsuccessful" |
terminate-success |
"EC2 Instance Terminate Successful" |
terminate-failure |
"EC2 Instance Terminate Unsuccessful" |
Example:
mode:
role: #ARN of the IAM role you want the lambda to use
type: asg-instance-state
events:
- launch-success
cloudtrail
triggers the lambda in response to CloudTrail events. The cloudtrail
type comes in a couple different flavors: events for which there are shortcuts, and all other events.
For very common API calls, Cloud Custodian has defined some shortcuts to target commonly-used CloudTrail events.
As of this writing, the available shortcuts are:
- ConsoleLogin
- CreateAutoScalingGroup
- UpdateAutoScalingGroup
- CreateBucket
- CreateCluster
- CreateLoadBalancer
- CreateLoadBalancerPolicy
- CreateDBInstance
- CreateVolume
- SetLoadBalancerPoliciesOfListener
- CreateElasticsearchDomain
- CreateTable
- RunInstances
For those shortcuts, you simply need to specify:
mode:
role: #ARN of the IAM role you want the lambda to use
type: cloudtrail
events:
- RunInstances
You can also trigger your lambda via any other CloudTrail event; you'll just have to add two more pieces of information. First, you need the source API call - e.g. ec2.amazonaws.com
. Secondly, you need a JMESPath query to extract the resource IDs from the event. For example, if RunInstances
wasn't already a shortcut, you would specify it like so:
mode:
role: #ARN of the IAM role you want the lambda to use
type: cloudtrail
events:
- source: ec2.amazonaws.com
event: RunInstances
ids: "responseElements.instancesSet.items[].instanceId"
config-rule
creates a custom Config Rule to trigger the lambda. Config rules themselves can only be triggered by configuration changes; triggering rules periodically is not supported. Use the periodic
mode type instead.
Example:
mode:
role: #ARN of the IAM role you want the lambda to use
type: config-rule
ec2-instance-state
triggers the lambda in response to EC2 instance state events (e.g. an instance being created and entering pending
state).
Available states:
- pending
- running
- stopping
- stopped
- shutting-down
- terminated
- rebooting
Example:
mode:
role: #ARN of the IAM role you want the lambda to use
type: ec2-instance-state
events:
- pending
With guard-duty
, your lambda will be triggered by GuardDuty findings. The associated filters would then look at the guard duty event detail - e.g. severity
or type
.
mode:
role: #ARN of the IAM role you want the lambda to use
type: periodic
filters:
- type: event
key: detail.severity
op: gte
value: 4.5
periodic
creates a CloudWatch event to trigger the lambda on a given schedule. The schedule
is specified using scheduler syntax.
Example:
mode:
role: #ARN of the IAM role you want the lambda to use
type: periodic
schedule: "rate(1 day)"