Self-Hosted Runners, EKS, and Leapp Design Decisions (#684)

Co-authored-by: Erik Osterman (CEO @ Cloud Posse) <[email protected]>
cloudposse · Sep 16, 2024 · 4aa163c · 4aa163c
1 parent 56e16bc
commit 4aa163c
Show file tree

Hide file tree

Showing 8 changed files with 291 additions and 2 deletions.
diff --git a/.github/workflows/website-deploy-preview.yml b/.github/workflows/website-deploy-preview.yml
@@ -29,6 +29,10 @@ permissions:
   id-token: write
   contents: read
 
+concurrency:
+  group: "docs-preview-${{ github.event.pull_request.number }}"
+  cancel-in-progress: true
+
 jobs:
   deploy-preview:
     runs-on: ubuntu-latest

diff --git a/docs/intro/intro.mdx b/docs/intro/intro.mdx
@@ -107,7 +107,7 @@ With SweetOps you can implement the following complex architectural patterns wit
 
 ## What are the alternatives?
 
-The reference archietcture is comparable to various other solutions that bundle ready-to-go Terraform "templates" and offer subscription plans for access to their modules.
+The reference architecture is comparable to various other solutions that bundle ready-to-go Terraform "templates" and offer subscription plans for access to their modules.
 
 How does it differentiate from these solutions?
 

diff --git a/docs/layers/eks/design-decisions/decide-on-secrets-management-for-eks.md b/docs/layers/eks/design-decisions/decide-on-secrets-management-for-eks.md
@@ -0,0 +1,62 @@
+---
+title: "Decide on Secrets Management for EKS"
+sidebar_label: "Secrets Management for EKS"
+description: Decide on the secrets management strategy for EKS.
+---
+import Intro from '@site/src/components/Intro';
+import KeyPoints from '@site/src/components/KeyPoints';
+
+<Intro>
+We need to decide on a secrets management strategy for EKS. We prefer storing secrets externally, like in AWS SSM Parameter Store, to keep clusters more disposable. If we decide on this, we'll need a way to pull these secrets into Kubernetes.
+</Intro>
+
+## Problem
+
+We aim to design our Kubernetes clusters to be disposable and ephemeral, treating them like cattle rather than pets. This influences how we manage secrets. Ideally, Kubernetes should not be the sole source of truth for secrets, though we still want to leverage Kubernetes’ native `Secret` resource. If the cluster experiences a failure, storing secrets exclusively within Kubernetes risks losing access to them. Additionally, keeping secrets only in Kubernetes limits integration with other services.
+
+To address this, several solutions allow secrets to be stored externally (as the source of truth) while still utilizing Kubernetes' `Secret` resources. These solutions, including some open-source tools and recent offerings from Amazon, enhance resilience and interoperability. Any approach must respect IAM permissions and ensure secure secret management for applications running on EKS. We have several options to consider that balance external secret storage with Kubernetes-native functionality.
+
+### Option 1: External Secrets Operator
+
+Use [External Secrets Operator](https://external-secrets.io/latest/) with AWS SSM Parameter Store.
+
+External Secrets Operator is a Kubernetes operator that manages and stores sensitive information in external secret management systems like AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, HashiCorp Vault, and more. It allows you to use these external secret management systems to securely add secrets in your Kubernetes cluster.
+
+Cloud Posse historically recommends using External Secrets Operator with AWS SSM Parameter Store and has existing Terraform modules to support this solution. See the [eks/external-secrets-operator](/components/library/aws/eks/external-secrets-operator/) component.
+
+### Option 2: AWS Secrets Manager secrets with Kubernetes Secrets Store CSI Driver
+
+Use [AWS Secrets and Configuration Provider (ASCP) for the Kubernetes Secrets Store CSI Driver](https://docs.aws.amazon.com/secretsmanager/latest/userguide/integrating_csi_driver.html). This option allows you to use AWS Secrets Manager secrets as Kubernetes secrets that can be accessed by Pods as environment variables or files mounted in the pods. The ASCP also works with [Parameter Store parameters](https://docs.aws.amazon.com/systems-manager/latest/userguide/integrating_csi_driver.html)
+
+However, Cloud Posse does not have existing Terraform modules for this solution. We would need to build this support.
+
+### Option 3: SOPS Operator
+
+Use [SOPS Operator](https://github.com/isindir/sops-secrets-operator) to manage secrets in Kubernetes. SOPS Operator is a Kubernetes operator that builds on the `sops` project by Mozilla to encrypt the sensitive portions of a `Secrets` manifest into a `SopsSecret` resource, and then decrypt and provision `Secrets` in the Kubernetes cluster.
+
+1. **Mozilla SOPS Encryption**: Mozilla SOPS (Secrets OPerationS) is a tool that encrypts Kubernetes secret manifests, allowing them to be stored securely in Git repositories. SOPS supports encryption using a variety of key management services. Most importantly, it supports AWS KMS which enables IAM capabilities for native integration with AWS.
+
+2. **GitOps-Compatible Secret Management**: In a GitOps setup, storing plain-text secrets in Git poses security risks. Using SOPS, we can encrypt sensitive data in Kubernetes secret manifests while keeping the rest of the manifest in clear text. This allows us to store encrypted secrets in Git, track changes with diffs, and maintain security while benefiting from GitOps practices like version control, auditability, and CI/CD pipelines.
+
+3. **AWS KMS Integration**: SOPS uses AWS KMS to encrypt secrets with customer-managed keys (CMKs), ensuring only authorized users—based on IAM policies—can decrypt them. The encrypted secret manifests can be safely committed to Git, with AWS securely managing the keys. Since it's IAM-based, it integrates seamlessly with STS tokens, allowing secrets to be decrypted inside the cluster without hardcoded credentials.
+
+4. **Kubernetes Operator**: The [SOPS Secrets Operator](https://github.com/isindir/sops-secrets-operator) automates the decryption and management of Kubernetes secrets. It monitors a `SopsSecret` resource containing encrypted secrets. When a change is detected, the operator decrypts the secrets using AWS KMS and generates a native Kubernetes `Secret`, making them available to applications in the cluster. AWS KMS uses envelope encryption to manage the encryption keys, ensuring that secrets remain securely encrypted at rest.
+
+5. **Improved Disaster Recovery and Security**: By storing the source of truth for secrets outside of Kubernetes (e.g., in Git), this setup enhances disaster recovery, ensuring secrets remain accessible even if the cluster is compromised or destroyed. While secrets are duplicated across multiple locations, security is maintained by using IAM for encryption and decryption outside Kubernetes, and Kubernetes' native Role-Based Access Control (RBAC) model for managing access within the cluster. This ensures that only authorized entities, both external and internal to Kubernetes, can access the secrets.
+
+The SOPS Operator combines the strengths of Mozilla SOPS and AWS KMS, allowing you to:
+- Encrypt secrets using KMS keys.
+- Store encrypted secrets in Git repositories.
+- Automatically decrypt and manage secrets in Kubernetes using the SOPS Operator.
+
+This solution is ideal for teams following GitOps principles, offering secure, external management of sensitive information while utilizing Kubernetes' secret management capabilities. However, the redeployment required for secret rotation can be heavy-handed, potentially leading to a period where services are still using outdated or invalid secrets. This could cause services to fail until the new secrets are fully rolled out.
+
+## Recommendation
+
+We recommend using the External Secrets Operator with AWS SSM Parameter Store. This is a well-tested solution that we have used in the past. We have existing Terraform modules to support this solution.
+
+However, we are in the process of evaluating the AWS Secrets Manager secrets with Kubernetes Secrets Store CSI Driver solution. This is the AWS supported option and may be a better long-term solution. We will build the required Terraform component to support this solution.
+
+## Consquences
+
+We will develop the `eks/secrets-store-csi-driver` component using the [Secrets Store CSI Driver](https://secrets-store-csi-driver.sigs.k8s.io/getting-started/installation)
diff --git a/...s/github-actions/design-decisions/decide-on-self-hosted-runner-architecture.mdx b/...s/github-actions/design-decisions/decide-on-self-hosted-runner-architecture.mdx
@@ -0,0 +1,104 @@
+---
+title: "Decide on Self-Hosted Runner Architecture"
+sidebar_label: Runner Architecture
+description: Decide how to create self-hosted runners
+---
+
+import Intro from "@site/src/components/Intro";
+import Note from '@site/src/components/Note';
+
+<Intro>
+Decide on how to operate self-hosted runners that are used to run GitHub Actions workflows. These runners can be set up in various ways and allow us to avoid platform fees while running CI jobs in private infrastructure, enabling access to VPC resources. This approach is ideal for private repositories, providing control over instance size, architecture, and control costs by leveraging spot instances. The right choice depends on your platform, whether you’re using predominantly EKS, ECS, or Lambda.
+</Intro>
+
+## Problem
+
+When using GitHub Actions, you can opt for both GitHub Cloud-hosted and self-hosted runners, and they can complement each other. In some cases, self-hosted runners are essential—particularly for accessing resources within a VPC, such as databases, Kubernetes API endpoints, or Kafka servers, which is common in GitOps workflows.
+
+However, while self-hosted runners are ideal for private infrastructure, they pose risks in public or open-source repositories due to potential exposure of sensitive resources. If your organization maintains open-source projects, this should be a critical consideration, and we recommend using cloud-hosted runners for those tasks.
+
+The hosting approach for self-hosted runners should align with your infrastructure. If you use Kubernetes, it's generally best to run your runners on Kubernetes. Conversely, if your infrastructure relies on ECS or Lambdas, you may want to avoid unnecessary Kubernetes dependencies and opt for alternative hosting methods.
+
+In Kubernetes-based setups, configuring node pools with Karpenter is key to maintaining stability and ensuring effective auto-scaling with a mix of spot and on-demand instances. However, tuning this setup can be challenging, especially with recent changes to ARC, where the [newer version does not support multiple labels for runner groups](https://github.com/actions/actions-runner-controller/issues/2445), leading to community disagreement over trade-offs. We provide multiple deployment options for self-hosted runners, including EKS, Philips Labs' solution, and Auto Scaling Groups (ASG), tailored to your specific runner management needs.
+
+## Considered Options
+
+### Option 1: EC2 Instances in an Auto Scaling Group (`github-runners`)
+
+The first option is to deploy EC2 instances in an Auto Scaling Group. This is the simplest option. We can use the
+`github-runners` component to deploy the runners. However, this option is not as scalable as the other options.
+
+### Option 2: Actions Runner Controller on EKS (`eks/actions-runner-controller`)
+
+The second option is to deploy the Actions Runner Controller on EKS. Since many implementations already have EKS, this
+option is a good choice to reuse existing infrastructure.
+
+We can use the `eks/actions-runner-controller` component to deploy the runners, which is built with the
+[Actions Runner Controller helm chart](https://github.com/actions/actions-runner-controller).
+
+### Option 3: GitHub Actions Runner on EKS (`eks/github-actions-runner`)
+
+Alternatively, we can deploy the GitHub Actions Runner on EKS. This option is similar to the previous one, but it uses
+the GitHub Actions Runner instead of the Actions Runner Controller.
+
+This component deploys self-hosted GitHub Actions Runners and a
+[Controller](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/quickstart-for-actions-runner-controller#introduction)
+on an EKS cluster, using
+"[runner scale sets](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller#runner-scale-set)".
+
+This solution is supported by GitHub and supersedes the
+[actions-runner-controller](https://github.com/actions/actions-runner-controller/blob/master/docs/about-arc.md)
+developed by Summerwind and deployed by Cloud Posse's
+[actions-runner-controller](https://docs.cloudposse.com/components/library/aws/eks/actions-runner-controller/)
+component.
+
+However, there are some limitations to the official Runner Sets implementation:
+
+- #### Limited set of packages
+
+  The runner image used by Runner Sets contains [no more packages than are necessary](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller#about-the-runner-container-image) to run the runner. This is in contrast to the Summerwind implementation, which contains some commonly needed packages like `build-essential`, `curl`, `wget`, `git`, and `jq`, and the GitHub hosted images which contain a robust set of tools. (This is a limitation of the official Runner Sets implementation, not this component per se.) You will need to install any tools you need in your workflows, either as part of your workflow (recommended), by maintaining a [custom runner image](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller#creating-your-own-runner-image), or by running such steps in a [separate container](https://docs.github.com/en/actions/using-jobs/running-jobs-in-a-container) that has the tools pre-installed. Many tools have publicly available actions to install them, such as `actions/setup-node` to install NodeJS or `dcarbone/install-jq-action` to install `jq`. You can also install packages using `awalsh128/cache-apt-pkgs-action`, which has the advantage of being able to skip the installation if the package is already installed, so you can more efficiently run the same workflow on GitHub hosted as well as self-hosted runners.
+
+  <Note title="Feature Requests">There are (as of this writing) open feature requests to add some commonly needed packages to the official Runner Sets runner image. You can upvote these requests [here](https://github.com/actions/actions-runner-controller/discussions/3168) and [here](https://github.com/orgs/community/discussions/80868) to help get them implemented.</Note>
+
+- #### Docker in Docker (dind) mode only
+
+  In the current version of this component, only "dind" (Docker in Docker) mode has been tested. Support for "kubernetes" mode is provided, but has not been validated.
+
+- #### Limited configuration options
+
+  Many elements in the Controller chart are not directly configurable by named inputs. To configure them, you can use the `controller.chart_values` input or create a `resources/values-controller.yaml` file in the component to supply values.
+
+  Almost all the features of the Runner Scale Set chart are configurable by named inputs. The exceptions are:
+
+  - There is no specific input for specifying an outbound HTTP proxy.
+  - There is no specific input for supplying a [custom certificate authority (CA) certificate](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller#custom-tls-certificates) to use when connecting to GitHub Enterprise Server.
+
+  You can specify these values by creating a `resources/values-runner.yaml` file in the component and setting values as shown by the default Helm [values.yaml](https://github.com/actions/actions-runner-controller/blob/master/charts/gha-runner-scale-set/values.yaml), and they will be applied to all runners.
+
+- #### Component limitations
+
+  Furthermore, the Cloud Posse component has some additional limitations. In particular:
+
+  - The controller and all runners and listeners share the Image Pull Secrets. You cannot use different ones for different
+    runners.
+  - All the runners use the same GitHub secret (app or PAT). Using a GitHub app is preferred anyway, and the single GitHub
+    app serves the entire organization.
+  - Only one controller is supported per cluster, though it can have multiple replicas.
+
+These limitations could be addressed if there is demand. Contact [Cloud Posse Professional Services](https://cloudposse.com/professional-services/) if you would be interested in sponsoring the development of any of these features.
+
+### Option 4: Philips Labs Runners (`philips-labs-github-runners`)
+
+If we are not deploying EKS, it's not worth the additional effort to set up Self-Hosted runners on EKS. Instead, we deploy Self-Hosted runners on EC2 instances. These are managed by an API Gateway and Lambda function that will automatically scale the number of runners based on the number of pending jobs in the queue. The queue is written to by the API Gateway from GitHub Events.
+
+For more on this option, see the [Philips Labs GitHub Runner](https://philips-labs.github.io/terraform-aws-github-runner/) documentation.
+
+### Option 5: Managed Runners
+
+There are a number of services that offer managed runners. These still have the advantage over GitHub Cloud hosted runners as the can be managed within you private VPCs.
+
+One option to consider is [runs-on.com](https://runs-on.com/) which provides a very inexpensive option.
+
+## Recommendation
+
+At this time Cloud Posse recommends the Actions Runner Controller on EKS (`eks/actions-runner-controller`) if you are using EKS and the Philips Labs Runners (`philips-labs-github-runners`) if you are not using EKS.
diff --git a/...yers/github-actions/design-decisions/decide-on-self-hosted-runner-placement.mdx b/...yers/github-actions/design-decisions/decide-on-self-hosted-runner-placement.mdx
@@ -0,0 +1,44 @@
+---
+title: "Decide on Self-Hosted Runner Placement"
+sidebar_label: Runner Placement
+description: Decide where to place self-hosted runners in your AWS organization
+---
+import Intro from '@site/src/components/Intro';
+
+<Intro>
+Self-hosted runners are custom runners that we use to run GitHub Actions workflows. We can use these runners to access resources in our private networks and reduce costs by using our own infrastructure. We need to decide where to place these runners in your AWS organization.
+</Intro>
+
+## Problem
+
+We need to decide where to place self-hosted runners in your AWS organization.
+
+We support multiple options for deploying self-hosted runners. We can deploy runners with EKS, Philips Labs, or with an ASG. For this decision, we will focus on the placement of the runners themselves.
+
+## Considered Options
+
+### Option 1: Deploy the runners in an `auto` account
+
+The first option is to deploy the controller in the `auto` (Automation) account. This account would be dedicated to automation tasks and would have access to all other accounts. We can use this account to deploy the controller and manage the runners in a centralized location.
+
+However, compliance is complicated because the `auto` cluster would have access to all environments.
+
+### Option 2: Deploy the runners in each account
+
+The second option is to deploy the controller in each account. This option sounds great from a compliance standpoint. Jobs running in each account are scoped to that account, each account has its own controller, and we can manage the runners independently.
+
+This might seem like a simplification from a compliance standpoint, but it creates complexity from an implementation standpoint. We would need to carefully consider the following:
+
+1. Scaling runners can inadvertently impact IP space available to production workloads
+2. Many accounts do not have a VPC or EKS Cluster (for EKS/ARC solutions). So, we would need to decide how to manage those accounts.
+3. We would need to manage the complexity of dynamically selecting the right runner pool when a workflow starts. While this might seem straightforward, it can get tricky in cases like promoting an ECR image from staging to production, where it’s not always clear-cut which runners should be used.
+
+## Recommendation
+
+_Option 1: Deploy the runners in an `auto` account_
+
+We will deploy the runners in an `auto` account. This account will be connected to the private network and will have access to all other accounts where necessary. This will simplify the management of the runners and ensure that they are available when needed.
+
+## Consequences
+
+We will create an `auto` account and deploy the runners there.