From c037f2b52d9199194fbe829773cfefa21134a3d6 Mon Sep 17 00:00:00 2001 From: Poornima Krishnasamy Date: Wed, 27 Sep 2023 17:35:46 +0100 Subject: [PATCH] Update runbooks and bump review date --- ...add-new-receiver-alert-manager.html.md.erb | 4 +- .../add-nodes-to-the-eks-cluster.html.md.erb | 87 ++++-------- runbooks/source/auth0-rotation.html.md.erb | 4 +- runbooks/source/aws-create-user.html.md.erb | 4 +- runbooks/source/bastion-node.html.md.erb | 4 +- .../delete-prometheus-metrics.html.md.erb | 4 +- runbooks/source/delete-state-lock.html.md.erb | 4 +- .../export-elasticsearch-to-csv.html.md.erb | 4 +- .../rotate-user-aws-credentials.html.md.erb | 126 ++++-------------- .../upgrade-cluster-components.html.md.erb | 4 +- .../upgrade-terraform-version.html.md.erb | 8 +- .../upgrade-user-components.html.md.erb | 8 +- 12 files changed, 72 insertions(+), 189 deletions(-) diff --git a/runbooks/source/add-new-receiver-alert-manager.html.md.erb b/runbooks/source/add-new-receiver-alert-manager.html.md.erb index d8acf1df..2ba1bdd0 100644 --- a/runbooks/source/add-new-receiver-alert-manager.html.md.erb +++ b/runbooks/source/add-new-receiver-alert-manager.html.md.erb @@ -1,8 +1,8 @@ --- title: Add a new Alertmanager receiver and a slack webhook weight: 85 -last_reviewed_on: 2023-06-12 -review_in: 3 months +last_reviewed_on: 2023-09-27 +review_in: 6 months --- # Add a new Alertmanager receiver and a slack webhook diff --git a/runbooks/source/add-nodes-to-the-eks-cluster.html.md.erb b/runbooks/source/add-nodes-to-the-eks-cluster.html.md.erb index 3f86be41..b18e6cfc 100644 --- a/runbooks/source/add-nodes-to-the-eks-cluster.html.md.erb +++ b/runbooks/source/add-nodes-to-the-eks-cluster.html.md.erb @@ -1,18 +1,16 @@ --- -title: Add nodes/change the instance type of the AWS EKS cluster +title: Add nodes to the AWS EKS cluster weight: 65 -last_reviewed_on: 2023-05-22 -review_in: 3 months +last_reviewed_on: 2023-09-27 +review_in: 6 months --- -# Add nodes/change the instance type of the AWS EKS cluster +# Add nodes to the AWS EKS cluster -This runbook covers how to increase the number of nodes in an eks cluster and/or change the instance type (worker_node_machine_type) +This runbook covers how to increase the number of nodes in an eks cluster This can address the problem of CPU high usage/load -## Add nodes to the eks cluster - ### Requirements #### 1. Ensure you have access to the Cloud Platform AWS account @@ -30,79 +28,44 @@ Use `git crypt unlock` to see the following code: ``` - node_groups = { - default_ng = { - desired_capacity = var.cluster_node_count - max_capacity = 30 - min_capacity = 1 - subnets = data.aws_subnet_ids.private.ids - - instance_type = var.worker_node_machine_type - k8s_labels = { - Terraform = "true" - Cluster = local.cluster_name - Domain = local.cluster_base_domain_name - } - additional_tags = { - default_ng = "true" - } - } -``` - -#### [Variable.tf](https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/main/terraform/cloud-platform-eks/variables.tf) - -``` -variable "vpc_name" { - description = "The VPC name where the cluster(s) are going to be provisioned. VPCs are created in cloud-platform-network" - default = "" +node_groups_count = { + live = "64" + live-2 = "7" + manager = "4" + default = "3" } - -variable "cluster_node_count" { - description = "The number of worker node in the cluster" - default = "4" -} - -variable "worker_node_machine_type" { - description = "The AWS EC2 instance types to use for worker nodes" - default = "m4.large" +# Default node group minimum capacity +default_ng_min_count = { + live = "45" + live-2 = "2" + manager = "4" + default = "2" } ``` -### Issue - -There is an issue that you cannot update the default "cluster_node_count" (in isolation) with terraform - - unless you increase the default "worker_node_machine_type" too.
-The issue is to do with auto-scaling complexities utilising Terrafom - please see [here](https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/autoscaling.md#notes) - -Therefore you either have to update default "worker_node_machine_type" to - in above example "m4.xlarge" and also the default "cluster_node_count" to - in above example "5" or "6" -Or you have to edit the "Desired size" in the "AWS EKS dashboard Edit Node Group" (once you have carried out the AWS dashboard change - update the terraform config, `terraform apply` accordingly - so that it is in sync with the AWS dashboard): - -#### [AWS dashboard EKS - Edit Node Group:](https://eu-west-2.console.aws.amazon.com/eks/home?region=eu-west-2#/clusters/manager/nodegroups/manager-default_ng-composed-sculpin/edit-nodegroup) +#### AWS dashboard EKS - Edit Node Group ``` Group size Minimum size Set the minimum number of nodes that the group can scale in to. -1 +2 nodes Maximum size Set the maximum number of nodes that the group can scale out to. -30 +85 nodes Desired size Set the desired number of nodes that the group should launch with initially. -4 +3 nodes ``` -## Change the AWS EKS instance type (worker_node_machine_type) - -* update default "worker_node_machine_type" to - in above example "m4.xlarge" - -* A 'terraform plan' will show that that it will replace the existing nodes +Modifying the node_groups_count in terraform will not update the desired size of the EKS cluster nor increase the actual node count. Its a design decision the +module has taken. Refer issue [#835](https://github.com/terraform-aws-modules/terraform-aws-eks/issues/835). -* `terraform apply' the changes in the usual way +To increase/decrease the desired node group count, we need to use the AWS dashboard. Login to the AWS dashboard and navigate to EKS -> Select Cluster -> Select Compute tab +Choose the Node Group you want to edit and Click Edit. Change the desired size and click Save Changes. -* monitor how the update is going in the [AWS Autoscaling dashboard:](https://eu-west-2.console.aws.amazon.com/ec2/autoscaling/home?region=eu-west-2#AutoScalingGroups:view=tags;filter=eks) +Watch the number of nodes using `kubectl get nodes`. You should see the new nodes getting created to match the desired size. -Note that it will create the instances/nodes before it deletes the existing - so there should be no down time diff --git a/runbooks/source/auth0-rotation.html.md.erb b/runbooks/source/auth0-rotation.html.md.erb index 7d89d86c..62420980 100644 --- a/runbooks/source/auth0-rotation.html.md.erb +++ b/runbooks/source/auth0-rotation.html.md.erb @@ -1,8 +1,8 @@ --- title: Credentials rotation for auth0 apps weight: 68 -last_reviewed_on: 2023-06-12 -review_in: 3 months +last_reviewed_on: 2023-09-27 +review_in: 6 months --- # <%= current_page.data.title %> diff --git a/runbooks/source/aws-create-user.html.md.erb b/runbooks/source/aws-create-user.html.md.erb index 688c991c..b6f00986 100644 --- a/runbooks/source/aws-create-user.html.md.erb +++ b/runbooks/source/aws-create-user.html.md.erb @@ -1,8 +1,8 @@ --- title: AWS Console Access weight: 115 -last_reviewed_on: 2023-06-12 -review_in: 3 months +last_reviewed_on: 2023-09-27 +review_in: 6 months --- # AWS Console Access diff --git a/runbooks/source/bastion-node.html.md.erb b/runbooks/source/bastion-node.html.md.erb index 154e95f9..63bbf1ec 100644 --- a/runbooks/source/bastion-node.html.md.erb +++ b/runbooks/source/bastion-node.html.md.erb @@ -1,8 +1,8 @@ --- title: Create and access bastion node weight: 97 -last_reviewed_on: 2023-06-12 -review_in: 3 months +last_reviewed_on: 2023-09-27 +review_in: 6 months --- # Create and access bastion node. diff --git a/runbooks/source/delete-prometheus-metrics.html.md.erb b/runbooks/source/delete-prometheus-metrics.html.md.erb index 044fe606..815c14c2 100644 --- a/runbooks/source/delete-prometheus-metrics.html.md.erb +++ b/runbooks/source/delete-prometheus-metrics.html.md.erb @@ -1,8 +1,8 @@ --- title: Delete Prometheus Metrics weight: 170 -last_reviewed_on: 2023-05-15 -review_in: 3 months +last_reviewed_on: 2023-09-27 +review_in: 6 months --- # <%= current_page.data.title %> diff --git a/runbooks/source/delete-state-lock.html.md.erb b/runbooks/source/delete-state-lock.html.md.erb index 2389830f..86d4ed4f 100644 --- a/runbooks/source/delete-state-lock.html.md.erb +++ b/runbooks/source/delete-state-lock.html.md.erb @@ -1,8 +1,8 @@ --- title: Delete terraform state lock weight: 199 -last_reviewed_on: 2023-06-05 -review_in: 3 months +last_reviewed_on: 2023-09-27 +review_in: 6 months --- # <%= current_page.data.title %> diff --git a/runbooks/source/export-elasticsearch-to-csv.html.md.erb b/runbooks/source/export-elasticsearch-to-csv.html.md.erb index c3a716bd..abd133fb 100644 --- a/runbooks/source/export-elasticsearch-to-csv.html.md.erb +++ b/runbooks/source/export-elasticsearch-to-csv.html.md.erb @@ -1,8 +1,8 @@ --- title: Export data from AWS Elasticsearch into a CSV file weight: 190 -last_reviewed_on: 2023-06-12 -review_in: 3 months +last_reviewed_on: 2023-09-27 +review_in: 6 months --- # Export data from Elasticsearch into a CSV file diff --git a/runbooks/source/rotate-user-aws-credentials.html.md.erb b/runbooks/source/rotate-user-aws-credentials.html.md.erb index 9247232b..947a6853 100644 --- a/runbooks/source/rotate-user-aws-credentials.html.md.erb +++ b/runbooks/source/rotate-user-aws-credentials.html.md.erb @@ -1,8 +1,8 @@ --- title: Rotate User Credentials weight: 100 -last_reviewed_on: 2023-05-15 -review_in: 3 months +last_reviewed_on: 2023-09-27 +review_in: 6 months --- # Rotate User AWS Credentials @@ -34,11 +34,6 @@ make tools-shell If the changes involve applying "pingdom_check", set the environment variables for pingdom. The values are stored as secrets in `manager` cluster - `concourse-main` namespace. -``` -export PINGDOM_USER="XXXXXXXXXXX" -export PINGDOM_PASSWORD="XXXXXXXXXXXX" -export PINGDOM_API_KEY="XXXXXXXXXXXXX" -``` ## Target the live cluster @@ -49,12 +44,15 @@ aws eks --region eu-west-2 update-kubeconfig --name live ## Set cluster related environment variables ```bash -# TF_VAR_cluster_name is referencing VPC name -export TF_VAR_cluster_name="live" -export TF_VAR_cluster_state_bucket=cloud-platform-terraform-state -export TF_VAR_cluster_state_key="cloud-platform/live/terraform.tfstate" +export TF_VAR_vpc_name="live-1" +export TF_VAR_eks_cluster_name="live" +export TF_VAR_github_owner="ministryofjustice" +export TF_VAR_github_token="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" +export TF_VAR_kubernetes_cluster="DF366E49809688A3B16EEC29707D8C09.gr7.eu-west-2.eks.amazonaws.com" +export PINGDOM_API_TOKEN='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' #needed by tf k8s provider export KUBE_CONFIG_PATH=${HOME}/.kube/config +export KUBECONFIG=${HOME}/.kube/config ``` ## Set the namespace name @@ -72,12 +70,12 @@ cd namespaces/live.cloud-platform.service.justice.gov.uk/${NAMESPACE}/resources ``` terraform init \ -backend-config="bucket=cloud-platform-terraform-state" \ - -backend-config="key=cloud-platform-environments/live.cloud-platform.service.justice.gov.uk/${NAMESPACE}/terraform.tfstate" \ + -backend-config="key=cloud-platform-environments/live-1.cloud-platform.service.justice.gov.uk/${NAMESPACE}/terraform.tfstate" \ -backend-config="region=eu-west-1" \ -backend-config="dynamodb_table=cloud-platform-environments-terraform-lock" ``` -Note: Bucket key above is referencing to "live", as state is stored in "live.cloud-platform.service.justice.gov.uk" for namespaces in "live" cluster. +Note: Bucket key above is referencing to "live-1", as state is stored in "live-1.cloud-platform.service.justice.gov.uk" for namespaces in "live" cluster. ## Terraform Plan/Apply @@ -96,28 +94,25 @@ Look for the compromised access key. In this case, the access key was AKIAXXXXXX ``` ... -kubernetes_secret.ecr-repo-my-repo: - id = my-namespace/ecr-repo-my-repo - data.% = 3 +kubernetes_secret.iam-credentials: + id = my-namespace/iam-credentials data.access_key_id = AKIAXXXXXXXXXXXXXXXX - data.repo_url = ... data.secret_access_key = ... ... -module.ecr-repo-my-repo.aws_iam_access_key.key: +aws_iam_access_key.user: id = AKIAXXXXXXXXXXXXXXXX secret = ... - ses_smtp_password = ... ... ``` The first occurence is when terraform stores the credentials in a kubernetes secret. -The second occurence is the one we want, where terraform creates the credentials (the `aws_iam_access_key.key` resource, for the ECR repo). +The second occurence is the one we want, where terraform creates the credentials (the `aws_iam_access_key.key` resource, for the IAM User). ### 3. Destroy the compromised key ```bash -$ terraform destroy --target=module.ecr-repo-my-repo.aws_iam_access_key.key +$ terraform destroy --target=aws_iam_access_key.key ``` If this looks like it's going to do the right thing, enter 'yes' to confirm. @@ -129,7 +124,7 @@ Resource actions are indicated with the following symbols: Terraform will perform the following actions: - - module.ecr-repo-my-repo.aws_iam_access_key.key + - aws_iam_access_key.key Plan: 0 to add, 0 to change, 1 to destroy. ``` @@ -150,20 +145,18 @@ Resource actions are indicated with the following symbols: Terraform will perform the following actions: - ~ kubernetes_secret.ecr-repo-my-repo - data.%: "" => + ~ kubernetes_secret.iam-credentials data.access_key_id: "AKIAXXXXXXXXXXXXXXXX" => "" - data.repo_url: "..." => "" data.secret_access_key: "..." => "" - + module.ecr-repo-my-repo.aws_iam_access_key.key + + aws_iam_access_key.key id: encrypted_secret: key_fingerprint: secret: ses_smtp_password: status: - user: "ecr-user-0000000000000000" + user: "iam-user-0000000000000000" Plan: 1 to add, 1 to change, 0 to destroy. ``` @@ -178,85 +171,12 @@ If this looks like it's going to do the right thing, enter 'yes' to confirm. At this point, a new set of AWS credentials should have been created for the existing IAM user, and the kubernetes secret should contain the new access key and secret. -# Rotate RDS Credentials - -If a user's RDS credentials(database_username and database_password) have been exposed. Follow "Rotate User AWS Credentials" guidance above until step 2, -in step 3 instead of "Destroy the compromised key", we should [taint][tf-taint] the password. - -```bash -$ terraform taint module.rds.random_password.password -``` - -Running taint will show the below message. - -``` -Resource instance module.rds.random_password.password has been marked as tainted. -``` - -### Let terraform create a new password - -```bash -$ terraform plan -``` - -This should report that it will **create** a new `rds.random_password.password` resource and **modify** the corresponding `kubernetes_secret` resource. - -``` -Terraform will perform the following actions: - - # kubernetes_secret.rds will be updated in-place - ~ resource "kubernetes_secret" "rds" { - ~ data = (sensitive value) - id = "rds/postgres" - } - - # module.rds.aws_db_instance.rds will be updated in-place - ~ resource "aws_db_instance" "rds" { - id = "cloud-platform-xxxxxx" - name = "dbxxxxxxx" - ~ password = (sensitive value) - tags = { - "application" = "app" - "business-unit" = "business-unit" - "environment-name" = "production" - "infrastructure-support" = "team@digital.justice.gov.uk" - "is-production" = "true" - "namespace" = "namespace-name" - "owner" = "namespace-woner" - } - - + timeouts { - + create = "2h" - + delete = "2h" - + update = "2h" - } - } - - # module.rds.random_password.password is tainted, so must be replaced --/+ resource "random_password" "password" { - ~ id = "none" -> (known after apply) - ~ result = (sensitive value) - # (9 unchanged attributes hidden) - } - -Plan: 1 to add, 2 to change, 1 to destroy. -``` - -If all is well: - -```bash -$ terraform apply -``` - -If this looks like it's going to do the right thing, enter 'yes' to confirm. - -At this point, a new RDS password should have been created, and the kubernetes secret should contain the db password. - -Note: It is possible that applications might experience downtime if, for example, a pod which was launched with the old password drops a DB connection and tries to open a new one (which will fail, because the password is no longer valid). To make pods pick up the new password, perform a _manual_ rollout on every relevant deployment: +Note: It is possible that applications might experience downtime if, for example, a pod which was launched with the old password drops the connection to AWS and tries to open a new one (which will fail, because the password is no longer valid). +To make pods pick up the new password, perform a _manual_ rollout on every relevant deployment: ```bash kubectl rollout restart "deployment/{deployment}" -namespace="{namespace}" ``` -This will rotate all pods according to the rollout strategy used in deployments in the namespace, which will pick up the new DB password from the kubernetes secret. +This will rotate all pods according to the rollout strategy used in deployments in the namespace, which will pick up the new iam keys from the kubernetes secret. [env-repo]: https://github.com/ministryofjustice/cloud-platform-environments [tf-taint]: https://www.terraform.io/cli/commands/taint diff --git a/runbooks/source/upgrade-cluster-components.html.md.erb b/runbooks/source/upgrade-cluster-components.html.md.erb index 4e39ccf3..a7a89431 100644 --- a/runbooks/source/upgrade-cluster-components.html.md.erb +++ b/runbooks/source/upgrade-cluster-components.html.md.erb @@ -1,8 +1,8 @@ --- title: Upgrade cluster components weight: 54 -last_reviewed_on: 2023-05-15 -review_in: 3 months +last_reviewed_on: 2023-09-27 +review_in: 6 months --- # Upgrade cluster components diff --git a/runbooks/source/upgrade-terraform-version.html.md.erb b/runbooks/source/upgrade-terraform-version.html.md.erb index 85a70e85..c820c566 100644 --- a/runbooks/source/upgrade-terraform-version.html.md.erb +++ b/runbooks/source/upgrade-terraform-version.html.md.erb @@ -1,17 +1,17 @@ --- title: Upgrade Terraform Version weight: 54 -last_reviewed_on: 2023-05-22 -review_in: 3 months +last_reviewed_on: 2023-09-27 +review_in: 6 months --- # Upgrade Terraform Version ## Introduction -The intention of this document is to provide you with a method to upgrade the Terraform version used in state across the MoJ Cloud Platform. This document won't go into minutiae detail on how to perform each task, as each upgrade will require different levels of attention. +The intention of this document is to provide you with a method to upgrade the Terraform version used in state across the MoJ Cloud Platform. This document won't go into minite detail on how to perform each task, as each upgrade will require different levels of attention. ## Recommendations -- Install [TF Switch](https://github.com/warrensbox/terraform-switcher) to allow you to switch between Terraform versions. +- Install [tfenv](https://github.com/tfutils/tfenv) to allow you to switch between Terraform versions. ## Caveats This document was originally written following the Terraform 0.13 to 0.14 upgrade, it's worth noting this was the best course of action for that particular upgrade. Over time this document will evolve and the process of upgrading will improve. diff --git a/runbooks/source/upgrade-user-components.html.md.erb b/runbooks/source/upgrade-user-components.html.md.erb index 28dcfbd6..50188dd4 100644 --- a/runbooks/source/upgrade-user-components.html.md.erb +++ b/runbooks/source/upgrade-user-components.html.md.erb @@ -1,8 +1,8 @@ --- title: Upgrade user components weight: 55 -last_reviewed_on: 2023-06-12 -review_in: 3 months +last_reviewed_on: 2023-09-26 +review_in: 6 months --- # Upgrade user components @@ -37,13 +37,13 @@ The [cloud-platform-environments-repo] repository contains the namespaces for al Clone the repository and branch off main. Run the following command: ```bash -cloud-platform environments bump-module --module --module-version +cloud-platform environment bump-module --module --module-version ``` The `module-name` flag must contain a word in the module source. For example, if you were to upgrade the serviceaccount module to 0.5.0, you would run the following command: ```bash -cloud-platform bump-module --module serviceaccount --module-version 0.5.0 +cloud-platform environment bump-module --module serviceaccount --module-version 0.5.0 ``` The CLI will make changes to the local copy of the repository. It's your responsibility to commit these changes to your branch.