fix: avoid cluster recreation when migrating from v17.x #1933

renato0307 · 2022-03-10T15:37:37Z

Description

Solves two issues:

Avoids recreation of the cluster during migration from v17.24 to v18.x.
Adds support for additional cluster security group rules without the need to create a new security group (which causes cluster recreation)

Motivation and Context

We are migrating the module from v17.24.0 to v18.x and the changes force a cluster replacement, which is not acceptable in our context.

The v17.24 cluster is being created like this:

module "cluster" {
  source           = "terraform-aws-modules/eks/aws"
  version          = "17.24.0"
  cluster_name     = local.stamp_name
  cluster_version  = local.kubernetes_version
  subnets          = module.vpc.private_subnets
  vpc_id           = module.vpc.vpc_id
  enable_irsa      = true
  write_kubeconfig = false
  manage_aws_auth  = false

  manage_worker_iam_resources  = false
  manage_cluster_iam_resources = false
  cluster_iam_role_name        = aws_iam_role.cluster_controlplane.name

  cluster_enabled_log_types     = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
  cluster_log_retention_in_days = 365

  worker_create_security_group  = false
  cluster_create_security_group = false

  tags = merge(local.common_tags, {
    Name = local.stamp_name
  })

  node_groups_defaults = {
    iam_role_arn = aws_iam_role.cluster_nodes.arn

    tags = [
      {
        "key"                 = "k8s.io/cluster-autoscaler/enabled"
        "propagate_at_launch" = "false"
        "value"               = "true"
      },
      {
        "key"                 = "k8s.io/cluster-autoscaler/${local.stamp_name}"
        "propagate_at_launch" = "false"
        "value"               = "true"
      }
    ]

    kubelet_extra_args     = var.kubelet_extra_args
    create_launch_template = true
  }

  node_groups = {
    pool-main-1 = {
      name_prefix      = "pool-main-1-"
      version          = local.kubernetes_nodes_version
      desired_capacity = var.initial_capacity
      max_capacity     = var.maximum_capacity
      min_capacity     = var.minimum_capacity
      instance_types   = [var.instance_type]
    }

    pool-jobs-1 = {
      name_prefix      = "pool-jobs-1-"
      version          = local.kubernetes_nodes_version
      desired_capacity = var.initial_capacity
      max_capacity     = var.maximum_capacity
      min_capacity     = var.minimum_capacity
      instance_types   = [var.instance_type_jobs]

      create_launch_template = true
  depends_on = [
     ...
  ]
}

The with v18.8.1 we converted it to:

module "cluster" {
  source = "github.com/renato0307/terraform-aws-eks?ref=v18.8.1-os"
  #source          = "terraform-aws-modules/eks/aws"
  #version         = "18.8.1"
  cluster_name    = local.stamp_name
  cluster_version = local.kubernetes_version
  subnet_ids      = module.vpc.private_subnets # RENAME
  vpc_id          = module.vpc.vpc_id
  enable_irsa     = true

  create_iam_role          = false                                 # NEW
  iam_role_arn             = aws_iam_role.cluster_controlplane.arn # RENAME
  iam_role_use_name_prefix = false                                 # NEW

  cluster_enabled_log_types              = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
  cloudwatch_log_group_retention_in_days = 365

  create_node_security_group = true        # NEW
  node_security_group_additional_rules = { # NEW
    cluster_rule = {
      source_cluster_security_group = true
      description                   = "Access from the cluster"
      protocol                      = "-1"
      from_port                     = 0
      to_port                       = 0
      type                          = "ingress"
    }
  }

  create_cluster_security_group = false                             # NEW
  cluster_security_group_id     = var.tmp_cluster_security_group_id # NEW
  cluster_security_group_additional_rules = {                       # NEW
    cluster_rule = {
      source_node_security_group = true
      description                = "Access from node_security_group"
      protocol                   = "-1"
      from_port                  = 0
      to_port                    = 0
      type                       = "ingress"
    }
  }

  # This is not needed since we are using managed nodes
  # NOT_SUPPORTED:worker_create_security_group  = false
  # NOT_SUPPORTED:cluster_create_security_group = false
  # NOT_SUPPORTED:write_kubeconfig = false
  # NOT_SUPPORTED:manage_aws_auth  = false
  # NOT_SUPPORTED:manage_worker_iam_resources  = false
  # NOT_SUPPORTED:manage_cluster_iam_resources = false

  tags = merge(local.common_tags, {
    Name = local.stamp_name
  })

  eks_managed_node_group_defaults = { # RENAME
    iam_role_arn = aws_iam_role.cluster_nodes.arn

    tags = { # NOT A LIST ANYMORE
      "k8s.io/cluster-autoscaler/enabled" : "true",
      "k8s.io/cluster-autoscaler/${local.stamp_name}" : "true"
    }

    kubelet_extra_args     = var.kubelet_extra_args
    create_launch_template = true
  }

  eks_managed_node_groups = {
    pool-main-1 = {
      name_prefix      = "pool-main-1-"
      version          = local.kubernetes_nodes_version
      desired_capacity = var.initial_capacity
      max_capacity     = var.maximum_capacity
      min_capacity     = var.minimum_capacity
      instance_types   = [var.instance_type]
    }

    pool-jobs-1 = {
      name_prefix      = "pool-jobs-1-"
      version          = local.kubernetes_nodes_version
      desired_capacity = var.initial_capacity
      max_capacity     = var.maximum_capacity
      min_capacity     = var.minimum_capacity
      instance_types   = [var.instance_type_jobs]

      create_launch_template = true
    }
  }

  # Ensure the creation order but most important the delete order
  depends_on = [
...
  ]
}

A Terraform plan output example is ("forces replacement" notice):

 ~ vpc_config {
   ~ cluster_security_group_id = "sg-xxxxxxxxxx" -> (known after apply)
          ~ security_group_ids        = [ # forces replacement
              + "sg-xxxxxxxxxx",
            ]
          ~ vpc_id                    = "vpc-xxxxx" -> (known after apply)
            # (4 unchanged attributes hidden)
        }

Breaking Changes

How Has This Been Tested?

I have tested and validated these changes using one or more of the provided examples/* projects

bryantbiggs · 2022-03-10T15:42:53Z

unfortunately we won't be accepting PRs like this and instead I will point you to what others have done to migrate - #1744

renato0307 · 2022-03-10T19:06:04Z

Hi @bryantbiggs, thanks for the feedback.

I've been reading the full thread but I've reached a dead-end because apparently the 17.x version allowed to create clusters with the securityGroupIds empty, with values only in the clusterSecurityGroupId.

The v18.x, using an existing cluster security group or creating a new one, always updates securityGroupIds which by AWS limitation always recreates the cluster.

If I execute

aws eks update-cluster-config --name clustername --resources-vpc-config securityGroupIds=sg-xxxx

I'll get

An error occurred (InvalidParameterException) when calling the UpdateClusterConfig operation: subnetIds and securityGroupIds cannot be updated.

Example:

       "resourcesVpcConfig": {
            "subnetIds": [
                "subnet-xx",
                "subnet-yy",
                "subnet-zz"
            ],
            "securityGroupIds": [],
            "clusterSecurityGroupId": "sg-xxxx",
            "vpcId": "vpc-qqq",
            "endpointPublicAccess": true,
            "endpointPrivateAccess": false,
            "publicAccessCidrs": [
                "0.0.0.0/0"
            ]
        },

My clusters are all created like this and I cannot destroy them. The only alternative I currently have is to fork the module, which I prefer not to.

Any ideas?

I can work on doing a PR to support this edge case.

Thank you for your time and effort.

bryantbiggs · 2022-03-10T19:27:13Z

hey @renato0307 I think your issue might be related to #1911

Let me spin up a cluster and check out that PR

renato0307 · 2022-03-10T19:40:19Z

@bryantbiggs thanks.

Let me know if I can help.

renato0307 · 2022-03-10T21:15:39Z

@bryantbiggs I've tried the #1911 in my scenario and the cluster still gets deleted.

The PR only affects node_groups while my clusters are being destroyed because of changes in the cluster's vpc_config.

We would need something like to make it work:
renato0307@d80ae75

renato0307 · 2022-03-11T09:35:37Z

Hi @bryantbiggs - with this fix I could migrate my clusters by using:

create_cluster_security_group = false
create_node_security_group    = false
node_security_group_id        = var.tmp_node_group_security_group_id

The process is:

Fetch the existing security group for the node groups and pass it to Terraform as a variable
Remove the existing node groups from the Terraform state to prevent downtime
Run terraform apply
Cordon/drain old nodes
Delete old node groups

I've applied your change in a fork and will use this until the fix gets released.

Thank you once again.

xueshanf · 2022-05-14T19:57:41Z

@renato0307 thanks for sharing! I would like to know do you have any issues to clean up old node pool from EKS console now the pool is no longer managed by Terraform? Any dependencies to prevent the old pool from being cleaned up?

github-actions · 2022-11-09T02:37:24Z

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

fix: better support for v17.x upgrades

43afa27

bryantbiggs closed this Mar 10, 2022

bryantbiggs mentioned this pull request Mar 10, 2022

fix: Compact result of cluster security group to avoid disruptive updates when no security groups are supplied #1934

Merged

1 task

github-actions bot locked as resolved and limited conversation to collaborators Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid cluster recreation when migrating from v17.x #1933

fix: avoid cluster recreation when migrating from v17.x #1933

renato0307 commented Mar 10, 2022 •

edited

Loading

bryantbiggs commented Mar 10, 2022

renato0307 commented Mar 10, 2022 •

edited

Loading

bryantbiggs commented Mar 10, 2022

renato0307 commented Mar 10, 2022

renato0307 commented Mar 10, 2022

renato0307 commented Mar 11, 2022

xueshanf commented May 14, 2022

github-actions bot commented Nov 9, 2022

fix: avoid cluster recreation when migrating from v17.x #1933

fix: avoid cluster recreation when migrating from v17.x #1933

Conversation

renato0307 commented Mar 10, 2022 • edited Loading

Description

Motivation and Context

Breaking Changes

How Has This Been Tested?

bryantbiggs commented Mar 10, 2022

renato0307 commented Mar 10, 2022 • edited Loading

bryantbiggs commented Mar 10, 2022

renato0307 commented Mar 10, 2022

renato0307 commented Mar 10, 2022

renato0307 commented Mar 11, 2022

xueshanf commented May 14, 2022

github-actions bot commented Nov 9, 2022

renato0307 commented Mar 10, 2022 •

edited

Loading

renato0307 commented Mar 10, 2022 •

edited

Loading