Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid cluster recreation when migrating from v17.x #1933

Closed

Conversation

renato0307
Copy link

@renato0307 renato0307 commented Mar 10, 2022

Description

Solves two issues:

  1. Avoids recreation of the cluster during migration from v17.24 to v18.x.
  2. Adds support for additional cluster security group rules without the need to create a new security group (which causes cluster recreation)

Motivation and Context

We are migrating the module from v17.24.0 to v18.x and the changes force a cluster replacement, which is not acceptable in our context.

The v17.24 cluster is being created like this:

module "cluster" {
  source           = "terraform-aws-modules/eks/aws"
  version          = "17.24.0"
  cluster_name     = local.stamp_name
  cluster_version  = local.kubernetes_version
  subnets          = module.vpc.private_subnets
  vpc_id           = module.vpc.vpc_id
  enable_irsa      = true
  write_kubeconfig = false
  manage_aws_auth  = false

  manage_worker_iam_resources  = false
  manage_cluster_iam_resources = false
  cluster_iam_role_name        = aws_iam_role.cluster_controlplane.name

  cluster_enabled_log_types     = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
  cluster_log_retention_in_days = 365

  worker_create_security_group  = false
  cluster_create_security_group = false

  tags = merge(local.common_tags, {
    Name = local.stamp_name
  })

  node_groups_defaults = {
    iam_role_arn = aws_iam_role.cluster_nodes.arn

    tags = [
      {
        "key"                 = "k8s.io/cluster-autoscaler/enabled"
        "propagate_at_launch" = "false"
        "value"               = "true"
      },
      {
        "key"                 = "k8s.io/cluster-autoscaler/${local.stamp_name}"
        "propagate_at_launch" = "false"
        "value"               = "true"
      }
    ]

    kubelet_extra_args     = var.kubelet_extra_args
    create_launch_template = true
  }

  node_groups = {
    pool-main-1 = {
      name_prefix      = "pool-main-1-"
      version          = local.kubernetes_nodes_version
      desired_capacity = var.initial_capacity
      max_capacity     = var.maximum_capacity
      min_capacity     = var.minimum_capacity
      instance_types   = [var.instance_type]
    }

    pool-jobs-1 = {
      name_prefix      = "pool-jobs-1-"
      version          = local.kubernetes_nodes_version
      desired_capacity = var.initial_capacity
      max_capacity     = var.maximum_capacity
      min_capacity     = var.minimum_capacity
      instance_types   = [var.instance_type_jobs]

      create_launch_template = true
  depends_on = [
     ...
  ]
}

The with v18.8.1 we converted it to:

module "cluster" {
  source = "github.com/renato0307/terraform-aws-eks?ref=v18.8.1-os"
  #source          = "terraform-aws-modules/eks/aws"
  #version         = "18.8.1"
  cluster_name    = local.stamp_name
  cluster_version = local.kubernetes_version
  subnet_ids      = module.vpc.private_subnets # RENAME
  vpc_id          = module.vpc.vpc_id
  enable_irsa     = true

  create_iam_role          = false                                 # NEW
  iam_role_arn             = aws_iam_role.cluster_controlplane.arn # RENAME
  iam_role_use_name_prefix = false                                 # NEW

  cluster_enabled_log_types              = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
  cloudwatch_log_group_retention_in_days = 365

  create_node_security_group = true        # NEW
  node_security_group_additional_rules = { # NEW
    cluster_rule = {
      source_cluster_security_group = true
      description                   = "Access from the cluster"
      protocol                      = "-1"
      from_port                     = 0
      to_port                       = 0
      type                          = "ingress"
    }
  }

  create_cluster_security_group = false                             # NEW
  cluster_security_group_id     = var.tmp_cluster_security_group_id # NEW
  cluster_security_group_additional_rules = {                       # NEW
    cluster_rule = {
      source_node_security_group = true
      description                = "Access from node_security_group"
      protocol                   = "-1"
      from_port                  = 0
      to_port                    = 0
      type                       = "ingress"
    }
  }

  # This is not needed since we are using managed nodes
  # NOT_SUPPORTED:worker_create_security_group  = false
  # NOT_SUPPORTED:cluster_create_security_group = false
  # NOT_SUPPORTED:write_kubeconfig = false
  # NOT_SUPPORTED:manage_aws_auth  = false
  # NOT_SUPPORTED:manage_worker_iam_resources  = false
  # NOT_SUPPORTED:manage_cluster_iam_resources = false

  tags = merge(local.common_tags, {
    Name = local.stamp_name
  })

  eks_managed_node_group_defaults = { # RENAME
    iam_role_arn = aws_iam_role.cluster_nodes.arn

    tags = { # NOT A LIST ANYMORE
      "k8s.io/cluster-autoscaler/enabled" : "true",
      "k8s.io/cluster-autoscaler/${local.stamp_name}" : "true"
    }

    kubelet_extra_args     = var.kubelet_extra_args
    create_launch_template = true
  }

  eks_managed_node_groups = {
    pool-main-1 = {
      name_prefix      = "pool-main-1-"
      version          = local.kubernetes_nodes_version
      desired_capacity = var.initial_capacity
      max_capacity     = var.maximum_capacity
      min_capacity     = var.minimum_capacity
      instance_types   = [var.instance_type]
    }

    pool-jobs-1 = {
      name_prefix      = "pool-jobs-1-"
      version          = local.kubernetes_nodes_version
      desired_capacity = var.initial_capacity
      max_capacity     = var.maximum_capacity
      min_capacity     = var.minimum_capacity
      instance_types   = [var.instance_type_jobs]

      create_launch_template = true
    }
  }

  # Ensure the creation order but most important the delete order
  depends_on = [
...
  ]
}

A Terraform plan output example is ("forces replacement" notice):

 ~ vpc_config {
   ~ cluster_security_group_id = "sg-xxxxxxxxxx" -> (known after apply)
          ~ security_group_ids        = [ # forces replacement
              + "sg-xxxxxxxxxx",
            ]
          ~ vpc_id                    = "vpc-xxxxx" -> (known after apply)
            # (4 unchanged attributes hidden)
        }

Breaking Changes

How Has This Been Tested?

  • I have tested and validated these changes using one or more of the provided examples/* projects

@bryantbiggs
Copy link
Member

unfortunately we won't be accepting PRs like this and instead I will point you to what others have done to migrate - #1744

@renato0307
Copy link
Author

renato0307 commented Mar 10, 2022

Hi @bryantbiggs, thanks for the feedback.

I've been reading the full thread but I've reached a dead-end because apparently the 17.x version allowed to create clusters with the securityGroupIds empty, with values only in the clusterSecurityGroupId.

The v18.x, using an existing cluster security group or creating a new one, always updates securityGroupIds which by AWS limitation always recreates the cluster.

If I execute

aws eks update-cluster-config --name clustername --resources-vpc-config securityGroupIds=sg-xxxx 

I'll get

An error occurred (InvalidParameterException) when calling the UpdateClusterConfig operation: subnetIds and securityGroupIds cannot be updated.

Example:

       "resourcesVpcConfig": {
            "subnetIds": [
                "subnet-xx",
                "subnet-yy",
                "subnet-zz"
            ],
            "securityGroupIds": [],
            "clusterSecurityGroupId": "sg-xxxx",
            "vpcId": "vpc-qqq",
            "endpointPublicAccess": true,
            "endpointPrivateAccess": false,
            "publicAccessCidrs": [
                "0.0.0.0/0"
            ]
        },

My clusters are all created like this and I cannot destroy them. The only alternative I currently have is to fork the module, which I prefer not to.

Any ideas?

I can work on doing a PR to support this edge case.

Thank you for your time and effort.

@bryantbiggs
Copy link
Member

hey @renato0307 I think your issue might be related to #1911

Let me spin up a cluster and check out that PR

@renato0307
Copy link
Author

@bryantbiggs thanks.

Let me know if I can help.

@renato0307
Copy link
Author

@bryantbiggs I've tried the #1911 in my scenario and the cluster still gets deleted.

The PR only affects node_groups while my clusters are being destroyed because of changes in the cluster's vpc_config.

We would need something like to make it work:
renato0307@d80ae75

@renato0307
Copy link
Author

Hi @bryantbiggs - with this fix I could migrate my clusters by using:

create_cluster_security_group = false
create_node_security_group    = false
node_security_group_id        = var.tmp_node_group_security_group_id

The process is:

  1. Fetch the existing security group for the node groups and pass it to Terraform as a variable
  2. Remove the existing node groups from the Terraform state to prevent downtime
  3. Run terraform apply
  4. Cordon/drain old nodes
  5. Delete old node groups

I've applied your change in a fork and will use this until the fix gets released.

Thank you once again.

@xueshanf
Copy link

@renato0307 thanks for sharing! I would like to know do you have any issues to clean up old node pool from EKS console now the pool is no longer managed by Terraform? Any dependencies to prevent the old pool from being cleaned up?

@github-actions
Copy link

github-actions bot commented Nov 9, 2022

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants