Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Launch Template with Managed Node Groups Example Incorrect Security Group #1662

Closed
geoff-contino opened this issue Oct 27, 2021 · 19 comments · Fixed by #1680
Closed

Launch Template with Managed Node Groups Example Incorrect Security Group #1662

geoff-contino opened this issue Oct 27, 2021 · 19 comments · Fixed by #1680
Assignees

Comments

@geoff-contino
Copy link

geoff-contino commented Oct 27, 2021

Description

As per issue kubernetes-sigs/aws-efs-csi-driver#574 it would appear that the example in module

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/examples/launch_templates_with_managed_node_groups/launchtemplate.tf

Attaches

 module.eks_cluster.worker_security_group_id

to the launch template for the nodes - this appears to be incorrect and this should be:

module.eks_cluster.cluster_primary_security_group_id

Otherwise none of the EFS, ENIs can connect to the nodes because they use the cluster primary security group id as the reference.

Having checked the standard node group deployment, this appears to be the case...

@daroga0002
Copy link
Contributor

provided think is just a example, you can connect multiple security group to node groups as we dont limiting this on module. We leaving this as per needs for users decision.

Example is just example

@geoff-contino
Copy link
Author

@daroga0002 I do understand it is just an example, but should the example not be representative of a standard default node group configuration?

Perhaps I should have realised sooner that it should have been this group instead - but that example as a minimum should have the primary cluster security group as this is what is needed for inter-node and control plane communication.

@daroga0002
Copy link
Contributor

Perhaps I should have realised sooner that it should have been this group instead - but that example as a minimum should have the primary cluster security group as this is what is needed for inter-node and control plane communication.

so you mean control plane communication is blocked in that example?

I believe those were tested and all nodes are joining cluster properly

@geoff-contino
Copy link
Author

so they join, however there (on a private cluster) appeared to be issues with the aws-node and/or coredns pods... the k8 logs (outside this scope) were less than helpful. We were also running Cilium that did not like this configuration... (again outside this scope)

The biggest issue was PVCs - these are linked to the primary cluster sg which is the default group applied to the nodes in a non-launch template configuration. When we switched to a launch template using this example as a base (as we wanted a custom AMI and encryption) they failed to attach.

As noted here:

https://docs.aws.amazon.com/eks/latest/userguide/sec-group-reqs.html

This cluster sg should be applied to the control plane and nodes to allow everything to communicate 'freely'.

Fully appreciate that this gives you more freedom to configure the security groups and maybe have a fine grained control - but to better replicate what EKS does with node groups currently, I feel it would be better to have the cluster primary sg as the one attached to the nodes.

@daroga0002
Copy link
Contributor

example is creating following security groups:
image

if we look into provided docs we have all requirements:
image

so they join, however there (on a private cluster) appeared to be issues with the aws-node and/or coredns pods... the k8 logs (outside this scope) were less than helpful. We were also running Cilium that did not like this configuration... (again outside this scope)

sorry but Kubernetes is a big think and this module is not able to support all unlimited configurations which users are using in their organisations. Main goal of this module is to also allow customisation to cover their use cases, but this doesnt mean we can support all those customisation out of box inside this module

@geoff-contino
Copy link
Author

I am not saying that the example should support another configuration, it should support the same configuration that is currently being used within AWS.

If you were to deploy a node group without the launch template, it would use the primary sg for the control plane and the the nodes. This is what EFS (the only example I have) will reference in its ENI. If you were to change this to the launch template config following the example in the module it uses the worker security group which is what you have shown above. This will not work with EFS and I am sure would affect other areas too that use the primary sg as a reference for the nodes.

I get that you shouldnt support all configurations, but this is the 'standard' configuration for EKS.

@daroga0002
Copy link
Contributor

SO I inderstand issue is when you using dynamic provisioning for EFS CSI?

@geoff-contino
Copy link
Author

we do that yes - however we noticed the issue when we went from standard node groups to those using a launch template - essentially re-rolling the node groups.

I am not sure how the EFS CSI tells the ENI which security group to use, but when they are created initially, they use the primary cluster sg as the source. Therefore using the above configuration, it would not work.

If you would like to confirm, please try launching a standard EKS cluster without using a launch template for the nodes and see which SG the nodes are assigned - from what I can see it is the primary sg - not a separate worker one.

@daroga0002
Copy link
Contributor

could you paste a screens from a security groups assigned to your EKS and those created on EFS thru dynamic provisioning?

@geoff-contino
Copy link
Author

So this is from the node:

error-1

and this is from the cluster

Screenshot 2021-10-28 at 10 30 51

@daroga0002
Copy link
Contributor

thx, I need also SG from EFS share

@geoff-contino
Copy link
Author

sorry - here you go

Screenshot 2021-10-28 at 10 49 11

@mijndert
Copy link

mijndert commented Nov 2, 2021

Same issue here using the following code. The managed node groups gets created with the default SG regardless of what the documentation tells me. I believe worker_security_group_id should be mapped to https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/eks_node_group#remote_access_security_group_id

$ tf version
Terraform v1.0.10

  source          = "terraform-aws-modules/eks/aws"

  cluster_name    = var.environment
  cluster_version = var.eks_version

  vpc_id          = data.aws_vpc.selected.id
  subnets         = data.aws_subnet_ids.private.ids

  worker_create_security_group = false

  # cluster_service_ipv4_cidr =

  node_groups_defaults = {
    ami_type  = "AL2_x86_64" # Default is x86_64, can be overwritten per node group
    ebs_optimized = true
    disk_size = var.disk_size
    disk_encrypted = true
  }

  node_groups = {
    general = {
      name             = "eks-general-${random_string.suffix.result}"

      worker_security_group_id = data.aws_security_group.eks-nodes.id

      desired_capacity = var.desired_capacity_general
      max_capacity     = var.max_capacity_general
      min_capacity     = var.min_capacity_general

      instance_types = [var.instance_type_general]
      capacity_type  = "ON_DEMAND"

      update_config = {
        max_unavailable_percentage = 50 # or set `max_unavailable`
      }

      metadata_http_endpoint               = true
      metadata_http_tokens                 = true
      metadata_http_put_response_hop_limit = 2

      pre_userdata = data.cloudinit_config.eks.rendered

      tags = {
        Name = "${var.environment}-eks-general"
      }
    },
    local_storage = {
      name             = "eks-local-storage-${random_string.suffix.result}"

      worker_security_group_id = data.aws_security_group.eks-nodes.id

      ami_type         = "AL2_ARM_64"

      desired_capacity = var.desired_capacity_local_storage
      max_capacity     = var.max_capacity_local_storage
      min_capacity     = var.min_capacity_local_storage

      instance_types = [var.instance_type_local_storage]
      capacity_type  = "ON_DEMAND"

      update_config = {
        max_unavailable_percentage = 50 # or set `max_unavailable`
      }

      metadata_http_endpoint               = true
      metadata_http_tokens                 = true
      metadata_http_put_response_hop_limit = 2

      pre_userdata = data.cloudinit_config.eks-local-storage.rendered

      tags = {
        Name = "${var.environment}-eks-local-storage"
      }
    }
  }
}

@daroga0002
Copy link
Contributor

I believe worker_security_group_id should be mapped to https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/eks_node_group#remote_access_security_group_id

Nope, as this feature is working that AWS API is creating additional security group with open port 22 (SSH) and setup provided SG as source security group.

@daroga0002 daroga0002 added the bug label Nov 2, 2021
@daroga0002
Copy link
Contributor

this is some bug which we must fix to have consistent behaviour between totally managed and "semi" managed node groups in EKS

@geoff-contino
Copy link
Author

geoff-contino commented Nov 2, 2021

@daroga0002 - so it turns out we were setting the efs share to use that SG in a module within another module - it was not the default action. My fault - sorry!

That said - the cluster_sg should be applied to both the cluster and nodes by default (as it is when launched fully managed). This will also allow for this SG to be referenced and node groups to be destroyed and still allow resources access to the cluster. Otherwise when they nodes are rebuilt - issues like the above could happen

@daroga0002
Copy link
Contributor

daroga0002 commented Nov 5, 2021

similar case #1616

@antonbabenko
Copy link
Member

This issue has been resolved in version 18.0.0 🎉

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants