Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workloads deployed on a node of a node_groups are unable to make calls to the internet #1089

Closed
3 tasks
marcosborges opened this issue Nov 6, 2020 · 18 comments · Fixed by #1094
Closed
3 tasks

Comments

@marcosborges
Copy link

Hello guys, I just uploaded an eks using the terraform-aws-eks module, at first it was a pleasant experience, the use of the module was very fluid.

I ran into a problem, configured the module to create a group of nodes so that certain types of applications could be deployed in it.

When I deployed the applications in this group of nodes via the node selector my applications are unable to make calls outside the cluster eg curl google.com.

When I remove the node selector and redo the deployment the application goes up to the standard eks job nodes. In these nodes the application can make calls outside the cluster.

  • bug report
  • feature request
  • [ x ] support request - read the FAQ first!
  • kudos, thank you, warm fuzzy

What is the current behavior?

Workloads deployed in groups of nodes are unable to make calls outside the cluster

module "eks" {
  source = "git::https://github.com/terraform-aws-modules/terraform-aws-eks.git"
  cluster_name = local.env_prefix
  cluster_version = "1.18"
  vpc_id = module.vpc.vpc_id
  subnets = module.vpc.private_subnets
  worker_groups = [
    {
      instance_type = "m4.large"
      desired_capacity = 2
      asg_max_size  = 5
    }
  ]
  node_groups = {
    vault = {
      desired_capacity = 2
      max_capacity     = 10
      min_capacity     = 2
      instance_type = "m4.large"
      k8s_labels = {
        Vault       = "true"
      }
      additional_tags = {
        Vault       = "true"
      }
    }
  }
}

I started by checking the subnets where the ec2 referring to groups of nodes were going up. It was found that they were the same as the nodes in the working group.

To raise the VPC I used the vpc module (terraform-aws-modules / terraform-aws-vpc).

After checking if it was something in the security group, the rules for the groups of nodes are the same for the worker nodes.

I also validated the IAM Role and again they were the same.

I need a light, tip, direction or smoke signal to continue creating my environment

I will be extremely grateful for the help.

@barryib
Copy link
Member

barryib commented Nov 6, 2020

Humm. Can you please check if you have a NAT gateway attached to your private subnets ? Can you please share you vpc module configuration ?

You can also have a look at the examples/managed_node_groups for a working example. This will probably help you to figure out what's wrong in your deployment.

@ScubaDrew
Copy link

I am having the same problem. I found that the node_groups security group does not have the correct inbound rules.

@barryib
Copy link
Member

barryib commented Nov 8, 2020

Can you please elaborate. Which SG rules are missing ?

@ScubaDrew
Copy link

node_groups security group
image

worker_groups security group:
image

Internet is not accessible with the first SG attached, but if I attach the worker_groups SG -- things work correctly.

@barryib
Copy link
Member

barryib commented Nov 9, 2020

I think it because you have an egress rule which allow internet traffic in the worker SG. This is done for the worker SG created by this module. @marcosborges @ScubaDrew can you confirm please.

I don't use MNG at all, and when I go through the code, I don't understand why this is opened only now.

@ScubaDrew
Copy link

They both have the same egress rule:
image

@barryib
Copy link
Member

barryib commented Nov 10, 2020

I just tested an internet accès within a managed node group and everything work as expected.

I was wondering what do you mean by "unable to make calls to the internet" ? Is it an DNS issue or your DNS resolution is working correctly and you're just having trouble to reach internet ?

If you have an DNS issue, I suspect that your core-dns pod are running in your worker groups and your pod from your managed node groups can't reach them. This is because, there are no rule to allow communication between worker groups and managed node groups by default. To do that, you can set var.worker_create_cluster_primary_security_group_rules=true.

@ScubaDrew
Copy link

@barryib I think you are right - the issue is DNS. coredns is not running on the node_groups node.

It seems the node_groups nodes do not get permission to talk to other nodes in the cluster.

As I showed above worker_groups get:
image

worker_create_cluster_primary_security_group_rules does not sound like really what we need/want. We want node_groups to be able to talk to the rest of the cluster for DNS I guess... or, to have core-dns running on them? I'm not sure what is best.

@barryib
Copy link
Member

barryib commented Nov 10, 2020

Here is the description of worker_create_cluster_primary_security_group_rules:

"Whether to create security group rules to allow communication between pods on workers and pods using the primary cluster security group."

It means that it's allow communication between pod in worker groups SG and managed node groups SG. MNG use the primary SG (this was introduced in EKS 1.14).

@ScubaDrew
Copy link

Got it. I'll add that then ! Thank you.

The example does not have it - https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/examples/managed_node_groups/main.tf -- so DNS wouldn't work there, right?

Thanks again

@barryib
Copy link
Member

barryib commented Nov 10, 2020

Oh good catch. Can you please test it and see if it solves your issue or even better confirm that the example you linked is not working as expected and open an PR to update the example/FAQ ?

@barryib
Copy link
Member

barryib commented Nov 10, 2020

Oh sorry. That example works. It's a quite late here ^^

That example work because, you don't have worker groups and managed node groups => so your core DNS pod run in your MNG which already share the same primary SG.

This issue comes when you have worker groups and MNG + your core DNS scheduled on one side of your cluster (in your case, on your self-managed worker groups).

@ScubaDrew
Copy link

Confirmed: worker_create_cluster_primary_security_group_rules fixes things when you have both worker and MNG. Thanks @barryib

@barryib
Copy link
Member

barryib commented Nov 10, 2020

Great. Can you please review #1094.

@adilsonmenechini
Copy link

Use the network public.

subnets = module.vpc.public_subnets

@barryib
Copy link
Member

barryib commented Nov 10, 2020

Use the network public.

subnets = module.vpc.public_subnets

Can you please elaborate ? How using public subnets will open coms between pods scheduled on managed node groups and those on self-managed worker groups.

@barryib
Copy link
Member

barryib commented Nov 11, 2020

Great. Can you please review #1094.

cc @ScubaDrew

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants