Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hotfix(eks public) Correct LB system #3110

Merged
merged 2 commits into from
Oct 25, 2022

Conversation

dduportal
Copy link
Contributor

@dduportal dduportal commented Oct 25, 2022

After jenkins-infra/helpdesk#3053, the EKS managed SG was updated with the tag kubernetes.io/cluster/<cluster name> which caused mayhem with the Loadbalancer (no more backends...) because subnet discoverability was expecting only 1 tagged SG while they were 2 (the EKS managed SG and the Terraform EKS module managed Node SG (node_security_group_id).

Hard to diagnose because eks-public used the Legacy Cloud Load Balancer Controller (no logs) despite the new Controller being installed (logs but no mentions of the loadbalancer because delegated to the Legacy Controller).
Error message could be found in the service description:

$ kubectl --namespace=public-nginx-ingress describe svc <svc name>
# ...
Events
----------
# ...
Error syncing load balancer: failed to ensure load balancer: Multiple tagged security groups found for instance i-0850526a4ec2a17af; ensure only the k8s security group is tagged; the tagged groups were sg-03029dd273150c206(eks-cluster-sg-public-mint-ape-1630374159) sg-03735a81840336f76(public-mint-ape-node-20221013174103536200000004)  

Once switched to the AWS LB Controller, same error with more details in the controller logs:

$ kubectl -n aws-load-balancer logs -l 'app.kubernetes.io/name'=aws-load-balancer-controller -f
# ...
expect exactly one securityGroup tagged with kubernetes.io/cluster/public-mint-ape for eni eni-0b66cbd6118848186, got: [sg-03029dd273150c206 sg-03735a81840336f76] (clusterName: public-mint-ape)"}

=> Automatic discovery of the subnets allows selecting a network interface (eni-xxxx) and then select security groups transitively, with a final selection per tags on security groups.
This is how we diagnosed the tag added by EKS itself outside our Terraform management (issue jenkins-infra/helpdesk#3053 updated for next kubernetes updates).

Finally, the IRSA ARN in the AWS LB Controller was changed by jenkins-infra/aws#276 but not reported to the helm chart values.

This PR fixes the LB management in eks-public with the following changes:

@dduportal dduportal requested a review from a team October 25, 2022 10:13
@dduportal dduportal changed the title Fix/eks public/lb hotfix(eks public) Correct LB system Oct 25, 2022
@dduportal dduportal merged commit 2c59bf2 into jenkins-infra:main Oct 25, 2022
@dduportal dduportal deleted the fix/eks-public/lb branch October 25, 2022 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant