Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid detaching ENIs on nodes being drained #1223

Closed
mogren opened this issue Sep 18, 2020 · 25 comments
Closed

Avoid detaching ENIs on nodes being drained #1223

mogren opened this issue Sep 18, 2020 · 25 comments

Comments

@mogren
Copy link
Contributor

mogren commented Sep 18, 2020

What would you like to be added:
We should prevent ipamd from trying to free ENIs when a node is about to terminated.

For spot instances we could do something similar to the aws-node-termination-handler and check some metadata endpoints.

For the case where a node is cordoned off before being terminated, meaning it is marked as "unschedulable", we should be able to check this node taints before trying to attach or detach any ENIs.

Why is this needed:
Since there is no EC2 API call to directly "delete" an ENI that is attached, instead they first have to be detached, which takes a few seconds, then deleted. If the instance gets terminated after the ENI has been detached, but before it has been deleted, it will be leaked. This leaked ENI can prevent Security Groups and VPCs from being deleted and require manual clean up.

Related issues: #608 #69, #690

@jayanthvn
Copy link
Contributor

There are 2 parts to this - 1. There is an internal tracking ticket with Ec2 team to see if this can be handled. 2. IPAMD can check if the node is marked as unschedulable then prevent ENI deletion. But the cleaner approach is #1.

@jayanthvn
Copy link
Contributor

Pending POC on the IPAMD changes. Also following up with ec2 team if this can be handled internally.

@hurriyetgiray-ping
Copy link

Do we have an ETA on this fix please? This is causing us a lot of issues, and substantial coding to workaround it.

@hurriyetgiray-ping
Copy link

Can this be tagged as a bug please, as this is more than a feature request?

@jayanthvn jayanthvn added the bug label Jun 28, 2021
@jayanthvn
Copy link
Contributor

Hi @hurriyetgiray-ping,

What is the CNI version you are using and are you having short lived clusters in the account? currently we have a background thread - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/awsutils/awsutils.go#L382 which cleans up leaked ENIs in the account and I agree this would need atleast one aws-node pod running in the account.

@hurriyetgiray-ping
Copy link

Hello @jayanthvn. Thank you for your response and also for tagging this issue as a bug. CNI version is V1.7.5 as per the below kubectl describe.

bash-4.2$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.7.5
amazon-k8s-cni:v1.7.5 

Would the background thread you mention help us? How frequently does it run?

Not sure what qualifies a cluster's life span as 'short', but this particular use case could possibly qualify as one. We have an EKS cluster but we manage the nodes ourselves. As part of our end-to-end test suite, we create nodes, performs tests and then delete them via Cloudformation. Here are the related CloudFormation stack events and timelines showing the delete error.


2021-06-23 16:51:27 UTC+0100 eks-stateful-node-xxx DELETE_FAILED The following resource(s) failed to delete: [WorkerNodeSecurityGroup].
2021-06-23 16:51:26 UTC+0100 WorkerNodeSecurityGroup DELETE_FAILED resource sg-xxx has a dependent object (Service: AmazonEC2; Status Code: 400; Error Code: DependencyViolation; Request ID: xxx; Proxy: null)

2021-06-23 16:07:10 UTC+0100 WorkerNodeSecurityGroup CREATE_COMPLETE  (creates sg-xxx)

2021-06-23 16:06:59 UTC+0100 eks-stateful-node-xxx CREATE_IN_PROGRESS	Transformation succeeded
2021-06-23 16:06:52 UTC+0100 eks-stateful-node-xxx CREATE_IN_PROGRESS

@johngmyers
Copy link

kOps is also having problems with ENIs being leaked by amazon-vpc upon cluster deletion.

I suspect the window between creation of an ENI and attaching it with DeleteOnTermination set is also a factor.

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Apr 13, 2022
@Nuru
Copy link

Nuru commented Apr 13, 2022

People have been complaining about this since 2019. Can we get this taken care of now please?

@jayanthvn
Copy link
Contributor

@Nuru - #1927 should mitigate this issue up to some extent but we are actively working with EC2 team on the implementation of detach and delete calls.

@bryantbiggs
Copy link
Member

@jayanthvn anything preventing #1927 from being merged?

@jayanthvn
Copy link
Contributor

@bryantbiggs - It is pending code review. We are tracking it for 1.11.1 release. I will provide an ETA soon.

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Jun 13, 2022
@jayanthvn
Copy link
Contributor

/not stale

@github-actions github-actions bot removed the stale Issue or PR is stale label Jun 14, 2022
@EladGabay
Copy link

Hi, any update here?
Thanks

@jayanthvn
Copy link
Contributor

#1927 mitigates the issue to certain extent. But we are actively working with EC2/VPC team on fixing the backend calls.

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Sep 21, 2022
@Nuru
Copy link

Nuru commented Sep 23, 2022

/not stale

@Nuru
Copy link

Nuru commented Sep 23, 2022

@jayanthvn how do we get this marked as "not stale"?

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Nov 24, 2022
@jayanthvn
Copy link
Contributor

/not stale

@github-actions github-actions bot removed the stale Issue or PR is stale label Nov 25, 2022
@kgyovai
Copy link

kgyovai commented Jan 18, 2023

@Nuru - I had a similar use case to the one you described in #608. I also encountered the same error. When provisioning a managed node group with Terraform (aws_eks_node_group resource type) and then performing a subsequent destroy operation, the security group(s) assigned to the node group instance(s) could not be deleted because one or more ENIs provisioned by AWS through the managed node group were associated with the security group(s) as illustrated below.

image

image

Versions

EKS: 1.24
Terraform: 1.3.6
Terraform AWS Provider: 4.48.0
aws-vpc-cni: v1.12.0-eksbuild.2

My Solution

I ended up creating a null_resource with a local-exec destroy-time provisioner to delete any ENIs that were created by AWS through the managed node group and had already been detached. This is not a perfect solution if you are provisioning and destroying more than one set of managed node groups in an account at the same time. It works for my use case of standing up a single cluster with node groups and then tearing it all down.

Here is an example of the solution in action during a destroy operation using Terraform.

image

Code

/* This resource is used to ensure that any "dangling" ENIs are deleted before the destroy operation is performed on
 * the security group assigned to the managed node group. This is often, but not always, necessary because managed node
 * groups result in the creation of AWS-managed ENIs that cannot be evaluated through the use of Terraform state. The
 * AWS-managed ENIs often detach during the destroy operation but are not always deleted. The root cause of this
 * behavior is because the EC2 API requires the "detach" and "delete" invocations to be performed separately.
 * See https://github.com/aws/amazon-vpc-cni-k8s/issues/608
 */
resource "null_resource" "dangling_eni_cleanup" {
  provisioner "local-exec" {
    command     = "./delete-detached-enis.sh"
    when        = destroy
    working_dir = path.module
  }
  depends_on = [
    aws_security_group.this
  ]
}
#!/bin/sh

# This script was created to provide a solution to "dangling" ENIs that occur during the deletion of a managed EKS node
# group. https://github.com/aws/amazon-vpc-cni-k8s/issues/608#issuecomment-694654359

set -e

query_dangling_enis() {
  aws ec2 describe-network-interfaces \
    --query 'NetworkInterfaces[?Status==`available`&&contains(Description, `aws-K8S`) == `true`].NetworkInterfaceId'
}

delete_eni() {
  echo "Deleting detached ENI: $1"
  aws ec2 delete-network-interface --network-interface-id $1
  echo "The interface was deleted."
}

echo "Initiating dangling ENI cleanup."
query_dangling_enis | jq -rc '.[]' | while read eni; do
  delete_eni $eni
done
echo "All dangling ENIs have been removed."

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Mar 20, 2023
@kuzaxak
Copy link

kuzaxak commented Mar 21, 2023

/not stale

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests