Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dangling ENIs without any association with Instances #1447

Closed
Buffer0x7cd opened this issue Apr 29, 2021 · 23 comments
Closed

Dangling ENIs without any association with Instances #1447

Buffer0x7cd opened this issue Apr 29, 2021 · 23 comments
Labels
bug stale Issue or PR is stale

Comments

@Buffer0x7cd
Copy link

What happened:
During one of incidents , where pods are failing due to IP address exhaustion, We noticed that there a lots of ENIs that are allocated , But are not attached to any Instances. Our first assumption was these might be the ENIs that are created to maintain warm pool on the nodes, But After checking them we discovered that there are no tags node.k8s.amazonaws.com/instance_id tags available on those ENIs, Which doesn’t seems like expected behaviour.

func (cache *EC2InstanceMetadataCache) AllocENI(useCustomCfg bool, sg []*string, subnet string) (string, error) {

As far i can see, Allocation and attachment of ENIs are so there shouldn’t be the case where ENIs are allocated but are not attached and have missing tags, Except here (
awsUtilsErrInc("AllocENIDeleteErr", err)

ENI attach and delete both failed). To verify this i checked the prometheus metrics for AttachNetworkInterface api for any errors , but there are no significant increases here that explains this being the cause of increase in Allocated ENIs.

@jayanthvn
Copy link
Contributor

Hi @Buffer0x7cd

Do you have short lived instances/cluster? Also do you have any node termination policy? There is one known issue (#1223), After ENI is detached, it will take few seconds for the ENI to delete, if in the mean time node is terminated then the ENI will be dangling in the account.

@Buffer0x7cd
Copy link
Author

HI @jayanthvn

It doesn’t seems like this is the issue.

func (cache *EC2InstanceMetadataCache) freeENI(eniName string, sleepDelayAfterDetach time.Duration, maxBackoffDelay time.Duration) error {

From my understanding , In the case here. ENI will be First detached and deleted. Assuming the ENI was first Attached It should have the node.k8s.amazonaws.com/instance_id tag, Even after being detached ( As there is no steps to delete tags in the freeENI method).

In our observed case we can see that the dangling ENIs have no node.k8s.amazonaws.com/instance_id tag available , Which should be present if these Dangling ENIs were due to #1223

@jayanthvn
Copy link
Contributor

Yeah makes sense, I quickly ran a test and detached an ENI and I still see the instance_id tag even though the ENI is detached. Can you please open a support case?

@jayanthvn
Copy link
Contributor

Hi @Buffer0x7cd

For the ENI, do you see the "node.k8s.amazonaws.com/createdAt" tag present?

@Buffer0x7cd
Copy link
Author

@jayanthvn yes i can see the node.k8s.amazonaws.com/createdAt at tag present

@jayanthvn
Copy link
Contributor

Thanks for checking @Buffer0x7cd. So looks like createENI is fine but if attachENI failed we would have deleted the ENI -

attachmentID, err := cache.attachENI(eniID)
if err != nil {
derr := cache.deleteENI(eniID, maxENIBackoffDelay)
. If you can open a support case, then we can check EC2 logs to confirm why attachENI failed.

@aclevername
Copy link

aclevername commented Sep 22, 2021

We've noticed this while working on https://github.com/weaveworks/eksctl/ too. We recently managed to reproduce this issue: eksctl-io/eksctl#4214 (comment)

@hiattp
Copy link

hiattp commented Dec 10, 2021

We're seeing a similar/related issue but have cases where none of the active pods have ENIs that are attached to instances (the node has 2 ENIs with 10 and 1 private IP addresses respectively, and there are 13 pods on the node none of which use those ENIs). Not sure if this is actually the same issue but we've raised a support ticket (9328577341 9331293811). The original reason we raised the ticket was due to pods getting stuck in Pending with events like:

Warning FailedScheduling 21s (x12 over 13m) default-scheduler 0/10 nodes are available: 4 node(s) didn't match node selector, 6 Insufficient vpc.amazonaws.com/pod-eni.

And further investigation led us to this issue, but it's unclear whether the issues are related.

@GaruGaru
Copy link

GaruGaru commented Feb 9, 2022

Same issue running v1.7.5-eksbuild.1 on v1.21.5-eks-9017834.
We have many unused ENI interfaces with just the node.k8s.amazonaws.com/createdAt tag set.
This is pretty important since it can lead to available interface exhaustion causing service disruption.

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Apr 14, 2022
@bryantbiggs
Copy link
Member

Not stale

@jayanthvn jayanthvn removed the stale Issue or PR is stale label Apr 14, 2022
@jayanthvn
Copy link
Contributor

@aclevername - in the issue you mentioned we do see the node.k8s.amazonaws.com/instance_id. Typically this happens when node is terminated between delete and detach ENI calls.

@bryantbiggs or @GaruGaru - Can one of you please share IPAMD logs? You can email the log bundle to - [email protected]

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Jun 18, 2022
@bryantbiggs
Copy link
Member

Not stale

@timblaktu
Copy link

Tagging teammate @vidhyadharm about this "dangling ENI" issue, suggested by @bryantbiggs as root cause for our vpc deletion issue in eks blueprints and the corresponding vpc deletion issue in aws vpc module.

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Nov 19, 2022
@jayanthvn
Copy link
Contributor

/not stale

@github-actions github-actions bot removed the stale Issue or PR is stale label Nov 20, 2022
@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

@github-actions github-actions bot added the stale Issue or PR is stale label Jan 20, 2023
@github-actions
Copy link

github-actions bot commented Feb 4, 2023

Issue closed due to inactivity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 4, 2023
@yukccy
Copy link

yukccy commented Nov 21, 2023

Is there any fix for this issue? Coming from terraform-aws-modules/terraform-aws-vpc#283 that cannot delete VPC due to DependencyViolation

@demisx
Copy link

demisx commented Apr 9, 2024

In my case, there were nginx and eks related security groups left behind after EKS deletion. Once I removed those manually via AWS console, the VPC was destroyed within a couple of seconds.

@NathanDotTo
Copy link

This still appears to be an issue. It seems that the only workaround is to manually delete the VPC.

@flaso-giron
Copy link

This is still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale Issue or PR is stale
Projects
None yet
Development

No branches or pull requests