Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipamd assigns IPs from detached ENIs #732

Closed
tatatodd opened this issue Nov 23, 2019 · 8 comments · Fixed by #754
Closed

ipamd assigns IPs from detached ENIs #732

tatatodd opened this issue Nov 23, 2019 · 8 comments · Fixed by #754

Comments

@tatatodd
Copy link
Contributor

I've been debugging an issue where ipamd assigns IPs from ENIs that are not attached to the EC2 instance. This is support case# 6601452261 (I dunno if you have access to that).

I've tracked it down to the ENI reconciliation logic. ipamd starts out with 3 ENIs, and assigns IPs from these ENIs, and all is good. Then one of the ENIs gets detached, outside of ipamd's control. Here's where badness starts; the datastore will continually refuse to remove the ENI, since some of its IPs are assigned. And ipamd will continue to assign IPs from this detached ENI, meaning that our pods end up with IPs that simply don't work.

In our specific case, we still don't know why the ENI gets detached; it seems to happen shortly after our VM reboots. We also don't know why our VM is rebooting. The working theory is that our EC2 instances end up with too many combined ENI+EBS attachments; i.e. more than the limit for our instance type. Perhaps that results in VM reboot, and subsequent ENI detachment.

Regardless of the reason, it seems clear that ipamd shouldn't continue to assign IPs from detached ENIs. I have a PR for a fix that I'll send shortly. Below are logs that demonstrate the problem from v1.5.5, that I've annotated whilst debugging.

***** 0) ipamd startup after reboot
2019-11-20T16:47:07.325Z [INFO] Starting L-IPAMD v1.5.5  ...

***** 1) ipamd discovers 3 ENIs, including eni-02797faf7a67b1590
2019-11-20T16:47:37.354Z [DEBUG]        Total number of interfaces found: 3
2019-11-20T16:47:37.356Z [DEBUG]        Found ENI: eni-02797faf7a67b1590, MAC 06:48:a2:a8:91:f0, device 3
2019-11-20T16:47:37.358Z [DEBUG]        Found ENI: eni-0ea92c15e054d06d3, MAC 06:91:7d:d1:10:9c, device 2
2019-11-20T16:47:37.360Z [DEBUG]        Found ENI: eni-0578f3df9b86e409c, MAC 06:f1:59:05:99:ac, device 0

***** 2) ipamd reconciles 3 ENIs from AWS local metadata, including eni-02797faf7a67b1590 with IP 192.168.165.244
2019-11-20T16:47:57.872Z [DEBUG]  Reconciling ENI/IP pool info...
2019-11-20T16:47:57.873Z [DEBUG]  Total number of interfaces found: 3
2019-11-20T16:47:57.874Z [DEBUG]  Found ENI: eni-02797faf7a67b1590, MAC 06:48:a2:a8:91:f0, device 3
2019-11-20T16:47:57.877Z [DEBUG]  Found ENI: eni-0ea92c15e054d06d3, MAC 06:91:7d:d1:10:9c, device 2
2019-11-20T16:47:57.880Z [DEBUG]  Found ENI: eni-0578f3df9b86e409c, MAC 06:f1:59:05:99:ac, device 0
2019-11-20T16:47:57.882Z [DEBUG]  Adding ENI(eni-02797faf7a67b1590)'s IPv4 address 192.168.165.244 to datastore
2019-11-20T16:47:57.882Z [DEBUG]  Reconciled IP 192.168.165.244 on ENI eni-02797faf7a67b1590

***** 3) k8s asks ipamd to add a new network for pod fluentd-cloudwatch-chh96
***** ipamd assigns IP 192.168.165.244 from eni-02797faf7a67b1590
2019-11-20T16:48:00.711Z [INFO] Received AddNetwork for NS /proc/11349/ns/net, Pod fluentd-cloudwatch-chh96, NameSpace amazon-cloudwatch, Container e415008a45cc9f34eed550c837c69229df80dc3c2a6ad45dab0c9309d1d3f960, ifname eth0
2019-11-20T16:48:00.711Z [DEBUG]  AssignIPv4Address: IP address pool stats: total: 42, assigned 7
2019-11-20T16:48:00.711Z [INFO] AssignPodIPv4Address: Assign IP 192.168.165.244 to pod (name fluentd-cloudwatch-chh96, namespace amazon-cloudwatch container e415008a45cc9f34eed550c837c69229df80dc3c2a6ad45dab0c9309d1d3f960)
2019-11-20T16:48:00.711Z [INFO] Send AddNetworkReply: IPv4Addr 192.168.165.244, DeviceNumber: 3, err: <nil>

***** 4) ipamd updates pod fluentd-cloudwatch-chh96 from k8s, with the IP assigned above in (3)
2019-11-20T16:48:27.106Z [INFO] Add/Update for Pod fluentd-cloudwatch-chh96 on my node, namespace = amazon-cloudwatch, IP = 192.168.165.244

***** 5) ipamd reconciles 2 ENIs from AWS local metadata, not including eni-02797faf7a67b1590
2019-11-20T16:48:57.946Z [DEBUG]  Reconciling ENI/IP pool info...
2019-11-20T16:48:57.958Z [DEBUG]  Total number of interfaces found: 2
2019-11-20T16:48:57.965Z [DEBUG]  Found ENI: eni-0ea92c15e054d06d3, MAC 06:91:7d:d1:10:9c, device 2
2019-11-20T16:48:57.977Z [DEBUG]  Found ENI: eni-0578f3df9b86e409c, MAC 06:f1:59:05:99:ac, device 0

***** THINGS GO BAD FROM HERE

***** 6) hereafter ipamd won't delete detached eni-02797faf7a67b1590 because it has pods assigned
2019-11-20T16:48:57.981Z [INFO] Reconcile and delete detached ENI eni-02797faf7a67b1590
2019-11-20T16:48:57.981Z [ERROR]  IP pool reconcile: Failed to delete ENI during reconcile: datastore: ENI is used and can not be deleted
2019-11-20T16:49:58.004Z [ERROR]  IP pool reconcile: Failed to delete ENI during reconcile: datastore: ENI is used and can not be deleted
2019-11-20T16:50:58.017Z [ERROR]  IP pool reconcile: Failed to delete ENI during reconcile: datastore: ENI is used and can not be deleted
2019-11-20T16:51:58.032Z [ERROR]  IP pool reconcile: Failed to delete ENI during reconcile: datastore: ENI is used and can not be deleted

***** 7) in addition ipamd continues to assign IPs from eni-02797faf7a67b1590 (device 3)
2019-11-20T19:01:37.043Z [INFO] Send AddNetworkReply: IPv4Addr 192.168.167.47, DeviceNumber: 3, err: <nil>
@tatatodd
Copy link
Contributor Author

BTW, if you have suggestions for how to debug why our ENIs are getting detached in the first place, I'm all ears. We're using EKS in what we thought is a plain vanilla style.

@mogren
Copy link
Contributor

mogren commented Nov 27, 2019

@tatatodd If you can find the time when the ENI was detached, it should be possible to search in CloudTrail for the DetachNetworkInterface event.

@tatatodd
Copy link
Contributor Author

Thanks @mogren!

I looked in CloudTrail, and may have a lead. Here's the full events from CloudTrail for eni-02797faf7a67b1590 from the logs above. Based on timestamps, all of the stuff here happened before the stuff in the logs above (i.e. before the reboot). The user corresponding to these events is the EC2 instance itself.

2019-11-20T16:45:13Z i-0daa24c7655912dc9 CreateTags                      eni-02797faf7a67b1590
2019-11-20T16:45:13Z i-0daa24c7655912dc9 CreateNetworkInterface          eni-02797faf7a67b1590
2019-11-20T16:45:14Z i-0daa24c7655912dc9 ModifyNetworkInterfaceAttribute eni-02797faf7a67b1590
2019-11-20T16:45:14Z i-0daa24c7655912dc9 AssignPrivateIpAddresses        eni-02797faf7a67b1590
2019-11-20T16:45:14Z i-0daa24c7655912dc9 AttachNetworkInterface          eni-attach-091ebf0a039db8ad9 eni-02797faf7a67b1590 i-0daa24c7655912dc9
2019-11-20T16:46:25Z i-0daa24c7655912dc9 DetachNetworkInterface          eni-attach-091ebf0a039db8ad9
2019-11-20T16:46:25Z i-0daa24c7655912dc9 DeleteNetworkInterface          eni-02797faf7a67b1590 Client.InvalidParameterValue "Network interface 'eni-02797faf7a67b1590' is currently in use."

Note that the final DeleteNetworkInterface fails because the ENI is still in use, which is common (there's a retry loop to deal with this). Here's the corresponding logs.

2019-11-20T16:45:08.317Z [DEBUG]        IP pool is NOT too low: available (14) >= ENI target (1) * addrsPerENI (14)
2019-11-20T16:45:08.317Z [DEBUG]        IP pool stats: total = 28, used = 14, c.maxIPsPerENI = 14
2019-11-20T16:45:08.317Z [DEBUG]        Its NOT possible to remove extra ENIs because available (14) <= ENI target (1) * addrsPerENI (14):
2019-11-20T16:45:10.817Z [DEBUG]        nodeIPPoolReconcile: skipping because time since last 35.014702283s <= 1m0s
2019-11-20T16:45:12.893Z [INFO] Received AddNetwork for NS /proc/2677/ns/net, Pod cleanup-circleci-kbvcs, NameSpace circleci-regression, Container 45172ff43e95a20da8abc290d5f473072c01a52ec7d906e9a87944c45172e869, ifname eth0
2019-11-20T16:45:12.893Z [DEBUG]        AssignIPv4Address: IP address pool stats: total: 28, assigned 14
2019-11-20T16:45:12.893Z [INFO] AssignPodIPv4Address: Assign IP 192.168.176.47 to pod (name cleanup-circleci-kbvcs, namespace circleci-regression container 45172ff43e95a20da8abc290d5f473072c01a52ec7d906e9a87944c45172e869)
2019-11-20T16:45:12.893Z [DEBUG]        VPC CIDR 192.168.0.0/16
2019-11-20T16:45:12.893Z [INFO] Send AddNetworkReply: IPv4Addr 192.168.176.47, DeviceNumber: 0, err: <nil>
2019-11-20T16:45:13.317Z [DEBUG]        IP pool stats: total = 28, used = 15, c.maxIPsPerENI = 14
2019-11-20T16:45:13.317Z [DEBUG]        IP pool is too low: available (13) < ENI target (1) * addrsPerENI (14)
2019-11-20T16:45:13.317Z [DEBUG]        Starting to increase IP pool size
2019-11-20T16:45:13.317Z [INFO] createENI: use primary interface's config, [0xc0005ec0d0], subnet-012b699d36e009bb0
2019-11-20T16:45:13.444Z [INFO] Created a new ENI: eni-02797faf7a67b1590
2019-11-20T16:45:14.591Z [INFO] Successfully created and attached a new ENI eni-02797faf7a67b1590 to instance
...
2019-11-20T16:45:31.433Z [INFO] UnassignPodIPv4Address: pod (Name: cleanup-circleci-kbvcs, NameSpace circleci-regression Container 45172ff43e95a20da8abc290d5f473072c01a52ec7d906e9a87944c45172e869)'s ipAddr 192.168.176.47, DeviceNumber0
2019-11-20T16:45:31.433Z [INFO] Send DelNetworkReply: IPv4Addr 192.168.176.47, DeviceNumber: 0, err: <nil>
2019-11-20T16:45:35.011Z [DEBUG]        IP pool stats: total = 42, used = 14, c.maxIPsPerENI = 14
2019-11-20T16:45:35.011Z [DEBUG]        IP pool is NOT too low: available (28) >= ENI target (1) * addrsPerENI (14)
2019-11-20T16:45:35.011Z [DEBUG]        IP pool stats: total = 42, used = 14, c.maxIPsPerENI = 14
2019-11-20T16:45:35.011Z [DEBUG]        It might be possible to remove extra ENIs because available (28) > ENI target (1) * addrsPerENI (14):
2019-11-20T16:45:35.011Z [DEBUG]        ENI eni-0578f3df9b86e409c cannot be deleted because it is primary
2019-11-20T16:45:35.011Z [DEBUG]        ENI eni-0ea92c15e054d06d3 cannot be deleted because it has pods assigned
2019-11-20T16:45:35.011Z [DEBUG]        ENI eni-02797faf7a67b1590 cannot be deleted because it is too young
...
2019-11-20T16:45:40.012Z [DEBUG]        ENI eni-02797faf7a67b1590 cannot be deleted because it is too young
2019-11-20T16:45:45.012Z [DEBUG]        ENI eni-02797faf7a67b1590 cannot be deleted because it is too young
2019-11-20T16:45:50.013Z [DEBUG]        ENI eni-02797faf7a67b1590 cannot be deleted because it is too young
2019-11-20T16:45:55.013Z [DEBUG]        ENI eni-02797faf7a67b1590 cannot be deleted because it is too young
2019-11-20T16:46:00.014Z [DEBUG]        ENI eni-02797faf7a67b1590 cannot be deleted because it is too young
2019-11-20T16:46:05.014Z [DEBUG]        ENI eni-02797faf7a67b1590 cannot be deleted because it is too young
2019-11-20T16:46:10.014Z [DEBUG]        ENI eni-02797faf7a67b1590 cannot be deleted because it is too young
...
2019-11-20T16:46:15.015Z [DEBUG]        IP pool stats: total = 42, used = 13, c.maxIPsPerENI = 14
2019-11-20T16:46:15.015Z [DEBUG]        IP pool is NOT too low: available (29) >= ENI target (1) * addrsPerENI (14)
2019-11-20T16:46:15.015Z [DEBUG]        IP pool stats: total = 42, used = 13, c.maxIPsPerENI = 14
2019-11-20T16:46:15.015Z [DEBUG]        It might be possible to remove extra ENIs because available (29) > ENI target (1) * addrsPerENI (14):
2019-11-20T16:46:15.015Z [DEBUG]        ENI eni-0578f3df9b86e409c cannot be deleted because it is primary
2019-11-20T16:46:15.015Z [DEBUG]        ENI eni-0ea92c15e054d06d3 cannot be deleted because it has pods assigned
2019-11-20T16:46:15.015Z [DEBUG]        ENI eni-02797faf7a67b1590 cannot be deleted because it is too young
2019-11-20T16:46:15.015Z [DEBUG]        No ENI can be deleted at this time
2019-11-20T16:46:15.015Z [INFO] No ENI to remove, all ENIs have IPs in use
... ### VM reboots here, with a gap in the logs 
2019-11-20T16:47:07.325Z [INFO] Starting L-IPAMD v1.5.5  ...

Here's what the logs tell us:
16:45:12 A new pod cleanup-circleci-kbvcs is assigned.
16:45:13 ipamd creates a new ENI to maintain the warm pool target.
16:45:31 Pod cleanup-circleci-kbvcs is unassigned.
16:46:15 ipamd can't delete ENI because it's still too young (< 1min).
... reboot occurs sometime in here
16:47:07 New ipamd starts up.

Here's the theory:

  • The seelog package that ipamd does buffering, so we've lost logs when the reboot happened without a clean log flush.
  • The detachment of the ENI at 16:46:25 was working-as-intended; it happened once the ENI lifespan was old enough (>1min).

And here's what the logs post-reboot (in my first post) tell us:

  • Then the VM rebooted, and after ipamd started, it thought that the ENI was still attached, and assigned some pods to it. Thereafter it refused to remove the ENI from its datastore, even though the local metadata eventually told it that the ENI was no longer attached.

But here's some open questions:

  1. Do we expect this kind of inconsistency in the EC2 local metadata wrt network interfaces? I.e. is it possible that the pre-reboot ipamd managed to detach the ENI, but the post-reboot ipamd was still told by EC2 local metadata that the ENI was still attached?
  2. Why did the VM reboot in the first place? Our best guess so far is that we're overrunning the ENI+EBS attachment limit, and linux is getting unhappy and spontaneously rebooting.

@tatatodd
Copy link
Contributor Author

Answering my own question 1 from the previous post, given the retry loops around DeleteNetworkInterface and the CloudTrail events, I guess it is expected that the EC2 local metadata might still think the ENI is attached, even after a successful DetachNetworkInterface call.

So I guess the core problem here is that an inopportune VM reboot that occurred right after an ENI was asked to be detached (but before it was really detached) caused the ENI to be assigned some pods, and thereafter ipamd would continue to assign pods to that (eventually really detached) ENI.

I think my PR #733 fixes part of the problem, to ensure that at least once ENI reconciliation occurs, ipamd won't assign new pods IPs from the detached ENI.

But there's still the problem that there doesn't seem to be a mechanism to fix existing pods that have already been assigned IPs from the detached ENI. I.e. we've been running v1.5.5 patched with #733 on our clusters, and we've seen this prometheus counter and warning in the logs:

ipamdErrInc("nodeInitAssignPodIPv4AddressFailed")

Our only recourse so far is to identify the broken pods when this happens, and manually delete them. Any thoughts on when that will be fixed?

@mogren
Copy link
Contributor

mogren commented Nov 28, 2019

@tatatodd Yes, great debugging on your part here, not sure when and how we will get this completely solved.

To detach an ENI and return those IPs to the Subnet, the EC2 API requires you to first DetachNetworkInterface, then wait at least 2 seconds, usually a bit more, then do the DeleteNetworkInterface call. If ipamd restarts or the node restarts at that time, we will lose the ENI. We have made changes in the v1.6.0 branch to try to reduce the leaking, band also to find and delete already leaked ENIs.

The on instance metadata service can be very out of date, 30 seconds is common, but we have seen up to 9 minutes on older instance types. What happens is that after a detach, the EC2 controlplane will say and know that the ENI is detached, but the instance metadata will still show it as Attached and it might get added back to the data store when ipamd restarts.

@tatatodd
Copy link
Contributor Author

tatatodd commented Dec 2, 2019

@mogren Thanks for the info.

What do you think about PR #733 to at least prevent ipamd from continuing to assign new IPs from the detached ENI?

@mogren
Copy link
Contributor

mogren commented Dec 3, 2019

@tatatodd Because of all the issues we have seen with the instance metadata being out of sync, I'm a bit wary about using it as the truth. Even if it's on the the instance, it's not always correct. That said, we should definitely not use IPs from a detached ENI. We will need to investigate this more.

@tatatodd
Copy link
Contributor Author

tatatodd commented Dec 3, 2019

@mogren Gotcha, that's reasonable. Let me know if you'd like me to try anything out; we've been running into issues related to this on roughly a weekly basis in our clusters, so we're happy to help. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants