ENIs are released when API is down, causing potential outage #431

adammw · 2019-05-01T21:46:36Z

We recently saw an issue with our custom Kubernetes clusters in production running the amazon-vpc-cni-k8s plugin version 1.3.0 where existing pods running on a node lost network connectivity and packets were being routed out the wrong interface. Newly scheduled pods would be unaffected and work correctly.

Upon further inspection, we found that the IP address of pods previously scheduled to the node were not present in the ENIs attached to the node. With the help of AWS CloudTrail and AWS Config, we determined that the instance had released the ENI which contained the IP addresses of the still-running pods.

Looking at the logs it appears that the plugin's failure to contact the API server to learn about pods running on the node causes the plugin to believe that none of the IP addresses assigned are in use, and therefore frees the ENI.

We would like this behaviour to be changed to fail-safe when the API server is unavailable, that is, when the pods running cannot be verified by the API, it doesn't free the ENI and instead keeps trying, or falls back on some other approach to verify if the addresses are still in use.

2019-04-25T09:16:07Z [INFO] Starting L-IPAMD v1.3.0  ...
2019-04-25T09:16:07Z [INFO] Testing communication with server
2019-04-25T09:16:07Z [INFO] Starting Pod controller
2019-04-25T09:16:07Z [INFO] Running with Kubernetes cluster version: v1.11. git version: v1.11.9. git tree state: clean. commit: 16236ce91790d4c75b79f6ce96841db1c843e7d2. platform: linux/amd64
2019-04-25T09:16:07Z [INFO] Communication with server successful
2019-04-25T09:16:07Z [INFO] Go OS/Arch: linux/amd64
2019-04-25T09:16:07Z [INFO] operator-sdk Version: 0.0.5+git
2019-04-25T09:16:07Z [INFO] Go Version: go1.10.5
2019-04-25T09:16:07Z [INFO] Watching crd.k8s.amazonaws.com/v1alpha1, ENIConfig, default, 5000000000
2019-04-25T09:16:07Z [DEBUG] Discovered region: us-west-2
2019-04-25T09:16:07Z [DEBUG] Found avalability zone: us-west-2a 
2019-04-25T09:16:07Z [DEBUG] Discovered the instance primary ip address: 10.210.192.160
2019-04-25T09:16:07Z [DEBUG] Found instance-id: i-0eaf24b557d9d0d5b 
2019-04-25T09:16:07Z [DEBUG] Found instance-type: m4.4xlarge 
2019-04-25T09:16:07Z [DEBUG] Found primary interface's mac address: 06:2e:67:da:85:16
2019-04-25T09:16:07Z [DEBUG] Found device-number: 0 
2019-04-25T09:16:07Z [DEBUG] Discovered 3 interfaces.
[snipped]
2019-04-25T09:16:17Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:17Z [INFO] Not able to get local pods yet (attempt 1/12): discovery: informer not synced
2019-04-25T09:16:22Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:22Z [INFO] Not able to get local pods yet (attempt 2/12): discovery: informer not synced
2019-04-25T09:16:28Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:28Z [INFO] Not able to get local pods yet (attempt 3/12): discovery: informer not synced
2019-04-25T09:16:33Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:33Z [INFO] Not able to get local pods yet (attempt 4/12): discovery: informer not synced
2019-04-25T09:16:38Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:38Z [INFO] Not able to get local pods yet (attempt 5/12): discovery: informer not synced
2019-04-25T09:16:43Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:43Z [INFO] Not able to get local pods yet (attempt 6/12): discovery: informer not synced
2019-04-25T09:16:48Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:48Z [INFO] Not able to get local pods yet (attempt 7/12): discovery: informer not synced
2019-04-25T09:16:53Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:54Z [INFO] Not able to get local pods yet (attempt 8/12): discovery: informer not synced
2019-04-25T09:16:59Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:59Z [INFO] Not able to get local pods yet (attempt 9/12): discovery: informer not synced
2019-04-25T09:17:04Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:17:04Z [INFO] Not able to get local pods yet (attempt 10/12): discovery: informer not synced
2019-04-25T09:17:09Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:17:09Z [INFO] Not able to get local pods yet (attempt 11/12): discovery: informer not synced
2019-04-25T09:17:14Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:17:14Z [INFO] Not able to get local pods yet (attempt 12/12): discovery: informer not synced
2019-04-25T09:17:19Z [WARN] During ipamd init, failed to get Pod information from Kubelet unable to get local pods, giving up
2019-04-25T09:17:24Z [DEBUG] Skip the primary ENI for need IP check
2019-04-25T09:17:25Z [DEBUG] IP pool stats: total = 58, used = 0, c.currentMaxAddrsPerENI = 29, c.maxAddrsPerENI = 29
2019-04-25T09:17:25Z [DEBUG] Start freeing eni eni-0721cbed4aa67ca3a
2019-04-25T09:17:25Z [INFO] FreeENI eni-0721cbed4aa67ca3a: IP address pool stats: free 29 addresses, total: 29, assigned: 0
2019-04-25T09:17:25Z [DEBUG] FreeENI: found a deletable ENI eni-0721cbed4aa67ca3a
2019-04-25T09:17:25Z [INFO] Trying to free eni: eni-0721cbed4aa67ca3a

/cc @zenvdeluca @grosser @hgokavarapuz @yizhang-zen

The text was updated successfully, but these errors were encountered:

mogren · 2019-05-02T00:21:05Z

Hi @adammw, thanks a lot for reporting this issue!

We are working on a few changes related to this, for example #123 #359, #377 and #401. We will take this issue in to account as well when working on this.

adammw · 2019-07-10T22:20:03Z

This is still a problem with the latest version. We have experienced three cluster degradations due to this so far in the span of a week.

uthark · 2019-07-12T17:57:56Z

@mogren As a workaround, if we set WARM_ENI_TARGET to maximum available for the instance type, will it help?

mogren · 2019-07-13T21:30:38Z

@uthark Yes, that would at least prevent the CNI plugin from detaching the ENIs, but this is still a bug. An ENI with IPs that are still in use should never be freed. Will have to look more at this case.

uthark · 2019-07-26T21:44:13Z

After upgrade to 1.5 this issue doesn't during node startup, CNI plugin correctly exits and restarts.
Fixed in 9b5268f

Still investigating why it starts to release if API Server becomes unavailable after CNI Plugin initialized and is working.

jaypipes · 2019-10-09T21:18:12Z

@uthark have you been able to reproduce the CNI plugin releasing ENIs after successful startup of the CNI plugin but after an API server interruption has occurred? @mogren and I have been looking through the current codebase and cannot see any location where an ENI is being released if the k8s API server cannot be contacted...

jaypipes · 2020-01-22T19:24:12Z

@uthark closing this out as we are not able to determine if this is any longer an issue. Feel free to re-open if you see this occur again.

mogren added the needs investigation label May 2, 2019

mogren added the bug label Jul 13, 2019

uthark mentioned this issue Jul 22, 2019

Unhandled rate limit which causes other errors #537

Closed

jaypipes closed this as completed Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENIs are released when API is down, causing potential outage #431

ENIs are released when API is down, causing potential outage #431

adammw commented May 1, 2019

mogren commented May 2, 2019

adammw commented Jul 10, 2019

uthark commented Jul 12, 2019

mogren commented Jul 13, 2019

uthark commented Jul 26, 2019

jaypipes commented Oct 9, 2019

jaypipes commented Jan 22, 2020

ENIs are released when API is down, causing potential outage #431

ENIs are released when API is down, causing potential outage #431

Comments

adammw commented May 1, 2019

mogren commented May 2, 2019

adammw commented Jul 10, 2019

uthark commented Jul 12, 2019

mogren commented Jul 13, 2019

uthark commented Jul 26, 2019

jaypipes commented Oct 9, 2019

jaypipes commented Jan 22, 2020