Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENIs are released when API is down, causing potential outage #431

Closed
adammw opened this issue May 1, 2019 · 7 comments
Closed

ENIs are released when API is down, causing potential outage #431

adammw opened this issue May 1, 2019 · 7 comments

Comments

@adammw
Copy link
Contributor

adammw commented May 1, 2019

We recently saw an issue with our custom Kubernetes clusters in production running the amazon-vpc-cni-k8s plugin version 1.3.0 where existing pods running on a node lost network connectivity and packets were being routed out the wrong interface. Newly scheduled pods would be unaffected and work correctly.

Upon further inspection, we found that the IP address of pods previously scheduled to the node were not present in the ENIs attached to the node. With the help of AWS CloudTrail and AWS Config, we determined that the instance had released the ENI which contained the IP addresses of the still-running pods.

Looking at the logs it appears that the plugin's failure to contact the API server to learn about pods running on the node causes the plugin to believe that none of the IP addresses assigned are in use, and therefore frees the ENI.

We would like this behaviour to be changed to fail-safe when the API server is unavailable, that is, when the pods running cannot be verified by the API, it doesn't free the ENI and instead keeps trying, or falls back on some other approach to verify if the addresses are still in use.

2019-04-25T09:16:07Z [INFO] Starting L-IPAMD v1.3.0  ...
2019-04-25T09:16:07Z [INFO] Testing communication with server
2019-04-25T09:16:07Z [INFO] Starting Pod controller
2019-04-25T09:16:07Z [INFO] Running with Kubernetes cluster version: v1.11. git version: v1.11.9. git tree state: clean. commit: 16236ce91790d4c75b79f6ce96841db1c843e7d2. platform: linux/amd64
2019-04-25T09:16:07Z [INFO] Communication with server successful
2019-04-25T09:16:07Z [INFO] Go OS/Arch: linux/amd64
2019-04-25T09:16:07Z [INFO] operator-sdk Version: 0.0.5+git
2019-04-25T09:16:07Z [INFO] Go Version: go1.10.5
2019-04-25T09:16:07Z [INFO] Watching crd.k8s.amazonaws.com/v1alpha1, ENIConfig, default, 5000000000
2019-04-25T09:16:07Z [DEBUG] Discovered region: us-west-2
2019-04-25T09:16:07Z [DEBUG] Found avalability zone: us-west-2a 
2019-04-25T09:16:07Z [DEBUG] Discovered the instance primary ip address: 10.210.192.160
2019-04-25T09:16:07Z [DEBUG] Found instance-id: i-0eaf24b557d9d0d5b 
2019-04-25T09:16:07Z [DEBUG] Found instance-type: m4.4xlarge 
2019-04-25T09:16:07Z [DEBUG] Found primary interface's mac address: 06:2e:67:da:85:16
2019-04-25T09:16:07Z [DEBUG] Found device-number: 0 
2019-04-25T09:16:07Z [DEBUG] Discovered 3 interfaces.
[snipped]
2019-04-25T09:16:17Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:17Z [INFO] Not able to get local pods yet (attempt 1/12): discovery: informer not synced
2019-04-25T09:16:22Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:22Z [INFO] Not able to get local pods yet (attempt 2/12): discovery: informer not synced
2019-04-25T09:16:28Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:28Z [INFO] Not able to get local pods yet (attempt 3/12): discovery: informer not synced
2019-04-25T09:16:33Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:33Z [INFO] Not able to get local pods yet (attempt 4/12): discovery: informer not synced
2019-04-25T09:16:38Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:38Z [INFO] Not able to get local pods yet (attempt 5/12): discovery: informer not synced
2019-04-25T09:16:43Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:43Z [INFO] Not able to get local pods yet (attempt 6/12): discovery: informer not synced
2019-04-25T09:16:48Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:48Z [INFO] Not able to get local pods yet (attempt 7/12): discovery: informer not synced
2019-04-25T09:16:53Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:54Z [INFO] Not able to get local pods yet (attempt 8/12): discovery: informer not synced
2019-04-25T09:16:59Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:16:59Z [INFO] Not able to get local pods yet (attempt 9/12): discovery: informer not synced
2019-04-25T09:17:04Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:17:04Z [INFO] Not able to get local pods yet (attempt 10/12): discovery: informer not synced
2019-04-25T09:17:09Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:17:09Z [INFO] Not able to get local pods yet (attempt 11/12): discovery: informer not synced
2019-04-25T09:17:14Z [INFO] GetLocalPods: informer not synced yet
2019-04-25T09:17:14Z [INFO] Not able to get local pods yet (attempt 12/12): discovery: informer not synced
2019-04-25T09:17:19Z [WARN] During ipamd init, failed to get Pod information from Kubelet unable to get local pods, giving up
2019-04-25T09:17:24Z [DEBUG] Skip the primary ENI for need IP check
2019-04-25T09:17:25Z [DEBUG] IP pool stats: total = 58, used = 0, c.currentMaxAddrsPerENI = 29, c.maxAddrsPerENI = 29
2019-04-25T09:17:25Z [DEBUG] Start freeing eni eni-0721cbed4aa67ca3a
2019-04-25T09:17:25Z [INFO] FreeENI eni-0721cbed4aa67ca3a: IP address pool stats: free 29 addresses, total: 29, assigned: 0
2019-04-25T09:17:25Z [DEBUG] FreeENI: found a deletable ENI eni-0721cbed4aa67ca3a
2019-04-25T09:17:25Z [INFO] Trying to free eni: eni-0721cbed4aa67ca3a

/cc @zenvdeluca @grosser @hgokavarapuz @yizhang-zen

@mogren
Copy link
Contributor

mogren commented May 2, 2019

Hi @adammw, thanks a lot for reporting this issue!

We are working on a few changes related to this, for example #123 #359, #377 and #401. We will take this issue in to account as well when working on this.

@adammw
Copy link
Contributor Author

adammw commented Jul 10, 2019

This is still a problem with the latest version. We have experienced three cluster degradations due to this so far in the span of a week.

@uthark
Copy link
Contributor

uthark commented Jul 12, 2019

@mogren As a workaround, if we set WARM_ENI_TARGET to maximum available for the instance type, will it help?

@mogren
Copy link
Contributor

mogren commented Jul 13, 2019

@uthark Yes, that would at least prevent the CNI plugin from detaching the ENIs, but this is still a bug. An ENI with IPs that are still in use should never be freed. Will have to look more at this case.

@uthark
Copy link
Contributor

uthark commented Jul 26, 2019

After upgrade to 1.5 this issue doesn't during node startup, CNI plugin correctly exits and restarts.
Fixed in 9b5268f

Still investigating why it starts to release if API Server becomes unavailable after CNI Plugin initialized and is working.

@jaypipes
Copy link
Contributor

jaypipes commented Oct 9, 2019

@uthark have you been able to reproduce the CNI plugin releasing ENIs after successful startup of the CNI plugin but after an API server interruption has occurred? @mogren and I have been looking through the current codebase and cannot see any location where an ENI is being released if the k8s API server cannot be contacted...

@jaypipes
Copy link
Contributor

@uthark closing this out as we are not able to determine if this is any longer an issue. Feel free to re-open if you see this occur again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants