-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENIs are released when API is down, causing potential outage #431
Comments
This is still a problem with the latest version. We have experienced three cluster degradations due to this so far in the span of a week. |
@mogren As a workaround, if we set |
@uthark Yes, that would at least prevent the CNI plugin from detaching the ENIs, but this is still a bug. An ENI with IPs that are still in use should never be freed. Will have to look more at this case. |
After upgrade to 1.5 this issue doesn't during node startup, CNI plugin correctly exits and restarts. Still investigating why it starts to release if API Server becomes unavailable after CNI Plugin initialized and is working. |
@uthark have you been able to reproduce the CNI plugin releasing ENIs after successful startup of the CNI plugin but after an API server interruption has occurred? @mogren and I have been looking through the current codebase and cannot see any location where an ENI is being released if the k8s API server cannot be contacted... |
@uthark closing this out as we are not able to determine if this is any longer an issue. Feel free to re-open if you see this occur again. |
We recently saw an issue with our custom Kubernetes clusters in production running the amazon-vpc-cni-k8s plugin version 1.3.0 where existing pods running on a node lost network connectivity and packets were being routed out the wrong interface. Newly scheduled pods would be unaffected and work correctly.
Upon further inspection, we found that the IP address of pods previously scheduled to the node were not present in the ENIs attached to the node. With the help of AWS CloudTrail and AWS Config, we determined that the instance had released the ENI which contained the IP addresses of the still-running pods.
Looking at the logs it appears that the plugin's failure to contact the API server to learn about pods running on the node causes the plugin to believe that none of the IP addresses assigned are in use, and therefore frees the ENI.
We would like this behaviour to be changed to fail-safe when the API server is unavailable, that is, when the pods running cannot be verified by the API, it doesn't free the ENI and instead keeps trying, or falls back on some other approach to verify if the addresses are still in use.
/cc @zenvdeluca @grosser @hgokavarapuz @yizhang-zen
The text was updated successfully, but these errors were encountered: