-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENI warming is delayed for at least for 1 minute, probably caused by #480 #525
Comments
Also, related to #330 |
Another issue caused by long startup is that all pods will fail with
|
Thanks @uthark, this is a good find and something we have to investigate. |
Setting |
We set restartPolicy to Always, so, we have a lot of FailedCreatePodSandbox from kubelet during first minute. |
@mogren Why does CNI Plugin wait for 1 minute?
As of 1.9 kubernetes uses quorum reads from etcd (kubernetes/kubernetes#53717). So, why wait? I'd like to submit PR to make number of tries configurable, would you accept such PR? |
@uthark The issue we saw with v1.4.1 back in May was that if a pod, I guess the root cause of this issue is that we make the node available for scheduling pods before the CNI is actually ready. |
Yes, what we ended up with is using lifecycle hooks to taint/untaint the node. |
Also, question — what if CNI plugin stored used IP addresses in some kind of CRD or even as a tag on the ENI? |
This is kinda affecting our production now. Problem can be trace back to in #282 |
What @left4taco said. This needs to be fixed! |
Same @sc250024 and @left4taco. |
Lifecycle hook: https://gist.github.com/uthark/cd475f1dca21e2804eeda1564a1e6dc7 and example of usage: https://gist.github.com/uthark/caf919ee2d37a7d3e9536974de326136 |
Thanks @uthark! Are we suppose to edit the VPC CNI DaemonSet with that additional script? The code of a multi-billion dollar cloud provider, in the end saved by a shell script 🙄 |
we build our own image that has kubectl in it and the hook; and update DaemonSet manifest to use the hook. |
@uthark I tried what you described, but a node doesn't have permissions to taint in a default AWS EKS setup. I'm guessing you also allowed the nodes extra permissions in the
|
@uthark I actually got your script to work, but it doesn't work 100% of the time actually. A bit of a bummer, really. |
When it doesn't work for you?
Regards,
Oleg.
…On Jul 27, 2019, 02:00 -0700, Scott Crooks ***@***.***>, wrote:
@uthark I actually got your script to work, but it doesn't work 100% of the time actually. A bit of a bummer, really.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
EDIT 2019-07-29: @uthark I actually saw why this happens. In the case of our GitLab runners, the images we use for CI are very small (usually Alpine Linux). Sometimes Kubernetes will download the image for a particular GitLab job before the CNI image has been downloaded. Kubernetes will then try to run the particular CI job (i.e. the GitLab CI job is in @uthark As far as I can tell, it's unpredictable. But the error message is always the same:
Same error as you encountered above. Sometimes it works, and sometimes it doesn't. We use this primarily to scale up / scale down GitLab runner nodes. |
We are experiencing this exact issue for our production EKS environment. What seems to happen is ipamd gets stuck in a loop by logic introduced in this PR #480. @mogren What's supposed to be happening here? For our scenario the pods never receive their pod ip, so ipamd never starts up, resulting in tcp connection errors for our CNI. |
Hi @robertsheehy-wf, I explained some of the background for this change in an earlier comment. That said, It should only retry for at most 60 seconds, then continue and assume that the pods don't have an IP. Why this doesn't happen in all cases we still have to investigate more. |
Thanks, sorry I should have read more closely. Granted my knowledge is limited, but is there maybe a better strategy rather than waiting 60 seconds? Can we look at the pods more closely to understand why the Pod doesn't have a IP? Or maybe omit the wait loop if we know the CNI has never been ran on the node? (since this a problem we see on node startup) |
This is currently affecting us in all our environment and completely broke our autoscaling in production. We schedule dynamic amounts of batch jobs in kubernetes and autoscale our nodes based on taints/labels/requests. When the nodes autoscale 99% of the pods fail and get stuck in sandbox creation failed/cni errors. |
We’ve found that this is a problem when starting new nodes in a cluster. What happens is a pod will get scheduled at the same time as a CNI on a new node. This is where the CNI seems to falter. If the CNI is scheduled concurrently with another pod, the CNI sometimes seems to get into a state where it is unsure how to handle the pod. This results in the pod never receiving an IP. This problem is then exacerbated by the fact the CNI will wait 60 seconds while it waits for eventual consistency of Kubernetes. In the situation the CNI is blocked for 60 seconds while nothing happens, because in reality the pod was never assigned an IP by anyone. This effectively means the node is unusable for 60 seconds. This is a huge problem if you are cycling nodes in huge batches. What we’ve concluded is the CNI should always come first on a host before any other pods. We had assumed the CNI could gracefully handle this scenario. To resolve this we looked at managing taints/tolerations like @uthark had done. The one hangup we had with his approach was we didn’t want to manage a fork of this repo. This is where we realized for several of our daemonsets we had tolerations which tolerated all taints.
We had done this for our own reasons, but this meant for every node we brought up there was a high probability we would hit this issue. Our daemonset pods were being stood up concurrently with the CNI. Upon further investigation we found EKS already had a taint we could use on new nodes. Specifically this was the After fixing our daemonset tolerations we hit this issue a lot less often, since now CNIs had time to start before pods were scheduled on the node. |
@robertsheehy-wf Yeah, we did exact same. |
The linked commit appears to have alleviated the problem, at least for us. We don't see our pods getting stuck. |
The latest release of 1.5.1 RC1 solved this issue when autoscaling nodes and pods getting stuck with sandbox errors. The linked commit + #548 was primarily the 2 patches needed for us. |
Reduced wait time and added a fix to not retry too long for force detached ENIs. |
We also found that #480 changes causes the following interesting behavior:
Relevant log entries from IPAMD:
Also, during this 1 minute, all requests to GRPC failed (we see errors like "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused" during startup.
The text was updated successfully, but these errors were encountered: