ENI warming is delayed for at least for 1 minute, probably caused by #480 #525

uthark · 2019-07-10T21:21:57Z

We also found that #480 changes causes the following interesting behavior:

New node joins the cluster.
API Server assigns new pods to it (particularly from system daemonsets, like aws-cni itself, fluentd, kube-node-monitor and some other that have tolerations)
aws-cni plugins starts and immediately discovers that there are pods without IP address. It then wait for 12*5=1 minute before trying to assign IP addresses to these pods.
And it looks like warming ENIs starts only after this timeout. At least, according to the logs we got:

Relevant log entries from IPAMD:

2019-07-10T20:30:30Z [INFO]     Starting L-IPAMD v1.5.0  ...
...
2019-07-10T20:30:30Z [DEBUG]    Using WARM_ENI_TARGET 4
...
2019-07-10T20:31:32Z [INFO]     Serving metrics on port 61678
2019-07-10T20:31:32Z [INFO]     Serving introspection endpoints on 127.0.0.1:61679
...
2019-07-10T20:31:34Z [INFO]     Received DelNetwork for IP <nil>, Pod kube-node-monitor-ckbsh, Namespace default, Container 9a8410ed9f541e1961f47b5c797fe10cd3d006c09301ba88ccb3c389c51a6017
2019-07-10T20:31:34Z [INFO]     Received AddNetwork for NS /proc/38956/ns/net, Pod kube-node-monitor-ckbsh, NameSpace default, Container 35265389458c297371c825f2ae3f5277583c9c5ab2acebc599ba02a3bb3b6920, ifname eth0
2019-07-10T20:31:34Z [DEBUG]    AssignIPv4Address: IP address pool stats: total: 0, assigned 0
2019-07-10T20:31:34Z [DEBUG]    AssignPodIPv4Address: Skip ENI eni-036df2e82333ebf90 that does not have available addresses
2019-07-10T20:31:34Z [ERROR]    DataStore has no available IP addresses
...
2019-07-10T20:31:35Z [DEBUG]    IP pool stats: total = 0, used = 0, c.maxIPsPerENI = 29
2019-07-10T20:31:35Z [DEBUG]    IP pool is too low: available (0) < ENI target (4) * addrsPerENI (29)
2019-07-10T20:31:35Z [DEBUG]    Starting to increase IP pool size
2019-07-10T20:31:35Z [DEBUG]    Skip the primary ENI for need IP check
...
2019-07-10T20:31:35Z [INFO]     Created a new ENI: eni-0091e4a9e7fa4ecb3

Also, during this 1 minute, all requests to GRPC failed (we see errors like "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused" during startup.

The text was updated successfully, but these errors were encountered:

uthark · 2019-07-10T21:24:58Z

Also, related to #330

uthark · 2019-07-11T22:24:28Z

Another issue caused by long startup is that all pods will fail with FailedCreatePodSandBox

Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "08929bf2e0cd09f9e163925d0bd7de8bc04c01311dc65202f2400ed7665aa771" network for pod "app-server-7c845659bf-dsh52": NetworkPlugin cni failed to set up pod "app-server-7c845659bf-dsh52" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "08929bf2e0cd09f9e163925d0bd7de8bc04c01311dc65202f2400ed7665aa771" network for pod "classic-app-server-7c845659bf-dsh52": NetworkPlugin cni failed to teardown pod "app-server-7c845659bf-dsh52" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

mogren · 2019-07-13T21:25:52Z

Thanks @uthark, this is a good find and something we have to investigate.

MrSaints · 2019-07-15T09:17:35Z

Setting restartPolicy: OnFailure gets around this for now, though not ideal for one-off jobs where a failure may be desired / intentional.

uthark · 2019-07-15T14:37:54Z

We set restartPolicy to Always, so, we have a lot of FailedCreatePodSandbox from kubelet during first minute.

uthark · 2019-07-15T18:29:48Z

@mogren Why does CNI Plugin wait for 1 minute?
From #480 description:

If ipamd gets restarted while a pod is registering it's newly assigned IP with the apiserver, we might miss the assigned IP because of the eventual consistency of Kubernetes.

As of 1.9 kubernetes uses quorum reads from etcd (kubernetes/kubernetes#53717). So, why wait?

I'd like to submit PR to make number of tries configurable, would you accept such PR?

mogren · 2019-07-17T03:05:12Z

@uthark The issue we saw with v1.4.1 back in May was that if a pod, demo-pod, got added to the node and ipamd assigns it an IP, let's say 10.1.1.1, and just at that time the CNI gets killed. When a new ipamd comes up and tries to re-sync it's internal data store by comparing the IPs of the current pods on the node with the IPs assigned to the ENIs attached, we noticed multiple times where the latest added pod, demo-pod in this case, did not have any IP. That in turn could lead to ipamd unassigning that IP from the node and returning it to the EC2 control plane, which then assigned the IP to another node that gave it to another pod. Now you have a state where two pods in the same cluster have the IP 10.1.1.1 assigned. If you are not using the WARM_IP_TARGET, this should not happen nearly as often, but still possible since a full ENI could be detached even though a pod is using an IP that belonged to it.

I guess the root cause of this issue is that we make the node available for scheduling pods before the CNI is actually ready.

uthark · 2019-07-17T03:12:02Z

Yes, what we ended up with is using lifecycle hooks to taint/untaint the node.
Updating startup script to prepare CNI config for kubelet after it is actually ready will help to solve issue during node boot. It will not solve the case when CNI plugin restarts, though. I can submit pr with taints using lifecycle hooks, if you’re interested.

uthark · 2019-07-17T03:17:02Z

Also, question — what if CNI plugin stored used IP addresses in some kind of CRD or even as a tag on the ENI?

t0ny-peng · 2019-07-18T21:56:14Z

This is kinda affecting our production now. Problem can be trace back to in #282

sc250024 · 2019-07-19T12:17:43Z

What @left4taco said. This needs to be fixed!

dcherniv · 2019-07-19T23:24:52Z

Same @sc250024 and @left4taco.
In general state of affairs with EKS is rather appalling compared to GKE.

uthark · 2019-07-23T16:22:16Z

Lifecycle hook: https://gist.github.com/uthark/cd475f1dca21e2804eeda1564a1e6dc7 and example of usage: https://gist.github.com/uthark/caf919ee2d37a7d3e9536974de326136

sc250024 · 2019-07-23T16:32:41Z

Thanks @uthark! Are we suppose to edit the VPC CNI DaemonSet with that additional script?

The code of a multi-billion dollar cloud provider, in the end saved by a shell script 🙄

uthark · 2019-07-23T17:33:01Z

we build our own image that has kubectl in it and the hook; and update DaemonSet manifest to use the hook.

sc250024 · 2019-07-24T09:13:35Z

@uthark I tried what you described, but a node doesn't have permissions to taint in a default AWS EKS setup. I'm guessing you also allowed the nodes extra permissions in the aws-auth ConfigMap?

Events:
  Type     Reason               Age                From                                                 Message
  ----     ------               ----               ----                                                 -------
  Normal   Scheduled            60s                default-scheduler                                    Successfully assigned kube-system/aws-node-f9xk5 to <internal-node-ip>.eu-west-1.compute.internal
  Normal   Pulling              27s (x2 over 59s)  kubelet, <internal-node-ip>.eu-west-1.compute.internal  pulling image "<my-custom-repo>/amazon-vpc-cni-k8s:v1.5.0"
  Normal   Killing              27s                kubelet, <internal-node-ip>.eu-west-1.compute.internal  Killing container with id docker://aws-node:FailedPostStartHook
  Normal   Pulled               26s (x2 over 58s)  kubelet, <internal-node-ip>.eu-west-1.compute.internal  Successfully pulled image "<my-custom-repo>/amazon-vpc-cni-k8s:v1.5.0"
  Normal   Created              26s (x2 over 58s)  kubelet, <internal-node-ip>.eu-west-1.compute.internal  Created container
  Normal   Started              26s (x2 over 58s)  kubelet, <internal-node-ip>.eu-west-1.compute.internal  Started container
  Warning  FailedPostStartHook  25s (x2 over 58s)  kubelet, <internal-node-ip>.eu-west-1.compute.internal  Exec lifecycle hook ([/bin/sh -c /usr/bin/lifecycle_hook.sh postStart]) for Container "aws-node" in Pod "aws-node-f9xk5_kube-system(639cd5e3-ae02-11e9-b573-06b8a900565e)" failed - error: command '/bin/sh -c /usr/bin/lifecycle_hook.sh postStart' exited with 1: Error from server (Forbidden): nodes "<internal-node-ip>.eu-west-1.compute.internal" is forbidden: User "system:serviceaccount:kube-system:aws-node" cannot patch resource "nodes" in API group "" at the cluster scope
, message: "Error from server (Forbidden): nodes \"<internal-node-ip>.eu-west-1.compute.internal\" is forbidden: User \"system:serviceaccount:kube-system:aws-node\" cannot patch resource \"nodes\" in API group \"\" at the cluster scope\n"
  Warning  FailedPreStopHook  25s (x2 over 58s)  kubelet, <internal-node-ip>.eu-west-1.compute.internal  Exec lifecycle hook ([/bin/sh -c /usr/bin/lifecycle_hook.sh preStop]) for Container "aws-node" in Pod "aws-node-f9xk5_kube-system(639cd5e3-ae02-11e9-b573-06b8a900565e)" failed - error: command '/bin/sh -c /usr/bin/lifecycle_hook.sh preStop' exited with 1: Error from server (Forbidden): nodes "<internal-node-ip>.eu-west-1.compute.internal" is forbidden: User "system:serviceaccount:kube-system:aws-node" cannot patch resource "nodes" in API group "" at the cluster scope
, message: "Error from server (Forbidden): nodes \"<internal-node-ip>.eu-west-1.compute.internal\" is forbidden: User \"system:serviceaccount:kube-system:aws-node\" cannot patch resource \"nodes\" in API group \"\" at the cluster scope\n"

sc250024 · 2019-07-27T09:00:28Z

@uthark I actually got your script to work, but it doesn't work 100% of the time actually. A bit of a bummer, really.

uthark · 2019-07-28T04:11:58Z

When it doesn't work for you? Regards, Oleg.

…

On Jul 27, 2019, 02:00 -0700, Scott Crooks ***@***.***>, wrote: @uthark I actually got your script to work, but it doesn't work 100% of the time actually. A bit of a bummer, really. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

sc250024 · 2019-07-28T11:23:39Z

EDIT 2019-07-29: @uthark I actually saw why this happens. In the case of our GitLab runners, the images we use for CI are very small (usually Alpine Linux). Sometimes Kubernetes will download the image for a particular GitLab job before the CNI image has been downloaded. Kubernetes will then try to run the particular CI job (i.e. the GitLab CI job is in ContainerCreating state) before the CNI image has a chance to taint that node. It doesn't happen often, but it does happen enough.

@uthark As far as I can tell, it's unpredictable. But the error message is always the same:

Warning FailedCreatePodSandBox 25s kubelet, .eu-west-1.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "" network for pod "runner-7dpmmrh-project--concurrent-05zx5f": NetworkPlugin cni failed to set up pod "runner-7dpmmrh-project--concurrent-05zx5f_gitlab-runner" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "" network for pod "runner-7dpmmrh-project--concurrent-05zx5f": NetworkPlugin cni failed to teardown pod "runner-7dpmmrh-project--concurrent-05zx5f_gitlab-runner" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

Same error as you encountered above. Sometimes it works, and sometimes it doesn't. We use this primarily to scale up / scale down GitLab runner nodes.

robertsheehy-wf · 2019-07-30T19:24:25Z

We are experiencing this exact issue for our production EKS environment.

What seems to happen is ipamd gets stuck in a loop by logic introduced in this PR #480.

@mogren What's supposed to be happening here? For our scenario the pods never receive their pod ip, so ipamd never starts up, resulting in tcp connection errors for our CNI.

mogren · 2019-07-30T19:32:07Z

Hi @robertsheehy-wf,

I explained some of the background for this change in an earlier comment. That said, It should only retry for at most 60 seconds, then continue and assume that the pods don't have an IP. Why this doesn't happen in all cases we still have to investigate more.

robertsheehy-wf · 2019-07-30T20:44:43Z

Thanks, sorry I should have read more closely.

Granted my knowledge is limited, but is there maybe a better strategy rather than waiting 60 seconds?

Can we look at the pods more closely to understand why the Pod doesn't have a IP? Or maybe omit the wait loop if we know the CNI has never been ran on the node? (since this a problem we see on node startup)

acesir · 2019-07-31T21:17:24Z

This is currently affecting us in all our environment and completely broke our autoscaling in production. We schedule dynamic amounts of batch jobs in kubernetes and autoscale our nodes based on taints/labels/requests. When the nodes autoscale 99% of the pods fail and get stuck in sandbox creation failed/cni errors.

Related to aws#525

robertsheehy-wf · 2019-07-31T23:49:50Z

We’ve found that this is a problem when starting new nodes in a cluster. What happens is a pod will get scheduled at the same time as a CNI on a new node. This is where the CNI seems to falter. If the CNI is scheduled concurrently with another pod, the CNI sometimes seems to get into a state where it is unsure how to handle the pod. This results in the pod never receiving an IP. This problem is then exacerbated by the fact the CNI will wait 60 seconds while it waits for eventual consistency of Kubernetes. In the situation the CNI is blocked for 60 seconds while nothing happens, because in reality the pod was never assigned an IP by anyone.

This effectively means the node is unusable for 60 seconds. This is a huge problem if you are cycling nodes in huge batches.

What we’ve concluded is the CNI should always come first on a host before any other pods. We had assumed the CNI could gracefully handle this scenario.

To resolve this we looked at managing taints/tolerations like @uthark had done. The one hangup we had with his approach was we didn’t want to manage a fork of this repo.

This is where we realized for several of our daemonsets we had tolerations which tolerated all taints.

      tolerations:
        - operator: "Exists"

We had done this for our own reasons, but this meant for every node we brought up there was a high probability we would hit this issue. Our daemonset pods were being stood up concurrently with the CNI.

Upon further investigation we found EKS already had a taint we could use on new nodes. Specifically this was the node.kubernetes.io/not-ready:NoSchedule taint. This would always be removed after the CNI had started.

After fixing our daemonset tolerations we hit this issue a lot less often, since now CNIs had time to start before pods were scheduled on the node.

uthark · 2019-07-31T23:54:53Z

@robertsheehy-wf Yeah, we did exact same.

Related to #525

MrSaints · 2019-08-01T15:05:52Z

The linked commit appears to have alleviated the problem, at least for us. We don't see our pods getting stuck.

acesir · 2019-08-02T14:03:23Z

The latest release of 1.5.1 RC1 solved this issue when autoscaling nodes and pods getting stuck with sandbox errors. The linked commit + #548 was primarily the 2 patches needed for us.

Related to aws#525

Related to #525

mogren · 2019-08-22T21:31:48Z

Reduced wait time and added a fix to not retry too long for force detached ENIs.

uthark changed the title ~~ENI warming is delayed for at least for 1 minutes, probably caused by #480~~ ENI warming is delayed for at least for 1 minute, probably caused by #480 Jul 11, 2019

mogren added the needs investigation label Jul 13, 2019

mogren self-assigned this Jul 13, 2019

MrSaints mentioned this issue Jul 15, 2019

Containers stuck in ContainerCreating after configuring CNI Custom Networking on extended CIDR #527

Closed

uthark mentioned this issue Jul 15, 2019

Allow override maxK8S retries via environment variable #532

Closed

mogren added this to the v1.6 milestone Jul 18, 2019

deliahu mentioned this issue Jul 22, 2019

Pods stuck in ContainerCreating after cluster autoscaling (AWS CNI race condition) cortexlabs/cortex#247

Closed

mogren mentioned this issue Jul 23, 2019

Detach ENI before deleting #538

Merged

deliahu mentioned this issue Jul 25, 2019

Bring back AWS CNI once issues are resolved cortexlabs/cortex#260

Closed

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Jul 31, 2019

Reduce the wait time when checking for pods without IPs

7a8019e

Related to aws#525

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Jul 31, 2019

Reduce the wait time when checking for pods without IPs

8b64370

Related to aws#525

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Jul 31, 2019

Reduce the wait time when checking for pods without IPs

6aae8a0

Related to aws#525

mogren mentioned this issue Jul 31, 2019

Reduce wait time when talking to the api server #552

Merged

mogren pushed a commit that referenced this issue Aug 1, 2019

Reduce the wait time when checking for pods without IPs

31821f9

Related to #525

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Aug 2, 2019

Reduce the wait time when checking for pods without IPs

bfc6658

Related to aws#525

mogren pushed a commit to mogren/amazon-vpc-cni-k8s that referenced this issue Aug 2, 2019

Reduce the wait time when checking for pods without IPs

4fe3ad2

Related to aws#525

mogren pushed a commit that referenced this issue Aug 2, 2019

Reduce the wait time when checking for pods without IPs

18ac8fd

Related to #525

deliahu mentioned this issue Aug 7, 2019

Pods stuck in ContainerCreating (AWS CNI pod limit) cortexlabs/cortex#219

Closed

3 tasks

mogren closed this as completed Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENI warming is delayed for at least for 1 minute, probably caused by #480 #525

ENI warming is delayed for at least for 1 minute, probably caused by #480 #525

uthark commented Jul 10, 2019

uthark commented Jul 10, 2019

uthark commented Jul 11, 2019 •

edited

Loading

mogren commented Jul 13, 2019

MrSaints commented Jul 15, 2019 •

edited

Loading

uthark commented Jul 15, 2019

uthark commented Jul 15, 2019

mogren commented Jul 17, 2019

uthark commented Jul 17, 2019

uthark commented Jul 17, 2019

t0ny-peng commented Jul 18, 2019

sc250024 commented Jul 19, 2019

dcherniv commented Jul 19, 2019

uthark commented Jul 23, 2019 •

edited

Loading

sc250024 commented Jul 23, 2019 •

edited

Loading

uthark commented Jul 23, 2019

sc250024 commented Jul 24, 2019 •

edited

Loading

sc250024 commented Jul 27, 2019

uthark commented Jul 28, 2019 via email

sc250024 commented Jul 28, 2019 •

edited

Loading

robertsheehy-wf commented Jul 30, 2019

mogren commented Jul 30, 2019

robertsheehy-wf commented Jul 30, 2019 •

edited

Loading

acesir commented Jul 31, 2019 •

edited

Loading

robertsheehy-wf commented Jul 31, 2019

uthark commented Jul 31, 2019

MrSaints commented Aug 1, 2019

acesir commented Aug 2, 2019

mogren commented Aug 22, 2019

ENI warming is delayed for at least for 1 minute, probably caused by #480 #525

ENI warming is delayed for at least for 1 minute, probably caused by #480 #525

Comments

uthark commented Jul 10, 2019

uthark commented Jul 10, 2019

uthark commented Jul 11, 2019 • edited Loading

mogren commented Jul 13, 2019

MrSaints commented Jul 15, 2019 • edited Loading

uthark commented Jul 15, 2019

uthark commented Jul 15, 2019

mogren commented Jul 17, 2019

uthark commented Jul 17, 2019

uthark commented Jul 17, 2019

t0ny-peng commented Jul 18, 2019

sc250024 commented Jul 19, 2019

dcherniv commented Jul 19, 2019

uthark commented Jul 23, 2019 • edited Loading

sc250024 commented Jul 23, 2019 • edited Loading

uthark commented Jul 23, 2019

sc250024 commented Jul 24, 2019 • edited Loading

sc250024 commented Jul 27, 2019

uthark commented Jul 28, 2019 via email

sc250024 commented Jul 28, 2019 • edited Loading

robertsheehy-wf commented Jul 30, 2019

mogren commented Jul 30, 2019

robertsheehy-wf commented Jul 30, 2019 • edited Loading

acesir commented Jul 31, 2019 • edited Loading

robertsheehy-wf commented Jul 31, 2019

uthark commented Jul 31, 2019

MrSaints commented Aug 1, 2019

acesir commented Aug 2, 2019

mogren commented Aug 22, 2019

uthark commented Jul 11, 2019 •

edited

Loading

MrSaints commented Jul 15, 2019 •

edited

Loading

uthark commented Jul 23, 2019 •

edited

Loading

sc250024 commented Jul 23, 2019 •

edited

Loading

sc250024 commented Jul 24, 2019 •

edited

Loading

sc250024 commented Jul 28, 2019 •

edited

Loading

robertsheehy-wf commented Jul 30, 2019 •

edited

Loading

acesir commented Jul 31, 2019 •

edited

Loading