Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unresponsive aws-node CNI 1.7.5 #1372

Closed
nmckeown opened this issue Feb 3, 2021 · 9 comments
Closed

Unresponsive aws-node CNI 1.7.5 #1372

nmckeown opened this issue Feb 3, 2021 · 9 comments
Labels

Comments

@nmckeown
Copy link

nmckeown commented Feb 3, 2021

What happened:
We have a service that talks out to hundreds of AWS accounts from certain pods. We are losing network connectivity to these pods when registering new accounts every day or so. We notice that the aws-node on these nodes is crashing. It is unresponsive to connect or retrieve logs.

Readiness and Liveness probes for aws-node on these nodes are constantly failing:
rpc error: code = DeadlineExceeded desc = context deadline exceeded

A broken pipe error is observed in the node syslogs for aws-node pods:
dockerd: time="2021-01-31T18:30:11.456914780Z" level=error msg="Handler for GET /containers/4bea941465f4455916f49313de3109c3b4e56dca7c033f8ebb1bfb3c536bbad9/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"

What you expected to happen:
Is there any known issue with CNI and pods that are very busy on the network. Multiple GBs could be transferring to these pods at any second.

Environment:
CNI: 1.7.5
Kube-Proxy: v1.15.11
CoreDNS: v1.6.6
EKS: 1.15.12-eks-31566f
OS: 5.4.80-40.140.amzn2.x86_64
Docker Engine: 19.3.6

@nmckeown nmckeown added the bug label Feb 3, 2021
@jayanthvn
Copy link
Contributor

@nmckeown - Can you please share CNI logs by running the log collector script?

sudo bash /opt/cni/bin/aws-cni-support.sh

@nmckeown
Copy link
Author

nmckeown commented Feb 4, 2021

Thanks @jayanthvn for responding. We don't have access to these nodes as per our security policy. Do you know is there any other way to collect those logs, either via cloudwatch, kubectl proxy ..etc

@jayanthvn
Copy link
Contributor

Hi @nmckeown

Sorry for the delayed response. These logs are from IPAMD on the node. So customers have to run this script on the node and collect the logs. If the customer can open a support ticket and share us the clusterARN we can help retrieve the logs from aws-node.

Thanks.

@nmckeown
Copy link
Author

nmckeown commented Feb 9, 2021

Hi @jayanthvn, no worries. We actually found a fix for this and it ended up being the EBS performance. We were exhausting all our I/O burst credits so write operations were failing. Increasing our root volume to 1TB resolved this. Thanks for responding.

@jayanthvn
Copy link
Contributor

Thanks for letting me know :) Glad it fixed.

@DmitriyStoyanov
Copy link

DmitriyStoyanov commented Feb 16, 2021

Looks like we faced with the same problem with cni 1.7.5 and eks 1.18
In logs I found something like:
Feb 16 16:14:12 ip-10-13-9-254.ec2.internal kubelet[4826]: E0216 16:14:12.816601 4826 remote_runtime.go:351] ExecSync 8d6eacbf23... '/app/grpc-health-probe -addr=:50051' from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

in prometheus I found that nodes with exec_sync metric equal 10 seconds, have this problems and pods cannot be started there.
For the moment to help cluster, I terminate such node instances and cluster works fine. But after some time ~ 2 days, I see again issues with the same problem, pod cannot start

@DmitriyStoyanov
Copy link

And the same burst I/O go to 0
image

@nmckeown
Copy link
Author

Hi @DmitriyStoyanov yeah, if your balance goes to zero, it takes from time to recover and hence operations will fail. Eastiest/cheapest option for us was just increase the volume size to 1TB so you don't rely on I/O credits. Other options was move from gp2 to gp3 or look at provisioned IOPS.

@DmitriyStoyanov
Copy link

DmitriyStoyanov commented Feb 17, 2021

For the moment we just increased for different instances disk size previously we had only 50GB for all instances, now
c5.xlarge - 50GB (do nothing) (150 iops)
c5.2xlarge - 100GB (x2) (300 iops)
c5.4xlarge - 200GB (x4) (600 iops) (where we faced with issues several times)
and watch on the metrics, possible will look into gp3, but not today :)
because of terraform-aws-modules/terraform-aws-eks#1134 (comment) and terraform-aws-modules/terraform-aws-eks#1205

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants