-
Notifications
You must be signed in to change notification settings - Fork 743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods stuck in ContainerCreating due to pull error unauthorized #2030
Comments
@axelczk - Can you please open a support ticket for this? Team should be able to check if it is any permission issues to pull from ECR. Looks like you are getting a 401. This issue doesn't belong to CNI. |
I know I'm getting a 401. The real question is why when the node just started it's working and I can pull this image and after some days or hours, it's not working anymore ? I don't know which service is responsible for this. |
Hi, I'm having this exactly issue too after upgrading EKS. Is there any solution to that? |
Just had the same issue and found this ticket. I'm using BottlerocketOS and it was not that trivial. Here's how to do it.
aws ecr get-login-password --region <your-region>
cd /tmp
yum install tar -y
curl -fsL -o crictl.tar.gz https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.26.0/crictl-v1.26.0-linux-amd64.tar.gz
tar zxf crictl.tar.gz
chmod u+x crictl
./crictl --runtime-endpoint=unix:///.bottlerocket/rootfs/run/dockershim.sock pull --creds "AWS:TOKEN_FROM_STEP_1" XXXXX.dkr.ecr.XXXXXXX.amazonaws.com/eks/pause:3.1-eksbuild.1 Now you have that pause image in place, so pods should be able to start normally. |
There is an error on their side on EKS node. You need to add this bootstrap extra arg: Using this, the garbage collector will not remove the pause container and you will not have the need to pull the image. |
As far as I know, garbage collector takes only disk space into account. In my case, the server was running out of inodes, so I had to manually prune images. |
We have contacted the AWS support on our side, and after days of exchange and debugging this was the explanation we found. The garbage collector was pruning image on the node and removing also the pause container with others images. I still have the ticket somewhere and can check for the full explanation if necessary. |
Hi, I'm having the same issue after upgrading EKS to 1.25. Is this solution still valid? I think this feature flag is deprecated.
|
Having the same issue in an EKS upgrade to 1.24 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": pulling from host 602401143452.dkr.ecr.us-east-1.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized |
Having the same issue in an EKS after upgrade to 1.27, can anyone help me, please?
Getting token works at node: Fetching image via |
hi @interair Hi, We are also having same issue in our environment.. The kubelet is able to pull all system images(amazon-k8s-cni-init, amazon-k8s-cni) except pause image as given below. Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to resolve reference "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": pulling from host 900889452093.dkr.ecr.ap-south-2.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized Fetching image via ./crictl is not a possible solution in production kind of environment, Can anyone help me, please? |
any updates here same issue!! |
@VikramPunnam @hamdallahjodah @interair @ddl-slevine I am not familiar with this issue, and it is not an issue with the VPC CNI, so I suggest opening an AWS support case to get help. That will be the fastest way to a resolution, and you can your findings here. |
I am have the same issue when upgrade to 1.29. Some node can download pause image, but some node cannot. So all pods no the node just hung in create state. Doesn't understand why pause image have 401 only some times. |
We also have this issue after upgrading to 1.29. Do we have a few good hints so I can start digging? |
I have the same issue with EKS 1.29 :( |
I've observed the same after v1.29 upgrade today too. Tried to re-place an affected compute node with the fresh one and it seems helped (at least for awhile). So far so good... |
I think the problem is happening after 12hs when the session token expires and curiously the instance where I tested it didn't have any inodes/space problems. |
If this is happening on the official EKS AMI, can you open an issue in our repo so we can look into it? https://github.com/awslabs/amazon-eks-ami |
--pod-infra-container-image flag is set on kubelet. I found that my disk on node really become full after some time and kubelet image garbage collector delete pause image. So, instead of delete different images, it deletes pause image. After pause image deleted, node doesn't work. |
@ohrab-hacken |
This issue is being discussed at awslabs/amazon-eks-ami#1597 |
Is there any new progress solving this matter? |
Did you follow the issue I linked to? This issue is in the EKS AMI, not the VPC CNI, so short and long-term resolutions are being discussed there |
I will take another look at that. |
Just started running into this today?
2/3 replicas for my pod deployed; all were scheduled on different nodes but all nodes are self-managed and running the same AMI.. thought maybe it was only affecting one AZ, tried re-deploy again and now only 1/3 worked. Not sure yet if it's only affecting specific nodes or what... Edit: So, I don't see any patterns with regards to node type, node group, AZ or specific resources or anything. Seems to have started a few days ago. Not AMI-related really. Not sure if it's specifically VPC-CNI related either though it of course prevented me from updating that plugin. Doing an instance refresh and/or terminating/re-creating the nodes/instances that were failing seems to have resolved the issue (for now?) - they were all redeployed with the same AMI and everything. No idea WTH. |
We recently switched our cluster to EKS 1.22 with managed node group and since we have sometime this error when container are created. We don't have a fix except replacing the node where the pod try to be scheduled.
I don't know if it's the right place to ask for this. If this is not, please tell me where I can post this issue.
The text was updated successfully, but these errors were encountered: