Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in ContainerCreating due to pause image pull error 401 unauthorized #1425

Closed
VikramPunnam opened this issue Sep 12, 2023 · 7 comments

Comments

@VikramPunnam
Copy link

We generally build custom EKS AMI using EKS optimized AMI as base image in ap-south-1 region and copies to other regions for EKS cluster setup.

Having the below issue in the EKS after upgrade to 1.27, if the pause image gets deleted on the node.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to resolve reference "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": pulling from host 900889452093.dkr.ecr.ap-south-2.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized

Can anyone help me, please?

@cartermckinnon
Copy link
Member

How did the pause image get deleted from the node?

@cartermckinnon
Copy link
Member

I've seen this failure mode a few times in the past, because containerd doesn't have a way to obtain ECR credentials to pull the sandbox container image. That's why we pull it with a systemd unit at launch time (if it's not already cached in the AMI): https://github.com/awslabs/amazon-eks-ami/blob/master/files/sandbox-image.service

You could systemctl restart sandbox-image to trigger a pull, and we could feasibly run this periodically so this isn't a terminal node failure; but I'd still look into why the image was deleted to begin with.

@VikramPunnam
Copy link
Author

Hi @cartermckinnon,

Thanks for your reply.

We used a custom script that runs on every node to cleanup the unused images and exited containers on the node. which is being removing the pause image as well on the node. which is causing trouble in our environment.

Now we modified the script that can exclude some images on the node.

@cartermckinnon cartermckinnon closed this as not planned Won't fix, can't repro, duplicate, stale Sep 13, 2023
@ForbiddenEra
Copy link

ForbiddenEra commented Mar 31, 2024

I've been running into this issue on nodes randomly since 1.29 upgrade. Both AWS EKS managed nodes running AL2 and AL2023 as well as Ubuntu's EKS image...

Getting really frustrating to have keep refreshing nodes as that's the only fix I can figure out..

Can't find much info/threads about it, this was one of the few. Nothing is modified on the nodes themselves, running provided AMIs and never access the nodes directly.

@ForbiddenEra
Copy link

So, I was just about to try and log into the nodes that are currently affected for me and was checking the docs because I don't know offhand where/how kubelet/k8s caches images.. as of 1.29:

Garbage collection for unused container images
FEATURE STATE: Kubernetes v1.29 [alpha]

As an alpha feature, you can specify the maximum time a local image can be unused for, regardless of disk usage. This is a kubelet setting that you configure for each node.

To configure the setting, enable the ImageMaximumGCAge [feature gate](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/) for the kubelet, and also set a value for the ImageMaximumGCAge field in the kubelet configuration file.

The value is specified as a Kubernetes duration; for example, you can set the configuration field to 3d12h, which means 3 days and 12 hours

that sounds incredibly fishily interesting and potentially the issue

note this is happening also on EKS-managed nodes running Amazon Linux AMIs?

@elirenato
Copy link

elirenato commented May 30, 2024

I've seen this failure because I tried to prune the images by myself using the crictl rmi --prune. Running the command systemctl restart sandbox-image as @cartermckinnon suggested fixed the problem, but I was wondering, do we really need to prune the images by ourselves?

I saw this article https://repost.aws/knowledge-center/eks-worker-nodes-image-cache that suggests we already have a cleanup of the image in according with the image-gc-high-threshold attribute (default to 85%).

Default values for one node that I have:

curl -sSL "http://localhost:8001/api/v1/nodes/<MY_NODE_NAME>/proxy/configz" | python3 -m json.tool | grep image
        "imageMinimumGCAge": "2m0s",
        "imageMaximumGCAge": "0s",
        "imageGCHighThresholdPercent": 85,
        "imageGCLowThresholdPercent": 80,

@cartermckinnon
Copy link
Member

do we really need to prune the images by ourselves?

Nope! The kubelet will take care of this. Deleting images out of band almost always hurts more than it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants