Pods stuck in ContainerCreating due to pause image pull error 401 unauthorized #1425

VikramPunnam · 2023-09-12T05:13:52Z

We generally build custom EKS AMI using EKS optimized AMI as base image in ap-south-1 region and copies to other regions for EKS cluster setup.

Having the below issue in the EKS after upgrade to 1.27, if the pause image gets deleted on the node.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to resolve reference "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": pulling from host 900889452093.dkr.ecr.ap-south-2.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized

Can anyone help me, please?

cartermckinnon · 2023-09-12T17:32:54Z

How did the pause image get deleted from the node?

cartermckinnon · 2023-09-12T21:28:43Z

I've seen this failure mode a few times in the past, because containerd doesn't have a way to obtain ECR credentials to pull the sandbox container image. That's why we pull it with a systemd unit at launch time (if it's not already cached in the AMI): https://github.com/awslabs/amazon-eks-ami/blob/master/files/sandbox-image.service

You could systemctl restart sandbox-image to trigger a pull, and we could feasibly run this periodically so this isn't a terminal node failure; but I'd still look into why the image was deleted to begin with.

VikramPunnam · 2023-09-13T09:19:11Z

Hi @cartermckinnon,

Thanks for your reply.

We used a custom script that runs on every node to cleanup the unused images and exited containers on the node. which is being removing the pause image as well on the node. which is causing trouble in our environment.

Now we modified the script that can exclude some images on the node.

ForbiddenEra · 2024-03-31T07:01:36Z

I've been running into this issue on nodes randomly since 1.29 upgrade. Both AWS EKS managed nodes running AL2 and AL2023 as well as Ubuntu's EKS image...

Getting really frustrating to have keep refreshing nodes as that's the only fix I can figure out..

Can't find much info/threads about it, this was one of the few. Nothing is modified on the nodes themselves, running provided AMIs and never access the nodes directly.

ForbiddenEra · 2024-03-31T07:08:15Z

So, I was just about to try and log into the nodes that are currently affected for me and was checking the docs because I don't know offhand where/how kubelet/k8s caches images.. as of 1.29:

Garbage collection for unused container images
FEATURE STATE: Kubernetes v1.29 [alpha]

As an alpha feature, you can specify the maximum time a local image can be unused for, regardless of disk usage. This is a kubelet setting that you configure for each node.

To configure the setting, enable the ImageMaximumGCAge [feature gate](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/) for the kubelet, and also set a value for the ImageMaximumGCAge field in the kubelet configuration file.

The value is specified as a Kubernetes duration; for example, you can set the configuration field to 3d12h, which means 3 days and 12 hours

that sounds incredibly fishily interesting and potentially the issue

note this is happening also on EKS-managed nodes running Amazon Linux AMIs?

elirenato · 2024-05-30T23:24:26Z

I've seen this failure because I tried to prune the images by myself using the crictl rmi --prune. Running the command systemctl restart sandbox-image as @cartermckinnon suggested fixed the problem, but I was wondering, do we really need to prune the images by ourselves?

I saw this article https://repost.aws/knowledge-center/eks-worker-nodes-image-cache that suggests we already have a cleanup of the image in according with the image-gc-high-threshold attribute (default to 85%).

Default values for one node that I have:

curl -sSL "http://localhost:8001/api/v1/nodes/<MY_NODE_NAME>/proxy/configz" | python3 -m json.tool | grep image
        "imageMinimumGCAge": "2m0s",
        "imageMaximumGCAge": "0s",
        "imageGCHighThresholdPercent": 85,
        "imageGCLowThresholdPercent": 80,

cartermckinnon · 2024-05-31T00:01:24Z

do we really need to prune the images by ourselves?

Nope! The kubelet will take care of this. Deleting images out of band almost always hurts more than it helps.

cartermckinnon closed this as not planned Won't fix, can't repro, duplicate, stale Sep 13, 2023

ForbiddenEra mentioned this issue Mar 31, 2024

Pods can't run due to failures pulling pause image; pause image is being incorrectly garbage collected #1740

Closed

chenfeilee mentioned this issue Apr 25, 2024

node get DiskPressure when disk space is still plenty kubernetes/kubernetes#66961

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods stuck in ContainerCreating due to pause image pull error 401 unauthorized #1425

Pods stuck in ContainerCreating due to pause image pull error 401 unauthorized #1425

VikramPunnam commented Sep 12, 2023

cartermckinnon commented Sep 12, 2023

cartermckinnon commented Sep 12, 2023

VikramPunnam commented Sep 13, 2023

ForbiddenEra commented Mar 31, 2024 •

edited

Loading

ForbiddenEra commented Mar 31, 2024

elirenato commented May 30, 2024 •

edited

Loading

cartermckinnon commented May 31, 2024

Pods stuck in ContainerCreating due to pause image pull error 401 unauthorized #1425

Pods stuck in ContainerCreating due to pause image pull error 401 unauthorized #1425

Comments

VikramPunnam commented Sep 12, 2023

cartermckinnon commented Sep 12, 2023

cartermckinnon commented Sep 12, 2023

VikramPunnam commented Sep 13, 2023

ForbiddenEra commented Mar 31, 2024 • edited Loading

ForbiddenEra commented Mar 31, 2024

elirenato commented May 30, 2024 • edited Loading

cartermckinnon commented May 31, 2024

ForbiddenEra commented Mar 31, 2024 •

edited

Loading

elirenato commented May 30, 2024 •

edited

Loading