bug(al2): sandbox container image failed to pull #2061

javilaadevinta · 2024-11-19T14:16:22Z

What happened:

One node in one of our clusters has an error related to the sandbox container and is unable to be pulled.
AMI: amazon-eks-node-1.30-v20241109

Warning FailedCreatePodSandBox 3m31s (x210 over 48m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/e
ks/pause:3.5": failed to pull image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

The disk usage was around 10%.
We have no access anymore to the node as it was deleted.

This is our config on the containerd side:

[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5"

And this is our kubelet flag:

--pod-infra-container-image=602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5

Could be somehow a regression of this issue?
#1597

What you expected to happen:
Kubelet shouldn't GC the image, or at least nodes should be able to pull it again if somehow was deleted.

How to reproduce it (as minimally and precisely as possible):
I have no idea how to reproduce it again, as it is the first time that we observed this happening in one of our clusters.

Environment:

AWS Region:
Instance Type(s):
Cluster Kubernetes version:
Node Kubernetes version:
AMI Version:

The text was updated successfully, but these errors were encountered:

cartermckinnon · 2024-11-19T16:02:24Z

pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

That means containerd is attempting to pull the sandbox image itself (without ECR credentials), which won’t work. Do you see ImageDelete events in the logs or was the sandbox image never present on this node?

There’s a systemd unit on AL2 that pulls the image (using ECR credentials) that may have failed, you can check:

journalctl -u sandbox-image

hany-mhajna-payu-gpo · 2024-11-20T08:31:04Z

We face the same issue yesterday to fetch pause image: 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5,

in pod events we got this error:

kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

We run on eks 1.29 controller and nodes. but it sound like something not related the ami itself happened.
Our workaround for now in the nodes that had this issue is to edit /etc/containerd/config.toml pause image to public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest and then to restart containerd: sudo systemctl restart containerd

Do you know about global issue in eu-central-1?

javilaadevinta · 2024-11-20T13:14:52Z

pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

That means containerd is attempting to pull the sandbox image itself (without ECR credentials), which won’t work. Do you see ImageDelete events in the logs or was the sandbox image never present on this node?

There’s a systemd unit on AL2 that pulls the image (using ECR credentials) that may have failed, you can check:
journalctl -u sandbox-image

That's a good point. Sadly, we have no more system logs for this specific node, but analyzing the usage of other metrics could fit, as we didn't comply with any GC condition.

cartermckinnon · 2024-11-21T02:33:01Z

@hany-mhajna-payu-gpo if you have the logs from this node, can you see if the sandbox image failed to pull? (with the journalctl command above)

I'm not ware of an incident in eu-central-1, but I can do some digging if you have a timeframe.

cartermckinnon · 2024-11-21T02:33:48Z

FWIW, we're in the process of removing this runtime dependency on ECR, we've done that on AL2023 so far, see #2000.

javilaadevinta added the bug Something isn't working label Nov 19, 2024

cartermckinnon changed the title ~~Sandbox container image being GC'd in 1.30~~ bug(al2): sandbox container image failed to pull Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(al2): sandbox container image failed to pull #2061

bug(al2): sandbox container image failed to pull #2061

javilaadevinta commented Nov 19, 2024

cartermckinnon commented Nov 19, 2024 •

edited

Loading

hany-mhajna-payu-gpo commented Nov 20, 2024

javilaadevinta commented Nov 20, 2024

cartermckinnon commented Nov 21, 2024

cartermckinnon commented Nov 21, 2024

bug(al2): sandbox container image failed to pull #2061

bug(al2): sandbox container image failed to pull #2061

Comments

javilaadevinta commented Nov 19, 2024

cartermckinnon commented Nov 19, 2024 • edited Loading

hany-mhajna-payu-gpo commented Nov 20, 2024

javilaadevinta commented Nov 20, 2024

cartermckinnon commented Nov 21, 2024

cartermckinnon commented Nov 21, 2024

cartermckinnon commented Nov 19, 2024 •

edited

Loading