Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(al2): sandbox container image failed to pull #2061

Open
javilaadevinta opened this issue Nov 19, 2024 · 5 comments
Open

bug(al2): sandbox container image failed to pull #2061

javilaadevinta opened this issue Nov 19, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@javilaadevinta
Copy link

What happened:

One node in one of our clusters has an error related to the sandbox container and is unable to be pulled.
AMI: amazon-eks-node-1.30-v20241109

Warning FailedCreatePodSandBox 3m31s (x210 over 48m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/e
ks/pause:3.5": failed to pull image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

The disk usage was around 10%.
We have no access anymore to the node as it was deleted.

This is our config on the containerd side:

[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5"

And this is our kubelet flag:

--pod-infra-container-image=602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5

Could be somehow a regression of this issue?
#1597

What you expected to happen:
Kubelet shouldn't GC the image, or at least nodes should be able to pull it again if somehow was deleted.

How to reproduce it (as minimally and precisely as possible):
I have no idea how to reproduce it again, as it is the first time that we observed this happening in one of our clusters.

Environment:

  • AWS Region:
  • Instance Type(s):
  • Cluster Kubernetes version:
  • Node Kubernetes version:
  • AMI Version:
@javilaadevinta javilaadevinta added the bug Something isn't working label Nov 19, 2024
@cartermckinnon
Copy link
Member

cartermckinnon commented Nov 19, 2024

pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

That means containerd is attempting to pull the sandbox image itself (without ECR credentials), which won’t work. Do you see ImageDelete events in the logs or was the sandbox image never present on this node?

There’s a systemd unit on AL2 that pulls the image (using ECR credentials) that may have failed, you can check:

journalctl -u sandbox-image

@hany-mhajna-payu-gpo
Copy link

We face the same issue yesterday to fetch pause image: 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5,

in pod events we got this error:

kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

We run on eks 1.29 controller and nodes. but it sound like something not related the ami itself happened.
Our workaround for now in the nodes that had this issue is to edit /etc/containerd/config.toml pause image to public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest and then to restart containerd: sudo systemctl restart containerd

Do you know about global issue in eu-central-1?

@javilaadevinta
Copy link
Author

pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

That means containerd is attempting to pull the sandbox image itself (without ECR credentials), which won’t work. Do you see ImageDelete events in the logs or was the sandbox image never present on this node?

There’s a systemd unit on AL2 that pulls the image (using ECR credentials) that may have failed, you can check:

journalctl -u sandbox-image

That's a good point. Sadly, we have no more system logs for this specific node, but analyzing the usage of other metrics could fit, as we didn't comply with any GC condition.

@cartermckinnon cartermckinnon changed the title Sandbox container image being GC'd in 1.30 bug(al2): sandbox container image failed to pull Nov 21, 2024
@cartermckinnon
Copy link
Member

@hany-mhajna-payu-gpo if you have the logs from this node, can you see if the sandbox image failed to pull? (with the journalctl command above)

I'm not ware of an incident in eu-central-1, but I can do some digging if you have a timeframe.

@cartermckinnon
Copy link
Member

FWIW, we're in the process of removing this runtime dependency on ECR, we've done that on AL2023 so far, see #2000.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants