-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug(al2): sandbox container image failed to pull #2061
Comments
That means containerd is attempting to pull the sandbox image itself (without ECR credentials), which won’t work. Do you see ImageDelete events in the logs or was the sandbox image never present on this node? There’s a systemd unit on AL2 that pulls the image (using ECR credentials) that may have failed, you can check:
|
We face the same issue yesterday to fetch pause image: in pod events we got this error:
We run on eks 1.29 controller and nodes. but it sound like something not related the ami itself happened. Do you know about global issue in eu-central-1? |
That's a good point. Sadly, we have no more system logs for this specific node, but analyzing the usage of other metrics could fit, as we didn't comply with any GC condition. |
@hany-mhajna-payu-gpo if you have the logs from this node, can you see if the sandbox image failed to pull? (with the I'm not ware of an incident in eu-central-1, but I can do some digging if you have a timeframe. |
FWIW, we're in the process of removing this runtime dependency on ECR, we've done that on AL2023 so far, see #2000. |
What happened:
One node in one of our clusters has an error related to the sandbox container and is unable to be pulled.
AMI: amazon-eks-node-1.30-v20241109
Warning FailedCreatePodSandBox 3m31s (x210 over 48m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/e
ks/pause:3.5": failed to pull image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials
The disk usage was around 10%.
We have no access anymore to the node as it was deleted.
This is our config on the containerd side:
And this is our kubelet flag:
Could be somehow a regression of this issue?
#1597
What you expected to happen:
Kubelet shouldn't GC the image, or at least nodes should be able to pull it again if somehow was deleted.
How to reproduce it (as minimally and precisely as possible):
I have no idea how to reproduce it again, as it is the first time that we observed this happening in one of our clusters.
Environment:
The text was updated successfully, but these errors were encountered: