-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods can't run due to failures pulling pause
image; pause
image is being incorrectly garbage collected
#1740
Comments
Just checked; indeed the nodes that are experiencing images no longer have the |
pause
imagepause
image; pause
image is being incorrectly garbage collected
Welp, whether it's the aforementioned GC or another, it's GC:
I've never noticed any disk space issues previously; regardless of that though, the Also, I haven't fully gotten strict on defining where certain pods get deployed on which node on this cluster, one of the current nodes having an issue has 32gb; others that are fine right now have 20gb, I don't think the issue is at all related to that I need to increase space, I mean of course, GC should've never affected that image in the case that it's not re-pull-able, but I've otherwise not ever run into anything that hints that I should increase the space, currently this node has nearly 10gb free even, so otherwise GC is doing it's job fine. |
This was an issue with 1.29 that should be addressed on AL2: #1597 We haven't had any reports of this on AL2023. Can you verify that
|
$SANDBOX_IMAGE isn't set, not sure if it's meant to be; on my AL2023 nodes: {
"status": {
"pinned": true
},
// ...
} Seems like it is pinned. I haven't encountered it on the AL2023 nodes since the last image update I did a few days ago, perhaps that fix got pushed both ways? Pretty sure I also saw it happen once on AL2 before I switched, which makes sense with the issue you linked, wish I had seen that issue much earlier, didn't come up in my searching! Definitely not pinned on the latest Ubuntu AMI though. Any idea where I can report that? Weirdly enough though, on the Ubuntu image in [plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5" I tried restarting |
I'm gonna close this due to the existence of #1597; I wish I had found that issue in my initial searches! if I see it again on AL2023 I'll comment or open a new issue, otherwise if someone knows where to post an issue for the Ubuntu AMI, I'd be appreciative to be pointed in that direction. |
Normal Scheduled 18s default-scheduler Successfully assigned iwork-ui/iwork-ui-deployment-748c87cc58-5l57j to ip-172-31-47-25.ap-south-1.compute.internal |
Not sure what you're saying w/out any comment and only an error but you should review #1597 for some workarounds if needed; if you're experiencing it on AL2023 then definitely report back, otherwise #1597 is for AL2 and I was experiencing it on Ubuntu's EKS AMI (jammy) which this isn't the place for reporting that |
What happened:
Nodes stop being able to create new pods, error says `Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5" failed to pull and unpack image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.ca-central-1.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized"
What you expected to happen:
Pods should work.
How to reproduce it (as minimally and precisely as possible):
Try to deploy a pod on EKS. Up until right now, I had no idea an exact reproduction but I have at least a slight idea now. After diving into this again because nodes started failing to deploy pods again, I started Googling and again, wasn't able to find much, but saw #1425 again and re-read it.
This oddly makes sense, but only after diving deeper to try and perhaps see if what that issue said about the images getting removed could possibly be happening, not that it should as I am using off the shelf AMIs and never access the nodes directly, but I wanted to check.
Upon trying to find a reference on where kubelet/kubernetes caches images as I didn't know offhand, I found this in the Kubernetes docs:
I don't know for sure if this is related or not, but, I feel like it's a possibility. The last time the issue popped up was after the weekend, a time where our nodes could easily go 3 days, 12 hours with no new pods.
Anything else we need to know?:
This has only been happening since upgrading to 1.29 on EKS and has happened on both AL pods (IIRC both AL2 and AL2023) as well as the Ubuntu EKS pods. Both on self-managed and EKS-managed node groups. Seems to affect nodes randomly, it's never an entire node group or anything, nothing to correlate the nodes together.
AL2023_x86_64_STANDARD-1.29.0-20240307
Environment:
uname -a
): 6.1.79-99.164.amzn2023.x86_64 for the AL2023 nodes, 6.5.0-1015-aws for Ubuntucat /etc/eks/release
on a node):For AL2023 nodes:
Ubuntu nodes don't have a
/etc/eks/release
file.I don't see the
ImageMaximumGCAge
feature gate being passed to the kubelet when reviewing it's parameters viaps aux
, but I also said I'm not 100% sure that is the issue, the issue exists either way, that was just my only remote idea of why/how it could be happening and the 3.5 days length seems to line up fishily well.Any ideas!?
The text was updated successfully, but these errors were encountered: