-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Images pulled from ECR with IAM creds failing with invalidated session #2638
Comments
I'm sorry to hear that you're experiencing this issue with your pulls from ECR. Can you talk about how you have tracked the behavior to the 1.44.3 release? Were you seeing the behavior when you updated a task or service to 1.44.2 or before? I assume from your description that you're consuming the latest SSM parameters to launch new Container Instances with the latest ECS-Optimized AMI. Can you go into more detail about how you are updating your instances so I can better clarify the interval of potential commits/changes that could be having an effect here? |
Hey @fierlion I should have mentioned I am building on top of the SSM param AMI by way of packer, but very minute additions to the base AMI from the SSM path, such as installing SSM agent for SSH access and a datadog agent. I had initially tracked the behavior to the release by way of mapping back to an old AMI I stored from or so, which is running 1.43.0, However I believe I may have been premature about pinning this issue to the ecs-agent version, as after a few hours of smooth operation I started seeing the behavior on my old 1.43.0 AMI. So I am no longer convinced it's a version 1.44.3 issue. I however do have a new lead to the cause, which is that after looking at logs from each EC2 in my ASG, the machines that successfully pull the image are in AZs us-west-2a and us-west-2b, and the machines that fail to pull from the ECR registry are in us-west-2c - will follow up with a support ticket on that, which is probably out of scope for this github project, but might be useful info for others coming here. |
I asked because the last few releases have had very little change by way of ECR except to add a configurable pull timeout (#2565). |
Summary
I am noticing since 1.44.3 on EC2s belonging to an ECS cluster that after an initial pull of a docker image from ECR, if I rerun a job to create a service/task freshly, that ec2 will fail to pull the image from ECR with cryptic errors.
Description
I am running a multiaccount setup with a centralized account housing an ECR, where I publish images once in a dev environment and allow higher environments like prod to use with a different env var set. Repos in this ECR are granted org level permissions, so all accounts within the org have been able to pull images without fail, until 1.44.3, based on the ECS AMI available at SSM path
aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id
.My ECS Clusters authenticate with ECR via IAM roles that use the standard amazon roles as follows:
Now, on EC2s running ecs-agent v 1.44.3 (
aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id
), I am experiencing strange errors with pulling an image from the org-level ECR, where the images absolutely exists, and where no networking / IAM changes has been made.Hoping for some insight on what the cause here may be - with no real changes besides a new EC2 image with the latest ecs-agent 1.44.3, Im curious if others are seeing this behavior.
Expected Behavior
Images are consistently pulled without error from ECR using IAM role based credentials.
Observed Behavior
Images are inconsistently being pulled from ECR with occasional errors that suggest image doesnt exist or security token is invalid.
Environment Details
AMI: latest pulled from SSM
aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id
with ecs-agent 1.44.3docker info:
Supporting Log Snippets
These are some logs from /var/log/ecs/ecs-init.log around time of failure, scrubbed of sensitive info.
The text was updated successfully, but these errors were encountered: