Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images pulled from ECR with IAM creds failing with invalidated session #2638

Closed
kyleian opened this issue Sep 17, 2020 · 3 comments
Closed

Comments

@kyleian
Copy link

kyleian commented Sep 17, 2020

Summary

I am noticing since 1.44.3 on EC2s belonging to an ECS cluster that after an initial pull of a docker image from ECR, if I rerun a job to create a service/task freshly, that ec2 will fail to pull the image from ECR with cryptic errors.

Description

I am running a multiaccount setup with a centralized account housing an ECR, where I publish images once in a dev environment and allow higher environments like prod to use with a different env var set. Repos in this ECR are granted org level permissions, so all accounts within the org have been able to pull images without fail, until 1.44.3, based on the ECS AMI available at SSM path aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id.

My ECS Clusters authenticate with ECR via IAM roles that use the standard amazon roles as follows:

ManagedPolicyArns:
      - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
      - arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
      - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
      - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

Now, on EC2s running ecs-agent v 1.44.3 (aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id), I am experiencing strange errors with pulling an image from the org-level ECR, where the images absolutely exists, and where no networking / IAM changes has been made.

Hoping for some insight on what the cause here may be - with no real changes besides a new EC2 image with the latest ecs-agent 1.44.3, Im curious if others are seeing this behavior.

Expected Behavior

Images are consistently pulled without error from ECR using IAM role based credentials.

Observed Behavior

Images are inconsistently being pulled from ECR with occasional errors that suggest image doesnt exist or security token is invalid.

Environment Details

AMI: latest pulled from SSM aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id with ecs-agent 1.44.3

docker info:

sh-4.2$ sudo docker info
Client:
 Debug Mode: false

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 19.03.6-ce
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: amazon-ecs-volume-plugin local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ff48f57fc83a8c44cf4ad5d672424a98ba37ded6
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.193-149.317.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.851GiB
 Name: ${EC2_NAME_WITH_IP}
 ID: 2LVX:FPAR:4KHL:JKES:CC5J:BSM3:UASG:LMZD:MS44:7W2I:EVIB:UHJI
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Supporting Log Snippets

These are some logs from /var/log/ecs/ecs-init.log around time of failure, scrubbed of sensitive info.

level=error time=2020-09-17T15:21:25Z msg="Error inspecting image ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}:${IMAGE_TAG_ID}: Error: No such image: ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}:${IMAGE_TAG_ID}" module=docker_image_manager.go
level=error time=2020-09-17T15:21:25Z msg="Task engine [arn:aws:ecs:us-west-2:${ACCOUNT_ID}:task/${CLUSTER_NAME}/${CLUSTER_ID}]: unable to add container reference to image state: Error: No such image: ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}:${IMAGE_TAG_ID}" module=docker_task_engine.go
level=error time=2020-09-17T15:21:25Z msg="Task engine [arn:aws:ecs:us-west-2:${ACCOUNT_ID}:task/${CLUSTER_NAME}/${CLUSTER_ID}]: failed to pull image ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}:${IMAGE_TAG_ID} for container ${CONTAINER_NAME}: Error response from daemon: pull access denied for ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}, repository does not exist or may require 'docker login': denied: The security token included in the request is invalid." module=docker_task_engine.go
level=info time=2020-09-17T15:21:25Z msg="Task engine [arn:aws:ecs:us-west-2:${ACCOUNT_ID}:task/${CLUSTER_NAME}/${CLUSTER_ID}]: error transitioning container [${CONTAINER_NAME} (Runtime ID: )] to [PULLED]: Error response from daemon: pull access denied for ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}, repository does not exist or may require 'docker login': denied: The security token included in the request is invalid." module=docker_task_engine.go
level=info time=2020-09-17T15:21:25Z msg="Managed task [arn:aws:ecs:us-west-2:${ACCOUNT_ID}:task/${CLUSTER_NAME}/${CLUSTER_ID}]: Container [name=${CONTAINER_NAME} runtimeID=]: handling container change event [PULLED]" module=task_manager.go
level=error time=2020-09-17T15:21:25Z msg="Managed task [arn:aws:ecs:us-west-2:${ACCOUNT_ID}:task/${CLUSTER_NAME}/${CLUSTER_ID}]: error while pulling container ${CONTAINER_NAME} and image ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}:${IMAGE_TAG_ID}, will try to run anyway: Error response from daemon: pull access denied for ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}, repository does not exist or may require 'docker login': denied: The security token included in the request is invalid." module=task_manager.go
@fierlion
Copy link
Member

I'm sorry to hear that you're experiencing this issue with your pulls from ECR.

Can you talk about how you have tracked the behavior to the 1.44.3 release? Were you seeing the behavior when you updated a task or service to 1.44.2 or before? I assume from your description that you're consuming the latest SSM parameters to launch new Container Instances with the latest ECS-Optimized AMI.

Can you go into more detail about how you are updating your instances so I can better clarify the interval of potential commits/changes that could be having an effect here?

@kyleian
Copy link
Author

kyleian commented Sep 17, 2020

Hey @fierlion

I should have mentioned I am building on top of the SSM param AMI by way of packer, but very minute additions to the base AMI from the SSM path, such as installing SSM agent for SSH access and a datadog agent.

I had initially tracked the behavior to the release by way of mapping back to an old AMI I stored from or so, which is running 1.43.0, However I believe I may have been premature about pinning this issue to the ecs-agent version, as after a few hours of smooth operation I started seeing the behavior on my old 1.43.0 AMI.

So I am no longer convinced it's a version 1.44.3 issue.

I however do have a new lead to the cause, which is that after looking at logs from each EC2 in my ASG, the machines that successfully pull the image are in AZs us-west-2a and us-west-2b, and the machines that fail to pull from the ECR registry are in us-west-2c - will follow up with a support ticket on that, which is probably out of scope for this github project, but might be useful info for others coming here.

@fierlion
Copy link
Member

I asked because the last few releases have had very little change by way of ECR except to add a configurable pull timeout (#2565).
I'll close this issue now, but please re-open if you trace it back to the 1.44.3 (or before) update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants