Images pulled from ECR with IAM creds failing with invalidated session #2638

kyleian · 2020-09-17T19:02:16Z

Summary

I am noticing since 1.44.3 on EC2s belonging to an ECS cluster that after an initial pull of a docker image from ECR, if I rerun a job to create a service/task freshly, that ec2 will fail to pull the image from ECR with cryptic errors.

Description

I am running a multiaccount setup with a centralized account housing an ECR, where I publish images once in a dev environment and allow higher environments like prod to use with a different env var set. Repos in this ECR are granted org level permissions, so all accounts within the org have been able to pull images without fail, until 1.44.3, based on the ECS AMI available at SSM path aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id.

My ECS Clusters authenticate with ECR via IAM roles that use the standard amazon roles as follows:

ManagedPolicyArns:
      - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
      - arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role
      - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
      - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

Now, on EC2s running ecs-agent v 1.44.3 (aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id), I am experiencing strange errors with pulling an image from the org-level ECR, where the images absolutely exists, and where no networking / IAM changes has been made.

Hoping for some insight on what the cause here may be - with no real changes besides a new EC2 image with the latest ecs-agent 1.44.3, Im curious if others are seeing this behavior.

Expected Behavior

Images are consistently pulled without error from ECR using IAM role based credentials.

Observed Behavior

Images are inconsistently being pulled from ECR with occasional errors that suggest image doesnt exist or security token is invalid.

Environment Details

AMI: latest pulled from SSM aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id with ecs-agent 1.44.3

docker info:

sh-4.2$ sudo docker info
Client:
 Debug Mode: false

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 19.03.6-ce
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: amazon-ecs-volume-plugin local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ff48f57fc83a8c44cf4ad5d672424a98ba37ded6
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.193-149.317.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.851GiB
 Name: ${EC2_NAME_WITH_IP}
 ID: 2LVX:FPAR:4KHL:JKES:CC5J:BSM3:UASG:LMZD:MS44:7W2I:EVIB:UHJI
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Supporting Log Snippets

These are some logs from /var/log/ecs/ecs-init.log around time of failure, scrubbed of sensitive info.

level=error time=2020-09-17T15:21:25Z msg="Error inspecting image ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}:${IMAGE_TAG_ID}: Error: No such image: ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}:${IMAGE_TAG_ID}" module=docker_image_manager.go
level=error time=2020-09-17T15:21:25Z msg="Task engine [arn:aws:ecs:us-west-2:${ACCOUNT_ID}:task/${CLUSTER_NAME}/${CLUSTER_ID}]: unable to add container reference to image state: Error: No such image: ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}:${IMAGE_TAG_ID}" module=docker_task_engine.go
level=error time=2020-09-17T15:21:25Z msg="Task engine [arn:aws:ecs:us-west-2:${ACCOUNT_ID}:task/${CLUSTER_NAME}/${CLUSTER_ID}]: failed to pull image ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}:${IMAGE_TAG_ID} for container ${CONTAINER_NAME}: Error response from daemon: pull access denied for ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}, repository does not exist or may require 'docker login': denied: The security token included in the request is invalid." module=docker_task_engine.go
level=info time=2020-09-17T15:21:25Z msg="Task engine [arn:aws:ecs:us-west-2:${ACCOUNT_ID}:task/${CLUSTER_NAME}/${CLUSTER_ID}]: error transitioning container [${CONTAINER_NAME} (Runtime ID: )] to [PULLED]: Error response from daemon: pull access denied for ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}, repository does not exist or may require 'docker login': denied: The security token included in the request is invalid." module=docker_task_engine.go
level=info time=2020-09-17T15:21:25Z msg="Managed task [arn:aws:ecs:us-west-2:${ACCOUNT_ID}:task/${CLUSTER_NAME}/${CLUSTER_ID}]: Container [name=${CONTAINER_NAME} runtimeID=]: handling container change event [PULLED]" module=task_manager.go
level=error time=2020-09-17T15:21:25Z msg="Managed task [arn:aws:ecs:us-west-2:${ACCOUNT_ID}:task/${CLUSTER_NAME}/${CLUSTER_ID}]: error while pulling container ${CONTAINER_NAME} and image ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}:${IMAGE_TAG_ID}, will try to run anyway: Error response from daemon: pull access denied for ${ECR_SOURCE_ACCOUNT_ID}.dkr.ecr.us-west-2.amazonaws.com/${ECR_REPO}, repository does not exist or may require 'docker login': denied: The security token included in the request is invalid." module=task_manager.go

The text was updated successfully, but these errors were encountered:

fierlion · 2020-09-17T22:16:50Z

I'm sorry to hear that you're experiencing this issue with your pulls from ECR.

Can you talk about how you have tracked the behavior to the 1.44.3 release? Were you seeing the behavior when you updated a task or service to 1.44.2 or before? I assume from your description that you're consuming the latest SSM parameters to launch new Container Instances with the latest ECS-Optimized AMI.

Can you go into more detail about how you are updating your instances so I can better clarify the interval of potential commits/changes that could be having an effect here?

kyleian · 2020-09-17T22:35:44Z

Hey @fierlion

I should have mentioned I am building on top of the SSM param AMI by way of packer, but very minute additions to the base AMI from the SSM path, such as installing SSM agent for SSH access and a datadog agent.

I had initially tracked the behavior to the release by way of mapping back to an old AMI I stored from or so, which is running 1.43.0, However I believe I may have been premature about pinning this issue to the ecs-agent version, as after a few hours of smooth operation I started seeing the behavior on my old 1.43.0 AMI.

So I am no longer convinced it's a version 1.44.3 issue.

I however do have a new lead to the cause, which is that after looking at logs from each EC2 in my ASG, the machines that successfully pull the image are in AZs us-west-2a and us-west-2b, and the machines that fail to pull from the ECR registry are in us-west-2c - will follow up with a support ticket on that, which is probably out of scope for this github project, but might be useful info for others coming here.

fierlion · 2020-09-17T22:53:45Z

I asked because the last few releases have had very little change by way of ECR except to add a configurable pull timeout (#2565).
I'll close this issue now, but please re-open if you trace it back to the 1.44.3 (or before) update.

fierlion closed this as completed Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images pulled from ECR with IAM creds failing with invalidated session #2638

Images pulled from ECR with IAM creds failing with invalidated session #2638

kyleian commented Sep 17, 2020

fierlion commented Sep 17, 2020

kyleian commented Sep 17, 2020

fierlion commented Sep 17, 2020

Images pulled from ECR with IAM creds failing with invalidated session #2638

Images pulled from ECR with IAM creds failing with invalidated session #2638

Comments

kyleian commented Sep 17, 2020

Summary

Description

Expected Behavior

Observed Behavior

Environment Details

Supporting Log Snippets

fierlion commented Sep 17, 2020

kyleian commented Sep 17, 2020

fierlion commented Sep 17, 2020