runtime error: index out of range #1237

cgouteix · 2018-02-16T09:18:33Z

Summary

After upgrading to 1.17.0, the agent container is enable to start on some randomly instances.

Description

I'm using ECS and an EC2 Autoscaling Group. All my containers have docker's healthcheck on.

I'm using ami-63cbae0c (amzn-ami-2017.09.h-amazon-ecs-optimized) in eu-central-1. I was on ami-eacf5d85 (amzn-ami-2017.09.f-amazon-ecs-optimized) with no such trouble.

After upgrading to 1.17.0 I get this behavior randomly (some EC2 instances and all):

Tasks could not be started with this event message : service xxx was unable to place a task because no container instance met all of its requirements. The closest matching container-instance xxx encountered error "AGENT". For more information, see the Troubleshooting section.
In the ECS Instances tab, the "Agent Connected" column says false for the previously mentionned instance
After login in the instance, I can see that the ecs-agent container is perpetually restarting. Its last log message is panic: runtime error: index out of range (see the trace bellow)

Environment Details

$ docker info
Containers: 35
 Running: 29
 Paused: 0
 Stopped: 6
Images: 6
Server Version: 17.09.1-ce
Storage Driver: devicemapper
 Pool Name: docker-docker--pool
 Pool Blocksize: 524.3kB
 Base Device Size: 10.74GB
 Backing Filesystem: ext4
 Data file: 
 Metadata file: 
 Data Space Used: 2.904GB
 Data Space Total: 23.33GB
 Data Space Available: 20.42GB
 Metadata Space Used: 2.707MB
 Metadata Space Total: 25.17MB
 Metadata Space Available: 22.46MB
 Thin Pool Minimum Free Space: 2.333GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.77-31.58.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.09
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.67GiB
Name: ip-10-93-4-161
ID: YFQZ:3NTN:N2RV:LFFW:UTLS:SWUY:5UES:324C:ZB5I:67LA:TI76:NMAA
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

$ curl http://localhost:51678/v1/metadata
{"Cluster":"xxx","ContainerInstanceArn":"arn:aws:ecs:eu-central-1:xxx","Version":"Amazon ECS Agent - v1.17.0 (761937f7)"}

$ df -h
Sys. de fichiers Taille Utilisé Dispo Uti% Monté sur
/dev/xvda1         7,8G    769M  7,0G  10% /
devtmpfs           7,9G    196K  7,9G   1% /dev
tmpfs              7,9G       0  7,9G   0% /dev/shm

Supporting Log Snippets

Very hard to catch, because the agent container is recreated every few seconds:

$ docker logs -f ecs-agent

panic: runtime error: index out of range

goroutine 1 [running]:
github.com/aws/amazon-ecs-agent/agent/engine.metadataFromContainer(0xc4204f2000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_client.go:792 +0x7b2
github.com/aws/amazon-ecs-agent/agent/engine.(*dockerGoClient).DescribeContainer(0xc420281b80, 0xc4201cb8c0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_client.go:590 +0x101
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).synchronizeContainerStatus(0xc4202a8000, 0xc42029bb90, 0xc4202a8140)
	/go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:349 +0x5c7
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).synchronizeState(0xc4202a8000)
	/go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:276 +0x295
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).Init(0xc4202a8000, 0x1002040, 0xc42001a028, 0x2, 0x410bf7)
	/go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:199 +0xf6
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).MustInit.func1(0x0, 0x0)
	/go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:227 +0x56
github.com/aws/amazon-ecs-agent/agent/utils.RetryWithBackoffCtx(0x1002040, 0xc42001a028, 0xffda00, 0xc420313500, 0xc4200ebc68, 0xc4200ebc01, 0xc420313500)
	/go/src/github.com/aws/amazon-ecs-agent/agent/utils/utils.go:130 +0x87
github.com/aws/amazon-ecs-agent/agent/utils.RetryWithBackoff(0xffda00, 0xc420313500, 0xc4200ebc68, 0xf, 0xc4203a6200)
	/go/src/github.com/aws/amazon-ecs-agent/agent/utils/utils.go:113 +0x57
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).MustInit(0xc4202a8000, 0x1002040, 0xc42001a028)
	/go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:223 +0x151
github.com/aws/amazon-ecs-agent/agent/app.(*ecsAgent).doStart(0xc420203e00, 0xc4201fc770, 0x1001180, 0xc420292980, 0x1006780, 0xc4201da0f0, 0x1003c40, 0xc4201f2fc0, 0x1003280, 0xc4201da1e0, ...)
	/go/src/github.com/aws/amazon-ecs-agent/agent/app/agent.go:285 +0x42c
github.com/aws/amazon-ecs-agent/agent/app.(*ecsAgent).start(0xc420203e00, 0xc42001a028)
	/go/src/github.com/aws/amazon-ecs-agent/agent/app/agent.go:199 +0x339
github.com/aws/amazon-ecs-agent/agent/app.Run(0xc42000e180, 0x0, 0x0, 0x0)
	/go/src/github.com/aws/amazon-ecs-agent/agent/app/run.go:64 +0x15b
main.main()
	/go/src/github.com/aws/amazon-ecs-agent/agent/agent.go:29 +0x63

The text was updated successfully, but these errors were encountered:

cgouteix · 2018-02-16T17:10:22Z

it is happening again
in my pre-production environment

$ docker logs -f ecs-agent
2018-02-16T17:05:16Z [INFO] Loading configuration
2018-02-16T17:05:16Z [INFO] Loading state! module="statemanager"
2018-02-16T17:05:16Z [INFO] Event stream ContainerChange start listening...
2018-02-16T17:05:16Z [INFO] Restored cluster 'cmdpre'
2018-02-16T17:05:16Z [INFO] Restored from checkpoint file. I am running as 'arn:aws:ecs:eu-central-1:559122992834:container-instance/73a4c4aa-a1de-4030-b709-6e2020fc137f' in cluster 'cmdpre'
2018-02-16T17:05:16Z [INFO] Registered container instance with cluster!
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/aws/amazon-ecs-agent/agent/engine.metadataFromContainer(0xc4205de600, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_client.go:792 +0x7b2
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).synchronizeContainerStatus(0xc4202b2000, 0xc4202856e0, 0xc4202b2640)
	/go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:337 +0x336
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).synchronizeState(0xc4202b2000)
	/go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:276 +0x295
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).Init(0xc4202b2000, 0x1002040, 0xc42001a028, 0x2, 0x410bf7)
	/go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:199 +0xf6
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).MustInit.func1(0x0, 0x0)
	/go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:227 +0x56
github.com/aws/amazon-ecs-agent/agent/utils.RetryWithBackoffCtx(0x1002040, 0xc42001a028, 0xffda00, 0xc4204ab410, 0xc420347c68, 0xc420347c01, 0xc4204ab410)
	/go/src/github.com/aws/amazon-ecs-agent/agent/utils/utils.go:130 +0x87
github.com/aws/amazon-ecs-agent/agent/utils.RetryWithBackoff(0xffda00, 0xc4204ab410, 0xc420347c68, 0xf, 0xc42027dc40)
	/go/src/github.com/aws/amazon-ecs-agent/agent/utils/utils.go:113 +0x57
github.com/aws/amazon-ecs-agent/agent/engine.(*DockerTaskEngine).MustInit(0xc4202b2000, 0x1002040, 0xc42001a028)
	/go/src/github.com/aws/amazon-ecs-agent/agent/engine/docker_task_engine.go:223 +0x151
github.com/aws/amazon-ecs-agent/agent/app.(*ecsAgent).doStart(0xc420276e00, 0xc42019a930, 0x1001180, 0xc42025f720, 0x1006780, 0xc420066c80, 0x1003c40, 0xc4202a81b0, 0x1003280, 0xc420066d70, ...)
	/go/src/github.com/aws/amazon-ecs-agent/agent/app/agent.go:285 +0x42c
github.com/aws/amazon-ecs-agent/agent/app.(*ecsAgent).start(0xc420276e00, 0xc42001a028)
	/go/src/github.com/aws/amazon-ecs-agent/agent/app/agent.go:199 +0x339
github.com/aws/amazon-ecs-agent/agent/app.Run(0xc42000e180, 0x0, 0x0, 0x0)
	/go/src/github.com/aws/amazon-ecs-agent/agent/app/run.go:64 +0x15b
main.main()
	/go/src/github.com/aws/amazon-ecs-agent/agent/agent.go:29 +0x63

$ docker info
Containers: 31
 Running: 20
 Paused: 0
 Stopped: 11
Images: 5
Server Version: 17.09.1-ce
Storage Driver: devicemapper
 Pool Name: docker-docker--pool
 Pool Blocksize: 524.3kB
 Base Device Size: 10.74GB
 Backing Filesystem: ext4
 Data file: 
 Metadata file: 
 Data Space Used: 2.4GB
 Data Space Total: 23.33GB
 Data Space Available: 20.93GB
 Metadata Space Used: 2.392MB
 Metadata Space Total: 25.17MB
 Metadata Space Available: 22.77MB
 Thin Pool Minimum Free Space: 2.333GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.77-31.58.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.09
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.798GiB
Name: ip-10-93-5-190
ID: YX3L:UVEW:F2GM:PW73:ZPH7:C4BC:FB3U:3P22:CUUM:UJU2:YPM2:VXTX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

$ curl http://localhost:51678/v1/metadata
curl: (7) Failed to connect to localhost port 51678: Connexion refusée

adnxn · 2018-02-16T17:59:37Z

@cgouteix good catch, thanks for bringing this to our attention.

adnxn · 2018-02-16T18:22:28Z

also -

I'm not able to reproduce this error on my end and some more logging data would be very helpful for debugging this issue.

Do you mind sharing your container image metadata as well? Specifically I'm curious about the healthcheck command and how the output from that is interacting with the agent. Would you be able to send me logs from the problematic instance? If this is easily reproducible for you, I'd also suggest running the agent with ECS_LOGLEVEL=debug.

You can capture logs using our log collection tool and email them to me directly at adnkha at amazon dot com. If you end up sending the logs to my email, please update this issue so I don't miss them.

Thanks 😄

cgouteix · 2018-02-17T04:47:47Z

I'll try to reproduced it and give you more info. But for now I had to roll back to the previous ami.

I think maybe it has something to do with Scheduled Tasks. I use them a lot.

cgouteix · 2018-02-17T06:34:50Z

So it seems that I was right.
It happens with scheduled tasks.
I can't say if ecs-agent crashes after the task started or after it ended, but it did crash.
And I was able to let the ecs-agent restart after deleting the stopped container that was created for the scheduled task.
I'll send you my ecs logs by email, but I don't think it can help much.

dynaum · 2018-02-19T14:40:14Z

We are facing the same problem here, but this error occurs when agent try to stop a task with two containers.

garrickstapleszed · 2018-02-22T21:56:41Z

+1

I'm seeing a lot of these after upgrading to 1.17.0. I'm anxiously awaiting 1.17.1.

adnxn · 2018-02-23T00:15:45Z

The latest release 1.17.1 includes a fix for this issue.

adnxn added the kind/bug label Feb 16, 2018

adnxn mentioned this issue Feb 16, 2018

engine: docker healthcheck logLength index fix #1239

Merged

8 tasks

adnxn self-assigned this Feb 16, 2018

adnxn added this to the 1.17.1 milestone Feb 16, 2018

adnxn added the pending release label Feb 19, 2018

adnxn closed this as completed Feb 23, 2018

richardpen mentioned this issue Mar 1, 2018

ECS agent apparently stuck in infinite restart loop #1257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime error: index out of range #1237

runtime error: index out of range #1237

cgouteix commented Feb 16, 2018

cgouteix commented Feb 16, 2018

adnxn commented Feb 16, 2018

adnxn commented Feb 16, 2018 •

edited

Loading

cgouteix commented Feb 17, 2018

cgouteix commented Feb 17, 2018

dynaum commented Feb 19, 2018

garrickstapleszed commented Feb 22, 2018

adnxn commented Feb 23, 2018

runtime error: index out of range #1237

runtime error: index out of range #1237

Comments

cgouteix commented Feb 16, 2018

Summary

Description

Environment Details

Supporting Log Snippets

cgouteix commented Feb 16, 2018

adnxn commented Feb 16, 2018

adnxn commented Feb 16, 2018 • edited Loading

cgouteix commented Feb 17, 2018

cgouteix commented Feb 17, 2018

dynaum commented Feb 19, 2018

garrickstapleszed commented Feb 22, 2018

adnxn commented Feb 23, 2018

adnxn commented Feb 16, 2018 •

edited

Loading