-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECS agent apparently stuck in infinite restart loop #1257
Comments
@alexhall, you're right. I think digging into the logs will help. I'm not able to see much from the snippet here. Please send the .tgz file to adnkha at amazon dot com. Thanks. |
@adnxn I emailed you the logfiles. Unfortunately I wound up having to tear down the container instance and start a new one, so I won't be able to provide further info. But this wasn't the first time I encountered the agent error while placing tasks, so if there are further debugging steps let me know in case it happens again. |
If you can repro this with |
Hello, I'd like to confirm that exactly the same situation happens on multiple container instances in our infrastructure -- ECS agent restarts in the infinite loop (see the logs below). We noticed that it happens on container instances running containers for a while and that it might be triggered by deployment into ECS service. In order to recover capacity for ECS tasks in ECS cluster there are two options: remove /var/lib/ecs and restart ECS agent (or just wait for its next start since it restarts constantly), or replace container instance. Unfortunately we don't have We faced this issue with ECS agents
|
@alexhall @anosulchik Thanks for providing the logs, based on the logs, I suspect you are running into the issue #1237 where the agent could panic if the container has healthcheck configured. This issue has been fixed in ECS Agent v1.17.1, can you upgrade to the latest agent version and see if you still experience this issue? If you are still experiencing this issue, could you share the output Thanks, |
@richardpen Thanks for your input. We don't have docker health checks enabled for containers running in ECS tasks. We use ECS services attached to ALB target group to validate that containers are up and running so I believe #1237 is not our case. At this moment we don't have container instance with crashed/restarting ECS agent so we can only investigate it retrospectively. I can download all ecs agent's logs since container instance start as well as docker logs if you might need them. Just let me know. Thank you. |
@anosulchik The log file may not contain useful information for this issue because the panic stack trace of the panic wasn't saved to the log file. If you see this issue again, I wish you could run Thanks, |
THanks @richardpen. The problem here is that systemd script for ecs agent that comes with Amazon Linux restarts ecs container and deletes one that failed or crashed: https://github.com/segmentio/stack/blob/master/packer/ecs/root/etc/systemd/system/ecs-agent.service Since we run ecs agents in prod this setup is reasonable for us. But yeah, when the problem happens again -- before to recover container instance I'm going to collect container's logs for you. Thanks. |
@richardpen I'm not sure whether or not we had health checks enabled, but the behavior does match that described in #1237 and I can confirm that we have not seen these errors recur since upgrading to 1.17.1 a week ago. |
Hi, not sure whether it helps, but we hit this issue this morning with 1.17.0 but no panic was seen in the ecs-agent docker logs. It seemed to have the "Unable to retrieve stats for container" issue for about an hour before everything went dark. SSH stopped working too. However, in /var/log/messages the following started appearing around the same time, which makes me think that it was caused not by the health check but by an ambient memory issue. Anyway, thought someone might find it useful.
The server has been replaced and is now running 1.17.1. |
Hi all, please let us know if you run into this again with 1.17.2 version of ECS agent. Closing this issue for now. Thanks, |
Hi All! We have run into this issue with 1.29.0 version of ECS agent. |
Summary
An ECS container instance launched from an EC2 autoscaling group using the EC2-optimized AMI became disconnected from the cluster. The apparent reason was that the ECS agent on the instance was restarting every few seconds.
Description
I created an ECS cluster attached to an EC2 autoscaling group, along with an ECS service and task, using a CloudFormation template. After a couple of updates to the CFN template, the service failed to update with message:
service ronaldo-docker-staging-service-1MBJ5GTPZJ6WM was unable to place a task because no container instance met all of its requirements. The closest matching container-instance 99024b43-6851-4f3e-8867-cdae0bf0c8ec encountered error "AGENT".
(This was a staging environment, so the autoscaling group had a desired count of 1.)I ssh'ed into the ECS container instance and followed the troubleshooting instructions in https://aws.amazon.com/premiumsupport/knowledge-center/ecs-agent-disconnected/.
Running
docker ps
multiple times in succession showed new container IDs on each invocation, with an uptime of less than 10 seconds, which seems to indicate that the agent is getting constant restarted and cannot establish a connection with the ECS cluster.Environment Details
ECS container instance launched via an EC2 autoscaling group using
ami-5e414e24
.Supporting Log Snippets
From the ECS agent logfile:
The last group of lines, from "Loading configuration" to "Task [...]: recording execution stopped time", is repeated indefinitely from here on out.
I ran the log-collector, but Github won't let me upload a .tgz file, so let me know if there's an alternate way to send the file.
The text was updated successfully, but these errors were encountered: