Can't see task level cpu/memory utilization in cloudwatch #106

rutchkiwi · 2016-10-21T16:51:30Z

Hi!

In my team we're using ECS to run services. The ECS agent collect a bunch of metrics, that we can then view in cloudwatch. Especially useful for us is the memory/cpu utilization metric, that we use to tune how much memory and cpu we allocate to services. This is really nice that in that it allows us to catch services that are running on dangerously high memory before they go to 100% and get killed by the agent.

In our cluster we're also running a bunch of scheduled tasks, like ETLs, daily cleanups, sending emails at specific times etc. These are run by simply starting ECS tasks.

In cloudwatch, we can only see ECS cluster and service level metrics, not task level ones. This means that we can't see how much memory/cpu these tasks use (as they are not running under a ECS service)

It would be really nice to get these metrics for these kind of short-lived tasks as well. Are there any plans to support this in cloudwatch?

thanks for any replies!

christianblunden · 2016-10-21T18:56:40Z

+1

billyshambrook · 2017-01-04T05:27:15Z

Is this a limitation of the agent or cloudwatch?

atifrizwan89 · 2017-09-05T12:44:56Z

+1 for task level monitoring

skatenerd · 2017-12-08T20:11:25Z

+1 it would be nice to have an official response at least telling us whether cloudwatch will eventually offer this

bramswenson · 2017-12-08T20:13:47Z

@billyshambrook More likely a limitation of ECS itself, and the metrics it is emitting to Cloudwatch. The current dimensions are ClusterName and ServiceName:
http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ecs-metricscollected.html

jonathonsim · 2018-01-27T02:06:47Z

+1 for adding a dimension for TaskName to these metrics. Without it we can't really get a full picture of what's running on the cluster, only what's running in a service. Tasks scheduled based on things like Cloudwatch events are invisible

aaithal · 2018-01-30T00:00:28Z

Hello,

One of the reasons for not having utilization metrics more granular than service name is the ephemeral nature of Task IDs. Publishing utilization metrics by task IDs can lead to metric spam as most of these tasks are short-lived by nature. It's also very hard to alarm on something as ephemeral as task IDs. Having something that's more human readable and generated makes it easier to do these things.

Emitting utilization metrics aggregated by task definition family and version strings is something that is sort of a middle ground here, which we have considered as an alternative here. Is that something that you think would prove to be helpful here?

Thanks,
Anirudh

jonathonsim · 2018-01-30T00:58:36Z

@aaithal - I think that would give us what we need for our use case

mdamir · 2018-02-11T10:20:12Z

Yes. @aaithal That can be helpful. An alternative can also be to have a completely new metric such as "maxMemoryUtilization" which will track max memory among all tasks in a ecs service.

bashilbers · 2018-02-20T15:40:46Z

+1 a task container name does not change that much right? Is that an alternative to use as task metric?

jrodr12 · 2018-04-16T15:47:27Z

+1

mandeepbal · 2018-05-03T19:38:04Z

+1

dustinbolton · 2018-06-22T05:15:48Z

+1

hlopezvg · 2018-07-24T16:42:16Z

+1

milanbrahmbhatt · 2018-08-07T20:25:18Z

+1

blues4ugrl · 2018-08-29T20:56:29Z

Another reason for insight into task-level metrics is for helping debug issues. I have Service A and it runs 30 tasks. By other means (i.e. alerting on CloudWatch events from the ECS Agent) I get notification that 1 or 2 tasks get stopped due to an OutOfMemoryError. When I view service-level metrics and look at max memory utilization during the timeframe that said tasks are stopped, the max utilization is < 80%.

According to documentation:

Service memory utilization (metrics that are filtered by ClusterName and ServiceName) is measured as the total memory in use by the tasks that belong to the service, divided by the total memory that is reserved for the tasks that belong to the service.

Out of my 30 tasks, only 2 of them were stopped due to memory pressure. What about the other tasks? Are they only utilizing a small percentage compared to the 2 that fell over? Or, were they high in utilization as well and only 2 tasks hit that breaking point? Knowing that makes a difference - either you don't have enough capacity overall or you have some code that in certain data scenarios is using a ton of memory.

If you already know the "total memory in use by the tasks that belong to the service" to be able to show us the overall utilization, I'm hoping that based on the conversations/feedback above, you'll find a way to expose it that makes sense to those looking for it. Thanks for listening! :)

waffleshop · 2018-09-14T13:51:47Z

+1

It's hard for me to recommend ECS as a container solution without being able to monitor basic container-level metrics. The burden is being pushed on your consumers to develop our own means of container resource monitoring.

I wrote PowerShell and Python scripts to ship these metrics to CloudWatch, but depending on the number of containers you're running across your environments, the cost can be quite ridiculous. I recommend shipping these metrics to another monitoring solution if you have lots of containers.

ryanpagel · 2018-09-22T07:54:29Z

+1

vimmis · 2018-10-18T17:24:11Z

+1

coultn · 2018-10-31T20:13:00Z

Thanks for the feedback. I wanted to let you know that the ECS team is aware of this issue, and that it is under active consideration. We always appreciate +1's and additional details on use cases.

kbhandar · 2018-11-02T06:37:16Z

+1

danielfosbery · 2018-11-07T20:45:00Z

+1 This would be really helpful. We have a service running that hits 100% Max CPU but with an average CPU of about 40%. Some tasks are doing more work than others, without task level stats it is very hard to debug which tasks are running at capacity and why.

sandeepboyapati · 2018-11-15T17:14:39Z

+1

kandoiNikhil · 2018-11-15T20:29:15Z

+1

DionJones615 · 2018-12-19T12:29:00Z

+1

abby-fuller · 2019-01-10T22:50:41Z

moving this over to the containers roadmap since this is a feature request and not an ecs-agent issue.

nicolas-modsy · 2019-02-09T01:19:57Z

+1

deleugpn · 2019-02-09T08:19:11Z

My use case is that I often seen some of my services with max utilization CPU nearly 100% and min utilization nearly 10%. I can only assume that some tasks are working hard while others are being lazy, but I don't know which. I'd like to know so I could either find out why or at least kill them and get a better one.

enricopesce · 2019-05-15T07:53:54Z

+1

medbensalem · 2019-05-15T07:56:24Z

+1

gdanielson · 2019-05-24T00:39:40Z

A big +1 👍 for task level resource tracking. Since the inside of a running container is normally so opaque any additional information on run-time state is extremely valuable when things do not go according to plan

akshayram-wolverine · 2019-07-09T22:34:35Z

Hi everyone,

This feature is now in preview: https://aws.amazon.com/about-aws/whats-new/2019/07/introducing-container-insights-for-ecs-and-aws-fargate-in-preview/

Look forward to feedback!

akshayram-wolverine · 2019-09-02T23:33:31Z

Shipped! More info here: https://aws.amazon.com/about-aws/whats-new/2019/08/container-monitoring-for-amazon-ecs-eks-and-kubernetes-is-now-available-in-amazon-cloudwatch/

esbie · 2019-09-03T17:16:16Z

It seems like for ecs, task-level metrics were not added to cloudwatch insights. I only see "TaskDefinitionFamily" in the the ecs supported dimensions. https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-ECS.html

vcolano · 2019-10-22T14:15:55Z

The docs explicitly state that this is not available for AWS Batch: "Currently, Container Insights isn't supported in AWS Batch."

When will this be supported for Batch?

ayush-san · 2021-04-29T09:25:54Z

Is there any timeline for it to be supported for batch too?

sasuolanderSito · 2021-06-21T11:57:18Z

Technically, is it possible to turn container insight on in batch compute environment by running:

aws ecs update-cluster-settings --cluster BatchComputeEnviromentClusterEC2 --settings "name=containerInsights,value=enabled" ?

Compute environment for a fargate seems to be just a normal EC2 cluster.

abby-fuller transferred this issue from aws/amazon-ecs-agent Jan 10, 2019

abby-fuller mentioned this issue Jan 10, 2019

[ECS] [Metrics]: Container Insights for ECS #70

Closed

abby-fuller added ECS Amazon Elastic Container Service Proposed Community submitted issue labels Jan 10, 2019

akshayram-wolverine closed this as completed Sep 5, 2019

Can't see task level cpu/memory utilization in cloudwatch #106

Can't see task level cpu/memory utilization in cloudwatch #106

Comments

rutchkiwi commented Oct 21, 2016

christianblunden commented Oct 21, 2016

billyshambrook commented Jan 4, 2017

atifrizwan89 commented Sep 5, 2017

skatenerd commented Dec 8, 2017

bramswenson commented Dec 8, 2017

jonathonsim commented Jan 27, 2018

aaithal commented Jan 30, 2018 • edited Loading

jonathonsim commented Jan 30, 2018

mdamir commented Feb 11, 2018

bashilbers commented Feb 20, 2018

jrodr12 commented Apr 16, 2018

mandeepbal commented May 3, 2018

dustinbolton commented Jun 22, 2018

hlopezvg commented Jul 24, 2018

milanbrahmbhatt commented Aug 7, 2018

blues4ugrl commented Aug 29, 2018

waffleshop commented Sep 14, 2018

ryanpagel commented Sep 22, 2018

vimmis commented Oct 18, 2018

coultn commented Oct 31, 2018

kbhandar commented Nov 2, 2018

danielfosbery commented Nov 7, 2018

sandeepboyapati commented Nov 15, 2018

kandoiNikhil commented Nov 15, 2018

DionJones615 commented Dec 19, 2018

abby-fuller commented Jan 10, 2019

nicolas-modsy commented Feb 9, 2019

deleugpn commented Feb 9, 2019

enricopesce commented May 15, 2019

medbensalem commented May 15, 2019

gdanielson commented May 24, 2019

akshayram-wolverine commented Jul 9, 2019

akshayram-wolverine commented Sep 2, 2019

esbie commented Sep 3, 2019

vcolano commented Oct 22, 2019

ayush-san commented Apr 29, 2021

sasuolanderSito commented Jun 21, 2021

aaithal commented Jan 30, 2018 •

edited

Loading