Change default docker metric gathering behavior #2452

sparrc · 2020-05-13T18:42:00Z

Summary

Change default docker metric gathering behavior from streaming
metrics to polling.
Change the default polling interval to 10s (half of the TACS publishing
interval of 20s), so that every publish interval we have two docker
metrics (previously 15s).
Change the minimum polling interval to 5s to prevent customers from
configuring resource-intensive polling (previously 1s).
When a customer configures an interval below the minimum interval,
log a warning and set the interval to the minimum (previously set to default).
When a customer configures an interval above the maximum interval,
log a warning and set the interval to the maximum (previously set to default).

These changes are being made because we have found that docker streaming
stats consumes considerable resources from the agent, dockerd daemon, and
containerd daemon.

The graph below shows the improvement in cpu utilization on a single-instance cluster with an m5.large with 120 containers. On agent 1.39.0 with default settings our cluster utilization maxes out at 86% because agent/dockerd/containerd is using ~14% of the instance's cpu resources. With this change (1.40.0) we max out at 94.5%. This means we see a 60% reduction in resources consumed by ECS daemons and 10% improvement in overall cluster utilization.

NOTE: higher cpu utilization in these graphs is a good thing, as it means the customer's containers/tasks are utilizing more of the cluster's cpu, and not daemons required for ECS.

Improvement gets more dramatic with more containers:

and not as dramatic but still substantial with fewer:

Testing

New tests cover the changes: no

Description for the changelog

Agent's default stats gathering is changing from docker streaming stats to polling. This should not affect the metrics that customers ultimately see in cloudwatch, but it does affect how the agent gathers the underlying metrics from docker. This change was made for considerable performance gains. Customers with high CPU loads may see their cluster utilization increase; this is a good thing because it means the containers are utilizing more of the cluster, and agent/dockerd/containerd are utilizing less.

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

README.md

shubham2892

Left comments about readme changes, LGTM 🚀

sparrc · 2020-05-14T18:31:43Z

agent/dockerclient/dockerapi/docker_client.go

@@ -1369,56 +1369,56 @@ func (dg *dockerGoClient) Stats(ctx context.Context, id string, inactivityTimeou
 		}()


The changes here are because functional tests caught that polling metrics was taking longer to populate the task stats API versus streaming stats. This is because polling stats attempts to jitter the initial stats poll in order to avoid hammering the docker stats API on startup.

So change the behavior of polling metrics to poll for docker stats immediately on startup. This is not ideal from a load perspective but is a necessity to avoid unintentional behavior changes.

1. Change default docker metric gathering behavior from streaming metrics to polling. 2. Change the default polling interval to half of the TACS publishing interval (currently 20s), so that every publish interval we have two docker metrics. 3. Change the minimum polling interval to 5s to prevent customers from configuring polling to be just as resource-intensive as streaming metrics. These changes are being made because we have found that docker streaming stats consumes considerable resources from the agent, dockerd daemon, and containerd daemon.

to avoid changing behavior from streaming stats, we need to populate the stats endpoint immediately when the stats engine starts. So instead of jittering the first stats gather we need to just do it immediately.

5/12 release broke our tests

sparrc force-pushed the change-default-metric-gathering branch from f02e7c6 to 73bc2f0 Compare May 13, 2020 21:22

sparrc added the bot/test label May 13, 2020

amazon-ecs-bot removed the bot/test label May 13, 2020

sparrc changed the title ~~[WIP] Change default docker metric gathering behavior~~ Change default docker metric gathering behavior May 13, 2020

sparrc added this to the 1.40.0 milestone May 13, 2020

shubham2892 reviewed May 13, 2020

View reviewed changes

README.md Outdated Show resolved Hide resolved

shubham2892 reviewed May 13, 2020

View reviewed changes

README.md Outdated Show resolved Hide resolved

shubham2892 approved these changes May 13, 2020

View reviewed changes

yhlee-aws approved these changes May 14, 2020

View reviewed changes

sparrc force-pushed the change-default-metric-gathering branch from 800ca4f to 5704544 Compare May 14, 2020 17:51

sparrc added the bot/test label May 14, 2020

amazon-ecs-bot removed the bot/test label May 14, 2020

sparrc commented May 14, 2020

View reviewed changes

sparrc added 2 commits May 15, 2020 11:08

Task stats endpoint needs to be populated immediately

c4871aa

to avoid changing behavior from streaming stats, we need to populate the stats endpoint immediately when the stats engine starts. So instead of jittering the first stats gather we need to just do it immediately.

sparrc force-pushed the change-default-metric-gathering branch from 5704544 to c4871aa Compare May 15, 2020 18:10

sparrc added the bot/test label May 15, 2020

amazon-ecs-bot removed the bot/test label May 15, 2020

sparrc added the bot/test label May 15, 2020

amazon-ecs-bot removed the bot/test label May 15, 2020

sparrc force-pushed the change-default-metric-gathering branch from 544d296 to 8bb1a6b Compare May 15, 2020 20:03

sparrc added the bot/test label May 15, 2020

amazon-ecs-bot removed the bot/test label May 15, 2020

sparrc force-pushed the change-default-metric-gathering branch from 8bb1a6b to 1d59927 Compare May 15, 2020 20:45

Pinning windows images (ltsc201*) to known working version

28ba4aa

5/12 release broke our tests

sparrc force-pushed the change-default-metric-gathering branch from 1d59927 to 28ba4aa Compare May 15, 2020 23:10

sparrc added the bot/test label May 15, 2020

amazon-ecs-bot removed the bot/test label May 15, 2020

sparrc merged commit afa5b19 into aws:dev May 18, 2020

sparrc deleted the change-default-metric-gathering branch May 18, 2020 19:07

fenxiong mentioned this pull request Jun 2, 2020

V1.40.0 stage #2469

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default docker metric gathering behavior #2452

Change default docker metric gathering behavior #2452

sparrc commented May 13, 2020 •

edited

Loading

shubham2892 left a comment

sparrc May 14, 2020 •

edited

Loading

		@@ -1369,56 +1369,56 @@ func (dg *dockerGoClient) Stats(ctx context.Context, id string, inactivityTimeou
		}()

Change default docker metric gathering behavior #2452

Change default docker metric gathering behavior #2452

Conversation

sparrc commented May 13, 2020 • edited Loading

Summary

Testing

Description for the changelog

Licensing

shubham2892 left a comment

Choose a reason for hiding this comment

sparrc May 14, 2020 • edited Loading

Choose a reason for hiding this comment

sparrc commented May 13, 2020 •

edited

Loading

sparrc May 14, 2020 •

edited

Loading