Update readme for unstable reserved memory value reported when ECS_POLL_METRICS is enabled and ECS_POLLING_METRICS_WAIT_DURATION is set to a high value #3863

Realmonia · 2023-08-22T01:58:29Z

Summary

Update readme for unstable reserved memory value reported when ECS_POLL_METRICS is enabled and ECS_POLLING_METRICS_WAIT_DURATION is set to a high value

Background: During experiment (instance type m5.4xlarge, latest ECS optimized AMI), AL2023 do not see an issue with ECS_POLLING_METRICS_WAIT_DURATION = 20s, while AL2 see reserved memory "vibrates" when ECS_POLLING_METRICS_WAIT_DURATION equals 19s or 20s. The issue only appears when high number of tasks running on instance (in the experiment 500 tasks, 950 containers), in the case that lower tasks/containers count, reserved memory will not be impacted. This is likely due to docker metrics response latency when overloaded, and therefore causing metrics to miss reporting window (20s).

Considering the different behavior among different compute type and workload, and we have a solid default value (10s), I decided to not change specific agent handling code, but instead update readme to raise user's caution around this issue under specific circumstance.

Implementation details

Testing

New tests cover the changes: N/A

Description for the changelog

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

README.md

prateekchaudhry · 2023-08-25T19:28:43Z

where reserved memory of ECS cluster becomes unstable due to missing metrics sample at metric collection time

I am little unclear here, does this relate to ECS_RESERVED_MEMORY/config.ReservedMemory? If that is a config, how does that fluctuate with polling metrics duration?

Realmonia · 2023-08-25T20:03:58Z

where reserved memory of ECS cluster becomes unstable due to missing metrics sample at metric collection time

I am little unclear here, does this relate to ECS_RESERVED_MEMORY/config.ReservedMemory? If that is a config, how does that fluctuate with polling metrics duration?

No it's a different config in task definition that is the memory limit of task/container. This data is collected in metrics to show the theoretical limit of utilization.

prateekchaudhry · 2023-08-25T20:14:11Z

I see, is it the task level memory limit? And total memory reserved using it, or something similar to that? In that case, I wonder if this could be rephrased to make it distinct from the 'other' Reserved Memory? (non blocking)

of ECS cluster

In hindsight this does makes it clearer

Realmonia · 2023-08-28T21:31:49Z

I see, is it the task level memory limit? And total memory reserved using it, or something similar to that? In that case, I wonder if this could be rephrased to make it distinct from the 'other' Reserved Memory? (non blocking)

of ECS cluster

In hindsight this does makes it clearer

Total memory does not use it; ECS_RESERVED_MEMORY is a value we subtract when we calculate total available memory during RCI call, it is the memory that is projected to be used by agent managed processes. This reserved memory is for scaling purpose, so if certain task has reserved memory, ECS scheduling procedure will guarantee those tasks can use this amount of memory when needed.

Realmonia · 2023-08-28T21:34:50Z

I will change all "memory reserved" to "memory reservation value in metrics" to avoid confusion. That SGTY?

Realmonia · 2023-08-29T22:00:13Z

Not able to trigger gpu integ tests. Force pushed.

Realmonia marked this pull request as ready for review August 22, 2023 01:58

Realmonia requested a review from a team as a code owner August 22, 2023 01:58

Realmonia added the bot/test label Aug 22, 2023

amazon-ecs-bot removed the bot/test label Aug 22, 2023

danehlim reviewed Aug 23, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

Realmonia force-pushed the dev branch 2 times, most recently from 834fd27 to 206f3dd Compare August 23, 2023 22:45

danehlim previously approved these changes Aug 23, 2023

View reviewed changes

Realmonia added the bot/test label Aug 24, 2023

amazon-ecs-bot removed the bot/test label Aug 24, 2023

prateekchaudhry previously approved these changes Aug 25, 2023

View reviewed changes

Realmonia dismissed stale reviews from prateekchaudhry and danehlim via 0561aa0 August 28, 2023 23:25

Realmonia force-pushed the dev branch from 206f3dd to 0561aa0 Compare August 28, 2023 23:25

Realmonia added the bot/test label Aug 28, 2023

amazon-ecs-bot removed the bot/test label Aug 28, 2023

danehlim approved these changes Aug 28, 2023

View reviewed changes

prateekchaudhry approved these changes Aug 28, 2023

View reviewed changes

Realmonia closed this Aug 29, 2023

Realmonia force-pushed the dev branch from 0561aa0 to 5e9ed72 Compare August 29, 2023 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update readme for unstable reserved memory value reported when ECS_POLL_METRICS is enabled and ECS_POLLING_METRICS_WAIT_DURATION is set to a high value #3863

Update readme for unstable reserved memory value reported when ECS_POLL_METRICS is enabled and ECS_POLLING_METRICS_WAIT_DURATION is set to a high value #3863

Realmonia commented Aug 22, 2023 •

edited

Loading

prateekchaudhry commented Aug 25, 2023

Realmonia commented Aug 25, 2023

prateekchaudhry commented Aug 25, 2023 •

edited

Loading

Realmonia commented Aug 28, 2023

Realmonia commented Aug 28, 2023

Realmonia commented Aug 29, 2023

Update readme for unstable reserved memory value reported when ECS_POLL_METRICS is enabled and ECS_POLLING_METRICS_WAIT_DURATION is set to a high value #3863

Update readme for unstable reserved memory value reported when ECS_POLL_METRICS is enabled and ECS_POLLING_METRICS_WAIT_DURATION is set to a high value #3863

Conversation

Realmonia commented Aug 22, 2023 • edited Loading

Summary

Implementation details

Testing

Description for the changelog

Licensing

prateekchaudhry commented Aug 25, 2023

Realmonia commented Aug 25, 2023

prateekchaudhry commented Aug 25, 2023 • edited Loading

Realmonia commented Aug 28, 2023

Realmonia commented Aug 28, 2023

Realmonia commented Aug 29, 2023

Realmonia commented Aug 22, 2023 •

edited

Loading

prateekchaudhry commented Aug 25, 2023 •

edited

Loading