-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update readme for unstable reserved memory value reported when ECS_POLL_METRICS is enabled and ECS_POLLING_METRICS_WAIT_DURATION is set to a high value #3863
Conversation
834fd27
to
206f3dd
Compare
I am little unclear here, does this relate to ECS_RESERVED_MEMORY/ |
No it's a different config in task definition that is the memory limit of task/container. This data is collected in metrics to show the theoretical limit of utilization. |
I see, is it the task level memory limit? And total memory reserved using it, or something similar to that? In that case, I wonder if this could be rephrased to make it distinct from the 'other' Reserved Memory? (non blocking)
In hindsight this does makes it clearer |
Total memory does not use it; |
I will change all "memory reserved" to "memory reservation value in metrics" to avoid confusion. That SGTY? |
0561aa0
Not able to trigger gpu integ tests. Force pushed. |
Summary
Update readme for unstable reserved memory value reported when ECS_POLL_METRICS is enabled and ECS_POLLING_METRICS_WAIT_DURATION is set to a high value
Background: During experiment (instance type m5.4xlarge, latest ECS optimized AMI), AL2023 do not see an issue with ECS_POLLING_METRICS_WAIT_DURATION = 20s, while AL2 see reserved memory "vibrates" when ECS_POLLING_METRICS_WAIT_DURATION equals 19s or 20s. The issue only appears when high number of tasks running on instance (in the experiment 500 tasks, 950 containers), in the case that lower tasks/containers count, reserved memory will not be impacted. This is likely due to docker metrics response latency when overloaded, and therefore causing metrics to miss reporting window (20s).
Considering the different behavior among different compute type and workload, and we have a solid default value (10s), I decided to not change specific agent handling code, but instead update readme to raise user's caution around this issue under specific circumstance.
Implementation details
Testing
New tests cover the changes: N/A
Description for the changelog
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.