Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V1.40.0 stage #2469

Closed
wants to merge 23 commits into from
Closed

V1.40.0 stage #2469

wants to merge 23 commits into from

Conversation

mssrivas
Copy link
Contributor

@mssrivas mssrivas commented Jun 2, 2020

Summary

Implementation details

Testing

New tests cover the changes:

Description for the changelog

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

amazon-ecs-bot and others added 21 commits April 5, 2020 17:21
* Timeout and unit tests for agent healthcheck

* integ test cleanup wait duration and expand terminal reason log msg

* add 'type' to DriverOpts
* backoff container stats collection when fails

* make gogenerate

* fix unit tests and tweak timeout/backoff

Extend inactivity timeout

fix missing return statement in metric polling

* address PR review comments
1. normalize the container name/id logging format when handling
container state change
2. log the full container event as a debug message

logs will go from this:

level=info time=2020-05-02T15:39:55Z msg="Managed task [arn:aws:ecs:us-east-1:039193556298:task/3371e240-6285-4d09-b4f5-500587e9d0cf]: waiting for event for task" module=task_manager.go
level=info time=2020-05-02T15:39:55Z msg="Managed task [arn:aws:ecs:us-east-1:039193556298:task/3371e240-6285-4d09-b4f5-500587e9d0cf]: got container [dd (Runtime ID: 35a25c7e69d3e6a4b1833d70157ba666032b3294045abe2c993dd3744c72149d)] event: [RUNNING]" module=task_manager.go
level=info time=2020-05-02T15:39:55Z msg="Managed task [arn:aws:ecs:us-east-1:039193556298:task/3371e240-6285-4d09-b4f5-500587e9d0cf]: handling container change [{RUNNING {35a25c7e69d3e6a4b1833d70157ba666032b3294045abe2c993dd3744c72149d <nil> [] <nil> [] map[com.amazonaws.ecs.cluster:auto-qjkn com.amazonaws.ecs.container-name:dd com.amazonaws.ecs.task-arn:arn:aws:ecs:us-east-1:039193556298:task/3371e240-6285-4d09-b4f5-500587e9d0cf com.amazonaws.ecs.task-definition-family:dd com.amazonaws.ecs.task-definition-version:9] 2020-05-02 14:59:33.688910738 +0000 UTC 2020-05-02 14:59:55.013436083 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {UNKNOWN <nil> 0 } default 0xc000aca300} ContainerStatusChangeEvent}] for container [dd (Runtime ID: 35a25c7e69d3e6a4b1833d70157ba666032b3294045abe2c993dd3744c72149d)]" module=task_manager.go
level=info time=2020-05-02T15:39:55Z msg="Managed task [arn:aws:ecs:us-east-1:039193556298:task/3371e240-6285-4d09-b4f5-500587e9d0cf]: redundant container state change. dd to RUNNING, but already RUNNING" module=task_manager.go
level=info time=2020-05-02T15:39:55Z msg="Managed task [arn:aws:ecs:us-east-1:039193556298:task/3371e240-6285-4d09-b4f5-500587e9d0cf]: task at steady state: RUNNING" module=task_manager.go
level=info time=2020-05-02T15:39:55Z msg="Managed task [arn:aws:ecs:us-east-1:039193556298:task/3371e240-6285-4d09-b4f5-500587e9d0cf]: waiting for event for task" module=task_manager.go

to this:

level=info time=2020-05-02T16:07:58Z msg="Managed task [arn:aws:ecs:us-east-1:039193556298:task/d46739f3-4b08-4660-ab50-97ee9d0ccdc6]: waiting for event for task" module=task_manager.go
level=info time=2020-05-02T16:07:59Z msg="Managed task [arn:aws:ecs:us-east-1:039193556298:task/d46739f3-4b08-4660-ab50-97ee9d0ccdc6]: Container [name=dd runtimeID=0e955bcf4c289034005f4b4032045030b80a0e570152158ccd29c7eb095556ea]: handling container change event [RUNNING]" module=task_manager.go
level=info time=2020-05-02T16:07:59Z msg="Managed task [arn:aws:ecs:us-east-1:039193556298:task/d46739f3-4b08-4660-ab50-97ee9d0ccdc6]: Container [name=dd runtimeID=0e955bcf4c289034005f4b4032045030b80a0e570152158ccd29c7eb095556ea]: container change RUNNING->RUNNING is redundant" module=task_manager.go
level=info time=2020-05-02T16:07:59Z msg="Managed task [arn:aws:ecs:us-east-1:039193556298:task/d46739f3-4b08-4660-ab50-97ee9d0ccdc6]: task at steady state: RUNNING" module=task_manager.go
level=info time=2020-05-02T16:07:59Z msg="Managed task [arn:aws:ecs:us-east-1:039193556298:task/d46739f3-4b08-4660-ab50-97ee9d0ccdc6]: waiting for event for task" module=task_manager.go
Implications of this change:
1. During one TACS publishing interval, we previously would only send
metrics to TACS if we received two separate stats from docker. Going
forward, if we only received a single stat we will compare it to a stat
from a previous interval in order to get the usage.
2. This also allows the user to configure the stats collection interval
to be once per TACS publish interval, further improving CPU usage.

Fix unit tests
always verify that the task is stopped at the end of every test, in case
unrelated state change events are received that may trigger a test to
exit before it has been cleaned up.

also get a new port for each test to avoid potential collisions.
* Change default docker metric gathering behavior

1. Change default docker metric gathering behavior from streaming
metrics to polling.
2. Change the default polling interval to half of the TACS publishing
interval (currently 20s), so that every publish interval we have two
docker metrics.
3. Change the minimum polling interval to 5s to prevent customers from
configuring polling to be just as resource-intensive as streaming
metrics.

These changes are being made because we have found that docker streaming
stats consumes considerable resources from the agent, dockerd daemon, and
containerd daemon.

* Task stats endpoint needs to be populated immediately

to avoid changing behavior from streaming stats, we need to populate the
stats endpoint immediately when the stats engine starts. So instead of
jittering the first stats gather we need to just do it immediately.

* Pinning windows images (ltsc201*) to known working version

5/12 release broke our tests
These response bodies need to be closed in order to reuse the underlying
tcp connections in Go.

I have also changed the healthcheck GET request to a HEAD request
because it doesn't have any use for the data contained in the GET, so
might as well not get it in the first place.

wrote a small benchmark test to show the improvement in time and
allocations with this change:

dev branch:
$ go test ./agent/app/. -bench=Healthcheck -benchmem
goos: darwin
goarch: amd64
pkg: github.com/aws/amazon-ecs-agent/agent/app
BenchmarkHealthcheck-4              3280            782405 ns/op           16996 B/op        119 allocs/op
PASS
ok      github.com/aws/amazon-ecs-agent/agent/app       5.393s

this branch:
$ go test ./agent/app/. -bench=Healthcheck -benchmem
goos: darwin
goarch: amd64
pkg: github.com/aws/amazon-ecs-agent/agent/app
BenchmarkHealthcheck-4             15004             75400 ns/op            4303 B/op         55 allocs/op
PASS
ok      github.com/aws/amazon-ecs-agent/agent/app       3.980s
* Change asserts to require in integ tests

* Copy all test tasks in integ tests to avoid data races

* image manager integ test logging fix

* skipping windows image cleanup test because of hcsshim bug
* Remove unused code found via staticcheck tool

* Add staticcheck unused code check to CI
@mssrivas mssrivas added this to the 1.40.0 milestone Jun 2, 2020
@mssrivas mssrivas requested a review from a team June 2, 2020 18:54
CHANGELOG.md Outdated
@@ -1,5 +1,9 @@
# Changelog

## 1.40.0
* Bug - Register custom logger before it gets used to ensure that the formatter is initiated before it is loaded [#2438](https://github.com/aws/amazon-ecs-agent/pull/2438)
* Enhancement - adds a jitter to this so that we don't query docker for every container's state all at the same time [#2444] (https://github.com/aws/amazon-ecs-agent/pull/2444)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space after [#2444] causing link not rendered;

nit: put enhancement item above bug

@@ -1,5 +1,9 @@
# Changelog

## 1.40.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have a changelog for #2452

@mssrivas mssrivas closed this Jun 2, 2020
@mssrivas mssrivas deleted the v1.40.0-stage branch June 2, 2020 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants