Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always report service connect metrics when both health and task metrics are disabled #3786

Merged
merged 1 commit into from
Jul 13, 2023

Conversation

Realmonia
Copy link
Contributor

@Realmonia Realmonia commented Jul 6, 2023

Summary

Currently, when both ECS_DISABLE_METRICS and ECS_DISABLE_DOCKER_HEALTH_CHECK are set to true on an EC2 instance, we skip creating the metrics session to connect with TACS, and thus no service connect metrics will be published.

As agent does not know if there will be a SC task will be placed on this instance (besides instance capability; however, it is not useful in this case), and TACS session is created (or not created) at agent start, we have no way to only create connection as needed. Therefore, this PR removes the logic of skipping TACS session creation when both metrics and health check are disabled, i.e. will always create an connection even if there is no metrics reported.

Implementation details

This should be reviewed after #3743 is merged and rebased.

  • Remove the logic of skip creating metric session when both ECS_DISABLE_METRICS and ECS_DISABLE_DOCKER_HEALTH_CHECK are enabled
  • Modify some of the GetInstanceMetrics to assure no empty metrics message will reach the channel of TCSClient

Testing

New tests cover the changes: no.

Existing github testing workflow passed.

Manual testings:

  • Set both ECS_DISABLE_METRICS and ECS_DISABLE_DOCKER_HEALTH_CHECK to true on an EC2 instance, and enable debug log
  • Start a SC task on instance
  • See following logs in /var/log/ecs/ecs-agent.log
level=debug time=2023-07-06T00:34:03Z msg="publishMetricsTicker triggered. Sending telemetry messages to tcsClient through channel" module=engine.go
level=debug time=2023-07-06T00:34:03Z msg="sent health message" module=engine.go
level=debug time=2023-07-06T00:34:03Z msg="received health message in healthChannel"
level=debug time=2023-07-06T00:34:03Z msg="making publish health metrics request"
level=debug time=2023-07-06T00:34:03Z msg="Error getting container metrics for task: arn:aws:ecs:us-west-2:922673882694:task/default/bbb773ca3b5941ba9c461c4a5ad97cbb, err: task not found" module=engine.go
level=debug time=2023-07-06T00:34:03Z msg="Return empty metrics error" module=engine.go
level=warn time=2023-07-06T00:34:03Z msg="Error collecting task metrics: stats engine: no task metrics to report" module=engine.go
level=debug time=2023-07-06T00:34:03Z msg="Received message of type: AckPublishHealth"
level=debug time=2023-07-06T00:34:03Z msg="Received ACKPublishHealth from tcs"


level=debug time=2023-07-06T00:34:23Z msg="publishMetricsTicker triggered. Sending telemetry messages to tcsClient through channel" module=engine.go
level=debug time=2023-07-06T00:34:23Z msg="sent health message" module=engine.go
level=debug time=2023-07-06T00:34:23Z msg="received health message in healthChannel"
level=debug time=2023-07-06T00:34:23Z msg="making publish health metrics request"
level=debug time=2023-07-06T00:34:23Z msg="Error getting container metrics for task: arn:aws:ecs:us-west-2:922673882694:task/default/bbb773ca3b5941ba9c461c4a5ad97cbb, err: task not found" module=engine.go
level=debug time=2023-07-06T00:34:23Z msg="Return empty metrics error" module=engine.go
level=warn time=2023-07-06T00:34:23Z msg="Error collecting task metrics: stats engine: no task metrics to report" module=engine.go
level=debug time=2023-07-06T00:34:23Z msg="Received message of type: AckPublishHealth"
level=debug time=2023-07-06T00:34:23Z msg="Received ACKPublishHealth from tcs"


level=debug time=2023-07-06T00:34:43Z msg="publishMetricsTicker triggered. Sending telemetry messages to tcsClient through channel" module=engine.go
level=debug time=2023-07-06T00:34:43Z msg="service connect metrics included" module=engine.go
level=debug time=2023-07-06T00:34:43Z msg="sent health message" module=engine.go
level=debug time=2023-07-06T00:34:43Z msg="received health message in healthChannel"
level=debug time=2023-07-06T00:34:43Z msg="making publish health metrics request"
level=debug time=2023-07-06T00:34:43Z msg="Error getting container metrics for task: arn:aws:ecs:us-west-2:922673882694:task/default/bbb773ca3b5941ba9c461c4a5ad97cbb, err: task not found" module=engine.go
level=debug time=2023-07-06T00:34:43Z msg="Empty containerMetrics for task, ignoring, task: arn:aws:ecs:us-west-2:922673882694:task/default/bbb773ca3b5941ba9c461c4a5ad97cbb" module=engine.go
level=debug time=2023-07-06T00:34:43Z msg="Adding service connect stats for task : arn:aws:ecs:us-west-2:922673882694:task/default/bbb773ca3b5941ba9c461c4a5ad97cbb" module=engine.go
level=debug time=2023-07-06T00:34:43Z msg="Metrics: [{\n  ServiceConnectMetricsWrapper: [],\n  TaskArn: \"arn:aws:ecs:us-west-2:922673882694:task/default/bbb773ca3b5941ba9c461c4a5ad97cbb\",\n  TaskDefinitionFamily: \"sleeper10Container\",\n  TaskDefinitionVersion: \"6\"\n}]" module=engine.go
level=debug time=2023-07-06T00:34:43Z msg="sent telemetry message" module=engine.go
level=debug time=2023-07-06T00:34:43Z msg="received telemetry message in metricsChannel"
level=debug time=2023-07-06T00:34:43Z msg="making publish metrics request"
level=debug time=2023-07-06T00:34:43Z msg="Received message of type: AckPublishHealth"
level=debug time=2023-07-06T00:34:43Z msg="Received ACKPublishHealth from tcs"
level=debug time=2023-07-06T00:34:43Z msg="Received message of type: AckPublishMetric"
level=debug time=2023-07-06T00:34:43Z msg="Received AckPublishMetric from tcs"

SC metrics are published every 3 metrics publish interval (metrics publish interval 20s). We can see empty task metrics in first 2 publish metrics process, publishMetrics is skipped; and in 3rd one we see publish metrics request with only SC metrics.

Health metrics are always sent, and that is for container ecs-service-connect-HGeyfR, which is also expected even when we disabled the health checks.


Description for the changelog

Always report service connect metrics when both health and task metrics are disabled.

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@Realmonia Realmonia force-pushed the alwaysReportSCMetrics branch 3 times, most recently from dce1c9a to 922aea7 Compare July 6, 2023 18:48
@Realmonia Realmonia marked this pull request as ready for review July 10, 2023 18:45
@Realmonia Realmonia requested a review from a team as a code owner July 10, 2023 18:45
@Realmonia Realmonia merged commit 9999b1d into aws:dev Jul 13, 2023
@Realmonia Realmonia deleted the alwaysReportSCMetrics branch July 27, 2023 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants