Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Statuscode Handler to Agent Health Extension #1423

Open
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

Paramadon
Copy link
Contributor

@Paramadon Paramadon commented Nov 12, 2024


Description of the Issue

The CloudWatch Agent lacked an agenthealth configuration to monitor API health by tracking HTTP status codes for specific responses. Monitoring the status codes (200, 400, 408, 419, and 429) across all APIs is critical for diagnosing issues and ensuring comprehensive observability of API behaviors. Without this configuration, users were unable to quickly identify trends in API health metrics or correlate specific status codes with performance issues.


Changes Made

  • Added agenthealth Configuration:

    • Configured the CloudWatch Agent to track the following status codes across all APIs:
      • 200: Success
      • 400: Bad Request
      • 408: Request Timeout
      • 419: Authentication Timeout
      • 429: Too Many Requests
    • This ensures that health metrics for these common response codes are captured and reported.
  • Updated CloudWatch Agent Configuration JSON:

    • Introduced a new section under the metrics collection configuration to specify agenthealth metrics.
    • Ensured compatibility with existing metrics and logs collection by integrating the new configuration seamlessly.
  • Validated the Configuration:

    • Tested the updated configuration with the CloudWatch Agent to ensure the new metrics are collected and reported correctly.
    • Confirmed that the metrics appear as expected in CloudWatch for all specified APIs.

Impact

  • Provides real-time visibility into API status codes, helping identify anomalies or patterns that indicate issues.
  • Enhances observability for critical services by adding detailed health metrics.

Testing

  • Deployed the updated CloudWatch Agent health configuraion
  • Verified that metrics for status codes 200, 400, 408, 419, and 429 are captured in the header as you can see from the image below:
Screenshot 2024-11-15 at 4 36 21 PM

@Paramadon Paramadon requested a review from a team as a code owner November 12, 2024 20:47
@Paramadon Paramadon changed the title fixing issue Adding Agent Health Extension Nov 15, 2024
@Paramadon Paramadon changed the title Adding Agent Health Extension Adding Agent Health Extension Statuscode handler Nov 19, 2024
@Paramadon Paramadon changed the title Adding Agent Health Extension Statuscode handler Adding Agent Health Extension Statuscode Handler Nov 19, 2024
@Paramadon Paramadon changed the title Adding Agent Health Extension Statuscode Handler Adding Statuscode Handler to Agent Health Extension Nov 19, 2024
Copy link
Contributor

@mitali-salvi mitali-salvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix the linting issues ?

@@ -12,6 +12,7 @@ import (
type Config struct {
IsUsageDataEnabled bool `mapstructure:"is_usage_data_enabled"`
Stats agent.StatsConfig `mapstructure:"stats"`
StatusCodeOnly *bool `mapstructure:"is_status_code_only,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick - can be renamed to is_only_status_code_enabled
also keep it consistent with IsUsageDataEnabled with IsOnlyStatusCodeEnabled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also any reason to make it a pointer ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bool with omitempty in the tag is enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with mitali. Instead of a pointer, can we treat the default value as false?

Additionally, can you explain a bit more what this field is used for? It seems like a way to set the agent health extension such that it turns the other things off. I think that may be confusing because you could set IsUsageDataEnabled to true and also set StatusCodeOnly to true, so it's not clear what it would actually do (without looking at the implementation).

Copy link
Contributor Author

@Paramadon Paramadon Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I will rename it. The pointer is used so we can emit it even if it's empty (nil). As for its functionality, I can add a comment, but the main purpose is to ensure that we only use the statuscodehandler when calling the agenthealthextension for a processor, exporter, or receiver, without including other handlers like processstats, clientstats, etc.

extension/agenthealth/handler/stats/agent/agent.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/agent/agent.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/agent/agent.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/interval.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/agent/agent_test.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/agent/agent.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/handler.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/statuscode.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/statuscode.go Outdated Show resolved Hide resolved
internal/ecsservicediscovery/containerinstanceprocessor.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/statuscode.go Outdated Show resolved Hide resolved
@@ -12,6 +12,7 @@ import (
type Config struct {
IsUsageDataEnabled bool `mapstructure:"is_usage_data_enabled"`
Stats agent.StatsConfig `mapstructure:"stats"`
StatusCodeOnly *bool `mapstructure:"is_status_code_only,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bool with omitempty in the tag is enough.

cmd/amazon-cloudwatch-agent/amazon-cloudwatch-agent.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/interval.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/interval.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/agent/agent.go Outdated Show resolved Hide resolved
@@ -12,6 +12,7 @@ import (
type Config struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO status code stats should have to be enabled. agenthealth/metrics and agenthealth/traces won't have any allowlisted operations, but are still going to have the handlers created and attached.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this and I changed it so that we would have to enable statuscodes

extension/agenthealth/handler/stats/agent/agent.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/statuscode.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/statuscode.go Outdated Show resolved Hide resolved
extension/agenthealth/handler/stats/provider/statuscode.go Outdated Show resolved Hide resolved
Comment on lines +78 to +86
value, loaded := h.statsByOperation.LoadOrStore(operation, &[5]int{})
if !loaded {
log.Printf("Initializing stats for operation: %s", operation)
}
stats := value.(*[5]int)

h.updateStatusCodeCount(stats, statusCode, operation)

h.statsByOperation.Store(operation, stats)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, these lines consist of a read-modify-write operation. Two threads running this at the same time could clobber each others updates. Only the latter thread will have it's updates applied. As an example:

  1. Thread A and B are both handling an HTTP response for the same operation. HTTP response is 200 for both threads.
  2. Thread A loads current values, let's say stats[0] = 10
  3. Thread B loads current values, also sees stats[0] = 10
  4. Both threads now have a copy stats array
  5. B adds 1 to its copy, stats[0] = 11
  6. B stores its result
  7. A adds 1 to it's copy, stats[0] = 11
  8. A stores its result

We end up with stats[0]=11 for the operation instead of stats[0]=12. Perhaps its not actually an issue though. Is it possible we could be handling more than one HTTP response for the same operation at once? If not, then I wouldn't worry about it. If we don't know, or if it doesn't happen now but could in the future, we should handle it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants