-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Statuscode Handler to Agent Health Extension #1423
base: main
Are you sure you want to change the base?
Conversation
…s from status codes
9465711
to
5dc40f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix the linting issues ?
extension/agenthealth/config.go
Outdated
@@ -12,6 +12,7 @@ import ( | |||
type Config struct { | |||
IsUsageDataEnabled bool `mapstructure:"is_usage_data_enabled"` | |||
Stats agent.StatsConfig `mapstructure:"stats"` | |||
StatusCodeOnly *bool `mapstructure:"is_status_code_only,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick - can be renamed to is_only_status_code_enabled
also keep it consistent with IsUsageDataEnabled with IsOnlyStatusCodeEnabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also any reason to make it a pointer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bool with omitempty
in the tag is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with mitali. Instead of a pointer, can we treat the default value as false?
Additionally, can you explain a bit more what this field is used for? It seems like a way to set the agent health extension such that it turns the other things off. I think that may be confusing because you could set IsUsageDataEnabled
to true and also set StatusCodeOnly
to true, so it's not clear what it would actually do (without looking at the implementation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I will rename it. The pointer is used so we can emit it even if it's empty (nil). As for its functionality, I can add a comment, but the main purpose is to ensure that we only use the statuscodehandler when calling the agenthealthextension for a processor, exporter, or receiver, without including other handlers like processstats, clientstats, etc.
extension/agenthealth/config.go
Outdated
@@ -12,6 +12,7 @@ import ( | |||
type Config struct { | |||
IsUsageDataEnabled bool `mapstructure:"is_usage_data_enabled"` | |||
Stats agent.StatsConfig `mapstructure:"stats"` | |||
StatusCodeOnly *bool `mapstructure:"is_status_code_only,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bool with omitempty
in the tag is enough.
@@ -12,6 +12,7 @@ import ( | |||
type Config struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO status code stats should have to be enabled. agenthealth/metrics
and agenthealth/traces
won't have any allowlisted operations, but are still going to have the handlers created and attached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this and I changed it so that we would have to enable statuscodes
value, loaded := h.statsByOperation.LoadOrStore(operation, &[5]int{}) | ||
if !loaded { | ||
log.Printf("Initializing stats for operation: %s", operation) | ||
} | ||
stats := value.(*[5]int) | ||
|
||
h.updateStatusCodeCount(stats, statusCode, operation) | ||
|
||
h.statsByOperation.Store(operation, stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, these lines consist of a read-modify-write operation. Two threads running this at the same time could clobber each others updates. Only the latter thread will have it's updates applied. As an example:
- Thread A and B are both handling an HTTP response for the same operation. HTTP response is 200 for both threads.
- Thread A loads current values, let's say stats[0] = 10
- Thread B loads current values, also sees stats[0] = 10
- Both threads now have a copy stats array
- B adds 1 to its copy, stats[0] = 11
- B stores its result
- A adds 1 to it's copy, stats[0] = 11
- A stores its result
We end up with stats[0]=11 for the operation instead of stats[0]=12. Perhaps its not actually an issue though. Is it possible we could be handling more than one HTTP response for the same operation at once? If not, then I wouldn't worry about it. If we don't know, or if it doesn't happen now but could in the future, we should handle it.
Description of the Issue
The CloudWatch Agent lacked an
agenthealth
configuration to monitor API health by tracking HTTP status codes for specific responses. Monitoring the status codes (200
,400
,408
,419
, and429
) across all APIs is critical for diagnosing issues and ensuring comprehensive observability of API behaviors. Without this configuration, users were unable to quickly identify trends in API health metrics or correlate specific status codes with performance issues.Changes Made
Added
agenthealth
Configuration:200
: Success400
: Bad Request408
: Request Timeout419
: Authentication Timeout429
: Too Many RequestsUpdated CloudWatch Agent Configuration JSON:
agenthealth
metrics.Validated the Configuration:
Impact
Testing
200
,400
,408
,419
, and429
are captured in the header as you can see from the image below: