-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROX-22557: Count expired centrals #1677
Conversation
Skipping CI for Draft Pull Request. |
pkg/metrics/metrics.go
Outdated
@@ -326,6 +330,20 @@ func UpdateCentralPerClusterCountMetric(clusterID string, clusterExternalID stri | |||
centralPerClusterCountMetric.With(labels).Set(float64(count)) | |||
} | |||
|
|||
// create a new counterVec for when expiration timestamp is set to a central | |||
var centralExpirationSetCountMetric = prometheus.NewCounterVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you use a counter instead of a gauge? A counter will not decrease if instances are "un-expired" again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this metric does not count the number of expired instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that's a different metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you trying to define a metric for how many instances are expired in general? Or a metric that tracks the expired status of instances individually (presumably to detect flapping)?
@stehessel, the plan is to define an alert on high rate of such expiration events, to spot potential mass extinction. The value of the metric doesn't tell much. Implementing a metric, that counts the actual number of expired instances, would require a special process to periodically query DB specifically for that. Would you suggest that instead? |
/retest |
@stehessel, I changed the code to count the total number of expired centrals. This might make more sense indeed. |
Yes from the metrics perspective this makes more sense imo. The issue with the previous approach is that the metrics will have a high cardinality. One time series per instance + probe instances will generate a lot of metric series. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: 0x656b694d, stehessel The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
expired-at
non-null changes
Description
A new Fleet-manager Prometheus metric, showing the total number of expired centrals.
The idea is to create an alert on a spike of such events.
Checklist (Definition of Done)
Test manual
ROX-12345: ...
Test manual
Initial run:
Created locally an eval central, managed by quota-management-list;
Patched quota-management-list config to allow 0 instances;
Restarted fleet-manager to trigger expiration worker;
Checked the fleet-manager metrics: