ROX-22557: Count expired centrals #1677

parametalol · 2024-02-22T13:38:47Z

Description

A new Fleet-manager Prometheus metric, showing the total number of expired centrals.
The idea is to create an alert on a spike of such events.

Checklist (Definition of Done)

Test manual

Initial run:

$ curl -s http://localhost:8080/metrics | grep expired
# HELP acs_fleet_manager_expired_centrals_count total number of expired centrals
# TYPE acs_fleet_manager_expired_centrals_count gauge
acs_fleet_manager_expired_centrals_count 0

Created locally an eval central, managed by quota-management-list;
Patched quota-management-list config to allow 0 instances;
Restarted fleet-manager to trigger expiration worker;

Checked the fleet-manager metrics:

curl -s http://localhost:8080/metrics | grep expired
# HELP acs_fleet_manager_expired_centrals_count total number of expired centrals
# TYPE acs_fleet_manager_expired_centrals_count gauge
acs_fleet_manager_expired_centrals_count 1

openshift-ci · 2024-02-22T13:38:52Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

stehessel · 2024-02-26T15:06:07Z

pkg/metrics/metrics.go

@@ -326,6 +330,20 @@ func UpdateCentralPerClusterCountMetric(clusterID string, clusterExternalID stri
 	centralPerClusterCountMetric.With(labels).Set(float64(count))
 }

+// create a new counterVec for when expiration timestamp is set to a central
+var centralExpirationSetCountMetric = prometheus.NewCounterVec(


Why do you use a counter instead of a gauge? A counter will not decrease if instances are "un-expired" again.

Because this metric does not count the number of expired instances.

Now that's a different metric.

stehessel

Are you trying to define a metric for how many instances are expired in general? Or a metric that tracks the expired status of instances individually (presumably to detect flapping)?

parametalol · 2024-02-27T09:38:38Z

Are you trying to define a metric for how many instances are expired in general? Or a metric that tracks the expired status of instances individually (presumably to detect flapping)?

@stehessel, the plan is to define an alert on high rate of such expiration events, to spot potential mass extinction. The value of the metric doesn't tell much.

Implementing a metric, that counts the actual number of expired instances, would require a special process to periodically query DB specifically for that. Would you suggest that instead?

parametalol · 2024-02-27T10:18:04Z

/retest

parametalol · 2024-02-27T10:41:03Z

@stehessel, I changed the code to count the total number of expired centrals. This might make more sense indeed.

stehessel · 2024-02-27T12:45:00Z

Implementing a metric, that counts the actual number of expired instances, would require a special process to periodically query DB specifically for that. Would you suggest that instead?

Yes from the metrics perspective this makes more sense imo. The issue with the previous approach is that the metrics will have a high cardinality. One time series per instance + probe instances will generate a lot of metric series.

openshift-ci · 2024-02-27T12:48:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: 0x656b694d, stehessel
Once this PR has been reviewed and has the lgtm label, please assign kovayur for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

new metric

7d42826

openshift-ci bot added the do-not-merge/work-in-progress label Feb 22, 2024

parametalol temporarily deployed to development February 22, 2024 13:38 — with GitHub Actions Inactive

parametalol added 2 commits February 22, 2024 15:55

register metric

b426c1c

labels

1561a54

parametalol temporarily deployed to development February 22, 2024 21:03 — with GitHub Actions Inactive

parametalol marked this pull request as ready for review February 26, 2024 14:34

openshift-ci bot removed the do-not-merge/work-in-progress label Feb 26, 2024

parametalol temporarily deployed to development February 26, 2024 14:34 — with GitHub Actions Inactive

parametalol requested review from stehessel and ivan-degtiarenko February 26, 2024 14:35

stehessel reviewed Feb 26, 2024

View reviewed changes

total number of expired instances metric

dda162a

parametalol temporarily deployed to development February 27, 2024 10:39 — with GitHub Actions Inactive

parametalol requested a review from stehessel February 27, 2024 10:40

stehessel approved these changes Feb 27, 2024

View reviewed changes

openshift-ci bot assigned stehessel Feb 27, 2024

openshift-ci bot added the lgtm label Feb 27, 2024

parametalol merged commit 106155f into main Feb 27, 2024
8 checks passed

parametalol deleted the michael/ROX-22557-alert-expiration-spike branch February 27, 2024 14:02

parametalol changed the title ~~ROX-22557: Count expired-at non-null changes~~ ROX-22557: Count expired centrals Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-22557: Count expired centrals #1677

ROX-22557: Count expired centrals #1677

parametalol commented Feb 22, 2024 •

edited

Loading

openshift-ci bot commented Feb 22, 2024

stehessel Feb 26, 2024

parametalol Feb 27, 2024

parametalol Feb 27, 2024 •

edited

Loading

stehessel left a comment

parametalol commented Feb 27, 2024

parametalol commented Feb 27, 2024

parametalol commented Feb 27, 2024

stehessel commented Feb 27, 2024

openshift-ci bot commented Feb 27, 2024

ROX-22557: Count expired centrals #1677

ROX-22557: Count expired centrals #1677

Conversation

parametalol commented Feb 22, 2024 • edited Loading

Description

Checklist (Definition of Done)

Test manual

openshift-ci bot commented Feb 22, 2024

stehessel Feb 26, 2024

Choose a reason for hiding this comment

parametalol Feb 27, 2024

Choose a reason for hiding this comment

parametalol Feb 27, 2024 • edited Loading

Choose a reason for hiding this comment

stehessel left a comment

Choose a reason for hiding this comment

parametalol commented Feb 27, 2024

parametalol commented Feb 27, 2024

parametalol commented Feb 27, 2024

stehessel commented Feb 27, 2024

openshift-ci bot commented Feb 27, 2024

parametalol commented Feb 22, 2024 •

edited

Loading

parametalol Feb 27, 2024 •

edited

Loading