Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-22557: Count expired centrals #1677

Merged
merged 4 commits into from
Feb 27, 2024

Conversation

parametalol
Copy link
Contributor

@parametalol parametalol commented Feb 22, 2024

Description

A new Fleet-manager Prometheus metric, showing the total number of expired centrals.
The idea is to create an alert on a spike of such events.

Checklist (Definition of Done)

  • Unit and integration tests added
  • Added test description under Test manual
  • Documentation added if necessary (i.e. changes to dev setup, test execution, ...)
  • CI and all relevant tests are passing
  • Add the ticket number to the PR title if available, i.e. ROX-12345: ...
  • Discussed security and business related topics privately. Will move any security and business related topics that arise to private communication channel.
  • Add secret to app-interface Vault or Secrets Manager if necessary
  • RDS changes were e2e tested manually
  • Check AWS limits are reasonable for changes provisioning new resources
  • (If applicable) Changes to the dp-terraform Helm values have been reflected in the addon on integration environment

Test manual

Initial run:

$ curl -s http://localhost:8080/metrics | grep expired
# HELP acs_fleet_manager_expired_centrals_count total number of expired centrals
# TYPE acs_fleet_manager_expired_centrals_count gauge
acs_fleet_manager_expired_centrals_count 0
  1. Created locally an eval central, managed by quota-management-list;

  2. Patched quota-management-list config to allow 0 instances;

  3. Restarted fleet-manager to trigger expiration worker;

  4. Checked the fleet-manager metrics:

    curl -s http://localhost:8080/metrics | grep expired
    # HELP acs_fleet_manager_expired_centrals_count total number of expired centrals
    # TYPE acs_fleet_manager_expired_centrals_count gauge
    acs_fleet_manager_expired_centrals_count 1

Copy link
Contributor

openshift-ci bot commented Feb 22, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@@ -326,6 +330,20 @@ func UpdateCentralPerClusterCountMetric(clusterID string, clusterExternalID stri
centralPerClusterCountMetric.With(labels).Set(float64(count))
}

// create a new counterVec for when expiration timestamp is set to a central
var centralExpirationSetCountMetric = prometheus.NewCounterVec(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use a counter instead of a gauge? A counter will not decrease if instances are "un-expired" again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this metric does not count the number of expired instances.

Copy link
Contributor Author

@parametalol parametalol Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that's a different metric.

Copy link
Contributor

@stehessel stehessel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to define a metric for how many instances are expired in general? Or a metric that tracks the expired status of instances individually (presumably to detect flapping)?

@parametalol
Copy link
Contributor Author

Are you trying to define a metric for how many instances are expired in general? Or a metric that tracks the expired status of instances individually (presumably to detect flapping)?

@stehessel, the plan is to define an alert on high rate of such expiration events, to spot potential mass extinction. The value of the metric doesn't tell much.

Implementing a metric, that counts the actual number of expired instances, would require a special process to periodically query DB specifically for that. Would you suggest that instead?

@parametalol
Copy link
Contributor Author

/retest

@parametalol
Copy link
Contributor Author

@stehessel, I changed the code to count the total number of expired centrals. This might make more sense indeed.

@stehessel
Copy link
Contributor

Implementing a metric, that counts the actual number of expired instances, would require a special process to periodically query DB specifically for that. Would you suggest that instead?

Yes from the metrics perspective this makes more sense imo. The issue with the previous approach is that the metrics will have a high cardinality. One time series per instance + probe instances will generate a lot of metric series.

Copy link
Contributor

openshift-ci bot commented Feb 27, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: 0x656b694d, stehessel
Once this PR has been reviewed and has the lgtm label, please assign kovayur for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@parametalol parametalol merged commit 106155f into main Feb 27, 2024
8 checks passed
@parametalol parametalol deleted the michael/ROX-22557-alert-expiration-spike branch February 27, 2024 14:02
@parametalol parametalol changed the title ROX-22557: Count expired-at non-null changes ROX-22557: Count expired centrals Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants