feat: add cluster_id to metricProvisionFailedTerminal #1953

boranx · 2023-02-03T12:41:08Z

Fix: https://issues.redhat.com/browse/OSD-14586

This adds cluster_id to hive_cluster_deployments_provision_failed_terminal_total to establish relation between this metric and network_verifier_runs for deeper investigations & sanity checks

…stigations

openshift-ci · 2023-02-03T12:41:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: boranx
Once this PR has been reviewed and has the lgtm label, please assign dlom for approval by writing /assign @dlom in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-02-03T12:47:42Z

@boranx: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/unit	`1e65290`	link	true	`/test unit`
ci/prow/images	`1e65290`	link	true	`/test images`
ci/prow/coverage	`1e65290`	link	true	`/test coverage`
ci/prow/e2e-pool	`1e65290`	link	true	`/test e2e-pool`
ci/prow/e2e	`1e65290`	link	true	`/test e2e`
ci/prow/verify	`1e65290`	link	true	`/test verify`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

suhanime · 2023-02-03T15:47:48Z

@boranx We have a feature underway that will allow you to do what you're aiming with this PR. However, we do not recommend adding labels for cluster id, name or namespace - as hive manages too many resources, and it is bound to explode the cardinality and overwhelm the prometheus servers.

boranx · 2023-02-06T11:32:17Z

nice! so after #1925 is merged, are we able to query those at cluster-level (using cluster-id)?
my 2 cents from the cardinality perspective, I'd agree that this shouldn't happen here, as it's not hive's responsibility... On the other hand, we need hive install-byovpc results to cross-check between verifier metrics and actual byovpc results.
Although adding individual metrics look trivial and far from designing a generic solution, I'd respectfully disagree with the idea adding cluster-id could overwhelm p8s servers because approximately it'd consume less storage than reason in terms of string size.
Plus it'd thrown per failed installation for a cluster, so as long as there are no millions of failing installation requests at once, adding just cluster-id shouldn't be a performance issue except for a couple of additional bytes

suhanime · 2023-02-06T18:19:39Z

so after #1925 is merged, are we able to query those at cluster-level (using cluster-id)?

Yes - after the changes merge, you should be able to query for any cluster deployment labels you define to look for in HiveConfig.

I'd respectfully disagree with the idea adding cluster-id could overwhelm p8s servers because approximately it'd consume less storage than reason in terms of string size.
Plus it'd thrown per failed installation for a cluster, so as long as there are no millions of failing installation requests at once, adding just cluster-id shouldn't be a performance issue except for a couple of additional bytes

The reasoning goes more like this - prometheus libraries maintain maps for each metric with the dimensions <= to the labels defined for the metric. So while cluster_id itself might need less storage space, it will result in that many more rows. The map grows exponentially with each label.
So as a generic solution, we suggest other ways of identifying a cluster, but we understand it might be necessary in certain cases for some users, hence we have some opt-in metric config features - 1925 being the latest one.

2uasimojo · 2023-02-06T18:20:38Z

nice! so after #1925 is merged, are we able to query those at cluster-level (using cluster-id)?

You would theoretically be able to configure hive to report the metrics that way, yes. But as @suhanime said, you should not create metric labels with per-cluster values. More below:

I'd respectfully disagree with the idea adding cluster-id could overwhelm p8s servers because approximately it'd consume less storage than reason in terms of string size.

It's not about string size, it's about the number of unique values the label can take. As it has been explained to me: you can think of each unique value as adding a whole column to the prom db table that stores the metric. If we have, say, two dozen reason codes, that's two dozen columns -- and that's it. But if we label by cluster ID, now every time we observe for a new cluster, we're adding a column. Those columns do eventually get purged if no metrics are observed with those label values for some amount of time (retention period). But in the meantime they're sitting around taking up (lots of) space. More...

Plus it'd thrown per failed installation for a cluster, so as long as there are no millions of failing installation requests at once, adding just cluster-id shouldn't be a performance issue except for a couple of additional bytes

Ultimately this is going to be up to you. If you're confident that the metric in question is going to generate and purge label values at a reasonable rate, we're giving you the ability to label by whatever you like. But I would strongly recommend scrutinizing every case very carefully, with oversight from someone who understands this cardinality issue deeply.

we need hive install-byovpc results to cross-check between verifier metrics and actual byovpc results.

I don't understand this statement (lack of context). If you'd like to elaborate or meet to discuss, perhaps we can help find a solution that doesn't require per-cluster label values.

boranx · 2023-02-07T10:01:28Z

It's not about string size, it's about the number of unique values the label can take. As it has been explained to me: you can think of each unique value as adding a whole column to the prom db table that stores the metric. If we have, say, two dozen reason codes, that's two dozen columns -- and that's it. But if we label by cluster ID, now every time we observe for a new cluster, we're adding a column. Those columns do eventually get purged if no metrics are observed with those label values for some amount of time (retention period). But in the meantime, they're sitting around taking up (lots of) space. More...

I agree that it'd take more space but can't we also focus on what's the overhead vs what's the benefit? (ignoring the cardinality concerns)
the prometheus doesn't add metrics by columns iirc, as it's a no-sql timeseries db, and there could be unique values for each field too
I thought this would only affect our query performance, not insert
(the overhead that caused by adding a unique field can be handleable by tweaking retention times, if that's the main concern yes)

I don't understand this statement (lack of context). If you'd like to elaborate or meet to discuss, perhaps we can help find a solution that doesn't require per-cluster label values.

yeah, I'd be happy to talk! I can schedule one meeting between you and our team to discuss what could be the alternative ways, but it's summed in: https://issues.redhat.com/browse/OSD-14585
tl;dr; uhc-cluster-service throws the metric network_verifier_runs for each cluster-installation and stores the verifier result alongside a clusterid (can be found in app-sre-prod-04). Essentially, we thought if it'd be possible to cross-check these metrics with hive results so we could filter if there are cases where osd-network-verifier failed but hive installation succeded or onv success but hive installation failed

2uasimojo · 2023-02-07T23:46:08Z

the prometheus doesn't add metrics by columns iirc

Agree; I was using that as a means to visualize what happens when you add a unique value.

I'm not an expert on this, but here are a couple of articles that try to explain it:

https://grafana.com/blog/2022/02/15/what-are-cardinality-spikes-and-why-do-they-matter/
https://www.robustperception.io/cardinality-is-key/

My interpretation: the true number of "time series" for a given metric is the product of the number of unique values of all its labels. So in this case we have:

clusterpool_namespacedname -- always unspecified for you in OSD, since you don't use clusterpools, so we get a 1 here.
cluster_type -- looks like you have managed and managed-byoc, so 2 here.
failure_reason -- I'm not going to go count, but let's say there's like 10 of these.

So today you have 1 x 2 x 10 = 20 time series for this metric.

Now add the cluster-id as a label.

For every unique cluster where a failure is observed, you're adding one to that multiplier.

So if five clusters fail, you have 20 x 5 = 100 time series.
If 20 fail, you have 20 x 20 = 400 time series.

You can see that these numbers blow up very quickly. Multiply by the dozens of metrics we're tracking, and you'll quickly overwhelm the database.

At this point I'll stop trying to convince you this is a bad idea, because ultimately it's in your hands once #1925 merges.

As to your specific problem, it's easy enough to run a report on a given shard to figure out which clusters have succeeded or failed. E.g. this gives you the cluster name in the first column and the ProvisionStopped condition status in the second:

oc get cd -A -o json | jq -r '.items[] | [.metadata.name, (.status.conditions[] | select(.type == "ProvisionStopped") | .status)] | @tsv'

It should be possible to merge this with a similar list from your network verifier thingy to get the information you're looking for.

boranx · 2023-02-08T12:49:34Z

yeah, I agree that hive_cluster_deployments_provision_failed_terminal_total is not the right place to add cluster_id, and from the cardinality perspective, this PR can be closed.

As to your specific problem, it's easy enough to run a report on a given shard to figure out which clusters have succeeded or failed. E.g. this gives you the cluster name in the first column and the ProvisionStopped condition status in the second:

since this is not recorded into a time-series, there's no way to check the past data, yet a metric containing the cluster-id that is thrown per install result could be persistent until it's removed. As a user of hive, that'd be great to have though.
I've read #1925 again, seems having Managed VPC would help a lot too

feat: add cluster_id to metricProvisionFailedTerminal for deeper inve…

1e65290

…stigations

openshift-ci bot requested review from 2uasimojo and suhanime February 3, 2023 12:41

boranx closed this Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add cluster_id to metricProvisionFailedTerminal #1953

feat: add cluster_id to metricProvisionFailedTerminal #1953

boranx commented Feb 3, 2023

openshift-ci bot commented Feb 3, 2023

openshift-ci bot commented Feb 3, 2023

suhanime commented Feb 3, 2023

boranx commented Feb 6, 2023

suhanime commented Feb 6, 2023

2uasimojo commented Feb 6, 2023

boranx commented Feb 7, 2023

2uasimojo commented Feb 7, 2023

boranx commented Feb 8, 2023

feat: add cluster_id to metricProvisionFailedTerminal #1953

feat: add cluster_id to metricProvisionFailedTerminal #1953

Conversation

boranx commented Feb 3, 2023

openshift-ci bot commented Feb 3, 2023

openshift-ci bot commented Feb 3, 2023

suhanime commented Feb 3, 2023

boranx commented Feb 6, 2023

suhanime commented Feb 6, 2023

2uasimojo commented Feb 6, 2023

boranx commented Feb 7, 2023

2uasimojo commented Feb 7, 2023

boranx commented Feb 8, 2023