-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add cluster_id to metricProvisionFailedTerminal #1953
feat: add cluster_id to metricProvisionFailedTerminal #1953
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: boranx The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@boranx: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
nice! so after #1925 is merged, are we able to query those at cluster-level (using cluster-id)? |
Yes - after the changes merge, you should be able to query for any cluster deployment labels you define to look for in HiveConfig.
The reasoning goes more like this - prometheus libraries maintain maps for each metric with the dimensions <= to the labels defined for the metric. So while |
You would theoretically be able to configure hive to report the metrics that way, yes. But as @suhanime said, you should not create metric labels with per-cluster values. More below:
It's not about string size, it's about the number of unique values the label can take. As it has been explained to me: you can think of each unique value as adding a whole column to the prom db table that stores the metric. If we have, say, two dozen reason codes, that's two dozen columns -- and that's it. But if we label by cluster ID, now every time we observe for a new cluster, we're adding a column. Those columns do eventually get purged if no metrics are observed with those label values for some amount of time (retention period). But in the meantime they're sitting around taking up (lots of) space. More...
Ultimately this is going to be up to you. If you're confident that the metric in question is going to generate and purge label values at a reasonable rate, we're giving you the ability to label by whatever you like. But I would strongly recommend scrutinizing every case very carefully, with oversight from someone who understands this cardinality issue deeply.
I don't understand this statement (lack of context). If you'd like to elaborate or meet to discuss, perhaps we can help find a solution that doesn't require per-cluster label values. |
I agree that it'd take more space but can't we also focus on what's the overhead vs what's the benefit? (ignoring the cardinality concerns)
yeah, I'd be happy to talk! I can schedule one meeting between you and our team to discuss what could be the alternative ways, but it's summed in: https://issues.redhat.com/browse/OSD-14585 |
Agree; I was using that as a means to visualize what happens when you add a unique value. I'm not an expert on this, but here are a couple of articles that try to explain it: https://grafana.com/blog/2022/02/15/what-are-cardinality-spikes-and-why-do-they-matter/ My interpretation: the true number of "time series" for a given metric is the product of the number of unique values of all its labels. So in this case we have:
So today you have 1 x 2 x 10 = 20 time series for this metric. Now add the cluster-id as a label. For every unique cluster where a failure is observed, you're adding one to that multiplier. So if five clusters fail, you have 20 x 5 = 100 time series. You can see that these numbers blow up very quickly. Multiply by the dozens of metrics we're tracking, and you'll quickly overwhelm the database. At this point I'll stop trying to convince you this is a bad idea, because ultimately it's in your hands once #1925 merges. As to your specific problem, it's easy enough to run a report on a given shard to figure out which clusters have succeeded or failed. E.g. this gives you the cluster name in the first column and the ProvisionStopped condition status in the second:
It should be possible to merge this with a similar list from your network verifier thingy to get the information you're looking for. |
yeah, I agree that hive_cluster_deployments_provision_failed_terminal_total is not the right place to add cluster_id, and from the cardinality perspective, this PR can be closed.
since this is not recorded into a time-series, there's no way to check the past data, yet a metric containing the cluster-id that is thrown per install result could be persistent until it's removed. As a user of hive, that'd be great to have though. |
Fix: https://issues.redhat.com/browse/OSD-14586
This adds cluster_id to hive_cluster_deployments_provision_failed_terminal_total to establish relation between this metric and network_verifier_runs for deeper investigations & sanity checks