Skip to content

Commit

Permalink
Addressing PR comments
Browse files Browse the repository at this point in the history
Most of the changes are as per the PR comments. Additional changes:
- Rename files and package names as per Golang naming conventions
- Add Hive Operator metrics in documentation
  • Loading branch information
suhanime committed Jan 25, 2023
1 parent 8e67f66 commit 756f500
Show file tree
Hide file tree
Showing 18 changed files with 203 additions and 116 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ type MetricsConfig struct {
// label -- e.g. "4".
// NOTE: Avoid ClusterDeployment labels whose values are unbounded, such as those representing cluster names or IDs,
// as these will cause your prometheus database to grow indefinitely.
// Check docs\hive-metrics.md for a list of affected metrics
// Affected metrics are those whose type implements the metricsWithDyanamicLabels interface found in
// pkg/controller/metrics/metrics_with_dynamic_labels.go
// +optional
AdditionalClusterDeploymentLabels map[string]string `json:"additionalClusterDeploymentLabels"`
}
32 changes: 32 additions & 0 deletions apis/hive/v1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion config/crds/hive.openshift.io_hiveconfigs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -627,7 +627,8 @@ spec:
-- e.g. "4". NOTE: Avoid ClusterDeployment labels whose values
are unbounded, such as those representing cluster names or IDs,
as these will cause your prometheus database to grow indefinitely.
Check docs\hive-metrics.md for a list of affected metrics'
Affected metrics are those whose type implements the metricsWithDyanamicLabels
interface found in pkg/controller/metrics/metrics_with_dynamic_labels.go'
type: object
metricsWithDuration:
description: Optional metrics and their configurations
Expand Down
34 changes: 22 additions & 12 deletions docs/hive-metrics.md → docs/hive_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@

- [Hive Metrics](#hive-metrics)
- [Optional Metrics](#optional-metrics)
- [Metrics with Optional Cluster Deployment labels](#metrics-with-optional-cluster-deployment-labels)
- [Duration-based Metrics](#duration-based-metrics)
- [Metrics with Optional Cluster Deployment labels](#metrics-with-optional-cluster-deployment-labels)
- [List of all Hive metrics](#list-of-all-hive-metrics)
- [Hive Operator metrics](#hive-operator-metrics)
- [ClusterDeployment controller metrics](#clusterdeployment-controller-metrics)
- [ClusterProvision controller metrics](#clusterprovision-controller-metrics)
- [ClusterPool controller metrics](#clusterpool-controller-metrics)
Expand All @@ -15,19 +17,19 @@

# Hive Metrics

Hive publishes metrics, which can help admins to monitor the Hive operations. Most of these metrics are always published, few are [optional](#metrics-with-optional-cluster-deployment-labels).

### Exporting Hive Metrics
`HiveConfig.Spec.ExportMetrics` specifies whether the operator should enable metrics for hive controllers to be extracted for the prometheus instance running on the cluster. When set to true, `openshift.io/cluster-monitoring` label is applied and the operator deploys ServiceMonitors so that the prometheus instances can extract metrics. The operator also sets up RBAC in the TargetNamespace so that openshift prometheus in the cluster can list/access objects required to pull metrics.
Hive publishes metrics, which can help admins to monitor the Hive operations. Most of these metrics are always published; a few are [optional](#optional-metrics).

### Optional Metrics

Some metrics pertaining to Hibernation (or PowerState) controller, are not logged by default, since that feature is not supported across all clouds.
Admins can opt for logging such metrics via `HiveConfig.Spec.MetricsConfig`
#### Duration-based Metrics

Some metrics that log duration are not logged by default, since that feature is not supported across all clouds.
Admins can opt for logging such metrics via `HiveConfig.Spec.MetricsConfig.MetricsWithDuration`
Check the metrics labelled as `Optional` in the [list below](#optional-metrics) to see affected metrics.

### Metrics with Optional Cluster Deployment labels
#### Metrics with Optional Cluster Deployment labels

Not all metrics are logged per cluster deployment name or namespace. Many metrics are logged per cluster-type, and hive allows for admins to define the labels to look for when reporting these metrics.
Most metrics are not observed per cluster deployment name or namespace, so Hive allows for admins to define the labels to look for when reporting these metrics.

Opt into them via `HiveConfig.Spec.MetricsConfig.AdditionalClusterDeploymentLabels`, which accepts a map of the label you would want on your metrics, to the label found on ClusterDeployment.
For example, including `{"ocp_major_version": "hive.openshift.io/version-major"}` will cause affected metrics to include a label key ocp_major_version with the value from the `hive.openshift.io/version-major` ClusterDeployment label -- e.g. "4".
Expand All @@ -36,8 +38,16 @@ Note: It is upto the cluster admins to be mindful of cardinality and ensure thes

### List of all Hive metrics

#### Hive Operator metrics
These metrics are observed by the Hive Operator. None of these are optional

| Metric Name | Optional Label Support |
|:-------------------------------:|:----------------------:|
| hive_hiveconfig_conditions | N |
| hive_operator_reconcile_seconds | N |

#### ClusterDeployment controller metrics
These metrics are logged per ClusterDeployment instance. None of these are optional.
These metrics are observed while processing ClusterDeployments. None of these are optional.

| Metric Name | Optional Label Support |
|:--------------------------------------------------------:|:----------------------:|
Expand All @@ -52,7 +62,7 @@ These metrics are logged per ClusterDeployment instance. None of these are optio
| hive_cluster_deployments_provision_failed_terminal_total | Y |

#### ClusterProvision controller metrics
These metrics are logged per ClusterProvision instance. If they have optional label support, ClusterDeployment will apply those labels to ClusterProvision. None of these are optional.
These metrics are observed while processing ClusterProvisions. None of these are optional.

| Metric Name | Optional Label Support |
|:---------------------------------------------:|:----------------------:|
Expand All @@ -62,7 +72,7 @@ These metrics are logged per ClusterProvision instance. If they have optional la
| hive_cluster_deployment_install_success_total | Y |

#### ClusterPool controller metrics
These metrics are logged per ClusterPool instance. None of these are optional.
These metrics are observed while processing ClusterPools. None of these are optional.

| Metric Name | Optional Label Support |
|:-------------------------------------------------:|:----------------------:|
Expand Down
4 changes: 2 additions & 2 deletions hack/app-sre/saas-template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4464,8 +4464,8 @@ objects:
-- e.g. "4". NOTE: Avoid ClusterDeployment labels whose values
are unbounded, such as those representing cluster names or
IDs, as these will cause your prometheus database to grow
indefinitely. Check docs\hive-metrics.md for a list of affected
metrics'
indefinitely. Affected metrics are those whose type implements
the metricsWithDyanamicLabels interface found in pkg/controller/metrics/metrics_with_dynamic_labels.go'
type: object
metricsWithDuration:
description: Optional metrics and their configurations
Expand Down
13 changes: 3 additions & 10 deletions pkg/controller/clusterdeployment/clusterdeployment_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,11 +77,6 @@ var (
// ImageSet condition initialized by cluster deployment
hivev1.InstallerImageResolutionFailedCondition,
}

// mapClusterTypeLabelToValue is a map that corresponds to MetricsWithClusterTypeLabels provided in metrics config.
// The key in the map is the name of the label used in defining the metric, and the value would be the cluster
// deployment label to look for while reporting the metric
mapClusterTypeLabelToValue map[string]string
)

const (
Expand Down Expand Up @@ -128,11 +123,9 @@ func Add(mgr manager.Manager) error {
log.WithError(err).Error("error reading metrics config")
return err
}
mapClusterTypeLabelToValue = make(map[string]string)
mapClusterTypeLabelToValue = hivemetrics.GetOptionalClusterTypeLabels(mConfig)
// Register the metrics. This is done here to ensure we define the metrics with optional label support after we have
// read the hiveconfig, and we register them only once.
registerMetrics(mapClusterTypeLabelToValue)
registerMetrics(mConfig)
return AddToManager(mgr, NewReconciler(mgr, logger, clientRateLimiter), concurrentReconciles, queueRateLimiter)
}

Expand Down Expand Up @@ -588,7 +581,7 @@ func (r *ReconcileClusterDeployment) reconcile(request reconcile.Request, cd *hi
cdLog.WithError(err).Log(controllerutils.LogLevel(err), "error adding finalizer")
return reconcile.Result{}, err
}
metricClustersCreated.LogMetricWithDynamicLabels(cd, cdLog, 1)
metricClustersCreated.ObserveMetricWithDynamicLabels(cd, cdLog, 1)
return reconcile.Result{}, nil
}

Expand Down Expand Up @@ -1444,7 +1437,7 @@ func (r *ReconcileClusterDeployment) removeClusterDeploymentFinalizer(cd *hivev1
hivemetrics.ClearClusterSyncFailingSecondsMetric(cd.Namespace, cd.Name, cdLog)

// Increment the clusters deleted counter:
metricClustersDeleted.LogMetricWithDynamicLabels(cd, cdLog, 1)
metricClustersDeleted.ObserveMetricWithDynamicLabels(cd, cdLog, 1)
return nil
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ import (
hivev1aws "github.com/openshift/hive/apis/hive/v1/aws"
"github.com/openshift/hive/apis/hive/v1/azure"
"github.com/openshift/hive/apis/hive/v1/baremetal"
"github.com/openshift/hive/apis/hive/v1/metricsconfig"
hiveintv1alpha1 "github.com/openshift/hive/apis/hiveinternal/v1alpha1"
"github.com/openshift/hive/pkg/constants"
controllerutils "github.com/openshift/hive/pkg/controller/utils"
Expand Down Expand Up @@ -94,8 +95,9 @@ current-context: admin

func init() {
log.SetLevel(log.DebugLevel)
// Register the metrics to avoid panics during testing
registerMetrics(mapClusterTypeLabelToValue)
// While the metrics need not be registered for this test suite, they still need to be defined to avoid panics
// during the tests
registerMetrics(&metricsconfig.MetricsConfig{})
}

func fakeReadFile(content string) func(string) ([]byte, error) {
Expand Down
4 changes: 2 additions & 2 deletions pkg/controller/clusterdeployment/clusterinstalls.go
Original file line number Diff line number Diff line change
Expand Up @@ -211,9 +211,9 @@ func (r *ReconcileClusterDeployment) reconcileExistingInstallingClusterInstall(c
logger.WithField("duration", installDuration.Seconds()).Debug("install job completed")
metricInstallJobDuration.Observe(float64(installDuration.Seconds()))

metricCompletedInstallJobRestarts.LogMetricWithDynamicLabels(cd, logger, float64(cd.Status.InstallRestarts))
metricCompletedInstallJobRestarts.ObserveMetricWithDynamicLabels(cd, logger, float64(cd.Status.InstallRestarts))

metricClustersInstalled.LogMetricWithDynamicLabels(cd, logger, 1)
metricClustersInstalled.ObserveMetricWithDynamicLabels(cd, logger, 1)

if r.protectedDelete {
// Set protected delete on for the ClusterDeployment.
Expand Down
4 changes: 2 additions & 2 deletions pkg/controller/clusterdeployment/clusterprovisions.go
Original file line number Diff line number Diff line change
Expand Up @@ -512,9 +512,9 @@ func (r *ReconcileClusterDeployment) reconcileCompletedProvision(cd *hivev1.Clus
metricInstallJobDuration.Observe(float64(jobDuration.Seconds()))

// Report a metric for the total number of install restarts:
metricCompletedInstallJobRestarts.LogMetricWithDynamicLabels(cd, cdLog, float64(cd.Status.InstallRestarts))
metricCompletedInstallJobRestarts.ObserveMetricWithDynamicLabels(cd, cdLog, float64(cd.Status.InstallRestarts))

metricClustersInstalled.LogMetricWithDynamicLabels(cd, cdLog, 1)
metricClustersInstalled.ObserveMetricWithDynamicLabels(cd, cdLog, 1)

return reconcile.Result{}, nil
}
Expand Down
47 changes: 25 additions & 22 deletions pkg/controller/clusterdeployment/metrics.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
package clusterdeployment

import (
"github.com/openshift/hive/apis/hive/v1/metricsconfig"
"github.com/prometheus/client_golang/prometheus"
log "github.com/sirupsen/logrus"

Expand Down Expand Up @@ -44,12 +45,12 @@ var (

// Declare the metrics which allow optional labels to be added.
// They are defined later once the hive config has been read.
metricCompletedInstallJobRestarts hivemetrics.HistogramWithDynamicLabels
metricCompletedInstallJobRestarts hivemetrics.HistogramVecWithDynamicLabels

metricClustersCreated hivemetrics.CounterWithDynamicLabels
metricClustersInstalled hivemetrics.CounterWithDynamicLabels
metricClustersDeleted hivemetrics.CounterWithDynamicLabels
metricProvisionFailedTerminal hivemetrics.CounterWithDynamicLabels
metricClustersCreated hivemetrics.CounterVecWithDynamicLabels
metricClustersInstalled hivemetrics.CounterVecWithDynamicLabels
metricClustersDeleted hivemetrics.CounterVecWithDynamicLabels
metricProvisionFailedTerminal hivemetrics.CounterVecWithDynamicLabels
)

func incProvisionFailedTerminal(cd *hivev1.ClusterDeployment, log log.FieldLogger) {
Expand All @@ -64,63 +65,65 @@ func incProvisionFailedTerminal(cd *hivev1.ClusterDeployment, log log.FieldLogge
}
metricProvisionFailedTerminal.FixedLabels["clusterpool_namespacedname"] = poolNSName
metricProvisionFailedTerminal.FixedLabels["failure_reason"] = stoppedReason
metricProvisionFailedTerminal.LogMetricWithDynamicLabels(cd, log, 1)
metricProvisionFailedTerminal.ObserveMetricWithDynamicLabels(cd, log, 1)
}

func registerMetrics(mapClusterTypeLabelToValue map[string]string) {
func registerMetrics(mConfig *metricsconfig.MetricsConfig) {
mapClusterTypeLabelToValue := hivemetrics.GetOptionalClusterTypeLabels(mConfig)
optionalClusterTypeLabels := hivemetrics.GetSortedKeys(mapClusterTypeLabelToValue)
defaultLabels := &hivemetrics.DynamicLabels{
FixedLabels: map[string]string{},
OptionalLabels: mapClusterTypeLabelToValue,
}

metricCompletedInstallJobRestarts = hivemetrics.HistogramWithDynamicLabels{
metricCompletedInstallJobRestarts = hivemetrics.HistogramVecWithDynamicLabels{
HistogramVec: prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "hive_cluster_deployment_completed_install_restart",
Help: "Distribution of the number of restarts for all completed cluster installations.",
Buckets: []float64{0, 2, 10, 20, 50},
},
optionalClusterTypeLabels),
FixedLabels: map[string]string{},
OptionalLabels: mapClusterTypeLabelToValue,
DynamicLabels: defaultLabels,
}

metricClustersCreated = hivemetrics.CounterWithDynamicLabels{
metricClustersCreated = hivemetrics.CounterVecWithDynamicLabels{
CounterVec: prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "hive_cluster_deployments_created_total",
Help: "Counter incremented every time we observe a new cluster.",
},
optionalClusterTypeLabels),
FixedLabels: map[string]string{},
OptionalLabels: mapClusterTypeLabelToValue,
DynamicLabels: defaultLabels,
}
metricClustersInstalled = hivemetrics.CounterWithDynamicLabels{
metricClustersInstalled = hivemetrics.CounterVecWithDynamicLabels{
CounterVec: prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "hive_cluster_deployments_installed_total",
Help: "Counter incremented every time we observe a successful installation.",
},
optionalClusterTypeLabels),
FixedLabels: map[string]string{},
OptionalLabels: mapClusterTypeLabelToValue,
DynamicLabels: defaultLabels,
}
metricClustersDeleted = hivemetrics.CounterWithDynamicLabels{
metricClustersDeleted = hivemetrics.CounterVecWithDynamicLabels{
CounterVec: prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "hive_cluster_deployments_deleted_total",
Help: "Counter incremented every time we observe a deleted cluster.",
},
optionalClusterTypeLabels),
FixedLabels: map[string]string{},
OptionalLabels: mapClusterTypeLabelToValue,
DynamicLabels: defaultLabels,
}
metricProvisionFailedTerminal = hivemetrics.CounterWithDynamicLabels{
metricProvisionFailedTerminal = hivemetrics.CounterVecWithDynamicLabels{
CounterVec: prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "hive_cluster_deployments_provision_failed_terminal_total",
Help: "Counter incremented when a cluster provision has failed and won't be retried.",
},
append([]string{"clusterpool_namespacedname", "failure_reason"}, optionalClusterTypeLabels...)),
FixedLabels: map[string]string{
DynamicLabels: &hivemetrics.DynamicLabels{FixedLabels: map[string]string{
// These values must be filled before logging the metric
"clusterpool_namespacedname": constants.MetricLabelDefaultValue,
"failure_reason": constants.MetricLabelDefaultValue,
},
OptionalLabels: mapClusterTypeLabelToValue,
OptionalLabels: mapClusterTypeLabelToValue,
},
}

metrics.Registry.MustRegister(metricInstallJobDuration)
Expand Down
Loading

0 comments on commit 756f500

Please sign in to comment.