add metrics #452

jpeeler · 2018-09-12T14:26:50Z

Expose counts for various OLM specific resources, which currently are:
CSVs
InstallPlans
Subscriptions
CatalogSources

And also a count for CSV upgrades.

njhale · 2018-09-12T15:26:30Z

pkg/lib/queueinformer/queueinformer_operator.go

@@ -134,6 +134,7 @@ func (o *Operator) sync(loop *QueueInformer, key string) error {
 		return err
 	}

+	// TODO: handle resource deletion metrics here?


I'm wondering if we could pass in a custom event handler from the operator that has the resource deletion metrics embedded in the custom cache.ResourceHandlerFuncs.Delete function

I thought about that. It's worth considering, but I didn't immediately gravitate towards it since I think I'd have to have that delete function be aware of all the types. Then again, the TODO location would have the same problem.

Good point. I also just remembered that everything in the lib directory is stuff that we wish was in an external package.

jpeeler · 2018-09-13T18:33:26Z

@ecordell Is OLM and the catalog ever deployed separately? I'm wondering if metrics should be exposed from both containers. Also, see if this is closer to what you had in mind (while I debug CI meanwhile).

ecordell · 2018-09-13T19:21:23Z

We deploy them in separate containers, but so far haven't had anyone deploy one without the other. There may be use cases for just deploying OLM though (if you know better than we do how to resolve something, etc)

jpeeler · 2018-09-13T21:20:11Z

If you click the Details link, you can see the pipeline passed (didn't the first time). But I don't know how to update the status on the PR. I'll push a new commit soon and everything should refresh anyway.

ecordell · 2018-09-12T15:39:12Z

pkg/controller/operators/catalog/operator.go

@@ -181,6 +182,10 @@ func (o *Operator) syncCatalogSources(obj interface{}) (syncError error) {
 		return fmt.Errorf("failed to create catalog source from ConfigMap %s: %s", out.Spec.ConfigMap, err)
 	}

+	if o.sourcesLastUpdate.IsZero() {
+		metrics.CatalogSourceCount.Inc()


sourcesLastUpdate is a property on the running operator (when was the last time the operator checked for catalog sources), so I don't think this is tracking the right thing

ecordell · 2018-09-13T20:05:29Z

pkg/lib/queueinformer/queueinformer_operator.go

@@ -133,7 +136,9 @@ func (o *Operator) sync(loop *QueueInformer, key string) error {
 	if err != nil {
 		return err
 	}
-
+	if err = o.handleMetrics(loop); err != nil {
+		return err


I don't think we want to completely kill the control loop just because we couldn't emit metrics. Maybe just log an error

ecordell · 2018-09-13T21:31:41Z

pkg/lib/queueinformer/queueinformer_operator.go

@@ -133,7 +136,9 @@ func (o *Operator) sync(loop *QueueInformer, key string) error {
 	if err != nil {
 		return err
 	}
-
+	if err = o.handleMetrics(loop); err != nil {


if sync gets retried because syncHandler returns false (see processNextWorkItem) this will get called more than we really want.

Would it make sense to trigger only after successful syncHandler? e.g.

success := loop.syncHandler() if success { go o.handleMetrics(loop) // are prom metrics safe to call concurrently? } return success

ecordell · 2018-09-13T21:35:31Z

pkg/lib/queueinformer/queueinformer_operator.go

+
+func (o *Operator) handleMetrics(informer *QueueInformer) error {
+	switch informer.name {
+	case "csv":


Instead of making the QueueInformer know about all of the informers we might construct it with, what if specialize handleMetrics per operator/control loop (e.g. CSV control loop triggers CSV metrics, InstallPlan control loop triggers InstallPlan metrics, etc) similar to how we have the operators specify a syncHandler.

Then when you make a new CSV QueueInformer (as a strawman) set the metricHandler to

o.metricHandler := func() { cList, err := o.OpClient.ListCustomResource(v1alpha1.GroupName, v1alpha1.GroupVersion, "", v1alpha1.ClusterServiceVersionKind) if err != nil { return err } metrics.CSVCount.Set(float64(len(cList.Items))) }

and handleMetrics just calls o.metricHandler()

(I think a more common go pattern would be to define a MetricProvider interface with a MetricHandler() method, but if you want to go that route you should update the syncHandler to follow the same pattern 😄)

ecordell · 2018-09-13T21:41:36Z

cmd/olm/main.go

+	//healthz.InstallHandler(mux) //(less code)
+	//mux.Handle("/metrics", promhttp.Handler()) //other form is deprecated
+	mux.Handle("/metrics", prometheus.Handler())
+	go http.ListenAndServe(":8080", mux)


do you know if there's a convention for this? is it common to serve metrics on the healthz port or do people define a separate port for each?

Yeah, I think it's common to use the same port. Here's what the prometheus operator does (though they aren't using healthz here): https://github.com/coreos/prometheus-operator/blob/ec94db4149766312fb60a2d5fdc5aca3153386dd/cmd/operator/main.go#L201

looks like you can do both without a new muxer? https://github.com/coreos/etcd-operator/blob/master/cmd/operator/main.go#L105-L106

ecordell · 2018-09-13T22:15:47Z

pkg/controller/operators/olm/operator.go

@@ -345,6 +346,7 @@ func (a *Operator) checkReplacementsAndUpdateStatus(csv *v1alpha1.ClusterService
 		log.Infof("newer ClusterServiceVersion replacing %s, no-op", csv.SelfLink)
 		msg := fmt.Sprintf("being replaced by csv: %s", replacement.SelfLink)
 		csv.SetPhase(v1alpha1.CSVPhaseReplacing, v1alpha1.CSVReasonBeingReplaced, msg)
+		metrics.CSVUpgradeCount.Inc()


I wonder if we even need this if we have the other metrics. count(changes(csv_count{phase=Replacing} / 2)) might get us there? I'm thinking the graph for csv_count{phase=Replacing} will look like:

____/\_____/\____/\___

Though with a counter we'll eventually get an accurate count even if we can't scrape some of the events...

I left it for now, but I'd be happy to remove since it doesn't really fit in with the theme of the rest of them.

ecordell

This looks good! One small nit

ecordell · 2018-09-14T22:31:44Z

cmd/olm/main.go

+	//healthz.InstallHandler(mux) //(less code)
+	//mux.Handle("/metrics", promhttp.Handler()) //other form is deprecated
+	mux.Handle("/metrics", prometheus.Handler())
+	go http.ListenAndServe(":8080", mux)


looks like you can do both without a new muxer? https://github.com/coreos/etcd-operator/blob/master/cmd/operator/main.go#L105-L106

This adds a /metrics endpoint on the OLM container that exposes counts for various OLM specific resources, which currently are: CSVs InstallPlans Subscriptions CatalogSources And also a count for CSV upgrades. Some of the e2e code (getMetricsFromPod) was copied from the kubernetes e2e metrics framework.

ecordell

LGTM

add metrics

njhale reviewed Sep 12, 2018

View reviewed changes

jpeeler force-pushed the add-metrics branch 2 times, most recently from 5e7aa7d to 5d3b1e3 Compare September 13, 2018 17:25

jpeeler force-pushed the add-metrics branch from 5d3b1e3 to 6ad54cd Compare September 13, 2018 20:11

ecordell reviewed Sep 13, 2018

View reviewed changes

jpeeler force-pushed the add-metrics branch 2 times, most recently from e1b1671 to a6f3ac7 Compare September 14, 2018 20:36

ecordell requested changes Sep 14, 2018

View reviewed changes

jpeeler force-pushed the add-metrics branch from a6f3ac7 to 2c4f3bf Compare September 18, 2018 01:12

ecordell approved these changes Sep 18, 2018

View reviewed changes

jpeeler changed the title ~~WIP: add metrics~~ add metrics Sep 18, 2018

jpeeler merged commit b096ff8 into operator-framework:master Sep 18, 2018

ecordell pushed a commit to ecordell/operator-lifecycle-manager that referenced this pull request Mar 8, 2019

Merge pull request operator-framework#452 from jpeeler/add-metrics

f014916

add metrics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add metrics #452

add metrics #452

jpeeler commented Sep 12, 2018

njhale Sep 12, 2018

jpeeler Sep 12, 2018

njhale Sep 13, 2018

jpeeler commented Sep 13, 2018

ecordell commented Sep 13, 2018

jpeeler commented Sep 13, 2018

ecordell Sep 12, 2018

ecordell Sep 13, 2018

ecordell Sep 13, 2018

ecordell Sep 13, 2018

ecordell Sep 13, 2018

jpeeler Sep 14, 2018

ecordell Sep 14, 2018

ecordell Sep 13, 2018 •

edited

Loading

jpeeler Sep 14, 2018

ecordell left a comment

ecordell Sep 14, 2018

ecordell left a comment

add metrics #452

add metrics #452

Conversation

jpeeler commented Sep 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpeeler commented Sep 13, 2018

ecordell commented Sep 13, 2018

jpeeler commented Sep 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ecordell Sep 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ecordell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ecordell left a comment

Choose a reason for hiding this comment

ecordell Sep 13, 2018 •

edited

Loading