Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metrics #452

Merged
merged 1 commit into from
Sep 18, 2018
Merged

add metrics #452

merged 1 commit into from
Sep 18, 2018

Conversation

jpeeler
Copy link

@jpeeler jpeeler commented Sep 12, 2018

Expose counts for various OLM specific resources, which currently are:
CSVs
InstallPlans
Subscriptions
CatalogSources

And also a count for CSV upgrades.

@@ -134,6 +134,7 @@ func (o *Operator) sync(loop *QueueInformer, key string) error {
return err
}

// TODO: handle resource deletion metrics here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could pass in a custom event handler from the operator that has the resource deletion metrics embedded in the custom cache.ResourceHandlerFuncs.Delete function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that. It's worth considering, but I didn't immediately gravitate towards it since I think I'd have to have that delete function be aware of all the types. Then again, the TODO location would have the same problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I also just remembered that everything in the lib directory is stuff that we wish was in an external package.

@jpeeler jpeeler force-pushed the add-metrics branch 2 times, most recently from 5e7aa7d to 5d3b1e3 Compare September 13, 2018 17:25
@jpeeler
Copy link
Author

jpeeler commented Sep 13, 2018

@ecordell Is OLM and the catalog ever deployed separately? I'm wondering if metrics should be exposed from both containers. Also, see if this is closer to what you had in mind (while I debug CI meanwhile).

@ecordell
Copy link
Member

We deploy them in separate containers, but so far haven't had anyone deploy one without the other. There may be use cases for just deploying OLM though (if you know better than we do how to resolve something, etc)

@jpeeler
Copy link
Author

jpeeler commented Sep 13, 2018

If you click the Details link, you can see the pipeline passed (didn't the first time). But I don't know how to update the status on the PR. I'll push a new commit soon and everything should refresh anyway.

@@ -181,6 +182,10 @@ func (o *Operator) syncCatalogSources(obj interface{}) (syncError error) {
return fmt.Errorf("failed to create catalog source from ConfigMap %s: %s", out.Spec.ConfigMap, err)
}

if o.sourcesLastUpdate.IsZero() {
metrics.CatalogSourceCount.Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sourcesLastUpdate is a property on the running operator (when was the last time the operator checked for catalog sources), so I don't think this is tracking the right thing

@@ -133,7 +136,9 @@ func (o *Operator) sync(loop *QueueInformer, key string) error {
if err != nil {
return err
}

if err = o.handleMetrics(loop); err != nil {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to completely kill the control loop just because we couldn't emit metrics. Maybe just log an error

@@ -133,7 +136,9 @@ func (o *Operator) sync(loop *QueueInformer, key string) error {
if err != nil {
return err
}

if err = o.handleMetrics(loop); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if sync gets retried because syncHandler returns false (see processNextWorkItem) this will get called more than we really want.

Would it make sense to trigger only after successful syncHandler? e.g.

success := loop.syncHandler()
if success {
  go o.handleMetrics(loop) // are prom metrics safe to call concurrently?
}
return success


func (o *Operator) handleMetrics(informer *QueueInformer) error {
switch informer.name {
case "csv":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making the QueueInformer know about all of the informers we might construct it with, what if specialize handleMetrics per operator/control loop (e.g. CSV control loop triggers CSV metrics, InstallPlan control loop triggers InstallPlan metrics, etc) similar to how we have the operators specify a syncHandler.

Then when you make a new CSV QueueInformer (as a strawman) set the metricHandler to

o.metricHandler := func() {
	cList, err := o.OpClient.ListCustomResource(v1alpha1.GroupName, v1alpha1.GroupVersion, "", v1alpha1.ClusterServiceVersionKind)
	if err != nil {
		return err
	}
	metrics.CSVCount.Set(float64(len(cList.Items)))
}

and handleMetrics just calls o.metricHandler()

(I think a more common go pattern would be to define a MetricProvider interface with a MetricHandler() method, but if you want to go that route you should update the syncHandler to follow the same pattern 😄)

cmd/olm/main.go Outdated
//healthz.InstallHandler(mux) //(less code)
//mux.Handle("/metrics", promhttp.Handler()) //other form is deprecated
mux.Handle("/metrics", prometheus.Handler())
go http.ListenAndServe(":8080", mux)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if there's a convention for this? is it common to serve metrics on the healthz port or do people define a separate port for each?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's common to use the same port. Here's what the prometheus operator does (though they aren't using healthz here): https://github.com/coreos/prometheus-operator/blob/ec94db4149766312fb60a2d5fdc5aca3153386dd/cmd/operator/main.go#L201

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -345,6 +346,7 @@ func (a *Operator) checkReplacementsAndUpdateStatus(csv *v1alpha1.ClusterService
log.Infof("newer ClusterServiceVersion replacing %s, no-op", csv.SelfLink)
msg := fmt.Sprintf("being replaced by csv: %s", replacement.SelfLink)
csv.SetPhase(v1alpha1.CSVPhaseReplacing, v1alpha1.CSVReasonBeingReplaced, msg)
metrics.CSVUpgradeCount.Inc()
Copy link
Member

@ecordell ecordell Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we even need this if we have the other metrics. count(changes(csv_count{phase=Replacing} / 2)) might get us there? I'm thinking the graph for csv_count{phase=Replacing} will look like:

____/\_____/\____/\___

Though with a counter we'll eventually get an accurate count even if we can't scrape some of the events...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it for now, but I'd be happy to remove since it doesn't really fit in with the theme of the rest of them.

@jpeeler jpeeler force-pushed the add-metrics branch 2 times, most recently from e1b1671 to a6f3ac7 Compare September 14, 2018 20:36
Copy link
Member

@ecordell ecordell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! One small nit

cmd/olm/main.go Outdated
//healthz.InstallHandler(mux) //(less code)
//mux.Handle("/metrics", promhttp.Handler()) //other form is deprecated
mux.Handle("/metrics", prometheus.Handler())
go http.ListenAndServe(":8080", mux)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a /metrics endpoint on the OLM container that exposes counts for
various OLM specific resources, which currently are:
CSVs
InstallPlans
Subscriptions
CatalogSources

And also a count for CSV upgrades.

Some of the e2e code (getMetricsFromPod) was copied from the kubernetes e2e
metrics framework.
Copy link
Member

@ecordell ecordell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jpeeler jpeeler changed the title WIP: add metrics add metrics Sep 18, 2018
@jpeeler jpeeler merged commit b096ff8 into operator-framework:master Sep 18, 2018
ecordell pushed a commit to ecordell/operator-lifecycle-manager that referenced this pull request Mar 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants