-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add metrics #452
add metrics #452
Conversation
@@ -134,6 +134,7 @@ func (o *Operator) sync(loop *QueueInformer, key string) error { | |||
return err | |||
} | |||
|
|||
// TODO: handle resource deletion metrics here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we could pass in a custom event handler from the operator that has the resource deletion metrics embedded in the custom cache.ResourceHandlerFuncs.Delete
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that. It's worth considering, but I didn't immediately gravitate towards it since I think I'd have to have that delete function be aware of all the types. Then again, the TODO location would have the same problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I also just remembered that everything in the lib
directory is stuff that we wish was in an external package.
5e7aa7d
to
5d3b1e3
Compare
@ecordell Is OLM and the catalog ever deployed separately? I'm wondering if metrics should be exposed from both containers. Also, see if this is closer to what you had in mind (while I debug CI meanwhile). |
We deploy them in separate containers, but so far haven't had anyone deploy one without the other. There may be use cases for just deploying OLM though (if you know better than we do how to resolve something, etc) |
5d3b1e3
to
6ad54cd
Compare
If you click the Details link, you can see the pipeline passed (didn't the first time). But I don't know how to update the status on the PR. I'll push a new commit soon and everything should refresh anyway. |
@@ -181,6 +182,10 @@ func (o *Operator) syncCatalogSources(obj interface{}) (syncError error) { | |||
return fmt.Errorf("failed to create catalog source from ConfigMap %s: %s", out.Spec.ConfigMap, err) | |||
} | |||
|
|||
if o.sourcesLastUpdate.IsZero() { | |||
metrics.CatalogSourceCount.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sourcesLastUpdate is a property on the running operator (when was the last time the operator checked for catalog sources), so I don't think this is tracking the right thing
@@ -133,7 +136,9 @@ func (o *Operator) sync(loop *QueueInformer, key string) error { | |||
if err != nil { | |||
return err | |||
} | |||
|
|||
if err = o.handleMetrics(loop); err != nil { | |||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to completely kill the control loop just because we couldn't emit metrics. Maybe just log an error
@@ -133,7 +136,9 @@ func (o *Operator) sync(loop *QueueInformer, key string) error { | |||
if err != nil { | |||
return err | |||
} | |||
|
|||
if err = o.handleMetrics(loop); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if sync gets retried because syncHandler returns false (see processNextWorkItem
) this will get called more than we really want.
Would it make sense to trigger only after successful syncHandler
? e.g.
success := loop.syncHandler()
if success {
go o.handleMetrics(loop) // are prom metrics safe to call concurrently?
}
return success
|
||
func (o *Operator) handleMetrics(informer *QueueInformer) error { | ||
switch informer.name { | ||
case "csv": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of making the QueueInformer know about all of the informers we might construct it with, what if specialize handleMetrics
per operator/control loop (e.g. CSV control loop triggers CSV metrics, InstallPlan control loop triggers InstallPlan metrics, etc) similar to how we have the operators specify a syncHandler
.
Then when you make a new CSV QueueInformer (as a strawman) set the metricHandler
to
o.metricHandler := func() {
cList, err := o.OpClient.ListCustomResource(v1alpha1.GroupName, v1alpha1.GroupVersion, "", v1alpha1.ClusterServiceVersionKind)
if err != nil {
return err
}
metrics.CSVCount.Set(float64(len(cList.Items)))
}
and handleMetrics
just calls o.metricHandler()
(I think a more common go pattern would be to define a MetricProvider
interface with a MetricHandler()
method, but if you want to go that route you should update the syncHandler to follow the same pattern 😄)
cmd/olm/main.go
Outdated
//healthz.InstallHandler(mux) //(less code) | ||
//mux.Handle("/metrics", promhttp.Handler()) //other form is deprecated | ||
mux.Handle("/metrics", prometheus.Handler()) | ||
go http.ListenAndServe(":8080", mux) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know if there's a convention for this? is it common to serve metrics on the healthz port or do people define a separate port for each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's common to use the same port. Here's what the prometheus operator does (though they aren't using healthz here): https://github.com/coreos/prometheus-operator/blob/ec94db4149766312fb60a2d5fdc5aca3153386dd/cmd/operator/main.go#L201
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you can do both without a new muxer? https://github.com/coreos/etcd-operator/blob/master/cmd/operator/main.go#L105-L106
@@ -345,6 +346,7 @@ func (a *Operator) checkReplacementsAndUpdateStatus(csv *v1alpha1.ClusterService | |||
log.Infof("newer ClusterServiceVersion replacing %s, no-op", csv.SelfLink) | |||
msg := fmt.Sprintf("being replaced by csv: %s", replacement.SelfLink) | |||
csv.SetPhase(v1alpha1.CSVPhaseReplacing, v1alpha1.CSVReasonBeingReplaced, msg) | |||
metrics.CSVUpgradeCount.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we even need this if we have the other metrics. count(changes(csv_count{phase=Replacing} / 2))
might get us there? I'm thinking the graph for csv_count{phase=Replacing}
will look like:
____/\_____/\____/\___
Though with a counter we'll eventually get an accurate count even if we can't scrape some of the events...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it for now, but I'd be happy to remove since it doesn't really fit in with the theme of the rest of them.
e1b1671
to
a6f3ac7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! One small nit
cmd/olm/main.go
Outdated
//healthz.InstallHandler(mux) //(less code) | ||
//mux.Handle("/metrics", promhttp.Handler()) //other form is deprecated | ||
mux.Handle("/metrics", prometheus.Handler()) | ||
go http.ListenAndServe(":8080", mux) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you can do both without a new muxer? https://github.com/coreos/etcd-operator/blob/master/cmd/operator/main.go#L105-L106
This adds a /metrics endpoint on the OLM container that exposes counts for various OLM specific resources, which currently are: CSVs InstallPlans Subscriptions CatalogSources And also a count for CSV upgrades. Some of the e2e code (getMetricsFromPod) was copied from the kubernetes e2e metrics framework.
a6f3ac7
to
2c4f3bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Expose counts for various OLM specific resources, which currently are:
CSVs
InstallPlans
Subscriptions
CatalogSources
And also a count for CSV upgrades.