Metricsregistry #120

jacksontj · 2018-01-27T22:09:22Z

Rebase of #74, fixes #63

TLDR; statsd exporter currently doesn't allow for duplicate/conflicting tag-sets -- which is a problem for the statsd exporter. The exporter interface has no such restriction, its simply an issue with the go prometheus client. This can be fixed by implementing our own registry-- which is this PR.

Apologies for the long delay, other things came up as higher priority. So here is a rebased version against current master to start the conversation from.

As far as the outstanding comments from the previous PR:

Please update the help string for both flags to make the difference very very clear. -- Updated with the most recent commit on the branch

What happens when these are exceeded? What happens for someone with very many metrics? -- this is a copy/pasted hard-coded value from the prometheus client_golang when the channel hits capacity it will block collection until it has been consumed (effectively a memory limiter).

that PR is closed. what's the future here? -- We'll have to maintain this registry long-term. If we'd prefer I can pull the registry into a separate package (directory in the same repo).

cc: @matthiasr

For the generic statsd case we have some issues with the restrictions placed on us by the default golang client registry (discussion: https://groups.google.com/forum/#!topic/prometheus-developers/Q2pRR-UlHI0). This commit implements a simplistic metrics registry that puts no restrictions on the registration or deregistration of metrics (since they are all done programatically).

Before all the exporter metrics where mixed with the statsd metrics. This changes that to have the exporter metrics sent to /metrics (by default) and the statsd metrics sent to /statsd (by default) Fixes prometheus#80

matthiasr

I still think having a full-on registry is a very heavy-handed and low-level approach. I'm not sure it's a price I want to pay for supporting this case, which can also be solved by coupling the application and exporter lifecycles.

If we do want to do it, I'd rather keep track of things separately and then generate ConstMetrics like in many other exporters. However, I don't know how we could do that with summaries and histograms.

If the client library were to allow modifying an existing metric to extend the label set, we could do that (after #119) when we observe a wider label set for the first time, and pad narrower label sets on observation. @beorn7 is this something you would be willing to support in general? You mentioned at some point that you wanted to support some solution for this in the library, what form would that take? It would prevent having partial copies of the library code in multiple projects. @beorn7 let's discuss options in the coming week?

matthiasr · 2018-01-28T12:53:24Z

registry.go

+		close(metricChan)
+	}()
+	for _, metric := range r.metricsByID {
+		go func(collector prometheus.Collector) {


As pointed out in #76 (comment) this causes a huge spike in memory consumption because it spawns a goroutine for every single time series.

prometheus/client_golang#370 is a potential fix for that.

jacksontj · 2018-01-28T15:32:58Z

If we end up needing to continue with the registry we can either optimize or implement a more naive one. I'd be more than happy to not "build our own" but the response last time this was discussed upstream was basically "try it yourself and of we like it we might merge it".

…

On Jan 28, 2018 4:54 AM, "Matthias Rampke" ***@***.***> wrote: ***@***.**** commented on this pull request. I still think having a full-on registry is a very heavy-handed and low-level approach. I'm not sure it's a price I want to pay for supporting this case, which can also be solved by coupling the application and exporter lifecycles. If we do want to do it, I'd rather keep track of things separately and then generate ConstMetrics like in many other exporters. However, I don't know how we could do that with summaries and histograms. If the client library were to allow modifying an existing metric to extend the label set, we could do that (after #119 <#119>) when we observe a wider label set for the first time, and pad narrower label sets on observation. @beorn7 <https://github.com/beorn7> is this something you would be willing to support in general? You mentioned at some point that you wanted to support *some* solution for this in the library, what form would that take? It would prevent having partial copies of the library code in multiple projects. @beorn7 <https://github.com/beorn7> let's discuss options in the coming week? ------------------------------ In registry.go <#120 (comment)> : > + wg sync.WaitGroup + errs MultiError // The collected errors to return in the end. + ) + + r.mtx.RLock() + metricFamiliesByName := make(map[string]*dto.MetricFamily) + + // Scatter. + // (Collectors could be complex and slow, so we call them all at once.) + wg.Add(len(r.metricsByID)) + go func() { + wg.Wait() + close(metricChan) + }() + for _, metric := range r.metricsByID { + go func(collector prometheus.Collector) { As pointed out in #76 (comment) <#76 (comment)> this causes a huge spike in memory consumption because it spawns a goroutine for every single time series. prometheus/client_golang#370 <prometheus/client_golang#370> is a potential fix for that. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#120 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABi3nQe7vn9UlkYeTfvrM0e1h_Y7q6ADks5tPG32gaJpZM4Rvd_N> .

beorn7 · 2018-01-29T13:05:49Z

Re-implementing the registry to create inconsistent metrics sounds like a really bad idea, to be honest.
The ideas for prometheus/client_golang#47 and prometheus/client_golang#355 are around allowing collectors to be declared "unchecked", and then the registry in the client library will try its best to pad missing labels (using empty label values). This will guarantee that every single exposition is consistent, but it will not guarantee consistency over the lifetime of a binary (which is probably a price we can/need to pay to get out of this coundrum).

matthiasr · 2018-01-29T16:22:13Z

So we just discussed this separately – the linked issues don't quite cut it for the statsd exporter. They work well with custom collectors that generate ConstMetrics, but this exporter doesn't. It could somewhat sensibly for counters and gauges, but we'd have to recreate the quantile calculation from scratch for summaries, and the bucketing for histograms.

Another idea that we've kicked around is tracking the labels for a given metric name (this would be easy on top of #119), and when they change unregister the metric and register a completely new one. That's not allowed at the moment anyway, but could be without breaking the semantics too much. On top of that, we could fill out with empty labels as needed anyway. This would mean that at the transition, we would throw away all previous observations (unless we dig into the protobuf to rescue them and copy them over into the new metrics somehow.)

There's no really pretty solution yet …

beorn7 · 2018-01-29T17:49:52Z

More thoughts:

The "unregister" hack could be implemented more cleanly (and without changing the semantics of Unregister) by using a local (not the global) registry and simply register all metrics (the newly labeled and the old ones that haven't changed) with the new registry. Then throw the old HTTP handler and the old registry away and wire up the new registry.

WRT throwing away the old observations: With summaries, you want a decay time anyway (by default 10m), so assuming label changes are rare, that's not a big deal. For histogram, throwing away the old observations is essentially a counter reset and will be dealt with by Prometheus, i.e. you'll only lose about half a scrape interval worth of even counts.

I'd say let's go with the "throw away on label change" approach, i.e. whenever a previously unaccounted label shows up, recreate all affected metrics with a zero count in a new registry, register all unaffected metrics with the new registry, rewire the HTTP handle to the new registry and throw away the old one.

Does that make sense?

matthiasr · 2018-01-30T08:42:30Z

Could there be any strange side effects from not resetting all metrics at the same time? Say, for request totals and errors?

…

On Mon, Jan 29, 2018, 18:50 Björn Rabenstein ***@***.***> wrote: More thoughts: The "unregister" hack could be implemented more cleanly (and without changing the semantics of Unregister) by using a local (not the global) registry and simply register all metrics (the newly labeled and the old ones that haven't changed) with the new registry. Then throw the old HTTP handler and the old registry away and wire up the new registry. WRT throwing away the old observations: With summaries, you want a decay time anyway (by default 10m), so assuming label changes are rare, that's not a big deal. For histogram, throwing away the old observations is essentially a counter reset and will be dealt with by Prometheus, i.e. you'll only lose about half a scrape interval worth of even counts. I'd say let's go with the "throw away on label change" approach, i.e. whenever a previously unaccounted label shows up, recreate all affected metrics with a zero count in a new registry, register all unaffected metrics with the new registry, rewire the HTTP handle to the new registry and throw away the old one. Does that make sense? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#120 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAICBqImJIMJ66F-xmFc3PlOv7mxTJWOks5tPgTAgaJpZM4Rvd_N> .

brian-brazil · 2018-01-30T08:51:52Z

Technically no, but I can imagine that causing artifacts. On the other hand applications should not be changing labels at runtime.

matthiasr · 2018-01-30T09:46:41Z

The statsd exporter is not, by itself, an application – I would still recommend coupling its lifecycle to that of the application but I also recognize that that is not making it a drop-in replacement for statsd (where a central deployment is much more common). Now, with the schema @beorn7 proposed, we would re-register and reset the metric that changed, but we would not have a mechanism to do that to all metrics exposed by one application at the same time (because we don't know what these all are).

I'm not saying it's a bad idea, just writing down thoughts about the limitations. Whatever we choose to do, I'd want to document the different choices and tradeoffs really well. We could even make the metrics-reregistering an optional feature (and drop the event otherwise).

beorn7 · 2018-01-30T10:59:45Z

Could there be any strange side effects from not resetting all metrics at the same time? Say, for request totals and errors?

To quote from my Prometheus proverb collection: “First do rate, then aggregate.” That means, that counter resets are taken care of before aggregating. I guess the scenario you have in mind is a counter for total, that doesn't change its labels, while suddenly error counters arrive with an additional label. Thus, the error counters get a reset but not the total counter. With the usual exception of losing a few increments, it should be fine.

matthiasr · 2018-01-31T09:32:48Z

@jacksontj is this (wiping the whole metric and recreating with the new label set) a solution that would fit your needs?

grobie · 2018-01-31T10:23:01Z

How would that work out during a deployment, where some instances send the new label set and others still the old one? If I understand the proposal correctly, the exporter would constantly throw away the existing metric and create a new one.

matthiasr · 2018-01-31T10:27:09Z

I wouldn't do this for shrinking label sets, those I would just pad out indefinitely. So it would only happen once per metric name, the first time an event with the wider label set is seen. Renaming a label would result in a "wider" label set that includes both.

jacksontj · 2018-02-19T16:40:15Z

@matthiasr that would mean that the metrics sent from clients to this exporter would have labels added with empty values. IMO thats not workable, as that means I can no longer query based on the existence of label keys -- since they are being mutated before storage.

the intention of this exporter is to be (as much as possible) a drop-in-replacment for a statsd server/aggregator/relay/etc. With that in mind we should avoid mutating the metrics as much as possible-- we want to store exactly what we where sent.

matthiasr · 2018-02-19T16:55:29Z

Fundamentally, that's an impossible task. The data models are not the same so we can't store exactly what we get sent – only Graphite can.

The Prometheus server does not distinguish between empty and non-present labels, so you can't query based on label key presence anyway – the query language does not allow that. The only way to test for "label is not present" is to match {label=""} anyway.

matthiasr · 2018-08-04T14:28:34Z

I'm going to close this – the concrete changeset is very old, and I believe with prometheus/client_golang#425 we can solve this without a whole new registry.

The concrete problem in #63 was fixed from another angle.

jacksontj added 4 commits January 27, 2018 14:00

Update client_golang dep

ef058b7

Add separate metrics endpoint for internal vs statsd metrics

ae87a4e

Before all the exporter metrics where mixed with the statsd metrics. This changes that to have the exporter metrics sent to /metrics (by default) and the statsd metrics sent to /statsd (by default) Fixes prometheus#80

Clarify CLI flag

bff4e09

matthiasr reviewed Jan 28, 2018

View reviewed changes

matthiasr mentioned this pull request Jan 29, 2018

Handling of conflicting metric values #63

Closed

matthiasr closed this Aug 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metricsregistry #120

Metricsregistry #120

jacksontj commented Jan 27, 2018 •

edited by matthiasr

Loading

matthiasr left a comment

matthiasr Jan 28, 2018

jacksontj commented Jan 28, 2018 via email

beorn7 commented Jan 29, 2018

matthiasr commented Jan 29, 2018

beorn7 commented Jan 29, 2018

matthiasr commented Jan 30, 2018 via email

brian-brazil commented Jan 30, 2018

matthiasr commented Jan 30, 2018

beorn7 commented Jan 30, 2018

matthiasr commented Jan 31, 2018

grobie commented Jan 31, 2018

matthiasr commented Jan 31, 2018

jacksontj commented Feb 19, 2018

matthiasr commented Feb 19, 2018

matthiasr commented Aug 4, 2018

Metricsregistry #120

Metricsregistry #120

Conversation

jacksontj commented Jan 27, 2018 • edited by matthiasr Loading

matthiasr left a comment

Choose a reason for hiding this comment

matthiasr Jan 28, 2018

Choose a reason for hiding this comment

jacksontj commented Jan 28, 2018 via email

beorn7 commented Jan 29, 2018

matthiasr commented Jan 29, 2018

beorn7 commented Jan 29, 2018

matthiasr commented Jan 30, 2018 via email

brian-brazil commented Jan 30, 2018

matthiasr commented Jan 30, 2018

beorn7 commented Jan 30, 2018

matthiasr commented Jan 31, 2018

grobie commented Jan 31, 2018

matthiasr commented Jan 31, 2018

jacksontj commented Feb 19, 2018

matthiasr commented Feb 19, 2018

matthiasr commented Aug 4, 2018

jacksontj commented Jan 27, 2018 •

edited by matthiasr

Loading