Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metricsregistry #120

Closed
wants to merge 4 commits into from
Closed

Conversation

jacksontj
Copy link
Contributor

@jacksontj jacksontj commented Jan 27, 2018

Rebase of #74, fixes #63

TLDR; statsd exporter currently doesn't allow for duplicate/conflicting tag-sets -- which is a problem for the statsd exporter. The exporter interface has no such restriction, its simply an issue with the go prometheus client. This can be fixed by implementing our own registry-- which is this PR.

Apologies for the long delay, other things came up as higher priority. So here is a rebased version against current master to start the conversation from.

As far as the outstanding comments from the previous PR:

Please update the help string for both flags to make the difference very very clear. -- Updated with the most recent commit on the branch

What happens when these are exceeded? What happens for someone with very many metrics? -- this is a copy/pasted hard-coded value from the prometheus client_golang when the channel hits capacity it will block collection until it has been consumed (effectively a memory limiter).

that PR is closed. what's the future here? -- We'll have to maintain this registry long-term. If we'd prefer I can pull the registry into a separate package (directory in the same repo).

cc: @matthiasr

For the generic statsd case we have some issues with the restrictions
placed on us by the default golang client registry (discussion:
https://groups.google.com/forum/#!topic/prometheus-developers/Q2pRR-UlHI0).

This commit implements a simplistic metrics registry that puts no
restrictions on the registration or deregistration of metrics (since
they are all done programatically).
Before all the exporter metrics where mixed with the statsd metrics. This changes that to have the exporter metrics sent to /metrics (by default) and the statsd metrics sent to /statsd (by default)

Fixes prometheus#80
Copy link
Contributor

@matthiasr matthiasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think having a full-on registry is a very heavy-handed and low-level approach. I'm not sure it's a price I want to pay for supporting this case, which can also be solved by coupling the application and exporter lifecycles.

If we do want to do it, I'd rather keep track of things separately and then generate ConstMetrics like in many other exporters. However, I don't know how we could do that with summaries and histograms.

If the client library were to allow modifying an existing metric to extend the label set, we could do that (after #119) when we observe a wider label set for the first time, and pad narrower label sets on observation. @beorn7 is this something you would be willing to support in general? You mentioned at some point that you wanted to support some solution for this in the library, what form would that take? It would prevent having partial copies of the library code in multiple projects. @beorn7 let's discuss options in the coming week?

close(metricChan)
}()
for _, metric := range r.metricsByID {
go func(collector prometheus.Collector) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As pointed out in #76 (comment) this causes a huge spike in memory consumption because it spawns a goroutine for every single time series.

prometheus/client_golang#370 is a potential fix for that.

@jacksontj
Copy link
Contributor Author

jacksontj commented Jan 28, 2018 via email

@beorn7
Copy link
Member

beorn7 commented Jan 29, 2018

Re-implementing the registry to create inconsistent metrics sounds like a really bad idea, to be honest.
The ideas for prometheus/client_golang#47 and prometheus/client_golang#355 are around allowing collectors to be declared "unchecked", and then the registry in the client library will try its best to pad missing labels (using empty label values). This will guarantee that every single exposition is consistent, but it will not guarantee consistency over the lifetime of a binary (which is probably a price we can/need to pay to get out of this coundrum).

@matthiasr
Copy link
Contributor

So we just discussed this separately – the linked issues don't quite cut it for the statsd exporter. They work well with custom collectors that generate ConstMetrics, but this exporter doesn't. It could somewhat sensibly for counters and gauges, but we'd have to recreate the quantile calculation from scratch for summaries, and the bucketing for histograms.

Another idea that we've kicked around is tracking the labels for a given metric name (this would be easy on top of #119), and when they change unregister the metric and register a completely new one. That's not allowed at the moment anyway, but could be without breaking the semantics too much. On top of that, we could fill out with empty labels as needed anyway. This would mean that at the transition, we would throw away all previous observations (unless we dig into the protobuf to rescue them and copy them over into the new metrics somehow.)

There's no really pretty solution yet …

@beorn7
Copy link
Member

beorn7 commented Jan 29, 2018

More thoughts:

The "unregister" hack could be implemented more cleanly (and without changing the semantics of Unregister) by using a local (not the global) registry and simply register all metrics (the newly labeled and the old ones that haven't changed) with the new registry. Then throw the old HTTP handler and the old registry away and wire up the new registry.

WRT throwing away the old observations: With summaries, you want a decay time anyway (by default 10m), so assuming label changes are rare, that's not a big deal. For histogram, throwing away the old observations is essentially a counter reset and will be dealt with by Prometheus, i.e. you'll only lose about half a scrape interval worth of even counts.

I'd say let's go with the "throw away on label change" approach, i.e. whenever a previously unaccounted label shows up, recreate all affected metrics with a zero count in a new registry, register all unaffected metrics with the new registry, rewire the HTTP handle to the new registry and throw away the old one.

Does that make sense?

@matthiasr
Copy link
Contributor

matthiasr commented Jan 30, 2018 via email

@brian-brazil
Copy link
Contributor

Technically no, but I can imagine that causing artifacts. On the other hand applications should not be changing labels at runtime.

@matthiasr
Copy link
Contributor

The statsd exporter is not, by itself, an application – I would still recommend coupling its lifecycle to that of the application but I also recognize that that is not making it a drop-in replacement for statsd (where a central deployment is much more common). Now, with the schema @beorn7 proposed, we would re-register and reset the metric that changed, but we would not have a mechanism to do that to all metrics exposed by one application at the same time (because we don't know what these all are).

I'm not saying it's a bad idea, just writing down thoughts about the limitations. Whatever we choose to do, I'd want to document the different choices and tradeoffs really well. We could even make the metrics-reregistering an optional feature (and drop the event otherwise).

@beorn7
Copy link
Member

beorn7 commented Jan 30, 2018

Could there be any strange side effects from not resetting all metrics at the same time? Say, for request totals and errors?

To quote from my Prometheus proverb collection: “First do rate, then aggregate.” That means, that counter resets are taken care of before aggregating. I guess the scenario you have in mind is a counter for total, that doesn't change its labels, while suddenly error counters arrive with an additional label. Thus, the error counters get a reset but not the total counter. With the usual exception of losing a few increments, it should be fine.

@matthiasr
Copy link
Contributor

@jacksontj is this (wiping the whole metric and recreating with the new label set) a solution that would fit your needs?

@grobie
Copy link
Member

grobie commented Jan 31, 2018

How would that work out during a deployment, where some instances send the new label set and others still the old one? If I understand the proposal correctly, the exporter would constantly throw away the existing metric and create a new one.

@matthiasr
Copy link
Contributor

I wouldn't do this for shrinking label sets, those I would just pad out indefinitely. So it would only happen once per metric name, the first time an event with the wider label set is seen. Renaming a label would result in a "wider" label set that includes both.

@jacksontj
Copy link
Contributor Author

@matthiasr that would mean that the metrics sent from clients to this exporter would have labels added with empty values. IMO thats not workable, as that means I can no longer query based on the existence of label keys -- since they are being mutated before storage.

the intention of this exporter is to be (as much as possible) a drop-in-replacment for a statsd server/aggregator/relay/etc. With that in mind we should avoid mutating the metrics as much as possible-- we want to store exactly what we where sent.

@matthiasr
Copy link
Contributor

Fundamentally, that's an impossible task. The data models are not the same so we can't store exactly what we get sent – only Graphite can.

The Prometheus server does not distinguish between empty and non-present labels, so you can't query based on label key presence anyway – the query language does not allow that. The only way to test for "label is not present" is to match {label=""} anyway.

@matthiasr
Copy link
Contributor

I'm going to close this – the concrete changeset is very old, and I believe with prometheus/client_golang#425 we can solve this without a whole new registry.

The concrete problem in #63 was fixed from another angle.

@matthiasr matthiasr closed this Aug 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling of conflicting metric values
5 participants