Handling of conflicting metric values #63

acdha · 2017-03-08T17:06:40Z

We had some developers on a new project starting to integrate statsd support in their application. statsd_exporter 0.3.0 is crashing constantly with this error message:

FATA[0000] A change of configuration created inconsistent metrics for "query_timer". You have to restart the statsd_exporter, and you should consider the effects on your monitoring setup. Error: a previously registered descriptor with the same fully-qualified name as Desc{fqName: "query_timer", help: "Metric autogenerated by statsd_exporter.", constLabels: {after_date="2017-01-19 00:00:00-05:00",application="…",component="…",environment="testing",server="…",sub_component="…"}, variableLabels: []} has different label names or a different help string  source=exporter.go:137

If I'm reading that correctly, this due to them having two versions of the app sending inconsistent metric values but while an error in this situation seems reasonable it actually causes the statsd_exporter process to crash.

The text was updated successfully, but these errors were encountered:

amahomet · 2017-05-25T07:50:49Z

guys, any news here?

grobie · 2017-05-25T17:50:38Z

What behavior would you expect here? We can't register both conflicting metrics and export them. Just logging the error and continuing would result in silently ignoring all conflicting metrics.

acdha · 2017-05-25T17:55:56Z

I hit this problem as a report that “Prometheus is down” when some developers were rolling out a new build with different help text. Exiting just meant that it was in a loop with Upstart restarting the service so I'm not sure that was substantially better than an ERROR log message.

grobie · 2017-05-25T18:09:00Z

@acdha So you would prefer writing out log lines and ignoring all metrics with the new signature until someone manually restarts the statsd_exporter in that case?

acdha · 2017-05-25T18:09:59Z

I guess it's a judgement call whether you think it's better to be disruptive, forcing people to actually notice the problem, or to allow other applications to continue sending stats without interruption.

acdha · 2017-05-25T18:12:26Z

In my case running an instance shared by several teams, it would have been preferable if only the project which changed their stats experienced a gap but there is an argument that it'd also be acceptable to simply say “monitor process flapping better”.

grobie · 2017-05-25T18:19:19Z

I guess I wouldn't share a statsd exporter between teams for such reasons. We generally tend to prefer failing hard and early, as everything else usually makes debugging very difficult. That's why I'm a bit hesitant to implement a solution which will silently ignore metrics.

In general, I'd recommend to use direct instrumentation with our client libraries instead of relying on the statsd_exporter so much.

acdha · 2017-05-25T18:26:23Z

Yeah, in this case it was a shared instance among developers working on the same project - the person who was working on an update was trying to figure out why he was getting error messages from the statsd client when it terminated prematurely.

grobie · 2017-05-25T18:34:38Z

Given we ignore a lot of metrics already in statsd_exporter, I'd be personally fine accepting such a pull request. Changing the behavior would require to change the signature of *Container.Get() methods like

statsd_exporter/exporter.go

Lines 72 to 87 in 663b6a1

    
           func (c *CounterContainer) Get(metricName string, labels prometheus.Labels) prometheus.Counter { 
        
           	hash := hashNameAndLabels(metricName, labels) 
        
           	counter, ok := c.Elements[hash] 
        
           	if !ok { 
        
           		counter = prometheus.NewCounter(prometheus.CounterOpts{ 
        
           			Name:        metricName, 
        
           			Help:        defaultHelp, 
        
           			ConstLabels: labels, 
        
           		}) 
        
           		c.Elements[hash] = counter 
        
           		if err := prometheus.Register(counter); err != nil { 
        
           			log.Fatalf(regErrF, metricName, err) 
        
           		} 
        
           	} 
        
           	return counter 
        
           }

. Instead of logging the error directly, it should be returned to the caller, the caller should then only call log.Errorf instead of log.Fatalf and continue with the next metric.

SuperQ · 2017-05-26T06:08:47Z

Also, it would be good to have an exporter internal error counter so you can monitor for the problem.

avplab · 2017-06-01T08:42:16Z

Guys, probably my question is not related statsd_exporter, but when we aligned metric labels not to crash statsd_exporter, how to restart Prometheus to recalculate metrics?

SuperQ · 2017-06-01T08:48:17Z

@avplab Please take your question to our community.

jacksontj · 2017-07-18T16:05:04Z

I just opened a PR to fix this (#72). Although this fixes the immediate issue of "exporter dies" it doesn't solve the longer-term issue of "you have to restart the exporter".

The most common case that this would be an issue is the following: an app exists and is emitting metrics. A new release of the app goes out which adds/removes some tags-- at this point those metrics are "broken" until the exporter is restarted. In this situation the "old" metircs are no longer being emitted, and as such we could remove them (given some TTL).

Because of this I was thinking of adding in a feature to basically TTL out metrics that haven't been emitted for a while if there is a new metric being emitted. Alternatively there could be some API call to "unregister" a metric-- but that seems fairly clunky (and not very "statsd-esque".

I figured I'd float the idea here first -- as it is related to this larger issue. If one of those (or some other option) is wanted I'll open a separate PR for the feature-- so we can get this fix for the crashing in quicker.

grobie · 2017-07-18T18:25:35Z

Thanks a lot @jacksontj. I merged your contribution.

For the discussion of how to handle such conflicting metrics in general, I'd recommend to write to prometheus-developers@.

jacksontj · 2017-07-19T17:26:31Z

Just for linkage (if anyone is interested) here is the thread on the developer list -- https://groups.google.com/forum/#!topic/prometheus-developers/Q2pRR-UlHI0

This patch simply moves the error message from a log.Fatalf() to a log.Errorf() to continue on. Fixes prometheus#63

matthiasr · 2018-01-18T08:25:16Z

I closed #74 because it had gone stale, but I am going to reopen this issue to track the underlying reasoning.

matthiasr · 2018-01-29T16:26:31Z

x-ref: more discussion in #120

tobiashenkel · 2018-02-23T07:09:31Z

The StatsD -> graphite pipeline supports multiple metric types on the same name nicely (by type namespacing support within the graphite backend). Further there are things out there which send e.g. counters and timers on the same name.

e.g.:

time="2018-02-23T06:21:34Z" level=info msg="lineToEvents:  zuul.geard.packet.GRAB_JOB_UNIQ:0|ms" source="exporter.go:428"
time="2018-02-23T06:21:34Z" level=info msg="lineToEvents:  zuul.geard.packet.GRAB_JOB_UNIQ:1|c" source="exporter.go:428"

So what do you think about the idea to extend the mapper with a type match such that we can map different typed metrics of the same name to different prometheus metric names?

TimoL · 2018-02-23T07:38:50Z

@tobiashenkel that would be the ideal solution, I guess.

@matthiasr would you like that solution? Any hints how to implement it?

matthiasr · 2018-02-23T08:30:24Z

Sounds reasonable. How would you represent this in the configuration? Can you mock up a mapping config snippet?

tobiashenkel · 2018-02-23T08:37:27Z

What about something like this?

mappings:
- match: test.timing.*.*.*
  match_metric_type: counter|gauge|timer
  name: "my_timer"
  labels:
    provider: "$2"
    outcome: "$3"
    job: "${1}_server"

grobie · 2018-02-23T08:44:20Z

Our just the option to specify the type at the end of the match attribute `match: test.timimg.*|c`?

…

On Fri, Feb 23, 2018, 09:37 Tobias Henkel ***@***.***> wrote: What about something like this? mappings: - match: test.timing.*.*.* match_metric_type: counter|gauge|timer name: "my_timer" labels: provider: "$2" outcome: "$3" job: "${1}_server" — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#63 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAANaJAmQZ6UAiqXE_aze4pRafiKFreyks5tXnjIgaJpZM4MXDBn> .

tobiashenkel · 2018-02-23T08:46:37Z

Both would work, one is nearer to the actual metric line, the other more comprehensive in the config file.

I'd be fine with either way.

matthiasr · 2018-02-23T08:49:56Z

I like the explicit match_metric_type better. It requires less knowledge about the details of the protocol – someone using a client library has probably never seen `|c` before. I would be happy to accept a PR for this!

…

gaizeror · 2019-04-23T06:35:25Z

I don't know if it worths creating a new issue, so I'll try here first.
We are using datadog statsd python client, and we can't handle well conflicts today.
I am writing this comment because we found out some metrics were'nt been sent to statsd_exporter for a week.

There are no logs ASAIK when a conflict occures. It would be great to have an option to enable these logs, so we will fail early.
There is no way to "ignore" conflicts and duplicate the metric when labels are different.
There is no way to unitest it.

Any suggestion how to handle such cases?

matthiasr · 2019-04-23T09:21:33Z

Ideally this should be three new issues 😂

I agree, if you can figure out a way to log this, please send a PR! Sometimes this isn't so easy because the conflicts only arise at scrape time in the Prometheus client.
When labels are different, there should not be a conflict. Can you (in a new issue) detail how the conflict arises? Are there two metrics conflicting with each other, or one conflicting with a built-in metric (like in Errors when push standart golang client metrics to influxdb_exporet influxdb_exporter#37)?

matthiasr · 2019-04-23T09:22:41Z

Again, in a new issue, could you detail what exactly you would like to unit test, and ideally how? Since your app and the exporter communicate over the network, I'm not sure what exactly you mean to unit test.

jacksontj mentioned this issue Jul 18, 2017

Fixes for crashes with conflicting metric values #72

Merged

grobie closed this as completed in d9aa6e2 Jul 18, 2017

jacksontj added a commit to jacksontj/statsd_exporter that referenced this issue Jul 20, 2017

Don't crash on conflicting metric names

7833316

This patch simply moves the error message from a log.Fatalf() to a log.Errorf() to continue on. Fixes prometheus#63

jacksontj mentioned this issue Jul 20, 2017

Metricsregistry #74

Closed

matthiasr reopened this Jan 18, 2018

matthiasr mentioned this issue Jan 18, 2018

Fix use of client_golang to allow inconsistent labels on metrics #114

Closed

matthiasr changed the title ~~Crash when receiving conflicting metric values~~ Handling of conflicting metric values Jan 29, 2018

jacksontj mentioned this issue Jan 29, 2018

Metricsregistry #120

Closed

westphahl mentioned this issue Jun 15, 2018

Allow mapping match on metric type #136

Merged

matthiasr closed this as completed in #136 Jul 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of conflicting metric values #63

Handling of conflicting metric values #63

acdha commented Mar 8, 2017

amahomet commented May 25, 2017

grobie commented May 25, 2017

acdha commented May 25, 2017

grobie commented May 25, 2017

acdha commented May 25, 2017

acdha commented May 25, 2017

grobie commented May 25, 2017 •

edited

Loading

acdha commented May 25, 2017

grobie commented May 25, 2017

SuperQ commented May 26, 2017

avplab commented Jun 1, 2017

SuperQ commented Jun 1, 2017

jacksontj commented Jul 18, 2017

grobie commented Jul 18, 2017

jacksontj commented Jul 19, 2017

matthiasr commented Jan 18, 2018

matthiasr commented Jan 29, 2018

tobiashenkel commented Feb 23, 2018

TimoL commented Feb 23, 2018

matthiasr commented Feb 23, 2018 via email

tobiashenkel commented Feb 23, 2018

grobie commented Feb 23, 2018 via email

tobiashenkel commented Feb 23, 2018

matthiasr commented Feb 23, 2018 via email

gaizeror commented Apr 23, 2019

matthiasr commented Apr 23, 2019

matthiasr commented Apr 23, 2019

Handling of conflicting metric values #63

Handling of conflicting metric values #63

Comments

acdha commented Mar 8, 2017

amahomet commented May 25, 2017

grobie commented May 25, 2017

acdha commented May 25, 2017

grobie commented May 25, 2017

acdha commented May 25, 2017

acdha commented May 25, 2017

grobie commented May 25, 2017 • edited Loading

acdha commented May 25, 2017

grobie commented May 25, 2017

SuperQ commented May 26, 2017

avplab commented Jun 1, 2017

SuperQ commented Jun 1, 2017

jacksontj commented Jul 18, 2017

grobie commented Jul 18, 2017

jacksontj commented Jul 19, 2017

matthiasr commented Jan 18, 2018

matthiasr commented Jan 29, 2018

tobiashenkel commented Feb 23, 2018

TimoL commented Feb 23, 2018

matthiasr commented Feb 23, 2018 via email

tobiashenkel commented Feb 23, 2018

grobie commented Feb 23, 2018 via email

tobiashenkel commented Feb 23, 2018

matthiasr commented Feb 23, 2018 via email

gaizeror commented Apr 23, 2019

matthiasr commented Apr 23, 2019

matthiasr commented Apr 23, 2019

grobie commented May 25, 2017 •

edited

Loading