Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of conflicting metric values #63

Closed
acdha opened this issue Mar 8, 2017 · 27 comments
Closed

Handling of conflicting metric values #63

acdha opened this issue Mar 8, 2017 · 27 comments

Comments

@acdha
Copy link

acdha commented Mar 8, 2017

We had some developers on a new project starting to integrate statsd support in their application. statsd_exporter 0.3.0 is crashing constantly with this error message:

FATA[0000] A change of configuration created inconsistent metrics for "query_timer". You have to restart the statsd_exporter, and you should consider the effects on your monitoring setup. Error: a previously registered descriptor with the same fully-qualified name as Desc{fqName: "query_timer", help: "Metric autogenerated by statsd_exporter.", constLabels: {after_date="2017-01-19 00:00:00-05:00",application="…",component="…",environment="testing",server="…",sub_component="…"}, variableLabels: []} has different label names or a different help string  source=exporter.go:137

If I'm reading that correctly, this due to them having two versions of the app sending inconsistent metric values but while an error in this situation seems reasonable it actually causes the statsd_exporter process to crash.

@amahomet
Copy link

guys, any news here?

@grobie
Copy link
Member

grobie commented May 25, 2017

What behavior would you expect here? We can't register both conflicting metrics and export them. Just logging the error and continuing would result in silently ignoring all conflicting metrics.

@acdha
Copy link
Author

acdha commented May 25, 2017

I hit this problem as a report that “Prometheus is down” when some developers were rolling out a new build with different help text. Exiting just meant that it was in a loop with Upstart restarting the service so I'm not sure that was substantially better than an ERROR log message.

@grobie
Copy link
Member

grobie commented May 25, 2017

@acdha So you would prefer writing out log lines and ignoring all metrics with the new signature until someone manually restarts the statsd_exporter in that case?

@acdha
Copy link
Author

acdha commented May 25, 2017

I guess it's a judgement call whether you think it's better to be disruptive, forcing people to actually notice the problem, or to allow other applications to continue sending stats without interruption.

@acdha
Copy link
Author

acdha commented May 25, 2017

In my case running an instance shared by several teams, it would have been preferable if only the project which changed their stats experienced a gap but there is an argument that it'd also be acceptable to simply say “monitor process flapping better”.

@grobie
Copy link
Member

grobie commented May 25, 2017

I guess I wouldn't share a statsd exporter between teams for such reasons. We generally tend to prefer failing hard and early, as everything else usually makes debugging very difficult. That's why I'm a bit hesitant to implement a solution which will silently ignore metrics.

In general, I'd recommend to use direct instrumentation with our client libraries instead of relying on the statsd_exporter so much.

@acdha
Copy link
Author

acdha commented May 25, 2017

Yeah, in this case it was a shared instance among developers working on the same project - the person who was working on an update was trying to figure out why he was getting error messages from the statsd client when it terminated prematurely.

@grobie
Copy link
Member

grobie commented May 25, 2017

Given we ignore a lot of metrics already in statsd_exporter, I'd be personally fine accepting such a pull request. Changing the behavior would require to change the signature of *Container.Get() methods like

func (c *CounterContainer) Get(metricName string, labels prometheus.Labels) prometheus.Counter {
hash := hashNameAndLabels(metricName, labels)
counter, ok := c.Elements[hash]
if !ok {
counter = prometheus.NewCounter(prometheus.CounterOpts{
Name: metricName,
Help: defaultHelp,
ConstLabels: labels,
})
c.Elements[hash] = counter
if err := prometheus.Register(counter); err != nil {
log.Fatalf(regErrF, metricName, err)
}
}
return counter
}
. Instead of logging the error directly, it should be returned to the caller, the caller should then only call log.Errorf instead of log.Fatalf and continue with the next metric.

@SuperQ
Copy link
Member

SuperQ commented May 26, 2017

Also, it would be good to have an exporter internal error counter so you can monitor for the problem.

@avplab
Copy link

avplab commented Jun 1, 2017

Guys, probably my question is not related statsd_exporter, but when we aligned metric labels not to crash statsd_exporter, how to restart Prometheus to recalculate metrics?

@SuperQ
Copy link
Member

SuperQ commented Jun 1, 2017

@avplab Please take your question to our community.

@jacksontj
Copy link
Contributor

I just opened a PR to fix this (#72). Although this fixes the immediate issue of "exporter dies" it doesn't solve the longer-term issue of "you have to restart the exporter".

The most common case that this would be an issue is the following: an app exists and is emitting metrics. A new release of the app goes out which adds/removes some tags-- at this point those metrics are "broken" until the exporter is restarted. In this situation the "old" metircs are no longer being emitted, and as such we could remove them (given some TTL).

Because of this I was thinking of adding in a feature to basically TTL out metrics that haven't been emitted for a while if there is a new metric being emitted. Alternatively there could be some API call to "unregister" a metric-- but that seems fairly clunky (and not very "statsd-esque".

I figured I'd float the idea here first -- as it is related to this larger issue. If one of those (or some other option) is wanted I'll open a separate PR for the feature-- so we can get this fix for the crashing in quicker.

@grobie grobie closed this as completed in d9aa6e2 Jul 18, 2017
@grobie
Copy link
Member

grobie commented Jul 18, 2017

Thanks a lot @jacksontj. I merged your contribution.

For the discussion of how to handle such conflicting metrics in general, I'd recommend to write to prometheus-developers@.

@jacksontj
Copy link
Contributor

Just for linkage (if anyone is interested) here is the thread on the developer list -- https://groups.google.com/forum/#!topic/prometheus-developers/Q2pRR-UlHI0

jacksontj added a commit to jacksontj/statsd_exporter that referenced this issue Jul 20, 2017
This patch simply moves the error message from a log.Fatalf() to a
log.Errorf() to continue on.

Fixes prometheus#63
@matthiasr
Copy link
Contributor

I closed #74 because it had gone stale, but I am going to reopen this issue to track the underlying reasoning.

@matthiasr matthiasr reopened this Jan 18, 2018
@matthiasr matthiasr changed the title Crash when receiving conflicting metric values Handling of conflicting metric values Jan 29, 2018
@matthiasr
Copy link
Contributor

x-ref: more discussion in #120

@tobiashenkel
Copy link

The StatsD -> graphite pipeline supports multiple metric types on the same name nicely (by type namespacing support within the graphite backend). Further there are things out there which send e.g. counters and timers on the same name.

e.g.:

time="2018-02-23T06:21:34Z" level=info msg="lineToEvents:  zuul.geard.packet.GRAB_JOB_UNIQ:0|ms" source="exporter.go:428"
time="2018-02-23T06:21:34Z" level=info msg="lineToEvents:  zuul.geard.packet.GRAB_JOB_UNIQ:1|c" source="exporter.go:428"

So what do you think about the idea to extend the mapper with a type match such that we can map different typed metrics of the same name to different prometheus metric names?

@TimoL
Copy link

TimoL commented Feb 23, 2018

@tobiashenkel that would be the ideal solution, I guess.

@matthiasr would you like that solution? Any hints how to implement it?

@matthiasr
Copy link
Contributor

matthiasr commented Feb 23, 2018 via email

@tobiashenkel
Copy link

What about something like this?

mappings:
- match: test.timing.*.*.*
  match_metric_type: counter|gauge|timer
  name: "my_timer"
  labels:
    provider: "$2"
    outcome: "$3"
    job: "${1}_server"

@grobie
Copy link
Member

grobie commented Feb 23, 2018 via email

@tobiashenkel
Copy link

Both would work, one is nearer to the actual metric line, the other more comprehensive in the config file.

I'd be fine with either way.

@matthiasr
Copy link
Contributor

matthiasr commented Feb 23, 2018 via email

@gaizeror
Copy link

I don't know if it worths creating a new issue, so I'll try here first.
We are using datadog statsd python client, and we can't handle well conflicts today.
I am writing this comment because we found out some metrics were'nt been sent to statsd_exporter for a week.

  1. There are no logs ASAIK when a conflict occures. It would be great to have an option to enable these logs, so we will fail early.
  2. There is no way to "ignore" conflicts and duplicate the metric when labels are different.
  3. There is no way to unitest it.

Any suggestion how to handle such cases?

@matthiasr
Copy link
Contributor

Ideally this should be three new issues 😂

  1. I agree, if you can figure out a way to log this, please send a PR! Sometimes this isn't so easy because the conflicts only arise at scrape time in the Prometheus client.

  2. When labels are different, there should not be a conflict. Can you (in a new issue) detail how the conflict arises? Are there two metrics conflicting with each other, or one conflicting with a built-in metric (like in Errors when push standart golang client metrics to influxdb_exporet influxdb_exporter#37)?

@matthiasr
Copy link
Contributor

  1. Again, in a new issue, could you detail what exactly you would like to unit test, and ideally how? Since your app and the exporter communicate over the network, I'm not sure what exactly you mean to unit test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants