Fix use of client_golang to allow inconsistent labels on metrics #114

shlimp · 2017-12-20T15:12:51Z

from time to time, looks like the metrics endpoint dies and start returning 500 errors, and only way to restore it is to restart the pod it's on.
Iogs are empty and I didn't see any options to switch on a more verbose logging.
what can I do in order to provide a more detailed log and help debug it?

matthiasr · 2018-01-08T19:04:39Z

Ouch, that sounds bad. Could this be related to some metric that is submitted and poisons the registry?

Logging flags are currently missing, see #111. Also, #80 would allow still getting metrics about the exporter itself even when that happens. That might at least point to where more logging is needed …

matthiasr · 2018-01-18T08:26:01Z

Also cross-linking #63 for handling of inconsistent labels.

TimoL · 2018-02-21T19:04:07Z

We are facing the exact same issue.

We operate a CI system based on Zuul [1] and Nodepool [2]. Both are configured to push their statsd data directly to statsd-exporter. As soon as there's some load on the system, statsd-exporter goes into that "return error 500" mode.

Thanks to K8S liveness probes this is detected and the statsd-exporter is restarted, but it enters failure mode again within seconds. Result: we're sometimes missing >75% of the timeseries data in Prometheus.

We assume, that Zuul or Nodepool pushes some StatsD metrics, that the statsd-exporter does not handle properly. We could not yet trace the exact event that triggers the error state in statsd-exporter, but we plan to investigate the issue.

Additional note:
To make things worse, the /-/healthy API still reports "ok". I consider this a separate bug.

[1] https://docs.openstack.org/infra/zuul/
[2] https://docs.openstack.org/infra/nodepool/

TimoL · 2018-02-23T20:38:59Z

We did some more investigation on that issue. Our initial assumption

Zuul or Nodepool pushes some StatsD metrics, that the statsd-exporter does not handle properly

seems to be incorrect. All received statsd data seems to be fine.

After enabling more logging, we discovered something else:

"Config file changed (\"/opt/statsd-mapping/statsd-mapping.conf\": MODIFY|ATTRIB), attempting reload" source="main.go:114"
"Config reloaded successfully" source="main.go:120"

These reloads happen every 1 to 2 minutes.

Background:
We're running statsd-exporter docker container on OpenShift (K8S) and the statsd-mapping.conf file is mounted into the container as ConfigMap. This seems to trigger the config reload.

Current assumption is a race condition of

config reload and
incoming statsd events

@matthiasr I would be very interested, what's your opinion on this.

In the meantime we moved the statsd-mapping.conf file into the docker image and the config reloads disappeared. Now we need to wait for the next period of high load. I will provide an update when we get some results.

TimoL · 2018-02-26T14:59:55Z

Ok, our assumption of a config reload race condition is also wrong.
The implementation uses a mutex and looks solid.

After silencing these periodic config reloads we still see the same issue with "http error 500 returned".

Need to dig deeper into this. Any help or idea is very welcome ;-)

matthiasr · 2018-02-26T15:41:48Z

Is there anything in the logs of the exporter during this time? Any indication why it's throwing a 500?

TimoL · 2018-03-02T06:26:11Z

Maybe a stupid question (because of lacking go skills):
How do I get some kinde of verbose logging out of the statsd exporter (and underlying libs)?

For now I can not see anything suspicious in the logs. No error of any kind. Only warnings covered by #63 (and these were gathered by simply adding info logs):

time="2018-03-02T06:24:29Z" level=info msg="lineToEvents:  zuul.geard.packet.WORK_COMPLETE:0\|ms" source="exporter.go:428"
  | time="2018-03-02T06:24:29Z" level=info msg="lineToEvents:  zuul.geard.packet.WORK_COMPLETE:1\|c" source="exporter.go:428"
  | time="2018-03-02T06:24:29Z" level=info msg="A change of configuration created inconsistent metrics for \"zuul_geard_packet\". You have to restart the statsd_exporter, and you should consider the effects on your monitoring setup. Error: duplicate metrics collector registration attempted" source="exporter.go:295"
```

matthiasr · 2018-03-02T13:27:52Z

The error is happening in the HTTP endpoint handled by the Prometheus client library, and it is not logging errors by default. Additionally, we are using the deprecated one from github.com/prometheus/client_golang/prometheus.

I think the simplest way to get logging is to switch to promhttp and do this but pass an ErrorLogger in the HandlerOpts.

TimoL · 2018-03-05T12:58:17Z

@matthiasr Thanks for the hint, will test this asap.

TimoL · 2018-03-08T17:08:16Z

@matthiasr with your instructions I was able to get the error message (added line breaks for better readability, removed project data):

time="2018-03-08T16:48:07Z" level=error msg="error gathering metrics:
3 error(s) occurred:\n* collected metric zuul_job_results
label:<name:\"branch\" value:\"master\" >
label:<name:\"hostname\" value:\"xxxxxxxxxx\" >
label:<name:\"job\" value:\"xxxxxxxxxx\" > 
label:<name:\"pipeline\" value:\"xxxxxxxxxx\" > 
label:<name:\"project\" value:\"xxxxxxxxxx\" > 
label:<name:\"result\" value:\"xxxxxxxxxx\" > 
label:<name:\"tenant\" value:\"xxxxxxxxxx\" > 
counter:<value:1 >  should be a Summary\n* collected metric zuul_job_results 
label:<name:\"branch\" value:\"xxxxxxxxxx\" > 
label:<name:\"hostname\" value:\"xxxxxxxxxx\" > 
label:<name:\"job\" value:\"xxxxxxxxxx\" > 
label:<name:\"pipeline\" value:\"xxxxxxxxxx\" > 
label:<name:\"project\" value:\"xxxxxxxxxx\" > 
label:<name:\"result\" value:\"xxxxxxxxxx\" > 
label:<name:\"tenant\" value:\"xxxxxxxxxx\" > 
counter:<value:1 >  should be a Summary\n* collected metric zuul_job_results 
label:<name:\"branch\" value:\"master\" > 
label:<name:\"hostname\" value:\"xxxxxxxxxx\" > 
label:<name:\"job\" value:\"xxxxxxxxxx\" > 
label:<name:\"pipeline\" value:\"xxxxxxxxxx\" > 
label:<name:\"project\" value:\"xxxxxxxxxx\" > 
label:<name:\"result\" value:\"xxxxxxxxxx\" > 
label:<name:\"tenant\" value:\"xxxxxxxxxx\" > 
counter:<value:1 >  should be a Summary\n"
source="<autogenerated>:1"

matthiasr · 2018-03-08T18:11:08Z

This looks like you are emitting a mixture of statsd timers and counters under the same metric name, or under different metric names that get mapped to zuul_job_results. The Prometheus client library rejects that because it can't merge the two.

As a simple solution, drop the counters – the summary includes a _total metric that indicates the count of observed timer events.

Would you mind submitting a PR with the logging fix?

TimoL · 2018-06-15T11:28:50Z

A PR to handle the underlying issue was just proposed here: #136

matthiasr · 2018-12-11T08:45:04Z

Some thoughts in light of the changes in the Prometheus client library (copy from #170):

In principle this is something I want to support, but it is not easy to implement. The Prometheus client really prefers consistent metrics; I believe we need to convert to the Collector form and ConstMetrics, and make sure to mark the collector as Unchecked. Alternatively/additionally, we could fully handle expanding label sets ourselves but that may some many edge cases that need handling.

In any case, I would wait until #164 is in, it gets us one step closer to this by handling metrics on a name-by-name basis.

matthiasr · 2018-12-11T08:50:19Z

I would like the Collector to be in a separate package, so that I can reuse it in graphite_exporter. Also see that for a simple Collector/ConstMetrics style exporter. The statsd case is a superset because it also supports counters, summaries, and histograms.

matthiasr · 2019-01-17T11:22:24Z

From #179:

Currently if you try to update a metric with a labelset that is inconsistent with whats registered, the metric gets dropped. As of release 0.9 of the Prometheus Go client, specifically 417, inconsistent label dimensions are now allowed.

Since this is now officially allowed, we should indeed support it. I think (but have not looked into it in detail) that the issue is that we send some metrics descriptors, so we don't get treated as an unchecked collector.

vsakhart · 2019-01-24T20:57:39Z

@matthiasr Any plans to work on this soon or are you willing to take a PR?

matthiasr · 2019-01-25T08:42:03Z

PRs are always welcome for any issue! Sorry I didn't make that clear

matthiasr added the bug label Jan 8, 2018

matthiasr mentioned this issue Dec 11, 2018

dynamic tags cannot be added when metric submit #170

Closed

matthiasr mentioned this issue Jan 17, 2019

Allow metrics with inconsistent label dimensions #179

Closed

matthiasr changed the title ~~500 error on metrics endpoint~~ Fix use of client_golang to allow inconsistent labels on metrics Jan 17, 2019

vsakhart mentioned this issue Mar 28, 2019

Allow support for inconsistent label sets by marking metrics registered as unchecked collectors #194

Merged

matthiasr closed this as completed in #194 Apr 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix use of client_golang to allow inconsistent labels on metrics #114

Fix use of client_golang to allow inconsistent labels on metrics #114

shlimp commented Dec 20, 2017

matthiasr commented Jan 8, 2018

matthiasr commented Jan 18, 2018

TimoL commented Feb 21, 2018 •

edited

Loading

TimoL commented Feb 23, 2018

TimoL commented Feb 26, 2018

matthiasr commented Feb 26, 2018

TimoL commented Mar 2, 2018 •

edited

Loading

matthiasr commented Mar 2, 2018

TimoL commented Mar 5, 2018

TimoL commented Mar 8, 2018 •

edited

Loading

matthiasr commented Mar 8, 2018

TimoL commented Jun 15, 2018

matthiasr commented Dec 11, 2018

matthiasr commented Dec 11, 2018

matthiasr commented Jan 17, 2019

vsakhart commented Jan 24, 2019

matthiasr commented Jan 25, 2019

Fix use of client_golang to allow inconsistent labels on metrics #114

Fix use of client_golang to allow inconsistent labels on metrics #114

Comments

shlimp commented Dec 20, 2017

matthiasr commented Jan 8, 2018

matthiasr commented Jan 18, 2018

TimoL commented Feb 21, 2018 • edited Loading

TimoL commented Feb 23, 2018

TimoL commented Feb 26, 2018

matthiasr commented Feb 26, 2018

TimoL commented Mar 2, 2018 • edited Loading

matthiasr commented Mar 2, 2018

TimoL commented Mar 5, 2018

TimoL commented Mar 8, 2018 • edited Loading

matthiasr commented Mar 8, 2018

TimoL commented Jun 15, 2018

matthiasr commented Dec 11, 2018

matthiasr commented Dec 11, 2018

matthiasr commented Jan 17, 2019

vsakhart commented Jan 24, 2019

matthiasr commented Jan 25, 2019

TimoL commented Feb 21, 2018 •

edited

Loading

TimoL commented Mar 2, 2018 •

edited

Loading

TimoL commented Mar 8, 2018 •

edited

Loading