Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix use of client_golang to allow inconsistent labels on metrics #114

Closed
shlimp opened this issue Dec 20, 2017 · 17 comments · Fixed by #194
Closed

Fix use of client_golang to allow inconsistent labels on metrics #114

shlimp opened this issue Dec 20, 2017 · 17 comments · Fixed by #194
Labels

Comments

@shlimp
Copy link

shlimp commented Dec 20, 2017

from time to time, looks like the metrics endpoint dies and start returning 500 errors, and only way to restore it is to restart the pod it's on.
Iogs are empty and I didn't see any options to switch on a more verbose logging.
what can I do in order to provide a more detailed log and help debug it?

@matthiasr matthiasr added the bug label Jan 8, 2018
@matthiasr
Copy link
Contributor

Ouch, that sounds bad. Could this be related to some metric that is submitted and poisons the registry?

Logging flags are currently missing, see #111. Also, #80 would allow still getting metrics about the exporter itself even when that happens. That might at least point to where more logging is needed …

@matthiasr
Copy link
Contributor

Also cross-linking #63 for handling of inconsistent labels.

@TimoL
Copy link

TimoL commented Feb 21, 2018

We are facing the exact same issue.

We operate a CI system based on Zuul [1] and Nodepool [2]. Both are configured to push their statsd data directly to statsd-exporter. As soon as there's some load on the system, statsd-exporter goes into that "return error 500" mode.

Thanks to K8S liveness probes this is detected and the statsd-exporter is restarted, but it enters failure mode again within seconds. Result: we're sometimes missing >75% of the timeseries data in Prometheus.

We assume, that Zuul or Nodepool pushes some StatsD metrics, that the statsd-exporter does not handle properly. We could not yet trace the exact event that triggers the error state in statsd-exporter, but we plan to investigate the issue.

Additional note:
To make things worse, the /-/healthy API still reports "ok". I consider this a separate bug.

[1] https://docs.openstack.org/infra/zuul/
[2] https://docs.openstack.org/infra/nodepool/

@TimoL
Copy link

TimoL commented Feb 23, 2018

We did some more investigation on that issue. Our initial assumption

Zuul or Nodepool pushes some StatsD metrics, that the statsd-exporter does not handle properly

seems to be incorrect. All received statsd data seems to be fine.

After enabling more logging, we discovered something else:

"Config file changed (\"/opt/statsd-mapping/statsd-mapping.conf\": MODIFY|ATTRIB), attempting reload" source="main.go:114"
"Config reloaded successfully" source="main.go:120"

These reloads happen every 1 to 2 minutes.

Background:
We're running statsd-exporter docker container on OpenShift (K8S) and the statsd-mapping.conf file is mounted into the container as ConfigMap. This seems to trigger the config reload.

Current assumption is a race condition of

  • config reload and
  • incoming statsd events

@matthiasr I would be very interested, what's your opinion on this.

In the meantime we moved the statsd-mapping.conf file into the docker image and the config reloads disappeared. Now we need to wait for the next period of high load. I will provide an update when we get some results.

@TimoL
Copy link

TimoL commented Feb 26, 2018

Ok, our assumption of a config reload race condition is also wrong.
The implementation uses a mutex and looks solid.

After silencing these periodic config reloads we still see the same issue with "http error 500 returned".

Need to dig deeper into this. Any help or idea is very welcome ;-)

@matthiasr
Copy link
Contributor

Is there anything in the logs of the exporter during this time? Any indication why it's throwing a 500?

@TimoL
Copy link

TimoL commented Mar 2, 2018

Maybe a stupid question (because of lacking go skills):
How do I get some kinde of verbose logging out of the statsd exporter (and underlying libs)?

For now I can not see anything suspicious in the logs. No error of any kind. Only warnings covered by #63 (and these were gathered by simply adding info logs):

time="2018-03-02T06:24:29Z" level=info msg="lineToEvents:  zuul.geard.packet.WORK_COMPLETE:0\|ms" source="exporter.go:428"
  | time="2018-03-02T06:24:29Z" level=info msg="lineToEvents:  zuul.geard.packet.WORK_COMPLETE:1\|c" source="exporter.go:428"
  | time="2018-03-02T06:24:29Z" level=info msg="A change of configuration created inconsistent metrics for \"zuul_geard_packet\". You have to restart the statsd_exporter, and you should consider the effects on your monitoring setup. Error: duplicate metrics collector registration attempted" source="exporter.go:295"
``` 

@matthiasr
Copy link
Contributor

The error is happening in the HTTP endpoint handled by the Prometheus client library, and it is not logging errors by default. Additionally, we are using the deprecated one from github.com/prometheus/client_golang/prometheus.

I think the simplest way to get logging is to switch to promhttp and do this but pass an ErrorLogger in the HandlerOpts.

@TimoL
Copy link

TimoL commented Mar 5, 2018

@matthiasr Thanks for the hint, will test this asap.

@TimoL
Copy link

TimoL commented Mar 8, 2018

@matthiasr with your instructions I was able to get the error message (added line breaks for better readability, removed project data):

time="2018-03-08T16:48:07Z" level=error msg="error gathering metrics:
3 error(s) occurred:\n* collected metric zuul_job_results
label:<name:\"branch\" value:\"master\" >
label:<name:\"hostname\" value:\"xxxxxxxxxx\" >
label:<name:\"job\" value:\"xxxxxxxxxx\" > 
label:<name:\"pipeline\" value:\"xxxxxxxxxx\" > 
label:<name:\"project\" value:\"xxxxxxxxxx\" > 
label:<name:\"result\" value:\"xxxxxxxxxx\" > 
label:<name:\"tenant\" value:\"xxxxxxxxxx\" > 
counter:<value:1 >  should be a Summary\n* collected metric zuul_job_results 
label:<name:\"branch\" value:\"xxxxxxxxxx\" > 
label:<name:\"hostname\" value:\"xxxxxxxxxx\" > 
label:<name:\"job\" value:\"xxxxxxxxxx\" > 
label:<name:\"pipeline\" value:\"xxxxxxxxxx\" > 
label:<name:\"project\" value:\"xxxxxxxxxx\" > 
label:<name:\"result\" value:\"xxxxxxxxxx\" > 
label:<name:\"tenant\" value:\"xxxxxxxxxx\" > 
counter:<value:1 >  should be a Summary\n* collected metric zuul_job_results 
label:<name:\"branch\" value:\"master\" > 
label:<name:\"hostname\" value:\"xxxxxxxxxx\" > 
label:<name:\"job\" value:\"xxxxxxxxxx\" > 
label:<name:\"pipeline\" value:\"xxxxxxxxxx\" > 
label:<name:\"project\" value:\"xxxxxxxxxx\" > 
label:<name:\"result\" value:\"xxxxxxxxxx\" > 
label:<name:\"tenant\" value:\"xxxxxxxxxx\" > 
counter:<value:1 >  should be a Summary\n"
source="<autogenerated>:1"

@matthiasr
Copy link
Contributor

This looks like you are emitting a mixture of statsd timers and counters under the same metric name, or under different metric names that get mapped to zuul_job_results. The Prometheus client library rejects that because it can't merge the two.

As a simple solution, drop the counters – the summary includes a _total metric that indicates the count of observed timer events.

Would you mind submitting a PR with the logging fix?

@TimoL
Copy link

TimoL commented Jun 15, 2018

A PR to handle the underlying issue was just proposed here: #136

@matthiasr
Copy link
Contributor

Some thoughts in light of the changes in the Prometheus client library (copy from #170):

In principle this is something I want to support, but it is not easy to implement. The Prometheus client really prefers consistent metrics; I believe we need to convert to the Collector form and ConstMetrics, and make sure to mark the collector as Unchecked. Alternatively/additionally, we could fully handle expanding label sets ourselves but that may some many edge cases that need handling.

In any case, I would wait until #164 is in, it gets us one step closer to this by handling metrics on a name-by-name basis.

@matthiasr
Copy link
Contributor

I would like the Collector to be in a separate package, so that I can reuse it in graphite_exporter. Also see that for a simple Collector/ConstMetrics style exporter. The statsd case is a superset because it also supports counters, summaries, and histograms.

@matthiasr
Copy link
Contributor

From #179:

Currently if you try to update a metric with a labelset that is inconsistent with whats registered, the metric gets dropped. As of release 0.9 of the Prometheus Go client, specifically 417, inconsistent label dimensions are now allowed.

Since this is now officially allowed, we should indeed support it. I think (but have not looked into it in detail) that the issue is that we send some metrics descriptors, so we don't get treated as an unchecked collector.

@matthiasr matthiasr changed the title 500 error on metrics endpoint Fix use of client_golang to allow inconsistent labels on metrics Jan 17, 2019
@vsakhart
Copy link

@matthiasr Any plans to work on this soon or are you willing to take a PR?

@matthiasr
Copy link
Contributor

PRs are always welcome for any issue! Sorry I didn't make that clear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants