-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix use of client_golang to allow inconsistent labels on metrics #114
Comments
Ouch, that sounds bad. Could this be related to some metric that is submitted and poisons the registry? Logging flags are currently missing, see #111. Also, #80 would allow still getting metrics about the exporter itself even when that happens. That might at least point to where more logging is needed … |
Also cross-linking #63 for handling of inconsistent labels. |
We are facing the exact same issue. We operate a CI system based on Zuul [1] and Nodepool [2]. Both are configured to push their statsd data directly to statsd-exporter. As soon as there's some load on the system, statsd-exporter goes into that "return error 500" mode. Thanks to K8S liveness probes this is detected and the statsd-exporter is restarted, but it enters failure mode again within seconds. Result: we're sometimes missing >75% of the timeseries data in Prometheus. We assume, that Zuul or Nodepool pushes some StatsD metrics, that the statsd-exporter does not handle properly. We could not yet trace the exact event that triggers the error state in statsd-exporter, but we plan to investigate the issue. Additional note: [1] https://docs.openstack.org/infra/zuul/ |
We did some more investigation on that issue. Our initial assumption
seems to be incorrect. All received statsd data seems to be fine. After enabling more logging, we discovered something else:
These reloads happen every 1 to 2 minutes.
Current assumption is a race condition of
@matthiasr I would be very interested, what's your opinion on this. In the meantime we moved the |
Ok, our assumption of a config reload race condition is also wrong. After silencing these periodic config reloads we still see the same issue with "http error 500 returned". Need to dig deeper into this. Any help or idea is very welcome ;-) |
Is there anything in the logs of the exporter during this time? Any indication why it's throwing a 500? |
Maybe a stupid question (because of lacking go skills): For now I can not see anything suspicious in the logs. No error of any kind. Only warnings covered by #63 (and these were gathered by simply adding info logs):
|
The error is happening in the HTTP endpoint handled by the Prometheus client library, and it is not logging errors by default. Additionally, we are using the deprecated one from I think the simplest way to get logging is to switch to promhttp and do this but pass an ErrorLogger in the HandlerOpts. |
@matthiasr Thanks for the hint, will test this asap. |
@matthiasr with your instructions I was able to get the error message (added line breaks for better readability, removed project data):
|
This looks like you are emitting a mixture of statsd timers and counters under the same metric name, or under different metric names that get mapped to As a simple solution, drop the counters – the summary includes a Would you mind submitting a PR with the logging fix? |
A PR to handle the underlying issue was just proposed here: #136 |
Some thoughts in light of the changes in the Prometheus client library (copy from #170): In principle this is something I want to support, but it is not easy to implement. The Prometheus client really prefers consistent metrics; I believe we need to convert to the Collector form and ConstMetrics, and make sure to mark the collector as Unchecked. Alternatively/additionally, we could fully handle expanding label sets ourselves but that may some many edge cases that need handling. In any case, I would wait until #164 is in, it gets us one step closer to this by handling metrics on a name-by-name basis. |
I would like the Collector to be in a separate package, so that I can reuse it in graphite_exporter. Also see that for a simple Collector/ConstMetrics style exporter. The statsd case is a superset because it also supports counters, summaries, and histograms. |
From #179:
Since this is now officially allowed, we should indeed support it. I think (but have not looked into it in detail) that the issue is that we send some metrics descriptors, so we don't get treated as an unchecked collector. |
@matthiasr Any plans to work on this soon or are you willing to take a PR? |
PRs are always welcome for any issue! Sorry I didn't make that clear |
from time to time, looks like the metrics endpoint dies and start returning 500 errors, and only way to restore it is to restart the pod it's on.
Iogs are empty and I didn't see any options to switch on a more verbose logging.
what can I do in order to provide a more detailed log and help debug it?
The text was updated successfully, but these errors were encountered: