-
Notifications
You must be signed in to change notification settings - Fork 848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of order exemplars with Prometheus Exporter #4193
Comments
@anuraaga @jkwatson it sounds like this may belong in https://github.com/open-telemetry/opentelemetry-java |
Hi @irizzant - just wondering are you able to provide the actual prometheus HTTP output that is scraped? That would provide some context, I am curious if there is a collection bug that is causing what should be valid timestamps to be recorded incorrectly, or if it's an output issue, or such. We may be able to remove the timestamp, but I wouldn't expect the exemplar timestamp to be higher than the bucket timestamp in general. |
Hi @anuraaga I can't provide the actual prometheus HTTP output but the out-of-order exemplars seems to be disappeared on their own from the logs. Nevertheless, rather than closing the issue I'd still consider removing the timestamps from metrics |
Hi @anuraaga You can see lines like these:
where the exemplar timestamp is out of order. Please note this http output comes from a single VM instrumented with Opentelemetry Java Agent |
Any update on the above? |
@irizzant I don't think anyone has had a chance to look into this. Are you able to provide a PR with the fix? |
@jkwatson if you're asking me no I'm not able to do that, IMHO a possible fix could be removing timestamps from metrics. |
I confirm the issue is still present with version 1.12.0, it would be nice if someone could help fixing this issue or eventually address me to where I should make changes to remove timestamps from metrics |
@irizzant Can you point at a prometheus specification for how exemplars are supposed to look? That might help get some clarity on what the solution should be (I'm not a prometheus expert). |
@jkwatson actually Prometheus acts just as a backend for OpenTelemetry metrics format, the way metrics with exemplars should look like is defined in OpenTelemetry specs. Prometheus is complaining about out of order exemplars because timestamps associated to the exemplars are out of order, so removing them will fix the issue |
If prometheus gets grumpy out-of-order exemplars, then the spec should have the required behavior documented, I think (it doesn't right now, of course) |
@jkwatson in the specs I reported you can read:
So having a timestamp is not mandatory from the specs and can be removed from the Prometheus exporter. |
I understand that, but I'd like to have this timestamp ordering also called out in the spec, if Prometheus requires it. |
OpenTelemetry spec is not for Prometheus only so I don't understand why it should be reported there. Regardless of this, the removal of timestamps from exemplars doesn't depend on mentioning this issue and doesn't break the compatibility with the specs. Of course this is not the ONLY solution, one could go for fixing the out of order exemplars exported by the exporter but keeping the timestamps it's not necessary and causes problems like these. |
Hi @irizzant - the OTel spec requires adding timestamps to exemplars when converting, so we would need to change the specification before updating here. We're not quite familiar enough with Prometheus here to know what the best fix could be to the timestamps themselves. Intuitively, I can't find anything wrong with what we're exporting - we finalize aggregation at collection time, giving the metric timestamp, but by nature of exemplars they come some time in the past. So the reported timestamps seem correct. OpenMetrics, the exposition format, does not specify any restrictions on exemplar timestamps So I can only believe that this is a bug in prometheus's handling of the format. If the correct workaround is just to not report exemplar timestamps, then we should update the spec for that. I'd also confirm your on the latest version of prometheus and consider filing a bug to prometheus to see what they think. Given the wording in the openmetrics spec, the current otel spec also seems reasonable. |
Hi @anuraaga If you check this comment you can see yourself that on a single scrape the exporter produces 2 out of order timestamps:
First it produces 1645794322.137 then 1645793793.776. |
@irizzant But these are two time series so ordering should be independent for them right? Exemplars for one bucket will happen at times unrelated to those in another bucket. |
Though I found that OpenMetrics seems to recommend not including timestamps anywhere, not just exemplars. I filed this issue |
@anuraaga in the same comment I mentioned you can find the raw output of the whole scrape.
all the timestamps reported within it are in increasing order and in my opinion this is expected since we're talking about a single scrape of a single app into a single VM, samples for one histogram should be collected in order. As you wrote, timestamps in metrics (and also in exemplars) are not a good practice and lead to problems, this is why I suggested to go for removing them. |
Yes I looked at the output. It matched my expectations.
Just to clarify, I think this is missing what an exemplar actually is - if the first request within the collection interval was very slow and most the others were fast after that, then then the exemplar for the 99% bucket would have definitely come before the exemplar for the 50% bucket. There is an ordering we can expect of exemplars within a bucket, but ordering between buckets doesn't really make conceptual sense since different buckets have different latency, not timing. |
I get what you're saying but this would mean that is conceptually impossible collecting exemplars and require ordering at the same time, because for any given bucket there could always be one request slower than the others. |
Indeed, I believe it is impossible to require ordering between exemplars within a histogram's buckets. Exemplars are exemplary data within a collection interval - a subsequent collection interval will have its own exemplars. Exemplars will always have a timestamp in the past from the collection timestamp, or in other words the histogram's timestamp since that's when the histogram is computed. But given the order of exemplary requests is completely based on the actual e.g. user request pattern if thinking about an API server, there is no way for the exemplars to have any expected ordering among the buckets. I'm not sure why Prometheus would reject this, and do believe it's a bug in Prometheus. But we'll probably end up working around it by removing the timestamps globally anyways. |
Describe the bug
I have a Java app instrumented with OTEL agent.
This app exports metrics with OTEL Prometheus exporter.
Prometheus collects metrics + exemplars from the app endpoint, but log is filled with out-of-order exemplars:
logs
ts=2022-02-10T13:37:46.764Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"80e23b73d43c2784\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.246967 Ts:1644500109257 HasTs:true}" ts=2022-02-10T13:37:46.765Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"b805ddedebd2b9a3\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:16.11189 Ts:1644500108869 HasTs:true}" ts=2022-02-10T13:37:46.765Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"2ebf1ff7a445dcee\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.34602 Ts:1644500109361 HasTs:true}" ts=2022-02-10T13:37:46.765Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"2adeb2df958e21a4\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.196914 Ts:1644500109274 HasTs:true}" ts=2022-02-10T13:37:46.765Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"3c6cdc09b480d57e\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.603316 Ts:1644500109732 HasTs:true}" ts=2022-02-10T13:37:46.765Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"5a775034db6844b0\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:14.823061 Ts:1644500108889 HasTs:true}" ts=2022-02-10T13:37:46.765Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"b55bcb4351a12cf5\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:16.504856 Ts:1644500109299 HasTs:true}" ts=2022-02-10T13:37:46.766Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"6ce64f19917ab619\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.824368 Ts:1644500108768 HasTs:true}" ts=2022-02-10T13:37:46.766Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"26e13972a6eb78b4\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.484916 Ts:1644500108836 HasTs:true}" ts=2022-02-10T13:37:46.766Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"7aa4f77983d3b6f1\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:16.418786 Ts:1644500108751 HasTs:true}" ts=2022-02-10T13:37:46.766Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"7bdbfaaac7446383\", trace_id=\"6a67a119fb84a5fcb466e04e978dde24\"} Value:142.779066 Ts:1644498142288 HasTs:true}" ts=2022-02-10T13:37:46.766Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"5cb87ca55134f111\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.922487 Ts:1644500109702 HasTs:true}" ts=2022-02-10T13:37:46.766Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"a38b9aa879def0cd\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.245832 Ts:1644500108471 HasTs:true}" ts=2022-02-10T13:37:46.767Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"10aaadc3c4d70203\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.466594 Ts:1644500109050 HasTs:true}" ts=2022-02-10T13:37:46.767Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"f4799f234aee8d1b\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:21.507684 Ts:1644500109096 HasTs:true}" ts=2022-02-10T13:37:46.767Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"053c6c60a6f7e52f\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.987155 Ts:1644500108996 HasTs:true}" ts=2022-02-10T13:37:46.767Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"519332bd1f2ab522\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.366706 Ts:1644500109206 HasTs:true}" ts=2022-02-10T13:37:46.767Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"236bfeb651e5ce0c\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.013823 Ts:1644500108852 HasTs:true}" ts=2022-02-10T13:37:46.767Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"063758d9b991efe6\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.082417 Ts:1644500109230 HasTs:true}" ts=2022-02-10T13:37:46.768Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"ea6952890eba99f8\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:16.763646 Ts:1644500108925 HasTs:true}" ts=2022-02-10T13:37:46.768Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"40beaa800f1486ef\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.241195 Ts:1644500109646 HasTs:true}" ts=2022-02-10T13:37:46.768Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"0cd1443196dc4302\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:16.019285 Ts:1644500109013 HasTs:true}" ts=2022-02-10T13:37:46.768Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"bbca1c96a279c281\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.511393 Ts:1644500109404 HasTs:true}" ts=2022-02-10T13:37:46.768Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"7d5e88eda84bdf6b\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.720918 Ts:1644500109421 HasTs:true}" ts=2022-02-10T13:37:46.768Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"ca858f8cdcb8fdec\", trace_id=\"3ce14cee2b67171958be2872540b1392\"} Value:15.777929 Ts:1644500109684 HasTs:true}" ts=2022-02-10T13:37:46.768Z caller=scrape.go:1573 level=warn component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.88:9464/metrics msg="Error on ingesting out-of-order exemplars" num_dropped=65 ts=2022-02-10T13:37:54.632Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.90:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"0930b875fa474b40\", trace_id=\"5826bc046633f8bc347c753ac32e4580\"} Value:1170.714014 Ts:1644499119665 HasTs:true}" ts=2022-02-10T13:37:54.633Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.90:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"cf978cbe211cb3e3\", trace_id=\"91c9686ae5b8651e7c2f01f87618a32d\"} Value:796.384427 Ts:1644499400049 HasTs:true}" ts=2022-02-10T13:37:54.633Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.90:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"00e7cc7850ad16c3\", trace_id=\"f210e7dbe47f3c4848834b2e58962f6d\"} Value:1277.431793 Ts:1644499021411 HasTs:true}" ts=2022-02-10T13:37:54.633Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.90:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"0919666b8cf5b377\", trace_id=\"83490b4b1f6a878b737e39f712cd23ac\"} Value:15.143474 Ts:1644494333371 HasTs:true}" ts=2022-02-10T13:37:54.647Z caller=scrape.go:1637 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.90:9464/metrics msg="Out of order exemplar" exemplar="{Labels:{span_id=\"a8ca627ef96688f5\", trace_id=\"68e7ca3c4903934b6f054e425ded6cd6\"} Value:4740.193257 Ts:1644495614607 HasTs:true}" ts=2022-02-10T13:37:54.648Z caller=scrape.go:1573 level=warn component="scrape manager" scrape_pool=serviceMonitor/monitoring-custom-metrics/edimonitor-custom-metrics/1 target=http://10.101.25.90:9464/metrics msg="Error on ingesting out-of-order exemplars" num_dropped=5As far as I know this is related to the exemplars' timestamps, which in general should not be needed.
Is it possibile to remove them or suggest a way to fix this?
Steps to reproduce
If possible, provide a recipe for reproducing the error.
What did you expect to see?
Exemplars should be scraped successfully
What did you see instead?
Exemplars dropped because of out of order timestamps
What version are you using?
(e.g.,
v0.9.0
,393e4a2
, etc)v1.10.1
Environment
Compiler: (e.g., "AdoptOpenJDK 11.0.6")
OS: (e.g., "Ubuntu 20.04")
Runtime (if different from JDK above): (e.g., "Oracle JRE 8u251")
OS (if different from OS compiled on): (e.g., "Windows Server 2019")
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: