You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've encountered an issue with our production thanos deployment that we suspect may be a bug. Long story short we are using thanos in both the sidecar approach alongside Prometheus as well as with Thanos receive and remote write.
While the majority of our thanos infrastructure was already on version 0.26, our thanos receive deployment and the querier sitting in front of it was still running version 0.23. Last week we upgraded thanos receive to version 0.26 and began noticing something quite surprising: our query latency from Grafana degraded by over 100x! This resulted in Grafana taking an additional 1-2 minutes to load data into panels, even for queries spanning just a few hours.
Tracing revealed that the query_range methods were still blazing fast (subsecond) in most cases. However exemplar queries to Thanos Receive were very slow (30-50s) and in some cases timing out.
This came as an even greater shock because we do not use exemplars. We have never enabled exemplar collection in Prometheus nor have we enabled exemplar storage in thanos receive (max exemplar storage is set to zero). I've also confirmed that we in fact have absolutely no exemplars in storage via the following metrics:
To make things even more interesting, we are only seeing this behavior with Thanos receive. This behavior is not observed between thanos query -> thanos sidecar -> prometheus. In most cases exemplar queries to thanos sidecar respond in less than 150ms which can be seen in the above trace.
For the time being we've disabled exemplar querying from Grafana to restore our query performance, but I think it is worth investigating what is causing this query latency specifically with thanos receive.
What you expected to happen:
Thanos receive exemplar queries perform comparably to thanos sidecar exemplar queries given that neither thanos receive nor Prometheus have exemplar storage enabled.
How to reproduce it (as minimally and precisely as possible):
Query thanos receive with exemplars enabled.
Anything else we need to know:
Another observation worth mentioning is that CPU usage on the thanos receive querier increased by nearly 10x (up to 20 cores) when dealing with exemplars queries. This restored to its baseline level of 2 cores when exemplar querying was disabled.
The text was updated successfully, but these errors were encountered:
If others encounter this issue and need to disable exemplar querying from Grafana - I've written a small script that finds dashboards with exemplars enabled and sets exemplar querying to false:
Unfortunately Grafana had exemplars enabled by default which resulted in several hundreds of dashboards needing to be updated. Luckily they have fixed this from version 8.5 onwards grafana/grafana#45260
I am surprised exemplars are enabled by default. Will investigate.
It sounds like naive fanout implementation and some slow down on the TSDB side as well - but that's expected from experimental feature. Sorry for mess.
I don't think exemplars storage is enabled by default - though the exemplar query endpoint is. A fan out implementation on a large receive deployment could very well explain the latency. What intrigues me though is you can see that the proxy_exemplar behavior is different for thanos sidecar as compared to thanos receive:
Namely the thanos.Exemplars/Exemplars RPC appears to start at approximately the same time for thanos sidecar, however there is a clear sequential pattern for thanos receive.
Thanos, Prometheus and Golang version used:
Object Storage Provider:
Pure Storage
What happened:
Hey folks,
We've encountered an issue with our production thanos deployment that we suspect may be a bug. Long story short we are using thanos in both the sidecar approach alongside Prometheus as well as with Thanos receive and remote write.
While the majority of our thanos infrastructure was already on version 0.26, our thanos receive deployment and the querier sitting in front of it was still running version 0.23. Last week we upgraded thanos receive to version 0.26 and began noticing something quite surprising: our query latency from Grafana degraded by over 100x! This resulted in Grafana taking an additional 1-2 minutes to load data into panels, even for queries spanning just a few hours.
Tracing revealed that the query_range methods were still blazing fast (subsecond) in most cases. However exemplar queries to Thanos Receive were very slow (30-50s) and in some cases timing out.
This came as an even greater shock because we do not use exemplars. We have never enabled exemplar collection in Prometheus nor have we enabled exemplar storage in thanos receive (max exemplar storage is set to zero). I've also confirmed that we in fact have absolutely no exemplars in storage via the following metrics:
To make things even more interesting, we are only seeing this behavior with Thanos receive. This behavior is not observed between thanos query -> thanos sidecar -> prometheus. In most cases exemplar queries to thanos sidecar respond in less than 150ms which can be seen in the above trace.
For the time being we've disabled exemplar querying from Grafana to restore our query performance, but I think it is worth investigating what is causing this query latency specifically with thanos receive.
What you expected to happen:
Thanos receive exemplar queries perform comparably to thanos sidecar exemplar queries given that neither thanos receive nor Prometheus have exemplar storage enabled.
How to reproduce it (as minimally and precisely as possible):
Query thanos receive with exemplars enabled.
Anything else we need to know:
Another observation worth mentioning is that CPU usage on the thanos receive querier increased by nearly 10x (up to 20 cores) when dealing with exemplars queries. This restored to its baseline level of 2 cores when exemplar querying was disabled.
The text was updated successfully, but these errors were encountered: