Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Receive: High query latency for exemplars with zero exemplars in storage #5444

Closed
vanugrah opened this issue Jun 24, 2022 · 3 comments · Fixed by #5554
Closed

Thanos Receive: High query latency for exemplars with zero exemplars in storage #5444

vanugrah opened this issue Jun 24, 2022 · 3 comments · Fixed by #5554

Comments

@vanugrah
Copy link
Contributor

Thanos, Prometheus and Golang version used:

  • Thanos: 0.26.0
  • Prometheus: 2.35.0

Object Storage Provider:
Pure Storage

What happened:
Hey folks,

We've encountered an issue with our production thanos deployment that we suspect may be a bug. Long story short we are using thanos in both the sidecar approach alongside Prometheus as well as with Thanos receive and remote write.

While the majority of our thanos infrastructure was already on version 0.26, our thanos receive deployment and the querier sitting in front of it was still running version 0.23. Last week we upgraded thanos receive to version 0.26 and began noticing something quite surprising: our query latency from Grafana degraded by over 100x! This resulted in Grafana taking an additional 1-2 minutes to load data into panels, even for queries spanning just a few hours.

Tracing revealed that the query_range methods were still blazing fast (subsecond) in most cases. However exemplar queries to Thanos Receive were very slow (30-50s) and in some cases timing out.

Screen Shot 2022-06-24 at 6 49 42 PM

This came as an even greater shock because we do not use exemplars. We have never enabled exemplar collection in Prometheus nor have we enabled exemplar storage in thanos receive (max exemplar storage is set to zero). I've also confirmed that we in fact have absolutely no exemplars in storage via the following metrics:

Screen Shot 2022-06-24 at 6 54 02 PM

To make things even more interesting, we are only seeing this behavior with Thanos receive. This behavior is not observed between thanos query -> thanos sidecar -> prometheus. In most cases exemplar queries to thanos sidecar respond in less than 150ms which can be seen in the above trace.

For the time being we've disabled exemplar querying from Grafana to restore our query performance, but I think it is worth investigating what is causing this query latency specifically with thanos receive.

What you expected to happen:
Thanos receive exemplar queries perform comparably to thanos sidecar exemplar queries given that neither thanos receive nor Prometheus have exemplar storage enabled.

How to reproduce it (as minimally and precisely as possible):
Query thanos receive with exemplars enabled.

Anything else we need to know:
Another observation worth mentioning is that CPU usage on the thanos receive querier increased by nearly 10x (up to 20 cores) when dealing with exemplars queries. This restored to its baseline level of 2 cores when exemplar querying was disabled.

Thanos receive query CPU

@vanugrah
Copy link
Contributor Author

If others encounter this issue and need to disable exemplar querying from Grafana - I've written a small script that finds dashboards with exemplars enabled and sets exemplar querying to false:

https://github.com/vanugrah/grafana-disable-exemplars

Unfortunately Grafana had exemplars enabled by default which resulted in several hundreds of dashboards needing to be updated. Luckily they have fixed this from version 8.5 onwards grafana/grafana#45260

@bwplotka
Copy link
Member

bwplotka commented Jun 24, 2022

Thanks for the report and great points.

I am surprised exemplars are enabled by default. Will investigate.

It sounds like naive fanout implementation and some slow down on the TSDB side as well - but that's expected from experimental feature. Sorry for mess.

@vanugrah
Copy link
Contributor Author

Hey Bartek - thanks for the quick response!

I don't think exemplars storage is enabled by default - though the exemplar query endpoint is. A fan out implementation on a large receive deployment could very well explain the latency. What intrigues me though is you can see that the proxy_exemplar behavior is different for thanos sidecar as compared to thanos receive:

Screen Shot 2022-06-27 at 1 01 34 PM

Screen Shot 2022-06-27 at 1 01 53 PM

Namely the thanos.Exemplars/Exemplars RPC appears to start at approximately the same time for thanos sidecar, however there is a clear sequential pattern for thanos receive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants