Thanos Receive: High query latency for exemplars with zero exemplars in storage #5444

vanugrah · 2022-06-24T11:58:04Z

Thanos, Prometheus and Golang version used:

Thanos: 0.26.0
Prometheus: 2.35.0

Object Storage Provider:
Pure Storage

What happened:
Hey folks,

We've encountered an issue with our production thanos deployment that we suspect may be a bug. Long story short we are using thanos in both the sidecar approach alongside Prometheus as well as with Thanos receive and remote write.

While the majority of our thanos infrastructure was already on version 0.26, our thanos receive deployment and the querier sitting in front of it was still running version 0.23. Last week we upgraded thanos receive to version 0.26 and began noticing something quite surprising: our query latency from Grafana degraded by over 100x! This resulted in Grafana taking an additional 1-2 minutes to load data into panels, even for queries spanning just a few hours.

Tracing revealed that the query_range methods were still blazing fast (subsecond) in most cases. However exemplar queries to Thanos Receive were very slow (30-50s) and in some cases timing out.

This came as an even greater shock because we do not use exemplars. We have never enabled exemplar collection in Prometheus nor have we enabled exemplar storage in thanos receive (max exemplar storage is set to zero). I've also confirmed that we in fact have absolutely no exemplars in storage via the following metrics:

To make things even more interesting, we are only seeing this behavior with Thanos receive. This behavior is not observed between thanos query -> thanos sidecar -> prometheus. In most cases exemplar queries to thanos sidecar respond in less than 150ms which can be seen in the above trace.

For the time being we've disabled exemplar querying from Grafana to restore our query performance, but I think it is worth investigating what is causing this query latency specifically with thanos receive.

What you expected to happen:
Thanos receive exemplar queries perform comparably to thanos sidecar exemplar queries given that neither thanos receive nor Prometheus have exemplar storage enabled.

How to reproduce it (as minimally and precisely as possible):
Query thanos receive with exemplars enabled.

Anything else we need to know:
Another observation worth mentioning is that CPU usage on the thanos receive querier increased by nearly 10x (up to 20 cores) when dealing with exemplars queries. This restored to its baseline level of 2 cores when exemplar querying was disabled.

vanugrah · 2022-06-24T12:06:38Z

If others encounter this issue and need to disable exemplar querying from Grafana - I've written a small script that finds dashboards with exemplars enabled and sets exemplar querying to false:

https://github.com/vanugrah/grafana-disable-exemplars

Unfortunately Grafana had exemplars enabled by default which resulted in several hundreds of dashboards needing to be updated. Luckily they have fixed this from version 8.5 onwards grafana/grafana#45260

bwplotka · 2022-06-24T13:36:06Z

Thanks for the report and great points.

I am surprised exemplars are enabled by default. Will investigate.

It sounds like naive fanout implementation and some slow down on the TSDB side as well - but that's expected from experimental feature. Sorry for mess.

vanugrah · 2022-06-27T10:35:33Z

Hey Bartek - thanks for the quick response!

I don't think exemplars storage is enabled by default - though the exemplar query endpoint is. A fan out implementation on a large receive deployment could very well explain the latency. What intrigues me though is you can see that the proxy_exemplar behavior is different for thanos sidecar as compared to thanos receive:

Namely the thanos.Exemplars/Exemplars RPC appears to start at approximately the same time for thanos sidecar, however there is a clear sequential pattern for thanos receive.

matej-g added component: receive needs-investigation labels Jun 24, 2022

This was referenced Jul 29, 2022

Implement concurrent exemplar selects #5550

Closed

Fix multi-tenant exemplar matchers #5554

Merged

yeya24 closed this as completed in #5554 Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos Receive: High query latency for exemplars with zero exemplars in storage #5444

Thanos Receive: High query latency for exemplars with zero exemplars in storage #5444

vanugrah commented Jun 24, 2022

vanugrah commented Jun 24, 2022

bwplotka commented Jun 24, 2022 •

edited

Loading

vanugrah commented Jun 27, 2022

Thanos Receive: High query latency for exemplars with zero exemplars in storage #5444

Thanos Receive: High query latency for exemplars with zero exemplars in storage #5444

Comments

vanugrah commented Jun 24, 2022

vanugrah commented Jun 24, 2022

bwplotka commented Jun 24, 2022 • edited Loading

vanugrah commented Jun 27, 2022

bwplotka commented Jun 24, 2022 •

edited

Loading