Remote read API not performant in v2.13 #9764

rishabhkumar92 · 2024-10-29T04:38:46Z

Describe the bug

Hello Team,
We have been testing using thanos querier with Mimir remote read API and are seeing significant performance difference in range query v/s remote read. One of first thing I noticed was queries weren't sharded that might be contributing to majority of

To Reproduce

Steps to reproduce the behavior:

Start Mimir 2.13
Perform remote read API call on one of reasonable expensive query similar to count(services_platform_service_request_count{namespace=~".*-staging$"}) by (namespace)

Expected behavior

This query is taking ~2 seconds to execute when query range API are used and it should take approx same time with remote read API too, however it took ~15+ seconds.

Environment

Infrastructure: Kubernetes, Laptop
Deployment tool: jsonnet

Additional Context

NA

The text was updated successfully, but these errors were encountered:

rishabhkumar92 · 2024-10-29T16:35:10Z

Attaching a screenshot showing querier is taking a lot of time in data buffering even though Ingester finishes under few ms.

rishabhkumar92 · 2024-10-29T17:19:48Z

I saw fixes around remote read API not honoring hints. (ref) but I saw this perf issues during instant query so this is different issue than hinting fixes.

pracucci · 2024-11-07T11:04:03Z

The first thing that comes to my mind is that remote read supports two response types (specs):

Samples
Encoded chunks

Fetching samples is much slower than fetching the encoded chunks. Could you make sure that Thanos requests STREAMED_XOR_CHUNKS, please? That's what Mimir querier internally requests from ingesters when you run a range or instant query.

pracucci · 2024-11-07T15:26:32Z

Could you also share the full trace .json so I can look at it myself as well, please?

rishabhkumar92 · 2024-11-07T16:16:00Z

@pracucci regarding Encoded chunks, I confirmed that encoded chunks is being used as response type and was introduced in thanos few years back (reference)

Regarding full trace, I am still figuring out how to export it as full json, also we are issuing federated query for 100 of tenant which is far slower in remote read compared to range APIs .

rishabhkumar92 · 2024-11-07T23:13:28Z

Trace-4f3442-2024-11-07 15_12_26.json

Attaching a trace of an instant query which took 20+ seconds

pracucci · 2024-11-11T15:14:08Z

Attaching a trace of an instant query which took 20+ seconds

Thanks. I tried to load it in the Jaeger UI but doesn't work (apparently it's an invalid format for Jaeger). What format is the trace? Which application have you used to export it? Sorry for this ping-pong, but would be great if you could just give a me a trace that loads in the Jaeger UI.

To test it in Jaeger you can run it with:

docker run -p 16686:16686 jaegertracing/all-in-one:latest

Then upload the .json and see if it works. Thanks!

rishabhkumar92 · 2024-11-12T21:21:53Z

@pracucci I downloaded it from Grafana UI, can you try visualizing it in Grafana.

mattsimonsen · 2024-11-18T21:31:21Z

@rishabhkumar92 - I was unable to get Trace-4f3442-2024-11-07.15_12_26.json to load in Grafana Cloud, it fails with a parse error.

Could you try exporting for Jaeger and/or send a trace that will upload into Grafana Cloud as an alternative?

rishabhkumar92 · 2024-11-19T18:56:48Z

@mattsimonsen I was able to load the json in Zipkin UI to visualize trace, unfortunately we don't have a way to download trace which is compatible with Jaeger.

https://github.com/openzipkin/zipkin?tab=readme-ov-file

pracucci · 2024-12-12T10:06:55Z

Sorry for the late reply.

In the trace we can see that most nested item with an high latency is SeriesChunksStreamReader.StartBuffering():

However, SeriesChunksStreamReader.StartBuffering() tracks the time it takes to stream series labels and chunks from an ingester to the querier. The high latency could either to do the ingester being slow, or the querier being slow. If the querier is slow reading, the latency of SeriesChunksStreamReader.StartBuffering() will be high because of the backpressure mechanism. Essentially, an ingester has a buffer of data send to the querier; once the buffer is full, the ingester pauses the buffering until the buffer gets some room for more series data.

Since you mentioned that a range queries fetching the same raw series is fast, then I would guess it's not the ingester being slow, but the querier being slow.

The remote read API is a streaming API. This means that everything is implemented in a streaming way. The querier may be slow because its CPU is saturated, or because the client who sent the API request is slow reading.

As you theorised, one reason why the querier may be slow is just because remote read requests are not sharded. A single remote read request is executed single thread in the querier, so it doesn't even scale to multiple CPU cores. Remote reads are not that a common use case in Mimir, so we haven't invested into all the performance optimizations we've done for instant and range queries.

Another theory, is the the client is slow reading. In this case, seeing an high latency on the remote read API endpoint is just a side effect of a slow client. As a test to exclude this option, you could try to run the same query using mimirtool remote-read stats command. It's a CLI command that runs a remote read on a remote endpoint and print some stats about the queried series. The processing done by this command is lightweight, so I don't expect to significantly impact the latency measurement. Obviously, run this command from a machine with an high network bandwidth connection to your Mimir cluster.

I hope this can give you some insights to let you further investigate it.

rishabhkumar92 changed the title ~~Remote read API not performant~~ Remote read API not performant in v2.13 Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote read API not performant in v2.13 #9764

Remote read API not performant in v2.13 #9764

rishabhkumar92 commented Oct 29, 2024

rishabhkumar92 commented Oct 29, 2024

rishabhkumar92 commented Oct 29, 2024

pracucci commented Nov 7, 2024

pracucci commented Nov 7, 2024

rishabhkumar92 commented Nov 7, 2024 •

edited

Loading

rishabhkumar92 commented Nov 7, 2024 •

edited

Loading

pracucci commented Nov 11, 2024

rishabhkumar92 commented Nov 12, 2024

mattsimonsen commented Nov 18, 2024

rishabhkumar92 commented Nov 19, 2024

pracucci commented Dec 12, 2024 •

edited

Loading

Remote read API not performant in v2.13 #9764

Remote read API not performant in v2.13 #9764

Comments

rishabhkumar92 commented Oct 29, 2024

Describe the bug

To Reproduce

Expected behavior

Environment

Additional Context

rishabhkumar92 commented Oct 29, 2024

rishabhkumar92 commented Oct 29, 2024

pracucci commented Nov 7, 2024

pracucci commented Nov 7, 2024

rishabhkumar92 commented Nov 7, 2024 • edited Loading

rishabhkumar92 commented Nov 7, 2024 • edited Loading

pracucci commented Nov 11, 2024

rishabhkumar92 commented Nov 12, 2024

mattsimonsen commented Nov 18, 2024

rishabhkumar92 commented Nov 19, 2024

pracucci commented Dec 12, 2024 • edited Loading

rishabhkumar92 commented Nov 7, 2024 •

edited

Loading

rishabhkumar92 commented Nov 7, 2024 •

edited

Loading

pracucci commented Dec 12, 2024 •

edited

Loading