Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote read API not performant in v2.13 #9764

Open
rishabhkumar92 opened this issue Oct 29, 2024 · 11 comments
Open

Remote read API not performant in v2.13 #9764

rishabhkumar92 opened this issue Oct 29, 2024 · 11 comments

Comments

@rishabhkumar92
Copy link

Describe the bug

Hello Team,
We have been testing using thanos querier with Mimir remote read API and are seeing significant performance difference in range query v/s remote read. One of first thing I noticed was queries weren't sharded that might be contributing to majority of

To Reproduce

Steps to reproduce the behavior:

  1. Start Mimir 2.13
  2. Perform remote read API call on one of reasonable expensive query similar to count(services_platform_service_request_count{namespace=~".*-staging$"}) by (namespace)

Expected behavior

This query is taking ~2 seconds to execute when query range API are used and it should take approx same time with remote read API too, however it took ~15+ seconds.

Environment

  • Infrastructure: Kubernetes, Laptop
  • Deployment tool: jsonnet

Additional Context

NA

@rishabhkumar92 rishabhkumar92 changed the title Remote read API not performant Remote read API not performant in v2.13 Oct 29, 2024
@rishabhkumar92
Copy link
Author

Image

Attaching a screenshot showing querier is taking a lot of time in data buffering even though Ingester finishes under few ms.

@rishabhkumar92
Copy link
Author

I saw fixes around remote read API not honoring hints. (ref) but I saw this perf issues during instant query so this is different issue than hinting fixes.

@pracucci
Copy link
Collaborator

pracucci commented Nov 7, 2024

The first thing that comes to my mind is that remote read supports two response types (specs):

  • Samples
  • Encoded chunks

Fetching samples is much slower than fetching the encoded chunks. Could you make sure that Thanos requests STREAMED_XOR_CHUNKS, please? That's what Mimir querier internally requests from ingesters when you run a range or instant query.

@pracucci
Copy link
Collaborator

pracucci commented Nov 7, 2024

Could you also share the full trace .json so I can look at it myself as well, please?

@rishabhkumar92
Copy link
Author

rishabhkumar92 commented Nov 7, 2024

@pracucci regarding Encoded chunks, I confirmed that encoded chunks is being used as response type and was introduced in thanos few years back (reference)

Regarding full trace, I am still figuring out how to export it as full json, also we are issuing federated query for 100 of tenant which is far slower in remote read compared to range APIs .

@rishabhkumar92
Copy link
Author

rishabhkumar92 commented Nov 7, 2024

Trace-4f3442-2024-11-07 15_12_26.json

Attaching a trace of an instant query which took 20+ seconds

@pracucci
Copy link
Collaborator

Attaching a trace of an instant query which took 20+ seconds

Thanks. I tried to load it in the Jaeger UI but doesn't work (apparently it's an invalid format for Jaeger). What format is the trace? Which application have you used to export it? Sorry for this ping-pong, but would be great if you could just give a me a trace that loads in the Jaeger UI.

To test it in Jaeger you can run it with:

docker run -p 16686:16686 jaegertracing/all-in-one:latest

Then upload the .json and see if it works. Thanks!

@rishabhkumar92
Copy link
Author

@pracucci I downloaded it from Grafana UI, can you try visualizing it in Grafana.

@mattsimonsen
Copy link

@rishabhkumar92 - I was unable to get Trace-4f3442-2024-11-07.15_12_26.json to load in Grafana Cloud, it fails with a parse error.

Could you try exporting for Jaeger and/or send a trace that will upload into Grafana Cloud as an alternative?

@rishabhkumar92
Copy link
Author

@mattsimonsen I was able to load the json in Zipkin UI to visualize trace, unfortunately we don't have a way to download trace which is compatible with Jaeger.

https://github.com/openzipkin/zipkin?tab=readme-ov-file

@pracucci
Copy link
Collaborator

pracucci commented Dec 12, 2024

Sorry for the late reply.

In the trace we can see that most nested item with an high latency is SeriesChunksStreamReader.StartBuffering():

Image

However, SeriesChunksStreamReader.StartBuffering() tracks the time it takes to stream series labels and chunks from an ingester to the querier. The high latency could either to do the ingester being slow, or the querier being slow. If the querier is slow reading, the latency of SeriesChunksStreamReader.StartBuffering() will be high because of the backpressure mechanism. Essentially, an ingester has a buffer of data send to the querier; once the buffer is full, the ingester pauses the buffering until the buffer gets some room for more series data.

Since you mentioned that a range queries fetching the same raw series is fast, then I would guess it's not the ingester being slow, but the querier being slow.

The remote read API is a streaming API. This means that everything is implemented in a streaming way. The querier may be slow because its CPU is saturated, or because the client who sent the API request is slow reading.

As you theorised, one reason why the querier may be slow is just because remote read requests are not sharded. A single remote read request is executed single thread in the querier, so it doesn't even scale to multiple CPU cores. Remote reads are not that a common use case in Mimir, so we haven't invested into all the performance optimizations we've done for instant and range queries.

Another theory, is the the client is slow reading. In this case, seeing an high latency on the remote read API endpoint is just a side effect of a slow client. As a test to exclude this option, you could try to run the same query using mimirtool remote-read stats command. It's a CLI command that runs a remote read on a remote endpoint and print some stats about the queried series. The processing done by this command is lightweight, so I don't expect to significantly impact the latency measurement. Obviously, run this command from a machine with an high network bandwidth connection to your Mimir cluster.

I hope this can give you some insights to let you further investigate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants