Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More granular query performance metrics for Thanos store #4895

Open
moadz opened this issue Nov 23, 2021 · 10 comments
Open

More granular query performance metrics for Thanos store #4895

moadz opened this issue Nov 23, 2021 · 10 comments
Labels

Comments

@moadz
Copy link
Contributor

moadz commented Nov 23, 2021

Is your proposal related to a problem?

This is a follow-up to the discussion by @ianbillett #4674 about measuring query performance in Thanos.

tl;dr: We would like to measure query path latency in a reliable and consistent way to supplement our read path SLIs.

Currently in Red Hat's internal observability service, we're using Avalanche in conjunction with the http_request_duration_seconds on the query_range and query handlers.

This gives us some idea of overall latency for each query request we process, but does not allow us to investigate the impact of 'query shape' and data distribution across our store targets.

Describe the solution you'd like

There's a few reason why measuring the latency on the query and query_range endpoints is a poor indicator of how much 'work' Thanos is doing. Thanos will fanout the 'select' phase of the query to each of its store targets (other stores, sidecars etc.) , streaming the series from each target over the network, before executing the query on the resulting stream. This means that we actually touch far more series over the network than is returned in the final query response. An ideal metric for query performance must include the total number of samples and series retrieved remotely before the query is executed.

Ideally this metric will include:

  • [Must] measure aggregate number of samples returned during the select phase of a fanned out query
  • [Must] measure aggregate number of series returned during the select phase of fanned out query
  • [Could] measure aggregate response size (bytes) of fanned out requests (this is less critical as response size can be approximated from number of series/number of samples)

Mercifully, we already capture these stats in tracing spans from the query path. Unfortunately, traces are sampled and are difficult to query reliably for SLI's. My suggestion is that we add a new histogram, thanos_store_query_duration_seconds that persists the total elapsed time for the query, with labels for series_processed and samples_processed attached to each observation to allow for querying for a particular dimension (e.g. thanos_query_duration_seconds for the 0.99 request duration quantile with > 1,000,000 samples returned). Ideally we would represent this with an N-dimensional histogram, but this is not possible with prometheus.

Given most queries will have highly variable samples/series returned for each query, how do we manage cardinality?

Yes, in fact if we used the unique samples/series I suspect this metric would be totally useless. Instead we should define a set of 5-10 distinct 't-shirt size' buckets for the samples/series dimensions.

e.g. Exponential Scale

Samples: 
s <= 1000 samples, m <=  10_000 samples, l <= 100_000 samples, xl > 1_000_000 samples 

Series: 
s <= 10 series, m <= 100 series, m <= 1000 series, l <= 10_000 series, xl <= 100_000 series

So if a query returns 50,000 samples spanning 10 series in 32s it would be an observation of 32s with the labels sample-size=l, series-size=s.

This would allow us to define query SLI's with respect to a specific number of series/samples and mean response time

e.g. 90% of queries for up to 1,000,000 samples and up to 10 series complete in less than 2s.

How will we determine appropriate buckets for our additional dimensions?

This is my primary concern. Given the countless permutations of node size, data distribution and thanos topologies that might influence the query sizes and response times, it is unlikely any set thresholds will be appropriate for all deployments of Thanos. As a result, the 't-shirt sizes' will have to be configurable (possibly via runtime args) with some sensible defaults to allow users to tweak them. The obvious caveat being that if this metric is recorded, any changes to the bucketing would render previous data corrupted/incomprehensible. I would like guidance here if possible.

Describe alternatives you've considered

  • Separate histograms bucketed by series/samples instead of response time, this would be sub-optimal as we would not be able to correlate both in the same observation

I'm keen to understand how others have measured query performance (systematically) in the context of Thanos/Prometheus, or the potential issues with introducing a metric of this type.

@bwplotka
Copy link
Member

bwplotka commented Nov 24, 2021

I think it would be nicer if you would propose a PR with proposal to Thanos as flow https://thanos.io/tip/contributing/proposal-process.md/.

This will allow us to start discussion on each separate part of the proposal.

Otherwise, I like it. I wonder only:

  • Are S/M/L configurable? (probably they should)
  • Maybe it's better to put literal numbers instead of S/M/L to avoid confusion if S, M, L are dynamic? we could name is as series_le, samples_le (similar to normal bucket "lower or equal").

I don't see any issues if we keep permutation of 6 and maybe make it optional (:

@bwplotka
Copy link
Member

bwplotka commented Nov 24, 2021

cc @GiedriusS @yeya24 @wiardvanrij, our SLO expert @metalmatze, @kakkoyun @brancz for more opinions (:

@wiardvanrij
Copy link
Member

Honestly, your questions are deeper than I'm working with our stack. So I haven't formed an entire opinion on it, other than that I pretty much agree with everything you said.

As for the solution. I remembered a potential new feature that might resolve your implementation problem? https://youtu.be/F72Tk8iaWeA Sparse histograms.

@moadz
Copy link
Contributor Author

moadz commented Nov 25, 2021

  • Are S/M/L configurable? (probably they should)

Yes. I'm contemplating what this configuration would look like. Perhaps passing in arrays for each of the 'histogram partitions' (i.e. the labels we attach to observations that categorise number of series and number of samples)

// Buckets for labelling query performance metrics with respect to duration of query
--query.telemetry.request-duration-seconds-quantiles = [ 0.1, 0.25, 0.75, 1.25, 1.75, 2.5, 3, 5, 10 ]

// Buckets for labelling query performance metrics with respect to number of samples
--query.telemetry.sample-quantiles = [ 100, 1_000, 10_000, 100_000, 1_000_000 ]

// Buckets for labelling query performance metrics with respect to number of series
--query.telemetry.series-quantiles = [ 10, 100, 1000, 10_000, 100_000 ]
  • Maybe it's better to put literal numbers instead of S/M/L to avoid confusion if S, M, L are dynamic? we could name is as series_le, samples_le (similar to normal bucket "lower or equal").

Yeah agreed. If these are going to be configurable, it's probably useful to encode the literal number instead of t-shirt sizes to avoid confusion. I guess using labels instead of numbers helps when aggregating across bucket changes when the labels may change if they are reconfigured.

@moadz
Copy link
Contributor Author

moadz commented Dec 8, 2021

Initial proposal here @bwplotka #4932 with some caveats around what the metric will actually capture (specifically just the select phase of a query in thanos/store/proxy which precludes any processing)

@bill3tt
Copy link
Contributor

bill3tt commented Jan 6, 2022

@moadz you were talking about the extent of our ability to know tenant usage of querier which got me thinking - it could be really nice to extend these query path timing metrics to include tenant (or some configurable label) as an extra dimension.

This would increase the cardinality significantly, but I can imagine it being very useful.

@moadz
Copy link
Contributor Author

moadz commented Jan 7, 2022

@ianbillett that would be very useful, but I wonder if it would encourage improper usage of this metric.

  • Single Responsibility Principle; this metric tracks query performance relative to query shape. If we explicitly want to track usage more closely, we would benefit more from a richer query logger(Add query logging/tracing #1706) or a separate metric as opposed to instrumenting this through an already high cardinality histogram.
  • Are number of series/samples touched a good proxy for usage? As we don't instrument bytes/bandwidth and there are other factors that contribute to how long a query takes, where the data is being retrieved from, how many concurrent queries are being handled at that particular point. A secondary point would be that we are bucketing this dimensions, so it would be difficult to determine the exact number from this metric.
  • The metric is already quite high in cardinality, if we were to parameterise on, for example, a set of LabelNames that if present will be attached to the observation alongside the value of the label, it would be very easy for a user to introduce orders of magnitude of performance impact without really thinking about it.

@bill3tt
Copy link
Contributor

bill3tt commented Jan 11, 2022

You make some good points there @moadz - I think the richer query logger is what I was picturing in mind :)

@stale
Copy link

stale bot commented Apr 16, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Apr 16, 2022
@GiedriusS GiedriusS removed the stale label Apr 17, 2022
@stale
Copy link

stale bot commented Sep 21, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants