proposal: Moving Caching part of query-frontend to separate project. #1672

bwplotka · 2019-09-17T13:45:21Z

Hi 👋

A month ago @tomwilkie merged a PR that makes query-frontend capable to cache responses for the queries against any Prometheus API. Details were presented at Prometheus London Meetup:

Slides: https://speakerdeck.com/grafana/blazin-fast-promql
Watch the talk here: https://youtu.be/eyBbImSDOrI

Now, this is amazing piece of work as it allows simple and clear Cortex response caching (with days splitting!) to be used against any Prometheus-based backend. Requests against metric backends are often expensive, have small result output, are simultaneous and repetitive, so it makes sense to treat such caching component as must-have - even for vanilla Prometheus. As the Thanos maintainers we were looking exactly for something like this for some time. Overall it definitely looks like both Cortex and Thanos are looking to solve a very similar goal.

From Thanos side we want to make it a default caching solution that we want to recommend, document and maintain.

However, still, such caching is heavily bound to Cortex. It has quite a complex Queuing engine that already was proposed to be extracted from caching. I believe that splitting caching into a separate project (promcache?), in some common org like https://github.com/prometheus-community can have many advantages around contributing, clarity and adoption. I enumerated some benefits further down.

Proposal

Move query-frontend caching logic to separate Go module (plus cmd to run it) e.g https://github.com/prometheus-community/promcache
- Name of the project is to be defined ( :
Add maintainers who want to help from both Cortex and Thanos as the project owners.
Make it clear that this is a caching project for Prometheus API, Cortex, and Thanos backends.
- Open questions:
  - What if other backends want something extra? VM, M3DB?
  - Should we embed retries and limits as well? (IMO yes)
Allow Cortex to use it either as a library in query-frontend or just point to query-frontend (without caching)
Allow Thanos to use it as a library in Querier (potentially) or spin up on top of Querier (must-have)

If we agree on this, we (Thanos team) are happy to spin this project up, prepare repo, go module, initial docs and extract caching logic from query-frontend. Then we can focus on embedded caching in existing components like Querier or Query-frontend and use promcache as a library if needed.

Benefits of moving caching part of `query-frontend` into a separate project?

Share responsibility for maintaining promcache across both Thanos and Cortex teams.
More focused project! (caching, retries, limits around Prometheus Query APIs)
- Easier to understand, easier collaboration, documentation, starting up
- Separate versioning
- Easier to use as a library (fewer deps)
- Easier to justify adjustments for Cortex & Thanos:
- While some logic is common, there might some separate changes required for Cortex and Thanos.
  - Cortex: QoS, queueing, multitenancy;
  - Thanos: splitting by different ranges than days when using downsampled data, partial response logic etc
The first step to join forces and the collaboration between Cortex & Thanos!
- Space to agree on common queuing API inspired by Cortex that might be useful for Thanos or even vanilla Prometheus
- Space to agree on multi-tenancy, QoS, retry, limits mechanisms together ❤️
Beneficial for Cortex itself:
- Scaling caching frontend, separate to the queuing: Query Frontend scalability #1150

What could be missing in the current `query-frontend` caching layer?

Client load balancing for downstream API
- E.g In Kubernetes it’s hard to equally (round-robin) load balance the Queriers
Adjustments for Thanos as mentioned above.
Caching other Prometheus APIs (label/values, series)
Other caching backends

Initial google doc proposal.

Thanks, @gouthamve for the input so far!

cc @bboreham @tomwilkie and others (: What do you think?

The text was updated successfully, but these errors were encountered:

cyriltovena · 2019-09-19T23:24:34Z

I'm actually working on something very similar, I'm extracting everything Cortex related from the frontend package to make it reusable in Loki.

bwplotka · 2019-09-20T00:16:02Z

everything

What exactly? (:

ivan-kiselev · 2019-09-20T08:56:53Z

yeah, I'm about to introduce cortex-frontend into existing Thanos setup and natural for Thanos "partial response" which cortex doesn't support - introduces some complexity into the configuration.

I mean I understand that cortex-frontend is something that claims to be Prom API compatible and partial_response is somewhat Thanos extension to it thus isn't supposed to be supported out of the box, but It'd be nice to have.

cyriltovena · 2019-09-20T12:27:53Z

My first step is to make the frontend package fully agnostic of backends.

Removing middleware setup in the constructor.
Adding a setup method in the frontend to hook middleware after construction.
Make a middleware interface that rely only on http.
Moving the retry middleware to frontend as it can be agnostic.
Removing any references to the queryrange package.

The idea is to have backend specific middleware, hooked on startup, so Loki can have its own way of splitting queries but can still use the same retry, transport and queue mechanism as Cortex.

In the proposition here, we want to re-use the caching middleware, but create new Thanos specific ones (That may need to use the Thanos Store API), I believe the work I'm doing should also help.

However I totally agree that having this into another project/repo would be easier for everyone, my only concern is can we keep Loki on the table ? e.g I should be able to create middleware that are not compatible with the /query_range API of Prometheus.

cyriltovena · 2019-09-20T16:13:05Z

@homelessnessbo seems like my work would be beneficial to you too. Basically use the front-end package with a non-compatible Prometheus API.

bwplotka · 2019-10-01T19:15:15Z

Sorry, I was out on holidays for a bit.

However I totally agree that having this into another project/repo would be easier for everyone, my only concern is can we keep Loki on the table ? e.g I should be able to create middleware that is not compatible with the /query_range API of Prometheus.

@cyriltovena that's definitely a good question how "generic" we want to be. For logs characteristic is totally different, different APIs, probably format in cache backend will be totally different. The risk with being generic is that we might end up with yet another L7 proxy (like envoy) ;p So the question is how much we can reuse.

In the proposition here, we want to re-use the caching middleware, but create new Thanos specific ones (That may need to use the Thanos Store API), I believe the work I'm doing should also help.

So we don't want to use StoreAPI directly. In the same way, in Cortex, this caching middleware does not talk (queue) to ingesters or chunk stores directly. In both projects there is something like Querier which is the only one that does PromQL evaluation so can be exposed by something like Prometheus API (although AFAIK Cortex uses Queing gRPC service for that it really has the same parameters). That's important as it means that these caching middlewares and the whole promcache project might be focused on Prometheus API with some extensions. Those extensions would differ between Cortex/Thanos and maybe other LTS. Hopefully, those differences might be reduced over time as well.

I'm about to introduce cortex-frontend into existing Thanos setup and natural for Thanos "partial response" which cortex doesn't support - introduces some complexity into the configuration.

Yup, but for this our potential promcache would just add certain argument to HTTP Prometheus API and that's it (partial_response=false) when talking to Thanos.

To sum up, bringing Loki support fully here might be difficult, but not sure, maybe we can be generic enough or maybe we can allow reusing some key middlewares only. (: @gouthamve @tomwilkie @codesome @brancz any thoughts?

stale · 2020-02-03T11:56:34Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

bwplotka · 2020-02-18T13:10:24Z

This would be still nice. (:

We are just about to start putting more work and design into this piece from Thanos perspective.

BTW how Query parallel/sharding is going?

cc @pracucci @metalmatze

bwplotka · 2020-04-14T16:57:45Z

Ok, we ultimately bumped into a bit unexpected issue which is "confusion" (: TL;DR: From Thanos user side it's quite hard to deploy Cortex frontend, as it's a bit inconsistent vs what we have for Thanos (for example configuration), so it's quite confusing for the community.

Still, we want to use Cortex code for it so we decided to create a new Thanos component called frontend which would really import and wrap Cortex queryrange and frontend packages 🤗 So far contributing to Cortex was quite smooth, so I don't see any immediate issue in moving this code into yet another repo 👍 Details: thanos-io/thanos#2434

We will make sure we will contribute more to the Cortex frontend, it needs some care for sure (downsampling, subqueries and more).

bwplotka · 2020-04-17T11:55:15Z

Context: thanos-io/thanos#2454

bwplotka mentioned this issue Sep 17, 2019

query: Built-in cache for Thanos Querier backed up by Memcached thanos-io/thanos#1006

Closed

cyriltovena mentioned this issue Oct 2, 2019

Refactor: Separate Frontend and Queryrange middleware #1711

Closed

bwplotka mentioned this issue Oct 15, 2019

Response caching for Thanos thanos-io/thanos#1651

Closed

6 tasks

stale bot added the stale label Feb 3, 2020

stale bot closed this as completed Feb 18, 2020

pracucci reopened this Feb 18, 2020

stale bot removed the stale label Feb 18, 2020

pracucci added the keepalive Skipped by stale bot label Feb 18, 2020

bwplotka closed this as completed Apr 14, 2020

bwplotka mentioned this issue Apr 14, 2020

proposal: Added proposal for new Thanos component: Thanos Frontend. thanos-io/thanos#2434

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: Moving Caching part of query-frontend to separate project. #1672

proposal: Moving Caching part of query-frontend to separate project. #1672

bwplotka commented Sep 17, 2019

cyriltovena commented Sep 19, 2019

bwplotka commented Sep 20, 2019

ivan-kiselev commented Sep 20, 2019 •

edited

Loading

cyriltovena commented Sep 20, 2019

cyriltovena commented Sep 20, 2019 •

edited

Loading

bwplotka commented Oct 1, 2019 •

edited

Loading

stale bot commented Feb 3, 2020

bwplotka commented Feb 18, 2020

bwplotka commented Apr 14, 2020 •

edited

Loading

bwplotka commented Apr 17, 2020

proposal: Moving Caching part of query-frontend to separate project. #1672

proposal: Moving Caching part of query-frontend to separate project. #1672

Comments

bwplotka commented Sep 17, 2019

Proposal

Benefits of moving caching part of query-frontend into a separate project?

What could be missing in the current query-frontend caching layer?

cyriltovena commented Sep 19, 2019

bwplotka commented Sep 20, 2019

ivan-kiselev commented Sep 20, 2019 • edited Loading

cyriltovena commented Sep 20, 2019

cyriltovena commented Sep 20, 2019 • edited Loading

bwplotka commented Oct 1, 2019 • edited Loading

stale bot commented Feb 3, 2020

bwplotka commented Feb 18, 2020

bwplotka commented Apr 14, 2020 • edited Loading

bwplotka commented Apr 17, 2020

Benefits of moving caching part of `query-frontend` into a separate project?

What could be missing in the current `query-frontend` caching layer?

ivan-kiselev commented Sep 20, 2019 •

edited

Loading

cyriltovena commented Sep 20, 2019 •

edited

Loading

bwplotka commented Oct 1, 2019 •

edited

Loading

bwplotka commented Apr 14, 2020 •

edited

Loading