Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: Moving Caching part of query-frontend to separate project. #1672

Closed
bwplotka opened this issue Sep 17, 2019 · 10 comments
Closed

proposal: Moving Caching part of query-frontend to separate project. #1672

bwplotka opened this issue Sep 17, 2019 · 10 comments
Labels
keepalive Skipped by stale bot

Comments

@bwplotka
Copy link
Contributor

Hi 👋

A month ago @tomwilkie merged a PR that makes query-frontend capable to cache responses for the queries against any Prometheus API. Details were presented at Prometheus London Meetup:

Now, this is amazing piece of work as it allows simple and clear Cortex response caching (with days splitting!) to be used against any Prometheus-based backend. Requests against metric backends are often expensive, have small result output, are simultaneous and repetitive, so it makes sense to treat such caching component as must-have - even for vanilla Prometheus. As the Thanos maintainers we were looking exactly for something like this for some time. Overall it definitely looks like both Cortex and Thanos are looking to solve a very similar goal.

From Thanos side we want to make it a default caching solution that we want to recommend, document and maintain.

However, still, such caching is heavily bound to Cortex. It has quite a complex Queuing engine that already was proposed to be extracted from caching. I believe that splitting caching into a separate project (promcache?), in some common org like https://github.com/prometheus-community can have many advantages around contributing, clarity and adoption. I enumerated some benefits further down.

Proposal

  1. Move query-frontend caching logic to separate Go module (plus cmd to run it) e.g https://github.com/prometheus-community/promcache
    • Name of the project is to be defined ( :
  2. Add maintainers who want to help from both Cortex and Thanos as the project owners.
  3. Make it clear that this is a caching project for Prometheus API, Cortex, and Thanos backends.
    • Open questions:
      • What if other backends want something extra? VM, M3DB?
      • Should we embed retries and limits as well? (IMO yes)
  4. Allow Cortex to use it either as a library in query-frontend or just point to query-frontend (without caching)
  5. Allow Thanos to use it as a library in Querier (potentially) or spin up on top of Querier (must-have)

If we agree on this, we (Thanos team) are happy to spin this project up, prepare repo, go module, initial docs and extract caching logic from query-frontend. Then we can focus on embedded caching in existing components like Querier or Query-frontend and use promcache as a library if needed.

Benefits of moving caching part of query-frontend into a separate project?

  • Share responsibility for maintaining promcache across both Thanos and Cortex teams.
  • More focused project! (caching, retries, limits around Prometheus Query APIs)
    • Easier to understand, easier collaboration, documentation, starting up
    • Separate versioning
    • Easier to use as a library (fewer deps)
    • Easier to justify adjustments for Cortex & Thanos:
    • While some logic is common, there might some separate changes required for Cortex and Thanos.
      • Cortex: QoS, queueing, multitenancy;
      • Thanos: splitting by different ranges than days when using downsampled data, partial response logic etc
  • The first step to join forces and the collaboration between Cortex & Thanos!
    • Space to agree on common queuing API inspired by Cortex that might be useful for Thanos or even vanilla Prometheus
    • Space to agree on multi-tenancy, QoS, retry, limits mechanisms together ❤️
  • Beneficial for Cortex itself:

What could be missing in the current query-frontend caching layer?

  • Client load balancing for downstream API
    • E.g In Kubernetes it’s hard to equally (round-robin) load balance the Queriers
  • Adjustments for Thanos as mentioned above.
  • Caching other Prometheus APIs (label/values, series)
  • Other caching backends

Initial google doc proposal.

Thanks, @gouthamve for the input so far!

cc @bboreham @tomwilkie and others (: What do you think?

@cyriltovena
Copy link
Contributor

I'm actually working on something very similar, I'm extracting everything Cortex related from the frontend package to make it reusable in Loki.

@bwplotka
Copy link
Contributor Author

everything

What exactly? (:

@ivan-kiselev
Copy link

ivan-kiselev commented Sep 20, 2019

yeah, I'm about to introduce cortex-frontend into existing Thanos setup and natural for Thanos "partial response" which cortex doesn't support - introduces some complexity into the configuration.

I mean I understand that cortex-frontend is something that claims to be Prom API compatible and partial_response is somewhat Thanos extension to it thus isn't supposed to be supported out of the box, but It'd be nice to have.

@cyriltovena
Copy link
Contributor

My first step is to make the frontend package fully agnostic of backends.

  • Removing middleware setup in the constructor.
  • Adding a setup method in the frontend to hook middleware after construction.
  • Make a middleware interface that rely only on http.
  • Moving the retry middleware to frontend as it can be agnostic.
  • Removing any references to the queryrange package.

The idea is to have backend specific middleware, hooked on startup, so Loki can have its own way of splitting queries but can still use the same retry, transport and queue mechanism as Cortex.

In the proposition here, we want to re-use the caching middleware, but create new Thanos specific ones (That may need to use the Thanos Store API), I believe the work I'm doing should also help.

However I totally agree that having this into another project/repo would be easier for everyone, my only concern is can we keep Loki on the table ? e.g I should be able to create middleware that are not compatible with the /query_range API of Prometheus.

@cyriltovena
Copy link
Contributor

cyriltovena commented Sep 20, 2019

@homelessnessbo seems like my work would be beneficial to you too. Basically use the front-end package with a non-compatible Prometheus API.

@bwplotka
Copy link
Contributor Author

bwplotka commented Oct 1, 2019

Sorry, I was out on holidays for a bit.

However I totally agree that having this into another project/repo would be easier for everyone, my only concern is can we keep Loki on the table ? e.g I should be able to create middleware that is not compatible with the /query_range API of Prometheus.

@cyriltovena that's definitely a good question how "generic" we want to be. For logs characteristic is totally different, different APIs, probably format in cache backend will be totally different. The risk with being generic is that we might end up with yet another L7 proxy (like envoy) ;p So the question is how much we can reuse.

In the proposition here, we want to re-use the caching middleware, but create new Thanos specific ones (That may need to use the Thanos Store API), I believe the work I'm doing should also help.

So we don't want to use StoreAPI directly. In the same way, in Cortex, this caching middleware does not talk (queue) to ingesters or chunk stores directly. In both projects there is something like Querier which is the only one that does PromQL evaluation so can be exposed by something like Prometheus API (although AFAIK Cortex uses Queing gRPC service for that it really has the same parameters). That's important as it means that these caching middlewares and the whole promcache project might be focused on Prometheus API with some extensions. Those extensions would differ between Cortex/Thanos and maybe other LTS. Hopefully, those differences might be reduced over time as well.

I'm about to introduce cortex-frontend into existing Thanos setup and natural for Thanos "partial response" which cortex doesn't support - introduces some complexity into the configuration.

Yup, but for this our potential promcache would just add certain argument to HTTP Prometheus API and that's it (partial_response=false) when talking to Thanos.

To sum up, bringing Loki support fully here might be difficult, but not sure, maybe we can be generic enough or maybe we can allow reusing some key middlewares only. (: @gouthamve @tomwilkie @codesome @brancz any thoughts?

@stale
Copy link

stale bot commented Feb 3, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 3, 2020
@stale stale bot closed this as completed Feb 18, 2020
@bwplotka
Copy link
Contributor Author

This would be still nice. (:

We are just about to start putting more work and design into this piece from Thanos perspective.

BTW how Query parallel/sharding is going?

cc @pracucci @metalmatze

@pracucci pracucci reopened this Feb 18, 2020
@stale stale bot removed the stale label Feb 18, 2020
@pracucci pracucci added the keepalive Skipped by stale bot label Feb 18, 2020
@bwplotka
Copy link
Contributor Author

bwplotka commented Apr 14, 2020

Ok, we ultimately bumped into a bit unexpected issue which is "confusion" (: TL;DR: From Thanos user side it's quite hard to deploy Cortex frontend, as it's a bit inconsistent vs what we have for Thanos (for example configuration), so it's quite confusing for the community.

Still, we want to use Cortex code for it so we decided to create a new Thanos component called frontend which would really import and wrap Cortex queryrange and frontend packages 🤗 So far contributing to Cortex was quite smooth, so I don't see any immediate issue in moving this code into yet another repo 👍 Details: thanos-io/thanos#2434

We will make sure we will contribute more to the Cortex frontend, it needs some care for sure (downsampling, subqueries and more).

@bwplotka
Copy link
Contributor Author

Context: thanos-io/thanos#2454

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive Skipped by stale bot
Projects
None yet
Development

No branches or pull requests

4 participants