Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add thanos query frontend sub command #2973

Merged
merged 13 commits into from
Aug 11, 2020

Conversation

yeya24
Copy link
Contributor

@yeya24 yeya24 commented Aug 4, 2020

Signed-off-by: Ben Ye [email protected]

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

The current code is almost the same as the Cortex frontend. I just removed the sharding middleware because seems that is specific for Cortex.

TODO:

  • Implement Cortex cache.Cache interface, this should be done in another pr.
  • Unit tests and E2E tests.
  • Support parsing Thanos specific query parameters such as dedup, max_source_resolution, partial_response, etc. These params should be part of the cache key because these affect the query results. We need to implement Thanos codec to decode/encode Thanos query requests. Supporting Thanos codec to parse Thanos specific query parameters yeya24/thanos#159

Verification

@bwplotka bwplotka self-requested a review August 4, 2020 16:44
@yeya24 yeya24 mentioned this pull request Aug 5, 2020
2 tasks
@brancz
Copy link
Member

brancz commented Aug 6, 2020

IMHO Thanos should implement its own Request type just like the loki request. Is it better for me to include it in this pr, or in another pr?

I haven't had the chance to review this PR in itself but I think since this work will be on-going I think I would prefer to merge a minimum state that works good enough ™️, and then iterate on it. So to be explicit, I am for this, but in a follow up PR.

@yeya24 yeya24 force-pushed the add-thanos-query-frontend branch 4 times, most recently from 33d71b9 to 115499a Compare August 7, 2020 15:22
@yeya24 yeya24 changed the title WIP: Add thanos query frontend sub command Add thanos query frontend sub command Aug 7, 2020
@yeya24 yeya24 marked this pull request as ready for review August 7, 2020 20:17
@yeya24 yeya24 requested a review from bwplotka August 7, 2020 20:28
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks epic for a start 👍

Let's rebase on master since we merged the upgrade PR, wdyt? (:

Otherwise is good for first iteration 💪

The only thing is that I would try to create config as we have for other caches in store so we are ready for other cache providers (:

BoolVar(&c.cacheResults)

cmd.Flag("query-range.split-interval", "Split queries by an interval and execute in parallel, 0 disables it.").
Default("24h").DurationVar(&c.splitInterval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it for now but probably we need expose it better (:


func (c *responseCacheConfig) registerFlag(cmd *kingpin.CmdClause) {
c.fifoCache.registerFlag(cmd)
cmd.Flag("query-range.response-cache-max-freshness", "Most recent allowed cacheable result, to prevent caching very recent results that might still be in flux.").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be overall general config not fifo? 🤔

Copy link
Contributor Author

@yeya24 yeya24 Aug 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. It is a general cache config so I named it query-range.response-cache-max-freshness and it is available for all caches.


// fifoCacheConfig defines configurations for Cortex fifo cache.
type fifoCacheConfig struct {
maxSizeBytes units.Base2Bytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I think we should have those as cache client now and use our cache flags like in:

indexCacheConfig := extflag.RegisterPathOrContent(cmd, "index-cache.config",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

m[comp.String()] = func(g *run.Group, logger log.Logger, reg *prometheus.Registry, _ opentracing.Tracer, _ <-chan struct{}, _ bool) error {

return runQueryFrontend(
g,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for this syle - I think it can fit a line (:

conf.queryRangeConfig.respCacheConfig.cacheMaxFreshness,
)

// TODO(yeya24): support other cache when available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@yeya24
Copy link
Contributor Author

yeya24 commented Aug 7, 2020

Some build errors got after rebasing master.

# github.com/cortexproject/cortex/pkg/querier/queryrange
../../../go/pkg/mod/github.com/cortexproject/[email protected]/pkg/querier/queryrange/value.go:92:2: invalid case parser.ValueTypeVector in switch on promRes.Data.ResultType (mismatched types parser.ValueType and string)
../../../go/pkg/mod/github.com/cortexproject/[email protected]/pkg/querier/queryrange/value.go:92:2: invalid case parser.ValueTypeMatrix in switch on promRes.Data.ResultType (mismatched types parser.ValueType and string)
!! command failed: build -o /home/yeya24/go/bin/thanos -ldflags -X github.com/prometheus/common/version.Version=0.14.0 -X github.com/prometheus/common/version.Revision=f3b29ab18ede95baad53cda564b490916efb5407 -X github.com/prometheus/common/version.Branch=add-thanos-query-frontend -X github.com/prometheus/common/version.BuildUser=yeya24@yeya24 -X github.com/prometheus/common/version.BuildDate=20200807-21:34:12  -extldflags '-static' -a -tags netgo github.com/thanos-io/thanos/cmd/thanos: exit status 2
make: *** [Makefile:114: build] Error 1

Seems the Prometheus dependency needs to be updated on the Cortex side as well,

@bwplotka
Copy link
Member

bwplotka commented Aug 7, 2020

That might be true. Here comes cyclic deps (:

@bwplotka
Copy link
Member

bwplotka commented Aug 7, 2020

Let's propose a PR on their side cc @pracucci

@yeya24
Copy link
Contributor Author

yeya24 commented Aug 8, 2020

Let's propose a PR on their side cc @pracucci

I will prepare a pr.


r.Get("/labels", instr("labels", handleFunc))
r.Post("/labels", instr("labels", handleFunc))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to add stores and rules APIs here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well depends how we handle this - I think query frontend was doing pass through for all not defined endpoints, but we should test against it. If that's the case then why we define labels, series, values if not cached?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check it? (:

@yeya24
Copy link
Contributor Author

yeya24 commented Aug 9, 2020

I have updated my pr to support the response cache config file. PTAL when you have time. For the Cortex dependency error, I opened a pr cortexproject/cortex#3000 already.

@yeya24 yeya24 force-pushed the add-thanos-query-frontend branch 3 times, most recently from eda511a to 67c5609 Compare August 10, 2020 15:47
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, some suggestions mostly around diffetent paths. The current implementation is super vague what is cached, what is splitted and retries etc. What if some path is not in api.go? (:

Can we make it explicit? For example if no retry, splitting is expected for ALL but query_range, can we just remove all but query_range from API? maybe leaving query instant makes sense as slow query log is nice for those (: WDYT?

Beside that it's amazing! 💪 Great job! We are missing docs, but we can add those later (:

@@ -35,6 +35,7 @@ We use *breaking* word for marking changes that are not backward compatible (rel
- [#2865](https://github.com/thanos-io/thanos/pull/2865) ui: Migrate Thanos Ruler UI to React
- [#2964](https://github.com/thanos-io/thanos/pull/2964) Query: Add time range parameters to label APIs. Add `start` and `end` fields to Store API `LabelNamesRequest` and `LabelValuesRequest`.
- [#2996](https://github.com/thanos-io/thanos/pull/2996) Sidecar: Add `reloader_config_apply_errors_total` metric. Add new flags `--reloader.watch-interval`, and `--reloader.retry-interval`.
- [#2973](https://github.com/thanos-io/thanos/pull/2973) Add Thanos Query Frontend component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs, would be nice, maybe in next PR

cmd.Flag("query-range.max-query-parallelism", "Maximum number of queries will be scheduled in parallel by the frontend.").
Default("14").IntVar(&c.maxQueryParallelism)

cmd.Flag("query-range.response-cache-max-freshness", "Most recent allowed cacheable result, to prevent caching very recent results that might still be in flux.").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need better help flag description - in separate PR is ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

essentially let's describe why this is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, we can also help update the flag description in Cortex as well.


r.Get("/labels", instr("labels", handleFunc))
r.Post("/labels", instr("labels", handleFunc))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well depends how we handle this - I think query frontend was doing pass through for all not defined endpoints, but we should test against it. If that's the case then why we define labels, series, values if not cached?

pkg/queryfrontend/cache/inmemory.go Show resolved Hide resolved

r.Get("/labels", instr("labels", handleFunc))
r.Post("/labels", instr("labels", handleFunc))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check it? (:

pkg/queryfrontend/roundtrip.go Outdated Show resolved Hide resolved
// TestRoundTripRetryMiddleware tests the retry middleware.
func TestRoundTripRetryMiddleware(t *testing.T) {
testRequest := &queryrange.PrometheusRequest{
Path: "/api/v1/query_range",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if path is totally different? not query range? Can we test it?

{
name: "disable split",
req: &queryrange.PrometheusRequest{
Path: "/api/v1/query_range",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, can we test different paths? just to see what to expect

test/e2e/query_frontend_test.go Show resolved Hide resolved
@bwplotka
Copy link
Member

Also still somehow tests flakes, wonder if it's SWIFT flakiness

@yeya24
Copy link
Contributor Author

yeya24 commented Aug 10, 2020

well depends how we handle this - I think query frontend was doing pass through for all not defined endpoints, but we should test against it. If that's the case then why we define labels, series, values if not cached??

Not passing through. Please see cortexproject/cortex#2742. TBH I am not sure if this is the behavior we want to have.
Like cortex, now only defined endpoints are accessible. That's why the frontend needs to register other endpoints.

Only query range results are cached. The workflow is:

  1. Non-query range but defined endpoints -> pass through to downstream URL. So no middlewares will be applied here.
  2. query range -> go through all defined middlewares and finally go to downstream.

this is weird. I think we put this this tripperware for query, query range but also /label series label/.../values?

Umm, I agree we should separate labels for other endpoints as well. Cortex does the same thing

@yeya24
Copy link
Contributor Author

yeya24 commented Aug 10, 2020

@bwplotka
Copy link
Member

In this case let's do following:

  • Use handler in query_range and query since we will soon modify it to use our tripperware
  • For everything else let's have (SEPARATE!) passthrough handler. WDYT? (: Also I am surprised it does not work as our UI works just fine on query frontend 🤔
    TBH I don't feel like we need allowlist all of endpoints. Maybe we can catch query + query range for now and pass through rest?
    Make it more transparent (:

@yeya24
Copy link
Contributor Author

yeya24 commented Aug 11, 2020

In this case let's do following:

  • Use handler in query_range and query since we will soon modify it to use our tripperware
  • For everything else let's have (SEPARATE!) passthrough handler. WDYT? (: Also I am surprised it does not work as our UI works just fine on query frontend thinking
    TBH I don't feel like we need allowlist all of endpoints. Maybe we can catch query + query range for now and pass through rest?
    Make it more transparent (:

Hello, I agree that we shouldn't limit the endpoints here and it is better to pass through all other routes to downstream querier to make sure everything works fine (like Grafana).

But I am not sure why a separate handler is needed? Can I just reuse the downstream roundtripper implemented in Cortex frontend? https://github.com/cortexproject/cortex/blob/master/pkg/querier/frontend/frontend.go#L118. This works but TBH I don't know whether this is a good pattern or not.

I just found it not as easy as I thought to deal with the default route with github.com/prometheus/common/route package, so I removed the router and added the routing logic to the tripperware. WDYT? This way we don't need to have other routers.

	return func(next http.RoundTripper) http.RoundTripper {
		queryRangeTripper := queryrange.NewRoundTripper(next, codec, queryRangeMiddleware...)
		return frontend.RoundTripFunc(func(r *http.Request) (*http.Response, error) {
			switch r.URL.Path {
			case "/api/v1/query":
				if r.Method == http.MethodGet || r.Method == http.MethodPost {
					queriesCount.WithLabelValues(labelQuery).Inc()
				}
			case "/api/v1/query_range":
				if r.Method == http.MethodGet || r.Method == http.MethodPost {
					queriesCount.WithLabelValues(labelQueryRange).Inc()
					return queryRangeTripper.RoundTrip(r)
				}
			default:
			}
			return next.RoundTrip(r)
		})
	}, nil

@yeya24
Copy link
Contributor Author

yeya24 commented Aug 11, 2020

PR is updated and I added more test cases. PTAL tomorrow

@bwplotka
Copy link
Member

Happy with whatever makes sense more (: Will look today

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing.

I like it. What I am only missing here are some docs (can do in next PR) and tests for different pass through items. Again good for next PR (:

})
return hf
}
srv.Handle("/", injectf(fe.Handler().ServeHTTP))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


t.Run("same range query, cache hit.", func(t *testing.T) {
// Run the same range query again, the result can be retrieved from cache directly.
rangeQuery(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check that other endpoints are pass through?

Copy link
Contributor Author

@yeya24 yeya24 Aug 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already tested labelNames and labelValues here https://github.com/thanos-io/thanos/blob/master/test/e2e/query_frontend_test.go#L99-L118. I am not sure whether this is what you want or not.

@bwplotka bwplotka merged commit 2ea2c2b into thanos-io:master Aug 11, 2020
@yeya24 yeya24 deleted the add-thanos-query-frontend branch August 11, 2020 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants