-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
query: add store.read-timeout parameter to avoid partial response failure when one of stores timed out #895
Conversation
Hi, Thanos team! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I love the idea, but I think we need this timeout in different place. (:
Plus some minor suggestion while mocking. Let me know what do you think.
Currently we still have one strange place - mergeSet. But I think its idea for next pull request. |
@povilasv continued on your PR @d-ulyanov here #920 I hope you are ok with this (: |
…lure when one of stores timed out
Co-Authored-By: d-ulyanov <[email protected]>
@d-ulyanov this is solved by #928 |
As another version of this has been merged, I think it's safe to close this PR. Please shout at me if it needs to be reopened or I made a mistake somewhere. Thank you a lot for your contribution! :) |
@GiedriusS it's a different PR, so we still need this. I.e. this timeout is essiantially query-timeout, and cancels the context for the whole query. The current timeout doesn't work correctly, due to it sets timeout on PromQL engine |
Hi, @povilasv |
@d-ulyanov sure thing, could you resolve merge conflicts? I've lost the current state on this one, is this ready for review? Once it is I can reread it and do another round of review :) |
# Conflicts: # cmd/thanos/query.go # docs/components/query.md # pkg/store/proxy.go # pkg/store/proxy_test.go
Hey @povilasv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@povilasv Its cool that this PR will appear in v0.4.0
I have a time to finish this PR, but anyway, fill free to make any changes if you think that smth is wrong :) Edits from maintainers are allowed.
Also, please, review my comments about testing
cmd/thanos/query.go
Outdated
@@ -93,6 +93,9 @@ func registerQuery(m map[string]setupFunc, app *kingpin.Application, name string | |||
|
|||
storeResponseTimeout := modelDuration(cmd.Flag("store.response-timeout", "If a Store doesn't send any data in this specified duration then a Store will be ignored and partial data will be returned if it's enabled. 0 disables timeout.").Default("0ms")) | |||
|
|||
storeReadTimeout := modelDuration(cmd.Flag("store.read-timeout", "Maximum time to read response from store. If request to one of stores is timed out and store.read-timeout < query.timeout partial response will be returned. If store.read-timeout >= query.timeout one of stores is timed out the client will get no data and timeout error."). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree partially.
In code we have query
layer which also doing some logic (deduplication). So we can't really say that storeReadTimeout
will guarantee that request will be killed in time. We just limiting store request time. That's why I think that its not honest to use promql-engine-timeout
here or we need also to add ctx to query
layer.
@d-ulyanov FYI I've changed the flag names in your branch |
@povilasv cool, many thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for a good work, but I think we can still improve it. Sorry for review delay!
Some suggestions below. Please be careful on compatibility side. We are not 1.0 API stable, BUT we don't want to totally confuse our users either.
Also, can I ask what's the point of this logic, what's the use case? E.g WHEN would you set (old) query
timeout different to proxy/select/store.timeout
underneath PromQL?
Cannot find any immdiate idea. Do you want to timeout on select, but allow more time for promQL?
@@ -44,7 +45,11 @@ type ProxyStore struct { | |||
component component.StoreAPI | |||
selectorLabels labels.Labels | |||
|
|||
// responseTimeout is a timeout for any GRPC operation during series query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing trailing period in comment, same below
@@ -119,6 +126,8 @@ func newRespCh(ctx context.Context, buffer int) (*ctxRespSender, <-chan *storepb | |||
return &ctxRespSender{ctx: ctx, ch: respCh}, respCh, func() { close(respCh) } | |||
} | |||
|
|||
// send writes response to sender channel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use full sentence comments. They are for humans (:
return | ||
} | ||
|
||
// setError sets error (thread-safe) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full sentence comments
@@ -147,6 +156,9 @@ func (s *ProxyStore) Series(r *storepb.SeriesRequest, srv storepb.Store_SeriesSe | |||
respSender, respRecv, closeFn = newRespCh(gctx, 10) | |||
) | |||
|
|||
storeCtx, cancel := s.contextWithTimeout(gctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, what's the difference in this timeout to the overall gRPC request timeout or gRPC server timeout? I believe Querier can control the same thing by just specifying timeout here: https://github.com/improbable-eng/thanos/blob/1cd9ddd14999d6b074f34a4328e03f7ac3b7c26a/pkg/query/querier.go#L183
I would remove this from proxy.go
completely and set this timeout on querier client side. What do you think?
Effect is the same, if not better, as you missed to pass this context to gctx = errgroup.WithContext(srv.Context())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can I ask what's the point of this logic, what's the use case? E.g WHEN would you set (old) query timeout different to proxy/select/store.timeout underneath PromQL?
Cannot find any immdiate idea. Do you want to timeout on select, but allow more time for promQL?
Imagine that we have two stores: first is very slow, second is fast. Partial time-out is enabled.
Behaviour that we want to see: if slow store is timed out - we still receiving data from second.
So, to make it possible, store.read-timeout should be less than PromQL timeout.
You're right, we can move it to querier.go and the effect will be the same at the moment (and it was my first implementation :) )
I think it doesn't matter where we have this timeout, more important that at the moment it doesn't work perfect because of reading from stores in mergedSet: reading from stores works in series there and first slow store still blocks reading from second. We need some improvements there...
responseTimeout time.Duration | ||
|
||
// queryTimeout is a timeout for entire request | ||
queryTimeout time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's kill it IMO
promqlTimeout := modelDuration(cmd.Flag("promql.timeout", "Maximum time to execute PromQL in query node."). | ||
Default("2m")) | ||
|
||
queryTimeout := modelDuration(cmd.Flag("query.timeout", "Maximum time to process request by query node. If a request to one of the stores has timed out and query.timeout < promql.timeout then a partial response will be returned. If query.timeout >= promql.timeout then only timeout error will be returned."). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite confusing, I am having hard time to understand on what things we timeout here
@@ -61,9 +61,6 @@ func registerQuery(m map[string]setupFunc, app *kingpin.Application, name string | |||
webExternalPrefix := cmd.Flag("web.external-prefix", "Static prefix for all HTML links and redirect URLs in the UI query web interface. Actual endpoints are still served on / or the web.route-prefix. This allows thanos UI to be served behind a reverse proxy that strips a URL sub-path.").Default("").String() | |||
webPrefixHeaderName := cmd.Flag("web.prefix-header", "Name of HTTP request header used for dynamic prefixing of UI links and redirects. This option is ignored if web.external-prefix argument is set. Security risk: enable this option only if a reverse proxy in front of thanos is resetting the header. The --web.prefix-header=X-Forwarded-Prefix option can be useful, for example, if Thanos UI is served via Traefik reverse proxy with PathPrefixStrip option enabled, which sends the stripped prefix value in X-Forwarded-Prefix header. This allows thanos UI to be served on a sub-path.").Default("").String() | |||
|
|||
queryTimeout := modelDuration(cmd.Flag("query.timeout", "Maximum time to process query by query node."). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave this as it was, reasons:
-
query
sounds like being related to QueryAPI, that's why this name. Which always includedpromql
one. That's why I would leave like it was, the fact that query also queries something else underneath is hidden. Also it matches Prometheusquery
flags as well. And yet another thing compatibility. Even though we are not1.0
this API change is major hit into compatibiltiy as suddently query timeout means something opposite to what it was ): -
promql
sounds like only promQL, however we know it involvesproxy StoreAPI Series()
invocation. More over not only one, but sometimes more then one! PromQL can run multipleSelect
s per single Query.
I really like this idea, but IMO it should be:
query.timeout
(as it was)
store.timeout
for proxy.go client timeout.
alternatively if we want to be more explicit maybe query.select-timeout
? That will also include then the deduplication process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
After offline discussion with @povilasv I missed the fact that @povilasv meant to fix #919 but this PR specifies as addressing #835 ... So let's organize bit better:
Sorry for confused, @povilasv @d-ulyanov does it make sense to you? |
I'll try to fix the context propogation in PromQL eval and let's see where it goes. Let's keep this open for now. Also this PR has some cool stuff lke |
agreee |
Thanks for your review. I agree that our current solution is not perfect. Store A - slow, store B - fast:
I propose the following:
|
Yes, very related explanation of confusing slow/fast stores can be found here: #900 (comment) |
Can we talk about this more? It's probably the most critical part of Thanos, might need some design first. So essentially, taking the ready responses first instead of arbitrary Next of SeriesSet that are merged inside iterators, right? I think we might need some work on interface side as SeriesSet |
I disagree, we still need to have |
@d-ulyanov if you think But |
Currently we are reading from buffered channel, which is asyncronuos currently. So not sure what you mean here. Re merging please file an issue and write down what you want to do. IMO This is not the best place to discuss this. Thanks :) |
@d-ulyanov @bwplotka So today I spent like 2 hours trying to figure out where we lose the context and I couldn't find it. But I think #1035 this potentially simplifies code in context passing case. As
But in client does:
So the code becomes a bit cleaner and we can see how the context propogates :) |
Hi @povilasv
Initial issue: Expected behaviour: We're receiving data from store B. At the moment it seems that we have bottleneck here:
https://github.com/improbable-eng/thanos/blob/master/pkg/store/proxy.go#L214 On init it calls Next() on each store in Further, in cycle My propose:
We will allocate additional memory for this buffering. But I think its not a big problem, isn't it? PS Now we already have some buffer in seriesServer |
@d-ulyanov please create a seperate issue for this. And let's stop bloating this conversation :D |
|
@d-ulyanov Can we take this offline to avoid confusion? sounds like we need to discuss this first. Slack of any other chat? Are you on improbable-eng slack maybe? (: |
We decided offline that this flag is not helping with anything much, we need to fix SeriesSet iterator place first. |
This PR fixes partial response failure when one of stores timed out (if partial response enabled).
Issue: #835
Changes
Added
store.read-timeout
argument to Thanos Query. By default,store.read-timeout = 2m
(asquery.timeout
). Ifstore.read-timeout > query.timeout
it will be overrided byquery.timeout
value.This
store.read-timeout
propagated to GRPC StoreClient via context usingseriesServer
, seequerier.go
diff.Timeout will be finally handled inside GRPC client.
Also, mock structures added to
store
package to reuse it in unit tests in different packages.Verification
Changes covered by unit tests.