-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ui, server: add a timeout per node while collecting hot ranges #107796
ui, server: add a timeout per node while collecting hot ranges #107796
Conversation
3940afa
to
4f74c92
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @zachlite)
pkg/server/hot_ranges.go
line 21 at r1 (raw file):
var HotRangesNodeTimeout = settings.RegisterDurationSetting( settings.TenantWritable, "server.hot_ranges.node.timeout",
I would specify that this setting is about hot ranges request to make it more specific (ie server.request_hot_ranges.node.timeout
) to avoid any confusing that this setting can affect hot ranges values.
pkg/server/pagination_test.go
line 422 at r1 (raw file):
defer log.Scope(t).Close(t) ctx := context.Background() server, _, _ := serverutils.StartServer(t, base.TestServerArgs{})
I'd expect to use test cluster within this test to have several nodes at least. My concern about how this test can dial another node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @zachlite)
pkg/server/hot_ranges.go
line 23 at r1 (raw file):
"server.hot_ranges.node.timeout", "the duration allowed for a single node to return hot range data before the request is cancelled; if set to 0, there is no timeout", time.Minute*5,
Just out of curiosity, why was 5 minutes chosen as the default? It might be good to elaborate this in a comment above the setting var, so people in the future have some context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @koorosh)
pkg/server/hot_ranges.go
line 21 at r1 (raw file):
Previously, koorosh (Andrii Vorobiov) wrote…
I would specify that this setting is about hot ranges request to make it more specific (ie
server.request_hot_ranges.node.timeout
) to avoid any confusing that this setting can affect hot ranges values.
👍
pkg/server/hot_ranges.go
line 23 at r1 (raw file):
Previously, abarganier (Alex Barganier) wrote…
Just out of curiosity, why was 5 minutes chosen as the default? It might be good to elaborate this in a comment above the setting var, so people in the future have some context.
Good call, I'll context. It's my attempt to find a reasonable balance between "long enough" and "too long" for a reasonably large cluster.
pkg/server/pagination_test.go
line 422 at r1 (raw file):
Previously, koorosh (Andrii Vorobiov) wrote…
I'd expect to use test cluster within this test to have several nodes at least. My concern about how this test can dial another node?
I don't think this test needs to cover any of the dialing logic. My goal is to just test the new codepath that runs nodeFn
with a timeout. The old code path and the dialing logic is covered by the existing test.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @koorosh)
pkg/server/hot_ranges.go
line 21 at r1 (raw file):
Previously, zachlite wrote…
👍
Actually, going to change to hot_ranges_request
so it's similar to the existing proto names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier)
pkg/server/pagination_test.go
line 422 at r1 (raw file):
Previously, zachlite wrote…
I don't think this test needs to cover any of the dialing logic. My goal is to just test the new codepath that runs
nodeFn
with a timeout. The old code path and the dialing logic is covered by the existing test.What do you think?
Got it, it shouldn't test the logic on other side of the call.
4f74c92
to
06b3969
Compare
Requests for hot ranges are serviced by a cluster wide fan-out, where non-trivial work is done on each node to provide a response. For each store, and for each hot range, we start a transaction with KV to look up descriptor info. Previously, there was no upper-bound set on the time a node could take to provide a response. This commit introduces a per-node timeout in the pagination logic, and is configurable with the new cluster setting server.hot_ranges_request.node.timeout. A value of 0 will disable the timeout. Error behavior and semantics are preserved. If a particular node times out, The fan-out continues as before, as if a node failed to provide a response. Informs cockroachdb#104269 Resolves cockroachdb#107627 Epic: none Release note (ops change): Added a new cluster setting named server.hot_ranges_request.node.timeout, with a default value of 5 minutes. The setting controls the maximum amount of time that a hot ranges request will spend waiting for a node to provide a response. Set to 0 to disable timeouts.
06b3969
to
472d414
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier)
bors r+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 9 files at r1, 6 of 6 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @koorosh, and @zachlite)
- 5 minutes feels excessive. As a user invoking the "hot ranges" pages I would be disappointed if I don't see anything within a few seconds.
Short of changing the architecture to make the display incremental (this would be gold standard UX), I feel the user would be best served by a configurable global timeout for node fan-outs (and its default be 1 minute or perhaps even 10 seconds).
then during the fan-out, use context.WithDeadline
to reduce the timeout available to each node depending on how much time was elapsed since the start of the hot ranges request. (as in, measure startTime
when the request is initially received for the client, then use the samedeadline := startTime + maxTimeout
for all fan-out requests)
- I wonder if we would like a more general-purpose "node fan-out timeout" cluster setting, that would apply equally to the hot ranges but also the span stats and other situations where we collect data via fan-outs. In that case perhaps the name doesn't need to include the specific name of the service.
What is the reason for making this public? Wondering if it's better to keep it hidden and only use it if an issue is reported. |
Build succeeded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
-- commits
line 21 at r3:
Discussed offline:
configurable global timeout for node fan-outs
I wonder if we would like a more general-purpose "node fan-out timeout"
I think this concept is more readily understandable than a per-node timeout, but my concern is that the appropriate value will be dramatically different between clusters with a small number of nodes and clusters with a large number of nodes.
5 minutes feels excessive.
I agree. My concern was that I didn't want us to come back to Hard Rock and say "we fixed it", only for them to still experience timeouts. I think the cost of lowering it outweighs the cost of keeping it high?
make the display incremental
1000% .
We have a tracking issue here: #104584
I would love to spend all my time here, instead of these "quick fixes". Being able to incrementally load-and-show instead of waiting for the entire fan-out would be a dramatic improvement, and would actually enable us to achieve our "5 seconds or less" SLO. I love how splunk does it with the green line that progresses across the top of the screen.
pkg/server/hot_ranges.go
line 30 at r2 (raw file):
Previously, j82w (Jake) wrote…
What is the reason for making this public? Wondering if it's better to keep it hidden and only use it if an issue is reported.
I think that making it public increases the chances that we avoid an escalation.
Requests for hot ranges are serviced by a cluster wide fan-out, where non-trivial work is done on each node to provide a response. For each store, and for each hot range, we start a transaction with KV to look up descriptor info.
Previously, there was no upper-bound set on the time a node could take to provide a response. This commit introduces a per-node timeout in the pagination logic, and is configurable with the new cluster setting server.hot_ranges.node.timeout. A value of 0 will disable the timeout.
Error behavior and semantics are preserved. If a particular node times out, The fan-out continues as before, as if a node failed to provide a response.
Informs #104269
Resolves #107627
Epic: none
Release note (ops change): Added a new cluster setting named server.hot_ranges.node.timeout, with a default value of 5 minutes. The setting controls the maximum amount of time that a hot ranges request will spend waiting for a node to provide a response. Set to 0 to disable timeouts.