ui, server: add a timeout per node while collecting hot ranges #107796

zachlite · 2023-07-28T15:00:06Z

Requests for hot ranges are serviced by a cluster wide fan-out, where non-trivial work is done on each node to provide a response. For each store, and for each hot range, we start a transaction with KV to look up descriptor info.

Previously, there was no upper-bound set on the time a node could take to provide a response. This commit introduces a per-node timeout in the pagination logic, and is configurable with the new cluster setting server.hot_ranges.node.timeout. A value of 0 will disable the timeout.

Error behavior and semantics are preserved. If a particular node times out, The fan-out continues as before, as if a node failed to provide a response.

Informs #104269
Resolves #107627
Epic: none
Release note (ops change): Added a new cluster setting named server.hot_ranges.node.timeout, with a default value of 5 minutes. The setting controls the maximum amount of time that a hot ranges request will spend waiting for a node to provide a response. Set to 0 to disable timeouts.

cockroach-teamcity · 2023-07-28T15:00:17Z

This change is

koorosh

Reviewed 9 of 9 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @zachlite)

pkg/server/hot_ranges.go line 21 at r1 (raw file):

var HotRangesNodeTimeout = settings.RegisterDurationSetting(
	settings.TenantWritable,
	"server.hot_ranges.node.timeout",

I would specify that this setting is about hot ranges request to make it more specific (ie server.request_hot_ranges.node.timeout) to avoid any confusing that this setting can affect hot ranges values.

pkg/server/pagination_test.go line 422 at r1 (raw file):

	defer log.Scope(t).Close(t)
	ctx := context.Background()
	server, _, _ := serverutils.StartServer(t, base.TestServerArgs{})

I'd expect to use test cluster within this test to have several nodes at least. My concern about how this test can dial another node?

abarganier

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @zachlite)

pkg/server/hot_ranges.go line 23 at r1 (raw file):

	"server.hot_ranges.node.timeout",
	"the duration allowed for a single node to return hot range data before the request is cancelled; if set to 0, there is no timeout",
	time.Minute*5,

Just out of curiosity, why was 5 minutes chosen as the default? It might be good to elaborate this in a comment above the setting var, so people in the future have some context.

zachlite

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @koorosh)

pkg/server/hot_ranges.go line 21 at r1 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

I would specify that this setting is about hot ranges request to make it more specific (ie server.request_hot_ranges.node.timeout) to avoid any confusing that this setting can affect hot ranges values.

👍

pkg/server/hot_ranges.go line 23 at r1 (raw file):

Previously, abarganier (Alex Barganier) wrote…

Just out of curiosity, why was 5 minutes chosen as the default? It might be good to elaborate this in a comment above the setting var, so people in the future have some context.

Good call, I'll context. It's my attempt to find a reasonable balance between "long enough" and "too long" for a reasonably large cluster.

pkg/server/pagination_test.go line 422 at r1 (raw file):

Previously, koorosh (Andrii Vorobiov) wrote…

I'd expect to use test cluster within this test to have several nodes at least. My concern about how this test can dial another node?

I don't think this test needs to cover any of the dialing logic. My goal is to just test the new codepath that runs nodeFn with a timeout. The old code path and the dialing logic is covered by the existing test.

What do you think?

zachlite

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @koorosh)

pkg/server/hot_ranges.go line 21 at r1 (raw file):

Previously, zachlite wrote…

👍

Actually, going to change to hot_ranges_request so it's similar to the existing proto names.

koorosh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier)

pkg/server/pagination_test.go line 422 at r1 (raw file):

Previously, zachlite wrote…

I don't think this test needs to cover any of the dialing logic. My goal is to just test the new codepath that runs nodeFn with a timeout. The old code path and the dialing logic is covered by the existing test.

What do you think?

Got it, it shouldn't test the logic on other side of the call.

Requests for hot ranges are serviced by a cluster wide fan-out, where non-trivial work is done on each node to provide a response. For each store, and for each hot range, we start a transaction with KV to look up descriptor info. Previously, there was no upper-bound set on the time a node could take to provide a response. This commit introduces a per-node timeout in the pagination logic, and is configurable with the new cluster setting server.hot_ranges_request.node.timeout. A value of 0 will disable the timeout. Error behavior and semantics are preserved. If a particular node times out, The fan-out continues as before, as if a node failed to provide a response. Informs cockroachdb#104269 Resolves cockroachdb#107627 Epic: none Release note (ops change): Added a new cluster setting named server.hot_ranges_request.node.timeout, with a default value of 5 minutes. The setting controls the maximum amount of time that a hot ranges request will spend waiting for a node to provide a response. Set to 0 to disable timeouts.

koorosh

Reviewed 6 of 6 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier)

zachlite · 2023-08-01T14:30:49Z

bors r+

knz

Reviewed 1 of 9 files at r1, 6 of 6 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @koorosh, and @zachlite)

-- commits line 21 at r3:

5 minutes feels excessive. As a user invoking the "hot ranges" pages I would be disappointed if I don't see anything within a few seconds.

Short of changing the architecture to make the display incremental (this would be gold standard UX), I feel the user would be best served by a configurable global timeout for node fan-outs (and its default be 1 minute or perhaps even 10 seconds).
then during the fan-out, use context.WithDeadline to reduce the timeout available to each node depending on how much time was elapsed since the start of the hot ranges request. (as in, measure startTime when the request is initially received for the client, then use the samedeadline := startTime + maxTimeout for all fan-out requests)

I wonder if we would like a more general-purpose "node fan-out timeout" cluster setting, that would apply equally to the hot ranges but also the span stats and other situations where we collect data via fan-outs. In that case perhaps the name doesn't need to include the specific name of the service.

j82w · 2023-08-01T15:11:34Z

pkg/server/hot_ranges.go line 30 at r2 (raw file):

	time.Minute*5,
	settings.NonNegativeDuration,
).WithPublic()

What is the reason for making this public? Wondering if it's better to keep it hidden and only use it if an issue is reported.

craig · 2023-08-01T15:14:33Z

Build succeeded:

Bazel Essential CI (Cockroach)

zachlite

Reviewable status: complete! 1 of 0 LGTMs obtained

-- commits line 21 at r3:
Discussed offline:

configurable global timeout for node fan-outs
I wonder if we would like a more general-purpose "node fan-out timeout"

I think this concept is more readily understandable than a per-node timeout, but my concern is that the appropriate value will be dramatically different between clusters with a small number of nodes and clusters with a large number of nodes.

5 minutes feels excessive.

I agree. My concern was that I didn't want us to come back to Hard Rock and say "we fixed it", only for them to still experience timeouts. I think the cost of lowering it outweighs the cost of keeping it high?

make the display incremental

1000% .
We have a tracking issue here: #104584
I would love to spend all my time here, instead of these "quick fixes". Being able to incrementally load-and-show instead of waiting for the entire fan-out would be a dramatic improvement, and would actually enable us to achieve our "5 seconds or less" SLO. I love how splunk does it with the green line that progresses across the top of the screen.

pkg/server/hot_ranges.go line 30 at r2 (raw file):

Previously, j82w (Jake) wrote…

What is the reason for making this public? Wondering if it's better to keep it hidden and only use it if an issue is reported.

I think that making it public increases the chances that we avoid an escalation.

zachlite requested review from a team July 28, 2023 15:00

zachlite requested a review from a team as a code owner July 28, 2023 15:00

zachlite requested review from abarganier and removed request for a team July 28, 2023 15:00

zachlite force-pushed the 230726.hot-ranges-node-timeout branch 2 times, most recently from 3940afa to 4f74c92 Compare July 28, 2023 17:32

zachlite requested a review from koorosh July 28, 2023 19:55

koorosh reviewed Jul 31, 2023

View reviewed changes

abarganier reviewed Jul 31, 2023

View reviewed changes

zachlite commented Jul 31, 2023

View reviewed changes

koorosh reviewed Jul 31, 2023

View reviewed changes

zachlite force-pushed the 230726.hot-ranges-node-timeout branch from 4f74c92 to 06b3969 Compare July 31, 2023 16:56

zachlite force-pushed the 230726.hot-ranges-node-timeout branch from 06b3969 to 472d414 Compare July 31, 2023 20:38

koorosh approved these changes Aug 1, 2023

View reviewed changes

knz reviewed Aug 1, 2023

View reviewed changes

craig bot merged commit da59f58 into cockroachdb:master Aug 1, 2023

zachlite commented Aug 1, 2023

View reviewed changes

cockroach-teamcity mentioned this pull request Aug 2, 2023

PR #107796 - ui, server: add a timeout per node while collecting hot ranges cockroachdb/docs#17592

Open

This was referenced Aug 18, 2023

release-23.1: server: make the span stats fan-out more fault tolerant #109008

Merged

release-23.1: ui, server: add a timeout per node while collecting hot ranges #109015

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ui, server: add a timeout per node while collecting hot ranges #107796

ui, server: add a timeout per node while collecting hot ranges #107796

zachlite commented Jul 28, 2023 •

edited

Loading

cockroach-teamcity commented Jul 28, 2023

koorosh left a comment

abarganier left a comment

zachlite left a comment

zachlite left a comment

koorosh left a comment

koorosh left a comment

zachlite commented Aug 1, 2023

knz left a comment

j82w commented Aug 1, 2023

craig bot commented Aug 1, 2023

zachlite left a comment

ui, server: add a timeout per node while collecting hot ranges #107796

ui, server: add a timeout per node while collecting hot ranges #107796

Conversation

zachlite commented Jul 28, 2023 • edited Loading

cockroach-teamcity commented Jul 28, 2023

koorosh left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

zachlite left a comment

Choose a reason for hiding this comment

zachlite left a comment

Choose a reason for hiding this comment

koorosh left a comment

Choose a reason for hiding this comment

koorosh left a comment

Choose a reason for hiding this comment

zachlite commented Aug 1, 2023

knz left a comment

Choose a reason for hiding this comment

j82w commented Aug 1, 2023

craig bot commented Aug 1, 2023

zachlite left a comment

Choose a reason for hiding this comment

zachlite commented Jul 28, 2023 •

edited

Loading