kvprober: avoid repeated planning failures causing CRDB issues #61255

joshimhoff · 2021-03-01T14:30:45Z

Is your feature request related to a problem? Please describe.
kvprober scans kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 & unmarshalls to proto when it doesn't know what range to probe next, in order to construct a new probing plan: https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/planner.go#L155. This implies that so long as kv.prober.planner.num_steps_to_plan_at_once is >1, scanning meta2 & unmarshalling isn't needed on every prober tick (which happens every kv.prober.read.interval seconds), and in fact, if kv.prober.planner.num_steps_to_plan_at_once is set to a large number, doing planning on every prober tick could spend more resources they we want.

One thing I'd like to avoid is repeated plan failures either causing a production outage or else making an existing production outage worse. One way I can imagine this kind of thing happening is overload / contention on meta2: Repeated plan failures (whether due to code bug in prober or non-prober related production outage) will lead to scanning kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 & unmarshalling every kv.prober.read.interval seconds. Depending on how the relevant cluster settings are set, could this introduce new problems?

Possibly the answer to this is it's fine. Possibly so long as kv.prober.planner.num_steps_to_plan_at_once is not set very very high, the worst case scenario of scanning kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 every kv.prober.read.interval seconds is acceptable. It's hard for me to judge, given my inexperience writing CRDB code, but it does make me nervous.

@tbg and @knz, curious to hear how you think about this, but PLEASE feel free to not think about this until after 21.2 is out / things are less busy!

Describe the solution you'd like
We can consider a few things:

Do an exponential backoff.
Allow setting a max frequency for scanning meta2 & unmarshalling. If can't plan and have no next Step, just increment a metric.

I like 2 better than 1. It sounds straight forward to implement and it limits max resource usage in the way we want. I think I'll give implementing it a shot now.

Additional context
#61074

The text was updated successfully, but these errors were encountered:

tbg · 2021-03-01T14:38:56Z

I prefer two as well for its simplicity.

…

On Mon, Mar 1, 2021 at 3:31 PM Josh Imhoff ***@***.***> wrote: *Is your feature request related to a problem? Please describe.* kvprober scans kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 & unmarshalls to proto when it doesn't know what range to probe next, in order to construct a new probing plan: https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/planner.go#L155. This implies that so long as kv.prober.planner.num_steps_to_plan_at_once is >1, scanning meta2 & unmarshalling isn't needed on every prober tick (which happens every kv.prober.read.interval seconds), and in fact, if kv.prober.planner.num_steps_to_plan_at_once is set to a large number, doing planning on every prober tick could spend more resources they we want. One thing I'd like to avoid is repeated plan failures either causing a production outage or else making an existing production outage worse. One way I can imagine this kind of thing happening is overload / contention on meta2: Repeated plan failures (whether due to code bug in prober or non-prober related production outage) will lead to scanning kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 & unmarshalling *every* kv.prober.read.interval seconds. Depending on how the relevant cluster settings are set, could this introduce new problems? Possibly the answer to this is it's fine. Possibly so long as kv.prober.planner.num_steps_to_plan_at_once is not set very very high, the worst case scenario of scanning kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 every kv.prober.read.interval seconds is acceptable. It's hard for me to judge, given my inexperience writing CRDB code, but it does make me nervous. @tbg <https://github.com/tbg> and @knz <https://github.com/knz>, curious to hear how you think about this, but PLEASE feel free to not think about this until after 21.2 is out / things are less busy! *Describe the solution you'd like* We can consider a few things: 1. Do an exponential backoff. 2. Allow setting a max frequency for scanning meta2 & unmarshalling. If can't plan and have no next Step, just increment a metric. I like 2 better than 1. It sounds straight forward to implement and it limits max resource usage in the way we want. I think I'll give implementing it a shot now. *Additional context* #61074 <#61074> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#61255>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGXPZHKWY6AR4LFKFSAIXLTBOQKVANCNFSM4YMUSPPQ> .

joshimhoff · 2021-03-01T14:40:09Z

Nice! Thanks for quick response.

61275: kvprober: rate limit the planner r=joshimhoff a=joshimhoff #61255 **kvprober: rate limit the planner** This commit introduces a planning rate limit. This protects CRDB from planning executing too often, due to either issues with CRDB (e.g. meta2 unavailability) or bugs in kvprober. When planning does execute, kvprober scans kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 and unmarshalls the resulting range descriptors. Release justification: Auxiliary system that is off by default. Release note: None. Co-authored-by: Josh Imhoff <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>

joshimhoff added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 1, 2021

joshimhoff self-assigned this Mar 1, 2021

joshimhoff mentioned this issue Mar 1, 2021

kvprober: implement a prober that sends queries to KV in order to generate signal on the healthiness of KV client & below #61074

Closed

17 tasks

joshimhoff mentioned this issue Mar 1, 2021

kvprober: rate limit the planner #61275

Merged

joshimhoff closed this as completed Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvprober: avoid repeated planning failures causing CRDB issues #61255

kvprober: avoid repeated planning failures causing CRDB issues #61255

joshimhoff commented Mar 1, 2021

tbg commented Mar 1, 2021 via email

joshimhoff commented Mar 1, 2021

kvprober: avoid repeated planning failures causing CRDB issues #61255

kvprober: avoid repeated planning failures causing CRDB issues #61255

Comments

joshimhoff commented Mar 1, 2021

tbg commented Mar 1, 2021 via email

joshimhoff commented Mar 1, 2021