-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvprober: avoid repeated planning failures causing CRDB issues #61255
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Comments
joshimhoff
added
the
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
label
Mar 1, 2021
17 tasks
I prefer two as well for its simplicity.
…On Mon, Mar 1, 2021 at 3:31 PM Josh Imhoff ***@***.***> wrote:
*Is your feature request related to a problem? Please describe.*
kvprober scans kv.prober.planner.num_steps_to_plan_at_once rows worth of
meta2 & unmarshalls to proto when it doesn't know what range to probe next,
in order to construct a new probing plan:
https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/planner.go#L155.
This implies that so long as kv.prober.planner.num_steps_to_plan_at_once
is >1, scanning meta2 & unmarshalling isn't needed on every prober tick
(which happens every kv.prober.read.interval seconds), and in fact, if
kv.prober.planner.num_steps_to_plan_at_once is set to a large number,
doing planning on every prober tick could spend more resources they we want.
One thing I'd like to avoid is repeated plan failures either causing a
production outage or else making an existing production outage worse. One
way I can imagine this kind of thing happening is overload / contention on
meta2: Repeated plan failures (whether due to code bug in prober or
non-prober related production outage) will lead to scanning
kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 &
unmarshalling *every* kv.prober.read.interval seconds. Depending on how
the relevant cluster settings are set, could this introduce new problems?
Possibly the answer to this is it's fine. Possibly so long as
kv.prober.planner.num_steps_to_plan_at_once is not set very very high,
the worst case scenario of scanning
kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 every
kv.prober.read.interval seconds is acceptable. It's hard for me to judge,
given my inexperience writing CRDB code, but it does make me nervous.
@tbg <https://github.com/tbg> and @knz <https://github.com/knz>, curious
to hear how you think about this, but PLEASE feel free to not think about
this until after 21.2 is out / things are less busy!
*Describe the solution you'd like*
We can consider a few things:
1. Do an exponential backoff.
2. Allow setting a max frequency for scanning meta2 & unmarshalling.
If can't plan and have no next Step, just increment a metric.
I like 2 better than 1. It sounds straight forward to implement and it
limits max resource usage in the way we want. I think I'll give
implementing it a shot now.
*Additional context*
#61074 <#61074>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#61255>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGXPZHKWY6AR4LFKFSAIXLTBOQKVANCNFSM4YMUSPPQ>
.
|
Nice! Thanks for quick response. |
craig bot
pushed a commit
that referenced
this issue
Mar 4, 2021
61275: kvprober: rate limit the planner r=joshimhoff a=joshimhoff #61255 **kvprober: rate limit the planner** This commit introduces a planning rate limit. This protects CRDB from planning executing too often, due to either issues with CRDB (e.g. meta2 unavailability) or bugs in kvprober. When planning does execute, kvprober scans kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 and unmarshalls the resulting range descriptors. Release justification: Auxiliary system that is off by default. Release note: None. Co-authored-by: Josh Imhoff <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Is your feature request related to a problem? Please describe.
kvprober scans
kv.prober.planner.num_steps_to_plan_at_once
rows worth of meta2 & unmarshalls to proto when it doesn't know what range to probe next, in order to construct a new probing plan: https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/planner.go#L155. This implies that so long askv.prober.planner.num_steps_to_plan_at_once
is >1, scanning meta2 & unmarshalling isn't needed on every prober tick (which happens everykv.prober.read.interval
seconds), and in fact, ifkv.prober.planner.num_steps_to_plan_at_once
is set to a large number, doing planning on every prober tick could spend more resources they we want.One thing I'd like to avoid is repeated plan failures either causing a production outage or else making an existing production outage worse. One way I can imagine this kind of thing happening is overload / contention on meta2: Repeated plan failures (whether due to code bug in prober or non-prober related production outage) will lead to scanning
kv.prober.planner.num_steps_to_plan_at_once
rows worth of meta2 & unmarshalling everykv.prober.read.interval
seconds. Depending on how the relevant cluster settings are set, could this introduce new problems?Possibly the answer to this is it's fine. Possibly so long as
kv.prober.planner.num_steps_to_plan_at_once
is not set very very high, the worst case scenario of scanningkv.prober.planner.num_steps_to_plan_at_once
rows worth of meta2 everykv.prober.read.interval
seconds is acceptable. It's hard for me to judge, given my inexperience writing CRDB code, but it does make me nervous.@tbg and @knz, curious to hear how you think about this, but PLEASE feel free to not think about this until after 21.2 is out / things are less busy!
Describe the solution you'd like
We can consider a few things:
Step
, just increment a metric.I like 2 better than 1. It sounds straight forward to implement and it limits max resource usage in the way we want. I think I'll give implementing it a shot now.
Additional context
#61074
The text was updated successfully, but these errors were encountered: