Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvprober: avoid repeated planning failures causing CRDB issues #61255

Closed
joshimhoff opened this issue Mar 1, 2021 · 2 comments
Closed

kvprober: avoid repeated planning failures causing CRDB issues #61255

joshimhoff opened this issue Mar 1, 2021 · 2 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@joshimhoff
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
kvprober scans kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 & unmarshalls to proto when it doesn't know what range to probe next, in order to construct a new probing plan: https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/planner.go#L155. This implies that so long as kv.prober.planner.num_steps_to_plan_at_once is >1, scanning meta2 & unmarshalling isn't needed on every prober tick (which happens every kv.prober.read.interval seconds), and in fact, if kv.prober.planner.num_steps_to_plan_at_once is set to a large number, doing planning on every prober tick could spend more resources they we want.

One thing I'd like to avoid is repeated plan failures either causing a production outage or else making an existing production outage worse. One way I can imagine this kind of thing happening is overload / contention on meta2: Repeated plan failures (whether due to code bug in prober or non-prober related production outage) will lead to scanning kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 & unmarshalling every kv.prober.read.interval seconds. Depending on how the relevant cluster settings are set, could this introduce new problems?

Possibly the answer to this is it's fine. Possibly so long as kv.prober.planner.num_steps_to_plan_at_once is not set very very high, the worst case scenario of scanning kv.prober.planner.num_steps_to_plan_at_once rows worth of meta2 every kv.prober.read.interval seconds is acceptable. It's hard for me to judge, given my inexperience writing CRDB code, but it does make me nervous.

@tbg and @knz, curious to hear how you think about this, but PLEASE feel free to not think about this until after 21.2 is out / things are less busy!

Describe the solution you'd like
We can consider a few things:

  1. Do an exponential backoff.
  2. Allow setting a max frequency for scanning meta2 & unmarshalling. If can't plan and have no next Step, just increment a metric.

I like 2 better than 1. It sounds straight forward to implement and it limits max resource usage in the way we want. I think I'll give implementing it a shot now.

Additional context
#61074

@joshimhoff joshimhoff added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 1, 2021
@joshimhoff joshimhoff self-assigned this Mar 1, 2021
@tbg
Copy link
Member

tbg commented Mar 1, 2021 via email

@joshimhoff
Copy link
Collaborator Author

Nice! Thanks for quick response.

craig bot pushed a commit that referenced this issue Mar 4, 2021
61275: kvprober: rate limit the planner r=joshimhoff a=joshimhoff

#61255

**kvprober: rate limit the planner**

This commit introduces a planning rate limit. This protects CRDB from
planning executing too often, due to either issues with CRDB (e.g.
meta2 unavailability) or bugs in kvprober. When planning does execute,
kvprober scans kv.prober.planner.num_steps_to_plan_at_once rows worth
of meta2 and unmarshalls the resulting range descriptors.

Release justification: Auxiliary system that is off by default.
Release note: None.

Co-authored-by: Josh Imhoff <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

2 participants