Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: limit poor LSM health degradation via last line of defense protections in Pebble #79956

Closed
nicktrav opened this issue Apr 14, 2022 · 3 comments
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity T-storage Storage Team X-stale

Comments

@nicktrav
Copy link
Collaborator

nicktrav commented Apr 14, 2022

Currently, Cockroach relies heavily on Admission Control (AC) to protect the store on a node from falling into a "poor health" regime (most typically "LSM inversion", created when are large amount of data is resident in L0, across multiple sublevels, contributing to high read amplification). AC typically "knows more" about operations being executed on a store and can therefore make better decisions on what to allow / throttle than, say, Pebble itself.

However, there exist gaps in AC today that can result in poor LSM health (sometimes significant, as evidenced by a recent example, #79115). These poor health situations are typically difficult to recover from without invasive operator intervention (taking the affected node offline, performing an often lengthy manual compaction; or decommissioning the node entirely).

As a "last line of defense", Pebble could back-pressure or stall writes entirely to prevent a further degradation of health.

Currently, a hard limit on L0 sublevels exists, defaulting to 12 in Pebble, and 1,000 in Cockroach. For comparison, AC currently considers a store overloaded with an 20 L0 sublevels and 1,000 L0 files.

It should be noted that employing hard limits in Pebble as a last line of defense comes with some risks - writes below raft would be halted, and the failures modes would be less graceful (compared to AC, which still allows some progress in a store overload situation).

Any tuning of hard limits in Pebble should come with associated dynamically adjustable tuning knobs, allowing an operator to raise (or lower) the limits based on the circumstances.

The following have some useful background reading on the topic:

Jira issue: CRDB-15710

@nicktrav nicktrav added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Apr 14, 2022
nicktrav added a commit to nicktrav/pebble that referenced this issue Apr 14, 2022
The current comment mentions that the threshold is used relative to L0
sublevel count only when the `Experimental.L0SublevelCompactions` option
is enabled. As of 32a5180, the L0 sublevel count is always used.

Update the comment.

Touches cockroachdb/cockroach#79956.
nicktrav added a commit to cockroachdb/pebble that referenced this issue Apr 14, 2022
The current comment mentions that the threshold is used relative to L0
sublevel count only when the `Experimental.L0SublevelCompactions` option
is enabled. As of 32a5180, the L0 sublevel count is always used.

Update the comment.

Touches cockroachdb/cockroach#79956.
@tbg
Copy link
Member

tbg commented Apr 20, 2022

Have we considered being more targeted and failing/pausing liveness heartbeats under these stop conditions? We already do this:

for _, eng := range engines {
// We synchronously write to all disks before updating liveness because we
// don't want any excessively slow disks to prevent leases from being
// shifted to other nodes. A slow/stalled disk would block here and cause
// the node to lose its leases.
if err := storage.WriteSyncNoop(ctx, eng); err != nil {
return Record{}, errors.Wrapf(err, "couldn't update node liveness because disk write failed")
}
}

and could strengthen it by more aggressively skipping heartbeats when pebble indicates overload.

My default opinion is that engine backpressure is not a good idea. It is much more likely than targeted backpressure to have adverse effects across the cluster. For #79215, we are also discussing removing below-raft throttling that exists today in favor of better mechanisms.

I understand the desire to have a catch-all "fallback" but is it clear that the resulting behavior is better? We have a history of putting in mitigations that we don't understand well, and would like to make sure we're not repeating that here. Interested in any conversations that I've missed & experiments that were or could be run.

Assuming nodes reliably fail their heartbeats while in an L0-stop regime, the main source of incoming are raft messages (anything else?). I would be open to exploring ways to delay raft handling (dropping incoming messages is the "easiest" thing we could do, with a bit more elbow grease we could probably properly "pause" replication). Or put differently, if we solve the "follower writes" problem and "backpressure" heartbeats, do we have the catch-all that we're after?

@jbowens
Copy link
Collaborator

jbowens commented Apr 21, 2022

I was thinking about our MaxConcurrentCompactions limit. The MaxConcurrentCompactions limit currently artificially restricts an engine's ability to reshape the LSM to protect foreground traffic. Recent discussions (plus this issue) are about artificially restricting foreground traffic to protect the engine's ability to reshape the LSM. These appear to be at odds.

I think it's premature to consider hard engine-wide limits before we've made MaxConcurrentCompactions adaptive and have a better sense of how it performs to avoid LSM inversion. Lifting MaxConcurrentCompactions improves the the rate at which the LSM may be reshaped and implicitly backpressures writes through increased WAL write latency if write bandwidth is exhausted.

I know we're already planning on adaptively tuning MaxConcurrentCompactions according to resource utilization in 22.2. Instead of a hard engine stop at a certain level of inversion, we could lift MaxConcurrentCompactions up to GOMAXPROCS. This would effectively backpressure LSM writes that need to compete with compactions for write bandwidth and CPU. Even this might not be a good idea, because of the effect broad backpressure can have on cluster stability, like @tbg just mentioned.

Linking @sumeerbhola's a lot more nuanced thinking on the compaction concurrency limits.

@jlinder jlinder added sync-me and removed sync-me labels May 20, 2022
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2023
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity T-storage Storage Team X-stale
Projects
Archived in project
Development

No branches or pull requests

4 participants