storage: limit poor LSM health degradation via last line of defense protections in Pebble #79956

nicktrav · 2022-04-14T16:43:52Z

Currently, Cockroach relies heavily on Admission Control (AC) to protect the store on a node from falling into a "poor health" regime (most typically "LSM inversion", created when are large amount of data is resident in L0, across multiple sublevels, contributing to high read amplification). AC typically "knows more" about operations being executed on a store and can therefore make better decisions on what to allow / throttle than, say, Pebble itself.

However, there exist gaps in AC today that can result in poor LSM health (sometimes significant, as evidenced by a recent example, #79115). These poor health situations are typically difficult to recover from without invasive operator intervention (taking the affected node offline, performing an often lengthy manual compaction; or decommissioning the node entirely).

As a "last line of defense", Pebble could back-pressure or stall writes entirely to prevent a further degradation of health.

Currently, a hard limit on L0 sublevels exists, defaulting to 12 in Pebble, and 1,000 in Cockroach. For comparison, AC currently considers a store overloaded with an 20 L0 sublevels and 1,000 L0 files.

It should be noted that employing hard limits in Pebble as a last line of defense comes with some risks - writes below raft would be halted, and the failures modes would be less graceful (compared to AC, which still allows some progress in a store overload situation).

Any tuning of hard limits in Pebble should come with associated dynamically adjustable tuning knobs, allowing an operator to raise (or lower) the limits based on the circumstances.

The following have some useful background reading on the topic:

prior analysis of backpresure in Pebble: perf: pacing user writes, flushes and compactions pebble#7
write throttling in RocksDB

Jira issue: CRDB-15710

The current comment mentions that the threshold is used relative to L0 sublevel count only when the `Experimental.L0SublevelCompactions` option is enabled. As of 32a5180, the L0 sublevel count is always used. Update the comment. Touches cockroachdb/cockroach#79956.

tbg · 2022-04-20T12:38:38Z

Have we considered being more targeted and failing/pausing liveness heartbeats under these stop conditions? We already do this:

cockroach/pkg/kv/kvserver/liveness/liveness.go

Lines 1257 to 1265 in 0b7ba56

    
           for _, eng := range engines { 
        
           	// We synchronously write to all disks before updating liveness because we 
        
           	// don't want any excessively slow disks to prevent leases from being 
        
           	// shifted to other nodes. A slow/stalled disk would block here and cause 
        
           	// the node to lose its leases. 
        
           	if err := storage.WriteSyncNoop(ctx, eng); err != nil { 
        
           		return Record{}, errors.Wrapf(err, "couldn't update node liveness because disk write failed") 
        
           	} 
        
           }

and could strengthen it by more aggressively skipping heartbeats when pebble indicates overload.

My default opinion is that engine backpressure is not a good idea. It is much more likely than targeted backpressure to have adverse effects across the cluster. For #79215, we are also discussing removing below-raft throttling that exists today in favor of better mechanisms.

I understand the desire to have a catch-all "fallback" but is it clear that the resulting behavior is better? We have a history of putting in mitigations that we don't understand well, and would like to make sure we're not repeating that here. Interested in any conversations that I've missed & experiments that were or could be run.

Assuming nodes reliably fail their heartbeats while in an L0-stop regime, the main source of incoming are raft messages (anything else?). I would be open to exploring ways to delay raft handling (dropping incoming messages is the "easiest" thing we could do, with a bit more elbow grease we could probably properly "pause" replication). Or put differently, if we solve the "follower writes" problem and "backpressure" heartbeats, do we have the catch-all that we're after?

jbowens · 2022-04-21T01:39:19Z

I was thinking about our MaxConcurrentCompactions limit. The MaxConcurrentCompactions limit currently artificially restricts an engine's ability to reshape the LSM to protect foreground traffic. Recent discussions (plus this issue) are about artificially restricting foreground traffic to protect the engine's ability to reshape the LSM. These appear to be at odds.

I think it's premature to consider hard engine-wide limits before we've made MaxConcurrentCompactions adaptive and have a better sense of how it performs to avoid LSM inversion. Lifting MaxConcurrentCompactions improves the the rate at which the LSM may be reshaped and implicitly backpressures writes through increased WAL write latency if write bandwidth is exhausted.

I know we're already planning on adaptively tuning MaxConcurrentCompactions according to resource utilization in 22.2. Instead of a hard engine stop at a certain level of inversion, we could lift MaxConcurrentCompactions up to GOMAXPROCS. This would effectively backpressure LSM writes that need to compete with compactions for write bandwidth and CPU. Even this might not be a good idea, because of the effect broad backpressure can have on cluster stability, like @tbg just mentioned.

Linking @sumeerbhola's a lot more nuanced thinking on the compaction concurrency limits.

github-actions · 2023-11-29T11:04:32Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

nicktrav added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Apr 14, 2022

nicktrav mentioned this issue Apr 14, 2022

db: update comment on L0StopWritesThreshold cockroachdb/pebble#1647

Merged

jlinder added sync-me and removed sync-me labels May 20, 2022

github-actions bot added the no-issue-activity label Nov 29, 2023

github-actions bot added the X-stale label Dec 11, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2023

exalate-issue-sync bot closed this as completed Dec 11, 2023

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: limit poor LSM health degradation via last line of defense protections in Pebble #79956

storage: limit poor LSM health degradation via last line of defense protections in Pebble #79956

nicktrav commented Apr 14, 2022 •

edited by cockroach-jira-scripts

Loading

tbg commented Apr 20, 2022

jbowens commented Apr 21, 2022

github-actions bot commented Nov 29, 2023

storage: limit poor LSM health degradation via last line of defense protections in Pebble #79956

storage: limit poor LSM health degradation via last line of defense protections in Pebble #79956

Comments

nicktrav commented Apr 14, 2022 • edited by cockroach-jira-scripts Loading

tbg commented Apr 20, 2022

jbowens commented Apr 21, 2022

github-actions bot commented Nov 29, 2023

nicktrav commented Apr 14, 2022 •

edited by cockroach-jira-scripts

Loading