perf: snapshot keyspan bounds #1810

jbowens · 2022-07-14T16:54:33Z

Currently, a snapshot always applies universally across all keys in the database. In CockroachDB, snapshots are used to preserve state within the context of a single range. An LSM snapshot constructed to read range r₁ still prevents removal of obsolete keys in range r₂.

We could extend NewSnapshot to allow supplying a predicate p(k)→bool that configures the snapshot to only snapshot keys for which p returns true. During compaction and flushes, when a snapshot appears to produce a new snapshot stripe within the same key, p(k) is consulted and a new stripe is produced only if p(k)→true. This would allow overwritten keys in a non-snapshotted CockroachDB range to be dropped while preserving overwritten keys in a snapshotted CockroachDB range.

There's still the question of what to do when an iterator constructed through Snapshot.NewIter reads a key for which the predicate p(k)→false:

Do nothing—the result of iterating over keys that do not match p(k) is undefined. The caller must be careful to never use iteration results outside the predicate. If the predicate is defined over swaths of the key space, this may be achieved through setting iterator bounds.
Filter—during iteration, before returning, test p(k), skipping the key if p(k)→false.
Read at higher sequence number—when the iterator is constructed from the snapshot, record two sequence numbers: the current visible sequence number on the database, and the snapshot's sequence number. User keys for which p(k)→true are filtered at the snapshot's sequence number. User keys for which p(k)→false are filtered at the database's visible sequence number.

I expect limiting the scope of active snapshots would reduce write amplification, in particular during periods of heavy rebalancing where there are open LSM snapshots and replicas are being simultaneously removed. Replica removal lays down range deletions, but those range deletions are unable to drop the replica's data. Compaction of these range deletions is still prioritized, because wide range deletions force ingested sstables into higher levels. The result is we suffer unnecessary write amplification moving the removed replica's data and the range tombstone into L6.

If we are to tackle this, I think we might want to expose a very limited interface, at least from the CockroachDB pkg/storage package that meets our specific snapshot usages. This can help avoid the possibility of reading unshapshotted keys while under the impression of reading through a consistent snapshot.

The amount of write amplification saved is still unknown. Adding metrics for the size of obsolete keys preserved during compactions (#1204) would help us prioritize.

Jira issue: PEBBLE-127

The text was updated successfully, but these errors were encountered:

sumeerbhola · 2022-07-18T14:46:46Z

Can we make this less general, and limit predicates to a single key-span, and check that all iterator bounds are limited to that key-span?

jbowens · 2022-07-18T15:14:02Z

I think the fragmentation of a range's various key spaces (range-id, range-local, etc) force us to snapshot multiple key spans together.

sumeerbhola · 2022-07-18T16:02:24Z

I think the fragmentation of a range's various key spaces (range-id, range-local, etc) force us to snapshot multiple key spans together.

Ah yes. We could still have it be explicitly represented as a set of spans, yes?

jbowens · 2022-07-18T16:09:06Z

Yeah, for sure

Introduce a new type `frontiers`, designed to monitor several different user key frontiers during a compaction. When a user key is encountered that equals or exceeds the configured frontier, the code that specified the frontier is notified and given an opportunity to set a new frontier. Internally, `frontiers` uses a heap (code largely copied from the merging iterator's heap) to avoid N key comparisons for every key. This commit refactors the `limitFuncSplitter` type to make use of `frontiers`. The `limitFuncSplitter` type is used to split flushes to L0 flush split keys, and to split both flushes and compactions to avoid excessive overlap with grandparent files. This change is motivated by cockroachdb#2156, which will introduce an additional compaction-output splitter that must perform key comparisons against the next key to decide when to split a compaction. Additionally, the `frontiers` type may also be useful for other uses, such as applying key-space-dependent logic during a compaction (eg, compaction-time GC, disaggregated storage locality policies, or keyspan-restricted snapshots cockroachdb#1810).

Introduce a new type `frontiers`, designed to monitor several different user key frontiers during a compaction. When a user key is encountered that equals or exceeds the configured frontier, the code that specified the frontier is notified and given an opportunity to set a new frontier. Internally, `frontiers` uses a heap (code largely copied from the merging iterator's heap) to avoid N key comparisons for every key. This commit refactors the `limitFuncSplitter` type to make use of `frontiers`. The `limitFuncSplitter` type is used to split flushes to L0 flush split keys, and to split both flushes and compactions to avoid excessive overlap with grandparent files. This change is motivated by #2156, which will introduce an additional compaction-output splitter that must perform key comparisons against the next key to decide when to split a compaction. Additionally, the `frontiers` type may also be useful for other uses, such as applying key-space-dependent logic during a compaction (eg, compaction-time GC, disaggregated storage locality policies, or keyspan-restricted snapshots #1810).

jbowens added C-enhancement New feature or request T-storage A-storage labels Jul 14, 2022

jbowens mentioned this issue Jul 16, 2022

db: add snapshot-pinned keys sstable properties and metrics #1814

Merged

nicktrav mentioned this issue Jul 25, 2022

perf: relative positioning through broad range tombstones is slow #1070

Closed

jbowens added the A-write-amp potential to reduce write amplification label Jul 31, 2022

jbowens mentioned this issue Jan 22, 2023

db: refactor compaction splitting to reduce key comparisons #2259

Merged

jbowens mentioned this issue Mar 10, 2023

storage, kv: do best-effort GC of old versions during storage compactions cockroachdb/cockroach#57260

Open

jbowens mentioned this issue Apr 25, 2023

sstable: sstable-local key kinds for overwritten/deleted points #2465

Closed

jbowens changed the title ~~perf: snapshot predicates~~ perf: snapshot keyspan bounds Jun 8, 2023

erikgrinaker mentioned this issue Jun 9, 2023

kvserver: avoid using wide Pebble snapshots cockroachdb/cockroach#104661

Open

jbowens mentioned this issue Jun 22, 2023

db: use uncompensated scores to smoothe compaction picking scores #2663

Merged

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to Backlog in [Deprecated] Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: snapshot keyspan bounds #1810

perf: snapshot keyspan bounds #1810

jbowens commented Jul 14, 2022 •

edited by cockroach-jira-scripts

Loading

sumeerbhola commented Jul 18, 2022

jbowens commented Jul 18, 2022

sumeerbhola commented Jul 18, 2022

jbowens commented Jul 18, 2022

perf: snapshot keyspan bounds #1810

perf: snapshot keyspan bounds #1810

Comments

jbowens commented Jul 14, 2022 • edited by cockroach-jira-scripts Loading

sumeerbhola commented Jul 18, 2022

jbowens commented Jul 18, 2022

sumeerbhola commented Jul 18, 2022

jbowens commented Jul 18, 2022

jbowens commented Jul 14, 2022 •

edited by cockroach-jira-scripts

Loading