Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/rfc: add RFC for point tombstone density compaction heuristic #3719

Merged

Conversation

anish-shanbhag
Copy link
Contributor

@anish-shanbhag anish-shanbhag commented Jul 1, 2024

Adds an RFC for a new compaction heuristic that considers density of point tombstones.

Informs: #918

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@anish-shanbhag anish-shanbhag marked this pull request as ready for review July 1, 2024 17:14
@anish-shanbhag anish-shanbhag requested a review from a team as a code owner July 1, 2024 17:14
@anish-shanbhag anish-shanbhag force-pushed the tombstone-density-rfc branch 2 times, most recently from 2cbb739 to 75b849c Compare July 1, 2024 19:34
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still reading, but looking good so far

Reviewable status: 0 of 1 files reviewed, 4 unresolved discussions (waiting on @anish-shanbhag and @itsbilal)


docs/RFCS/20240701_tombstone_density_heuristic.md line 25 at r1 (raw file):

1. Tombstone buildup usually only happens in large stores (> few hundred MB) once multiple levels of the LSM start to be populated.
2. The LSM is "too well shaped" in that the size ratio between levels is at the proper value, so [compaction scores](https://github.com/cockroachdb/pebble/blob/3ef2e5b1f693dfbf78785e14f603a443af3c674b/compaction_picker.go#L919) for each level are all calculated to be <0 and thus no compactions are scheduled.

nit : this should be "<1": compaction scores range from 0 to infinity and a score of ≥ 1.0 is sufficient to schedule a new compaction out of a level


docs/RFCS/20240701_tombstone_density_heuristic.md line 27 at r1 (raw file):

2. The LSM is "too well shaped" in that the size ratio between levels is at the proper value, so [compaction scores](https://github.com/cockroachdb/pebble/blob/3ef2e5b1f693dfbf78785e14f603a443af3c674b/compaction_picker.go#L919) for each level are all calculated to be <0 and thus no compactions are scheduled.
	- Observed in [this escalation](https://github.com/cockroachlabs/support/issues/2628)
3. Read performance becomes especially bad when we have a high density of point tombstones in lower levels (L0-3) which span many SSTables in higher levels (L4-6).

nit: confusingly, we refer to L0-3 as "higher" levels, and refer to L6 as the bottommost level.


docs/RFCS/20240701_tombstone_density_heuristic.md line 34 at r1 (raw file):

		- We had the raft log sandwiched between the frequently-read keys used for expiration-based leases. Nodes used to gossip their liveness every few seconds, which writes and deletes messages from the raft log over and over. This makes the raft log span numerous SSTables that fill up the cache with tombstones, removing the frequently-read lease expiration keys from the cache. Liveness has since been changed to use a different implementation but the root cause still exists here.
5. Tombstones build up more often when KV pairs are small in size because more KVs fit into a single SSTable. In this case, heuristics that measure the possibility of disk space reclamation don't work because the tombstones take up little space despite filling up the key space.
6. The problem is specific to `SeekGE` and `SeekLT` because we have to iterate over all keys for these operations.

nit: this isn't quite true—it's also relevant for relative positioning operations Next and Prev; if you Next into a swath of tombstones, the iterator needs step through all the tombstones to arrive at the first live key.


docs/RFCS/20240701_tombstone_density_heuristic.md line 56 at r1 (raw file):

	- If at least `Z` buckets are tombstone dense, compact this table
- Adapt the [sliding window approach from RocksDB](https://github.com/facebook/rocksdb/blob/22fe23edc89e9842ed72b613de172cd80d3b00da/utilities/table_properties_collectors/compact_on_deletion_collector.cc#L33)
	- RocksDB uses an approach where they "slide a window" across the SSTable keys and schedule compaction if the window has a high enough ratio of tombstones. In other words, while writing if there are ever at least `X` tombstones in the last `Y` keys, compact this table

TIL — nice RocksDB archeology

@anish-shanbhag anish-shanbhag force-pushed the tombstone-density-rfc branch from aa6155b to 7c592cc Compare July 8, 2024 17:51
@itsbilal itsbilal requested a review from jbowens July 8, 2024 21:27
Copy link
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly this is looking great! Some comments based on what's here so far. I think we're already in a stage where we can decide on the level of granularity we want, and start implementing soon. This RFC can still stay here as a design doc for a little bit longer, but it shouldn't be a blocker for implementation work for much longer I'd hope.

Reviewable status: 0 of 1 files reviewed, 11 unresolved discussions (waiting on @anish-shanbhag and @jbowens)


docs/RFCS/20240701_tombstone_density_heuristic.md line 24 at r2 (raw file):

Even though our current heuristics take the potential amount of reclaimed space from compacting point tombstones into account, there are specific situations where this is not sufficient to prevent point tombstone buildup. Specifically, these are some of the factors that lead to this issue:

1. Tombstone buildup usually only happens in large stores (> few hundred MB) once multiple levels of the LSM start to be populated.

nit: we can say something bigger eg. > 10GB .


docs/RFCS/20240701_tombstone_density_heuristic.md line 32 at r2 (raw file):

	- We've seen this behavior multiple times with CockroachDB's implementation of the KV liveness range.
		- Observed in multiple escalations including [here](https://github.com/cockroachlabs/support/issues/2107) and [here](https://github.com/cockroachlabs/support/issues/2640) 
		- We had the raft log sandwiched between the frequently-read keys used for expiration-based leases. Nodes used to gossip their liveness every few seconds, which writes and deletes messages from the raft log over and over. This makes the raft log span numerous SSTables that fill up the cache with tombstones, removing the frequently-read lease expiration keys from the cache. Liveness has since been changed to use a different implementation but the root cause still exists here.

we can also list outbox-style workloads here - workloads that add rows at the end of a table and delete keys off the start. Once GC'd, this leads to a dense region of tombstones near the start, that keeps sliding forward.


docs/RFCS/20240701_tombstone_density_heuristic.md line 100 at r2 (raw file):

Given this method to query tombstone stats for arbitrary key ranges, here's a sketch of how the overall compaction process could look:
- After writing an SSTable, add this SSTable and all SSTables which overlap with its key range (using `version.Overlaps`) to a global set `needsTombstoneCheck` which marks them as possibly eligible for a tombstone density compaction
	- If the logic below ends up being fast enough, we could avoid having `needsTombstoneCheck` entirely and check whether compaction is needed during a write itself. But if not, we should defer the check in order to keep writes fast

The annotator framework should already take advantage of unchanged subtrees before/after the addition of an sstable, and retain the cached annotations. So I don't see the need to keep a separate bool. We should be able to capture any sstable-level metrics during sstable write time, and the compaction picker can just do the tombstone density query when it's picking a new compaction - lazily doing the calculation down the tree for any non-cached values.


docs/RFCS/20240701_tombstone_density_heuristic.md line 102 at r2 (raw file):

	- If the logic below ends up being fast enough, we could avoid having `needsTombstoneCheck` entirely and check whether compaction is needed during a write itself. But if not, we should defer the check in order to keep writes fast
- Inside [`pickAuto`](https://github.com/cockroachdb/pebble/blob/4981bd0e5e9538a032a4caf3a12d4571abb8c206/compaction_picker.go#L1324), we'll check whether any SSTable in `needsTombstoneCheck` should be compacted
	- Like elision-only and read-based compaction, we likely only want to do this check if there are no regular, size-based compactions that should be scheduled. Open question: how should these compactions be prioritized against elision-only and read-based compaction?

I can see a strong case for slotting this above read-triggered compactions in priority, and maybe even slightly ahead of elision-only compactions. Read-triggered compactions are truly the compactions we schedule when we have no other intuition on compactions to schedule, and make no assumption on keys being deleted or space being freed up. In this case there's a stronger case to be made about dropping the tombstones, so it should be higher priority than that.


docs/RFCS/20240701_tombstone_density_heuristic.md line 112 at r2 (raw file):

			- Note: if we use the technique from \# 2 above to find the tombstone-dense range `m->n` within this SSTable, we could also get more granular stats:
				- number of tombstones across the whole LSM between `m->n`
				- number of internal keys across the whole LSM between `m->n`

This can make a lot of sense if we happen to have tombstone dense ranges of keys that are right at sstable bounds, but the rest of the sstable(s) are still full of sets. That way we can surface a tombstone-dense keyrange even if no individual sstable has too many tombstones.

The one thing we'll need to keep in mind is that pebble doesn't have a good sense of how close together two keys are in the keyspace, so inevitably we'll also have to track keys in the not-tombstone-dense keyrange abutting a tombstone-dense keyrange. Once we know that, we can coalesce two tombstone-dense keyranges in the annotator if there's a small number of non-tombstone-dense keys in between. Does this make sense? Wonder what your thoughts are on addressing this issue.

My own preference would be to not worry about identifying tombstone-dense keyranges within sstables (because if we're heading in that direction we have to do more work of this sort to make it worthwhile), and to instead focus the range annotator work on better adding up tombstone ranges across sstables (treating each sstable as an indivisible unit) and across levels. This might be too coarse to perfectly solve all corner cases, but it'd solve the bulk of these issues and would be easier to envision. Thoughts?


docs/RFCS/20240701_tombstone_density_heuristic.md line 113 at r2 (raw file):

				- number of tombstones across the whole LSM between `m->n`
				- number of internal keys across the whole LSM between `m->n`
	- Note that stats about the whole LSM would be overestimates since we'd be including tombstones/keys from SSTables which only partially overlap the range. I think this should be fine?

This is fine because if we were to do a compaction for that range, we'd need to compact the entirety of that sstable anyway (and anything that overlaps), so some overestimation is okay.


docs/RFCS/20240701_tombstone_density_heuristic.md line 131 at r2 (raw file):

- cockroachdb/cockroach#113069 measures performance of liveness range scans since those were often slowed down by tombstones. Even though it looks like liveness logic has changed since then to avoid this, we could adapt the test to induce slow reads.
- Create reproduction roachtests for past slowdown scenarios we've observed
- RocksDB highlights that a [queue service built on top of a KV store](https://github.com/facebook/rocksdb/wiki/Implement-Queue-Service-Using-RocksDB) is a common situation where tombstones build up. We should write a test to capture performance for a queue

Yep, queue/outbox-style workloads definitely exhibit something like this. If you were to run a workload exhibiting this behaviour in cockroach, you could try a kv50 workload with a high delete percentage too, and lower the mvcc gcttl using ALTER RANGE default CONFIGURE ZONE USING gc.ttlseconds=3600 or so, so we start to lay down pebble tombstones in an hour of a key being deleted.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 15 unresolved discussions (waiting on @anish-shanbhag and @jbowens)


docs/RFCS/20240701_tombstone_density_heuristic.md line 14 at r2 (raw file):

This design document outlines a proposal for improving compaction heuristics in Pebble, specifically targeting the reduction of point tombstone density to enhance read performance. It identifies the problem of high densities of point tombstones slowing down iteration due to unnecessary I/O and CPU usage. We propose various methods to identify regions of the LSM with high tombstone density and schedule the corresponding SSTables for compaction.

# Motivation

This section is great!


docs/RFCS/20240701_tombstone_density_heuristic.md line 42 at r2 (raw file):

### 1. Tombstone Ratio

The simplest way to detect a buildup of point tombstones is to define some threshold percentage (`TOMBSTONE_THRESHOLD`) which indicates that any SSTable where `NumDeletions/NumEntries > TOMBSTONE_THRESHOLD` should be compacted. For example, if `TOMBSTONE_THRESHOLD = 0.6` an SSTable with 10,000 internal keys would be scheduled for compaction if it has at least 6000 tombstones.

Probably we would want different thresholds on different levels.


docs/RFCS/20240701_tombstone_density_heuristic.md line 54 at r2 (raw file):

If we find that more granularity is needed on a per-SSTable basis, i.e. it's important to know where tombstones are clustered within an SSTable, there are two possible options:
- Divide the key range of the SSTable into buckets of `X` keys and calculate how many have `>Y%` tombstones in them
	- If at least `Z` buckets are tombstone dense, compact this table

Here the "buckets" could just be data blocks. We can give each data block a tombstone density score and then aggregate it into some final metric (e.g. a very rudimentary histogram). We can use the existing property collectors for this.

We could even consult (for each data block) how many files in the levels below overlap that data block (at the time of writing that table) and work that into the heuristic. For each block the number of tombstones multiplied by the number of overlapping files below would probably be the "score" we care about.


docs/RFCS/20240701_tombstone_density_heuristic.md line 62 at r2 (raw file):

Like the `TOMBSTONE_THRESHOLD` strategy, this only considers single SSTables, so we can just calculate these metrics on the fly while writing the SSTable and immediately schedule them for compaction if they meet a density criteria.

### 3. Key Range Statistics

I would recommend at least experimenting with a prototype with one of the simpler approaches before going down this route.

Copy link
Contributor Author

@anish-shanbhag anish-shanbhag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 15 unresolved discussions (waiting on @itsbilal, @jbowens, and @RaduBerinde)


docs/RFCS/20240701_tombstone_density_heuristic.md line 25 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit : this should be "<1": compaction scores range from 0 to infinity and a score of ≥ 1.0 is sufficient to schedule a new compaction out of a level

fixed


docs/RFCS/20240701_tombstone_density_heuristic.md line 27 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: confusingly, we refer to L0-3 as "higher" levels, and refer to L6 as the bottommost level.

Makes sense, fixed this


docs/RFCS/20240701_tombstone_density_heuristic.md line 34 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: this isn't quite true—it's also relevant for relative positioning operations Next and Prev; if you Next into a swath of tombstones, the iterator needs step through all the tombstones to arrive at the first live key.

Good point, I'll look into testing these as well


docs/RFCS/20240701_tombstone_density_heuristic.md line 24 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

nit: we can say something bigger eg. > 10GB .

fixed


docs/RFCS/20240701_tombstone_density_heuristic.md line 32 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

we can also list outbox-style workloads here - workloads that add rows at the end of a table and delete keys off the start. Once GC'd, this leads to a dense region of tombstones near the start, that keeps sliding forward.

Yep, definitely planning to test this - also added a note to the doc here


docs/RFCS/20240701_tombstone_density_heuristic.md line 42 at r2 (raw file):

Previously, RaduBerinde wrote…

Probably we would want different thresholds on different levels.

Could you clarify why we might want different thresholds? I'm guessing it's because the bottommost tables could still have large swaths of tombstones despite simultaneously containing mostly live keys, since the tables are much bigger. So would we want a lower threshold for e.g. L5 vs L2?


docs/RFCS/20240701_tombstone_density_heuristic.md line 54 at r2 (raw file):

Previously, RaduBerinde wrote…

Here the "buckets" could just be data blocks. We can give each data block a tombstone density score and then aggregate it into some final metric (e.g. a very rudimentary histogram). We can use the existing property collectors for this.

We could even consult (for each data block) how many files in the levels below overlap that data block (at the time of writing that table) and work that into the heuristic. For each block the number of tombstones multiplied by the number of overlapping files below would probably be the "score" we care about.

Great idea, I was looking into the property collectors and thought this might be a good approach.

One potential drawback might be the added size of each index block entry. Looks like we also considered keeping track of key/value disk size for Reader.EstimateDiskUsage but decided against it for this reason:

// TODO(sumeer): if we need more accuracy, without loading any data blocks

But if we store the density stats in a small value like a uint8/uint16, the 1-2 added bytes hopefully shouldn't have a meaningful impact on DB size.


docs/RFCS/20240701_tombstone_density_heuristic.md line 100 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

The annotator framework should already take advantage of unchanged subtrees before/after the addition of an sstable, and retain the cached annotations. So I don't see the need to keep a separate bool. We should be able to capture any sstable-level metrics during sstable write time, and the compaction picker can just do the tombstone density query when it's picking a new compaction - lazily doing the calculation down the tree for any non-cached values.

Makes sense, I think we can go without this. I was mainly concerned about the possible case where e.g. we compact from L4->L5 and this changes the amount of keys that are beneath a tombstone-dense region in an L0 SSTable. But the number of keys should only decrease or stay the same after the compaction, so an overestimate here should be fine.

@anish-shanbhag anish-shanbhag force-pushed the tombstone-density-rfc branch from 7c592cc to 99a7689 Compare July 9, 2024 21:32
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 15 unresolved discussions (waiting on @anish-shanbhag, @itsbilal, and @jbowens)


docs/RFCS/20240701_tombstone_density_heuristic.md line 54 at r2 (raw file):

Previously, anish-shanbhag (Anish Shanbhag) wrote…

Great idea, I was looking into the property collectors and thought this might be a good approach.

One potential drawback might be the added size of each index block entry. Looks like we also considered keeping track of key/value disk size for Reader.EstimateDiskUsage but decided against it for this reason:

// TODO(sumeer): if we need more accuracy, without loading any data blocks

But if we store the density stats in a small value like a uint8/uint16, the 1-2 added bytes hopefully shouldn't have a meaningful impact on DB size.

I don't think increasing the handle is a huge deal. I interpret that TODO as saying that we don't want to increase it without good reason (I have a feeling it would have stayed a TODO regardless).

Each data block is ~32KB (uncompressed). If we add say 8 bytes to each block handle, that's only 0.02% of the total uncompressed size.

anish-shanbhag added a commit to anish-shanbhag/pebble that referenced this pull request Jul 15, 2024
Refactors `manifest.Annotator` to use generics and a simplified API.
This eliminates the need to perform pointer manipulation and unsafe
typecasting when defining a new Annotator.

This change also adds the concept of a "range annotation", which is
a computation that aggregates some value over a specific key range
within a level. Level-wide annotations are now computed internally as a
range annotation with a key range spanning the whole level. Range annotations
use the same B-tree caching behavior as regular annotations, so queries
remain fast even with thousands of tables because they avoid a sequential
iteration over a level's files.

The goal of this change is to expand and improve the Annotator interface
while not changing any existing behavior. However, there are a number of
potential use cases for range annotations which could be added next:
- Calculating the number of keys shadowed by a tombstone-dense key range,
  for use in the heuristic proposed at cockroachdb#3719
- Computing the total file size that a read compaction overlaps with,
  which is used to prevent read compactions that are too wide
  [here](https://github.com/cockroachdb/pebble/blob/9a4ea4dfc5a8129937e3fdc811ea87543d88565b/compaction_picker.go#L1930)
- Estimating disk usage for a key range without having to iterate over files, which is done [here](https://github.com/jbowens/pebble/blob/master/db.go#L2249)
- Calculating average value size and compression ratio for a key range,
  which we [currently use when estimating the potential space that compacting
  point tombstones would reclaim](https://github.com/jbowens/pebble/blob/646c6bab1af3c72dc7db59a0dcc38b5955fc15cc/table_stats.go#L350).
  Range annotations could also be used to implement the TODO from @jbowens.
- Estimating the reclaimed space from compacting range deletions, for which
  we also [currently use sequential iteration](https://github.com/jbowens/pebble/blob/master/table_stats.go#L557).
- Because annotations are in-memory, if we can find a way to refactor those
  last two without using I/O at all, then this would eliminate the need to
  defer table stats collection to a separate goroutine for newly written tables.
- Expand/rewrite the LSM visualizer tool (cockroachdb#508) to show overlapping ranges, as recommended
  in cockroachdb#1598. Range annotations would allow us to efficiently compute statistics
  including the # of sstables, # of keys, etc. in chunks of the keyspace and
  visualize this on a graph showing overlapping ranges from each level.

`BenchmarkNumFilesAnnotator` shows a slight speedup over master when compared
to the equivalent implementation of `orderStatistic`:
```
                     │     old     │                new                 │
                     │   sec/op    │   sec/op     vs base               │
NumFilesAnnotator-10   1.953µ ± 1%   1.618µ ± 3%  -17.15% (p=0.002 n=6)

                     │    old     │               new                │
                     │    B/op    │    B/op     vs base              │
NumFilesAnnotator-10   536.0 ± 0%   544.0 ± 0%  +1.49% (p=0.002 n=6)

                     │    old     │                new                │
                     │ allocs/op  │ allocs/op   vs base               │
NumFilesAnnotator-10   7.000 ± 0%   8.000 ± 0%  +14.29% (p=0.002 n=6)
```

`BenchmarkNumFilesRangeAnnotation` shows that range annotations remain fast
for arbitrary length ranges:
```
BenchmarkNumFilesRangeAnnotation-10    	  460471	      2191 ns/op	     944 B/op	       7 allocs/op
```
anish-shanbhag added a commit to anish-shanbhag/pebble that referenced this pull request Jul 15, 2024
Refactors `manifest.Annotator` to use generics and a simplified API.
This eliminates the need to perform pointer manipulation and unsafe
typecasting when defining a new Annotator.

This change also adds the concept of a "range annotation", which is
a computation that aggregates some value over a specific key range
within a level. Level-wide annotations are now computed internally as a
range annotation with a key range spanning the whole level. Range annotations
use the same B-tree caching behavior as regular annotations, so queries
remain fast even with thousands of tables because they avoid a sequential
iteration over a level's files.

The goal of this change is to expand and improve the Annotator interface
while not changing any existing behavior. However, there are a number of
potential use cases for range annotations which could be added next:
- Calculating the number of keys shadowed by a tombstone-dense key range,
  for use in the heuristic proposed at cockroachdb#3719
- Computing the total file size that a read compaction overlaps with,
  which is used to prevent read compactions that are too wide
  [here](https://github.com/cockroachdb/pebble/blob/9a4ea4dfc5a8129937e3fdc811ea87543d88565b/compaction_picker.go#L1930)
- Estimating disk usage for a key range without having to iterate over files, which is done [here](https://github.com/jbowens/pebble/blob/master/db.go#L2249)
- Calculating average value size and compression ratio for a key range,
  which we [currently use when estimating the potential space that compacting
  point tombstones would reclaim](https://github.com/jbowens/pebble/blob/646c6bab1af3c72dc7db59a0dcc38b5955fc15cc/table_stats.go#L350).
  Range annotations could also be used to implement the TODO from @jbowens.
- Estimating the reclaimed space from compacting range deletions, for which
  we also [currently use sequential iteration](https://github.com/jbowens/pebble/blob/master/table_stats.go#L557).
- Because annotations are in-memory, if we can find a way to refactor those
  last two without using I/O at all, then this would eliminate the need to
  defer table stats collection to a separate goroutine for newly written tables.
- Expand/rewrite the LSM visualizer tool (cockroachdb#508) to show overlapping ranges, as recommended
  in cockroachdb#1598. Range annotations would allow us to efficiently compute statistics
  including the # of sstables, # of keys, etc. in chunks of the keyspace and
  visualize this on a graph showing overlapping ranges from each level.

`BenchmarkNumFilesAnnotator` shows a slight speedup over master when compared
to the equivalent implementation of `orderStatistic`:
```
                     │     old     │                new                 │
                     │   sec/op    │   sec/op     vs base               │
NumFilesAnnotator-10   1.953µ ± 1%   1.618µ ± 3%  -17.15% (p=0.002 n=6)

                     │    old     │               new                │
                     │    B/op    │    B/op     vs base              │
NumFilesAnnotator-10   536.0 ± 0%   544.0 ± 0%  +1.49% (p=0.002 n=6)

                     │    old     │                new                │
                     │ allocs/op  │ allocs/op   vs base               │
NumFilesAnnotator-10   7.000 ± 0%   8.000 ± 0%  +14.29% (p=0.002 n=6)
```

`BenchmarkNumFilesRangeAnnotation` shows that range annotations remain fast
for arbitrary length ranges:
```
BenchmarkNumFilesRangeAnnotation-10    	  460471	      2191 ns/op	     944 B/op	       7 allocs/op
```
anish-shanbhag added a commit to anish-shanbhag/pebble that referenced this pull request Jul 15, 2024
Refactors `manifest.Annotator` to use generics and a simplified API.
This eliminates the need to perform pointer manipulation and unsafe
typecasting when defining a new Annotator.

This change also adds the concept of a "range annotation", which is
a computation that aggregates some value over a specific key range
within a level. Level-wide annotations are now computed internally as a
range annotation with a key range spanning the whole level. Range annotations
use the same B-tree caching behavior as regular annotations, so queries
remain fast even with thousands of tables because they avoid a sequential
iteration over a level's files.

The goal of this change is to expand and improve the Annotator interface
while not changing any existing behavior. However, there are a number of
potential use cases for range annotations which could be added next:
- Calculating the number of keys shadowed by a tombstone-dense key range,
  for use in the heuristic proposed at cockroachdb#3719
- Computing the total file size that a read compaction overlaps with,
  which is used to prevent read compactions that are too wide
  [here](https://github.com/cockroachdb/pebble/blob/9a4ea4dfc5a8129937e3fdc811ea87543d88565b/compaction_picker.go#L1930)
- Estimating disk usage for a key range without having to iterate over files, which is done [here](https://github.com/jbowens/pebble/blob/master/db.go#L2249)
- Calculating average value size and compression ratio for a key range,
  which we [currently use when estimating the potential space that compacting
  point tombstones would reclaim](https://github.com/jbowens/pebble/blob/646c6bab1af3c72dc7db59a0dcc38b5955fc15cc/table_stats.go#L350).
  Range annotations could also be used to implement the TODO from @jbowens.
- Estimating the reclaimed space from compacting range deletions, for which
  we also [currently use sequential iteration](https://github.com/jbowens/pebble/blob/master/table_stats.go#L557).
- Because annotations are in-memory, if we can find a way to refactor those
  last two without using I/O at all, then this would eliminate the need to
  defer table stats collection to a separate goroutine for newly written tables.
- Expand/rewrite the LSM visualizer tool (cockroachdb#508) to show overlapping ranges, as recommended
  in cockroachdb#1598. Range annotations would allow us to efficiently compute statistics
  including the # of sstables, # of keys, etc. in chunks of the keyspace and
  visualize this on a graph showing overlapping ranges from each level.

`BenchmarkNumFilesAnnotator` shows a slight speedup over master when compared
to the equivalent implementation of `orderStatistic`:
```
                     │     old     │                new                 │
                     │   sec/op    │   sec/op     vs base               │
NumFilesAnnotator-10   1.953µ ± 1%   1.618µ ± 3%  -17.15% (p=0.002 n=6)

                     │    old     │               new                │
                     │    B/op    │    B/op     vs base              │
NumFilesAnnotator-10   536.0 ± 0%   544.0 ± 0%  +1.49% (p=0.002 n=6)

                     │    old     │                new                │
                     │ allocs/op  │ allocs/op   vs base               │
NumFilesAnnotator-10   7.000 ± 0%   8.000 ± 0%  +14.29% (p=0.002 n=6)
```

`BenchmarkNumFilesRangeAnnotation` shows that range annotations remain fast
for arbitrary length ranges:
```
BenchmarkNumFilesRangeAnnotation-10    	  460471	      2191 ns/op	     944 B/op	       7 allocs/op
```
anish-shanbhag added a commit to anish-shanbhag/pebble that referenced this pull request Jul 15, 2024
Refactors `manifest.Annotator` to use generics and a simplified API.
This eliminates the need to perform pointer manipulation and unsafe
typecasting when defining a new Annotator.

This change also adds the concept of a "range annotation", which is
a computation that aggregates some value over a specific key range
within a level. Level-wide annotations are now computed internally as a
range annotation with a key range spanning the whole level. Range annotations
use the same B-tree caching behavior as regular annotations, so queries
remain fast even with thousands of tables because they avoid a sequential
iteration over a level's files.

The goal of this change is to expand and improve the Annotator interface
while not changing any existing behavior. However, there are a number of
potential use cases for range annotations which could be added next:
- Calculating the number of keys shadowed by a tombstone-dense key range,
  for use in the heuristic proposed at cockroachdb#3719
- Computing the total file size that a read compaction overlaps with,
  which is used to prevent read compactions that are too wide
  [here](https://github.com/cockroachdb/pebble/blob/9a4ea4dfc5a8129937e3fdc811ea87543d88565b/compaction_picker.go#L1930)
- Estimating disk usage for a key range without having to iterate over files, which is done [here](https://github.com/jbowens/pebble/blob/master/db.go#L2249)
- Calculating average value size and compression ratio for a key range,
  which we [currently use when estimating the potential space that compacting
  point tombstones would reclaim](https://github.com/jbowens/pebble/blob/646c6bab1af3c72dc7db59a0dcc38b5955fc15cc/table_stats.go#L350).
  Range annotations could also be used to implement the TODO from @jbowens.
- Estimating the reclaimed space from compacting range deletions, for which
  we also [currently use sequential iteration](https://github.com/jbowens/pebble/blob/master/table_stats.go#L557).
- Because annotations are in-memory, if we can find a way to refactor those
  last two without using I/O at all, then this would eliminate the need to
  defer table stats collection to a separate goroutine for newly written tables.
- Expand/rewrite the LSM visualizer tool (cockroachdb#508) to show overlapping ranges, as recommended
  in cockroachdb#1598. Range annotations would allow us to efficiently compute statistics
  including the # of sstables, # of keys, etc. in chunks of the keyspace and
  visualize this on a graph showing overlapping ranges from each level.

`BenchmarkNumFilesAnnotator` shows a slight speedup over master when compared
to the equivalent implementation of `orderStatistic`:
```
                     │     old     │                new                 │
                     │   sec/op    │   sec/op     vs base               │
NumFilesAnnotator-10   1.953µ ± 1%   1.618µ ± 3%  -17.15% (p=0.002 n=6)

                     │    old     │               new                │
                     │    B/op    │    B/op     vs base              │
NumFilesAnnotator-10   536.0 ± 0%   544.0 ± 0%  +1.49% (p=0.002 n=6)

                     │    old     │                new                │
                     │ allocs/op  │ allocs/op   vs base               │
NumFilesAnnotator-10   7.000 ± 0%   8.000 ± 0%  +14.29% (p=0.002 n=6)
```

`BenchmarkNumFilesRangeAnnotation` shows that range annotations remain fast
for arbitrary length ranges:
```
BenchmarkNumFilesRangeAnnotation-10    	  460471	      2191 ns/op	     944 B/op	       7 allocs/op
```
anish-shanbhag added a commit to anish-shanbhag/pebble that referenced this pull request Jul 26, 2024
This change adds a "range annotation" feature to Annotators , which
are computations that aggregate some value over a specific key range within
within a level. Level-wide annotations are now computed internally as a
range annotation with a key range spanning the whole level. Range annotations
use the same B-tree caching behavior as regular annotations, so queries
remain fast even with thousands of tables because they avoid a sequential
iteration over a level's files.

This PR only sets up range annotations without changing any existing
behavior. However, there are a number of potential use cases for range
annotations which could be added next:
- Calculating the number of keys shadowed by a tombstone-dense key range,
  for use in the heuristic proposed at cockroachdb#3719
- Computing the total file size that a read compaction overlaps with,
  which is used to prevent read compactions that are too wide
  [here](https://github.com/cockroachdb/pebble/blob/9a4ea4dfc5a8129937e3fdc811ea87543d88565b/compaction_picker.go#L1930)
- Estimating disk usage for a key range without having to iterate over files,
  which is done [here](https://github.com/jbowens/pebble/blob/master/db.go#L2249)
- Calculating average value size and compression ratio for a key range,
  which we [currently use when estimating the potential space that compacting
  point tombstones would reclaim](https://github.com/jbowens/pebble/blob/646c6bab1af3c72dc7db59a0dcc38b5955fc15cc/table_stats.go#L350).
  Range annotations could also be used to implement the TODO from @jbowens.
- Estimating the reclaimed space from compacting range deletions, for which
  we also [currently use sequential iteration](https://github.com/jbowens/pebble/blob/master/table_stats.go#L557).
- Because annotations are in-memory, if we can find a way to refactor those
  last two without using I/O at all, then this would eliminate the need to
  defer table stats collection to a separate goroutine for newly written
  tables.
- Refactoring [`db.ScanStatistics`](https://github.com/jbowens/pebble/blob/646c6bab1af3c72dc7db59a0dcc38b5955fc15cc/db.go#L2823),
  which could increase performance significantly over the current
  implementation where we scan every key in the DB.
- Expand the LSM visualizer tool (cockroachdb#508) or the LSM viewer (cockroachdb#3339) to show
  aggregate statistics about key ranges. Range annotations would allow us to efficiently
  compute statistics including the # of sstables, # of keys, etc. in chunks
  of the keyspace and visualize this on a graph showing overlapping ranges
  from each level.

`BenchmarkNumFilesRangeAnnotation` shows that range annotations are
significantly faster than using `version.Overlaps` to aggregate over
a key range:
```
pkg: github.com/cockroachdb/pebble/internal/manifest
BenchmarkNumFilesRangeAnnotation/annotator-10         	  232282	      4716 ns/op	     112 B/op	       7 allocs/op
BenchmarkNumFilesRangeAnnotation/overlaps-10          	    2110	    545482 ns/op	     400 B/op	       9 allocs/op```
```
Copy link
Contributor Author

@anish-shanbhag anish-shanbhag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated to reflect the final design of the heuristic at #3790 - hopefully we can merge this to serve as a future reference.

Reviewable status: 0 of 1 files reviewed, 15 unresolved discussions (waiting on @itsbilal, @jbowens, and @RaduBerinde)

Copy link
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! :lgtm: save for some minor comments. This should be good to merge after those are addressed.

Reviewed all commit messages.
Reviewable status: 0 of 1 files reviewed, 18 unresolved discussions (waiting on @anish-shanbhag, @jbowens, and @RaduBerinde)


docs/RFCS/20240701_tombstone_density_heuristic.md line 2 at r4 (raw file):

- Feature Name: Compaction heuristics for point tombstone density
- Status: in-progress

Status: done, and possibly an end date or so below?


docs/RFCS/20240701_tombstone_density_heuristic.md line 26 at r4 (raw file):

1. Tombstone buildup usually only happens in large stores (> 10 GB) once multiple levels of the LSM start to be populated.
2. The LSM is "too well shaped" in that the size ratio between levels is at the proper value, so [compaction scores](https://github.com/cockroachdb/pebble/blob/3ef2e5b1f693dfbf78785e14f603a443af3c674b/compaction_picker.go#L919) for each level are all calculated to be <1 and thus no compactions are scheduled.
	- Observed in [this escalation](https://github.com/cockroachlabs/support/issues/2628)

RFCs are public (cockroachdb org) while support issues are not, so while these links are totally cool and I don't think we've leaked any sensitive info in this RFC, we should still be suffixing all of these support repo links with "(internal only)".


docs/RFCS/20240701_tombstone_density_heuristic.md line 45 at r4 (raw file):

The two criteria above are meant to eliminate wasteful data blocks that would consume unnecessary resources during reads. The count-based threshold prevents CPU waste, and the size-based threshold prevents I/O waste.

A table is considered eligible for the new tombstone compaction type if the percent of tombstone-dense data blocks (compared to the total number of data blocks in the table) is at least X%. The default is 5% and is configured via the new `options.Experimental.TombstoneDenseCompactionThreshold`. This option may also be set to `-1` to disable tombstone density compactions.

"may also be set to zero or a negative value"


docs/RFCS/20240701_tombstone_density_heuristic.md line 73 at r4 (raw file):

### 2. More Granularity

If we find that more granularity is needed on a per-SSTable basis, i.e. it's important to know where tombstones are clustered within an SSTable, there are two possible options:

Worth mentioning that the preferred approach (block-level tombstone density) is an example of an approach with greater granularity than the above, sstable-level approach.

Copy link
Contributor Author

@anish-shanbhag anish-shanbhag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed these comments - TFTR!

Reviewable status: 0 of 1 files reviewed, 18 unresolved discussions (waiting on @itsbilal, @jbowens, and @RaduBerinde)

@anish-shanbhag anish-shanbhag merged commit f90b350 into cockroachdb:master Aug 19, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants