From f632070ec4a94a938e6de390e8bc73cc245e55b1 Mon Sep 17 00:00:00 2001 From: sumeerbhola Date: Sat, 20 May 2023 11:17:44 -0400 Subject: [PATCH] sstable,db: introduce a sstable-internal ObsoleteBit in the key kind MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This bit marks keys that are obsolete because they are not the newest seqnum for a user key (in that sstable), or they are masked by a RANGEDEL. Setting the obsolete bit on point keys is advanced usage, so we support 2 modes, both of which must be truthful when setting the obsolete bit, but vary in when they don't set the obsolete bit. - Non-strict: In this mode, the bit does not need to be set for keys that are obsolete. Additionally, any sstable containing MERGE keys can only use this mode. An iterator over such an sstable, when configured to hideObsoletePoints, can expose multiple internal keys per user key, and can expose keys that are deleted by rangedels in the same sstable. This is the mode that non-advanced users should use. Pebble without disaggregated storage will also use this mode and will best-effort set the obsolete bit, to optimize iteration when snapshots have retained many obsolete keys. - Strict: In this mode, every obsolete key must have the obsolete bit set, and no MERGE keys are permitted. An iterator over such an sstable, when configured to hideObsoletePoints satisfies two properties: - S1: will expose at most one internal key per user key, which is the most recent one. - S2: will never expose keys that are deleted by rangedels in the same sstable. This is the mode for two use cases in disaggregated storage (which will exclude parts of the key space that has MERGEs), for levels that contain sstables that can become foreign sstables. - Pebble compaction output to these levels that can become foreign sstables. - CockroachDB ingest operations that can ingest into the levels that can become foreign sstables. Note, these are not sstables corresponding to copied data for CockroachDB range snapshots. This case occurs for operations like index backfills: these trivially satisfy the strictness criteria since they only write one key per userkey. The strictness of the sstable is written to the Properties block. The Writer implementation discovers keys that are obsolete because they are the same userkey as the previous key. This can be cheaply done since we already do user key comparisons in the Writer. For keys obsoleted by RANGEDELs, the Writer relies on the caller. On the read path, the obsolete bit is removed by the blockIter. Since everything reading an sstable uses a blockIter, this prevents any leakage of this bit. Some effort was made to reduce the regression on the iteration path, but TableIterNext has +5.84% regression. Some of the slowdown is clawed back by improvements to Seek (e.g. SeekGE is now faster). old is master: name old time/op new time/op delta BlockIterSeekGE/restart=16-16 474ns ± 1% 450ns ± 1% -5.16% (p=0.000 n=10+10) BlockIterSeekLT/restart=16-16 520ns ± 0% 526ns ± 0% +1.20% (p=0.000 n=10+10) BlockIterNext/restart=16-16 19.3ns ± 1% 21.0ns ± 0% +8.76% (p=0.000 n=10+10) BlockIterPrev/restart=16-16 38.7ns ± 1% 39.9ns ± 0% +3.20% (p=0.000 n=9+9) TableIterSeekGE/restart=16,compression=Snappy-16 1.65µs ± 1% 1.61µs ± 3% -2.24% (p=0.000 n=9+10) TableIterSeekGE/restart=16,compression=ZSTD-16 1.67µs ± 3% 1.58µs ± 3% -5.11% (p=0.000 n=10+10) TableIterSeekLT/restart=16,compression=Snappy-16 1.75µs ± 3% 1.68µs ± 2% -4.14% (p=0.000 n=10+9) TableIterSeekLT/restart=16,compression=ZSTD-16 1.74µs ± 3% 1.69µs ± 3% -2.54% (p=0.001 n=10+10) TableIterNext/restart=16,compression=Snappy-16 23.9ns ± 1% 25.3ns ± 0% +5.84% (p=0.000 n=10+10) TableIterNext/restart=16,compression=ZSTD-16 23.9ns ± 1% 25.3ns ± 0% +5.78% (p=0.000 n=10+10) TableIterPrev/restart=16,compression=Snappy-16 45.2ns ± 1% 46.2ns ± 1% +2.09% (p=0.000 n=10+10) TableIterPrev/restart=16,compression=ZSTD-16 45.3ns ± 0% 46.3ns ± 0% +2.23% (p=0.000 n=8+9) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=false-16 51.7ns ± 1% 55.2ns ± 4% +6.82% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=true-16 54.9ns ± 1% 56.4ns ± 3% +2.73% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=false-16 35.0ns ± 1% 34.8ns ± 1% -0.56% (p=0.037 n=10+10) IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=true-16 37.8ns ± 0% 38.0ns ± 1% +0.55% (p=0.018 n=9+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=false-16 41.5ns ± 2% 42.4ns ± 1% +2.18% (p=0.000 n=10+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=true-16 94.7ns ± 4% 97.0ns ± 8% ~ (p=0.133 n=9+10) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=false-16 35.4ns ± 2% 36.5ns ± 1% +2.97% (p=0.000 n=10+8) IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=true-16 60.1ns ± 1% 57.8ns ± 0% -3.84% (p=0.000 n=9+9) IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=false-16 135ns ± 1% 136ns ± 1% +0.44% (p=0.009 n=9+10) IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=true-16 139ns ± 0% 139ns ± 0% +0.48% (p=0.000 n=10+8) IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=false-16 34.8ns ± 1% 35.5ns ± 2% +2.12% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=true-16 37.6ns ± 0% 38.6ns ± 1% +2.53% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=false-16 215ns ± 1% 216ns ± 0% ~ (p=0.341 n=10+10) IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=true-16 220ns ± 1% 220ns ± 0% ~ (p=0.983 n=10+8) IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=false-16 41.6ns ± 1% 42.6ns ± 2% +2.42% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=true-16 44.6ns ± 1% 45.6ns ± 1% +2.28% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=false-16 2.16µs ± 0% 2.06µs ± 1% -4.27% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=true-16 2.15µs ± 1% 2.07µs ± 0% -3.71% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=false-16 94.1ns ± 1% 95.9ns ± 2% +1.94% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=true-16 97.5ns ± 1% 98.2ns ± 1% +0.69% (p=0.023 n=10+10) IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=false-16 2.81µs ± 1% 2.66µs ± 1% -5.29% (p=0.000 n=9+10) IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=true-16 2.82µs ± 1% 2.67µs ± 0% -5.47% (p=0.000 n=8+10) IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=false-16 689ns ± 4% 652ns ± 5% -5.32% (p=0.000 n=10+10) IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=true-16 694ns ± 2% 657ns ± 1% -5.28% (p=0.000 n=10+8) Looking at mergingIter, the Next regression seems tolerable, and SeekGE is better. name old time/op new time/op delta MergingIterSeekGE/restart=16/count=1-16 1.25µs ± 3% 1.15µs ± 1% -8.51% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=2-16 2.49µs ± 2% 2.28µs ± 2% -8.39% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=3-16 3.82µs ± 3% 3.57µs ± 1% -6.54% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=4-16 5.31µs ± 2% 4.86µs ± 2% -8.39% (p=0.000 n=10+10) MergingIterSeekGE/restart=16/count=5-16 6.88µs ± 1% 6.36µs ± 2% -7.49% (p=0.000 n=10+10) MergingIterNext/restart=16/count=1-16 46.0ns ± 1% 46.6ns ± 1% +1.13% (p=0.000 n=10+10) MergingIterNext/restart=16/count=2-16 72.8ns ± 1% 73.0ns ± 0% ~ (p=0.363 n=10+10) MergingIterNext/restart=16/count=3-16 93.5ns ± 0% 93.1ns ± 1% ~ (p=0.507 n=10+9) MergingIterNext/restart=16/count=4-16 104ns ± 0% 104ns ± 1% ~ (p=0.078 n=8+10) MergingIterNext/restart=16/count=5-16 121ns ± 1% 121ns ± 1% -0.52% (p=0.008 n=10+10) MergingIterPrev/restart=16/count=1-16 66.6ns ± 1% 67.8ns ± 1% +1.81% (p=0.000 n=10+10) MergingIterPrev/restart=16/count=2-16 93.2ns ± 0% 94.4ns ± 1% +1.24% (p=0.000 n=10+10) MergingIterPrev/restart=16/count=3-16 114ns ± 0% 114ns ± 1% +0.36% (p=0.032 n=9+10) MergingIterPrev/restart=16/count=4-16 122ns ± 1% 123ns ± 0% +0.41% (p=0.014 n=10+9) MergingIterPrev/restart=16/count=5-16 138ns ± 1% 138ns ± 0% +0.52% (p=0.012 n=10+10) MergingIterSeqSeekGEWithBounds/levelCount=5-16 572ns ± 1% 572ns ± 0% ~ (p=0.842 n=10+9) MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16 1.85µs ± 1% 1.76µs ± 1% -4.85% (p=0.000 n=10+9) MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16 443ns ± 0% 444ns ± 1% ~ (p=0.255 n=10+10) MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16 1.86µs ± 1% 1.77µs ± 1% -4.63% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16 486ns ± 1% 482ns ± 1% -0.80% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16 1.93µs ± 1% 1.83µs ± 1% -4.95% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16 570ns ± 0% 567ns ± 2% -0.47% (p=0.020 n=10+10) MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16 2.12µs ± 0% 2.03µs ± 1% -4.38% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16 1.43µs ± 1% 1.39µs ± 1% -2.57% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16 2.28µs ± 1% 2.18µs ± 0% -4.54% (p=0.000 n=10+10) MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16 1.59µs ± 1% 1.53µs ± 1% -3.71% (p=0.000 n=10+9) Finally, a read benchmark where all except the first key is obsolete shows improvement. BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=false-10 36 32300029 ns/op 2 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=true-10 36 32418979 ns/op 3 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=false-10 82 13357163 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=true-10 90 13256770 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=false-10 36 32396367 ns/op 2 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=true-10 26086 46095 ns/op 0 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=false-10 88 13226711 ns/op 1 B/op 0 allocs/op BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=true-10 39171 30618 ns/op 0 B/op 0 allocs/op Informs #2465 --- batch.go | 8 +- compaction.go | 9 +- compaction_iter.go | 7 + compaction_iter_test.go | 2 +- compaction_test.go | 4 +- db.go | 1 + external_iterator.go | 18 +- format_major_version.go | 20 +- format_major_version_test.go | 36 +- get_iter.go | 4 +- internal/base/internal.go | 22 +- iterator.go | 10 + level_iter.go | 9 + level_iter_test.go | 4 +- metamorphic/generator.go | 4 +- metamorphic/ops.go | 2 +- options.go | 42 ++- scan_internal.go | 1 + sstable/block.go | 264 ++++++++++----- sstable/block_property.go | 2 +- sstable/block_property_test.go | 3 +- sstable/block_test.go | 2 +- sstable/data_test.go | 14 +- sstable/format.go | 150 +++++++++ sstable/internal.go | 1 + sstable/options.go | 11 + sstable/properties.go | 6 + sstable/properties_test.go | 1 + sstable/reader.go | 86 +++-- sstable/reader_test.go | 166 ++++++++- sstable/suffix_rewriter.go | 31 +- sstable/suffix_rewriter_test.go | 28 +- sstable/table.go | 23 +- sstable/testdata/reader_hide_obsolete/iter | 316 ++++++++++++++++++ sstable/testdata/virtual_reader | 4 +- sstable/testdata/writer_value_blocks | 232 ++++++------- sstable/writer.go | 281 +++++++++++++--- sstable/writer_test.go | 67 +++- table_cache.go | 37 +- testdata/checkpoint | 54 +-- testdata/compaction_delete_only_hints | 10 +- testdata/compaction_iter | 8 +- testdata/compaction_iter_delete_sized | 12 +- testdata/compaction_iter_set_with_del | 12 +- testdata/compaction_tombstones | 26 +- testdata/event_listener | 80 ++--- testdata/ingest | 6 +- ...manual_compaction_file_boundaries_delsized | 146 ++++---- testdata/marked_for_compaction | 4 +- testdata/metrics | 4 +- testdata/table_stats | 2 +- 51 files changed, 1734 insertions(+), 558 deletions(-) create mode 100644 sstable/testdata/reader_hide_obsolete/iter diff --git a/batch.go b/batch.go index 0b76c42856..1ae2086083 100644 --- a/batch.go +++ b/batch.go @@ -432,8 +432,8 @@ func (b *Batch) refreshMemTableSize() { case InternalKeyKindRangeKeySet, InternalKeyKindRangeKeyUnset, InternalKeyKindRangeKeyDelete: b.countRangeKeys++ case InternalKeyKindDeleteSized: - if b.minimumFormatMajorVersion < ExperimentalFormatDeleteSized { - b.minimumFormatMajorVersion = ExperimentalFormatDeleteSized + if b.minimumFormatMajorVersion < ExperimentalFormatDeleteSizedAndObsolete { + b.minimumFormatMajorVersion = ExperimentalFormatDeleteSizedAndObsolete } case InternalKeyKindIngestSST: if b.minimumFormatMajorVersion < FormatFlushableIngest { @@ -729,8 +729,8 @@ func (b *Batch) DeleteSized(key []byte, deletedValueSize uint32, _ *WriteOptions // complete key slice, letting the caller encode into the DeferredBatchOp.Key // slice and then call Finish() on the returned object. func (b *Batch) DeleteSizedDeferred(keyLen int, deletedValueSize uint32) *DeferredBatchOp { - if b.minimumFormatMajorVersion < ExperimentalFormatDeleteSized { - b.minimumFormatMajorVersion = ExperimentalFormatDeleteSized + if b.minimumFormatMajorVersion < ExperimentalFormatDeleteSizedAndObsolete { + b.minimumFormatMajorVersion = ExperimentalFormatDeleteSizedAndObsolete } // Encode the sum of the key length and the value in the value. diff --git a/compaction.go b/compaction.go index 5cee48d9ee..8e81724bd9 100644 --- a/compaction.go +++ b/compaction.go @@ -1415,6 +1415,10 @@ func (c *compaction) newInputIter( iterOpts := IterOptions{logger: c.logger} // TODO(bananabrick): Get rid of the extra manifest.Level parameter and fold it into // compactionLevel. + // + // TODO(bilal): when we start using strict obsolete sstables for L5 and L6 + // in disaggregated storage, and rely on the obsolete bit, we will also need + // to configure the levelIter at these levels to hide the obsolete points. addItersForLevel := func(level *compactionLevel, l manifest.Level) error { iters = append(iters, newLevelIter(iterOpts, c.cmp, nil /* split */, newIters, level.files.Iter(), l, &c.bytesIterated)) @@ -3235,7 +3239,10 @@ func (d *DB) runCompaction( return nil, pendingOutputs, stats, err } } - if err := tw.Add(*key, val); err != nil { + // iter.snapshotPinned is broader than whether the point was covered by + // a RANGEDEL, but it is harmless to pass true when the callee will also + // independently discover that the point is obsolete. + if err := tw.AddWithForceObsolete(*key, val, iter.snapshotPinned); err != nil { return nil, pendingOutputs, stats, err } if iter.snapshotPinned { diff --git a/compaction_iter.go b/compaction_iter.go index b013036b6b..6623e9343e 100644 --- a/compaction_iter.go +++ b/compaction_iter.go @@ -383,6 +383,13 @@ func (i *compactionIter) Next() (*InternalKey, []byte) { } else if cover == keyspan.CoversInvisibly { // i.iterKey would be deleted by a range deletion if there weren't // any open snapshots. Mark it as pinned. + // + // TODO(sumeer): there are multiple places in this file where we call + // i.rangeDelFrag.Covers and this is the only one where we are fiddling + // with i.snapshotPinned. i.snapshotPinned was previously being used + // only for stats, where a mistake does not lead to corruption. But it + // is also now being used for the forceObsolete bit in + // Writer.AddWithForceObsolete(). Give this more scrutiny. i.snapshotPinned = true } diff --git a/compaction_iter_test.go b/compaction_iter_test.go index 38f7cfc66a..6e3ad29f69 100644 --- a/compaction_iter_test.go +++ b/compaction_iter_test.go @@ -92,7 +92,7 @@ func TestCompactionIter(t *testing.T) { if formatVersion < FormatSetWithDelete { return "testdata/compaction_iter" } - if formatVersion < ExperimentalFormatDeleteSized { + if formatVersion < ExperimentalFormatDeleteSizedAndObsolete { return "testdata/compaction_iter_set_with_del" } return "testdata/compaction_iter_delete_sized" diff --git a/compaction_test.go b/compaction_test.go index 6571759139..6f1d2bf2d4 100644 --- a/compaction_test.go +++ b/compaction_test.go @@ -1607,11 +1607,11 @@ func TestManualCompaction(t *testing.T) { { testData: "testdata/manual_compaction_file_boundaries", minVersion: FormatMostCompatible, - maxVersion: ExperimentalFormatDeleteSized - 1, + maxVersion: ExperimentalFormatDeleteSizedAndObsolete - 1, }, { testData: "testdata/manual_compaction_file_boundaries_delsized", - minVersion: ExperimentalFormatDeleteSized, + minVersion: ExperimentalFormatDeleteSizedAndObsolete, maxVersion: internalFormatNewest, }, } diff --git a/db.go b/db.go index b4173ffe01..cd59a4cf82 100644 --- a/db.go +++ b/db.go @@ -1362,6 +1362,7 @@ func (i *Iterator) constructPointIter( levelsIndex := len(levels) mlevels = mlevels[:numMergingLevels] levels = levels[:numLevelIters] + i.opts.snapshotForHideObsoletePoints = buf.dbi.seqNum addLevelIterForFiles := func(files manifest.LevelIterator, level manifest.Level) { li := &levels[levelsIndex] diff --git a/external_iterator.go b/external_iterator.go index 3c45c2cd48..8fbef76e72 100644 --- a/external_iterator.go +++ b/external_iterator.go @@ -209,15 +209,15 @@ func createExternalPointIter(ctx context.Context, it *Iterator) (internalIterato pointIter internalIterator err error ) - pointIter, err = r.NewIterWithBlockPropertyFiltersAndContext( - ctx, - it.opts.LowerBound, - it.opts.UpperBound, - nil, /* BlockPropertiesFilterer */ - false, /* useFilterBlock */ - &it.stats.InternalStats, - sstable.TrivialReaderProvider{Reader: r}, - ) + // We could set hideObsoletePoints=true, since we are reading at + // InternalKeySeqNumMax, but we don't bother since these sstables should + // not have obsolete points (so the performance optimization is + // unnecessary), and we don't want to bother constructing a + // BlockPropertiesFilterer that includes obsoleteKeyBlockPropertyFilter. + pointIter, err = r.NewIterWithBlockPropertyFiltersAndContextEtc( + ctx, it.opts.LowerBound, it.opts.UpperBound, nil, /* BlockPropertiesFilterer */ + false /* hideObsoletePoints */, false, /* useFilterBlock */ + &it.stats.InternalStats, sstable.TrivialReaderProvider{Reader: r}) if err != nil { return nil, err } diff --git a/format_major_version.go b/format_major_version.go index b2a561fb8b..27cef3028c 100644 --- a/format_major_version.go +++ b/format_major_version.go @@ -151,11 +151,13 @@ const ( // compactions for files marked for compaction are complete. FormatPrePebblev1MarkedCompacted - // ExperimentalFormatDeleteSized is a format major version that adds support for - // deletion tombstones that encode the size of the value they're expected to - // delete. This format major version is required before the associated key - // kind may be committed through batch applications or ingests. - ExperimentalFormatDeleteSized + // ExperimentalFormatDeleteSizedAndObsolete is a format major version that adds support + // for deletion tombstones that encode the size of the value they're + // expected to delete. This format major version is required before the + // associated key kind may be committed through batch applications or + // ingests. It also adds support for keys that are marked obsolete (see + // sstable/format.go for details). + ExperimentalFormatDeleteSizedAndObsolete // internalFormatNewest holds the newest format major version, including // experimental ones excluded from the exported FormatNewest constant until @@ -182,7 +184,7 @@ func (v FormatMajorVersion) MaxTableFormat() sstable.TableFormat { return sstable.TableFormatPebblev2 case FormatSSTableValueBlocks, FormatFlushableIngest, FormatPrePebblev1MarkedCompacted: return sstable.TableFormatPebblev3 - case ExperimentalFormatDeleteSized: + case ExperimentalFormatDeleteSizedAndObsolete: return sstable.TableFormatPebblev4 default: panic(fmt.Sprintf("pebble: unsupported format major version: %s", v)) @@ -201,7 +203,7 @@ func (v FormatMajorVersion) MinTableFormat() sstable.TableFormat { case FormatMinTableFormatPebblev1, FormatPrePebblev1Marked, FormatUnusedPrePebblev1MarkedCompacted, FormatSSTableValueBlocks, FormatFlushableIngest, FormatPrePebblev1MarkedCompacted, - ExperimentalFormatDeleteSized: + ExperimentalFormatDeleteSizedAndObsolete: return sstable.TableFormatPebblev1 default: panic(fmt.Sprintf("pebble: unsupported format major version: %s", v)) @@ -334,8 +336,8 @@ var formatMajorVersionMigrations = map[FormatMajorVersion]func(*DB) error{ } return d.finalizeFormatVersUpgrade(FormatPrePebblev1MarkedCompacted) }, - ExperimentalFormatDeleteSized: func(d *DB) error { - return d.finalizeFormatVersUpgrade(ExperimentalFormatDeleteSized) + ExperimentalFormatDeleteSizedAndObsolete: func(d *DB) error { + return d.finalizeFormatVersUpgrade(ExperimentalFormatDeleteSizedAndObsolete) }, } diff --git a/format_major_version_test.go b/format_major_version_test.go index ca136351bf..571749d1c3 100644 --- a/format_major_version_test.go +++ b/format_major_version_test.go @@ -62,8 +62,8 @@ func TestRatchetFormat(t *testing.T) { require.Equal(t, FormatFlushableIngest, d.FormatMajorVersion()) require.NoError(t, d.RatchetFormatMajorVersion(FormatPrePebblev1MarkedCompacted)) require.Equal(t, FormatPrePebblev1MarkedCompacted, d.FormatMajorVersion()) - require.NoError(t, d.RatchetFormatMajorVersion(ExperimentalFormatDeleteSized)) - require.Equal(t, ExperimentalFormatDeleteSized, d.FormatMajorVersion()) + require.NoError(t, d.RatchetFormatMajorVersion(ExperimentalFormatDeleteSizedAndObsolete)) + require.Equal(t, ExperimentalFormatDeleteSizedAndObsolete, d.FormatMajorVersion()) require.NoError(t, d.Close()) @@ -212,22 +212,22 @@ func TestFormatMajorVersions_TableFormat(t *testing.T) { // fixture is intentionally verbose. m := map[FormatMajorVersion][2]sstable.TableFormat{ - FormatDefault: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2}, - FormatMostCompatible: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2}, - formatVersionedManifestMarker: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2}, - FormatVersioned: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2}, - FormatSetWithDelete: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2}, - FormatBlockPropertyCollector: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev1}, - FormatSplitUserKeysMarked: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev1}, - FormatSplitUserKeysMarkedCompacted: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev1}, - FormatRangeKeys: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev2}, - FormatMinTableFormatPebblev1: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev2}, - FormatPrePebblev1Marked: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev2}, - FormatUnusedPrePebblev1MarkedCompacted: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev2}, - FormatSSTableValueBlocks: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev3}, - FormatFlushableIngest: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev3}, - FormatPrePebblev1MarkedCompacted: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev3}, - ExperimentalFormatDeleteSized: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev4}, + FormatDefault: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2}, + FormatMostCompatible: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2}, + formatVersionedManifestMarker: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2}, + FormatVersioned: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2}, + FormatSetWithDelete: {sstable.TableFormatLevelDB, sstable.TableFormatRocksDBv2}, + FormatBlockPropertyCollector: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev1}, + FormatSplitUserKeysMarked: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev1}, + FormatSplitUserKeysMarkedCompacted: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev1}, + FormatRangeKeys: {sstable.TableFormatLevelDB, sstable.TableFormatPebblev2}, + FormatMinTableFormatPebblev1: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev2}, + FormatPrePebblev1Marked: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev2}, + FormatUnusedPrePebblev1MarkedCompacted: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev2}, + FormatSSTableValueBlocks: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev3}, + FormatFlushableIngest: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev3}, + FormatPrePebblev1MarkedCompacted: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev3}, + ExperimentalFormatDeleteSizedAndObsolete: {sstable.TableFormatPebblev1, sstable.TableFormatPebblev4}, } // Valid versions. diff --git a/get_iter.go b/get_iter.go index f1fc25caed..0ac0484e50 100644 --- a/get_iter.go +++ b/get_iter.go @@ -158,7 +158,7 @@ func (g *getIter) Next() (*InternalKey, base.LazyValue) { if n := len(g.l0); n > 0 { files := g.l0[n-1].Iter() g.l0 = g.l0[:n-1] - iterOpts := IterOptions{logger: g.logger} + iterOpts := IterOptions{logger: g.logger, snapshotForHideObsoletePoints: g.snapshot} g.levelIter.init(context.Background(), iterOpts, g.cmp, nil /* split */, g.newIters, files, manifest.L0Sublevel(n), internalIterOpts{}) g.levelIter.initRangeDel(&g.rangeDelIter) @@ -177,7 +177,7 @@ func (g *getIter) Next() (*InternalKey, base.LazyValue) { continue } - iterOpts := IterOptions{logger: g.logger} + iterOpts := IterOptions{logger: g.logger, snapshotForHideObsoletePoints: g.snapshot} g.levelIter.init(context.Background(), iterOpts, g.cmp, nil /* split */, g.newIters, g.version.Levels[g.level].Iter(), manifest.Level(g.level), internalIterOpts{}) g.levelIter.initRangeDel(&g.rangeDelIter) diff --git a/internal/base/internal.go b/internal/base/internal.go index 6f954bb851..082307a35b 100644 --- a/internal/base/internal.go +++ b/internal/base/internal.go @@ -24,6 +24,14 @@ const ( //InternalKeyKindColumnFamilyDeletion InternalKeyKind = 4 //InternalKeyKindColumnFamilyValue InternalKeyKind = 5 //InternalKeyKindColumnFamilyMerge InternalKeyKind = 6 + + // InternalKeyKindSingleDelete (SINGLEDEL) is a performance optimization + // solely for compactions (to reduce write amp and space amp). Readers other + // than compactions should treat SINGLEDEL as equivalent to a DEL. + // Historically, it was simpler for readers other than compactions to treat + // SINGLEDEL as equivalent to DEL, but as of the introduction of + // InternalKeyKindSSTableInternalObsoleteBit, this is also necessary for + // correctness. InternalKeyKindSingleDelete InternalKeyKind = 7 //InternalKeyKindColumnFamilySingleDelete InternalKeyKind = 8 //InternalKeyKindBeginPrepareXID InternalKeyKind = 9 @@ -71,7 +79,7 @@ const ( // value indicating the (len(key)+len(value)) of the shadowed entry the // tombstone is expected to delete. This value is used to inform compaction // heuristics, but is not required to be accurate for correctness. - InternalKeyKindDeleteSized = 23 + InternalKeyKindDeleteSized InternalKeyKind = 23 // This maximum value isn't part of the file format. Future extensions may // increase this value. @@ -84,12 +92,17 @@ const ( // seqNum. InternalKeyKindMax InternalKeyKind = 23 + // Internal to the sstable format. Not exposed by any sstable iterator. + // Declared here to prevent definition of valid key kinds that set this bit. + InternalKeyKindSSTableInternalObsoleteBit InternalKeyKind = 64 + InternalKeyKindSSTableInternalObsoleteMask InternalKeyKind = 191 + // InternalKeyZeroSeqnumMaxTrailer is the largest trailer with a // zero sequence number. - InternalKeyZeroSeqnumMaxTrailer = uint64(InternalKeyKindInvalid) + InternalKeyZeroSeqnumMaxTrailer = uint64(255) // A marker for an invalid key. - InternalKeyKindInvalid InternalKeyKind = 255 + InternalKeyKindInvalid InternalKeyKind = InternalKeyKindSSTableInternalObsoleteMask // InternalKeySeqNumBatch is a bit that is set on batch sequence numbers // which prevents those entries from being excluded from iteration. @@ -112,6 +125,9 @@ const ( InternalKeyBoundaryRangeKey = (InternalKeySeqNumMax << 8) | uint64(InternalKeyKindRangeKeySet) ) +// Assert InternalKeyKindSSTableInternalObsoleteBit > InternalKeyKindMax +const _ = uint(InternalKeyKindSSTableInternalObsoleteBit - InternalKeyKindMax - 1) + var internalKeyKindNames = []string{ InternalKeyKindDelete: "DEL", InternalKeyKindSet: "SET", diff --git a/iterator.go b/iterator.go index 996817c78b..9c1ff3eea9 100644 --- a/iterator.go +++ b/iterator.go @@ -569,6 +569,9 @@ func (i *Iterator) findNextEntry(limit []byte) { return case InternalKeyKindDelete, InternalKeyKindSingleDelete, InternalKeyKindDeleteSized: + // NB: treating InternalKeyKindSingleDelete as equivalent to DEL is not + // only simpler, but is also necessary for correctness due to + // InternalKeyKindSSTableInternalObsoleteBit. i.nextUserKey() continue @@ -632,6 +635,9 @@ func (i *Iterator) nextPointCurrentUserKey() bool { return false case InternalKeyKindDelete, InternalKeyKindSingleDelete, InternalKeyKindDeleteSized: + // NB: treating InternalKeyKindSingleDelete as equivalent to DEL is not + // only simpler, but is also necessary for correctness due to + // InternalKeyKindSSTableInternalObsoleteBit. return false case InternalKeyKindSet, InternalKeyKindSetWithDelete: @@ -1095,6 +1101,10 @@ func (i *Iterator) mergeNext(key InternalKey, valueMerger ValueMerger) { case InternalKeyKindDelete, InternalKeyKindSingleDelete, InternalKeyKindDeleteSized: // We've hit a deletion tombstone. Return everything up to this // point. + // + // NB: treating InternalKeyKindSingleDelete as equivalent to DEL is not + // only simpler, but is also necessary for correctness due to + // InternalKeyKindSSTableInternalObsoleteBit. return case InternalKeyKindSet, InternalKeyKindSetWithDelete: diff --git a/level_iter.go b/level_iter.go index 10ee903338..68e2e338a1 100644 --- a/level_iter.go +++ b/level_iter.go @@ -200,6 +200,11 @@ type levelIter struct { // cache when constructing new table iterators. internalOpts internalIterOpts + // Scratch space for the obsolete keys filter, when there are no other block + // property filters specified. See the performance note where + // IterOptions.PointKeyFilters is declared. + filtersBuf [1]BlockPropertyFilter + // Disable invariant checks even if they are otherwise enabled. Used by tests // which construct "impossible" situations (e.g. seeking to a key before the // lower bound). @@ -267,8 +272,12 @@ func (l *levelIter) init( l.upper = opts.UpperBound l.tableOpts.TableFilter = opts.TableFilter l.tableOpts.PointKeyFilters = opts.PointKeyFilters + if len(opts.PointKeyFilters) == 0 { + l.tableOpts.PointKeyFilters = l.filtersBuf[:0:1] + } l.tableOpts.UseL6Filters = opts.UseL6Filters l.tableOpts.level = l.level + l.tableOpts.snapshotForHideObsoletePoints = opts.snapshotForHideObsoletePoints l.cmp = cmp l.split = split l.iterFile = nil diff --git a/level_iter_test.go b/level_iter_test.go index af65b943b4..d2ec77b80b 100644 --- a/level_iter_test.go +++ b/level_iter_test.go @@ -163,8 +163,8 @@ func (lt *levelIterTest) newIters( ctx context.Context, file *manifest.FileMetadata, opts *IterOptions, iio internalIterOpts, ) (internalIterator, keyspan.FragmentIterator, error) { lt.itersCreated++ - iter, err := lt.readers[file.FileNum].NewIterWithBlockPropertyFiltersAndContext( - ctx, opts.LowerBound, opts.UpperBound, nil, true, iio.stats, + iter, err := lt.readers[file.FileNum].NewIterWithBlockPropertyFiltersAndContextEtc( + ctx, opts.LowerBound, opts.UpperBound, nil, false, true, iio.stats, sstable.TrivialReaderProvider{Reader: lt.readers[file.FileNum]}) if err != nil { return nil, nil, err diff --git a/metamorphic/generator.go b/metamorphic/generator.go index 049c901790..241f5b8679 100644 --- a/metamorphic/generator.go +++ b/metamorphic/generator.go @@ -514,8 +514,8 @@ func (g *generator) dbRatchetFormatMajorVersion() { // TODO(jackson): When FormatDeleteSized is stabilized, return this to just // using `FormatNewest`. newestTODO := pebble.FormatNewest - if newestTODO < pebble.ExperimentalFormatDeleteSized { - newestTODO = pebble.ExperimentalFormatDeleteSized + if newestTODO < pebble.ExperimentalFormatDeleteSizedAndObsolete { + newestTODO = pebble.ExperimentalFormatDeleteSizedAndObsolete } n := int(newestTODO - minimumFormatMajorVersion) diff --git a/metamorphic/ops.go b/metamorphic/ops.go index 77cff3de2a..78acba6882 100644 --- a/metamorphic/ops.go +++ b/metamorphic/ops.go @@ -192,7 +192,7 @@ type deleteOp struct { func (o *deleteOp) run(t *test, h historyRecorder) { w := t.getWriter(o.writerID) var err error - if t.testOpts.deleteSized && t.isFMV(pebble.ExperimentalFormatDeleteSized) { + if t.testOpts.deleteSized && t.isFMV(pebble.ExperimentalFormatDeleteSizedAndObsolete) { // Call DeleteSized with a deterministic size derived from the index. // The size does not need to be accurate for correctness. err = w.DeleteSized(o.key, hashSize(t.idx), t.writeOpts) diff --git a/options.go b/options.go index c415f77fa3..b30c19cb12 100644 --- a/options.go +++ b/options.go @@ -119,10 +119,13 @@ type IterOptions struct { // function can be used by multiple iterators, if the iterator is cloned. TableFilter func(userProps map[string]string) bool // PointKeyFilters can be used to avoid scanning tables and blocks in tables - // when iterating over point keys. It is requires that this slice is sorted in - // increasing order of the BlockPropertyFilter.ShortID. This slice represents - // an intersection across all filters, i.e., all filters must indicate that the - // block is relevant. + // when iterating over point keys. This slice represents an intersection + // across all filters, i.e., all filters must indicate that the block is + // relevant. + // + // Performance note: When len(PointKeyFilters) > 0, the caller should ensure + // that cap(PointKeyFilters) is at least len(PointKeyFilters)+1. This helps + // avoid allocations in Pebble internal code that mutates the slice. PointKeyFilters []BlockPropertyFilter // RangeKeyFilters can be usefd to avoid scanning tables and blocks in tables // when iterating over range keys. The same requirements that apply to @@ -181,6 +184,10 @@ type IterOptions struct { level manifest.Level // disableLazyCombinedIteration is an internal testing option. disableLazyCombinedIteration bool + // snapshotForHideObsoletePoints is specified for/by levelIter when opening + // files and is used to decide whether to hide obsolete points. A value of 0 + // implies obsolete points should not be hidden. + snapshotForHideObsoletePoints uint64 // NB: If adding new Options, you must account for them in iterator // construction and Iterator.SetOptions. @@ -632,18 +639,24 @@ type Options struct { ShortAttributeExtractor ShortAttributeExtractor // RequiredInPlaceValueBound specifies an optional span of user key - // prefixes for which the values must be stored with the key. This is - // useful for statically known exclusions to value separation. In - // CockroachDB, this will be used for the lock table key space that has - // non-empty suffixes, but those locks don't represent actual MVCC - // versions (the suffix ordering is arbitrary). We will also need to add - // support for dynamically configured exclusions (we want the default to - // be to allow Pebble to decide whether to separate the value or not, - // hence this is structured as exclusions), for example, for users of - // CockroachDB to dynamically exclude certain tables. + // prefixes that are not-MVCC, but have a suffix. For these the values + // must be stored with the key, since the concept of "older versions" is + // not defined. It is also useful for statically known exclusions to value + // separation. In CockroachDB, this will be used for the lock table key + // space that has non-empty suffixes, but those locks don't represent + // actual MVCC versions (the suffix ordering is arbitrary). We will also + // need to add support for dynamically configured exclusions (we want the + // default to be to allow Pebble to decide whether to separate the value + // or not, hence this is structured as exclusions), for example, for users + // of CockroachDB to dynamically exclude certain tables. // // Any change in exclusion behavior takes effect only on future written // sstables, and does not start rewriting existing sstables. + // + // Even ignoring changes in this setting, exclusions are interpreted as a + // guidance by Pebble, and not necessarily honored. Specifically, user + // keys with multiple Pebble-versions *may* have the older versions stored + // in value blocks. RequiredInPlaceValueBound UserKeyPrefixBound // DisableIngestAsFlushable disables lazy ingestion of sstables through @@ -1642,6 +1655,9 @@ func (o *Options) MakeWriterOptions(level int, format sstable.TableFormat) sstab if format >= sstable.TableFormatPebblev3 { writerOpts.ShortAttributeExtractor = o.Experimental.ShortAttributeExtractor writerOpts.RequiredInPlaceValueBound = o.Experimental.RequiredInPlaceValueBound + if format >= sstable.TableFormatPebblev4 && level == numLevels-1 { + writerOpts.WritingToLowestLevel = true + } } levelOpts := o.Level(level) writerOpts.BlockRestartInterval = levelOpts.BlockRestartInterval diff --git a/scan_internal.go b/scan_internal.go index beee42004f..7f7a2378c5 100644 --- a/scan_internal.go +++ b/scan_internal.go @@ -977,6 +977,7 @@ func (i *scanInternalIterator) constructPointIter(memtables flushableList, buf * mlevels = mlevels[:numMergingLevels] levels = levels[:numLevelIters] rangeDelLevels = rangeDelLevels[:numLevelIters] + i.opts.IterOptions.snapshotForHideObsoletePoints = i.seqNum addLevelIterForFiles := func(files manifest.LevelIterator, level manifest.Level) { li := &levels[levelsIndex] rli := &rangeDelLevels[levelsIndex] diff --git a/sstable/block.go b/sstable/block.go index 40ad4ec1bc..f7fb628b83 100644 --- a/sstable/block.go +++ b/sstable/block.go @@ -47,7 +47,10 @@ type blockWriter struct { // ensuring that any of the keys with the same prefix can be used to // assemble the full key when the prefix does change. restarts []uint32 - curKey []byte + // Do not read curKey directly from outside blockWriter since it can have + // the InternalKeyKindSSTableInternalObsoleteBit set. Use getCurKey() or + // getCurUserKey() instead. + curKey []byte // curValue excludes the optional prefix provided to // storeWithOptionalValuePrefix. curValue []byte @@ -79,6 +82,20 @@ const setHasSameKeyPrefixRestartMask uint32 = 1 << 31 const restartMaskLittleEndianHighByteWithoutSetHasSamePrefix byte = 0b0111_1111 const restartMaskLittleEndianHighByteOnlySetHasSamePrefix byte = 0b1000_0000 +func (w *blockWriter) getCurKey() InternalKey { + k := base.DecodeInternalKey(w.curKey) + k.Trailer = k.Trailer & trailerObsoleteMask + return k +} + +func (w *blockWriter) getCurUserKey() []byte { + n := len(w.curKey) - base.InternalTrailerLen + if n < 0 { + panic("corrupt key in blockWriter buffer") + } + return w.curKey[:n:n] +} + // If !addValuePrefix, the valuePrefix is ignored. func (w *blockWriter) storeWithOptionalValuePrefix( keySize int, @@ -191,12 +208,21 @@ func (w *blockWriter) storeWithOptionalValuePrefix( func (w *blockWriter) add(key InternalKey, value []byte) { w.addWithOptionalValuePrefix( - key, value, len(key.UserKey), false, 0, false) + key, false, value, len(key.UserKey), false, 0, false) } // Callers that always set addValuePrefix to false should use add() instead. +// +// isObsolete indicates whether this key-value pair is obsolete in this +// sstable (only applicable when writing data blocks) -- see the comment in +// table.go and the longer one in format.go. addValuePrefix adds a 1 byte +// prefix to the value, specified in valuePrefix -- this is used for data +// blocks in TableFormatPebblev3 onwards for SETs (see the comment in +// format.go, with more details in value_block.go). setHasSameKeyPrefix is +// also used in TableFormatPebblev3 onwards for SETs. func (w *blockWriter) addWithOptionalValuePrefix( key InternalKey, + isObsolete bool, value []byte, maxSharedKeyLen int, addValuePrefix bool, @@ -210,6 +236,9 @@ func (w *blockWriter) addWithOptionalValuePrefix( w.curKey = make([]byte, 0, size*2) } w.curKey = w.curKey[:size] + if isObsolete { + key.Trailer = key.Trailer | trailerObsoleteBit + } key.Encode(w.curKey) w.storeWithOptionalValuePrefix( @@ -381,6 +410,7 @@ type blockIter struct { vbr *valueBlockReader hasValuePrefix bool } + hideObsoletePoints bool } // blockIter implements the base.InternalIterator interface. @@ -388,14 +418,16 @@ var _ base.InternalIterator = (*blockIter)(nil) func newBlockIter(cmp Compare, block block) (*blockIter, error) { i := &blockIter{} - return i, i.init(cmp, block, 0) + return i, i.init(cmp, block, 0, false) } func (i *blockIter) String() string { return "block" } -func (i *blockIter) init(cmp Compare, block block, globalSeqNum uint64) error { +func (i *blockIter) init( + cmp Compare, block block, globalSeqNum uint64, hideObsoletePoints bool, +) error { numRestarts := int32(binary.LittleEndian.Uint32(block[len(block)-4:])) if numRestarts == 0 { return base.CorruptionErrorf("pebble/table: invalid table (block has no restart points)") @@ -408,6 +440,7 @@ func (i *blockIter) init(cmp Compare, block block, globalSeqNum uint64) error { i.data = block i.fullKey = i.fullKey[:0] i.val = nil + i.hideObsoletePoints = hideObsoletePoints i.clearCache() if i.restarts > 0 { if err := i.readFirstKey(); err != nil { @@ -420,10 +453,16 @@ func (i *blockIter) init(cmp Compare, block block, globalSeqNum uint64) error { return nil } -func (i *blockIter) initHandle(cmp Compare, block cache.Handle, globalSeqNum uint64) error { +// NB: two cases of hideObsoletePoints: +// - Local sstable iteration: globalSeqNum will be set iff the sstable was +// ingested. +// - Foreign sstable iteration: globalSeqNum is always set. +func (i *blockIter) initHandle( + cmp Compare, block cache.Handle, globalSeqNum uint64, hideObsoletePoints bool, +) error { i.cacheHandle.Release() i.cacheHandle = block - return i.init(cmp, block.Get(), globalSeqNum) + return i.init(cmp, block.Get(), globalSeqNum, hideObsoletePoints) } func (i *blockIter) invalidate() { @@ -590,7 +629,9 @@ func (i *blockIter) readFirstKey() error { // Manually inlining base.DecodeInternalKey provides a 5-10% speedup on // BlockIter benchmarks. if n := len(firstKey) - 8; n >= 0 { - i.firstKey.Trailer = binary.LittleEndian.Uint64(firstKey[n:]) + // NB: we do not track whether the firstKey is obsolete since the trailer + // of the firstKey is not used. + i.firstKey.Trailer = binary.LittleEndian.Uint64(firstKey[n:]) & trailerObsoleteMask i.firstKey.UserKey = firstKey[:n:n] if i.globalSeqNum != 0 { i.firstKey.SetSeqNum(i.globalSeqNum) @@ -603,11 +644,19 @@ func (i *blockIter) readFirstKey() error { return nil } -func (i *blockIter) decodeInternalKey(key []byte) { +// The sstable internal obsolete bit is set when writing a block and unset by +// blockIter, so no code outside block writing/reading code ever sees it. +const trailerObsoleteBit = uint64(base.InternalKeyKindSSTableInternalObsoleteBit) +const trailerObsoleteMask = (InternalKeySeqNumMax << 8) | uint64(base.InternalKeyKindSSTableInternalObsoleteMask) + +func (i *blockIter) decodeInternalKey(key []byte) (hiddenPoint bool) { // Manually inlining base.DecodeInternalKey provides a 5-10% speedup on // BlockIter benchmarks. if n := len(key) - 8; n >= 0 { - i.ikey.Trailer = binary.LittleEndian.Uint64(key[n:]) + trailer := binary.LittleEndian.Uint64(key[n:]) + hiddenPoint = i.hideObsoletePoints && + (trailer&trailerObsoleteBit != 0) + i.ikey.Trailer = trailer & trailerObsoleteMask i.ikey.UserKey = key[:n:n] if i.globalSeqNum != 0 { i.ikey.SetSeqNum(i.globalSeqNum) @@ -616,6 +665,7 @@ func (i *blockIter) decodeInternalKey(key []byte) { i.ikey.Trailer = uint64(InternalKeyKindInvalid) i.ikey.UserKey = nil } + return hiddenPoint } func (i *blockIter) clearCache() { @@ -640,13 +690,15 @@ func (i *blockIter) cacheEntry() { i.cachedBuf = append(i.cachedBuf, i.key...) } +func (i *blockIter) firstUserKey() []byte { + return i.firstKey.UserKey +} + // SeekGE implements internalIterator.SeekGE, as documented in the pebble // package. func (i *blockIter) SeekGE(key []byte, flags base.SeekGEFlags) (*InternalKey, base.LazyValue) { i.clearCache() - ikey := base.MakeSearchKey(key) - // Find the index of the smallest restart point whose key is > the key // sought; index will be numRestarts if there is no such restart point. i.offset = 0 @@ -700,23 +752,23 @@ func (i *blockIter) SeekGE(key []byte, flags base.SeekGEFlags) (*InternalKey, ba ptr = unsafe.Pointer(uintptr(ptr) + 5) } - // Manually inlining base.DecodeInternalKey provides a 5-10% speedup on - // BlockIter benchmarks. + // Manually inlining part of base.DecodeInternalKey provides a 5-10% + // speedup on BlockIter benchmarks. s := getBytes(ptr, int(v1)) - var k InternalKey + var k []byte if n := len(s) - 8; n >= 0 { - k.Trailer = binary.LittleEndian.Uint64(s[n:]) - k.UserKey = s[:n:n] - // NB: We can't have duplicate keys if the globalSeqNum != 0, so we - // leave the seqnum on this key as 0 as it won't affect our search - // since ikey has the maximum seqnum. - } else { - k.Trailer = uint64(InternalKeyKindInvalid) + k = s[:n:n] } + // Else k is invalid, and left as nil - if base.InternalCompare(i.cmp, ikey, k) >= 0 { + if i.cmp(key, k) > 0 { + // The search key is greater than the user key at this restart point. + // Search beyond this restart point, since we are trying to find the + // first restart point with a user key >= the search key. index = h + 1 // preserves f(i-1) == false } else { + // k >= search key, so prune everything after index (since index + // satisfies the property we are looking for). upper = h // preserves f(j) == true } } @@ -724,21 +776,25 @@ func (i *blockIter) SeekGE(key []byte, flags base.SeekGEFlags) (*InternalKey, ba // => answer is index. } - // Since keys are strictly increasing, if index > 0 then the restart point at - // index-1 will be the largest whose key is <= the key sought. If index == - // 0, then all keys in this block are larger than the key sought, and offset - // remains at zero. + // index is the first restart point with key >= search key. Define the keys + // between a restart point and the next restart point as belonging to that + // restart point. + // + // Since keys are strictly increasing, if index > 0 then the restart point + // at index-1 will be the first one that has some keys belonging to it that + // could be equal to the search key. If index == 0, then all keys in this + // block are larger than the key sought, and offset remains at zero. if index > 0 { i.offset = decodeRestart(i.data[i.restarts+4*(index-1):]) } i.readEntry() - i.decodeInternalKey(i.key) + hiddenPoint := i.decodeInternalKey(i.key) // Iterate from that restart point to somewhere >= the key sought. if !i.valid() { return nil, base.LazyValue{} } - if base.InternalCompare(i.cmp, i.ikey, ikey) >= 0 { + if !hiddenPoint && i.cmp(i.ikey.UserKey, key) >= 0 { // Initialize i.lazyValue if !i.lazyValueHandling.hasValuePrefix || base.TrailerKind(i.ikey.Trailer) != InternalKeyKindSet { @@ -751,7 +807,7 @@ func (i *blockIter) SeekGE(key []byte, flags base.SeekGEFlags) (*InternalKey, ba return &i.ikey, i.lazyValue } for i.Next(); i.valid(); i.Next() { - if base.InternalCompare(i.cmp, i.ikey, ikey) >= 0 { + if i.cmp(i.ikey.UserKey, key) >= 0 { // i.Next() has already initialized i.lazyValue. return &i.ikey, i.lazyValue } @@ -773,8 +829,6 @@ func (i *blockIter) SeekPrefixGE( func (i *blockIter) SeekLT(key []byte, flags base.SeekLTFlags) (*InternalKey, base.LazyValue) { i.clearCache() - ikey := base.MakeSearchKey(key) - // Find the index of the smallest restart point whose key is >= the key // sought; index will be numRestarts if there is no such restart point. i.offset = 0 @@ -828,23 +882,23 @@ func (i *blockIter) SeekLT(key []byte, flags base.SeekLTFlags) (*InternalKey, ba ptr = unsafe.Pointer(uintptr(ptr) + 5) } - // Manually inlining base.DecodeInternalKey provides a 5-10% speedup on - // BlockIter benchmarks. + // Manually inlining part of base.DecodeInternalKey provides a 5-10% + // speedup on BlockIter benchmarks. s := getBytes(ptr, int(v1)) - var k InternalKey + var k []byte if n := len(s) - 8; n >= 0 { - k.Trailer = binary.LittleEndian.Uint64(s[n:]) - k.UserKey = s[:n:n] - // NB: We can't have duplicate keys if the globalSeqNum != 0, so we - // leave the seqnum on this key as 0 as it won't affect our search - // since ikey has the maximum seqnum. - } else { - k.Trailer = uint64(InternalKeyKindInvalid) + k = s[:n:n] } + // Else k is invalid, and left as nil - if base.InternalCompare(i.cmp, ikey, k) > 0 { + if i.cmp(key, k) > 0 { + // The search key is greater than the user key at this restart point. + // Search beyond this restart point, since we are trying to find the + // first restart point with a user key >= the search key. index = h + 1 // preserves f(i-1) == false } else { + // k >= search key, so prune everything after index (since index + // satisfies the property we are looking for). upper = h // preserves f(j) == true } } @@ -852,8 +906,15 @@ func (i *blockIter) SeekLT(key []byte, flags base.SeekLTFlags) (*InternalKey, ba // => answer is index. } - // Since keys are strictly increasing, if index > 0 then the restart point at - // index-1 will be the largest whose key is < the key sought. + // index is the first restart point with key >= search key. Define the keys + // between a restart point and the next restart point as belonging to that + // restart point. Note that index could be equal to i.numRestarts, i.e., we + // are past the last restart. + // + // Since keys are strictly increasing, if index > 0 then the restart point + // at index-1 will be the first one that has some keys belonging to it that + // are less than the search key. If index == 0, then all keys in this block + // are larger than the search key, so there is no match. targetOffset := i.restarts if index > 0 { i.offset = decodeRestart(i.data[i.restarts+4*(index-1):]) @@ -876,24 +937,41 @@ func (i *blockIter) SeekLT(key []byte, flags base.SeekLTFlags) (*InternalKey, ba for { i.offset = i.nextOffset i.readEntry() - i.decodeInternalKey(i.key) - - if i.cmp(i.ikey.UserKey, ikey.UserKey) >= 0 { + // When hidden keys are common, there is additional optimization possible + // by not caching entries that are hidden (note that some calls to + // cacheEntry don't decode the internal key before caching, but checking + // whether a key is hidden does not require full decoding). However, we do + // need to use the blockEntry.offset in the cache for the first entry at + // the reset point to do the binary search when the cache is empty -- so + // we would need to cache that first entry (though not the key) even if + // was hidden. Our current assumption is that if there are large numbers + // of hidden keys we will be able to skip whole blocks (using block + // property filters) so we don't bother optimizing. + hiddenPoint := i.decodeInternalKey(i.key) + + // NB: we don't use the hiddenPoint return value of decodeInternalKey + // since we want to stop as soon as we reach a key >= ikey.UserKey, so + // that we can reverse. + if i.cmp(i.ikey.UserKey, key) >= 0 { // The current key is greater than or equal to our search key. Back up to // the previous key which was less than our search key. Note that this for // loop will execute at least once with this if-block not being true, so // the key we are backing up to is the last one this loop cached. - i.Prev() - // i.Prev() has already initialized i.lazyValue. - return &i.ikey, i.lazyValue + return i.Prev() } if i.nextOffset >= targetOffset { - // We've reached the end of the current restart block. Return the current - // key. When the restart interval is 1, the first iteration of the for - // loop will bring us here. In that case ikey is backed by the block so - // we get the desired key stability guarantee for the lifetime of the - // blockIter. + // We've reached the end of the current restart block. Return the + // current key if not hidden, else call Prev(). + // + // When the restart interval is 1, the first iteration of the for loop + // will bring us here. In that case ikey is backed by the block so we + // get the desired key stability guarantee for the lifetime of the + // blockIter. That is, we never cache anything and therefore never + // return a key backed by cachedBuf. + if hiddenPoint { + return i.Prev() + } break } @@ -923,7 +1001,10 @@ func (i *blockIter) First() (*InternalKey, base.LazyValue) { } i.clearCache() i.readEntry() - i.decodeInternalKey(i.key) + hiddenPoint := i.decodeInternalKey(i.key) + if hiddenPoint { + return i.Next() + } if !i.lazyValueHandling.hasValuePrefix || base.TrailerKind(i.ikey.Trailer) != InternalKeyKindSet { i.lazyValue = base.MakeInPlaceValue(i.val) @@ -958,7 +1039,10 @@ func (i *blockIter) Last() (*InternalKey, base.LazyValue) { i.readEntry() } - i.decodeInternalKey(i.key) + hiddenPoint := i.decodeInternalKey(i.key) + if hiddenPoint { + return i.Prev() + } if !i.lazyValueHandling.hasValuePrefix || base.TrailerKind(i.ikey.Trailer) != InternalKeyKindSet { i.lazyValue = base.MakeInPlaceValue(i.val) @@ -988,6 +1072,7 @@ func (i *blockIter) Next() (*InternalKey, base.LazyValue) { i.clearCache() } +start: i.offset = i.nextOffset if !i.valid() { return nil, base.LazyValue{} @@ -995,11 +1080,17 @@ func (i *blockIter) Next() (*InternalKey, base.LazyValue) { i.readEntry() // Manually inlined version of i.decodeInternalKey(i.key). if n := len(i.key) - 8; n >= 0 { - i.ikey.Trailer = binary.LittleEndian.Uint64(i.key[n:]) + trailer := binary.LittleEndian.Uint64(i.key[n:]) + hiddenPoint := i.hideObsoletePoints && + (trailer&trailerObsoleteBit != 0) + i.ikey.Trailer = trailer & trailerObsoleteMask i.ikey.UserKey = i.key[:n:n] if i.globalSeqNum != 0 { i.ikey.SetSeqNum(i.globalSeqNum) } + if hiddenPoint { + goto start + } } else { i.ikey.Trailer = uint64(InternalKeyKindInvalid) i.ikey.UserKey = nil @@ -1015,8 +1106,8 @@ func (i *blockIter) Next() (*InternalKey, base.LazyValue) { return &i.ikey, i.lazyValue } -// nextPrefix is used for implementing NPrefix. -func (i *blockIter) nextPrefix(succKey []byte) (*InternalKey, base.LazyValue) { +// NextPrefix implements (base.InternalIterator).NextPrefix. +func (i *blockIter) NextPrefix(succKey []byte) (*InternalKey, base.LazyValue) { if i.lazyValueHandling.hasValuePrefix { return i.nextPrefixV3(succKey) } @@ -1141,6 +1232,7 @@ func (i *blockIter) nextPrefixV3(succKey []byte) (*InternalKey, base.LazyValue) // The trailer is written in little endian, so the key kind is the first // byte in the trailer that is encoded in the slice [unshared-8:unshared]. keyKind := InternalKeyKind((*[manual.MaxArrayLen]byte)(ptr)[unshared-8]) + keyKind = keyKind & base.InternalKeyKindSSTableInternalObsoleteMask prefixChanged := false if keyKind == InternalKeyKindSet { if invariants.Enabled && value == 0 { @@ -1256,8 +1348,12 @@ func (i *blockIter) nextPrefixV3(succKey []byte) (*InternalKey, base.LazyValue) i.key = i.fullKey } // Manually inlined version of i.decodeInternalKey(i.key). + hiddenPoint := false if n := len(i.key) - 8; n >= 0 { - i.ikey.Trailer = binary.LittleEndian.Uint64(i.key[n:]) + trailer := binary.LittleEndian.Uint64(i.key[n:]) + hiddenPoint = i.hideObsoletePoints && + (trailer&trailerObsoleteBit != 0) + i.ikey.Trailer = trailer & trailerObsoleteMask i.ikey.UserKey = i.key[:n:n] if i.globalSeqNum != 0 { i.ikey.SetSeqNum(i.globalSeqNum) @@ -1273,6 +1369,9 @@ func (i *blockIter) nextPrefixV3(succKey []byte) (*InternalKey, base.LazyValue) } if prefixChanged || i.cmp(i.ikey.UserKey, succKey) >= 0 { // Prefix has changed. + if hiddenPoint { + return i.Next() + } if invariants.Enabled && !i.lazyValueHandling.hasValuePrefix { panic(errors.AssertionFailedf("nextPrefixV3 being run for non-v3 sstable")) } @@ -1294,23 +1393,11 @@ func (i *blockIter) nextPrefixV3(succKey []byte) (*InternalKey, base.LazyValue) return i.SeekGE(succKey, base.SeekGEFlagsNone) } -// NextPrefix implements (base.InternalIterator).NextPrefix -func (i *blockIter) NextPrefix(succKey []byte) (*InternalKey, base.LazyValue) { - const nextsBeforeSeek = 3 - k, v := i.Next() - for j := 1; k != nil && i.cmp(k.UserKey, succKey) < 0; j++ { - if j >= nextsBeforeSeek { - return i.SeekGE(succKey, base.SeekGEFlagsNone) - } - k, v = i.Next() - } - return k, v -} - // Prev implements internalIterator.Prev, as documented in the pebble // package. func (i *blockIter) Prev() (*InternalKey, base.LazyValue) { - if n := len(i.cached) - 1; n >= 0 { +start: + for n := len(i.cached) - 1; n >= 0; n-- { i.nextOffset = i.offset e := &i.cached[n] i.offset = e.offset @@ -1318,7 +1405,13 @@ func (i *blockIter) Prev() (*InternalKey, base.LazyValue) { // Manually inlined version of i.decodeInternalKey(i.key). i.key = i.cachedBuf[e.keyStart:e.keyEnd] if n := len(i.key) - 8; n >= 0 { - i.ikey.Trailer = binary.LittleEndian.Uint64(i.key[n:]) + trailer := binary.LittleEndian.Uint64(i.key[n:]) + hiddenPoint := i.hideObsoletePoints && + (trailer&trailerObsoleteBit != 0) + if hiddenPoint { + continue + } + i.ikey.Trailer = trailer & trailerObsoleteMask i.ikey.UserKey = i.key[:n:n] if i.globalSeqNum != 0 { i.ikey.SetSeqNum(i.globalSeqNum) @@ -1360,6 +1453,8 @@ func (i *blockIter) Prev() (*InternalKey, base.LazyValue) { // index ≤ h < upper offset := decodeRestart(i.data[i.restarts+4*h:]) if offset < targetOffset { + // Looking for the first restart that has offset >= targetOffset, so + // ignore h and earlier. index = h + 1 // preserves f(i-1) == false } else { upper = h // preserves f(j) == true @@ -1369,20 +1464,35 @@ func (i *blockIter) Prev() (*InternalKey, base.LazyValue) { // => answer is index. } + // index is first restart with offset >= targetOffset. Note that + // targetOffset may not be at a restart point since one can call Prev() + // after Next() (so the cache was not populated) and targetOffset refers to + // the current entry. index-1 must have an offset < targetOffset (it can't + // be equal to targetOffset since the binary search would have selected that + // as the index). i.offset = 0 if index > 0 { i.offset = decodeRestart(i.data[i.restarts+4*(index-1):]) } + // TODO(sumeer): why is the else case not an error given targetOffset is a + // valid offset. i.readEntry() + // We stop when i.nextOffset == targetOffset since the targetOffset is the + // entry we are stepping back from, and we don't need to cache the entry + // before it, since it is the candidate to return. for i.nextOffset < targetOffset { i.cacheEntry() i.offset = i.nextOffset i.readEntry() } - i.decodeInternalKey(i.key) + hiddenPoint := i.decodeInternalKey(i.key) + if hiddenPoint { + // Use the cache. + goto start + } if !i.lazyValueHandling.hasValuePrefix || base.TrailerKind(i.ikey.Trailer) != InternalKeyKindSet { i.lazyValue = base.MakeInPlaceValue(i.val) diff --git a/sstable/block_property.go b/sstable/block_property.go index 1e986fbc3e..9bec5294dc 100644 --- a/sstable/block_property.go +++ b/sstable/block_property.go @@ -117,7 +117,7 @@ type BlockPropertyCollector interface { } // SuffixReplaceableBlockCollector is an extension to the BlockPropertyCollector -// interface that allows a block property collector to indicate the it supports +// interface that allows a block property collector to indicate that it supports // being *updated* during suffix replacement, i.e. when an existing SST in which // all keys have the same key suffix is updated to have a new suffix. // diff --git a/sstable/block_property_test.go b/sstable/block_property_test.go index ece6884a52..e7298eccb7 100644 --- a/sstable/block_property_test.go +++ b/sstable/block_property_test.go @@ -1320,7 +1320,8 @@ func runBlockPropsCmd(r *Reader, td *datadriven.TestData) string { if err != nil { return err.Error() } - if err := subiter.init(r.Compare, subIndex.Get(), 0 /* globalSeqNum */); err != nil { + if err := subiter.init( + r.Compare, subIndex.Get(), 0 /* globalSeqNum */, false); err != nil { return err.Error() } for key, value := subiter.First(); key != nil; key, value = subiter.Next() { diff --git a/sstable/block_test.go b/sstable/block_test.go index d191247640..7d1aed6e9f 100644 --- a/sstable/block_test.go +++ b/sstable/block_test.go @@ -56,7 +56,7 @@ func TestBlockWriterWithPrefix(t *testing.T) { valuePrefix valuePrefix, setHasSameKeyPrefix bool) { w.addWithOptionalValuePrefix( - key, value, len(key.UserKey), addValuePrefix, valuePrefix, setHasSameKeyPrefix) + key, false, value, len(key.UserKey), addValuePrefix, valuePrefix, setHasSameKeyPrefix) } addAdapter( ikey("apple"), []byte("red"), false, 0, true) diff --git a/sstable/data_test.go b/sstable/data_test.go index d89bed0767..9912bd070f 100644 --- a/sstable/data_test.go +++ b/sstable/data_test.go @@ -54,6 +54,10 @@ func optsFromArgs(td *datadriven.TestData, writerOpts *WriterOptions) error { writerOpts.FilterPolicy = bloom.FilterPolicy(10) case "comparer-split-4b-suffix": writerOpts.Comparer = test4bSuffixComparer + case "writing-to-lowest-level": + writerOpts.WritingToLowestLevel = true + case "is-strict-obsolete": + writerOpts.IsStrictObsolete = true } } return nil @@ -102,11 +106,19 @@ func runBuildCmd( continue } + forceObsolete := false + if strings.HasPrefix(data, "force-obsolete:") { + data = strings.TrimSpace(strings.TrimPrefix(data, "force-obsolete:")) + forceObsolete = true + } j := strings.Index(data, ":") key := base.ParseInternalKey(data[:j]) value := []byte(data[j+1:]) switch key.Kind() { case InternalKeyKindRangeDelete: + if forceObsolete { + return nil, nil, errors.Errorf("force-obsolete is not allowed for RANGEDEL") + } var err error func() { defer func() { @@ -124,7 +136,7 @@ func runBuildCmd( return nil, nil, err } default: - if err := w.Add(key, value); err != nil { + if err := w.AddWithForceObsolete(key, value, forceObsolete); err != nil { return nil, nil, err } } diff --git a/sstable/format.go b/sstable/format.go index e8c620028f..a1d9be01ad 100644 --- a/sstable/format.go +++ b/sstable/format.go @@ -29,6 +29,156 @@ const ( TableFormatMax = TableFormatPebblev4 ) +// TableFormatPebblev4, in addition to DELSIZED, introduces the use of +// InternalKeyKindSSTableInternalObsoleteBit. +// +// 1. Motivation +// +// We have various related problems caused by Pebble snapshots: +// +// - P1: RANGEDELs that delete points in the same sstable, but the points +// happen to not get deleted during compactions because of an open snapshot. +// This causes very expensive iteration, that has been observed in +// production deployments +// +// - P2: When iterating over a foreign sstable (in disaggregated storage), we +// need to do (a) point collapsing to expose at most one point per user key, +// (b) apply RANGEDELs in the sstable to hide deleted points in the same +// sstable. This per-sstable point collapsing iteration needs to be very +// efficient (ideally as efficient from a CPU perspective as iteration over +// regular sstables) since foreign sstables can be very long-lived -- one of +// the goals of disaggregated storage is to scale compute and disk bandwidth +// resources as a function of the hot (from a write perspective) data and +// not the whole data, so we don't want to have to rewrite foreign sstables +// solely to improve read performance. +// +// The ideal solution for P2 would allow user-facing reads to utilize the +// existing SST iterators (with slight modifications) and with no loss of +// efficiency. And for P1 and P2 we would like to skip whole blocks of +// overwritten/deleted points. Even when we can't skip whole blocks, avoiding +// key comparisons at iteration time to discover what points are deleted is +// very desirable, since keys can be long. +// +// We observe that: +// +// - Reads: +// - All user-facing reads in CockroachDB use iterators over the DB, hence +// have a higher read seqnum than all sstables (there are some rare cases +// that can violate this, but those are not important from a performance +// optimization perspective). +// +// - Certain internal-facing reads in CockroachDB use snapshots, but the +// snapshots are shortlived enough that most L5 and L6 sstables will have +// all seqnums lower than the snapshot seqnum. +// +// - Writes: +// - We already do key comparisons between points when writing the sstable +// to ensure that the sstable invariant (monotonically increasing internal +// keys) is not violated. So we know which points share the same userkey, +// and thereby which points are obsolete because there is a more recent +// point in the same sstable. +// +// - The compactionIter knows which point id deleted by a RANGEDEL even if +// the point does need to be written because of a snapshot. +// +// So this known information can be encoded in the sstable at write time and +// utilized for optimized reading. +// +// 2. Solution +// +// We primarily scope the solution to the following point kinds: SET, +// SETWITHDEL, DEL, DELSIZED, SINGLEDEL. These are the ones marked locally +// obsolete, i.e., obsolete within the sstable, and we can guarantee that at +// most one point will be exposed per user key. MERGE keys create more +// complexity: MERGE followed by MERGE causes multiple keys to not be +// obsolete. Same applies for MERGE followed by SET/SETWITHDEL. Note that: +// +// - For regular sst iteration, the obsolete marking is a performance +// optimization, and multiple keys for the same userkey can be handled by +// higher layers in the iterator tree (specifically pebble.Iterator). +// +// - For foreign sst iteration, we disallow MERGEs to be written to such +// shared ssts (details below). +// +// The key kinds are marked with an obsolete bit +// (InternalKeyKindSSTableInternalObsoleteBit) when the key-value pair is +// obsolete. This marking is done within blockWriter, based on information +// passed to it by Writer. In turn, Writer uses a combination of key +// comparisons, and information provided by compactionIter to decide whether a +// key-value pair is obsolete. Additionally, a Pebble-internal +// BlockPropertyCollector (obsoleteKeyBlockPropertyCollector) is used to mark +// blocks where all key-value pairs are obsolete. Since the common case is +// non-obsolete blocks, this block property collector uses the empty byte +// slice to represent a non-obsolete block, which consumes no space in +// BlockHandleWithProperties.Props. +// +// At read time, the obsolete bit is only visible to the blockIter, which can +// be optionally configured to hide obsolete points. This hiding is only +// configured for data block iterators for sstables being read by user-facing +// iterators at a seqnum greater than the max seqnum in the sstable. +// Additionally, when this hiding is configured, a Pebble-internal block +// property filter (obsoleteKeyBlockPropertyFilter), is used to skip whole +// blocks that are obsolete. +// +// 2.1 Correctness +// +// Due to the level invariant, the sequence of seqnums for a user key in a +// sstable represents a contiguous subsequence of the seqnums for the userkey +// across the whole LSM, and is more recent than the seqnums in a sstable in a +// lower level. So exposing exactly one point from a sstable for a userkey +// will also mask the points for the userkey in lower levels. If we expose no +// point, because of RANGEDELs, that RANGEDEL will also mask the points in +// lower levels. +// +// Note that we do not need to do anything special at write time for +// SETWITHDEL and SINGLEDEL. This is because these key kinds are treated +// specially only by compactions, which do not hide obsolete points. For +// regular reads, SETWITHDEL behaves the same as SET and SINGLEDEL behaves the +// same as DEL. +// +// 2.2 Strictness and MERGE +// +// Setting the obsolete bit on point keys is advanced usage, so we support two +// modes, both of which must be truthful when setting the obsolete bit, but +// vary in when they don't set the obsolete bit. +// +// - Non-strict: In this mode, the bit does not need to be set for keys that +// are obsolete. Additionally, any sstable containing MERGE keys can only +// use this mode. An iterator over such an sstable, when configured to +// hideObsoletePoints, can expose multiple internal keys per user key, and +// can expose keys that are deleted by rangedels in the same sstable. This +// is the mode that non-advanced users should use. Pebble without +// disaggregated storage will also use this mode and will best-effort set +// the obsolete bit, to optimize iteration when snapshots have retained many +// obsolete keys. +// +// - Strict: In this mode, every obsolete key must have the obsolete bit set, +// and no MERGE keys are permitted. An iterator over such an sstable, when +// configured to hideObsoletePoints satisfies two properties: +// - S1: will expose at most one internal key per user key, which is the +// most recent one. +// - S2: will never expose keys that are deleted by rangedels in the same +// sstable. +// +// This is the mode for two use cases in disaggregated storage (which will +// exclude parts of the key space that has MERGEs), for levels that contain +// sstables that can become foreign sstables: +// - Pebble compaction output to these levels that can become foreign +// sstables. +// +// - CockroachDB ingest operations that can ingest into the levels that can +// become foreign sstables. Note, these are not sstables corresponding to +// copied data for CockroachDB range snapshots. This case occurs for +// operations like index backfills: these trivially satisfy the strictness +// criteria since they only write one key per userkey. +// +// TODO(sumeer): this latter case is not currently supported, since only +// Writer.AddWithForceObsolete calls are permitted for writing strict +// obsolete sstables. This is done to reduce the likelihood of bugs. One +// simple way to lift this limitation would be to disallow adding any +// RANGEDELs when a Pebble-external writer is trying to construct a strict +// obsolete sstable. + // ParseTableFormat parses the given magic bytes and version into its // corresponding internal TableFormat. func ParseTableFormat(magic []byte, version uint32) (TableFormat, error) { diff --git a/sstable/internal.go b/sstable/internal.go index 039da91074..0fe7c9937d 100644 --- a/sstable/internal.go +++ b/sstable/internal.go @@ -23,6 +23,7 @@ const ( InternalKeyKindLogData = base.InternalKeyKindLogData InternalKeyKindSingleDelete = base.InternalKeyKindSingleDelete InternalKeyKindRangeDelete = base.InternalKeyKindRangeDelete + InternalKeyKindSetWithDelete = base.InternalKeyKindSetWithDelete InternalKeyKindDeleteSized = base.InternalKeyKindDeleteSized InternalKeyKindMax = base.InternalKeyKindMax InternalKeyKindInvalid = base.InternalKeyKindInvalid diff --git a/sstable/options.go b/sstable/options.go index 869e69b17c..a2d5e3e858 100644 --- a/sstable/options.go +++ b/sstable/options.go @@ -216,6 +216,17 @@ type WriterOptions struct { // by a wider range of tools and libraries. TableFormat TableFormat + // IsStrictObsolete is only relevant for >= TableFormatPebblev4. See comment + // in format.go. Must be false if format < TableFormatPebblev4. + // + // TODO(bilal): set this when writing shared ssts. + IsStrictObsolete bool + + // WritingToLowestLevel is only relevant for >= TableFormatPebblev4. It is + // used to set the obsolete bit on DEL/DELSIZED/SINGLEDEL if they are the + // youngest for a userkey. + WritingToLowestLevel bool + // TablePropertyCollectors is a list of TablePropertyCollector creation // functions. A new TablePropertyCollector is created for each sstable built // and lives for the lifetime of the table. diff --git a/sstable/properties.go b/sstable/properties.go index c85aa1db60..00881cb1ae 100644 --- a/sstable/properties.go +++ b/sstable/properties.go @@ -101,6 +101,9 @@ type Properties struct { IndexType uint32 `prop:"rocksdb.block.based.table.index.type"` // Whether delta encoding is used to encode the index values. IndexValueIsDeltaEncoded uint64 `prop:"rocksdb.index.value.is.delta.encoded"` + // For formats >= TableFormatPebblev4, this is set to true if the obsolete + // bit is strict for all the point keys. + IsStrictObsolete bool `prop:"pebble.obsolete.is_strict"` // The name of the merger used in this table. Empty if no merger is used. MergerName string `prop:"rocksdb.merge.operator"` // The number of blocks in this table. @@ -349,6 +352,9 @@ func (p *Properties) save(tblFormat TableFormat, w *rawBlockWriter) { p.saveUvarint(m, unsafe.Offsetof(p.IndexSize), p.IndexSize) p.saveUint32(m, unsafe.Offsetof(p.IndexType), p.IndexType) p.saveUvarint(m, unsafe.Offsetof(p.IndexValueIsDeltaEncoded), p.IndexValueIsDeltaEncoded) + if p.IsStrictObsolete { + p.saveBool(m, unsafe.Offsetof(p.IsStrictObsolete), p.IsStrictObsolete) + } if p.MergerName != "" { p.saveString(m, unsafe.Offsetof(p.MergerName), p.MergerName) } diff --git a/sstable/properties_test.go b/sstable/properties_test.go index 25d1e1ba1d..49deacf7c5 100644 --- a/sstable/properties_test.go +++ b/sstable/properties_test.go @@ -79,6 +79,7 @@ func TestPropertiesSave(t *testing.T) { IndexSize: 11, IndexType: 12, IndexValueIsDeltaEncoded: 13, + IsStrictObsolete: true, MergerName: "merge operator name", NumDataBlocks: 14, NumDeletions: 15, diff --git a/sstable/reader.go b/sstable/reader.go index b37e6f0186..8511ee1de7 100644 --- a/sstable/reader.go +++ b/sstable/reader.go @@ -323,6 +323,8 @@ type singleLevelIterator struct { // is high). useFilter bool lastBloomFilterMatched bool + + hideObsoletePoints bool } // singleLevelIterator implements the base.InternalIterator interface. @@ -404,7 +406,7 @@ func (i *singleLevelIterator) init( v *virtualState, lower, upper []byte, filterer *BlockPropertiesFilterer, - useFilter bool, + useFilter, hideObsoletePoints bool, stats *base.InternalIteratorStats, rp ReaderProvider, ) error { @@ -428,7 +430,8 @@ func (i *singleLevelIterator) init( i.reader = r i.cmp = r.Compare i.stats = stats - err = i.index.initHandle(i.cmp, indexH, r.Properties.GlobalSeqNum) + i.hideObsoletePoints = hideObsoletePoints + err = i.index.initHandle(i.cmp, indexH, r.Properties.GlobalSeqNum, false) if err != nil { // blockIter.Close releases indexH and always returns a nil error _ = i.index.Close() @@ -566,7 +569,7 @@ func (i *singleLevelIterator) loadBlock(dir int8) loadBlockResult { i.err = err return loadBlockFailed } - i.err = i.data.initHandle(i.cmp, block, i.reader.Properties.GlobalSeqNum) + i.err = i.data.initHandle(i.cmp, block, i.reader.Properties.GlobalSeqNum, i.hideObsoletePoints) if i.err != nil { // The block is partially loaded, and we don't want it to appear valid. i.data.invalidate() @@ -653,12 +656,13 @@ func (i *singleLevelIterator) resolveMaybeExcluded(dir int8) intersectsResult { } func (i *singleLevelIterator) initBoundsForAlreadyLoadedBlock() { - if i.data.firstKey.UserKey == nil { + if i.data.firstUserKey() == nil { panic("initBoundsForAlreadyLoadedBlock must not be called on empty or corrupted block") } i.blockLower = i.lower if i.blockLower != nil { - if i.data.firstKey.UserKey != nil && i.cmp(i.blockLower, i.data.firstKey.UserKey) < 0 { + firstUserKey := i.data.firstUserKey() + if firstUserKey != nil && i.cmp(i.blockLower, firstUserKey) < 0 { // The lower-bound is less than the first key in the block. No need // to check the lower-bound again for this block. i.blockLower = nil @@ -1066,7 +1070,7 @@ func (i *singleLevelIterator) SeekLT( var dontSeekWithinBlock bool if !i.data.isDataInvalidated() && !i.index.isDataInvalidated() && i.data.valid() && i.index.valid() && - boundsCmp < 0 && i.cmp(i.data.firstKey.UserKey, key) < 0 { + boundsCmp < 0 && i.cmp(i.data.firstUserKey(), key) < 0 { // Fast-path: The bounds have moved backward, and this SeekLT is // respecting the upper bound (guaranteed by Iterator). We know that // the iterator must already be positioned within or just outside the @@ -1314,7 +1318,7 @@ func (i *singleLevelIterator) NextPrefix(succKey []byte) (*InternalKey, base.Laz if i.err != nil { return nil, base.LazyValue{} } - if key, val := i.data.nextPrefix(succKey); key != nil { + if key, val := i.data.NextPrefix(succKey); key != nil { if i.blockUpper != nil { cmp := i.cmp(key.UserKey, i.blockUpper) if (!i.endKeyInclusive && cmp >= 0) || cmp > 0 { @@ -1763,8 +1767,7 @@ func (i *twoLevelIterator) loadIndex(dir int8) loadBlockResult { i.err = err return loadBlockFailed } - if i.err = i.index.initHandle( - i.cmp, indexBlock, i.reader.Properties.GlobalSeqNum); i.err == nil { + if i.err = i.index.initHandle(i.cmp, indexBlock, i.reader.Properties.GlobalSeqNum, false); i.err == nil { return loadBlockOK } return loadBlockFailed @@ -1842,7 +1845,7 @@ func (i *twoLevelIterator) init( v *virtualState, lower, upper []byte, filterer *BlockPropertiesFilterer, - useFilter bool, + useFilter, hideObsoletePoints bool, stats *base.InternalIteratorStats, rp ReaderProvider, ) error { @@ -1867,7 +1870,8 @@ func (i *twoLevelIterator) init( i.reader = r i.cmp = r.Compare i.stats = stats - err = i.topLevelIndex.initHandle(i.cmp, topLevelIndexH, r.Properties.GlobalSeqNum) + i.hideObsoletePoints = hideObsoletePoints + err = i.topLevelIndex.initHandle(i.cmp, topLevelIndexH, r.Properties.GlobalSeqNum, false) if err != nil { // blockIter.Close releases topLevelIndexH and always returns a nil error _ = i.topLevelIndex.Close() @@ -2907,21 +2911,20 @@ func (v *VirtualReader) NewCompactionIter( return v.reader.newCompactionIter(bytesIterated, rp, &v.vState) } -// NewIterWithBlockPropertyFiltersAndContext wraps +// NewIterWithBlockPropertyFiltersAndContextEtc wraps // Reader.NewIterWithBlockPropertyFiltersAndContext. We assume that the passed // in [lower, upper) bounds will have at least some overlap with the virtual // sstable bounds. No overlap is not currently supported in the iterator. -func (v *VirtualReader) NewIterWithBlockPropertyFiltersAndContext( +func (v *VirtualReader) NewIterWithBlockPropertyFiltersAndContextEtc( ctx context.Context, lower, upper []byte, filterer *BlockPropertiesFilterer, - useFilterBlock bool, + hideObsoletePoints, useFilterBlock bool, stats *base.InternalIteratorStats, rp ReaderProvider, ) (Iterator, error) { return v.reader.newIterWithBlockPropertyFiltersAndContext( - ctx, - lower, upper, filterer, useFilterBlock, stats, rp, &v.vState, + ctx, lower, upper, filterer, hideObsoletePoints, useFilterBlock, stats, rp, &v.vState, ) } @@ -3080,31 +3083,51 @@ func (r *Reader) NewIterWithBlockPropertyFilters( ) (Iterator, error) { return r.newIterWithBlockPropertyFiltersAndContext( context.Background(), - lower, upper, filterer, useFilterBlock, stats, rp, nil, + lower, upper, filterer, false, useFilterBlock, stats, rp, nil, ) } -// NewIterWithBlockPropertyFiltersAndContext is similar to +// NewIterWithBlockPropertyFiltersAndContextEtc is similar to // NewIterWithBlockPropertyFilters and additionally accepts a context for // tracing. -func (r *Reader) NewIterWithBlockPropertyFiltersAndContext( +// +// If hideObsoletePoints, the callee assumes that filterer already includes +// obsoleteKeyBlockPropertyFilter. The caller can satisfy this contract by +// first calling TryAddBlockPropertyFilterForHideObsoletePoints. +func (r *Reader) NewIterWithBlockPropertyFiltersAndContextEtc( ctx context.Context, lower, upper []byte, filterer *BlockPropertiesFilterer, - useFilterBlock bool, + hideObsoletePoints, useFilterBlock bool, stats *base.InternalIteratorStats, rp ReaderProvider, ) (Iterator, error) { return r.newIterWithBlockPropertyFiltersAndContext( - ctx, - lower, upper, filterer, useFilterBlock, stats, rp, nil, + ctx, lower, upper, filterer, hideObsoletePoints, useFilterBlock, stats, rp, nil, ) } +// TryAddBlockPropertyFilterForHideObsoletePoints is expected to be called +// before the call to NewIterWithBlockPropertyFiltersAndContextEtc, to get the +// value of hideObsoletePoints and potentially add a block property filter. +func (r *Reader) TryAddBlockPropertyFilterForHideObsoletePoints( + snapshotForHideObsoletePoints uint64, + fileLargestSeqNum uint64, + pointKeyFilters []BlockPropertyFilter, +) (hideObsoletePoints bool, filters []BlockPropertyFilter) { + hideObsoletePoints = r.tableFormat >= TableFormatPebblev4 && + snapshotForHideObsoletePoints > fileLargestSeqNum + if hideObsoletePoints { + pointKeyFilters = append(pointKeyFilters, obsoleteKeyBlockPropertyFilter{}) + } + return hideObsoletePoints, pointKeyFilters +} + func (r *Reader) newIterWithBlockPropertyFiltersAndContext( ctx context.Context, lower, upper []byte, filterer *BlockPropertiesFilterer, + hideObsoletePoints bool, useFilterBlock bool, stats *base.InternalIteratorStats, rp ReaderProvider, @@ -3115,7 +3138,7 @@ func (r *Reader) newIterWithBlockPropertyFiltersAndContext( // until the final iterator closes. if r.Properties.IndexType == twoLevelIndex { i := twoLevelIterPool.Get().(*twoLevelIterator) - err := i.init(ctx, r, v, lower, upper, filterer, useFilterBlock, stats, rp) + err := i.init(ctx, r, v, lower, upper, filterer, useFilterBlock, hideObsoletePoints, stats, rp) if err != nil { return nil, err } @@ -3123,7 +3146,7 @@ func (r *Reader) newIterWithBlockPropertyFiltersAndContext( } i := singleLevelIterPool.Get().(*singleLevelIterator) - err := i.init(ctx, r, v, lower, upper, filterer, useFilterBlock, stats, rp) + err := i.init(ctx, r, v, lower, upper, filterer, useFilterBlock, hideObsoletePoints, stats, rp) if err != nil { return nil, err } @@ -3155,7 +3178,8 @@ func (r *Reader) newCompactionIter( err := i.init( context.Background(), r, v, nil /* lower */, nil /* upper */, nil, - false /* useFilter */, nil /* stats */, rp, + false /* useFilter */, false, /* hideObsoletePoints */ + nil /* stats */, rp, ) if err != nil { return nil, err @@ -3169,7 +3193,8 @@ func (r *Reader) newCompactionIter( i := singleLevelIterPool.Get().(*singleLevelIterator) err := i.init( context.Background(), r, v, nil /* lower */, nil, /* upper */ - nil, false /* useFilter */, nil /* stats */, rp, + nil, false /* useFilter */, false, /* hideObsoletePoints */ + nil /* stats */, rp, ) if err != nil { return nil, err @@ -3207,7 +3232,7 @@ func (r *Reader) NewFixedSeqnumRangeDelIter(seqNum uint64) (keyspan.FragmentIter return nil, err } i := &fragmentBlockIter{elideSameSeqnum: true} - if err := i.blockIter.initHandle(r.Compare, h, seqNum); err != nil { + if err := i.blockIter.initHandle(r.Compare, h, seqNum, false); err != nil { return nil, err } return i, nil @@ -3228,7 +3253,7 @@ func (r *Reader) NewRawRangeKeyIter() (keyspan.FragmentIterator, error) { return nil, err } i := rangeKeyFragmentBlockIterPool.Get().(*rangeKeyFragmentBlockIter) - if err := i.blockIter.initHandle(r.Compare, h, r.Properties.GlobalSeqNum); err != nil { + if err := i.blockIter.initHandle(r.Compare, h, r.Properties.GlobalSeqNum, false); err != nil { return nil, err } return i, nil @@ -3388,7 +3413,7 @@ func (r *Reader) transformRangeDelV1(b []byte) ([]byte, error) { // tombstones. We need properly fragmented and sorted range tombstones in // order to serve from them directly. iter := &blockIter{} - if err := iter.init(r.Compare, b, r.Properties.GlobalSeqNum); err != nil { + if err := iter.init(r.Compare, b, r.Properties.GlobalSeqNum, false); err != nil { return nil, err } var tombstones []keyspan.Span @@ -3580,7 +3605,8 @@ func (r *Reader) Layout() (*Layout, error) { if err != nil { return nil, err } - if err := iter.init(r.Compare, subIndex.Get(), 0 /* globalSeqNum */); err != nil { + if err := iter.init(r.Compare, subIndex.Get(), 0, /* globalSeqNum */ + false /* hideObsoletePoints */); err != nil { return nil, err } for key, value := iter.First(); key != nil; key, value = iter.Next() { diff --git a/sstable/reader_test.go b/sstable/reader_test.go index 4d432a9533..9509ef42ec 100644 --- a/sstable/reader_test.go +++ b/sstable/reader_test.go @@ -414,15 +414,9 @@ func TestVirtualReader(t *testing.T) { } var stats base.InternalIteratorStats - iter, err := v.NewIterWithBlockPropertyFiltersAndContext( - context.Background(), - lower, - upper, - nil, - false, - &stats, - TrivialReaderProvider{Reader: r}, - ) + iter, err := v.NewIterWithBlockPropertyFiltersAndContextEtc( + context.Background(), lower, upper, nil, false, false, + &stats, TrivialReaderProvider{Reader: r}) if err != nil { return err.Error() } @@ -473,7 +467,7 @@ func TestReader(t *testing.T) { "prefixFilter": "testdata/prefixreader", } - for _, format := range []TableFormat{TableFormatPebblev2, TableFormatPebblev3} { + for _, format := range []TableFormat{TableFormatPebblev2, TableFormatPebblev3, TableFormatPebblev4} { for dName, blockSize := range blockSizes { for iName, indexBlockSize := range blockSizes { for lName, tableOpt := range writerOpts { @@ -497,6 +491,29 @@ func TestReader(t *testing.T) { } } +func TestReaderHideObsolete(t *testing.T) { + blockSizes := map[string]int{ + "1bytes": 1, + "5bytes": 5, + "10bytes": 10, + "25bytes": 25, + "Maxbytes": math.MaxInt32, + } + for dName, blockSize := range blockSizes { + opts := WriterOptions{ + TableFormat: TableFormatPebblev4, + BlockSize: blockSize, + IndexBlockSize: blockSize, + Comparer: testkeys.Comparer, + } + t.Run(fmt.Sprintf("blockSize=%s", dName), func(t *testing.T) { + runTestReader( + t, opts, "testdata/reader_hide_obsolete", + nil /* Reader */, 0, false, true) + }) + } +} + func TestHamletReader(t *testing.T) { prebuiltSSTs := []string{ "testdata/h.ldb", @@ -722,12 +739,25 @@ func runTestReader( } var stats base.InternalIteratorStats r.Properties.GlobalSeqNum = seqNum - var filterer *BlockPropertiesFilterer + var bpfs []BlockPropertyFilter if d.HasArg("block-property-filter") { var filterMin, filterMax uint64 d.ScanArgs(t, "block-property-filter", &filterMin, &filterMax) bpf := NewTestKeysBlockPropertyFilter(filterMin, filterMax) - filterer = newBlockPropertiesFilterer([]BlockPropertyFilter{bpf}, nil) + bpfs = append(bpfs, bpf) + } + hideObsoletePoints := false + if d.HasArg("hide-obsolete-points") { + d.ScanArgs(t, "hide-obsolete-points", &hideObsoletePoints) + if hideObsoletePoints { + hideObsoletePoints, bpfs = r.TryAddBlockPropertyFilterForHideObsoletePoints( + InternalKeySeqNumMax, InternalKeySeqNumMax-1, bpfs) + require.True(t, hideObsoletePoints) + } + } + var filterer *BlockPropertiesFilterer + if len(bpfs) > 0 { + filterer = newBlockPropertiesFilterer(bpfs, nil) intersects, err := filterer.intersectsUserPropsAndFinishInit(r.Properties.UserProperties) if err != nil { @@ -737,10 +767,12 @@ func runTestReader( return "table does not intersect BlockPropertyFilter" } } - iter, err := r.NewIterWithBlockPropertyFilters( + iter, err := r.NewIterWithBlockPropertyFiltersAndContextEtc( + context.Background(), nil, /* lower */ nil, /* upper */ filterer, + hideObsoletePoints, true, /* use filter block */ &stats, TrivialReaderProvider{Reader: r}, @@ -1897,6 +1929,114 @@ func BenchmarkIteratorScanNextPrefix(b *testing.B) { } } +func BenchmarkIteratorScanObsolete(b *testing.B) { + options := WriterOptions{ + BlockSize: 32 << 10, + BlockRestartInterval: 16, + FilterPolicy: nil, + Compression: SnappyCompression, + Comparer: testkeys.Comparer, + } + const keyCount = 1 << 20 + const keyLen = 10 + + // Take the very large keyspace consisting of alphabetic characters of + // lengths up to unsharedPrefixLen and reduce it down to keyCount keys by + // picking every 1 key every keyCount keys. + keys := testkeys.Alpha(keyLen) + keys = keys.EveryN(keys.Count() / keyCount) + if keys.Count() < keyCount { + b.Fatalf("expected %d keys, found %d", keyCount, keys.Count()) + } + expectedKeyCount := keys.Count() + keyBuf := make([]byte, keyLen) + setupBench := func(b *testing.B, tableFormat TableFormat, cacheSize int64) *Reader { + mem := vfs.NewMem() + f0, err := mem.Create("bench") + require.NoError(b, err) + options.TableFormat = tableFormat + w := NewWriter(objstorageprovider.NewFileWritable(f0), options) + val := make([]byte, 100) + rng := rand.New(rand.NewSource(100)) + for i := 0; i < keys.Count(); i++ { + n := testkeys.WriteKey(keyBuf, keys, i) + key := keyBuf[:n] + rng.Read(val) + forceObsolete := true + if i == 0 { + forceObsolete = false + } + require.NoError(b, w.AddWithForceObsolete( + base.MakeInternalKey(key, 0, InternalKeyKindSet), val, forceObsolete)) + } + require.NoError(b, w.Close()) + c := cache.New(cacheSize) + defer c.Unref() + // Re-open the filename for reading. + f0, err = mem.Open("bench") + require.NoError(b, err) + r, err := newReader(f0, ReaderOptions{ + Cache: c, + Comparer: testkeys.Comparer, + }) + require.NoError(b, err) + return r + } + for _, format := range []TableFormat{TableFormatPebblev3, TableFormatPebblev4} { + b.Run(fmt.Sprintf("format=%s", format.String()), func(b *testing.B) { + // 150MiB results in a high cache hit rate for both formats. + for _, cacheSize := range []int64{1, 150 << 20} { + b.Run(fmt.Sprintf("cache-size=%s", humanize.IEC.Int64(cacheSize)), + func(b *testing.B) { + r := setupBench(b, format, cacheSize) + defer func() { + require.NoError(b, r.Close()) + }() + for _, hideObsoletePoints := range []bool{false, true} { + b.Run(fmt.Sprintf("hide-obsolete=%t", hideObsoletePoints), func(b *testing.B) { + var filterer *BlockPropertiesFilterer + if format == TableFormatPebblev4 && hideObsoletePoints { + filterer = newBlockPropertiesFilterer( + []BlockPropertyFilter{obsoleteKeyBlockPropertyFilter{}}, nil) + intersects, err := + filterer.intersectsUserPropsAndFinishInit(r.Properties.UserProperties) + if err != nil { + b.Fatalf("%s", err.Error()) + } + if !intersects { + b.Fatalf("sstable does not intersect") + } + } + iter, err := r.NewIterWithBlockPropertyFiltersAndContextEtc( + context.Background(), nil, nil, filterer, hideObsoletePoints, + true, nil, TrivialReaderProvider{Reader: r}) + require.NoError(b, err) + b.ResetTimer() + for i := 0; i < b.N; i++ { + count := 0 + k, _ := iter.First() + for k != nil { + count++ + k, _ = iter.Next() + } + if format == TableFormatPebblev4 && hideObsoletePoints { + if count != 1 { + b.Fatalf("found %d points", count) + } + } else { + if count != expectedKeyCount { + b.Fatalf("found %d points", count) + } + } + } + }) + } + }) + } + }) + } +} + func newReader(r ReadableFile, o ReaderOptions, extraOpts ...ReaderOption) (*Reader, error) { readable, err := NewSimpleReadable(r) if err != nil { diff --git a/sstable/suffix_rewriter.go b/sstable/suffix_rewriter.go index b1423fddb8..007d7666f6 100644 --- a/sstable/suffix_rewriter.go +++ b/sstable/suffix_rewriter.go @@ -50,6 +50,24 @@ func RewriteKeySuffixes( // same TableFormat as the input, which is returned in case the caller wants // to do some error checking. Suffix rewriting is meant to be efficient, and // allowing changes in the TableFormat detracts from that efficiency. +// +// Any obsolete bits that key-value pairs may be annotated with are ignored +// and lost during the rewrite. Additionally, the output sstable has the +// pebble.obsolete.is_strict property set to false. These limitations could be +// removed if needed. The current use case for +// RewriteKeySuffixesAndReturnFormat in CockroachDB is for MVCC-compliant file +// ingestion, where these files do not contain RANGEDELs and have one +// key-value pair per userkey -- so they trivially satisfy the strict +// criteria, and we don't need the obsolete bit as a performance optimization. +// For disaggregated storage, strict obsolete sstables are needed for L5 and +// L6, but at the time of writing, we expect such MVCC-compliant file +// ingestion to only ingest into levels L4 and higher. If this changes, we can +// do one of two things to get rid of this limitation: +// - Validate that there are no duplicate userkeys and no RANGEDELs in the +// sstable to be rewritten. Validating the former condition is non-trivial +// when rewriting blocks in parallel, so we could encode the pre-existing +// condition in a NumObsoletePoints property. +// - Preserve the obsolete bit (with changes to the blockIter). func RewriteKeySuffixesAndReturnFormat( sst []byte, rOpts ReaderOptions, @@ -184,7 +202,7 @@ func rewriteBlocks( if err != nil { return err } - if err := iter.init(r.Compare, inputBlock, r.Properties.GlobalSeqNum); err != nil { + if err := iter.init(r.Compare, inputBlock, r.Properties.GlobalSeqNum, false); err != nil { return err } @@ -434,7 +452,13 @@ func (c copyFilterWriter) policyName() string { return c.origPolicyName } // single loop over the Reader that writes each key to the Writer with the new // suffix. The is significantly slower than the parallelized rewriter, and does // more work to rederive filters, props, etc, however re-doing that work makes -// it less restrictive -- props no longer need to +// it less restrictive -- props no longer need to ?? +// +// Any obsolete bits that key-value pairs may be annotated with are ignored +// and lost during the rewrite. Some of the obsolete bits may be recreated -- +// specifically when there are multiple keys with the same user key. +// Additionally, the output sstable has the pebble.obsolete.is_strict property +// set to false. See the longer comment at RewriteKeySuffixesAndReturnFormat. func RewriteKeySuffixesViaWriter( r *Reader, out objstorage.Writable, o WriterOptions, from, to []byte, ) (*WriterMetadata, error) { @@ -442,6 +466,7 @@ func RewriteKeySuffixesViaWriter( return nil, errors.New("a valid splitter is required to rewrite suffixes") } + o.IsStrictObsolete = false w := NewWriter(out, o) defer func() { if w != nil { @@ -472,7 +497,7 @@ func RewriteKeySuffixesViaWriter( if err != nil { return nil, err } - if w.addPoint(scratch, val); err != nil { + if w.addPoint(scratch, val, false); err != nil { return nil, err } k, v = i.Next() diff --git a/sstable/suffix_rewriter_test.go b/sstable/suffix_rewriter_test.go index 0d9865c775..897e0dd40e 100644 --- a/sstable/suffix_rewriter_test.go +++ b/sstable/suffix_rewriter_test.go @@ -32,6 +32,9 @@ func TestRewriteSuffixProps(t *testing.T) { }, TableFormat: format, } + if format >= TableFormatPebblev4 { + wOpts.IsStrictObsolete = true + } const keyCount = 1e5 const rangeKeyCount = 100 @@ -52,11 +55,12 @@ func TestRewriteSuffixProps(t *testing.T) { expectedProps["bp3"] = string(interval{646, 647}.encode([]byte{0})) // Swap the order of two of the props so they have new shortIDs, and remove - // one. + // one. rwOpts inherits the IsStrictObsolete value from wOpts. rwOpts := wOpts - rwOpts.TableFormat = TableFormatPebblev2 if rand.Intn(2) != 0 { - rwOpts.TableFormat = TableFormatPebblev3 + rwOpts.TableFormat = TableFormatPebblev2 + rwOpts.IsStrictObsolete = false + t.Log("table format set to TableFormatPebblev2") } fmt.Printf("from format %s, to format %s\n", format.String(), rwOpts.TableFormat.String()) rwOpts.BlockPropertyCollectors = rwOpts.BlockPropertyCollectors[:3] @@ -72,6 +76,13 @@ func TestRewriteSuffixProps(t *testing.T) { defer r.Close() var sstBytes [2][]byte + adjustPropsForEffectiveFormat := func(effectiveFormat TableFormat) { + if effectiveFormat == TableFormatPebblev4 { + expectedProps["obsolete-key"] = string([]byte{3}) + } else { + delete(expectedProps, "obsolete-key") + } + } for i, byBlocks := range []bool{false, true} { t.Run(fmt.Sprintf("byBlocks=%v", byBlocks), func(t *testing.T) { rewrittenSST := &memFile{} @@ -82,9 +93,11 @@ func TestRewriteSuffixProps(t *testing.T) { // rwOpts.TableFormat is ignored. require.Equal(t, wOpts.TableFormat, rewriteFormat) require.NoError(t, err) + adjustPropsForEffectiveFormat(rewriteFormat) } else { _, err := RewriteKeySuffixesViaWriter(r, rewrittenSST, rwOpts, from, to) require.NoError(t, err) + adjustPropsForEffectiveFormat(rwOpts.TableFormat) } sstBytes[i] = rewrittenSST.Data() @@ -93,6 +106,7 @@ func TestRewriteSuffixProps(t *testing.T) { require.NoError(t, err) defer rRewritten.Close() require.Equal(t, expectedProps, rRewritten.Properties.UserProperties) + require.False(t, rRewritten.Properties.IsStrictObsolete) // Compare the block level props from the data blocks in the layout, // only if we did not do a rewrite from one format to another. If the @@ -119,7 +133,9 @@ func TestRewriteSuffixProps(t *testing.T) { for !newDecoder.done() { id, val, err := newDecoder.next() require.NoError(t, err) - newProps[id] = val + if int(id) < len(newProps) { + newProps[id] = val + } } require.Equal(t, oldProps[0], newProps[1]) ival.decode(newProps[0]) @@ -177,7 +193,9 @@ func make4bSuffixTestSST( binary.BigEndian.PutUint64(key[:8], 123) // 16-byte shared prefix binary.BigEndian.PutUint64(key[8:16], 456) binary.BigEndian.PutUint64(key[16:], uint64(i)) - if err := w.Set(key, key); err != nil { + err := w.AddWithForceObsolete( + base.MakeInternalKey(key, 0, InternalKeyKindSet), key, false) + if err != nil { t.Fatal(err) } } diff --git a/sstable/table.go b/sstable/table.go index 5267123029..040efafd62 100644 --- a/sstable/table.go +++ b/sstable/table.go @@ -160,8 +160,27 @@ The metaindex block also contains block handles as values, with keys being the names of the meta blocks. For a description of value blocks and the meta value index block, see -value_block.go - +value_block.go. + +Data blocks have some additional features: +- For TableFormatPebblev3 onwards: + - For SETs, the value has a 1 byte value prefix, which indicates whether the + value is inline, or in a separate value block, and indicates whether the + prefix of the userkey (as defined by split) has changed or not. See + value_block.go for details. + - The most significant bit of the restart points is used to indicate whether + userkey prefix has changed since the last restart point. See the detailed + comment in blockWriter. + - The maximum length of the "shared prefix" when encoding the key, is the + length of the prefix of the userkey (as defined by split) of the previous + key. + +- For TableFormatPebblev4 onwards: + - The key kinds may be altered to set the + InternalKeyKindSSTableInternalObsoleteBit if the key-value pair is obsolete + in the context of that sstable (for a reader that reads at a higher seqnum + than the highest seqnum in the sstable). For details, see the comment in + format.go. */ const ( diff --git a/sstable/testdata/reader_hide_obsolete/iter b/sstable/testdata/reader_hide_obsolete/iter new file mode 100644 index 0000000000..0582ba569a --- /dev/null +++ b/sstable/testdata/reader_hide_obsolete/iter @@ -0,0 +1,316 @@ +build +a.SET.1:A +b.SINGLEDEL.4: +b.SET.2:B +c.DEL.5: +c.SET.3:C +d.SET.4:D4 +d.SET.2:D2 +---- + +iter +first +next +next +next +next +next +next +next +prev +prev +prev +prev +prev +prev +prev +prev +---- +:A +: +:B +: +:C +:D4 +:D2 +. +:D2 +:D4 +:C +: +:B +: +:A +. + +iter hide-obsolete-points=true +first +next +next +next +next +prev +prev +prev +prev +prev +---- +:A +: +: +:D4 +. +:D4 +: +: +:A +. + +iter hide-obsolete-points=true +seek-ge c +prev +prev +next +next +next +seek-lt c +next +prev +prev +---- +: +: +:A +: +: +:D4 +: +: +: +:A + +build +a.SET.3:A +a.MERGE.2:A2 +b.MERGE.20:B20 +b.MERGE.18:B18 +b.SET.16:B16 +b.SET.14:B14 +c.MERGE.30:C30 +c.MERGE.28:C28 +c.DEL.26: +---- + +iter +first +next +next +next +next +next +next +next +---- +:A +:A2 +:B20 +:B18 +:B16 +:B14 +:C30 +:C28 + +iter hide-obsolete-points=true +first +next +next +next +next +next +next +last +prev +prev +prev +prev +prev +prev +---- +:A +:B20 +:B18 +:B16 +:C30 +:C28 +. +:C28 +:C30 +:B16 +:B18 +:B20 +:A +. + +build +b.MERGE.20:B20 +b.SETWITHDEL.16:B16 +b.SETWITHDEL.14:B14 +c.MERGE.30:C30 +c.DELSIZED.28: +c.DEL.26: +---- + +iter +first +next +next +next +next +next +next +---- +:B20 +:B16 +:B14 +:C30 +: +: +. + +iter hide-obsolete-points=true +first +next +next +next +---- +:B20 +:B16 +:C30 +. + +build +b.SETWITHDEL.20:B20 +b.MERGE.16:B16 +b.MERGE.14:B14 +b.SET.12:B12 +---- + +iter +first +next +next +next +next +---- +:B20 +:B16 +:B14 +:B12 +. + +iter hide-obsolete-points=true +first +next +---- +:B20 +. + +build writing-to-lowest-level +a.SET.10:A10 +b.DEL.20: +b.MERGE.16:B16 +b.MERGE.14:B14 +b.SET.12:B12 +---- + +iter +first +next +next +next +next +next +---- +:A10 +: +:B16 +:B14 +:B12 +. + +iter hide-obsolete-points=true +first +next +---- +:A10 +. + +build writing-to-lowest-level +a.SET.10:A10 +b.DEL.16:B16 +b.SETWITHDEL.14:B14 +c.DELSIZED.30: +c.SET.28:C28 +d.SINGLEDEL.40: +d.MERGE.30:D30 +---- + +iter +first +next +next +next +next +next +next +next +---- +:A10 +:B16 +:B14 +: +:C28 +: +:D30 +. + +iter hide-obsolete-points=true +first +next +---- +:A10 +. + +build writing-to-lowest-level +force-obsolete: a.SET.1:A +force-obsolete: b.SINGLEDEL.4: +force-obsolete: b.SET.2:B +c.DEL.5: +force-obsolete: d.SET.10:D10 +---- + +iter +first +next +next +next +next +next +---- +:A +: +:B +: +:D10 +. + +iter hide-obsolete-points=true +first +---- +table does not intersect BlockPropertyFilter + +build is-strict-obsolete +d.SINGLEDEL.40: +d.MERGE.30:D30 +---- +MERGE not supported in a strict-obsolete sstable diff --git a/sstable/testdata/virtual_reader b/sstable/testdata/virtual_reader index b5b507d245..c0765e31a9 100644 --- a/sstable/testdata/virtual_reader +++ b/sstable/testdata/virtual_reader @@ -159,7 +159,7 @@ virtualize dd.SET.5-ddd.SET.6 ---- bounds: [dd#5,1-ddd#6,1] filenum: 000008 -props: 8,1 +props: 7,1 # Check lower bound enforcement during SeekPrefixGE. iter @@ -566,7 +566,7 @@ virtualize f.SET.6-h.SET.9 ---- bounds: [f#6,1-h#9,1] filenum: 000020 -props: 6,0 +props: 5,0 iter seek-lt z diff --git a/sstable/testdata/writer_value_blocks b/sstable/testdata/writer_value_blocks index 869b849afb..8590e42925 100644 --- a/sstable/testdata/writer_value_blocks +++ b/sstable/testdata/writer_value_blocks @@ -132,7 +132,7 @@ layout 72 record (21 = 3 [0] + 14 + 4) [restart] blue@8#16,1:value handle {valueLen:6 blockNum:0 offsetInBlock:5} 93 [restart 72] - 101 [trailer compression=none checksum=0xdc74261] + 101 [trailer compression=none checksum=0x4e65b9b6] 106 data (29) 106 record (21 = 3 [0] + 14 + 4) [restart] blue@6#16,1:value handle {valueLen:15 blockNum:1 offsetInBlock:0} @@ -146,77 +146,78 @@ layout 173 block:38/29 [restart] 192 [restart 173] 200 [trailer compression=none checksum=0x21d27815] - 205 index (27) + 205 index (30) 205 block:72/29 [restart] - 224 [restart 205] - 232 [trailer compression=none checksum=0xbae26eb3] - 237 index (22) - 237 block:106/29 [restart] - 251 [restart 237] - 259 [trailer compression=none checksum=0x802be702] - 264 top-index (77) - 264 block:140/28 [restart] - 285 block:173/27 [restart] - 305 block:205/27 [restart] - 325 block:237/22 [restart] - 340 [restart 264] - 344 [restart 285] - 348 [restart 305] - 352 [restart 325] - 341 [trailer compression=snappy checksum=0x6b2d79b] - 346 value-block (11) - 362 value-block (15) - 382 value-index (8) - 395 properties (785) - 395 pebble.num.value-blocks (27) [restart] - 422 pebble.num.values.in.value-blocks (21) - 443 pebble.value-blocks.size (21) - 464 rocksdb.block.based.table.index.type (43) - 507 rocksdb.block.based.table.prefix.filtering (20) - 527 rocksdb.block.based.table.whole.key.filtering (23) - 550 rocksdb.column.family.id (24) - 574 rocksdb.comparator (35) - 609 rocksdb.compression (16) - 625 rocksdb.compression_options (106) - 731 rocksdb.creation.time (16) - 747 rocksdb.data.size (14) - 761 rocksdb.deleted.keys (15) - 776 rocksdb.external_sst_file.global_seqno (41) - 817 rocksdb.external_sst_file.version (14) - 831 rocksdb.filter.size (15) - 846 rocksdb.fixed.key.length (18) - 864 rocksdb.format.version (17) - 881 rocksdb.index.key.is.user.key (25) - 906 rocksdb.index.partitions (14) - 920 rocksdb.index.size (9) - 929 rocksdb.index.value.is.delta.encoded (26) - 955 rocksdb.merge.operands (18) - 973 rocksdb.merge.operator (24) - 997 rocksdb.num.data.blocks (19) - 1016 rocksdb.num.entries (11) - 1027 rocksdb.num.range-deletions (19) - 1046 rocksdb.oldest.key.time (19) - 1065 rocksdb.prefix.extractor.name (31) - 1096 rocksdb.property.collectors (22) - 1118 rocksdb.raw.key.size (16) - 1134 rocksdb.raw.value.size (14) - 1148 rocksdb.top-level.index.size (24) - 1172 [restart 395] - 1180 [trailer compression=none checksum=0x4352b7fd] - 1185 meta-index (64) - 1185 pebble.value_index block:382/8 value-blocks-index-lengths: 1(num), 2(offset), 1(length) [restart] - 1212 rocksdb.properties block:395/785 [restart] - 1237 [restart 1185] - 1241 [restart 1212] - 1249 [trailer compression=none checksum=0xe7aed935] - 1254 footer (53) - 1254 checksum type: crc32c - 1255 meta: offset=1185, length=64 - 1258 index: offset=264, length=77 - 1261 [padding] - 1295 version: 4 - 1299 magic number: 0xf09faab3f09faab3 - 1307 EOF + 227 [restart 205] + 235 [trailer compression=none checksum=0xba0b26fe] + 240 index (22) + 240 block:106/29 [restart] + 254 [restart 240] + 262 [trailer compression=none checksum=0x802be702] + 267 top-index (85) + 267 block:140/28 [restart] + 288 block:173/27 [restart] + 308 block:205/30 [restart] + 331 block:240/22 [restart] + 346 [restart 267] + 350 [restart 288] + 354 [restart 308] + 358 [restart 331] + 352 [trailer compression=snappy checksum=0x8bd0d63a] + 357 value-block (11) + 373 value-block (15) + 393 value-index (8) + 406 properties (813) + 406 obsolete-key (16) [restart] + 422 pebble.num.value-blocks (27) + 449 pebble.num.values.in.value-blocks (21) + 470 pebble.value-blocks.size (21) + 491 rocksdb.block.based.table.index.type (43) + 534 rocksdb.block.based.table.prefix.filtering (20) + 554 rocksdb.block.based.table.whole.key.filtering (23) + 577 rocksdb.column.family.id (24) + 601 rocksdb.comparator (35) + 636 rocksdb.compression (16) + 652 rocksdb.compression_options (106) + 758 rocksdb.creation.time (16) + 774 rocksdb.data.size (14) + 788 rocksdb.deleted.keys (15) + 803 rocksdb.external_sst_file.global_seqno (41) + 844 rocksdb.external_sst_file.version (14) + 858 rocksdb.filter.size (15) + 873 rocksdb.fixed.key.length (18) + 891 rocksdb.format.version (17) + 908 rocksdb.index.key.is.user.key (25) + 933 rocksdb.index.partitions (14) + 947 rocksdb.index.size (9) + 956 rocksdb.index.value.is.delta.encoded (26) + 982 rocksdb.merge.operands (18) + 1000 rocksdb.merge.operator (24) + 1024 rocksdb.num.data.blocks (19) + 1043 rocksdb.num.entries (11) + 1054 rocksdb.num.range-deletions (19) + 1073 rocksdb.oldest.key.time (19) + 1092 rocksdb.prefix.extractor.name (31) + 1123 rocksdb.property.collectors (34) + 1157 rocksdb.raw.key.size (16) + 1173 rocksdb.raw.value.size (14) + 1187 rocksdb.top-level.index.size (24) + 1211 [restart 406] + 1219 [trailer compression=none checksum=0xa07c3fa5] + 1224 meta-index (64) + 1224 pebble.value_index block:393/8 value-blocks-index-lengths: 1(num), 2(offset), 1(length) [restart] + 1251 rocksdb.properties block:406/813 [restart] + 1276 [restart 1224] + 1280 [restart 1251] + 1288 [trailer compression=none checksum=0xfe295720] + 1293 footer (53) + 1293 checksum type: crc32c + 1294 meta: offset=1224, length=64 + 1297 index: offset=267, length=85 + 1300 [padding] + 1334 version: 4 + 1338 magic number: 0xf09faab3f09faab3 + 1346 EOF # Require that [c,e) must be in-place. build in-place-bound=(c,e) @@ -296,47 +297,48 @@ layout 71 block:0/66 [restart] 85 [restart 71] 93 [trailer compression=none checksum=0xf80f5bcf] - 98 properties (715) - 98 pebble.raw.point-tombstone.key.size (39) [restart] - 137 rocksdb.block.based.table.index.type (43) - 180 rocksdb.block.based.table.prefix.filtering (20) - 200 rocksdb.block.based.table.whole.key.filtering (23) - 223 rocksdb.column.family.id (24) - 247 rocksdb.comparator (35) - 282 rocksdb.compression (16) - 298 rocksdb.compression_options (106) - 404 rocksdb.creation.time (16) - 420 rocksdb.data.size (13) - 433 rocksdb.deleted.keys (15) - 448 rocksdb.external_sst_file.global_seqno (41) - 489 rocksdb.external_sst_file.version (14) - 503 rocksdb.filter.size (15) - 518 rocksdb.fixed.key.length (18) - 536 rocksdb.format.version (17) - 553 rocksdb.index.key.is.user.key (25) - 578 rocksdb.index.size (8) - 586 rocksdb.index.value.is.delta.encoded (26) - 612 rocksdb.merge.operands (18) - 630 rocksdb.merge.operator (24) - 654 rocksdb.num.data.blocks (19) - 673 rocksdb.num.entries (11) - 684 rocksdb.num.range-deletions (19) - 703 rocksdb.oldest.key.time (19) - 722 rocksdb.prefix.extractor.name (31) - 753 rocksdb.property.collectors (22) - 775 rocksdb.raw.key.size (16) - 791 rocksdb.raw.value.size (14) - 805 [restart 98] - 813 [trailer compression=none checksum=0xfb9d6722] - 818 meta-index (32) - 818 rocksdb.properties block:98/715 [restart] - 842 [restart 818] - 850 [trailer compression=none checksum=0x8ca405dc] - 855 footer (53) - 855 checksum type: crc32c - 856 meta: offset=818, length=32 - 859 index: offset=71, length=22 - 861 [padding] - 896 version: 4 - 900 magic number: 0xf09faab3f09faab3 - 908 EOF + 98 properties (743) + 98 obsolete-key (16) [restart] + 114 pebble.raw.point-tombstone.key.size (39) + 153 rocksdb.block.based.table.index.type (43) + 196 rocksdb.block.based.table.prefix.filtering (20) + 216 rocksdb.block.based.table.whole.key.filtering (23) + 239 rocksdb.column.family.id (24) + 263 rocksdb.comparator (35) + 298 rocksdb.compression (16) + 314 rocksdb.compression_options (106) + 420 rocksdb.creation.time (16) + 436 rocksdb.data.size (13) + 449 rocksdb.deleted.keys (15) + 464 rocksdb.external_sst_file.global_seqno (41) + 505 rocksdb.external_sst_file.version (14) + 519 rocksdb.filter.size (15) + 534 rocksdb.fixed.key.length (18) + 552 rocksdb.format.version (17) + 569 rocksdb.index.key.is.user.key (25) + 594 rocksdb.index.size (8) + 602 rocksdb.index.value.is.delta.encoded (26) + 628 rocksdb.merge.operands (18) + 646 rocksdb.merge.operator (24) + 670 rocksdb.num.data.blocks (19) + 689 rocksdb.num.entries (11) + 700 rocksdb.num.range-deletions (19) + 719 rocksdb.oldest.key.time (19) + 738 rocksdb.prefix.extractor.name (31) + 769 rocksdb.property.collectors (34) + 803 rocksdb.raw.key.size (16) + 819 rocksdb.raw.value.size (14) + 833 [restart 98] + 841 [trailer compression=none checksum=0x71c150c9] + 846 meta-index (32) + 846 rocksdb.properties block:98/743 [restart] + 870 [restart 846] + 878 [trailer compression=none checksum=0x545483dc] + 883 footer (53) + 883 checksum type: crc32c + 884 meta: offset=846, length=32 + 887 index: offset=71, length=22 + 889 [padding] + 924 version: 4 + 928 magic number: 0xf09faab3f09faab3 + 936 EOF diff --git a/sstable/writer.go b/sstable/writer.go index 83ebad4ca0..0d495ce87d 100644 --- a/sstable/writer.go +++ b/sstable/writer.go @@ -132,6 +132,8 @@ type Writer struct { separator Separator successor Successor tableFormat TableFormat + isStrictObsolete bool + writingToLowestLevel bool cache *cache.Cache restartInterval int checksumType ChecksumType @@ -166,6 +168,7 @@ type Writer struct { props Properties propCollectors []TablePropertyCollector blockPropCollectors []BlockPropertyCollector + obsoleteCollector obsoleteKeyBlockPropertyCollector blockPropsEncoder blockPropertiesEncoder // filter accumulates the filter block. If populated, the filter ingests // either the output of w.split (i.e. a prefix extractor) if w.split is not @@ -215,6 +218,8 @@ type pointKeyInfo struct { // prefixLen uses w.split, if not nil. Only computed when w.valueBlockWriter // is not nil. prefixLen int + // True iff the point was marked obsolete. + isObsolete bool } type coordinationState struct { @@ -673,7 +678,12 @@ func (w *Writer) Set(key, value []byte) error { if w.err != nil { return w.err } - return w.addPoint(base.MakeInternalKey(key, 0, InternalKeyKindSet), value) + if w.isStrictObsolete { + return errors.Errorf("use AddWithForceObsolete") + } + // forceObsolete is false based on the assumption that no RANGEDELs in the + // sstable delete the added points. + return w.addPoint(base.MakeInternalKey(key, 0, InternalKeyKindSet), value, false) } // Delete deletes the value for the given key. The sequence number is set to @@ -685,7 +695,12 @@ func (w *Writer) Delete(key []byte) error { if w.err != nil { return w.err } - return w.addPoint(base.MakeInternalKey(key, 0, InternalKeyKindDelete), nil) + if w.isStrictObsolete { + return errors.Errorf("use AddWithForceObsolete") + } + // forceObsolete is false based on the assumption that no RANGEDELs in the + // sstable delete the added points. + return w.addPoint(base.MakeInternalKey(key, 0, InternalKeyKindDelete), nil, false) } // DeleteRange deletes all of the keys (and values) in the range [start,end) @@ -711,7 +726,13 @@ func (w *Writer) Merge(key, value []byte) error { if w.err != nil { return w.err } - return w.addPoint(base.MakeInternalKey(key, 0, InternalKeyKindMerge), value) + if w.isStrictObsolete { + return errors.Errorf("use AddWithForceObsolete") + } + // forceObsolete is false based on the assumption that no RANGEDELs in the + // sstable that delete the added points. If the user configured this writer + // to be strict-obsolete, addPoint will reject the addition of this MERGE. + return w.addPoint(base.MakeInternalKey(key, 0, InternalKeyKindMerge), value, false) } // Add adds a key/value pair to the table being written. For a given Writer, @@ -721,6 +742,24 @@ func (w *Writer) Merge(key, value []byte) error { // point entries. Additionally, range deletion tombstones must be fragmented // (i.e. by keyspan.Fragmenter). func (w *Writer) Add(key InternalKey, value []byte) error { + if w.isStrictObsolete { + return errors.Errorf("use AddWithForceObsolete") + } + return w.AddWithForceObsolete(key, value, false) +} + +// AddWithForceObsolete must be used when writing a strict-obsolete sstable. +// +// forceObsolete indicates whether the caller has determined that this key is +// obsolete even though it may be the latest point key for this userkey. This +// should be set to true for keys obsoleted by RANGEDELs, and is required for +// strict-obsolete sstables. +// +// Note that there are two properties, S1 and S2 (see comment in format.go) +// that strict-obsolete ssts must satisfy. S2, due to RANGEDELs, is solely the +// responsibility of the caller. S1 is solely the responsibility of the +// callee. +func (w *Writer) AddWithForceObsolete(key InternalKey, value []byte, forceObsolete bool) error { if w.err != nil { return w.err } @@ -735,7 +774,7 @@ func (w *Writer) Add(key InternalKey, value []byte) error { "pebble: range keys must be added via one of the RangeKey* functions") return w.err } - return w.addPoint(key, value) + return w.addPoint(key, value, forceObsolete) } func (w *Writer) makeAddPointDecisionV2(key InternalKey) error { @@ -745,12 +784,7 @@ func (w *Writer) makeAddPointDecisionV2(key InternalKey) error { return nil } if !w.disableKeyOrderChecks { - prevPointUserKey := w.dataBlockBuf.dataBlock.curKey - n := len(prevPointUserKey) - base.InternalTrailerLen - if n < 0 { - panic("corrupt key in blockWriter buffer") - } - prevPointUserKey = prevPointUserKey[:n:n] + prevPointUserKey := w.dataBlockBuf.dataBlock.getCurUserKey() cmpUser := w.compare(prevPointUserKey, key.UserKey) if cmpUser > 0 || (cmpUser == 0 && prevTrailer <= key.Trailer) { return errors.Errorf( @@ -762,9 +796,17 @@ func (w *Writer) makeAddPointDecisionV2(key InternalKey) error { return nil } +// REQUIRES: at least one point has been written to the Writer. +func (w *Writer) getLastPointUserKey() []byte { + if w.dataBlockBuf.dataBlock.nEntries == 0 { + panic(errors.AssertionFailedf("no point keys added to writer")) + } + return w.dataBlockBuf.dataBlock.getCurUserKey() +} + func (w *Writer) makeAddPointDecisionV3( key InternalKey, valueLen int, -) (setHasSamePrefix bool, writeToValueBlock bool, err error) { +) (setHasSamePrefix bool, writeToValueBlock bool, isObsolete bool, err error) { prevPointKeyInfo := w.lastPointKeyInfo w.lastPointKeyInfo.userKeyLen = len(key.UserKey) w.lastPointKeyInfo.prefixLen = w.lastPointKeyInfo.userKeyLen @@ -772,18 +814,16 @@ func (w *Writer) makeAddPointDecisionV3( w.lastPointKeyInfo.prefixLen = w.split(key.UserKey) } w.lastPointKeyInfo.trailer = key.Trailer - if w.dataBlockBuf.dataBlock.nEntries == 0 { - return false, false, nil - } - prevPointUserKey := w.dataBlockBuf.dataBlock.curKey - n := len(prevPointUserKey) - base.InternalTrailerLen - if n < 0 { - panic("corrupt key in blockWriter buffer") + w.lastPointKeyInfo.isObsolete = false + if !w.meta.HasPointKeys { + return false, false, false, nil } - prevPointUserKey = prevPointUserKey[:n:n] + keyKind := base.TrailerKind(key.Trailer) + prevPointUserKey := w.getLastPointUserKey() prevPointKey := InternalKey{UserKey: prevPointUserKey, Trailer: prevPointKeyInfo.trailer} - considerWriteToValueBlock := base.TrailerKind(prevPointKeyInfo.trailer) == InternalKeyKindSet && - base.TrailerKind(key.Trailer) == InternalKeyKindSet + prevKeyKind := base.TrailerKind(prevPointKeyInfo.trailer) + considerWriteToValueBlock := prevKeyKind == InternalKeyKindSet && + keyKind == InternalKeyKindSet if considerWriteToValueBlock && !w.requiredInPlaceValueBound.IsEmpty() { keyPrefix := key.UserKey[:w.lastPointKeyInfo.prefixLen] cmpUpper := w.compare( @@ -812,14 +852,59 @@ func (w *Writer) makeAddPointDecisionV3( } else { cmpUser = w.compare(prevPointUserKey, key.UserKey) } + keyKindIsSetOrMerge := keyKind == InternalKeyKindSet || keyKind == InternalKeyKindSetWithDelete || + keyKind == InternalKeyKindMerge + keyKindIsPointDelete := keyKind == InternalKeyKindDelete || + keyKind == InternalKeyKindSingleDelete || keyKind == InternalKeyKindDeleteSized + if !keyKindIsSetOrMerge && !keyKindIsPointDelete { + panic(errors.AssertionFailedf("unexpected key kind %s", keyKind.String())) + } + // If same user key, then the current key is obsolete if any of the + // following is true: + // C1 The prev key was obsolete. + // C2 The prev key was not a MERGE + // C3 The current key is not a SET or SETWITHDELETE or MERGE, then it is + // obsolete. This rule excludes SET, SETWITHDELETE, MERGE, since they + // will be merged with the previous key. + // + // Regardless of whether it is the same user key or not + // C4 The current key is some kind of point delete, and we are writing to + // the lowest level, then it is also obsolete. + // + // C1 ensures that for a user key there is at most one transition from + // !obsolete to obsolete. Consider a user key k, for which the first n keys + // are not obsolete. We consider the various value of n: + // + // n = 0: This happens due to forceObsolete being set by the caller, or due + // to C4. forceObsolete must only be set due a RANGEDEL, and that RANGEDEL + // must also delete all the lower seqnums for the same user key. C4 triggers + // due to a point delete and that deletes all the lower seqnums for the same + // user key. + // + // n = 1: This is the common case. It happens when the first key is not a + // MERGE, or the current key is some kind of point delete. + // + // n > 1: This is due to a sequence of MERGE keys, potentially followed by a + // single SET or SETWITHDELETE. + isObsolete = (cmpUser == 0 && + (prevKeyKind != InternalKeyKindMerge || prevPointKeyInfo.isObsolete || !keyKindIsSetOrMerge)) || + (w.writingToLowestLevel && + (keyKind == InternalKeyKindDelete || keyKind == InternalKeyKindSingleDelete || + keyKind == InternalKeyKindDeleteSized)) + // TODO(sumeer): storing isObsolete SET and SETWITHDEL in value blocks is + // possible, but requires some care in documenting and checking invariants. + // There is code that assumes nothing in value blocks because of single MVCC + // version (those should be ok). We have to ensure setHasSamePrefix is + // correctly initialized here etc. + if !w.disableKeyOrderChecks && (cmpUser > 0 || (cmpUser == 0 && prevPointKeyInfo.trailer <= key.Trailer)) { - return false, false, errors.Errorf( + return false, false, false, errors.Errorf( "pebble: keys must be added in strictly increasing order: %s, %s", prevPointKey.Pretty(w.formatKey), key.Pretty(w.formatKey)) } if !considerWriteToValueBlock { - return false, false, nil + return false, false, isObsolete, nil } // NB: it is possible that cmpUser == 0, i.e., these two SETs have identical // user keys (because of an open snapshot). This should be the rare case. @@ -832,19 +917,24 @@ func (w *Writer) makeAddPointDecisionV3( if considerWriteToValueBlock && valueLen <= tinyValueThreshold { considerWriteToValueBlock = false } - return setHasSamePrefix, considerWriteToValueBlock, nil + return setHasSamePrefix, considerWriteToValueBlock, isObsolete, nil } -func (w *Writer) addPoint(key InternalKey, value []byte) error { +func (w *Writer) addPoint(key InternalKey, value []byte, forceObsolete bool) error { + if w.isStrictObsolete && key.Kind() == InternalKeyKindMerge { + return errors.Errorf("MERGE not supported in a strict-obsolete sstable") + } var err error var setHasSameKeyPrefix, writeToValueBlock, addPrefixToValueStoredWithKey bool + var isObsolete bool maxSharedKeyLen := len(key.UserKey) if w.valueBlockWriter != nil { // maxSharedKeyLen is limited to the prefix of the preceding key. If the // preceding key was in a different block, then the blockWriter will // ignore this maxSharedKeyLen. maxSharedKeyLen = w.lastPointKeyInfo.prefixLen - setHasSameKeyPrefix, writeToValueBlock, err = w.makeAddPointDecisionV3(key, len(value)) + setHasSameKeyPrefix, writeToValueBlock, isObsolete, err = + w.makeAddPointDecisionV3(key, len(value)) addPrefixToValueStoredWithKey = base.TrailerKind(key.Trailer) == InternalKeyKindSet } else { err = w.makeAddPointDecisionV2(key) @@ -852,6 +942,8 @@ func (w *Writer) addPoint(key InternalKey, value []byte) error { if err != nil { return err } + isObsolete = w.tableFormat >= TableFormatPebblev4 && (isObsolete || forceObsolete) + w.lastPointKeyInfo.isObsolete = isObsolete var valueStoredWithKey []byte var prefix valuePrefix var valueStoredWithKeyLen int @@ -907,16 +999,19 @@ func (w *Writer) addPoint(key InternalKey, value []byte) error { return err } } + if w.tableFormat >= TableFormatPebblev4 { + w.obsoleteCollector.AddPoint(isObsolete) + } w.maybeAddToFilter(key.UserKey) w.dataBlockBuf.dataBlock.addWithOptionalValuePrefix( - key, valueStoredWithKey, maxSharedKeyLen, addPrefixToValueStoredWithKey, prefix, + key, isObsolete, valueStoredWithKey, maxSharedKeyLen, addPrefixToValueStoredWithKey, prefix, setHasSameKeyPrefix) w.meta.updateSeqNum(key.SeqNum()) if !w.meta.HasPointKeys { - k := base.DecodeInternalKey(w.dataBlockBuf.dataBlock.curKey) + k := w.dataBlockBuf.dataBlock.getCurKey() // NB: We need to ensure that SmallestPoint.UserKey is set, so we create // an InternalKey which is semantically identical to the key, but won't // have a nil UserKey. We do this, because key.UserKey could be nil, and @@ -968,7 +1063,7 @@ func (w *Writer) addTombstone(key InternalKey, value []byte) error { if !w.disableKeyOrderChecks && !w.rangeDelV1Format && w.rangeDelBlock.nEntries > 0 { // Check that tombstones are being added in fragmented order. If the two // tombstones overlap, their start and end keys must be identical. - prevKey := base.DecodeInternalKey(w.rangeDelBlock.curKey) + prevKey := w.rangeDelBlock.getCurKey() switch c := w.compare(prevKey.UserKey, key.UserKey); { case c > 0: w.err = errors.Errorf("pebble: keys must be added in order: %s, %s", @@ -1164,7 +1259,7 @@ func (w *Writer) encodeRangeKeySpan(span keyspan.Span) { func (w *Writer) addRangeKey(key InternalKey, value []byte) error { if !w.disableKeyOrderChecks && w.rangeKeyBlock.nEntries > 0 { - prevStartKey := base.DecodeInternalKey(w.rangeKeyBlock.curKey) + prevStartKey := w.rangeKeyBlock.getCurKey() prevEndKey, _, ok := rangekey.DecodeEndKey(prevStartKey.Kind(), w.rangeKeyBlock.curValue) if !ok { // We panic here as we should have previously decoded and validated this @@ -1310,7 +1405,7 @@ func (w *Writer) flush(key InternalKey) error { // dataBlockBuf is added back to a sync.Pool. In this particular case, the // byte slice which supports "sep" will eventually be copied when "sep" is // added to the index block. - prevKey := base.DecodeInternalKey(w.dataBlockBuf.dataBlock.curKey) + prevKey := w.dataBlockBuf.dataBlock.getCurKey() sep := w.indexEntrySep(prevKey, key, w.dataBlockBuf) // We determine that we should flush an index block from the Writer client // goroutine, but we actually finish the index block from the writeQueue. @@ -1600,7 +1695,7 @@ func (w *Writer) finishIndexBlock(indexBuf *indexBlockBuf, props []byte) error { nEntries: indexBuf.block.nEntries, properties: props, } w.indexSepAlloc, part.sep = cloneKeyWithBuf( - base.DecodeInternalKey(indexBuf.block.curKey), w.indexSepAlloc, + indexBuf.block.getCurKey(), w.indexSepAlloc, ) bk := indexBuf.finish() if len(w.indexBlockAlloc) < len(bk) { @@ -1804,7 +1899,7 @@ func (w *Writer) Close() (err error) { // however, if a dataBlock is flushed, then we add a key to the new w.dataBlockBuf in the // addPoint function after the flush occurs. if w.dataBlockBuf.dataBlock.nEntries >= 1 { - w.meta.SetLargestPointKey(base.DecodeInternalKey(w.dataBlockBuf.dataBlock.curKey).Clone()) + w.meta.SetLargestPointKey(w.dataBlockBuf.dataBlock.getCurKey().Clone()) } // Finish the last data block, or force an empty data block if there @@ -1818,7 +1913,7 @@ func (w *Writer) Close() (err error) { if err != nil { return err } - prevKey := base.DecodeInternalKey(w.dataBlockBuf.dataBlock.curKey) + prevKey := w.dataBlockBuf.dataBlock.getCurKey() if err := w.addIndexEntrySync(prevKey, InternalKey{}, bhp, w.dataBlockBuf.tmp[:]); err != nil { return err } @@ -1897,7 +1992,7 @@ func (w *Writer) Close() (err error) { var rangeKeyBH BlockHandle if w.props.NumRangeKeys() > 0 { - key := base.DecodeInternalKey(w.rangeKeyBlock.curKey) + key := w.rangeKeyBlock.getCurKey() kind := key.Kind() endKey, _, ok := rangekey.DecodeEndKey(kind, w.rangeKeyBlock.curValue) if !ok { @@ -2086,7 +2181,7 @@ func (o PreviousPointKeyOpt) UnsafeKey() base.InternalKey { if o.w.dataBlockBuf.dataBlock.nEntries >= 1 { // o.w.dataBlockBuf.dataBlock.curKey is guaranteed to point to the last point key // which was added to the Writer. - return base.DecodeInternalKey(o.w.dataBlockBuf.dataBlock.curKey) + return o.w.dataBlockBuf.dataBlock.getCurKey() } return base.InternalKey{} } @@ -2115,6 +2210,8 @@ func NewWriter(writable objstorage.Writable, o WriterOptions, extraOpts ...Write separator: o.Comparer.Separator, successor: o.Comparer.Successor, tableFormat: o.TableFormat, + isStrictObsolete: o.IsStrictObsolete, + writingToLowestLevel: o.WritingToLowestLevel, cache: o.Cache, restartInterval: o.BlockRestartInterval, checksumType: o.Checksum, @@ -2188,7 +2285,8 @@ func NewWriter(writable objstorage.Writable, o WriterOptions, extraOpts ...Write w.props.PropertyCollectorNames = "[]" w.props.ExternalFormatVersion = rocksDBExternalFormatVersion - if len(o.TablePropertyCollectors) > 0 || len(o.BlockPropertyCollectors) > 0 { + if len(o.TablePropertyCollectors) > 0 || len(o.BlockPropertyCollectors) > 0 || + w.tableFormat >= TableFormatPebblev4 { var buf bytes.Buffer buf.WriteString("[") if len(o.TablePropertyCollectors) > 0 { @@ -2201,16 +2299,22 @@ func NewWriter(writable objstorage.Writable, o WriterOptions, extraOpts ...Write buf.WriteString(w.propCollectors[i].Name()) } } + numBlockPropertyCollectors := len(o.BlockPropertyCollectors) + if w.tableFormat >= TableFormatPebblev4 { + numBlockPropertyCollectors++ + } + // shortID is a uint8, so we cannot exceed that number of block + // property collectors. + if numBlockPropertyCollectors > math.MaxUint8 { + w.err = errors.New("pebble: too many block property collectors") + return w + } + if numBlockPropertyCollectors > 0 { + w.blockPropCollectors = make([]BlockPropertyCollector, numBlockPropertyCollectors) + } if len(o.BlockPropertyCollectors) > 0 { - // shortID is a uint8, so we cannot exceed that number of block - // property collectors. - if len(o.BlockPropertyCollectors) > math.MaxUint8 { - w.err = errors.New("pebble: too many block property collectors") - return w - } // The shortID assigned to a collector is the same as its index in // this slice. - w.blockPropCollectors = make([]BlockPropertyCollector, len(o.BlockPropertyCollectors)) for i := range o.BlockPropertyCollectors { w.blockPropCollectors[i] = o.BlockPropertyCollectors[i]() if i > 0 || len(o.TablePropertyCollectors) > 0 { @@ -2219,6 +2323,13 @@ func NewWriter(writable objstorage.Writable, o WriterOptions, extraOpts ...Write buf.WriteString(w.blockPropCollectors[i].Name()) } } + if w.tableFormat >= TableFormatPebblev4 { + if numBlockPropertyCollectors > 1 || len(o.TablePropertyCollectors) > 0 { + buf.WriteString(",") + } + w.blockPropCollectors[numBlockPropertyCollectors-1] = &w.obsoleteCollector + buf.WriteString(w.obsoleteCollector.Name()) + } buf.WriteString("]") w.props.PropertyCollectorNames = buf.String() } @@ -2252,3 +2363,85 @@ func init() { } private.SSTableInternalProperties = internalGetProperties } + +type obsoleteKeyBlockPropertyCollector struct { + blockIsNonObsolete bool + indexIsNonObsolete bool + tableIsNonObsolete bool +} + +func encodeNonObsolete(isNonObsolete bool, buf []byte) []byte { + if isNonObsolete { + return buf + } + return append(buf, 't') +} + +func (o *obsoleteKeyBlockPropertyCollector) Name() string { + return "obsolete-key" +} + +func (o *obsoleteKeyBlockPropertyCollector) Add(key InternalKey, value []byte) error { + // Ignore. + return nil +} + +func (o *obsoleteKeyBlockPropertyCollector) AddPoint(isObsolete bool) { + o.blockIsNonObsolete = o.blockIsNonObsolete || !isObsolete +} + +func (o *obsoleteKeyBlockPropertyCollector) FinishDataBlock(buf []byte) ([]byte, error) { + o.tableIsNonObsolete = o.tableIsNonObsolete || o.blockIsNonObsolete + return encodeNonObsolete(o.blockIsNonObsolete, buf), nil +} + +func (o *obsoleteKeyBlockPropertyCollector) AddPrevDataBlockToIndexBlock() { + o.indexIsNonObsolete = o.indexIsNonObsolete || o.blockIsNonObsolete + o.blockIsNonObsolete = false +} + +func (o *obsoleteKeyBlockPropertyCollector) FinishIndexBlock(buf []byte) ([]byte, error) { + indexIsNonObsolete := o.indexIsNonObsolete + o.indexIsNonObsolete = false + return encodeNonObsolete(indexIsNonObsolete, buf), nil +} + +func (o *obsoleteKeyBlockPropertyCollector) FinishTable(buf []byte) ([]byte, error) { + return encodeNonObsolete(o.tableIsNonObsolete, buf), nil +} + +func (o *obsoleteKeyBlockPropertyCollector) UpdateKeySuffixes( + oldProp []byte, oldSuffix, newSuffix []byte, +) error { + _, err := propToIsObsolete(oldProp) + if err != nil { + return err + } + // Suffix rewriting currently loses the obsolete bit. + o.blockIsNonObsolete = true + return nil +} + +// NB: obsoleteKeyBlockPropertyFilter is stateless. This aspect of the filter +// is used in table_cache.go for in-place modification of a filters slice. +type obsoleteKeyBlockPropertyFilter struct{} + +func (o obsoleteKeyBlockPropertyFilter) Name() string { + return "obsolete-key" +} + +// Intersects returns true if the set represented by prop intersects with +// the set in the filter. +func (o obsoleteKeyBlockPropertyFilter) Intersects(prop []byte) (bool, error) { + return propToIsObsolete(prop) +} + +func propToIsObsolete(prop []byte) (bool, error) { + if len(prop) == 0 { + return true, nil + } + if len(prop) > 1 || prop[0] != 't' { + return false, errors.Errorf("unexpected property %x", prop) + } + return false, nil +} diff --git a/sstable/writer_test.go b/sstable/writer_test.go index 4d504a8778..f6051ad0d8 100644 --- a/sstable/writer_test.go +++ b/sstable/writer_test.go @@ -13,6 +13,7 @@ import ( "strings" "sync" "testing" + "unsafe" "github.com/cockroachdb/datadriven" "github.com/cockroachdb/errors" @@ -883,7 +884,7 @@ func TestWriterRace(t *testing.T) { w.Add(base.MakeInternalKey(keys[ki], uint64(ki), InternalKeyKindSet), val), ) require.Equal( - t, base.DecodeInternalKey(w.dataBlockBuf.dataBlock.curKey).UserKey, keys[ki], + t, w.dataBlockBuf.dataBlock.getCurKey().UserKey, keys[ki], ) } require.NoError(t, w.Close()) @@ -907,6 +908,70 @@ func TestWriterRace(t *testing.T) { wg.Wait() } +func TestObsoleteBlockPropertyCollectorFilter(t *testing.T) { + var c obsoleteKeyBlockPropertyCollector + var f obsoleteKeyBlockPropertyFilter + require.Equal(t, c.Name(), f.Name()) + // Data block with 1 obsolete and 1 non-obsolete point. + c.AddPoint(false) + c.AddPoint(true) + finishAndCheck := func(finishFunc func([]byte) ([]byte, error), expectedIntersects bool) { + var buf [1]byte + prop, err := finishFunc(buf[:0:1]) + require.NoError(t, err) + expectedLength := 1 + if expectedIntersects { + // The common case is encoded in 0 bytes + expectedLength = 0 + } + require.Equal(t, expectedLength, len(prop)) + // Confirm that the collector used the slice. + require.Equal(t, unsafe.Pointer(&buf[0]), unsafe.Pointer(&prop[:1][0])) + intersects, err := f.Intersects(prop) + require.NoError(t, err) + require.Equal(t, expectedIntersects, intersects) + } + finishAndCheck(c.FinishDataBlock, true) + c.AddPrevDataBlockToIndexBlock() + // Data block with only obsolete points. + c.AddPoint(true) + c.AddPoint(true) + finishAndCheck(c.FinishDataBlock, false) + c.AddPrevDataBlockToIndexBlock() + // Index block has one obsolete block and one non-obsolete block. + finishAndCheck(c.FinishIndexBlock, true) + + // Data block with obsolete point. + c.AddPoint(true) + finishAndCheck(c.FinishDataBlock, false) + c.AddPrevDataBlockToIndexBlock() + // Data block with obsolete point. + c.AddPoint(true) + finishAndCheck(c.FinishDataBlock, false) + c.AddPrevDataBlockToIndexBlock() + // Index block has only obsolete blocks. + finishAndCheck(c.FinishIndexBlock, false) + // Table is not obsolete. + finishAndCheck(c.FinishTable, true) + + // Reset the collector state. + c = obsoleteKeyBlockPropertyCollector{} + // Table with only obsolete blocks. + + // Data block with obsolete point. + c.AddPoint(true) + finishAndCheck(c.FinishDataBlock, false) + c.AddPrevDataBlockToIndexBlock() + // Data block with obsolete point. + c.AddPoint(true) + finishAndCheck(c.FinishDataBlock, false) + c.AddPrevDataBlockToIndexBlock() + // Index block has only obsolete blocks. + finishAndCheck(c.FinishIndexBlock, false) + // Table is obsolete. + finishAndCheck(c.FinishTable, false) +} + func BenchmarkWriter(b *testing.B) { keys := make([][]byte, 1e6) const keyLen = 24 diff --git a/table_cache.go b/table_cache.go index 7c416a737d..9c3b9d7fc9 100644 --- a/table_cache.go +++ b/table_cache.go @@ -375,6 +375,26 @@ func (c *tableCacheShard) newIters( return nil, nil, v.err } + hideObsoletePoints := false + if opts != nil { + // This code is appending (at most one filter) in-place to + // opts.PointKeyFilters even though the slice is shared for iterators in + // the same iterator tree. This is acceptable since all the following + // properties are true: + // - The iterator tree is single threaded, so the shared backing for the + // slice is being mutated in a single threaded manner. + // - Each shallow copy of the slice has its own notion of length. + // - The appended element is always the obsoleteKeyBlockPropertyFilter + // struct, which is stateless, so overwriting that struct when creating + // one sstable iterator is harmless to other sstable iterators that are + // relying on that struct. + // + // An alternative would be to have different slices for different sstable + // iterators, but that requires more work to avoid allocations. + hideObsoletePoints, opts.PointKeyFilters = + v.reader.TryAddBlockPropertyFilterForHideObsoletePoints( + opts.snapshotForHideObsoletePoints, file.LargestSeqNum, opts.PointKeyFilters) + } ok := true var filterer *sstable.BlockPropertiesFilterer var err error @@ -390,14 +410,7 @@ func (c *tableCacheShard) newIters( type iterCreator interface { NewFixedSeqnumRangeDelIter(seqNum uint64) (keyspan.FragmentIterator, error) NewRawRangeDelIter() (keyspan.FragmentIterator, error) - NewIterWithBlockPropertyFiltersAndContext( - ctx context.Context, - lower, upper []byte, - filterer *sstable.BlockPropertiesFilterer, - useFilterBlock bool, - stats *base.InternalIteratorStats, - rp sstable.ReaderProvider, - ) (sstable.Iterator, error) + NewIterWithBlockPropertyFiltersAndContextEtc(ctx context.Context, lower, upper []byte, filterer *sstable.BlockPropertiesFilterer, hideObsoletePoints, useFilterBlock bool, stats *base.InternalIteratorStats, rp sstable.ReaderProvider) (sstable.Iterator, error) NewCompactionIter( bytesIterated *uint64, rp sstable.ReaderProvider, @@ -484,11 +497,9 @@ func (c *tableCacheShard) newIters( if internalOpts.bytesIterated != nil { iter, err = ic.NewCompactionIter(internalOpts.bytesIterated, rp) } else { - iter, err = ic.NewIterWithBlockPropertyFiltersAndContext( - ctx, - opts.GetLowerBound(), opts.GetUpperBound(), - filterer, useFilter, internalOpts.stats, rp, - ) + iter, err = ic.NewIterWithBlockPropertyFiltersAndContextEtc( + ctx, opts.GetLowerBound(), opts.GetUpperBound(), filterer, hideObsoletePoints, useFilter, + internalOpts.stats, rp) } if err != nil { if rangeDelIter != nil { diff --git a/testdata/checkpoint b/testdata/checkpoint index 986d23b111..9c38a7c932 100644 --- a/testdata/checkpoint +++ b/testdata/checkpoint @@ -256,23 +256,23 @@ close: db/000009.sst sync: db sync: db/MANIFEST-000001 open: db/000005.sst -read-at(739, 53): db/000005.sst -read-at(702, 37): db/000005.sst -read-at(74, 628): db/000005.sst +read-at(767, 53): db/000005.sst +read-at(730, 37): db/000005.sst +read-at(74, 656): db/000005.sst read-at(47, 27): db/000005.sst open: db/000005.sst close: db/000005.sst open: db/000009.sst -read-at(734, 53): db/000009.sst -read-at(697, 37): db/000009.sst -read-at(69, 628): db/000009.sst +read-at(762, 53): db/000009.sst +read-at(725, 37): db/000009.sst +read-at(69, 656): db/000009.sst read-at(42, 27): db/000009.sst open: db/000009.sst close: db/000009.sst open: db/000007.sst -read-at(739, 53): db/000007.sst -read-at(702, 37): db/000007.sst -read-at(74, 628): db/000007.sst +read-at(767, 53): db/000007.sst +read-at(730, 37): db/000007.sst +read-at(74, 656): db/000007.sst read-at(47, 27): db/000007.sst open: db/000007.sst close: db/000007.sst @@ -341,15 +341,15 @@ close: checkpoints/checkpoint1/000006.log scan checkpoints/checkpoint1 ---- open: checkpoints/checkpoint1/000007.sst -read-at(739, 53): checkpoints/checkpoint1/000007.sst -read-at(702, 37): checkpoints/checkpoint1/000007.sst -read-at(74, 628): checkpoints/checkpoint1/000007.sst +read-at(767, 53): checkpoints/checkpoint1/000007.sst +read-at(730, 37): checkpoints/checkpoint1/000007.sst +read-at(74, 656): checkpoints/checkpoint1/000007.sst read-at(47, 27): checkpoints/checkpoint1/000007.sst read-at(0, 47): checkpoints/checkpoint1/000007.sst open: checkpoints/checkpoint1/000005.sst -read-at(739, 53): checkpoints/checkpoint1/000005.sst -read-at(702, 37): checkpoints/checkpoint1/000005.sst -read-at(74, 628): checkpoints/checkpoint1/000005.sst +read-at(767, 53): checkpoints/checkpoint1/000005.sst +read-at(730, 37): checkpoints/checkpoint1/000005.sst +read-at(74, 656): checkpoints/checkpoint1/000005.sst read-at(47, 27): checkpoints/checkpoint1/000005.sst read-at(0, 47): checkpoints/checkpoint1/000005.sst a 1 @@ -364,9 +364,9 @@ g 10 scan db ---- open: db/000010.sst -read-at(766, 53): db/000010.sst -read-at(729, 37): db/000010.sst -read-at(101, 628): db/000010.sst +read-at(794, 53): db/000010.sst +read-at(757, 37): db/000010.sst +read-at(101, 656): db/000010.sst read-at(74, 27): db/000010.sst read-at(0, 74): db/000010.sst a 1 @@ -406,9 +406,9 @@ close: checkpoints/checkpoint2/000006.log scan checkpoints/checkpoint2 ---- open: checkpoints/checkpoint2/000007.sst -read-at(739, 53): checkpoints/checkpoint2/000007.sst -read-at(702, 37): checkpoints/checkpoint2/000007.sst -read-at(74, 628): checkpoints/checkpoint2/000007.sst +read-at(767, 53): checkpoints/checkpoint2/000007.sst +read-at(730, 37): checkpoints/checkpoint2/000007.sst +read-at(74, 656): checkpoints/checkpoint2/000007.sst read-at(47, 27): checkpoints/checkpoint2/000007.sst read-at(0, 47): checkpoints/checkpoint2/000007.sst b 5 @@ -446,15 +446,15 @@ close: checkpoints/checkpoint3/000006.log scan checkpoints/checkpoint3 ---- open: checkpoints/checkpoint3/000007.sst -read-at(739, 53): checkpoints/checkpoint3/000007.sst -read-at(702, 37): checkpoints/checkpoint3/000007.sst -read-at(74, 628): checkpoints/checkpoint3/000007.sst +read-at(767, 53): checkpoints/checkpoint3/000007.sst +read-at(730, 37): checkpoints/checkpoint3/000007.sst +read-at(74, 656): checkpoints/checkpoint3/000007.sst read-at(47, 27): checkpoints/checkpoint3/000007.sst read-at(0, 47): checkpoints/checkpoint3/000007.sst open: checkpoints/checkpoint3/000005.sst -read-at(739, 53): checkpoints/checkpoint3/000005.sst -read-at(702, 37): checkpoints/checkpoint3/000005.sst -read-at(74, 628): checkpoints/checkpoint3/000005.sst +read-at(767, 53): checkpoints/checkpoint3/000005.sst +read-at(730, 37): checkpoints/checkpoint3/000005.sst +read-at(74, 656): checkpoints/checkpoint3/000005.sst read-at(47, 27): checkpoints/checkpoint3/000005.sst read-at(0, 47): checkpoints/checkpoint3/000005.sst a 1 diff --git a/testdata/compaction_delete_only_hints b/testdata/compaction_delete_only_hints index 6bee9d15e6..83606100dd 100644 --- a/testdata/compaction_delete_only_hints +++ b/testdata/compaction_delete_only_hints @@ -88,7 +88,7 @@ maybe-compact Deletion hints: (none) Compactions: - [JOB 100] compacted(delete-only) L2 [000005] (786 B) + L3 [000006] (786 B) -> L6 [] (0 B), in 1.0s (2.0s total), output rate 0 B/s + [JOB 100] compacted(delete-only) L2 [000005] (814 B) + L3 [000006] (814 B) -> L6 [] (0 B), in 1.0s (2.0s total), output rate 0 B/s # Verify that compaction correctly handles the presence of multiple # overlapping hints which might delete a file multiple times. All of the @@ -127,7 +127,7 @@ maybe-compact Deletion hints: (none) Compactions: - [JOB 100] compacted(delete-only) L2 [000006] (786 B) + L3 [000007] (786 B) -> L6 [] (0 B), in 1.0s (2.0s total), output rate 0 B/s + [JOB 100] compacted(delete-only) L2 [000006] (814 B) + L3 [000007] (814 B) -> L6 [] (0 B), in 1.0s (2.0s total), output rate 0 B/s # Test a range tombstone that is already compacted into L6. @@ -206,7 +206,7 @@ maybe-compact Deletion hints: (none) Compactions: - [JOB 100] compacted(delete-only) L2 [000005] (786 B) + L3 [000006] (786 B) -> L6 [] (0 B), in 1.0s (2.0s total), output rate 0 B/s + [JOB 100] compacted(delete-only) L2 [000005] (814 B) + L3 [000006] (814 B) -> L6 [] (0 B), in 1.0s (2.0s total), output rate 0 B/s # A deletion hint present on an sstable in a higher level should NOT result in a # deletion-only compaction incorrectly removing an sstable in L6 following an @@ -255,7 +255,7 @@ L0.000001 a-z seqnums(tombstone=5-27, file-smallest=0, type=point-key-only) close-snapshot 10 ---- -[JOB 100] compacted(elision-only) L6 [000004] (850 B) + L6 [] (0 B) -> L6 [000005] (771 B), in 1.0s (2.0s total), output rate 771 B/s +[JOB 100] compacted(elision-only) L6 [000004] (878 B) + L6 [] (0 B) -> L6 [000005] (799 B), in 1.0s (2.0s total), output rate 799 B/s # The deletion hint was removed by the elision-only compaction. get-hints @@ -414,4 +414,4 @@ maybe-compact Deletion hints: (none) Compactions: - [JOB 100] compacted(delete-only) L6 [000006 000007 000008 000009 000011] (4.5 K) -> L6 [] (0 B), in 1.0s (2.0s total), output rate 0 B/s + [JOB 100] compacted(delete-only) L6 [000006 000007 000008 000009 000011] (4.6 K) -> L6 [] (0 B), in 1.0s (2.0s total), output rate 0 B/s diff --git a/testdata/compaction_iter b/testdata/compaction_iter index 016cdbd5bf..a4e73827d5 100644 --- a/testdata/compaction_iter +++ b/testdata/compaction_iter @@ -238,7 +238,7 @@ a.SET.1:c iter first ---- -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 define a.SET.2:b @@ -250,7 +250,7 @@ first next ---- a#2,1:b -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 define a.MERGE.2:b @@ -262,7 +262,7 @@ first next ---- a#2,2:b -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 define a.INVALID.2:c @@ -273,7 +273,7 @@ iter first tombstones ---- -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 . define diff --git a/testdata/compaction_iter_delete_sized b/testdata/compaction_iter_delete_sized index c1663fcbfe..0483fd530a 100644 --- a/testdata/compaction_iter_delete_sized +++ b/testdata/compaction_iter_delete_sized @@ -238,7 +238,7 @@ a.SET.1:c iter first ---- -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 define a.SET.2:b @@ -250,7 +250,7 @@ first next ---- a#2,18:b -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 define a.MERGE.2:b @@ -262,7 +262,7 @@ first next ---- a#2,2:b -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 define a.INVALID.2:c @@ -273,7 +273,7 @@ iter first tombstones ---- -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 . define @@ -1338,7 +1338,7 @@ first next ---- a#2,18:b -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 define a.SET.3:c @@ -1351,7 +1351,7 @@ first next ---- a#3,18:c -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 # SINGLEDEL that meets a SETWITHDEL is transformed into a DEL. diff --git a/testdata/compaction_iter_set_with_del b/testdata/compaction_iter_set_with_del index 796984e0c7..0cb35f6b29 100644 --- a/testdata/compaction_iter_set_with_del +++ b/testdata/compaction_iter_set_with_del @@ -238,7 +238,7 @@ a.SET.1:c iter first ---- -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 define a.SET.2:b @@ -250,7 +250,7 @@ first next ---- a#2,18:b -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 define a.MERGE.2:b @@ -262,7 +262,7 @@ first next ---- a#2,2:b -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 define a.INVALID.2:c @@ -273,7 +273,7 @@ iter first tombstones ---- -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 . define @@ -1338,7 +1338,7 @@ first next ---- a#2,18:b -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 define a.SET.3:c @@ -1351,7 +1351,7 @@ first next ---- a#3,18:c -err=invalid internal key kind: 255 +err=invalid internal key kind: 191 # SINGLEDEL that meets a SETWITHDEL is transformed into a DEL. diff --git a/testdata/compaction_tombstones b/testdata/compaction_tombstones index cfc249cef6..6f0546f3ce 100644 --- a/testdata/compaction_tombstones +++ b/testdata/compaction_tombstones @@ -41,7 +41,7 @@ range-deletions-bytes-estimate: 0 maybe-compact ---- -[JOB 100] compacted(elision-only) L6 [000004] (853 B) + L6 [] (0 B) -> L6 [] (0 B), in 1.0s (2.0s total), output rate 0 B/s +[JOB 100] compacted(elision-only) L6 [000004] (885 B) + L6 [] (0 B) -> L6 [] (0 B), in 1.0s (2.0s total), output rate 0 B/s # Test a table that straddles a snapshot. It should not be compacted. define snapshots=(50) auto-compactions=off @@ -80,12 +80,12 @@ wait-pending-table-stats num-entries: 2 num-deletions: 1 num-range-key-sets: 0 -point-deletions-bytes-estimate: 108 +point-deletions-bytes-estimate: 111 range-deletions-bytes-estimate: 0 maybe-compact ---- -[JOB 100] compacted(elision-only) L6 [000004] (823 B) + L6 [] (0 B) -> L6 [000005] (772 B), in 1.0s (2.0s total), output rate 772 B/s +[JOB 100] compacted(elision-only) L6 [000004] (851 B) + L6 [] (0 B) -> L6 [000005] (800 B), in 1.0s (2.0s total), output rate 800 B/s version ---- @@ -119,7 +119,7 @@ wait-pending-table-stats num-entries: 6 num-deletions: 2 num-range-key-sets: 0 -point-deletions-bytes-estimate: 48 +point-deletions-bytes-estimate: 49 range-deletions-bytes-estimate: 66 maybe-compact @@ -134,7 +134,7 @@ close-snapshot close-snapshot 103 ---- -[JOB 100] compacted(elision-only) L6 [000004] (1001 B) + L6 [] (0 B) -> L6 [] (0 B), in 1.0s (2.0s total), output rate 0 B/s +[JOB 100] compacted(elision-only) L6 [000004] (1.0 K) + L6 [] (0 B) -> L6 [] (0 B), in 1.0s (2.0s total), output rate 0 B/s # Test a table that contains both deletions and non-deletions, but whose # non-deletions well outnumber its deletions. The table should not be @@ -152,7 +152,7 @@ wait-pending-table-stats num-entries: 11 num-deletions: 1 num-range-key-sets: 0 -point-deletions-bytes-estimate: 26 +point-deletions-bytes-estimate: 27 range-deletions-bytes-estimate: 0 close-snapshot @@ -233,7 +233,7 @@ wait-pending-table-stats num-entries: 3 num-deletions: 3 num-range-key-sets: 0 -point-deletions-bytes-estimate: 6966 +point-deletions-bytes-estimate: 6994 range-deletions-bytes-estimate: 0 # By plain file size, 000005 should be picked because it is larger and @@ -243,7 +243,7 @@ range-deletions-bytes-estimate: 0 maybe-compact ---- -[JOB 100] compacted(default) L5 [000004] (833 B) + L6 [000006] (13 K) -> L6 [] (0 B), in 1.0s (2.0s total), output rate 0 B/s +[JOB 100] compacted(default) L5 [000004] (861 B) + L6 [000006] (13 K) -> L6 [] (0 B), in 1.0s (2.0s total), output rate 0 B/s # A table containing only range keys is not eligible for elision. # RANGEKEYDEL or RANGEKEYUNSET. @@ -323,7 +323,7 @@ range-deletions-bytes-estimate: 41 maybe-compact ---- -[JOB 100] compacted(elision-only) L6 [000004] (1003 B) + L6 [] (0 B) -> L6 [000005] (778 B), in 1.0s (2.0s total), output rate 778 B/s +[JOB 100] compacted(elision-only) L6 [000004] (1.0 K) + L6 [] (0 B) -> L6 [000005] (806 B), in 1.0s (2.0s total), output rate 806 B/s # Close the DB, asserting that the reference counts balance. close @@ -359,7 +359,7 @@ wait-pending-table-stats num-entries: 2 num-deletions: 1 num-range-key-sets: 0 -point-deletions-bytes-estimate: 2786 +point-deletions-bytes-estimate: 2797 range-deletions-bytes-estimate: 0 wait-pending-table-stats @@ -373,7 +373,7 @@ range-deletions-bytes-estimate: 8246 maybe-compact ---- -[JOB 100] compacted(default) L5 [000005] (850 B) + L6 [000007] (13 K) -> L6 [000008] (4.8 K), in 1.0s (2.0s total), output rate 4.8 K/s +[JOB 100] compacted(default) L5 [000005] (878 B) + L6 [000007] (13 K) -> L6 [000008] (4.8 K), in 1.0s (2.0s total), output rate 4.8 K/s # The same LSM as above. However, this time, with point tombstone weighting at # 2x, the table with the point tombstone (000004) will be selected as the @@ -402,7 +402,7 @@ wait-pending-table-stats num-entries: 2 num-deletions: 1 num-range-key-sets: 0 -point-deletions-bytes-estimate: 2786 +point-deletions-bytes-estimate: 2797 range-deletions-bytes-estimate: 0 wait-pending-table-stats @@ -416,4 +416,4 @@ range-deletions-bytes-estimate: 8246 maybe-compact ---- -[JOB 100] compacted(default) L5 [000005] (850 B) + L6 [000007] (13 K) -> L6 [000008] (4.8 K), in 1.0s (2.0s total), output rate 4.8 K/s +[JOB 100] compacted(default) L5 [000005] (878 B) + L6 [000007] (13 K) -> L6 [000008] (4.8 K), in 1.0s (2.0s total), output rate 4.8 K/s diff --git a/testdata/event_listener b/testdata/event_listener index 9d5fb953f6..172ee775d8 100644 --- a/testdata/event_listener +++ b/testdata/event_listener @@ -127,7 +127,7 @@ close: db/marker.manifest.000002.MANIFEST-000006 remove: db/marker.manifest.000001.MANIFEST-000001 sync: db [JOB 5] MANIFEST created 000006 -[JOB 5] flushed 1 memtable to L0 [000005] (771 B), in 1.0s (2.0s total), output rate 771 B/s +[JOB 5] flushed 1 memtable to L0 [000005] (799 B), in 1.0s (2.0s total), output rate 799 B/s compact ---- @@ -151,21 +151,21 @@ close: db/marker.manifest.000003.MANIFEST-000009 remove: db/marker.manifest.000002.MANIFEST-000006 sync: db [JOB 7] MANIFEST created 000009 -[JOB 7] flushed 1 memtable to L0 [000008] (771 B), in 1.0s (2.0s total), output rate 771 B/s +[JOB 7] flushed 1 memtable to L0 [000008] (799 B), in 1.0s (2.0s total), output rate 799 B/s remove: db/MANIFEST-000001 [JOB 7] MANIFEST deleted 000001 -[JOB 8] compacting(default) L0 [000005 000008] (1.5 K) + L6 [] (0 B) +[JOB 8] compacting(default) L0 [000005 000008] (1.6 K) + L6 [] (0 B) open: db/000005.sst -read-at(718, 53): db/000005.sst -read-at(681, 37): db/000005.sst -read-at(53, 628): db/000005.sst +read-at(746, 53): db/000005.sst +read-at(709, 37): db/000005.sst +read-at(53, 656): db/000005.sst read-at(26, 27): db/000005.sst open: db/000005.sst close: db/000005.sst open: db/000008.sst -read-at(718, 53): db/000008.sst -read-at(681, 37): db/000008.sst -read-at(53, 628): db/000008.sst +read-at(746, 53): db/000008.sst +read-at(709, 37): db/000008.sst +read-at(53, 656): db/000008.sst read-at(26, 27): db/000008.sst open: db/000008.sst close: db/000008.sst @@ -188,7 +188,7 @@ close: db/marker.manifest.000004.MANIFEST-000011 remove: db/marker.manifest.000003.MANIFEST-000009 sync: db [JOB 8] MANIFEST created 000011 -[JOB 8] compacted(default) L0 [000005 000008] (1.5 K) + L6 [] (0 B) -> L6 [000010] (771 B), in 1.0s (3.0s total), output rate 771 B/s +[JOB 8] compacted(default) L0 [000005 000008] (1.6 K) + L6 [] (0 B) -> L6 [000010] (799 B), in 1.0s (3.0s total), output rate 799 B/s close: db/000005.sst close: db/000008.sst remove: db/000005.sst @@ -223,7 +223,7 @@ close: db/marker.manifest.000005.MANIFEST-000014 remove: db/marker.manifest.000004.MANIFEST-000011 sync: db [JOB 10] MANIFEST created 000014 -[JOB 10] flushed 1 memtable to L0 [000013] (771 B), in 1.0s (2.0s total), output rate 771 B/s +[JOB 10] flushed 1 memtable to L0 [000013] (799 B), in 1.0s (2.0s total), output rate 799 B/s enable-file-deletions ---- @@ -233,9 +233,9 @@ remove: db/MANIFEST-000009 ingest ---- open: ext/0 -read-at(773, 53): ext/0 -read-at(736, 37): ext/0 -read-at(53, 683): ext/0 +read-at(801, 53): ext/0 +read-at(764, 37): ext/0 +read-at(53, 711): ext/0 read-at(26, 27): ext/0 read-at(0, 26): ext/0 close: ext/0 @@ -243,9 +243,9 @@ link: ext/0 -> db/000015.sst [JOB 12] ingesting: sstable created 000015 sync: db open: db/000013.sst -read-at(718, 53): db/000013.sst -read-at(681, 37): db/000013.sst -read-at(53, 628): db/000013.sst +read-at(746, 53): db/000013.sst +read-at(709, 37): db/000013.sst +read-at(53, 656): db/000013.sst read-at(26, 27): db/000013.sst read-at(0, 26): db/000013.sst create: db/MANIFEST-000016 @@ -257,28 +257,28 @@ remove: db/marker.manifest.000005.MANIFEST-000014 sync: db [JOB 12] MANIFEST created 000016 remove: ext/0 -[JOB 12] ingested L0:000015 (826 B) +[JOB 12] ingested L0:000015 (854 B) metrics ---- __level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp WAL 1 27 B - 48 B - - - - 108 B - - - 2.2 - 0 2 1.6 K 0.40 81 B 826 B 1 0 B 0 2.3 K 3 0 B 2 28.6 + 0 2 1.6 K 0.40 81 B 854 B 1 0 B 0 2.3 K 3 0 B 2 29.6 1 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0 2 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0 3 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0 4 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0 5 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0 - 6 1 771 B - 1.5 K 0 B 0 0 B 0 771 B 1 1.5 K 1 0.5 - total 3 2.3 K - 934 B 826 B 1 0 B 0 3.9 K 4 1.5 K 3 4.3 + 6 1 799 B - 1.6 K 0 B 0 0 B 0 799 B 1 1.6 K 1 0.5 + total 3 2.4 K - 962 B 854 B 1 0 B 0 4.1 K 4 1.6 K 3 4.3 flush 3 0 B 0 0 (ingest = tables-ingested, move = ingested-as-flushable) -compact 1 2.3 K 0 B 0 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress) +compact 1 2.4 K 0 B 0 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress) ctype 1 0 0 0 0 0 0 (default, delete, elision, move, read, rewrite, multi-level) memtbl 1 256 K zmemtbl 0 0 B ztbl 0 0 B - bcache 8 1.4 K 11.1% (score == hit-rate) - tcache 1 752 B 40.0% (score == hit-rate) + bcache 8 1.5 K 11.1% (score == hit-rate) + tcache 1 760 B 40.0% (score == hit-rate) snaps 0 - 0 (score == earliest seq num) titers 0 filter - - 0.0% (score == utility) @@ -292,16 +292,16 @@ ingest-flushable ---- sync-data: wal/000012.log open: ext/a -read-at(773, 53): ext/a -read-at(736, 37): ext/a -read-at(53, 683): ext/a +read-at(801, 53): ext/a +read-at(764, 37): ext/a +read-at(53, 711): ext/a read-at(26, 27): ext/a read-at(0, 26): ext/a close: ext/a open: ext/b -read-at(773, 53): ext/b -read-at(736, 37): ext/b -read-at(53, 683): ext/b +read-at(801, 53): ext/b +read-at(764, 37): ext/b +read-at(53, 711): ext/b read-at(26, 27): ext/b read-at(0, 26): ext/b close: ext/b @@ -323,7 +323,7 @@ sync: wal [JOB 15] WAL created 000020 remove: ext/a remove: ext/b -[JOB 13] ingested as flushable 000017 (826 B), 000018 (826 B) +[JOB 13] ingested as flushable 000017 (854 B), 000018 (854 B) sync-data: wal/000020.log close: wal/000020.log create: wal/000021.log @@ -336,7 +336,7 @@ sync-data: db/000022.sst close: db/000022.sst sync: db sync: db/MANIFEST-000016 -[JOB 17] flushed 1 memtable to L0 [000022] (771 B), in 1.0s (2.0s total), output rate 771 B/s +[JOB 17] flushed 1 memtable to L0 [000022] (799 B), in 1.0s (2.0s total), output rate 799 B/s remove: db/MANIFEST-000011 [JOB 17] MANIFEST deleted 000011 [JOB 18] flushing 2 ingested tables @@ -348,7 +348,7 @@ close: db/marker.manifest.000007.MANIFEST-000023 remove: db/marker.manifest.000006.MANIFEST-000016 sync: db [JOB 18] MANIFEST created 000023 -[JOB 18] flushed 2 ingested flushables L0:000017 (826 B) + L6:000018 (826 B) in 1.0s (2.0s total), output rate 1.6 K/s +[JOB 18] flushed 2 ingested flushables L0:000017 (854 B) + L6:000018 (854 B) in 1.0s (2.0s total), output rate 1.7 K/s remove: db/MANIFEST-000014 [JOB 18] MANIFEST deleted 000014 [JOB 19] flushing 1 memtable to L0 @@ -359,22 +359,22 @@ metrics ---- __level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp WAL 1 29 B - 82 B - - - - 110 B - - - 1.3 - 0 4 3.1 K 0.80 81 B 1.6 K 2 0 B 0 3.0 K 4 0 B 4 38.1 + 0 4 3.2 K 0.80 81 B 1.7 K 2 0 B 0 3.1 K 4 0 B 4 39.5 1 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0 2 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0 3 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0 4 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0 5 0 0 B 0.00 0 B 0 B 0 0 B 0 0 B 0 0 B 0 0.0 - 6 2 1.6 K - 1.5 K 826 B 1 0 B 0 771 B 1 1.5 K 1 0.5 - total 6 4.7 K - 2.5 K 2.4 K 3 0 B 0 6.3 K 5 1.5 K 5 2.5 - flush 6 1.6 K 2 1 (ingest = tables-ingested, move = ingested-as-flushable) -compact 1 4.7 K 0 B 0 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress) + 6 2 1.6 K - 1.6 K 854 B 1 0 B 0 799 B 1 1.6 K 1 0.5 + total 6 4.8 K - 2.6 K 2.5 K 3 0 B 0 6.5 K 5 1.6 K 5 2.5 + flush 6 1.7 K 2 1 (ingest = tables-ingested, move = ingested-as-flushable) +compact 1 4.8 K 0 B 0 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress) ctype 1 0 0 0 0 0 0 (default, delete, elision, move, read, rewrite, multi-level) memtbl 1 512 K zmemtbl 0 0 B ztbl 0 0 B - bcache 16 2.9 K 14.3% (score == hit-rate) - tcache 1 752 B 50.0% (score == hit-rate) + bcache 16 3.0 K 14.3% (score == hit-rate) + tcache 1 760 B 50.0% (score == hit-rate) snaps 0 - 0 (score == earliest seq num) titers 0 filter - - 0.0% (score == utility) diff --git a/testdata/ingest b/testdata/ingest index 3334e58254..346ee52946 100644 --- a/testdata/ingest +++ b/testdata/ingest @@ -48,12 +48,12 @@ compact 0 0 B 0 B 0 (size == esti zmemtbl 0 0 B ztbl 0 0 B bcache 8 1.5 K 42.9% (score == hit-rate) - tcache 1 752 B 50.0% (score == hit-rate) + tcache 1 760 B 50.0% (score == hit-rate) snaps 0 - 0 (score == earliest seq num) titers 0 filter - - 0.0% (score == utility) ingest 1 - + iter seek-ge a @@ -351,7 +351,7 @@ num-entries: 2 num-deletions: 2 num-range-key-sets: 0 point-deletions-bytes-estimate: 0 -range-deletions-bytes-estimate: 1666 +range-deletions-bytes-estimate: 1694 # A set operation takes precedence over a range deletion at the same # sequence number as can occur during ingestion. diff --git a/testdata/manual_compaction_file_boundaries_delsized b/testdata/manual_compaction_file_boundaries_delsized index b83e57e265..9ec939d662 100644 --- a/testdata/manual_compaction_file_boundaries_delsized +++ b/testdata/manual_compaction_file_boundaries_delsized @@ -290,60 +290,60 @@ compact a-zz L1 file-sizes ---- L2: - 000095:[a#101,1-az@1#129,1]: 7637 bytes (7.5 K) - 000096:[b@1#130,1-bz@1#156,1]: 6629 bytes (6.5 K) - 000097:[c@1#157,1-cz@1#183,1]: 6629 bytes (6.5 K) - 000098:[d@1#184,1-dz@1#210,1]: 6629 bytes (6.5 K) - 000099:[e@1#211,1-ez@1#237,1]: 6629 bytes (6.5 K) - 000100:[f@1#238,1-fz@1#264,1]: 6629 bytes (6.5 K) - 000101:[g@1#265,1-gz@1#291,1]: 6629 bytes (6.5 K) - 000102:[h@1#292,1-hz@1#318,1]: 6629 bytes (6.5 K) - 000103:[i@1#319,1-iz@1#345,1]: 6629 bytes (6.5 K) - 000104:[j@1#346,1-jz@1#372,1]: 6629 bytes (6.5 K) - 000105:[k@1#373,1-kz@1#399,1]: 6629 bytes (6.5 K) - 000106:[l@1#400,1-lz@1#426,1]: 6629 bytes (6.5 K) - 000107:[m@1#427,1-mz@1#453,1]: 6629 bytes (6.5 K) - 000108:[n@1#454,1-nz@1#480,1]: 6629 bytes (6.5 K) - 000109:[o@1#481,1-oz@1#507,1]: 6629 bytes (6.5 K) - 000110:[p@1#508,1-pz@1#534,1]: 6629 bytes (6.5 K) - 000111:[q@1#535,1-qz@1#561,1]: 6628 bytes (6.5 K) - 000112:[r@1#562,1-rz@1#588,1]: 6629 bytes (6.5 K) - 000113:[s@1#589,1-sz@1#615,1]: 6629 bytes (6.5 K) - 000114:[t@1#616,1-tz@1#642,1]: 6629 bytes (6.5 K) - 000115:[u@1#643,1-uz@1#669,1]: 6629 bytes (6.5 K) - 000116:[v@1#670,1-vz@1#696,1]: 6629 bytes (6.5 K) - 000117:[w@1#697,1-wz@1#723,1]: 6629 bytes (6.5 K) - 000118:[x@1#724,1-xz@1#750,1]: 6629 bytes (6.5 K) - 000119:[y@1#751,1-yz@1#777,1]: 6629 bytes (6.5 K) - 000120:[z#102,1-zr@1#796,1]: 5909 bytes (5.8 K) - 000121:[zs@1#797,1-zz@1#804,1]: 2491 bytes (2.4 K) + 000095:[a#101,1-az@1#129,1]: 7665 bytes (7.5 K) + 000096:[b@1#130,1-bz@1#156,1]: 6657 bytes (6.5 K) + 000097:[c@1#157,1-cz@1#183,1]: 6657 bytes (6.5 K) + 000098:[d@1#184,1-dz@1#210,1]: 6657 bytes (6.5 K) + 000099:[e@1#211,1-ez@1#237,1]: 6657 bytes (6.5 K) + 000100:[f@1#238,1-fz@1#264,1]: 6657 bytes (6.5 K) + 000101:[g@1#265,1-gz@1#291,1]: 6657 bytes (6.5 K) + 000102:[h@1#292,1-hz@1#318,1]: 6657 bytes (6.5 K) + 000103:[i@1#319,1-iz@1#345,1]: 6657 bytes (6.5 K) + 000104:[j@1#346,1-jz@1#372,1]: 6657 bytes (6.5 K) + 000105:[k@1#373,1-kz@1#399,1]: 6657 bytes (6.5 K) + 000106:[l@1#400,1-lz@1#426,1]: 6657 bytes (6.5 K) + 000107:[m@1#427,1-mz@1#453,1]: 6657 bytes (6.5 K) + 000108:[n@1#454,1-nz@1#480,1]: 6657 bytes (6.5 K) + 000109:[o@1#481,1-oz@1#507,1]: 6657 bytes (6.5 K) + 000110:[p@1#508,1-pz@1#534,1]: 6657 bytes (6.5 K) + 000111:[q@1#535,1-qz@1#561,1]: 6656 bytes (6.5 K) + 000112:[r@1#562,1-rz@1#588,1]: 6657 bytes (6.5 K) + 000113:[s@1#589,1-sz@1#615,1]: 6657 bytes (6.5 K) + 000114:[t@1#616,1-tz@1#642,1]: 6657 bytes (6.5 K) + 000115:[u@1#643,1-uz@1#669,1]: 6657 bytes (6.5 K) + 000116:[v@1#670,1-vz@1#696,1]: 6657 bytes (6.5 K) + 000117:[w@1#697,1-wz@1#723,1]: 6657 bytes (6.5 K) + 000118:[x@1#724,1-xz@1#750,1]: 6657 bytes (6.5 K) + 000119:[y@1#751,1-yz@1#777,1]: 6657 bytes (6.5 K) + 000120:[z#102,1-zr@1#796,1]: 5937 bytes (5.8 K) + 000121:[zs@1#797,1-zz@1#804,1]: 2519 bytes (2.5 K) L3: - 000005:[a#1,1-a#1,1]: 10776 bytes (10 K) - 000006:[b#2,1-b#2,1]: 10776 bytes (10 K) - 000007:[c#3,1-c#3,1]: 10776 bytes (10 K) - 000008:[d#4,1-d#4,1]: 10776 bytes (10 K) - 000009:[e#5,1-e#5,1]: 10776 bytes (10 K) - 000010:[f#6,1-f#6,1]: 10776 bytes (10 K) - 000011:[g#7,1-g#7,1]: 10776 bytes (10 K) - 000012:[h#8,1-h#8,1]: 10776 bytes (10 K) - 000013:[i#9,1-i#9,1]: 10776 bytes (10 K) - 000014:[j#10,1-j#10,1]: 10776 bytes (10 K) - 000015:[k#11,1-k#11,1]: 10776 bytes (10 K) - 000016:[l#12,1-l#12,1]: 10776 bytes (10 K) - 000017:[m#13,1-m#13,1]: 10776 bytes (10 K) - 000018:[n#14,1-n#14,1]: 10776 bytes (10 K) - 000019:[o#15,1-o#15,1]: 10776 bytes (10 K) - 000020:[p#16,1-p#16,1]: 10776 bytes (10 K) - 000021:[q#17,1-q#17,1]: 10776 bytes (10 K) - 000022:[r#18,1-r#18,1]: 10776 bytes (10 K) - 000023:[s#19,1-s#19,1]: 10776 bytes (10 K) - 000024:[t#20,1-t#20,1]: 10776 bytes (10 K) - 000025:[u#21,1-u#21,1]: 10776 bytes (10 K) - 000026:[v#22,1-v#22,1]: 10776 bytes (10 K) - 000027:[w#23,1-w#23,1]: 10776 bytes (10 K) - 000028:[x#24,1-x#24,1]: 10776 bytes (10 K) - 000029:[y#25,1-y#25,1]: 10776 bytes (10 K) - 000030:[z#26,1-z#26,1]: 10776 bytes (10 K) + 000005:[a#1,1-a#1,1]: 10804 bytes (11 K) + 000006:[b#2,1-b#2,1]: 10804 bytes (11 K) + 000007:[c#3,1-c#3,1]: 10804 bytes (11 K) + 000008:[d#4,1-d#4,1]: 10804 bytes (11 K) + 000009:[e#5,1-e#5,1]: 10804 bytes (11 K) + 000010:[f#6,1-f#6,1]: 10804 bytes (11 K) + 000011:[g#7,1-g#7,1]: 10804 bytes (11 K) + 000012:[h#8,1-h#8,1]: 10804 bytes (11 K) + 000013:[i#9,1-i#9,1]: 10804 bytes (11 K) + 000014:[j#10,1-j#10,1]: 10804 bytes (11 K) + 000015:[k#11,1-k#11,1]: 10804 bytes (11 K) + 000016:[l#12,1-l#12,1]: 10804 bytes (11 K) + 000017:[m#13,1-m#13,1]: 10804 bytes (11 K) + 000018:[n#14,1-n#14,1]: 10804 bytes (11 K) + 000019:[o#15,1-o#15,1]: 10804 bytes (11 K) + 000020:[p#16,1-p#16,1]: 10804 bytes (11 K) + 000021:[q#17,1-q#17,1]: 10804 bytes (11 K) + 000022:[r#18,1-r#18,1]: 10804 bytes (11 K) + 000023:[s#19,1-s#19,1]: 10804 bytes (11 K) + 000024:[t#20,1-t#20,1]: 10804 bytes (11 K) + 000025:[u#21,1-u#21,1]: 10804 bytes (11 K) + 000026:[v#22,1-v#22,1]: 10804 bytes (11 K) + 000027:[w#23,1-w#23,1]: 10804 bytes (11 K) + 000028:[x#24,1-x#24,1]: 10804 bytes (11 K) + 000029:[y#25,1-y#25,1]: 10804 bytes (11 K) + 000030:[z#26,1-z#26,1]: 10804 bytes (11 K) # Test a scenario where there exists a grandparent file (in L3), but the L1->L2 # compaction doesn't reach it until late in the compaction. The output file @@ -399,11 +399,11 @@ compact a-zz L1 file-sizes ---- L2: - 000007:[a#201,1-j#210,1]: 10958 bytes (11 K) - 000008:[k#211,1-o#215,1]: 5865 bytes (5.7 K) - 000009:[z#102,1-z#102,1]: 781 bytes (781 B) + 000007:[a#201,1-j#210,1]: 10986 bytes (11 K) + 000008:[k#211,1-o#215,1]: 5893 bytes (5.8 K) + 000009:[z#102,1-z#102,1]: 809 bytes (809 B) L3: - 000006:[m#1,1-m#1,1]: 10776 bytes (10 K) + 000006:[m#1,1-m#1,1]: 10804 bytes (11 K) # Test the file-size splitter's adaptive tolerance for early-splitting at a # grandparent boundary. The L1->L2 compaction has many opportunities to split at @@ -501,20 +501,20 @@ compact a-zz L1 file-sizes ---- L2: - 000019:[a#201,1-e#205,1]: 5865 bytes (5.7 K) - 000020:[f#206,1-l#212,1]: 7893 bytes (7.7 K) - 000021:[m#213,1-z#102,1]: 3827 bytes (3.7 K) + 000019:[a#201,1-e#205,1]: 5893 bytes (5.8 K) + 000020:[f#206,1-l#212,1]: 7921 bytes (7.7 K) + 000021:[m#213,1-z#102,1]: 3855 bytes (3.8 K) L3: - 000006:[a#1,1-a#1,1]: 1776 bytes (1.7 K) - 000007:[ab#2,1-ab#2,1]: 1777 bytes (1.7 K) - 000008:[ac#3,1-ac#3,1]: 1777 bytes (1.7 K) - 000013:[ad#8,1-ad#8,1]: 1777 bytes (1.7 K) - 000012:[ad#7,1-ad#7,1]: 1777 bytes (1.7 K) - 000011:[ad#6,1-ad#6,1]: 1777 bytes (1.7 K) - 000010:[ad#5,1-ad#5,1]: 1777 bytes (1.7 K) - 000009:[ad#4,1-ad#4,1]: 1777 bytes (1.7 K) - 000014:[c#9,1-c#9,1]: 1776 bytes (1.7 K) - 000015:[d#10,1-d#10,1]: 1776 bytes (1.7 K) - 000016:[e#11,1-e#11,1]: 1776 bytes (1.7 K) - 000017:[f#12,1-f#12,1]: 1776 bytes (1.7 K) - 000018:[m#13,1-m#13,1]: 1776 bytes (1.7 K) + 000006:[a#1,1-a#1,1]: 1804 bytes (1.8 K) + 000007:[ab#2,1-ab#2,1]: 1805 bytes (1.8 K) + 000008:[ac#3,1-ac#3,1]: 1805 bytes (1.8 K) + 000013:[ad#8,1-ad#8,1]: 1805 bytes (1.8 K) + 000012:[ad#7,1-ad#7,1]: 1805 bytes (1.8 K) + 000011:[ad#6,1-ad#6,1]: 1805 bytes (1.8 K) + 000010:[ad#5,1-ad#5,1]: 1805 bytes (1.8 K) + 000009:[ad#4,1-ad#4,1]: 1805 bytes (1.8 K) + 000014:[c#9,1-c#9,1]: 1804 bytes (1.8 K) + 000015:[d#10,1-d#10,1]: 1804 bytes (1.8 K) + 000016:[e#11,1-e#11,1]: 1804 bytes (1.8 K) + 000017:[f#12,1-f#12,1]: 1804 bytes (1.8 K) + 000018:[m#13,1-m#13,1]: 1804 bytes (1.8 K) diff --git a/testdata/marked_for_compaction b/testdata/marked_for_compaction index 6d17017e96..fa99de8b58 100644 --- a/testdata/marked_for_compaction +++ b/testdata/marked_for_compaction @@ -20,8 +20,8 @@ marked L0.000004 maybe-compact ---- -[JOB 100] compacted(rewrite) L1 [000005] (779 B) + L1 [] (0 B) -> L1 [000006] (779 B), in 1.0s (2.0s total), output rate 779 B/s -[JOB 100] compacted(rewrite) L0 [000004] (774 B) + L0 [] (0 B) -> L0 [000007] (774 B), in 1.0s (2.0s total), output rate 774 B/s +[JOB 100] compacted(rewrite) L1 [000005] (807 B) + L1 [] (0 B) -> L1 [000006] (807 B), in 1.0s (2.0s total), output rate 807 B/s +[JOB 100] compacted(rewrite) L0 [000004] (802 B) + L0 [] (0 B) -> L0 [000007] (802 B), in 1.0s (2.0s total), output rate 802 B/s 0.0: 000007:[c#11,SET-c#11,SET] points:[c#11,SET-c#11,SET] 1: diff --git a/testdata/metrics b/testdata/metrics index 22c015068a..b30f480dfd 100644 --- a/testdata/metrics +++ b/testdata/metrics @@ -34,7 +34,7 @@ compact 0 0 B 0 B 0 (size == esti zmemtbl 1 256 K ztbl 0 0 B bcache 4 697 B 0.0% (score == hit-rate) - tcache 1 752 B 0.0% (score == hit-rate) + tcache 1 760 B 0.0% (score == hit-rate) snaps 0 - 0 (score == earliest seq num) titers 1 filter - - 0.0% (score == utility) @@ -148,7 +148,7 @@ compact 1 0 B 0 B 0 (size == esti zmemtbl 1 256 K ztbl 1 770 B bcache 4 697 B 42.9% (score == hit-rate) - tcache 1 752 B 66.7% (score == hit-rate) + tcache 1 760 B 66.7% (score == hit-rate) snaps 0 - 0 (score == earliest seq num) titers 1 filter - - 0.0% (score == utility) diff --git a/testdata/table_stats b/testdata/table_stats index d370ee88f2..0768426a95 100644 --- a/testdata/table_stats +++ b/testdata/table_stats @@ -542,5 +542,5 @@ wait-pending-table-stats num-entries: 5 num-deletions: 2 num-range-key-sets: 0 -point-deletions-bytes-estimate: 112974 +point-deletions-bytes-estimate: 113036 range-deletions-bytes-estimate: 0