Skip to content

Commit

Permalink
sstable,db: introduce a sstable-internal ObsoleteBit in the key kind
Browse files Browse the repository at this point in the history
This bit marks keys that are obsolete because they are not the newest
seqnum for a user key (in that sstable), or they are masked by a
RANGEDEL.

Setting the obsolete bit on point keys is advanced usage, so we support 2
modes, both of which must be truthful when setting the obsolete bit, but
vary in when they don't set the obsolete bit.
- Non-strict: In this mode, the bit does not need to be set for keys that
  are obsolete. Additionally, any sstable containing MERGE keys can only
  use this mode. An iterator over such an sstable, when configured to
  hideObsoletePoints, can expose multiple internal keys per user key, and
  can expose keys that are deleted by rangedels in the same sstable. This
  is the mode that non-advanced users should use. Pebble without
  disaggregated storage will also use this mode and will best-effort set
  the obsolete bit, to optimize iteration when snapshots have retained many
  obsolete keys.

- Strict: In this mode, every obsolete key must have the obsolete bit set,
  and no MERGE keys are permitted. An iterator over such an sstable, when
  configured to hideObsoletePoints satisfies two properties:
  - S1: will expose at most one internal key per user key, which is the
    most recent one.
  - S2: will never expose keys that are deleted by rangedels in the same
    sstable.
  This is the mode for two use cases in disaggregated storage (which will
  exclude parts of the key space that has MERGEs), for levels that contain
  sstables that can become foreign sstables.
  - Pebble compaction output to these levels that can become foreign
    sstables.
  - CockroachDB ingest operations that can ingest into the levels that can
    become foreign sstables. Note, these are not sstables corresponding to
    copied data for CockroachDB range snapshots. This case occurs for
    operations like index backfills: these trivially satisfy the strictness
    criteria since they only write one key per userkey.

The strictness of the sstable is written to the Properties block.

The Writer implementation discovers keys that are obsolete because they
are the same userkey as the previous key. This can be cheaply done since
we already do user key comparisons in the Writer. For keys obsoleted by
RANGEDELs, the Writer relies on the caller.

On the read path, the obsolete bit is removed by the blockIter. Since
everything reading an sstable uses a blockIter, this prevents any leakage
of this bit. Some effort was made to reduce the regression on the
iteration path, but TableIterNext has +5.84% regression. Some of the
slowdown is clawed back by improvements to Seek (e.g. SeekGE is now faster).

old is master:

name                                                                              old time/op    new time/op    delta
BlockIterSeekGE/restart=16-16                                                        474ns ± 1%     450ns ± 1%  -5.16%  (p=0.000 n=10+10)
BlockIterSeekLT/restart=16-16                                                        520ns ± 0%     526ns ± 0%  +1.20%  (p=0.000 n=10+10)
BlockIterNext/restart=16-16                                                         19.3ns ± 1%    21.0ns ± 0%  +8.76%  (p=0.000 n=10+10)
BlockIterPrev/restart=16-16                                                         38.7ns ± 1%    39.9ns ± 0%  +3.20%  (p=0.000 n=9+9)
TableIterSeekGE/restart=16,compression=Snappy-16                                    1.65µs ± 1%    1.61µs ± 3%  -2.24%  (p=0.000 n=9+10)
TableIterSeekGE/restart=16,compression=ZSTD-16                                      1.67µs ± 3%    1.58µs ± 3%  -5.11%  (p=0.000 n=10+10)
TableIterSeekLT/restart=16,compression=Snappy-16                                    1.75µs ± 3%    1.68µs ± 2%  -4.14%  (p=0.000 n=10+9)
TableIterSeekLT/restart=16,compression=ZSTD-16                                      1.74µs ± 3%    1.69µs ± 3%  -2.54%  (p=0.001 n=10+10)
TableIterNext/restart=16,compression=Snappy-16                                      23.9ns ± 1%    25.3ns ± 0%  +5.84%  (p=0.000 n=10+10)
TableIterNext/restart=16,compression=ZSTD-16                                        23.9ns ± 1%    25.3ns ± 0%  +5.78%  (p=0.000 n=10+10)
TableIterPrev/restart=16,compression=Snappy-16                                      45.2ns ± 1%    46.2ns ± 1%  +2.09%  (p=0.000 n=10+10)
TableIterPrev/restart=16,compression=ZSTD-16                                        45.3ns ± 0%    46.3ns ± 0%  +2.23%  (p=0.000 n=8+9)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=false-16     51.7ns ± 1%    55.2ns ± 4%  +6.82%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=20_M/read-value=true-16      54.9ns ± 1%    56.4ns ± 3%  +2.73%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=false-16    35.0ns ± 1%    34.8ns ± 1%  -0.56%  (p=0.037 n=10+10)
IteratorScanManyVersions/format=(Pebble,v2)/cache-size=150_M/read-value=true-16     37.8ns ± 0%    38.0ns ± 1%  +0.55%  (p=0.018 n=9+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=false-16     41.5ns ± 2%    42.4ns ± 1%  +2.18%  (p=0.000 n=10+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=20_M/read-value=true-16      94.7ns ± 4%    97.0ns ± 8%    ~     (p=0.133 n=9+10)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=false-16    35.4ns ± 2%    36.5ns ± 1%  +2.97%  (p=0.000 n=10+8)
IteratorScanManyVersions/format=(Pebble,v3)/cache-size=150_M/read-value=true-16     60.1ns ± 1%    57.8ns ± 0%  -3.84%  (p=0.000 n=9+9)
IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=false-16                 135ns ± 1%     136ns ± 1%  +0.44%  (p=0.009 n=9+10)
IteratorScanNextPrefix/versions=1/method=seek-ge/read-value=true-16                  139ns ± 0%     139ns ± 0%  +0.48%  (p=0.000 n=10+8)
IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=false-16            34.8ns ± 1%    35.5ns ± 2%  +2.12%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=1/method=next-prefix/read-value=true-16             37.6ns ± 0%    38.6ns ± 1%  +2.53%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=false-16                 215ns ± 1%     216ns ± 0%    ~     (p=0.341 n=10+10)
IteratorScanNextPrefix/versions=2/method=seek-ge/read-value=true-16                  220ns ± 1%     220ns ± 0%    ~     (p=0.983 n=10+8)
IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=false-16            41.6ns ± 1%    42.6ns ± 2%  +2.42%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=2/method=next-prefix/read-value=true-16             44.6ns ± 1%    45.6ns ± 1%  +2.28%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=false-16               2.16µs ± 0%    2.06µs ± 1%  -4.27%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=seek-ge/read-value=true-16                2.15µs ± 1%    2.07µs ± 0%  -3.71%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=false-16           94.1ns ± 1%    95.9ns ± 2%  +1.94%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=10/method=next-prefix/read-value=true-16            97.5ns ± 1%    98.2ns ± 1%  +0.69%  (p=0.023 n=10+10)
IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=false-16              2.81µs ± 1%    2.66µs ± 1%  -5.29%  (p=0.000 n=9+10)
IteratorScanNextPrefix/versions=100/method=seek-ge/read-value=true-16               2.82µs ± 1%    2.67µs ± 0%  -5.47%  (p=0.000 n=8+10)
IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=false-16           689ns ± 4%     652ns ± 5%  -5.32%  (p=0.000 n=10+10)
IteratorScanNextPrefix/versions=100/method=next-prefix/read-value=true-16            694ns ± 2%     657ns ± 1%  -5.28%  (p=0.000 n=10+8)

Looking at mergingIter, the Next regression seems tolerable, and SeekGE
is better.

name                                                  old time/op    new time/op    delta
MergingIterSeekGE/restart=16/count=1-16                 1.25µs ± 3%    1.15µs ± 1%  -8.51%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=2-16                 2.49µs ± 2%    2.28µs ± 2%  -8.39%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=3-16                 3.82µs ± 3%    3.57µs ± 1%  -6.54%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=4-16                 5.31µs ± 2%    4.86µs ± 2%  -8.39%  (p=0.000 n=10+10)
MergingIterSeekGE/restart=16/count=5-16                 6.88µs ± 1%    6.36µs ± 2%  -7.49%  (p=0.000 n=10+10)
MergingIterNext/restart=16/count=1-16                   46.0ns ± 1%    46.6ns ± 1%  +1.13%  (p=0.000 n=10+10)
MergingIterNext/restart=16/count=2-16                   72.8ns ± 1%    73.0ns ± 0%    ~     (p=0.363 n=10+10)
MergingIterNext/restart=16/count=3-16                   93.5ns ± 0%    93.1ns ± 1%    ~     (p=0.507 n=10+9)
MergingIterNext/restart=16/count=4-16                    104ns ± 0%     104ns ± 1%    ~     (p=0.078 n=8+10)
MergingIterNext/restart=16/count=5-16                    121ns ± 1%     121ns ± 1%  -0.52%  (p=0.008 n=10+10)
MergingIterPrev/restart=16/count=1-16                   66.6ns ± 1%    67.8ns ± 1%  +1.81%  (p=0.000 n=10+10)
MergingIterPrev/restart=16/count=2-16                   93.2ns ± 0%    94.4ns ± 1%  +1.24%  (p=0.000 n=10+10)
MergingIterPrev/restart=16/count=3-16                    114ns ± 0%     114ns ± 1%  +0.36%  (p=0.032 n=9+10)
MergingIterPrev/restart=16/count=4-16                    122ns ± 1%     123ns ± 0%  +0.41%  (p=0.014 n=10+9)
MergingIterPrev/restart=16/count=5-16                    138ns ± 1%     138ns ± 0%  +0.52%  (p=0.012 n=10+10)
MergingIterSeqSeekGEWithBounds/levelCount=5-16           572ns ± 1%     572ns ± 0%    ~     (p=0.842 n=10+9)
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16     1.85µs ± 1%    1.76µs ± 1%  -4.85%  (p=0.000 n=10+9)
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16       443ns ± 0%     444ns ± 1%    ~     (p=0.255 n=10+10)
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16     1.86µs ± 1%    1.77µs ± 1%  -4.63%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16       486ns ± 1%     482ns ± 1%  -0.80%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16     1.93µs ± 1%    1.83µs ± 1%  -4.95%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16       570ns ± 0%     567ns ± 2%  -0.47%  (p=0.020 n=10+10)
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16     2.12µs ± 0%    2.03µs ± 1%  -4.38%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16      1.43µs ± 1%    1.39µs ± 1%  -2.57%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16    2.28µs ± 1%    2.18µs ± 0%  -4.54%  (p=0.000 n=10+10)
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16     1.59µs ± 1%    1.53µs ± 1%  -3.71%  (p=0.000 n=10+9)

Finally, a read benchmark where all except the first key is obsolete
shows improvement.

BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=false-10         	      36	  32300029 ns/op	       2 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=1_B/hide-obsolete=true-10          	      36	  32418979 ns/op	       3 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=false-10       	      82	  13357163 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v3)/cache-size=150_M/hide-obsolete=true-10        	      90	  13256770 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=false-10         	      36	  32396367 ns/op	       2 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=1_B/hide-obsolete=true-10          	   26086	     46095 ns/op	       0 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=false-10       	      88	  13226711 ns/op	       1 B/op	       0 allocs/op
BenchmarkIteratorScanObsolete/format=(Pebble,v4)/cache-size=150_M/hide-obsolete=true-10        	   39171	     30618 ns/op	       0 B/op	       0 allocs/op

Informs cockroachdb#2465
  • Loading branch information
sumeerbhola committed Jun 1, 2023
1 parent ad14b30 commit 357cda7
Show file tree
Hide file tree
Showing 43 changed files with 1,420 additions and 516 deletions.
9 changes: 8 additions & 1 deletion compaction.go
Original file line number Diff line number Diff line change
Expand Up @@ -1415,6 +1415,10 @@ func (c *compaction) newInputIter(
iterOpts := IterOptions{logger: c.logger}
// TODO(bananabrick): Get rid of the extra manifest.Level parameter and fold it into
// compactionLevel.
//
// TODO(bilal): when we start using strict obsolete sstables for L5 and L6
// in disaggregated storage, and rely on the obsolete bit, we will also need
// to configure the levelIter at these levels to hide the obsolete points.
addItersForLevel := func(level *compactionLevel, l manifest.Level) error {
iters = append(iters, newLevelIter(iterOpts, c.cmp, nil /* split */, newIters,
level.files.Iter(), l, &c.bytesIterated))
Expand Down Expand Up @@ -3235,7 +3239,10 @@ func (d *DB) runCompaction(
return nil, pendingOutputs, stats, err
}
}
if err := tw.Add(*key, val); err != nil {
// iter.snapshotPinned is broader than whether the point was covered by
// a RANGEDEL, but it is harmless to pass true when the callee will also
// independently discover that the point is obsolete.
if err := tw.AddWithForceObsolete(*key, val, iter.snapshotPinned); err != nil {
return nil, pendingOutputs, stats, err
}
if iter.snapshotPinned {
Expand Down
7 changes: 7 additions & 0 deletions compaction_iter.go
Original file line number Diff line number Diff line change
Expand Up @@ -383,6 +383,13 @@ func (i *compactionIter) Next() (*InternalKey, []byte) {
} else if cover == keyspan.CoversInvisibly {
// i.iterKey would be deleted by a range deletion if there weren't
// any open snapshots. Mark it as pinned.
//
// TODO(sumeer): there are multiple places in this file where we call
// i.rangeDelFrag.Covers and this is the only one where we are fiddling
// with i.snapshotPinned. i.snapshotPinned was previously being used
// only for stats, where a mistake does not lead to corruption. But it
// is also now being used for the forceObsolete bit in
// Writer.AddWithForceObsolete(). Give this more scrutiny.
i.snapshotPinned = true
}

Expand Down
1 change: 1 addition & 0 deletions db.go
Original file line number Diff line number Diff line change
Expand Up @@ -1362,6 +1362,7 @@ func (i *Iterator) constructPointIter(
levelsIndex := len(levels)
mlevels = mlevels[:numMergingLevels]
levels = levels[:numLevelIters]
i.opts.snapshotForHideObsoletePoints = buf.dbi.seqNum
addLevelIterForFiles := func(files manifest.LevelIterator, level manifest.Level) {
li := &levels[levelsIndex]

Expand Down
18 changes: 9 additions & 9 deletions external_iterator.go
Original file line number Diff line number Diff line change
Expand Up @@ -209,15 +209,15 @@ func createExternalPointIter(ctx context.Context, it *Iterator) (internalIterato
pointIter internalIterator
err error
)
pointIter, err = r.NewIterWithBlockPropertyFiltersAndContext(
ctx,
it.opts.LowerBound,
it.opts.UpperBound,
nil, /* BlockPropertiesFilterer */
false, /* useFilterBlock */
&it.stats.InternalStats,
sstable.TrivialReaderProvider{Reader: r},
)
// We could set hideObsoletePoints=true, since we are reading at
// InternalKeySeqNumMax, but we don't bother since these sstables should
// not have obsolete points (so the performance optimization is
// unnecessary), and we don't want to bother constructing a
// BlockPropertiesFilterer that includes obsoleteKeyBlockPropertyFilter.
pointIter, err = r.NewIterWithBlockPropertyFiltersAndContextEtc(
ctx, it.opts.LowerBound, it.opts.UpperBound, nil, /* BlockPropertiesFilterer */
false /* hideObsoletePoints */, false, /* useFilterBlock */
&it.stats.InternalStats, sstable.TrivialReaderProvider{Reader: r})
if err != nil {
return nil, err
}
Expand Down
4 changes: 2 additions & 2 deletions get_iter.go
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ func (g *getIter) Next() (*InternalKey, base.LazyValue) {
if n := len(g.l0); n > 0 {
files := g.l0[n-1].Iter()
g.l0 = g.l0[:n-1]
iterOpts := IterOptions{logger: g.logger}
iterOpts := IterOptions{logger: g.logger, snapshotForHideObsoletePoints: g.snapshot}
g.levelIter.init(context.Background(), iterOpts, g.cmp, nil /* split */, g.newIters,
files, manifest.L0Sublevel(n), internalIterOpts{})
g.levelIter.initRangeDel(&g.rangeDelIter)
Expand All @@ -177,7 +177,7 @@ func (g *getIter) Next() (*InternalKey, base.LazyValue) {
continue
}

iterOpts := IterOptions{logger: g.logger}
iterOpts := IterOptions{logger: g.logger, snapshotForHideObsoletePoints: g.snapshot}
g.levelIter.init(context.Background(), iterOpts, g.cmp, nil /* split */, g.newIters,
g.version.Levels[g.level].Iter(), manifest.Level(g.level), internalIterOpts{})
g.levelIter.initRangeDel(&g.rangeDelIter)
Expand Down
22 changes: 19 additions & 3 deletions internal/base/internal.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,14 @@ const (
//InternalKeyKindColumnFamilyDeletion InternalKeyKind = 4
//InternalKeyKindColumnFamilyValue InternalKeyKind = 5
//InternalKeyKindColumnFamilyMerge InternalKeyKind = 6

// InternalKeyKindSingleDelete (SINGLEDEL) is a performance optimization
// solely for compactions (to reduce write amp and space amp). Readers other
// than compactions should treat SINGLEDEL as equivalent to a DEL.
// Historically, it was simpler for readers other than compactions to treat
// SINGLEDEL as equivalent to DEL, but as of the introduction of
// InternalKeyKindSSTableInternalObsoleteBit, this is also necessary for
// correctness.
InternalKeyKindSingleDelete InternalKeyKind = 7
//InternalKeyKindColumnFamilySingleDelete InternalKeyKind = 8
//InternalKeyKindBeginPrepareXID InternalKeyKind = 9
Expand Down Expand Up @@ -71,7 +79,7 @@ const (
// value indicating the (len(key)+len(value)) of the shadowed entry the
// tombstone is expected to delete. This value is used to inform compaction
// heuristics, but is not required to be accurate for correctness.
InternalKeyKindDeleteSized = 23
InternalKeyKindDeleteSized InternalKeyKind = 23

// This maximum value isn't part of the file format. Future extensions may
// increase this value.
Expand All @@ -84,12 +92,17 @@ const (
// seqNum.
InternalKeyKindMax InternalKeyKind = 23

// Internal to the sstable format. Not exposed by any sstable iterator.
// Declared here to prevent definition of valid key kinds that set this bit.
InternalKeyKindSSTableInternalObsoleteBit InternalKeyKind = 64
InternalKeyKindSSTableInternalObsoleteMask InternalKeyKind = 191

// InternalKeyZeroSeqnumMaxTrailer is the largest trailer with a
// zero sequence number.
InternalKeyZeroSeqnumMaxTrailer = uint64(InternalKeyKindInvalid)
InternalKeyZeroSeqnumMaxTrailer = uint64(255)

// A marker for an invalid key.
InternalKeyKindInvalid InternalKeyKind = 255
InternalKeyKindInvalid InternalKeyKind = InternalKeyKindSSTableInternalObsoleteMask

// InternalKeySeqNumBatch is a bit that is set on batch sequence numbers
// which prevents those entries from being excluded from iteration.
Expand All @@ -112,6 +125,9 @@ const (
InternalKeyBoundaryRangeKey = (InternalKeySeqNumMax << 8) | uint64(InternalKeyKindRangeKeySet)
)

// Assert InternalKeyKindSSTableInternalObsoleteBit > InternalKeyKindMax
const _ = uint(InternalKeyKindSSTableInternalObsoleteBit - InternalKeyKindMax - 1)

var internalKeyKindNames = []string{
InternalKeyKindDelete: "DEL",
InternalKeyKindSet: "SET",
Expand Down
10 changes: 10 additions & 0 deletions iterator.go
Original file line number Diff line number Diff line change
Expand Up @@ -569,6 +569,9 @@ func (i *Iterator) findNextEntry(limit []byte) {
return

case InternalKeyKindDelete, InternalKeyKindSingleDelete, InternalKeyKindDeleteSized:
// NB: treating InternalKeyKindSingleDelete as equivalent to DEL is not
// only simpler, but is also necessary for correctness due to
// InternalKeyKindSSTableInternalObsoleteBit.
i.nextUserKey()
continue

Expand Down Expand Up @@ -632,6 +635,9 @@ func (i *Iterator) nextPointCurrentUserKey() bool {
return false

case InternalKeyKindDelete, InternalKeyKindSingleDelete, InternalKeyKindDeleteSized:
// NB: treating InternalKeyKindSingleDelete as equivalent to DEL is not
// only simpler, but is also necessary for correctness due to
// InternalKeyKindSSTableInternalObsoleteBit.
return false

case InternalKeyKindSet, InternalKeyKindSetWithDelete:
Expand Down Expand Up @@ -1095,6 +1101,10 @@ func (i *Iterator) mergeNext(key InternalKey, valueMerger ValueMerger) {
case InternalKeyKindDelete, InternalKeyKindSingleDelete, InternalKeyKindDeleteSized:
// We've hit a deletion tombstone. Return everything up to this
// point.
//
// NB: treating InternalKeyKindSingleDelete as equivalent to DEL is not
// only simpler, but is also necessary for correctness due to
// InternalKeyKindSSTableInternalObsoleteBit.
return

case InternalKeyKindSet, InternalKeyKindSetWithDelete:
Expand Down
9 changes: 9 additions & 0 deletions level_iter.go
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,11 @@ type levelIter struct {
// cache when constructing new table iterators.
internalOpts internalIterOpts

// Scratch space for the obsolete keys filter, when there are no other block
// property filters specified. See the performance note where
// IterOptions.PointKeyFilters is declared.
filtersBuf [1]BlockPropertyFilter

// Disable invariant checks even if they are otherwise enabled. Used by tests
// which construct "impossible" situations (e.g. seeking to a key before the
// lower bound).
Expand Down Expand Up @@ -267,8 +272,12 @@ func (l *levelIter) init(
l.upper = opts.UpperBound
l.tableOpts.TableFilter = opts.TableFilter
l.tableOpts.PointKeyFilters = opts.PointKeyFilters
if len(opts.PointKeyFilters) == 0 {
l.tableOpts.PointKeyFilters = l.filtersBuf[:0:1]
}
l.tableOpts.UseL6Filters = opts.UseL6Filters
l.tableOpts.level = l.level
l.tableOpts.snapshotForHideObsoletePoints = opts.snapshotForHideObsoletePoints
l.cmp = cmp
l.split = split
l.iterFile = nil
Expand Down
4 changes: 2 additions & 2 deletions level_iter_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -163,8 +163,8 @@ func (lt *levelIterTest) newIters(
ctx context.Context, file *manifest.FileMetadata, opts *IterOptions, iio internalIterOpts,
) (internalIterator, keyspan.FragmentIterator, error) {
lt.itersCreated++
iter, err := lt.readers[file.FileNum].NewIterWithBlockPropertyFiltersAndContext(
ctx, opts.LowerBound, opts.UpperBound, nil, true, iio.stats,
iter, err := lt.readers[file.FileNum].NewIterWithBlockPropertyFiltersAndContextEtc(
ctx, opts.LowerBound, opts.UpperBound, nil, false, true, iio.stats,
sstable.TrivialReaderProvider{Reader: lt.readers[file.FileNum]})
if err != nil {
return nil, nil, err
Expand Down
42 changes: 29 additions & 13 deletions options.go
Original file line number Diff line number Diff line change
Expand Up @@ -119,10 +119,13 @@ type IterOptions struct {
// function can be used by multiple iterators, if the iterator is cloned.
TableFilter func(userProps map[string]string) bool
// PointKeyFilters can be used to avoid scanning tables and blocks in tables
// when iterating over point keys. It is requires that this slice is sorted in
// increasing order of the BlockPropertyFilter.ShortID. This slice represents
// an intersection across all filters, i.e., all filters must indicate that the
// block is relevant.
// when iterating over point keys. This slice represents an intersection
// across all filters, i.e., all filters must indicate that the block is
// relevant.
//
// Performance note: When len(PointKeyFilters) > 0, the caller should ensure
// that cap(PointKeyFilters) is at least len(PointKeyFilters)+1. This helps
// avoid allocations in Pebble internal code that mutates the slice.
PointKeyFilters []BlockPropertyFilter
// RangeKeyFilters can be usefd to avoid scanning tables and blocks in tables
// when iterating over range keys. The same requirements that apply to
Expand Down Expand Up @@ -181,6 +184,10 @@ type IterOptions struct {
level manifest.Level
// disableLazyCombinedIteration is an internal testing option.
disableLazyCombinedIteration bool
// snapshotForHideObsoletePoints is specified for/by levelIter when opening
// files and is used to decide whether to hide obsolete points. A value of 0
// implies obsolete points should not be hidden.
snapshotForHideObsoletePoints uint64

// NB: If adding new Options, you must account for them in iterator
// construction and Iterator.SetOptions.
Expand Down Expand Up @@ -632,18 +639,24 @@ type Options struct {
ShortAttributeExtractor ShortAttributeExtractor

// RequiredInPlaceValueBound specifies an optional span of user key
// prefixes for which the values must be stored with the key. This is
// useful for statically known exclusions to value separation. In
// CockroachDB, this will be used for the lock table key space that has
// non-empty suffixes, but those locks don't represent actual MVCC
// versions (the suffix ordering is arbitrary). We will also need to add
// support for dynamically configured exclusions (we want the default to
// be to allow Pebble to decide whether to separate the value or not,
// hence this is structured as exclusions), for example, for users of
// CockroachDB to dynamically exclude certain tables.
// prefixes that are not-MVCC, but have a suffix. For these the values
// must be stored with the key, since the concept of "older versions" is
// not defined. It is also useful for statically known exclusions to value
// separation. In CockroachDB, this will be used for the lock table key
// space that has non-empty suffixes, but those locks don't represent
// actual MVCC versions (the suffix ordering is arbitrary). We will also
// need to add support for dynamically configured exclusions (we want the
// default to be to allow Pebble to decide whether to separate the value
// or not, hence this is structured as exclusions), for example, for users
// of CockroachDB to dynamically exclude certain tables.
//
// Any change in exclusion behavior takes effect only on future written
// sstables, and does not start rewriting existing sstables.
//
// Even ignoring changes in this setting, exclusions are interpreted as a
// guidance by Pebble, and not necessarily honored. Specifically, user
// keys with multiple Pebble-versions *may* have the older versions stored
// in value blocks.
RequiredInPlaceValueBound UserKeyPrefixBound

// DisableIngestAsFlushable disables lazy ingestion of sstables through
Expand Down Expand Up @@ -1642,6 +1655,9 @@ func (o *Options) MakeWriterOptions(level int, format sstable.TableFormat) sstab
if format >= sstable.TableFormatPebblev3 {
writerOpts.ShortAttributeExtractor = o.Experimental.ShortAttributeExtractor
writerOpts.RequiredInPlaceValueBound = o.Experimental.RequiredInPlaceValueBound
if format >= sstable.TableFormatPebblev4 && level == numLevels-1 {
writerOpts.WritingToLowestLevel = true
}
}
levelOpts := o.Level(level)
writerOpts.BlockRestartInterval = levelOpts.BlockRestartInterval
Expand Down
1 change: 1 addition & 0 deletions scan_internal.go
Original file line number Diff line number Diff line change
Expand Up @@ -977,6 +977,7 @@ func (i *scanInternalIterator) constructPointIter(memtables flushableList, buf *
mlevels = mlevels[:numMergingLevels]
levels = levels[:numLevelIters]
rangeDelLevels = rangeDelLevels[:numLevelIters]
i.opts.IterOptions.snapshotForHideObsoletePoints = i.seqNum
addLevelIterForFiles := func(files manifest.LevelIterator, level manifest.Level) {
li := &levels[levelsIndex]
rli := &rangeDelLevels[levelsIndex]
Expand Down
Loading

0 comments on commit 357cda7

Please sign in to comment.