Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: copy user-provided bounds and optimize unchanged bounds #1675

Merged
merged 2 commits into from
May 13, 2022

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Apr 28, 2022

When a new iterator is constructed or when an existing iterator's bounds are
modified, copy the bounds into a Pebble-allocated slice. This gives Pebble
control over the lifetime of the bound buffers and allows Pebble to restore the
optimization for unchanged bounds that was removed in #1073.

Two buffers are used, one for the current bounds and one for old bounds. This
scheme allows low-level internal iterators in Pebble to compare new and old
bounds so that context-dependent optimizations may be applied.

These bounds buffers are pooled on the iterAlloc to reduce allocations.

These new semantics are simpler and allow Pebble's SetOptions to perform more
optimizations. They come at the cost of the additional key copies, regardless
of whether the Iterator's bounds will change.

Alternative to #1667.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to have gone down a rabbit hole where we are having to make performance tradeoffs, or worry about what use case we may miss when measuring for performance regression. Can we avoid all these concerns?
We currently have a setup where:
Case 1: bounds have changed. There is one set of key copies in CockroachDB.
Case 2: bounds have not changed. There are zero key copies.
I can't actually remember where case 2 occurs. There is definitely MVCCGet which is using a prefix iter and repeatedly sets setBounds(nil, nil) for each GetRequest contained in a BatchRequest -- this is not interesting. ScanRequests should fall into case 1. @erikgrinaker do you remember?

So let's focus on case 1. CockroachDB has to copy, in order to convert from roachpb.Key to a Pebble user key. Since CockroachDB is making a promise of key stability until after the next SetBounds call can we do something like the following in Pebble.

  • SetBounds has changed bounds: don't copy since caller has promised key stability. Remember that not copied.
  • SetBounds does not change bounds: if not-copied bit is not set, noop. If not-copied bit is set, make a copy, and pass it down as SetBounds(lower, upper, force-change=true), to disable any lower-level optimizations that will seek using next.

I realize this seems complicated. I am not suggesting we actually do it. But I am also searching for something with limited fallout.

Reviewable status: 0 of 8 files reviewed, all discussions resolved

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not suggesting we actually do it. But I am also searching for something with limited fallout.

I think #1667 is still viable to preserve the same performance characteristics for both cases. (This is with the caveat that tracking down metamorphic tests failures may yet reveal an overlooked reason that option is fundamentally flawed). This bound-copying alternative is seeking to address a concern about complexity and fragility rather than shifting a performance tradeoff.

I think we should try to get some sense of the actual impact of copying these bounds before we try to avoid it.

Reviewable status: 0 of 8 files reviewed, all discussions resolved

@jbowens jbowens requested a review from erikgrinaker April 28, 2022 21:53
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't actually remember where case 2 occurs. There is definitely MVCCGet which is using a prefix iter and repeatedly sets setBounds(nil, nil) for each GetRequest contained in a BatchRequest -- this is not interesting. ScanRequests should fall into case 1. @erikgrinaker do you remember?

I was going to say the intent interleaving iterator, but it clearly uses different bounds for the engine iterator and cloned MVCC iterator.

Another example would be MVCCIncrementalIterator, where we'll be setting up a TBI and a normal iter with the same bounds, cloning via the batch/readonly. But I don't think it's used in any code paths that are hot enough to matter.

In general, I think we should initially focus on scans and gets (puts can be considered a variant of gets), since those are the most performance critical.

I think we should try to get some sense of the actual impact of copying these bounds before we try to avoid it.

Yep, I'll run some quick tests.

Reviewable status: 0 of 8 files reviewed, all discussions resolved

@erikgrinaker
Copy link
Contributor

I ran some benchmarks for scans and gets, using a single batch that clones an iterator and calls SetOptions() on it for every operation. Gets use Prefix: true and nil bounds, scans use random upper/lower bounds for every operation. I compared this PR with its immediate parent, on top of cockroachdb/cockroach#79993.

name                                                 old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=8-24       3.37µs ± 2%    3.31µs ± 1%   -1.54%  (p=0.000 n=19+19)
MVCCScan_Pebble/rows=10/versions=1/valueSize=8-24      6.56µs ± 2%    6.53µs ± 1%   -0.54%  (p=0.013 n=18+18)
MVCCScan_Pebble/rows=100/versions=1/valueSize=8-24     30.3µs ± 2%    30.5µs ± 2%   +0.72%  (p=0.015 n=20+20)
MVCCScan_Pebble/rows=1000/versions=1/valueSize=8-24     257µs ± 1%     257µs ± 2%     ~     (p=0.517 n=17+20)
MVCCGet_Pebble/versions=1/valueSize=8-24               2.68µs ± 1%    2.91µs ± 1%   +8.64%  (p=0.000 n=20+16)
MVCCGet_Pebble/versions=10/valueSize=8-24              4.07µs ± 2%    4.39µs ± 3%   +7.94%  (p=0.000 n=18+19)
MVCCGet_Pebble/versions=100/valueSize=8-24             10.2µs ± 4%    11.4µs ± 5%  +11.46%  (p=0.000 n=20+20)

I would expect the copy overhead to show up in the scan benchmarks, but it appears to be negligible, so I don't think the copy is a problem in itself.

For gets, I looked at profile diffs, and it seemed like it was mostly down to SeekGE overhead. That's likely because this affects the lower-level seek optimization, and should be orthogonal to the copies themselves. Profile diff below:

Screenshot 2022-04-29 at 17 24 11

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For gets, I looked at profile diffs, and it seemed like it was mostly down to SeekGE overhead. That's likely because this affects the lower-level seek optimization, and should be orthogonal to the copies themselves.

I agree it looks like we're losing the seek optimizations, but I don't understand why? Do you mind pushing a branch that I can experiment with?

Reviewable status: 0 of 8 files reviewed, all discussions resolved

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, here you go:

erikgrinaker/cockroach@mvcc-range-tombstones-iterator-buffer-reuse...erikgrinaker:setoptions-benchmarks

E.g.:

$ dev bench --ignore-cache pkg/storage -f BenchmarkMVCCGet_Pebble -v --count 10
$ dev bench --ignore-cache pkg/storage -f '^BenchmarkMVCCScan_Pebble$' -v --count 10
$ dev bench --ignore-cache pkg/storage -f BenchmarkMVCCGet_Pebble --test-args '-test.cpuprofile=get.pprof'

Reviewable status: 0 of 8 files reviewed, all discussions resolved

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, is the original not reading through a batch? It makes sense that we're losing the seek-using-next optimization if we're reading through a batch now, but we should be able to claw that back with #1640. When SetOptions is called and the iterator's batch doesn't have any new mutations, we should still be able to apply the seek-using-next optimization.

Reviewable status: 0 of 8 files reviewed, all discussions resolved

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, is the original not reading through a batch? It makes sense that we're losing the seek-using-next optimization if we're reading through a batch now, but we should be able to claw that back with #1640. When SetOptions is called and the iterator's batch doesn't have any new mutations, we should still be able to apply the seek-using-next optimization.

They're both reading through a batch -- the old benchmark was run after the commit that changed the benchmark to use a batch. However, the old benchmark omitted the SetOptions call in CRDB when possible, while the new benchmark doesn't (not sure if it actually triggered here though since these were cloned iterators). If we additionally need #1675 to retain the SeekGE optimization in these cases then that makes sense.

In any case, for the copying I think the single-row scan is the benchmark we care about, since that uses random bounds for every scan. That looks viable based on these results at least.

Would it make sense to combine this and #1640 for a final benchmark before we merge this, just to verify that we're not regressing somehow?

Reviewable status: 0 of 8 files reviewed, all discussions resolved

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good call. This branch has the combination of the batch semantics and the bounds copy, preserving the seek-using-next optimizations with batches when possible: https://github.com/cockroachdb/pebble/compare/master...jbowens:bounds-copy-and-batch-semantics?expand=1

I'll try to compare it with the branch you linked above this afternoon.

Reviewable status: 0 of 8 files reviewed, all discussions resolved

@erikgrinaker
Copy link
Contributor

Thanks! Looks like we still have the get regression though:

MVCCGet_Pebble/versions=1/valueSize=8-24    2.73µs ± 1%    2.94µs ± 1%  +7.49%  (p=0.000 n=10+9)

@jbowens jbowens force-pushed the bounds-copy branch 2 times, most recently from 18f80d4 to 4c98b0b Compare May 5, 2022 14:46
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the regression was actually additional overhead in the SetOptions noop case. Increasing the benchmark count when taking the cpu profile produced a more representative cpu profile. In that MVCCGet benchmark, we're testing the noop case since each iterator is configured with the same options.

The SetOptions function has to perform slightly more work to diff two IterOptions because it exposes a larger surface area on its pebble.IterOptions. I was able to get it down to a ~3% regression through restructuring some conditionals. I might be able to get it down more.

name                                      old time/op    new time/op    delta
MVCCGet_Pebble/versions=1/valueSize=8-10    2.26µs ± 6%    2.20µs ± 5%  -2.89%  (p=0.000 n=49+50)

name                                      old speed      new speed      delta
MVCCGet_Pebble/versions=1/valueSize=8-10  3.54MB/s ± 6%  3.64MB/s ± 5%  +2.97%  (p=0.000 n=49+50)

Reviewable status: 0 of 8 files reviewed, all discussions resolved

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errr, what am I saying, that's faster? Mind double checking that benchmark?

Reviewable status: 0 of 8 files reviewed, all discussions resolved

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined with the changes from #1640, there is a regression because also now need to detect if the batch was mutated. Let me see if I can't get that down.

name                                      old time/op    new time/op    delta
MVCCGet_Pebble/versions=1/valueSize=8-10    2.26µs ± 6%    2.37µs ± 7%  +4.76%  (p=0.000 n=49+49)

name                                      old speed      new speed      delta
MVCCGet_Pebble/versions=1/valueSize=8-10  3.54MB/s ± 6%  3.38MB/s ± 6%  -4.55%  (p=0.000 n=49+49)

Reviewable status: 0 of 8 files reviewed, all discussions resolved

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 2cd336795dae99c4fd422ca9db40cdf3c0ec81de on master...jbowens:bounds-copy-and-batch-semantics I got the delta back down to a minor performance win:

jackson@jacksonmbp cockroach % benchstat old.txt e5dabf12.txt
name                                      old time/op    new time/op    delta
MVCCGet_Pebble/versions=1/valueSize=8-10    2.26µs ± 6%    2.23µs ± 2%  -1.36%  (p=0.014 n=49+49)

name                                      old speed      new speed      delta
MVCCGet_Pebble/versions=1/valueSize=8-10  3.54MB/s ± 6%  3.59MB/s ± 1%  +1.35%  (p=0.010 n=49+48)

Reviewable status: 0 of 8 files reviewed, all discussions resolved

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic news, thanks for digging into this @jbowens! ❤️

Do the scan benchmarks look fine as well? If so, I think we're essentially good to go here.

Reviewable status: 0 of 10 files reviewed, all discussions resolved

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, I'm learning that these benchmarks are unstable and highly sensitive to the count. There's enough state (eg, in iterator reuse) that running with --count 50 is dramatically different than running with --count 10.

Comparing old and new with --count 20, there's still a MVCCGet regression:

name                                                 old time/op    new time/op     delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=8-10       3.28µs ± 3%     2.45µs ± 1%   -25.53%  (p=0.000 n=20+20)
MVCCScan_Pebble/rows=10/versions=1/valueSize=8-10      13.9µs ± 3%      4.8µs ± 2%   -65.64%  (p=0.000 n=20+20)
MVCCScan_Pebble/rows=100/versions=1/valueSize=8-10      113µs ± 2%       23µs ± 1%   -79.32%  (p=0.000 n=20+20)
MVCCScan_Pebble/rows=1000/versions=1/valueSize=8-10    1.09ms ± 1%     0.20ms ± 0%   -81.57%  (p=0.000 n=19+20)
MVCCGet_Pebble/versions=1/valueSize=8-10               1.99µs ± 2%     2.28µs ± 5%   +14.45%  (p=0.000 n=19+20)

name                                                 old speed      new speed       delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=8-10     2.43MB/s ± 3%   3.27MB/s ± 1%   +34.35%  (p=0.000 n=20+20)
MVCCScan_Pebble/rows=10/versions=1/valueSize=8-10    5.76MB/s ± 3%  16.76MB/s ± 2%  +190.97%  (p=0.000 n=20+20)
MVCCScan_Pebble/rows=100/versions=1/valueSize=8-10   7.07MB/s ± 2%  34.17MB/s ± 1%  +383.51%  (p=0.000 n=20+20)
MVCCScan_Pebble/rows=1000/versions=1/valueSize=8-10  7.31MB/s ± 1%  39.69MB/s ± 0%  +442.70%  (p=0.000 n=19+20)
MVCCGet_Pebble/versions=1/valueSize=8-10             4.01MB/s ± 4%   3.52MB/s ± 5%   -12.40%  (p=0.000 n=20+20)

Diffing cpu profiles points to keyspan.SeekGE, the function used for seeking among range deletion tombstones and range keys.

Showing top 20 nodes out of 908
      flat  flat%   sum%        cum   cum%
     0.77s  2.26%  2.26%      1.14s  3.36%  github.com/cockroachdb/pebble/internal/keyspan.SeekGE
     0.76s  2.24%  4.50%      0.76s  2.24%  syscall.syscall6
     0.64s  1.88%  6.38%      0.63s  1.85%  syscall.syscall
    -0.35s  1.04%  5.33%     -1.32s  3.89%  github.com/cockroachdb/pebble/sstable.(*blockIter).SeekGE
    -0.35s  1.04%  4.29%     -0.45s  1.34%  github.com/cockroachdb/cockroach/pkg/storage.mvccGet
    -0.32s  0.94%  3.35%     -0.30s  0.87%  github.com/cockroachdb/pebble/sstable.(*blockIter).readEntry
    -0.32s  0.94%  2.41%     -0.56s  1.64%  github.com/cockroachdb/cockroach/pkg/storage.EngineKeyCompare
    -0.28s  0.82%  1.59%     -0.45s  1.31%  github.com/cockroachdb/pebble/sstable.(*blockIter).Next

I think this is a consequence of #1640. Previously a batch without any range deletions did not require a batch range deletion iterator to be added to the iterator. Now, we must add it to the iterator regardless, because we need a container to populate with the range deletions in case one is written to the batch later.

I think this is necessary and possibly something we can optimize down in subsequent PRs.

Reviewable status: 0 of 10 files reviewed, all discussions resolved

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, I'm learning that these benchmarks are unstable and highly sensitive to the count. There's enough state (eg, in iterator reuse) that running with --count 50 is dramatically different than running with --count 10.

Huh, I though each benchmark would be an independent run of the benchmark binary or something.

Anyway, this still isn't terrible -- in fact, the scan numbers are pretty fantastic (are we sure they're real?). If the gets are better than baseline for high counts that still seems like it might be a win.

I think this is necessary and possibly something we can optimize down in subsequent PRs.

Ok. Let me do a couple of quick tests here myself (might have to wait until tomorrow). As long as we can land this in CRDB master (that doesn't have any of the range key changes yet) without any severe regressions on end-to-end benchmarks then I'm fine with micro-optimizing this later.

Reviewable status: 0 of 10 files reviewed, all discussions resolved

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(are we sure they're real?)

It'd be good to verify, but I think the wins are coming from the fact that with #1640 we can now use the try-seek-using-next optimization when we're reading through batches too, so long as SetOptions wasn't just called and needed to invalidate the iterator because there are new mutations to the batch. So they're "real" for the narrow access pattern of that benchmark, but I'm not sure whether they're representative of any win for real production workloads.

If I run the benchmarks once per process, the variance is high enough that I'm not getting a bump recognized as statistically significant:

name                                                 old time/op    new time/op    delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=8-10       2.70µs ±23%    2.53µs ± 7%    ~     (p=0.780 n=40+40)
MVCCScan_Pebble/rows=10/versions=1/valueSize=8-10      8.20µs ±68%    5.99µs ±36%    ~     (p=0.459 n=40+40)
MVCCScan_Pebble/rows=100/versions=1/valueSize=8-10     58.0µs ±94%    36.4µs ±60%    ~     (p=0.262 n=40+40)
MVCCScan_Pebble/rows=1000/versions=1/valueSize=8-10     547µs ±98%     328µs ±60%    ~     (p=0.289 n=40+40)
MVCCGet_Pebble/versions=1/valueSize=8-10               2.04µs ± 4%    2.20µs ± 2%  +7.58%  (p=0.008 n=5+5)

name                                                 old speed      new speed      delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=8-10     3.03MB/s ±20%  3.17MB/s ± 7%    ~     (p=0.776 n=40+40)
MVCCScan_Pebble/rows=10/versions=1/valueSize=8-10    12.7MB/s ±54%  14.3MB/s ±31%    ~     (p=0.459 n=40+40)
MVCCScan_Pebble/rows=100/versions=1/valueSize=8-10   23.4MB/s ±70%  26.3MB/s ±48%    ~     (p=0.264 n=40+40)
MVCCScan_Pebble/rows=1000/versions=1/valueSize=8-10  26.6MB/s ±72%  30.0MB/s ±49%    ~     (p=0.288 n=40+40)
MVCCGet_Pebble/versions=1/valueSize=8-10             3.92MB/s ± 4%  3.64MB/s ± 2%  -7.09%  (p=0.008 n=5+5)

Ok. Let me do a couple of quick tests here myself (might have to wait until tomorrow).

Thanks, appreciate it. It'd be good to verify I'm not doing something silly collecting the scan benchmarks too.

Reviewable status: 0 of 10 files reviewed, all discussions resolved

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran some more benchmarks. On CRDB master there is no diff (not surprising, just wanted to check). With SetOptions(), my results are essentially the same as yours:

name                                                 old time/op    new time/op     delta
MVCCScan_Pebble/rows=1/versions=1/valueSize=8-24       4.36µs ± 1%     3.15µs ± 1%   -27.75%  (p=0.000 n=20+20)
MVCCScan_Pebble/rows=10/versions=1/valueSize=8-24      18.1µs ± 0%      6.1µs ± 2%   -66.36%  (p=0.000 n=19+20)
MVCCScan_Pebble/rows=100/versions=1/valueSize=8-24      147µs ± 0%       29µs ± 2%   -80.20%  (p=0.000 n=18+20)
MVCCScan_Pebble/rows=1000/versions=1/valueSize=8-24    1.44ms ± 1%     0.25ms ± 1%   -82.65%  (p=0.000 n=20+18)
MVCCGet_Pebble/versions=1/valueSize=8-24               2.63µs ± 0%     2.85µs ± 1%    +8.62%  (p=0.000 n=15+19)

I'm fine with merging this as-is for now, and revisiting optimizations.

Reviewable status: 0 of 10 files reviewed, all discussions resolved

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. @sumeerbhola mind giving this a review?

Reviewable status: 0 of 10 files reviewed, all discussions resolved

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 7 files at r1, 1 of 5 files at r6, 1 of 3 files at r8, 4 of 4 files at r10, 1 of 1 files at r11, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @jbowens)


iterator.go line 1794 at r10 (raw file):

	return err
}

I assume we have not changed the stability requirements for base.InternalIterator. InternalIterator.SetBounds doesn't say anything about the bounds stability -- could you add an explicit code comment there?


iterator.go line 1925 at r10 (raw file):

	if (i.pointIter != nil || !i.opts.pointKeys()) &&
		(i.rangeKey != nil || !i.opts.rangeKeys()) &&
		boundsEqual && o.KeyTypes == i.opts.KeyTypes &&

is o.KeyTypes == i.opts.KeyTypes subsumed by (i.pointIter != nil || !i.opts.pointKeys()) && (i.rangeKey != nil || !i.opts.rangeKeys())?
Even if the KeyTypes have changed, if we have closed certain iterators and ended up with exactly the iterators we need, that seems ok, yes?

Hmm, I see the problem. What if the condition was
(i.pointIter != nil == o.pointKeys()) && (i.rangeKey != nil == o.rangeKeys())
and in the preceding section we did something like:

if i.pointIter != nil && (closeBoth || len(o.PointKeyFilters) > 0 || len(i.opts.PointKeyFilters) > 0)  || !o.pointKeys()) {
		i.err = firstError(i.err, i.pointIter.Close())
		i.pointIter = nil

}

And similarly for range keys.
That is, we have one place we close iterators.

Or maybe I am missing the fact that we are not closing the point and range iterators even if they are not being used. Can you remind me what use case this optimization helps in?


testdata/iterator_bounds_lifetimes line 1 at r10 (raw file):

new-iter label=first lower=bar upper=foo

nice test


testdata/iterator_bounds_lifetimes line 56 at r10 (raw file):

second: ("a", "z")

# Test no-op set-options.

can we tell it is a noop? Should the test print out Iterator.boundsBufIdx?


testdata/iterator_bounds_lifetimes line 64 at r10 (raw file):


# Test SetOptions with unchanged bounds but changes to other options. SetOptions
# should hold onto the existing bounds buffers.

same question


testdata/iterator_seek_opt line 180 at r10 (raw file):

SeekPrefixGEs with trySeekUsingNext: 4

# Shifting bounds, with non-overlapping and monotonic bounds. A set-bounds sits

The non-overlapping and monotonic bounds optimization is important for sstables even though it isn't observable using the usual iterator stats since we don't plumb it via trySeekUsingNext. The InternalIteratorStats may make this observable. Can you try printing those out too, and check? There is a comment in the test about them being non-deterministic, but this test is laying out a particular set of sstables so I think it should be deterministic.

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 12 files reviewed, 6 unresolved discussions (waiting on @sumeerbhola)


iterator.go line 1794 at r10 (raw file):

Previously, sumeerbhola wrote…

I assume we have not changed the stability requirements for base.InternalIterator. InternalIterator.SetBounds doesn't say anything about the bounds stability -- could you add an explicit code comment there?

yeah, no changes to the internal iterator contract. Added documentation to the interface type and SetBounds.


iterator.go line 1925 at r10 (raw file):

Previously, sumeerbhola wrote…

is o.KeyTypes == i.opts.KeyTypes subsumed by (i.pointIter != nil || !i.opts.pointKeys()) && (i.rangeKey != nil || !i.opts.rangeKeys())?
Even if the KeyTypes have changed, if we have closed certain iterators and ended up with exactly the iterators we need, that seems ok, yes?

Hmm, I see the problem. What if the condition was
(i.pointIter != nil == o.pointKeys()) && (i.rangeKey != nil == o.rangeKeys())
and in the preceding section we did something like:

if i.pointIter != nil && (closeBoth || len(o.PointKeyFilters) > 0 || len(i.opts.PointKeyFilters) > 0)  || !o.pointKeys()) {
		i.err = firstError(i.err, i.pointIter.Close())
		i.pointIter = nil

}

And similarly for range keys.
That is, we have one place we close iterators.

Or maybe I am missing the fact that we are not closing the point and range iterators even if they are not being used. Can you remind me what use case this optimization helps in?

Yeah, I was trying to preserve the optimization of switching away and then back to a key type. It's not based in any known CockroachDB iterator usage.

In the case the key types change, we still need to switch i.iter to point to the correct iterator. If range keys are enabled, we also need to initialize the range key interleaving iterator to correctly either wrap the appropriate point iterator (or remove the point iterator if range keys only). Reinitializing the interleaving iterator requires invalidating the iterator and disabling seek-using-next optimizations.


testdata/iterator_bounds_lifetimes line 56 at r10 (raw file):

Previously, sumeerbhola wrote…

can we tell it is a noop? Should the test print out Iterator.boundsBufIdx?

good call, done.


testdata/iterator_bounds_lifetimes line 64 at r10 (raw file):

Previously, sumeerbhola wrote…

same question

Done.


testdata/iterator_seek_opt line 180 at r10 (raw file):

The InternalIteratorStats may make this observable.

I think the current granularity of internal stats don't provide any visibility here because block-reads aren't incremented if the reader attempts to load the block and the reader is already positioned over it. Whether we hit the optimization or not, the number block reads and keys surfaced to the merging iterator are constant.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 6 files at r12, 1 of 1 files at r13, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens and @sumeerbhola)


iterator_test.go line 1622 at r13 (raw file):

	seed := uint64(time.Now().UnixNano())
	t.Logf("seed = %d", seed)
	rng := rand.New(rand.NewSource(seed))

glad to see this randomized test.


iterator_test.go line 1628 at r13 (raw file):

		o = &IterOptions{KeyTypes: IterKeyType(rng.Intn(3))}
		if rng.Intn(2) == 1 {
			o.LowerBound = testkeys.KeyAt(ks, rng.Intn(ks.Count()), rng.Intn(ks.Count()))

With 26*26 keys it seems extremely rare to have bounds unchanged.
Should generateNewOptions take the previous options and with some probability keep bounds etc. the same?

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @sumeerbhola)


iterator_test.go line 1628 at r13 (raw file):

Previously, sumeerbhola wrote…

With 26*26 keys it seems extremely rare to have bounds unchanged.
Should generateNewOptions take the previous options and with some probability keep bounds etc. the same?

Done.

jbowens added 2 commits May 13, 2022 15:21
When a new iterator is constructed or when an existing iterator's bounds are
modified, copy the bounds into a Pebble-allocated slice. This gives Pebble
control over the lifetime of the bound buffers and allows Pebble to restore the
optimization for unchanged bounds that was removed in cockroachdb#1073.

Two buffers are used, one for the current bounds and one for old bounds. This
scheme allows low-level internal iterators in Pebble to compare new and old
bounds so that context-dependent optimizations may be applied.

These bounds buffers are pooled on the iterAlloc to reduce allocations.

These new semantics are simpler and allow Pebble's SetOptions to perform more
optimizations. They come at the cost of the additional key copies, regardless
of whether the Iterator's bounds will change.

Supersedes cockroachdb#1667.
After passing bounds to Pebble, overwrite the provided buffers with garbage.
This helps ensure we never accidentally retain user-provided bounds buffers
after a NewIter/SetOptions/SetBounds call.
@jbowens jbowens merged commit b8c9a56 into cockroachdb:master May 13, 2022
@jbowens jbowens deleted the bounds-copy branch May 13, 2022 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants