-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: hint scan batch size by expected row count #62282
sql: hint scan batch size by expected row count #62282
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice find! I just have a few nits as usual :)
Reviewed 8 of 8 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)
pkg/sql/colexec/colbuilder/execplan.go, line 741 at r1 (raw file):
} scanOp, err := colfetcher.NewColBatchScan( ctx, streamingAllocator, flowCtx, evalCtx, core.TableReader, post, estimatedRowCount)
super nit: it's nicer to put closing parenthesis on the newline.
pkg/sql/colfetcher/cfetcher.go, line 322 at r1 (raw file):
} rf.machine.batch, reallocated = rf.allocator.ResetMaybeReallocate( rf.typs, rf.machine.batch, estimatedRowCount, rf.memoryLimit,
nit: can just use rf.estimatedRowCount
directly since ResetMaybeReallocate
will truncate itself.
pkg/sql/execinfrapb/processors.proto, line 84 at r1 (raw file):
repeated sql.sem.types.T result_types = 7; optional uint64 estimated_row_count = 8 [(gogoproto.nullable) = false];
nit: I think this deserves at least a quick comment.
f66b666
to
11ab70c
Compare
be14785
to
7f2ce8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 8 files at r2, 13 of 13 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)
pkg/sql/colexec/colbuilder/execplan.go, line 684 at r3 (raw file):
post := &spec.Post estimatedRowCount := spec.EstimatedRowCount
nit: it would probably be nice to either move into the TableReader case or leave a comment here that currently this information is set only for the table readers.
pkg/sql/logictest/testdata/logic_test/explain_analyze_plans_nonmetamorphic, line 1 at r3 (raw file):
# LogicTest: !metamorphic local
We probably don't need this file, right?
pkg/sql/tests/vectorized_batch_size_test.go, line 11 at r3 (raw file):
// licenses/APL.txt. package tests
I'd move it to colfetcher
package.
pkg/sql/tests/vectorized_batch_size_test.go, line 28 at r3 (raw file):
) func TestScanBatchSize(t *testing.T) {
nit: quick comment for the goal of the test would be beneficial.
7f2ce8d
to
ff09f45
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/sql/logictest/testdata/logic_test/explain_analyze_plans_nonmetamorphic, line 1 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
We probably don't need this file, right?
Done.
pkg/sql/tests/vectorized_batch_size_test.go, line 11 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I'd move it to
colfetcher
package.
Done.
8bfae2e
to
3d7eecc
Compare
I did some However, I also did some |
Interesting. @erikgrinaker, thank you for working on these benchmark results. I'm currently baffled why this PR would have made kv95 worse. I would expect that it would have been a no-op. I will look into this as well. |
I did a couple of brief tests and couldn't reproduce the behavior you saw, but I'm not ruling out the possibility I've missed something just yet. Also, I wouldn't suspect that this fix could help with the slow regression you're looking at. The behavior for single-key primary key lookup workloads (like the ones in KV) should be exactly the same before and after this patch. 🤔 |
Yeah, it's weird. I'm fine with putting it down to noise, but I just did another run now and got similar results. The thing that jumps out at me is that it's always slower to ramp up with this change -- have a look at the instantaneous rates below: Without this PR:
With this PR:
This pattern holds across all the runs I've done. But I don't have a good explanation for why this change should cause this. |
@erikgrinaker does the pattern reoccur in your test setup with |
Could you share the exact invocations you used to do your tests? |
Sure:
I'll do a run with a patched |
Seeing the same with Without PR:
With PR:
|
I am having a great deal of trouble reproducing this in my environment. Perhaps it's noise, perhaps it's not. I think I'd like to merge and backport this soon, unless people have objections - my bandwidth for investigating this will also decrease quickly this week. |
Ok -- I think this is a pretty clear win for YCSB/E, and if it turns out not to be noise we can optimize that separately. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)
bors r+ |
3d7eecc
to
957bb40
Compare
I hope you don't mind - I've just rebased and updated bors r+ |
On the contrary, I very much appreciate it - thanks so much! |
Build failed (retrying...): |
"vectorized batch count" doesn't show up for some of the logic test configs :/ bors r- |
Canceled. |
Hm, I'm not sure what to do here. The flakes highlight the fact that introducing purely vectorized engine related field to Alternatively, we could either run |
I'll pick this up and will move over the finish line. I'll update |
Previously, the colfetcher ignored limit hints: it always fetched data from KV until its batch was full. This produces bad behavior if the batch size is larger than the limit hint. For example, if the expected row count was 500, causing us to create a 500-sized batch, but the limit hint for whatever reason was only 20, we would still go ahead and fetch 500 rows. This, in practice, does not appear to show up too easily - if the optimizer is doing its job, the batch size should always be equal to the limit hint for limited scans. Release note: None
Previously, the "dynamic batch size" strategy for the vectorized engine's batch allocator worked the same in every situation: batches would start at size 1, then double on each re-allocation, until they hit their maximum size of 1024. Now, to improve performance for scans that return a number of rows somewhere in between 1 and 1024, we pass in the optimizer's best guess of the number of rows that the scan will produce all the way down into the TableReader. That guess is used as the initial size of the batch if it's less than 1024. Release note (performance improvement): improve the performance for the vectorized engine when scanning fewer than 1024 rows at a time.
957bb40
to
5041b22
Compare
bors r+ |
Build failed (retrying...): |
Build succeeded: |
In cockroachdb#62282, the estimated row count was passed into the scan batch allocator to avoid growing the batch from 1. However, this also changed the default batch size from 1 to 1024 when no row count estimate was available, giving significant overhead when fetching small result sets. On `kv95/enc=false/nodes=1/cpu=32` this reduced performance from 66304 ops/s to 58862 ops/s (median of 5 runs), since these are single-row reads without estimates. This patch reverts the default batch size to 1 when no row count estimate is available. This fully fixes the `kv95` performance regression. YCSB/E takes a small hit going from 1895 ops/s to 1786 ops/s, but this only seems to happen because it takes a while for the statistics to update: sometime within the first minute of the test (after the 1-minute ramp-up period), throughput abruptly changes from ~700 ops/s to ~1800 ops/s, so using a 2-minute ramp-up period in roachtest would mostly eliminate any difference. Release note: None
62175: sql: add crdb_internal.reset_sql_stats() builtin r=asubiotto a=Azhng Previously, there was no mechanism to immediately clear SQL statistics. Users would have to wait until the reset interval expires. This commit creates a builtin to immediately clears SQL stats. Release note (sql change): SQL stats can now be cleared using crdb_internal.reset_sql_stats() Addresses #33315 62492: roachtest: add hibernate-spatial test r=rafiss a=otan * Bump hibernate to 5.4.30 * Add hibernate-spatial test which tests cockroachdb against hibernate's spatial suite. Used a separate suite because the directory magic of copying may not work since the set of running tests overlap a bit. Release note: None 62511: geo/geomfn: fix st_linelocatepoint to work with ZM coords r=otan a=andyyang890 Previously, st_linelocatepoint would panic when the line had Z and/or M coordinates. Release note: None 62534: sql: default to batch size 1 in allocator r=yuzefovich a=erikgrinaker In #62282, the estimated row count was passed into the scan batch allocator to avoid growing the batch from 1. However, this also changed the default batch size from 1 to 1024 when no row count estimate was available, giving significant overhead when fetching small result sets. On `kv95/enc=false/nodes=1/cpu=32` this reduced performance from 66304 ops/s to 58862 ops/s (median of 5 runs), since these are single-row reads without estimates. This patch reverts the default batch size to 1 when no row count estimate is available. This fully fixes the `kv95` performance regression. YCSB/E takes a small hit going from 1895 ops/s to 1786 ops/s, but this only seems to happen because it takes a while for the statistics to update: sometime within the first minute of the test (after the 1-minute ramp-up period), throughput abruptly changes from ~700 ops/s to ~1800 ops/s, so using a 2-minute ramp-up period in roachtest would mostly eliminate any difference. Resolves #62524. Release note: None 62535: roachtest: use 2-minute ramp times for YCSB workloads r=yuzefovich a=erikgrinaker In #62534 it was shown that it takes up to two minutes before we have good enough statistics to allocate appropriately sized batches. However, the YCSB workloads only had a 1-minute ramp time, which would skew the results as throughput would abruptly change when the statistics were updated during the test. This patch changes the ramp time for YCSB workloads to 2 minutes to make sure we have appropriate statistics before starting the actual test. Release note: None 62548: bazel: mark logictest as working in bazel r=rickystewart a=rickystewart Release note: None 62549: workload/schemachange: temporarily disable ADD REGION r=ajstorm,postamar a=otan This is causing flakes in CI. Resolves #62503 Release note: None Co-authored-by: Azhng <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Andy Yang <[email protected]> Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
In cockroachdb#62282, the estimated row count was passed into the scan batch allocator to avoid growing the batch from 1. However, this also changed the default batch size from 1 to 1024 when no row count estimate was available, giving significant overhead when fetching small result sets. On `kv95/enc=false/nodes=1/cpu=32` this reduced performance from 66304 ops/s to 58862 ops/s (median of 5 runs), since these are single-row reads without estimates. This patch reverts the default batch size to 1 when no row count estimate is available. This fully fixes the `kv95` performance regression. YCSB/E takes a small hit going from 1895 ops/s to 1786 ops/s, but this only seems to happen because it takes a while for the statistics to update: sometime within the first minute of the test (after the 1-minute ramp-up period), throughput abruptly changes from ~700 ops/s to ~1800 ops/s, so using a 2-minute ramp-up period in roachtest would mostly eliminate any difference. Release note: None
Might close #62198.
Previously, the "dynamic batch size" strategy for the vectorized
engine's batch allocator worked the same in every situation: batches
would start at size 1, then double on each re-allocation, until they hit
their maximum size of 1024.
Now, to improve performance for scans that return a number of rows
somewhere in between 1 and 1024, we pass in the optimizer's best guess
of the number of rows that the scan will produce all the way down into
the TableReader. That guess is used as the initial size of the batch if
it's less than 1024.
Release note (performance improvement): improve the performance for the
vectorized engine when scanning fewer than 1024 rows at a time.