Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: hint scan batch size by expected row count #62282

Merged

Conversation

jordanlewis
Copy link
Member

Might close #62198.

Previously, the "dynamic batch size" strategy for the vectorized
engine's batch allocator worked the same in every situation: batches
would start at size 1, then double on each re-allocation, until they hit
their maximum size of 1024.

Now, to improve performance for scans that return a number of rows
somewhere in between 1 and 1024, we pass in the optimizer's best guess
of the number of rows that the scan will produce all the way down into
the TableReader. That guess is used as the initial size of the batch if
it's less than 1024.

Release note (performance improvement): improve the performance for the
vectorized engine when scanning fewer than 1024 rows at a time.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jordanlewis jordanlewis requested a review from a team March 19, 2021 23:00
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice find! I just have a few nits as usual :)

Reviewed 8 of 8 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)


pkg/sql/colexec/colbuilder/execplan.go, line 741 at r1 (raw file):

			}
			scanOp, err := colfetcher.NewColBatchScan(
				ctx, streamingAllocator, flowCtx, evalCtx, core.TableReader, post, estimatedRowCount)

super nit: it's nicer to put closing parenthesis on the newline.


pkg/sql/colfetcher/cfetcher.go, line 322 at r1 (raw file):

	}
	rf.machine.batch, reallocated = rf.allocator.ResetMaybeReallocate(
		rf.typs, rf.machine.batch, estimatedRowCount, rf.memoryLimit,

nit: can just use rf.estimatedRowCount directly since ResetMaybeReallocate will truncate itself.


pkg/sql/execinfrapb/processors.proto, line 84 at r1 (raw file):

  repeated sql.sem.types.T result_types = 7;

  optional uint64 estimated_row_count = 8 [(gogoproto.nullable) = false];

nit: I think this deserves at least a quick comment.

@jordanlewis jordanlewis force-pushed the estimated-row-count-in-allocator branch from f66b666 to 11ab70c Compare March 20, 2021 00:14
@jordanlewis jordanlewis requested a review from a team as a code owner March 20, 2021 00:14
@jordanlewis jordanlewis force-pushed the estimated-row-count-in-allocator branch 3 times, most recently from be14785 to 7f2ce8d Compare March 20, 2021 00:58
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 8 files at r2, 13 of 13 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)


pkg/sql/colexec/colbuilder/execplan.go, line 684 at r3 (raw file):

	post := &spec.Post

	estimatedRowCount := spec.EstimatedRowCount

nit: it would probably be nice to either move into the TableReader case or leave a comment here that currently this information is set only for the table readers.


pkg/sql/logictest/testdata/logic_test/explain_analyze_plans_nonmetamorphic, line 1 at r3 (raw file):

# LogicTest: !metamorphic local

We probably don't need this file, right?


pkg/sql/tests/vectorized_batch_size_test.go, line 11 at r3 (raw file):

// licenses/APL.txt.

package tests

I'd move it to colfetcher package.


pkg/sql/tests/vectorized_batch_size_test.go, line 28 at r3 (raw file):

)

func TestScanBatchSize(t *testing.T) {

nit: quick comment for the goal of the test would be beneficial.

@jordanlewis jordanlewis force-pushed the estimated-row-count-in-allocator branch from 7f2ce8d to ff09f45 Compare March 20, 2021 01:22
Copy link
Member Author

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/sql/logictest/testdata/logic_test/explain_analyze_plans_nonmetamorphic, line 1 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

We probably don't need this file, right?

Done.


pkg/sql/tests/vectorized_batch_size_test.go, line 11 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I'd move it to colfetcher package.

Done.

@jordanlewis jordanlewis force-pushed the estimated-row-count-in-allocator branch 5 times, most recently from 8bfae2e to 3d7eecc Compare March 20, 2021 23:57
@erikgrinaker
Copy link
Contributor

I did some ycsb/E/nodes=3 GCE runs to try this: 5 runs at 24e76d3 (without this PR) and 5 runs at 3d7eecc (with this PR). Happy to say that the median results were 792 ops/s before, and 1829 ops/s after, which is even higher than the original ~1600ops/s baseline back in November! Great work @jordanlewis!

However, I also did some kv95/enc=false/nodes=1/cpu=32 GCE runs to see if this might address #62156. Unfortunately, this seems to have regressed from 60589 ops/s without this PR to 56142 ops/s with this PR. The results were fairly noisy, so I did 10 runs each, and it should probably be taken with a grain of salt. I don't think it should block this PR, but if you have any ideas for why this might have caused kv95 to regress then that'd be worth looking into.

@jordanlewis
Copy link
Member Author

Interesting. @erikgrinaker, thank you for working on these benchmark results.

I'm currently baffled why this PR would have made kv95 worse. I would expect that it would have been a no-op. I will look into this as well.

@jordanlewis
Copy link
Member Author

I did a couple of brief tests and couldn't reproduce the behavior you saw, but I'm not ruling out the possibility I've missed something just yet.

Also, I wouldn't suspect that this fix could help with the slow regression you're looking at. The behavior for single-key primary key lookup workloads (like the ones in KV) should be exactly the same before and after this patch. 🤔

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 22, 2021

Yeah, it's weird. I'm fine with putting it down to noise, but I just did another run now and got similar results. The thing that jumps out at me is that it's always slower to ramp up with this change -- have a look at the instantaneous rates below:

Without this PR:

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
    1.0s        0        40747.2        45317.0      0.9      3.5      7.3     15.2 read
    1.0s        0         2148.8         2389.8      3.3      8.9     12.6     17.8 write
    2.0s        0        47194.0        46255.5      0.9      3.1      7.3     16.3 read
    2.0s        0         2516.2         2453.0      3.1      9.4     13.1     18.9 write
    3.0s        0        49113.7        47206.9      0.8      2.9      7.1     18.9 read
    3.0s        0         2600.1         2502.0      3.1      8.9     12.1     19.9 write
    4.0s        0        49923.5        47886.7      0.8      2.9      6.8     16.8 read
    4.0s        0         2656.9         2540.7      3.4      9.4     12.1     21.0 write
    5.0s        0        51812.2        48671.6      0.8      2.4      6.6     17.8 read
    5.0s        0         2646.0         2561.8      2.9     10.0     15.2     21.0 write

With this PR:

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
    1.0s        0        29460.3        33190.0      1.1      5.2      8.9     27.3 read
    1.0s        0         1538.0         1732.7      3.8     10.0     13.6     26.2 write
    2.0s        0        33780.4        33484.5      1.0      5.0      8.9     24.1 read
    2.0s        0         1736.3         1734.5      4.5     11.0     15.7     25.2 write
    3.0s        0        32794.3        33254.4      1.0      5.5     10.0     21.0 read
    3.0s        0         1746.0         1738.3      4.5     12.1     17.8     28.3 write
    4.0s        0        35227.5        33747.8      1.0      4.7      9.4     22.0 read
    4.0s        0         1867.9         1770.7      3.8      9.4     14.7     29.4 write
    5.0s        0        34809.9        33960.6      1.0      5.2     10.5     19.9 read
    5.0s        0         1874.7         1791.5      3.8     11.0     14.7     26.2 write

This pattern holds across all the runs I've done. But I don't have a good explanation for why this change should cause this.

@jordanlewis
Copy link
Member Author

@erikgrinaker does the pattern reoccur in your test setup with read-percent=100?

@jordanlewis
Copy link
Member Author

Could you share the exact invocations you used to do your tests?

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 22, 2021

Sure:

$ roachtest -u grinaker --cockroach cockroachshort-linux-jordanlewis-pre run --count 5 --cpu-quota 320 '^kv95/enc=false/nodes=1/cpu=32$'

$ roachtest -u grinaker --cockroach cockroachshort-linux-jordanlewis-post run --count 5 --cpu-quota 320 '^kv95/enc=false/nodes=1/cpu=32$'

I'll do a run with a patched roachtest using read-percent=100 soon, running some other benchmarks atm.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 22, 2021

Seeing the same with read-percent=100, median is 66267 ops/s vs. 61842 ops/s. Slow ramp-up:

Without PR:

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
    1.0s        0        45943.9        52634.9      0.9      3.5      7.6     16.8 read
    2.0s        0        53887.3        53256.5      0.9      3.4      6.8     15.7 read
    3.0s        0        54726.6        53746.9      0.9      3.4      6.8     16.8 read
    4.0s        0        55622.4        54215.3      0.9      3.3      7.1     15.7 read
    5.0s        0        56558.9        54684.7      0.9      3.1      7.1     17.8 read
    6.0s        0        58279.0        55282.8      0.9      2.9      6.6     15.2 read
    7.0s        0        57671.5        55624.7      0.9      3.0      6.6     14.7 read
    8.0s        0        58504.5        55984.0      0.9      2.9      6.6     15.7 read
    9.0s        0        57625.4        56166.6      0.9      3.0      7.1     15.2 read
   10.0s        0        58540.9        56404.1      0.9      2.9      6.8     18.9 read

With PR:

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
    1.0s        0        31237.3        35882.4      1.2      5.0      9.4     22.0 read
    2.0s        0        37625.5        36753.0      1.2      4.7      8.4     23.1 read
    3.0s        0        37586.4        37031.0      1.1      5.0      9.4     23.1 read
    4.0s        0        38858.7        37488.1      1.2      4.5      7.9     18.9 read
    5.0s        0        38599.1        37710.2      1.1      4.5      8.9     18.9 read
    6.0s        0        38557.1        37851.5      1.1      4.7      8.4     19.9 read
    7.0s        0        38632.2        37963.4      1.2      4.5      8.9     18.9 read
    8.0s        0        40179.1        38239.8      1.1      4.5      8.9     18.9 read
    9.0s        0        39869.8        38421.0      1.0      4.7      8.9     23.1 read
   10.0s        0        40219.6        38600.9      1.1      4.5      8.9     16.3 read

@jordanlewis
Copy link
Member Author

I am having a great deal of trouble reproducing this in my environment. Perhaps it's noise, perhaps it's not. I think I'd like to merge and backport this soon, unless people have objections - my bandwidth for investigating this will also decrease quickly this week.

@erikgrinaker
Copy link
Contributor

Ok -- I think this is a pretty clear win for YCSB/E, and if it turns out not to be noise we can optimize that separately. Thanks!

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 6 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)

@jordanlewis
Copy link
Member Author

bors r+

@yuzefovich yuzefovich force-pushed the estimated-row-count-in-allocator branch from 3d7eecc to 957bb40 Compare March 22, 2021 23:48
@yuzefovich
Copy link
Member

I hope you don't mind - I've just rebased and updated explain_analyze logic test.

bors r+

@jordanlewis
Copy link
Member Author

On the contrary, I very much appreciate it - thanks so much!

@craig
Copy link
Contributor

craig bot commented Mar 23, 2021

Build failed (retrying...):

@yuzefovich
Copy link
Member

"vectorized batch count" doesn't show up for some of the logic test configs :/

bors r-

@craig
Copy link
Contributor

craig bot commented Mar 23, 2021

Canceled.

@yuzefovich
Copy link
Member

Hm, I'm not sure what to do here. The flakes highlight the fact that introducing purely vectorized engine related field to EXPLAIN output is unfortunate. I wonder whether we should merge this PR without changes to the EXPLAIN (i.e. without a regression test), backport the fix, and then come up with the testing strategy.

Alternatively, we could either run explain_analyze logic test only with vectorized configs (which would reduce our test coverage, but it was even worse until recently #62255), or duplicate the file for non-vectorized configs.

@yuzefovich
Copy link
Member

I'll pick this up and will move over the finish line. I'll update explain_analyze logic test to run only with vec-on configs so that it is not flaky anymore.

Previously, the colfetcher ignored limit hints: it always fetched data
from KV until its batch was full. This produces bad behavior if the
batch size is larger than the limit hint. For example, if the expected
row count was 500, causing us to create a 500-sized batch, but the limit
hint for whatever reason was only 20, we would still go ahead and fetch
500 rows.

This, in practice, does not appear to show up too easily - if the
optimizer is doing its job, the batch size should always be equal to the
limit hint for limited scans.

Release note: None
Previously, the "dynamic batch size" strategy for the vectorized
engine's batch allocator worked the same in every situation: batches
would start at size 1, then double on each re-allocation, until they hit
their maximum size of 1024.

Now, to improve performance for scans that return a number of rows
somewhere in between 1 and 1024, we pass in the optimizer's best guess
of the number of rows that the scan will produce all the way down into
the TableReader. That guess is used as the initial size of the batch if
it's less than 1024.

Release note (performance improvement): improve the performance for the
vectorized engine when scanning fewer than 1024 rows at a time.
@yuzefovich yuzefovich force-pushed the estimated-row-count-in-allocator branch from 957bb40 to 5041b22 Compare March 23, 2021 23:46
@yuzefovich
Copy link
Member

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 24, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 24, 2021

Build succeeded:

@craig craig bot merged commit a56498f into cockroachdb:master Mar 24, 2021
erikgrinaker added a commit to erikgrinaker/cockroach that referenced this pull request Mar 24, 2021
In cockroachdb#62282, the estimated row count was passed into the scan batch
allocator to avoid growing the batch from 1. However, this also changed
the default batch size from 1 to 1024 when no row count estimate was
available, giving significant overhead when fetching small result sets.
On `kv95/enc=false/nodes=1/cpu=32` this reduced performance from 66304
ops/s to 58862 ops/s (median of 5 runs), since these are single-row
reads without estimates.

This patch reverts the default batch size to 1 when no row count
estimate is available. This fully fixes the `kv95` performance
regression. YCSB/E takes a small hit going from 1895 ops/s to 1786
ops/s, but this only seems to happen because it takes a while for the
statistics to update: sometime within the first minute of the test
(after the 1-minute ramp-up period), throughput abruptly changes from
~700 ops/s to ~1800 ops/s, so using a 2-minute ramp-up period in
roachtest would mostly eliminate any difference.

Release note: None
craig bot pushed a commit that referenced this pull request Mar 24, 2021
62175: sql: add crdb_internal.reset_sql_stats() builtin r=asubiotto a=Azhng

Previously, there was no mechanism to immediately clear SQL statistics. Users
would have to wait until the reset interval expires. This commit creates a
builtin to immediately clears SQL stats.

Release note (sql change): SQL stats can now be cleared using
crdb_internal.reset_sql_stats()

Addresses #33315 

62492: roachtest: add hibernate-spatial test r=rafiss a=otan

* Bump hibernate to 5.4.30
* Add hibernate-spatial test which tests cockroachdb against hibernate's
  spatial suite. Used a separate suite because the directory magic of
  copying may not work since the set of running tests overlap a bit.

Release note: None

62511: geo/geomfn: fix st_linelocatepoint to work with ZM coords r=otan a=andyyang890

Previously, st_linelocatepoint would panic when the
line had Z and/or M coordinates.

Release note: None

62534: sql: default to batch size 1 in allocator r=yuzefovich a=erikgrinaker

In #62282, the estimated row count was passed into the scan batch
allocator to avoid growing the batch from 1. However, this also changed
the default batch size from 1 to 1024 when no row count estimate was
available, giving significant overhead when fetching small result sets.
On `kv95/enc=false/nodes=1/cpu=32` this reduced performance from 66304
ops/s to 58862 ops/s (median of 5 runs), since these are single-row
reads without estimates.

This patch reverts the default batch size to 1 when no row count
estimate is available. This fully fixes the `kv95` performance
regression. YCSB/E takes a small hit going from 1895 ops/s to 1786
ops/s, but this only seems to happen because it takes a while for the
statistics to update: sometime within the first minute of the test
(after the 1-minute ramp-up period), throughput abruptly changes from
~700 ops/s to ~1800 ops/s, so using a 2-minute ramp-up period in
roachtest would mostly eliminate any difference.

Resolves #62524.

Release note: None

62535: roachtest: use 2-minute ramp times for YCSB workloads r=yuzefovich a=erikgrinaker

In #62534 it was shown that it takes up to two minutes before we have
good enough statistics to allocate appropriately sized batches. However,
the YCSB workloads only had a 1-minute ramp time, which would skew the
results as throughput would abruptly change when the statistics were
updated during the test.

This patch changes the ramp time for YCSB workloads to 2 minutes to make
sure we have appropriate statistics before starting the actual test.

Release note: None

62548: bazel: mark logictest as working in bazel r=rickystewart a=rickystewart

Release note: None

62549: workload/schemachange: temporarily disable ADD REGION r=ajstorm,postamar a=otan

This is causing flakes in CI.
Resolves #62503 

Release note: None

Co-authored-by: Azhng <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Andy Yang <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
yuzefovich pushed a commit to yuzefovich/cockroach that referenced this pull request Mar 26, 2021
In cockroachdb#62282, the estimated row count was passed into the scan batch
allocator to avoid growing the batch from 1. However, this also changed
the default batch size from 1 to 1024 when no row count estimate was
available, giving significant overhead when fetching small result sets.
On `kv95/enc=false/nodes=1/cpu=32` this reduced performance from 66304
ops/s to 58862 ops/s (median of 5 runs), since these are single-row
reads without estimates.

This patch reverts the default batch size to 1 when no row count
estimate is available. This fully fixes the `kv95` performance
regression. YCSB/E takes a small hit going from 1895 ops/s to 1786
ops/s, but this only seems to happen because it takes a while for the
statistics to update: sometime within the first minute of the test
(after the 1-minute ramp-up period), throughput abruptly changes from
~700 ops/s to ~1800 ops/s, so using a 2-minute ramp-up period in
roachtest would mostly eliminate any difference.

Release note: None
@jordanlewis jordanlewis deleted the estimated-row-count-in-allocator branch March 26, 2021 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

YCSB/E performance regression on December 4th
4 participants