Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql,storage: add support for COL_BATCH_RESPONSE scan format #94438

Merged
merged 1 commit into from
Jan 25, 2023

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Dec 29, 2022

This commit introduces a new COL_BATCH_RESPONSE scan format for Scans
and ReverseScans which results only in needed columns to be returned
from the KV server. In other words, this commit introduces the ability
to perform the KV projection pushdown.

The main idea of this feature is to use the injected decoding logic from
SQL in order to process each KV and keep only the needed parts (i.e.
necessary SQL columns). Those needed parts are then propagated back to
the KV client as coldata.Batch'es (serialized in the Apache Arrow format).

Here is the outline of all components involved:

     ┌────────────────────────────────────────────────┐
     │                       SQL                      │
     │________________________________________________│
     │          colfetcher.ColBatchDirectScan         │
     │                        │                       │
     │                        ▼                       │
     │                 row.txnKVFetcher               │
     │    (behind the row.KVBatchFetcher interface)   │
     └────────────────────────────────────────────────┘
                              │
                              ▼
     ┌────────────────────────────────────────────────┐
     │                    KV Client                   │
     └────────────────────────────────────────────────┘
                              │
                              ▼
     ┌────────────────────────────────────────────────┐
     │                    KV Server                   │
     │________________________________________________│
     │           colfetcher.cFetcherWrapper           │
     │ (behind the storage.CFetcherWrapper interface) │
     │                        │                       │
     │                        ▼                       │
     │              colfetcher.cFetcher               │
     │                        │                       │
     │                        ▼                       │
     │          storage.mvccScanFetchAdapter ────────┐│
     │    (behind the storage.NextKVer interface)    ││
     │                        │                      ││
     │                        ▼                      ││
     │           storage.pebbleMVCCScanner           ││
     │ (which put's KVs into storage.singleResults) <┘│
     └────────────────────────────────────────────────┘

On the KV client side, row.txnKVFetcher issues Scans and ReverseScans with
the COL_BATCH_RESPONSE format and returns the response (which contains the
columnar data) to the colfetcher.ColBatchDirectScan.

On the KV server side, we create a storage.CFetcherWrapper that asks the
colfetcher.cFetcher for the next coldata.Batch. The cFetcher, in turn,
fetches the next KV, decodes it, and keeps only values for the needed SQL
columns, discarding the rest of the KV. The KV is emitted by the
mvccScanFetchAdapter which - via the singleResults struct - exposes
access to the current KV that the pebbleMVCCScanner is pointing at.

Note that there is an additional "implicit synchronization" between
components that is not shown on this diagram. In particular,
storage.singleResults.maybeTrimPartialLastRow must be in sync with the
colfetcher.cFetcher which is achieved by

  • the cFetcher exposing access to the first key of the last incomplete SQL
    row via the FirstKeyOfRowGetter,
  • the singleResults using that key as the resume key for the response,
  • and the cFetcher removing that last partial SQL row when NextKV()
    returns partialRow=true.
    This "upstream" link (although breaking the layering a bit) allows us to
    avoid a performance penalty for handling the case with multiple column
    families. (This case is handled by the storage.pebbleResults via tracking
    offsets into the pebbleResults.repr.)

This code structure deserves some elaboration. First, there is a mismatch
between the "push" mode in which the pebbleMVCCScanner operates and the
"pull" mode that the NextKVer exposes. The adaption between two different
modes is achieved via the mvccScanFetcherAdapter grabbing (when the control
returns to it) the current unstable KV pair from the singleResults struct
which serves as a one KV pair buffer that the pebbleMVCCScanner puts into.
Second, in order be able to use the unstable KV pair without performing a
copy, the pebbleMVCCScanner stops at the current KV pair and returns the
control flow (which is exactly what pebbleMVCCScanner.getOne does) back to
the mvccScanFetcherAdapter, with the adapter advancing the scanner only when
the next KV pair is needed.

There are multiple scenarios which are currently not supported:

  • SQL cannot issue Get requests (likely will support in 23.1)
  • TraceKV option is not supported (likely will support in 23.1)
  • user-defined types other than enums are not supported (will not
    support in 23.1)
  • non-default key locking strength as well as SKIP LOCKED wait policy
    are not supported (will not support in 23.1).

The usage of this feature is currently disabled by default, but I intend
to enable it by default for multi-tenant setups. The rationale is that
currently there is a large performance hit when enabling it for
single-tenant deployments whereas it offers significant speed up in the
multi-tenant world.

The microbenchmarks show
the expected improvement in multi-tenant setups when the tenant runs in
a separate process whenever we don't need to decode all of the columns
from the table.

The TPCH numbers, though, don't show the expected speedup:

Q1:	before: 11.47s	after: 8.84s	 -22.89%
Q2:	before: 0.41s	after: 0.29s	 -27.71%
Q3:	before: 7.89s	after: 9.68s	 22.63%
Q4:	before: 4.48s	after: 4.52s	 0.86%
Q5:	before: 10.39s	after: 10.35s	 -0.29%
Q6:	before: 33.57s	after: 33.41s	 -0.48%
Q7:	before: 23.82s	after: 23.81s	 -0.02%
Q8:	before: 3.78s	after: 3.76s	 -0.68%
Q9:	before: 28.15s	after: 28.03s	 -0.42%
Q10:	before: 5.00s	after: 4.98s	 -0.42%
Q11:	before: 2.44s	after: 2.44s	 0.22%
Q12:	before: 34.78s	after: 34.65s	 -0.37%
Q13:	before: 3.20s	after: 2.94s	 -8.28%
Q14:	before: 3.13s	after: 3.21s	 2.43%
Q15:	before: 16.80s	after: 16.73s	 -0.38%
Q16:	before: 1.60s	after: 1.65s	 2.96%
Q17:	before: 0.85s	after: 0.96s	 13.04%
Q18:	before: 16.39s	after: 15.47s	 -5.61%
Q19:	before: 13.76s	after: 13.01s	 -5.45%
Q20:	before: 55.33s	after: 55.12s	 -0.38%
Q21:	before: 24.31s	after: 24.31s	 -0.00%
Q22:	before: 1.28s	after: 1.41s	 10.26%

At the moment, coldata.Batch that is included into the response is
always serialized into the Arrow format, but I intend to introduce the
local fastpath to avoid that serialization. That work will be done in
a follow-up and should be able to reduce the perf hit for single-tenant
deployments.

A quick note on the TODOs sprinkled in this commit:

  • TODO(yuzefovich) means that this will be left for 23.2 or later.
  • TODO(yuzefovich, 23.1) means that it should be addressed in 23.1.

A quick note on testing: this commit randomizes the fact whether the new
infrastructure is used in almost all test builds. Introducing some unit
testing (say, in storage package) seems rather annoying since we must
create keys that are valid SQL keys (i.e. have TableID / Index ID
prefix) and need to come with the corresponding
fetchpb.IndexFetchSpec. Not having unit tests in the storage seems
ok to me given that the "meat" of the work there is still done by the
pebbleMVCCScanner which is exercised using the regular Scans.
End-to-end testing is well covered by all of our existing tests which
now runs randomly. I did run the CI multiple times with the new feature
enabled by default with no failure, so I hope that it shouldn't become
flaky.

Addresses: #82323.
Informs: #87610.

Epic: CRDB-14837

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich force-pushed the kv-pushdown-all branch 17 times, most recently from 186d49a to a284dbb Compare January 5, 2023 19:27
@yuzefovich yuzefovich force-pushed the kv-pushdown-all branch 7 times, most recently from 07f8ff0 to f2fc46d Compare January 6, 2023 20:37
@yuzefovich yuzefovich force-pushed the kv-pushdown-all branch 4 times, most recently from d24e48c to 26940e1 Compare January 6, 2023 22:33
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @sumeerbhola)


pkg/roachpb/api.proto line 673 at r7 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[nit] rowin -> row in and same above.

Done.


pkg/sql/colfetcher/cfetcher.go line 812 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

I think I still want to push on this a bit more, since it would be nice to decouple things. What if we added a method to the NextKVer interface SetFirstKVOfRow that would be called by the cFetcher when it reaches stateDecodeFirstKVOfRow? I think that should give mvccScanFetchAdapter all the information it needs to implement my earlier suggestion, right?

Thanks for pushing on this. I implemented your suggestion in first WIP commit, but I didn't like one aspect of it - having to call SetFirstKeyOfRow on every row seen by the cFetcher, even when not using the direct columnar scans. I think it could have a noticeable performance impact, so I added another WIP commit which eliminates this function call on every SQL row and still decouples things as you suggested.

Overall, I probably like this change. My only hesitation is that now the trimming logic is split into two places - we rely on synchronizing singleResults returning the appropriate first key of the trimmed row and the cFetcher actually removing that row. However, the interfaces do seem cleaner, so I'm curious to hear what everyone thinks.


pkg/storage/col_mvcc.go line 110 at r7 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

[nit] stable now just depends on the implementation and is always true or false, right? Consider either making it a separate method on the interface or making it a field on the cFetcher itself.

Good point, extracted into another method in NextKVer interface.

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @sumeerbhola, and @yuzefovich)


pkg/sql/colfetcher/cfetcher.go line 812 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Thanks for pushing on this. I implemented your suggestion in first WIP commit, but I didn't like one aspect of it - having to call SetFirstKeyOfRow on every row seen by the cFetcher, even when not using the direct columnar scans. I think it could have a noticeable performance impact, so I added another WIP commit which eliminates this function call on every SQL row and still decouples things as you suggested.

Overall, I probably like this change. My only hesitation is that now the trimming logic is split into two places - we rely on synchronizing singleResults returning the appropriate first key of the trimmed row and the cFetcher actually removing that row. However, the interfaces do seem cleaner, so I'm curious to hear what everyone thinks.

One last option - we could make the first key of the row an argument to NextKV.

Anyway, I think we've been pretty thorough at this point. I think I'd be happy with either of the WIPs you added. I'll do a final pass once you've finalized this bit but LGTM for now.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nits about comments. I am happy with the storage interfaces and code, so feel free to merge when @DrewKimball is satisfied.
:lgtm:

Reviewed 1 of 3 files at r10.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @michae2, and @yuzefovich)


pkg/storage/col_mvcc.go line 246 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

This part was very confusing - although the KV is unstable, in some cases the cFetcher doesn't need to hold onto it after the following NextKV call, so the copying is not needed (if there is only one column family in the table). The complication was that a copy needed to be performed for a very different reason in the old code path (in the row.KVFetcher). I refactored the code to clean things up, and now the implementation here should make more sense. In particular, we now just say that the returned KV is unstable, and the cFetcher itself decides whether to deep copy it or not.

Looks much cleaner now.


pkg/storage/col_mvcc.go line 124 at r10 (raw file):

	// When (ok=false,partialRow=true) is returned, the caller is expected to
	// discard all KVs that were part of the last SQL row that was incomplete.
	// The scan will be resumed from the last key provided by SetFirstKeyOfRow

was this supposed to say FirstKeyOfRowGetter?


pkg/storage/pebble_mvcc_scanner.go line 76 at r6 (raw file):

To me saying "before calling put() with the same key" includes both of the cases - i.e. the current getOne call hasn't called put with this key, nor the previous getOne calls have - which leads to this correct assumption

getOne could call put with key a and then enquire about key b, even though it has no intention of call put with key b in this call. Seems something that is permitted by this contract.
Even if it was not permitted, anything like this that is not obvious is worth spelling out in a code comment so that future readers can understand this super quickly.
Can you add something here about both the contract and how it implies that nothing is buffered in the singleResults implementation of results. I don't mind comments here that refer to any of the implementations since it helps folks understand why these constraints are being defined. Abstraction is good in terms of having clean code but we often want to understand why this abstraction its various invariants.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @sumeerbhola)


pkg/sql/vars.go line 490 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

IIRC ALTER ROLE ALL doesn't affect root user, and for logic tests I think we do use root often. The direct columnar scans need to be disabled for some types of logic tests, and it seems like we do need an explicit cluster setting to handle that. Perhaps we'd just need to do both ALTER ROLE ALL and ALTER ROLE root. Happy to be taught here since the choice of using the cluster setting can be driven simply by me being used to this way.

My current understanding of the situation is that we no longer allow cluster settings with sql.defaults prefix and perhaps we stir away engineers from introducing public cluster settings that also have corresponding session variables. In this case, this is a private undocumented setting, and I think it's ok to keep it that way.

That said, I did ask in slack, will see what SQL Session team recommends.


pkg/sql/colfetcher/cfetcher.go line 812 at r6 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

One last option - we could make the first key of the row an argument to NextKV.

Anyway, I think we've been pretty thorough at this point. I think I'd be happy with either of the WIPs you added. I'll do a final pass once you've finalized this bit but LGTM for now.

I like Getter method approach the most - passing a key when calling NextKV (which returns a key) seems wrong.


pkg/storage/col_mvcc.go line 124 at r10 (raw file):

Previously, sumeerbhola wrote…

was this supposed to say FirstKeyOfRowGetter?

Yes, this was a leftover, fixed.


pkg/storage/pebble_mvcc_scanner.go line 76 at r6 (raw file):

Previously, sumeerbhola wrote…

To me saying "before calling put() with the same key" includes both of the cases - i.e. the current getOne call hasn't called put with this key, nor the previous getOne calls have - which leads to this correct assumption

getOne could call put with key a and then enquire about key b, even though it has no intention of call put with key b in this call. Seems something that is permitted by this contract.
Even if it was not permitted, anything like this that is not obvious is worth spelling out in a code comment so that future readers can understand this super quickly.
Can you add something here about both the contract and how it implies that nothing is buffered in the singleResults implementation of results. I don't mind comments here that refer to any of the implementations since it helps folks understand why these constraints are being defined. Abstraction is good in terms of having clean code but we often want to understand why this abstraction its various invariants.

Thanks, makes sense, added this information into the comments.

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nicely done!

Reviewed 8 of 22 files at r7, 10 of 10 files at r11, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2, @sumeerbhola, and @yuzefovich)

@yuzefovich
Copy link
Member Author

I've got some performance numbers to share.

The microbenchmarks are as expected:

  • Count, SkipScan, and WideTableIgnoreColumns see a speed up for separate-process tenants. These benchmarks select only a subset of needed columns, so we're able to significantly reduce the amount of data to be transferred across process boundaries.
  • Scan and ScanFilter select all of the columns in the table, so we see a minor performance hit.
  • in-process tenants also see a minor performance hit. This will be improved in storage,colfetcher: implement local fast-path for COL_BATCH_RESPONSE #95033 where we will introduce a local fastpath that will avoid serialization of coldata.Batches
  • ScanFilter/limit=50 has the largest regression probably because we don't (yet) propagate the estimated row count. This should be fixed in 23.1.

TPCH numbers though aren't as good:

Q1:	before: 11.47s	after: 8.84s	 -22.89%
Q2:	before: 0.41s	after: 0.29s	 -27.71%
Q3:	before: 7.89s	after: 9.68s	 22.63%
Q4:	before: 4.48s	after: 4.52s	 0.86%
Q5:	before: 10.39s	after: 10.35s	 -0.29%
Q6:	before: 33.57s	after: 33.41s	 -0.48%
Q7:	before: 23.82s	after: 23.81s	 -0.02%
Q8:	before: 3.78s	after: 3.76s	 -0.68%
Q9:	before: 28.15s	after: 28.03s	 -0.42%
Q10:	before: 5.00s	after: 4.98s	 -0.42%
Q11:	before: 2.44s	after: 2.44s	 0.22%
Q12:	before: 34.78s	after: 34.65s	 -0.37%
Q13:	before: 3.20s	after: 2.94s	 -8.28%
Q14:	before: 3.13s	after: 3.21s	 2.43%
Q15:	before: 16.80s	after: 16.73s	 -0.38%
Q16:	before: 1.60s	after: 1.65s	 2.96%
Q17:	before: 0.85s	after: 0.96s	 13.04%
Q18:	before: 16.39s	after: 15.47s	 -5.61%
Q19:	before: 13.76s	after: 13.01s	 -5.45%
Q20:	before: 55.33s	after: 55.12s	 -0.38%
Q21:	before: 24.31s	after: 24.31s	 -0.00%
Q22:	before: 1.28s	after: 1.41s	 10.26%

This is a bit disappointing given the original very good numbers some time ago in the prototype. I did a bisect last week and couldn't reproduce those results. I'll do another bisect tomorrow, but most likely I messed something up when obtaining those numbers (although I do remember collecting them twice).

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @michae2, and @yuzefovich)


pkg/storage/pebble_mvcc_scanner.go line 80 at r11 (raw file):

	// returned on the NextKVer.NextKV call). Therefore, the singleResults can
	// make a determination on its own whether the given key belongs to the
	// first SQL row.

Now that singleResults is making a determination on its own, this constraint is not needed, yes?
It copies the first row eagerly in put, so even if that KV is still buffered, it is ok.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @DrewKimball, @michae2, and @sumeerbhola)


pkg/storage/pebble_mvcc_scanner.go line 80 at r11 (raw file):

Previously, sumeerbhola wrote…

Now that singleResults is making a determination on its own, this constraint is not needed, yes?
It copies the first row eagerly in put, so even if that KV is still buffered, it is ok.

Yes, you're right. I tightened the contract here a bit further since continuesFirstRow is only called if put has already been called at least once.

@yuzefovich
Copy link
Member Author

For reference, with the local fastpath for in-memory tenants (WIP in #95033) the benchmarks are here (the comparison is master vs this PR + #95033 with direct columnar scans always enabled). They show that for in-memory tenants (this is how UA will be achieved for self-hosted customers) the KV projection pushdown work already has benefits in some scenarios, and I hope with more targeted optimization work we can improve there noticeably.

@yuzefovich
Copy link
Member Author

Finally, I was able to reproduce the "great" results I obtained on the prototype two months ago. In short, they were due to a subtle bug in col_mvcc.go which didn't impact the correctness.

In particular, as can be seen in #95793 (ignoring the last commit) the prototype had the following:

	//res.NumKeys = mvccScanner.results.count
	//res.NumBytes = mvccScanner.results.bytes

after processing the Scan request via the direct columnar scans. As a result, after evaluating the Scan we didn't update limits of the BatchRequest (namely MaxSpanRequestKeys and TargetBytes) in DistSender.divideAndSendBatchToRanges which meant that these limits were applied on a per-ScanRequest basis rather than per-BatchRequest basis. This meant that we had to do fewer KV calls from the SQL land which improved the performance. The downside though is that the stability suffered since we could now receive a BatchResponse that exceeded TargetBytes significantly. This wasn't caught since the query results were still correct, we were effectively issuing BatchRequests with larger limits. I didn't run any "stability" tests on the prototype that would catch this.

Once this bug is fixed (i.e. the last commit in #95793 is applied), we get the results that show effectively no difference of the prototype on TPCH queries:

Q1:	before: 8.57s	after: 8.41s	 -1.84%
Q2:	before: 0.29s	after: 0.29s	 1.39%
Q3:	before: 9.55s	after: 9.68s	 1.35%
Q4:	before: 4.45s	after: 4.49s	 0.92%
Q5:	before: 10.39s	after: 10.35s	 -0.32%
Q6:	before: 33.58s	after: 33.46s	 -0.34%
Q7:	before: 23.82s	after: 23.82s	 0.02%
Q8:	before: 3.79s	after: 3.76s	 -0.74%
Q9:	before: 28.13s	after: 28.01s	 -0.45%
Q10:	before: 4.99s	after: 4.98s	 -0.32%
Q11:	before: 2.44s	after: 2.45s	 0.29%
Q12:	before: 34.75s	after: 34.63s	 -0.35%
Q13:	before: 3.15s	after: 2.93s	 -6.93%
Q14:	before: 3.27s	after: 3.24s	 -1.10%
Q15:	before: 16.82s	after: 16.73s	 -0.52%
Q16:	before: 1.80s	after: 1.71s	 -5.00%
Q17:	before: 1.01s	after: 0.94s	 -6.55%
Q18:	before: 30.14s	after: 28.46s	 -5.56%
Q19:	before: 13.77s	after: 13.01s	 -5.51%
Q20:	before: 55.32s	after: 55.12s	 -0.36%
Q21:	before: 24.32s	after: 24.33s	 0.05%
Q22:	before: 1.48s	after: 1.46s	 -1.69%

(It's interesting to observe that on the current polished PR there are some noticeable improvements and noticeable regressions, I haven't explored this in-depth though, and will do so in the next milestone.)

This begs the question whether we should increase these limits since it has such non-trivial impact on the performance (TBD whether this impact is more pronounced in the multi-tenant setups and whether the usage of the direct columnar scans play a role too). Perhaps, if we set GOMEMLIMIT on default, increasing these limits (I'm thinking mostly about the default value of TargetBytes which is currently 10MiB) will not significantly impact the stability. However, this is an orthogonal question.


Thus, I think there are no more blockers for this PR to be merged. It shows an expected improvement in microbenchmarks when only a subset of columns is needed in multi-tenant separate-process setups. I'll wait for Michael's approval.

@DrewKimball
Copy link
Collaborator

This begs the question whether we should increase these limits since it has such non-trivial impact on the performance (TBD whether this impact is more pronounced in the multi-tenant setups and whether the usage of the direct columnar scans play a role too). Perhaps, if we set GOMEMLIMIT on default, increasing these limits (I'm thinking mostly about the default value of TargetBytes which is currently 10MiB) will not significantly impact the stability. However, this is an orthogonal question.

Maybe this isn't the best place to discuss, but I will anyway :) Some memory usage in queries is hard to throttle up and down, but it seems like batch sizing isn't one of those things - the ResetMaybeReallocate stuff we already have does this pretty naturally. Maybe we could have some visibility at the operator level of how much memory is being used at a higher level, e.g. by all SQL queries, or maybe just by the current query. We could increase the actual limit from the default when overall memory usage is low, and decrease again when resource use is high (like would be the case for many concurrent queries).

yuzefovich added a commit to yuzefovich/cockroach that referenced this pull request Jan 25, 2023
This commit introduces a new `COL_BATCH_RESPONSE` scan format for Scans
and ReverseScans which results only in needed columns to be returned
from the KV server. In other words, this commit introduces the ability
to perform the KV projection pushdown.

The main idea of this feature is to use the injected decoding logic from
SQL in order to process each KV and keep only the needed parts (i.e.
necessary SQL columns). Those needed parts are then propagated back to
the KV client as coldata.Batch'es (serialized in the Apache Arrow format).

Here is the outline of all components involved:

     ┌────────────────────────────────────────────────┐
     │                       SQL                      │
     │________________________________________________│
     │          colfetcher.ColBatchDirectScan         │
     │                        │                       │
     │                        ▼                       │
     │                 row.txnKVFetcher               │
     │    (behind the row.KVBatchFetcher interface)   │
     └────────────────────────────────────────────────┘
                              │
                              ▼
     ┌────────────────────────────────────────────────┐
     │                    KV Client                   │
     └────────────────────────────────────────────────┘
                              │
                              ▼
     ┌────────────────────────────────────────────────┐
     │                    KV Server                   │
     │________________________________________________│
     │           colfetcher.cFetcherWrapper           │
     │ (behind the storage.CFetcherWrapper interface) │
     │                        │                       │
     │                        ▼                       │
     │              colfetcher.cFetcher               │
     │                        │                       │
     │                        ▼                       │
     │          storage.mvccScanFetchAdapter ────────┐│
     │    (behind the storage.NextKVer interface)    ││
     │                        │                      ││
     │                        ▼                      ││
     │           storage.pebbleMVCCScanner           ││
     │ (which put's KVs into storage.singleResults) <┘│
     └────────────────────────────────────────────────┘

On the KV client side, `row.txnKVFetcher` issues Scans and ReverseScans with
the `COL_BATCH_RESPONSE` format and returns the response (which contains the
columnar data) to the `colfetcher.ColBatchDirectScan`.

On the KV server side, we create a `storage.CFetcherWrapper` that asks the
`colfetcher.cFetcher` for the next `coldata.Batch`. The `cFetcher`, in turn,
fetches the next KV, decodes it, and keeps only values for the needed SQL
columns, discarding the rest of the KV. The KV is emitted by the
`mvccScanFetchAdapter` which - via the `singleResults` struct - exposes
access to the current KV that the `pebbleMVCCScanner` is pointing at.

Note that there is an additional "implicit synchronization" between
components that is not shown on this diagram. In particular,
`storage.singleResults.maybeTrimPartialLastRow` must be in sync with the
`colfetcher.cFetcher` which is achieved by
- the `cFetcher` exposing access to the first key of the last incomplete SQL
  row via the `FirstKeyOfRowGetter`,
- the `singleResults` using that key as the resume key for the response,
- and the `cFetcher` removing that last partial SQL row when `NextKV()`
  returns `partialRow=true`.
This "upstream" link (although breaking the layering a bit) allows us to
avoid a performance penalty for handling the case with multiple column
families. (This case is handled by the `storage.pebbleResults` via tracking
offsets into the `pebbleResults.repr`.)

This code structure deserves some elaboration. First, there is a mismatch
between the "push" mode in which the `pebbleMVCCScanner` operates and the
"pull" mode that the `NextKVer` exposes. The adaption between two different
modes is achieved via the `mvccScanFetcherAdapter` grabbing (when the control
returns to it) the current unstable KV pair from the `singleResults` struct
which serves as a one KV pair buffer that the `pebbleMVCCScanner` `put`s into.
Second, in order be able to use the unstable KV pair without performing a
copy, the `pebbleMVCCScanner` stops at the current KV pair and returns the
control flow (which is exactly what `pebbleMVCCScanner.getOne` does) back to
the `mvccScanFetcherAdapter`, with the adapter advancing the scanner only when
the next KV pair is needed.

There are multiple scenarios which are currently not supported:
- SQL cannot issue Get requests (likely will support in 23.1)
- `TraceKV` option is not supported (likely will support in 23.1)
- user-defined types other than enums are not supported (will _not_
support in 23.1)
- non-default key locking strength as well as SKIP LOCKED wait policy
are not supported (will _not_ support in 23.1).

The usage of this feature is currently disabled by default, but I intend
to enable it by default for multi-tenant setups. The rationale is that
currently there is a large performance hit when enabling it for
single-tenant deployments whereas it offers significant speed up in the
multi-tenant world.

The microbenchmarks [show](https://gist.github.com/yuzefovich/669c295a8a4fdffa6490532284c5a719)
the expected improvement in multi-tenant setups when the tenant runs in
a separate process whenever we don't need to decode all of the columns
from the table.

The TPCH numbers, though, don't show the expected speedup:
```
Q1:	before: 11.47s	after: 8.84s	 -22.89%
Q2:	before: 0.41s	after: 0.29s	 -27.71%
Q3:	before: 7.89s	after: 9.68s	 22.63%
Q4:	before: 4.48s	after: 4.52s	 0.86%
Q5:	before: 10.39s	after: 10.35s	 -0.29%
Q6:	before: 33.57s	after: 33.41s	 -0.48%
Q7:	before: 23.82s	after: 23.81s	 -0.02%
Q8:	before: 3.78s	after: 3.76s	 -0.68%
Q9:	before: 28.15s	after: 28.03s	 -0.42%
Q10:	before: 5.00s	after: 4.98s	 -0.42%
Q11:	before: 2.44s	after: 2.44s	 0.22%
Q12:	before: 34.78s	after: 34.65s	 -0.37%
Q13:	before: 3.20s	after: 2.94s	 -8.28%
Q14:	before: 3.13s	after: 3.21s	 2.43%
Q15:	before: 16.80s	after: 16.73s	 -0.38%
Q16:	before: 1.60s	after: 1.65s	 2.96%
Q17:	before: 0.85s	after: 0.96s	 13.04%
Q18:	before: 16.39s	after: 15.47s	 -5.61%
Q19:	before: 13.76s	after: 13.01s	 -5.45%
Q20:	before: 55.33s	after: 55.12s	 -0.38%
Q21:	before: 24.31s	after: 24.31s	 -0.00%
Q22:	before: 1.28s	after: 1.41s	 10.26%
```
Note that much better numbers were obtained on a buggy prototype. Those
numbers are invalid, and more details can be found
[here](cockroachdb#94438 (comment)).

At the moment, `coldata.Batch` that is included into the response is
always serialized into the Arrow format, but I intend to introduce the
local fastpath to avoid that serialization. That work will be done in
a follow-up and should be able to reduce the perf hit for single-tenant
deployments.

A quick note on the TODOs sprinkled in this commit:
- `TODO(yuzefovich)` means that this will be left for 23.2 or later.
- `TODO(yuzefovich, 23.1)` means that it should be addressed in 23.1.

A quick note on testing: this commit randomizes the fact whether the new
infrastructure is used in almost all test builds. Introducing some unit
testing (say, in `storage` package) seems rather annoying since we must
create keys that are valid SQL keys (i.e. have TableID / Index ID
prefix) and need to come with the corresponding
`fetchpb.IndexFetchSpec`. Not having unit tests in the `storage` seems
ok to me given that the "meat" of the work there is still done by the
`pebbleMVCCScanner` which is exercised using the regular Scans.
End-to-end testing is well covered by all of our existing tests which
now runs randomly. I did run the CI multiple times with the new feature
enabled by default with no failure, so I hope that it shouldn't become
flaky.

Release note: None
@yuzefovich
Copy link
Member Author

Some memory usage in queries is hard to throttle up and down, but it seems like batch sizing isn't one of those things - the ResetMaybeReallocate stuff we already have does this pretty naturally. Maybe we could have some visibility at the operator level of how much memory is being used at a higher level, e.g. by all SQL queries, or maybe just by the current query. We could increase the actual limit from the default when overall memory usage is low, and decrease again when resource use is high (like would be the case for many concurrent queries).

This sounds reasonable. However, setting reasonable value for GOMEMLIMIT by default seems like a must-have prerequisite to me for doing any of this work, so I don't think we'll be able to do anything here in 23.1.

Also, I just ran some quick benchmarks (using tpchvec/bench and multitenant/tpch) while varying the default value for rowinfra.defaultBatchBytesLimitProductionValue (which is where 10MiB default comes from), and they didn't show significant improvement when bumping it up to 20MiB or 100MiB, neither in single-tenant nor in multi-tenant world. Perhaps my explanation in #94438 (comment) is incomplete. This means that further investigation is needed to understand how exactly not setting NumKeys and/or NumBytes leads to the perf speedup.

Still, we now know for sure how those "great" numbers from the prototype were obtained.

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nice work! Thanks for waiting for me. Thank you for reworking those layer violations.

Interesting about the NumKeys / NumBytes omission causing the performance difference. Seems really promising, if we can figure out what else they affect besides BatchRequest limits.

Reviewed 2 of 55 files at r1, 2 of 15 files at r5, 11 of 22 files at r7, 7 of 10 files at r11, 2 of 3 files at r12, 10 of 11 files at r13, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @DrewKimball, @sumeerbhola, and @yuzefovich)


pkg/sql/vars.go line 490 at r6 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

My current understanding of the situation is that we no longer allow cluster settings with sql.defaults prefix and perhaps we stir away engineers from introducing public cluster settings that also have corresponding session variables. In this case, this is a private undocumented setting, and I think it's ok to keep it that way.

That said, I did ask in slack, will see what SQL Session team recommends.

TIL about ALTER ROLE not working for root. Ok, this is fine.


pkg/storage/col_mvcc.go line 229 at r13 (raw file):

		// The KV wasn't added for whatever reason (e.g. it could have been
		// skipped over due to having been deleted), so just move on.
		return f.NextKV(ctx, mvccDecodingStrategy)

nit: Using recursion here makes me a little nervous. I don't think Go has tail-call optimization, so the stack could grow deep if we have to skip over many KVs. Instead of recursion could we use either goto or a loop?

This commit introduces a new `COL_BATCH_RESPONSE` scan format for Scans
and ReverseScans which results only in needed columns to be returned
from the KV server. In other words, this commit introduces the ability
to perform the KV projection pushdown.

The main idea of this feature is to use the injected decoding logic from
SQL in order to process each KV and keep only the needed parts (i.e.
necessary SQL columns). Those needed parts are then propagated back to
the KV client as coldata.Batch'es (serialized in the Apache Arrow format).

Here is the outline of all components involved:

     ┌────────────────────────────────────────────────┐
     │                       SQL                      │
     │________________________________________________│
     │          colfetcher.ColBatchDirectScan         │
     │                        │                       │
     │                        ▼                       │
     │                 row.txnKVFetcher               │
     │    (behind the row.KVBatchFetcher interface)   │
     └────────────────────────────────────────────────┘
                              │
                              ▼
     ┌────────────────────────────────────────────────┐
     │                    KV Client                   │
     └────────────────────────────────────────────────┘
                              │
                              ▼
     ┌────────────────────────────────────────────────┐
     │                    KV Server                   │
     │________________________________________________│
     │           colfetcher.cFetcherWrapper           │
     │ (behind the storage.CFetcherWrapper interface) │
     │                        │                       │
     │                        ▼                       │
     │              colfetcher.cFetcher               │
     │                        │                       │
     │                        ▼                       │
     │          storage.mvccScanFetchAdapter ────────┐│
     │    (behind the storage.NextKVer interface)    ││
     │                        │                      ││
     │                        ▼                      ││
     │           storage.pebbleMVCCScanner           ││
     │ (which put's KVs into storage.singleResults) <┘│
     └────────────────────────────────────────────────┘

On the KV client side, `row.txnKVFetcher` issues Scans and ReverseScans with
the `COL_BATCH_RESPONSE` format and returns the response (which contains the
columnar data) to the `colfetcher.ColBatchDirectScan`.

On the KV server side, we create a `storage.CFetcherWrapper` that asks the
`colfetcher.cFetcher` for the next `coldata.Batch`. The `cFetcher`, in turn,
fetches the next KV, decodes it, and keeps only values for the needed SQL
columns, discarding the rest of the KV. The KV is emitted by the
`mvccScanFetchAdapter` which - via the `singleResults` struct - exposes
access to the current KV that the `pebbleMVCCScanner` is pointing at.

Note that there is an additional "implicit synchronization" between
components that is not shown on this diagram. In particular,
`storage.singleResults.maybeTrimPartialLastRow` must be in sync with the
`colfetcher.cFetcher` which is achieved by
- the `cFetcher` exposing access to the first key of the last incomplete SQL
  row via the `FirstKeyOfRowGetter`,
- the `singleResults` using that key as the resume key for the response,
- and the `cFetcher` removing that last partial SQL row when `NextKV()`
  returns `partialRow=true`.
This "upstream" link (although breaking the layering a bit) allows us to
avoid a performance penalty for handling the case with multiple column
families. (This case is handled by the `storage.pebbleResults` via tracking
offsets into the `pebbleResults.repr`.)

This code structure deserves some elaboration. First, there is a mismatch
between the "push" mode in which the `pebbleMVCCScanner` operates and the
"pull" mode that the `NextKVer` exposes. The adaption between two different
modes is achieved via the `mvccScanFetcherAdapter` grabbing (when the control
returns to it) the current unstable KV pair from the `singleResults` struct
which serves as a one KV pair buffer that the `pebbleMVCCScanner` `put`s into.
Second, in order be able to use the unstable KV pair without performing a
copy, the `pebbleMVCCScanner` stops at the current KV pair and returns the
control flow (which is exactly what `pebbleMVCCScanner.getOne` does) back to
the `mvccScanFetcherAdapter`, with the adapter advancing the scanner only when
the next KV pair is needed.

There are multiple scenarios which are currently not supported:
- SQL cannot issue Get requests (likely will support in 23.1)
- `TraceKV` option is not supported (likely will support in 23.1)
- user-defined types other than enums are not supported (will _not_
support in 23.1)
- non-default key locking strength as well as SKIP LOCKED wait policy
are not supported (will _not_ support in 23.1).

The usage of this feature is currently disabled by default, but I intend
to enable it by default for multi-tenant setups. The rationale is that
currently there is a large performance hit when enabling it for
single-tenant deployments whereas it offers significant speed up in the
multi-tenant world.

The microbenchmarks [show](https://gist.github.com/yuzefovich/669c295a8a4fdffa6490532284c5a719)
the expected improvement in multi-tenant setups when the tenant runs in
a separate process whenever we don't need to decode all of the columns
from the table.

The TPCH numbers, though, don't show the expected speedup:
```
Q1:	before: 11.47s	after: 8.84s	 -22.89%
Q2:	before: 0.41s	after: 0.29s	 -27.71%
Q3:	before: 7.89s	after: 9.68s	 22.63%
Q4:	before: 4.48s	after: 4.52s	 0.86%
Q5:	before: 10.39s	after: 10.35s	 -0.29%
Q6:	before: 33.57s	after: 33.41s	 -0.48%
Q7:	before: 23.82s	after: 23.81s	 -0.02%
Q8:	before: 3.78s	after: 3.76s	 -0.68%
Q9:	before: 28.15s	after: 28.03s	 -0.42%
Q10:	before: 5.00s	after: 4.98s	 -0.42%
Q11:	before: 2.44s	after: 2.44s	 0.22%
Q12:	before: 34.78s	after: 34.65s	 -0.37%
Q13:	before: 3.20s	after: 2.94s	 -8.28%
Q14:	before: 3.13s	after: 3.21s	 2.43%
Q15:	before: 16.80s	after: 16.73s	 -0.38%
Q16:	before: 1.60s	after: 1.65s	 2.96%
Q17:	before: 0.85s	after: 0.96s	 13.04%
Q18:	before: 16.39s	after: 15.47s	 -5.61%
Q19:	before: 13.76s	after: 13.01s	 -5.45%
Q20:	before: 55.33s	after: 55.12s	 -0.38%
Q21:	before: 24.31s	after: 24.31s	 -0.00%
Q22:	before: 1.28s	after: 1.41s	 10.26%
```
Note that much better numbers were obtained on a buggy prototype. Those
numbers are invalid, and more details can be found
[here](cockroachdb#94438 (comment)).

At the moment, `coldata.Batch` that is included into the response is
always serialized into the Arrow format, but I intend to introduce the
local fastpath to avoid that serialization. That work will be done in
a follow-up and should be able to reduce the perf hit for single-tenant
deployments.

A quick note on the TODOs sprinkled in this commit:
- `TODO(yuzefovich)` means that this will be left for 23.2 or later.
- `TODO(yuzefovich, 23.1)` means that it should be addressed in 23.1.

A quick note on testing: this commit randomizes the fact whether the new
infrastructure is used in almost all test builds. Introducing some unit
testing (say, in `storage` package) seems rather annoying since we must
create keys that are valid SQL keys (i.e. have TableID / Index ID
prefix) and need to come with the corresponding
`fetchpb.IndexFetchSpec`. Not having unit tests in the `storage` seems
ok to me given that the "meat" of the work there is still done by the
`pebbleMVCCScanner` which is exercised using the regular Scans.
End-to-end testing is well covered by all of our existing tests which
now runs randomly. I did run the CI multiple times with the new feature
enabled by default with no failure, so I hope that it shouldn't become
flaky.

Release note: None
@yuzefovich yuzefovich requested a review from a team January 25, 2023 18:01
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to adjust one of the upgrade tests which was incorrectly written.

Thanks for the reviews everyone!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @DrewKimball, @michae2, and @sumeerbhola)


pkg/storage/col_mvcc.go line 229 at r13 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: Using recursion here makes me a little nervous. I don't think Go has tail-call optimization, so the stack could grow deep if we have to skip over many KVs. Instead of recursion could we use either goto or a loop?

I didn't see any usages of goto in our codebase, so I opted for the loop. However, I'm a bit concerned that the loop will have noticeable performance impact, so I left a TODO to explore this - if it does have a noticeable impact, then I'd rather keep the recursion since for the recursion to lead to a crash we'd need like 1 billion KVs to be skipped which I'm assuming should not be possible if the compaction are working as intended.

@craig craig bot merged commit 2105cd4 into cockroachdb:master Jan 25, 2023
@craig
Copy link
Contributor

craig bot commented Jan 25, 2023

Build succeeded:

@yuzefovich yuzefovich deleted the kv-pushdown-all branch January 25, 2023 20:34
yuzefovich added a commit to yuzefovich/cockroach that referenced this pull request Jan 26, 2023
This commit introduces a new `COL_BATCH_RESPONSE` scan format for Scans
and ReverseScans which results only in needed columns to be returned
from the KV server. In other words, this commit introduces the ability
to perform the KV projection pushdown.

The main idea of this feature is to use the injected decoding logic from
SQL in order to process each KV and keep only the needed parts (i.e.
necessary SQL columns). Those needed parts are then propagated back to
the KV client as coldata.Batch'es (serialized in the Apache Arrow format).

Here is the outline of all components involved:

     ┌────────────────────────────────────────────────┐
     │                       SQL                      │
     │________________________________________________│
     │          colfetcher.ColBatchDirectScan         │
     │                        │                       │
     │                        ▼                       │
     │                 row.txnKVFetcher               │
     │    (behind the row.KVBatchFetcher interface)   │
     └────────────────────────────────────────────────┘
                              │
                              ▼
     ┌────────────────────────────────────────────────┐
     │                    KV Client                   │
     └────────────────────────────────────────────────┘
                              │
                              ▼
     ┌────────────────────────────────────────────────┐
     │                    KV Server                   │
     │________________________________________________│
     │           colfetcher.cFetcherWrapper           │
     │ (behind the storage.CFetcherWrapper interface) │
     │                        │                       │
     │                        ▼                       │
     │              colfetcher.cFetcher               │
     │                        │                       │
     │                        ▼                       │
     │          storage.mvccScanFetchAdapter ────────┐│
     │    (behind the storage.NextKVer interface)    ││
     │                        │                      ││
     │                        ▼                      ││
     │           storage.pebbleMVCCScanner           ││
     │ (which put's KVs into storage.singleResults) <┘│
     └────────────────────────────────────────────────┘

On the KV client side, `row.txnKVFetcher` issues Scans and ReverseScans with
the `COL_BATCH_RESPONSE` format and returns the response (which contains the
columnar data) to the `colfetcher.ColBatchDirectScan`.

On the KV server side, we create a `storage.CFetcherWrapper` that asks the
`colfetcher.cFetcher` for the next `coldata.Batch`. The `cFetcher`, in turn,
fetches the next KV, decodes it, and keeps only values for the needed SQL
columns, discarding the rest of the KV. The KV is emitted by the
`mvccScanFetchAdapter` which - via the `singleResults` struct - exposes
access to the current KV that the `pebbleMVCCScanner` is pointing at.

Note that there is an additional "implicit synchronization" between
components that is not shown on this diagram. In particular,
`storage.singleResults.maybeTrimPartialLastRow` must be in sync with the
`colfetcher.cFetcher` which is achieved by
- the `cFetcher` exposing access to the first key of the last incomplete SQL
  row via the `FirstKeyOfRowGetter`,
- the `singleResults` using that key as the resume key for the response,
- and the `cFetcher` removing that last partial SQL row when `NextKV()`
  returns `partialRow=true`.
This "upstream" link (although breaking the layering a bit) allows us to
avoid a performance penalty for handling the case with multiple column
families. (This case is handled by the `storage.pebbleResults` via tracking
offsets into the `pebbleResults.repr`.)

This code structure deserves some elaboration. First, there is a mismatch
between the "push" mode in which the `pebbleMVCCScanner` operates and the
"pull" mode that the `NextKVer` exposes. The adaption between two different
modes is achieved via the `mvccScanFetcherAdapter` grabbing (when the control
returns to it) the current unstable KV pair from the `singleResults` struct
which serves as a one KV pair buffer that the `pebbleMVCCScanner` `put`s into.
Second, in order be able to use the unstable KV pair without performing a
copy, the `pebbleMVCCScanner` stops at the current KV pair and returns the
control flow (which is exactly what `pebbleMVCCScanner.getOne` does) back to
the `mvccScanFetcherAdapter`, with the adapter advancing the scanner only when
the next KV pair is needed.

There are multiple scenarios which are currently not supported:
- SQL cannot issue Get requests (likely will support in 23.1)
- `TraceKV` option is not supported (likely will support in 23.1)
- user-defined types other than enums are not supported (will _not_
support in 23.1)
- non-default key locking strength as well as SKIP LOCKED wait policy
are not supported (will _not_ support in 23.1).

The usage of this feature is currently disabled by default, but I intend
to enable it by default for multi-tenant setups. The rationale is that
currently there is a large performance hit when enabling it for
single-tenant deployments whereas it offers significant speed up in the
multi-tenant world.

The microbenchmarks [show](https://gist.github.com/yuzefovich/669c295a8a4fdffa6490532284c5a719)
the expected improvement in multi-tenant setups when the tenant runs in
a separate process whenever we don't need to decode all of the columns
from the table.

The TPCH numbers, though, don't show the expected speedup:
```
Q1:	before: 11.47s	after: 8.84s	 -22.89%
Q2:	before: 0.41s	after: 0.29s	 -27.71%
Q3:	before: 7.89s	after: 9.68s	 22.63%
Q4:	before: 4.48s	after: 4.52s	 0.86%
Q5:	before: 10.39s	after: 10.35s	 -0.29%
Q6:	before: 33.57s	after: 33.41s	 -0.48%
Q7:	before: 23.82s	after: 23.81s	 -0.02%
Q8:	before: 3.78s	after: 3.76s	 -0.68%
Q9:	before: 28.15s	after: 28.03s	 -0.42%
Q10:	before: 5.00s	after: 4.98s	 -0.42%
Q11:	before: 2.44s	after: 2.44s	 0.22%
Q12:	before: 34.78s	after: 34.65s	 -0.37%
Q13:	before: 3.20s	after: 2.94s	 -8.28%
Q14:	before: 3.13s	after: 3.21s	 2.43%
Q15:	before: 16.80s	after: 16.73s	 -0.38%
Q16:	before: 1.60s	after: 1.65s	 2.96%
Q17:	before: 0.85s	after: 0.96s	 13.04%
Q18:	before: 16.39s	after: 15.47s	 -5.61%
Q19:	before: 13.76s	after: 13.01s	 -5.45%
Q20:	before: 55.33s	after: 55.12s	 -0.38%
Q21:	before: 24.31s	after: 24.31s	 -0.00%
Q22:	before: 1.28s	after: 1.41s	 10.26%
```
Note that much better numbers were obtained on a buggy prototype. Those
numbers are invalid, and more details can be found
[here](cockroachdb#94438 (comment)).

At the moment, `coldata.Batch` that is included into the response is
always serialized into the Arrow format, but I intend to introduce the
local fastpath to avoid that serialization. That work will be done in
a follow-up and should be able to reduce the perf hit for single-tenant
deployments.

A quick note on the TODOs sprinkled in this commit:
- `TODO(yuzefovich)` means that this will be left for 23.2 or later.
- `TODO(yuzefovich, 23.1)` means that it should be addressed in 23.1.

A quick note on testing: this commit randomizes the fact whether the new
infrastructure is used in almost all test builds. Introducing some unit
testing (say, in `storage` package) seems rather annoying since we must
create keys that are valid SQL keys (i.e. have TableID / Index ID
prefix) and need to come with the corresponding
`fetchpb.IndexFetchSpec`. Not having unit tests in the `storage` seems
ok to me given that the "meat" of the work there is still done by the
`pebbleMVCCScanner` which is exercised using the regular Scans.
End-to-end testing is well covered by all of our existing tests which
now runs randomly. I did run the CI multiple times with the new feature
enabled by default with no failure, so I hope that it shouldn't become
flaky.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants