sql,storage: add support for COL_BATCH_RESPONSE scan format

This commit introduces a new `COL_BATCH_RESPONSE` scan format for Scans and ReverseScans which results only in needed columns to be returned from the KV server. In other words, this commit introduces the ability to perform the KV projection pushdown. The main idea of this feature is to use the injected decoding logic from SQL in order to process each KV and keep only the needed parts (i.e. necessary SQL columns). Those needed parts are then propagated back to the KV client as coldata.Batch'es (serialized in the Apache Arrow format). Here is the outline of all components involved: ┌────────────────────────────────────────────────┐ │ SQL │ │________________________________________________│ │ colfetcher.ColBatchDirectScan │ │ │ │ │ ▼ │ │ row.txnKVFetcher │ │ (behind the row.KVBatchFetcher interface) │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Client │ └────────────────────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────┐ │ KV Server │ │________________________________________________│ │ colfetcher.cFetcherWrapper │ │ (behind the storage.CFetcherWrapper interface) │ │ │ │ │ ▼ │ │ colfetcher.cFetcher │ │ │ │ │ ▼ │ │ storage.mvccScanFetchAdapter ────────┐│ │ (behind the storage.NextKVer interface) ││ │ │ ││ │ ▼ ││ │ storage.pebbleMVCCScanner ││ │ (which put's KVs into storage.singleResults) <┘│ └────────────────────────────────────────────────┘ On the KV client side, `row.txnKVFetcher` issues Scans and ReverseScans with the `COL_BATCH_RESPONSE` format and returns the response (which contains the columnar data) to the `colfetcher.ColBatchDirectScan`. On the KV server side, we create a `storage.CFetcherWrapper` that asks the `colfetcher.cFetcher` for the next `coldata.Batch`. The `cFetcher`, in turn, fetches the next KV, decodes it, and keeps only values for the needed SQL columns, discarding the rest of the KV. The KV is emitted by the `mvccScanFetchAdapter` which - via the `singleResults` struct - exposes access to the current KV that the `pebbleMVCCScanner` is pointing at. Note that there is an additional "implicit synchronization" between components that is not shown on this diagram. In particular, `storage.singleResults.maybeTrimPartialLastRow` must be in sync with the `colfetcher.cFetcher` which is achieved by - the `cFetcher` exposing access to the first key of the last incomplete SQL row via the `FirstKeyOfRowGetter`, - the `singleResults` using that key as the resume key for the response, - and the `cFetcher` removing that last partial SQL row when `NextKV()` returns `partialRow=true`. This "upstream" link (although breaking the layering a bit) allows us to avoid a performance penalty for handling the case with multiple column families. (This case is handled by the `storage.pebbleResults` via tracking offsets into the `pebbleResults.repr`.) This code structure deserves some elaboration. First, there is a mismatch between the "push" mode in which the `pebbleMVCCScanner` operates and the "pull" mode that the `NextKVer` exposes. The adaption between two different modes is achieved via the `mvccScanFetcherAdapter` grabbing (when the control returns to it) the current unstable KV pair from the `singleResults` struct which serves as a one KV pair buffer that the `pebbleMVCCScanner` `put`s into. Second, in order be able to use the unstable KV pair without performing a copy, the `pebbleMVCCScanner` stops at the current KV pair and returns the control flow (which is exactly what `pebbleMVCCScanner.getOne` does) back to the `mvccScanFetcherAdapter`, with the adapter advancing the scanner only when the next KV pair is needed. There are multiple scenarios which are currently not supported: - SQL cannot issue Get requests (likely will support in 23.1) - `TraceKV` option is not supported (likely will support in 23.1) - user-defined types other than enums are not supported (will _not_ support in 23.1) - non-default key locking strength as well as SKIP LOCKED wait policy are not supported (will _not_ support in 23.1). The usage of this feature is currently disabled by default, but I intend to enable it by default for multi-tenant setups. The rationale is that currently there is a large performance hit when enabling it for single-tenant deployments whereas it offers significant speed up in the multi-tenant world. The microbenchmarks [show](https://gist.github.com/yuzefovich/669c295a8a4fdffa6490532284c5a719) the expected improvement in multi-tenant setups when the tenant runs in a separate process whenever we don't need to decode all of the columns from the table. The TPCH numbers, though, don't show the expected speedup: ``` Q1: before: 11.47s after: 8.84s -22.89% Q2: before: 0.41s after: 0.29s -27.71% Q3: before: 7.89s after: 9.68s 22.63% Q4: before: 4.48s after: 4.52s 0.86% Q5: before: 10.39s after: 10.35s -0.29% Q6: before: 33.57s after: 33.41s -0.48% Q7: before: 23.82s after: 23.81s -0.02% Q8: before: 3.78s after: 3.76s -0.68% Q9: before: 28.15s after: 28.03s -0.42% Q10: before: 5.00s after: 4.98s -0.42% Q11: before: 2.44s after: 2.44s 0.22% Q12: before: 34.78s after: 34.65s -0.37% Q13: before: 3.20s after: 2.94s -8.28% Q14: before: 3.13s after: 3.21s 2.43% Q15: before: 16.80s after: 16.73s -0.38% Q16: before: 1.60s after: 1.65s 2.96% Q17: before: 0.85s after: 0.96s 13.04% Q18: before: 16.39s after: 15.47s -5.61% Q19: before: 13.76s after: 13.01s -5.45% Q20: before: 55.33s after: 55.12s -0.38% Q21: before: 24.31s after: 24.31s -0.00% Q22: before: 1.28s after: 1.41s 10.26% ``` Note that much better numbers were obtained on a buggy prototype. Those numbers are invalid, and more details can be found [here](cockroachdb#94438 (comment)). At the moment, `coldata.Batch` that is included into the response is always serialized into the Arrow format, but I intend to introduce the local fastpath to avoid that serialization. That work will be done in a follow-up and should be able to reduce the perf hit for single-tenant deployments. A quick note on the TODOs sprinkled in this commit: - `TODO(yuzefovich)` means that this will be left for 23.2 or later. - `TODO(yuzefovich, 23.1)` means that it should be addressed in 23.1. A quick note on testing: this commit randomizes the fact whether the new infrastructure is used in almost all test builds. Introducing some unit testing (say, in `storage` package) seems rather annoying since we must create keys that are valid SQL keys (i.e. have TableID / Index ID prefix) and need to come with the corresponding `fetchpb.IndexFetchSpec`. Not having unit tests in the `storage` seems ok to me given that the "meat" of the work there is still done by the `pebbleMVCCScanner` which is exercised using the regular Scans. End-to-end testing is well covered by all of our existing tests which now runs randomly. I did run the CI multiple times with the new feature enabled by default with no failure, so I hope that it shouldn't become flaky. Release note: None
yuzefovich · Jan 26, 2023 · d153ece · d153ece
1 parent 6fc1022
commit d153ece
Show file tree

Hide file tree

Showing 58 changed files with 1,791 additions and 417 deletions.
diff --git a/docs/generated/settings/settings-for-tenants.txt b/docs/generated/settings/settings-for-tenants.txt
@@ -297,4 +297,4 @@ trace.jaeger.agent	string		the address of a Jaeger agent to receive traces using
 trace.opentelemetry.collector	string		address of an OpenTelemetry trace collector to receive traces using the otel gRPC protocol, as <host>:<port>. If no port is specified, 4317 will be used.
 trace.span_registry.enabled	boolean	true	if set, ongoing traces can be seen at https://<ui>/#/debug/tracez
 trace.zipkin.collector	string		the address of a Zipkin instance to receive traces, as <host>:<port>. If no port is specified, 9411 will be used.
-version	version	1000022.2-32	set the active cluster version in the format '<major>.<minor>'
+version	version	1000022.2-34	set the active cluster version in the format '<major>.<minor>'
diff --git a/docs/generated/settings/settings.html b/docs/generated/settings/settings.html
@@ -235,6 +235,6 @@
 <tr><td><div id="setting-trace-opentelemetry-collector" class="anchored"><code>trace.opentelemetry.collector</code></div></td><td>string</td><td><code></code></td><td>address of an OpenTelemetry trace collector to receive traces using the otel gRPC protocol, as &lt;host&gt;:&lt;port&gt;. If no port is specified, 4317 will be used.</td></tr>
 <tr><td><div id="setting-trace-span-registry-enabled" class="anchored"><code>trace.span_registry.enabled</code></div></td><td>boolean</td><td><code>true</code></td><td>if set, ongoing traces can be seen at https://&lt;ui&gt;/#/debug/tracez</td></tr>
 <tr><td><div id="setting-trace-zipkin-collector" class="anchored"><code>trace.zipkin.collector</code></div></td><td>string</td><td><code></code></td><td>the address of a Zipkin instance to receive traces, as &lt;host&gt;:&lt;port&gt;. If no port is specified, 9411 will be used.</td></tr>
-<tr><td><div id="setting-version" class="anchored"><code>version</code></div></td><td>version</td><td><code>1000022.2-32</code></td><td>set the active cluster version in the format &#39;&lt;major&gt;.&lt;minor&gt;&#39;</td></tr>
+<tr><td><div id="setting-version" class="anchored"><code>version</code></div></td><td>version</td><td><code>1000022.2-34</code></td><td>set the active cluster version in the format &#39;&lt;major&gt;.&lt;minor&gt;&#39;</td></tr>
 </tbody>
 </table>
diff --git a/pkg/ccl/logictestccl/tests/3node-tenant-multiregion/generated_test.go b/pkg/ccl/logictestccl/tests/3node-tenant-multiregion/generated_test.go
diff --git a/pkg/ccl/logictestccl/tests/3node-tenant/generated_test.go b/pkg/ccl/logictestccl/tests/3node-tenant/generated_test.go
diff --git a/pkg/clusterversion/cockroach_versions.go b/pkg/clusterversion/cockroach_versions.go
@@ -404,6 +404,10 @@ const (
 	// V23_1KeyVisualizerTablesAndJobs adds the system tables that support the key visualizer.
 	V23_1KeyVisualizerTablesAndJobs
 
+	// V23_1_KVDirectColumnarScans introduces the support of the "direct"
+	// columnar scans in the KV layer.
+	V23_1_KVDirectColumnarScans
+
 	// *************************************************
 	// Step (1): Add new versions here.
 	// Do not add new versions to a patch release.
@@ -694,6 +698,11 @@ var rawVersionsSingleton = keyedVersions{
 		Key:     V23_1KeyVisualizerTablesAndJobs,
 		Version: roachpb.Version{Major: 22, Minor: 2, Internal: 32},
 	},
+	{
+		Key:     V23_1_KVDirectColumnarScans,
+		Version: roachpb.Version{Major: 22, Minor: 2, Internal: 34},
+	},
+
 	// *************************************************
 	// Step (2): Add new versions here.
 	// Do not add new versions to a patch release.

diff --git a/pkg/cmd/generate-logictest/templates.go b/pkg/cmd/generate-logictest/templates.go
@@ -74,6 +74,9 @@ func runExecBuildLogicTest(t *testing.T, file string) {
 	serverArgs := logictest.TestServerArgs{
 		DisableWorkmemRandomization: true,{{ if .ForceProductionValues }}
 		ForceProductionValues:       true,{{end}}
+		// Disable the direct scans in order to keep the output of EXPLAIN (VEC)
+		// deterministic.
+		DisableDirectColumnarScans: true,
 	}
 	logictest.RunLogicTest(t, serverArgs, configIdx, filepath.Join(execBuildLogicTestDir, file))
 }

diff --git a/pkg/cmd/roachtest/tests/multitenant_tpch.go b/pkg/cmd/roachtest/tests/multitenant_tpch.go
@@ -26,7 +26,9 @@ import (
 // runMultiTenantTPCH runs TPCH queries on a cluster that is first used as a
 // single-tenant deployment followed by a run of all queries in a multi-tenant
 // deployment with a single SQL instance.
-func runMultiTenantTPCH(ctx context.Context, t test.Test, c cluster.Cluster) {
+func runMultiTenantTPCH(
+	ctx context.Context, t test.Test, c cluster.Cluster, enableDirectScans bool,
+) {
 	c.Put(ctx, t.Cockroach(), "./cockroach", c.All())
 	c.Put(ctx, t.DeprecatedWorkload(), "./workload", c.Node(1))
 	c.Start(ctx, t.L(), option.DefaultStartOpts(), install.MakeClusterSettings(install.SecureOption(true)), c.All())
@@ -40,6 +42,11 @@ func runMultiTenantTPCH(ctx context.Context, t test.Test, c cluster.Cluster) {
 	// one at a time (using the given url as a parameter to the 'workload run'
 	// command). The runtimes are accumulated in the perf helper.
 	runTPCH := func(conn *gosql.DB, url string, setupIdx int) {
+		setting := fmt.Sprintf("SET CLUSTER SETTING sql.distsql.direct_columnar_scans.enabled = %t", enableDirectScans)
+		t.Status(setting)
+		if _, err := conn.Exec(setting); err != nil {
+			t.Fatal(err)
+		}
 		t.Status("restoring TPCH dataset for Scale Factor 1 in ", setupNames[setupIdx])
 		if err := loadTPCHDataset(
 			ctx, t, c, conn, 1 /* sf */, c.NewMonitor(ctx), c.All(), false, /* disableMergeQueue */
@@ -93,6 +100,16 @@ func registerMultiTenantTPCH(r registry.Registry) {
 		Name:    "multitenant/tpch",
 		Owner:   registry.OwnerSQLQueries,
 		Cluster: r.MakeClusterSpec(1 /* nodeCount */),
-		Run:     runMultiTenantTPCH,
+		Run: func(ctx context.Context, t test.Test, c cluster.Cluster) {
+			runMultiTenantTPCH(ctx, t, c, false /* enableDirectScans */)
+		},
+	})
+	r.Add(registry.TestSpec{
+		Name:    "multitenant/tpch_direct_scans",
+		Owner:   registry.OwnerSQLQueries,
+		Cluster: r.MakeClusterSpec(1 /* nodeCount */),
+		Run: func(ctx context.Context, t test.Test, c cluster.Cluster) {
+			runMultiTenantTPCH(ctx, t, c, true /* enableDirectScans */)
+		},
 	})
 }
diff --git a/pkg/col/coldataext/extended_column_factory.go b/pkg/col/coldataext/extended_column_factory.go
@@ -32,6 +32,14 @@ func NewExtendedColumnFactory(evalCtx *eval.Context) coldata.ColumnFactory {
 	return &extendedColumnFactory{evalCtx: evalCtx}
 }
 
+// NewExtendedColumnFactoryNoEvalCtx returns an extendedColumnFactory that will
+// be producing coldata.DatumVecs that aren't fully initialized - the eval
+// context is not set on those vectors. This can be acceptable if the caller
+// cannot provide the eval.Context but also doesn't intend to compare datums.
+func NewExtendedColumnFactoryNoEvalCtx() coldata.ColumnFactory {
+	return &extendedColumnFactory{}
+}
+
 func (cf *extendedColumnFactory) MakeColumn(t *types.T, n int) coldata.Column {
 	if typeconv.TypeFamilyToCanonicalTypeFamily(t.Family()) == typeconv.DatumVecCanonicalTypeFamily {
 		return newDatumVec(t, n, cf.evalCtx)

diff --git a/pkg/col/colserde/arrowbatchconverter.go b/pkg/col/colserde/arrowbatchconverter.go
@@ -569,6 +569,8 @@ func getValueBytesAndOffsets(
 
 // Release should be called once the converter is no longer needed so that its
 // memory could be GCed.
+// TODO(yuzefovich): consider renaming this to Close in order to not be confused
+// with execreleasable.Releasable interface.
 func (c *ArrowBatchConverter) Release(ctx context.Context) {
 	if c.acc != nil {
 		c.acc.Shrink(ctx, c.accountedFor)

diff --git a/pkg/kv/kvserver/batcheval/cmd_reverse_scan.go b/pkg/kv/kvserver/batcheval/cmd_reverse_scan.go
@@ -63,6 +63,15 @@ func ReverseScan(
 			return result.Result{}, err
 		}
 		reply.BatchResponses = scanRes.KVData
+	case roachpb.COL_BATCH_RESPONSE:
+		scanRes, err = storage.MVCCScanToCols(
+			ctx, reader, cArgs.Header.IndexFetchSpec, args.Key, args.EndKey,
+			h.Timestamp, opts, cArgs.EvalCtx.ClusterSettings(),
+		)
+		if err != nil {
+			return result.Result{}, err
+		}
+		reply.BatchResponses = scanRes.KVData
 	case roachpb.KEY_VALUES:
 		scanRes, err = storage.MVCCScan(
 			ctx, reader, args.Key, args.EndKey, h.Timestamp, opts)

diff --git a/pkg/kv/kvserver/batcheval/cmd_scan.go b/pkg/kv/kvserver/batcheval/cmd_scan.go
@@ -64,6 +64,15 @@ func Scan(
 			return result.Result{}, err
 		}
 		reply.BatchResponses = scanRes.KVData
+	case roachpb.COL_BATCH_RESPONSE:
+		scanRes, err = storage.MVCCScanToCols(
+			ctx, reader, cArgs.Header.IndexFetchSpec, args.Key, args.EndKey,
+			h.Timestamp, opts, cArgs.EvalCtx.ClusterSettings(),
+		)
+		if err != nil {
+			return result.Result{}, err
+		}
+		reply.BatchResponses = scanRes.KVData
 	case roachpb.KEY_VALUES:
 		scanRes, err = storage.MVCCScan(
 			ctx, reader, args.Key, args.EndKey, h.Timestamp, opts)

diff --git a/pkg/kv/kvserver/batcheval/intent.go b/pkg/kv/kvserver/batcheval/intent.go
@@ -125,6 +125,8 @@ func acquireUnreplicatedLocksOnKeys(
 			res.Local.AcquiredLocks[i] = roachpb.MakeLockAcquisition(txn, copyKey(row.Key), lock.Unreplicated)
 		}
 		return nil
+	case roachpb.COL_BATCH_RESPONSE:
+		return errors.AssertionFailedf("unexpectedly acquiring unreplicated locks with COL_BATCH_RESPONSE scan format")
 	default:
 		panic("unexpected scanFormat")
 	}

diff --git a/pkg/kv/kvserver/replica_read.go b/pkg/kv/kvserver/replica_read.go
@@ -531,6 +531,9 @@ func (r *Replica) collectSpansRead(
 		resp := br.Responses[i].GetInner()
 
 		if ba.WaitPolicy == lock.WaitPolicy_SkipLocked && roachpb.CanSkipLocked(req) {
+			if ba.IndexFetchSpec != nil {
+				return nil, nil, errors.AssertionFailedf("unexpectedly IndexFetchSpec is set with SKIP LOCKED wait policy")
+			}
 			// If the request is using a SkipLocked wait policy, it behaves as if run
 			// at a lower isolation level for any keys that it skips over. If the read
 			// request did not return a key, it does not need to check for conflicts

diff --git a/pkg/kv/kvserver/replica_tscache.go b/pkg/kv/kvserver/replica_tscache.go
@@ -23,6 +23,7 @@ import (
 	"github.com/cockroachdb/cockroach/pkg/util/hlc"
 	"github.com/cockroachdb/cockroach/pkg/util/log"
 	"github.com/cockroachdb/cockroach/pkg/util/uuid"
+	"github.com/cockroachdb/errors"
 	"github.com/cockroachdb/redact"
 )
 
@@ -99,6 +100,9 @@ func (r *Replica) updateTimestampCache(
 		start, end := header.Key, header.EndKey
 
 		if ba.WaitPolicy == lock.WaitPolicy_SkipLocked && roachpb.CanSkipLocked(req) {
+			if ba.IndexFetchSpec != nil {
+				log.Errorf(ctx, "%v", errors.AssertionFailedf("unexpectedly IndexFetchSpec is set with SKIP LOCKED wait policy"))
+			}
 			// If the request is using a SkipLocked wait policy, it behaves as if run
 			// at a lower isolation level for any keys that it skips over. If the read
 			// request did not return a key, it does not make a claim about whether

diff --git a/pkg/roachpb/BUILD.bazel b/pkg/roachpb/BUILD.bazel
@@ -151,6 +151,7 @@ proto_library(
         "//pkg/kv/kvserver/concurrency/lock:lock_proto",
         "//pkg/kv/kvserver/readsummary/rspb:rspb_proto",
         "//pkg/settings:settings_proto",
+        "//pkg/sql/catalog/fetchpb:fetchpb_proto",
         "//pkg/storage/enginepb:enginepb_proto",
         "//pkg/util:util_proto",
         "//pkg/util/admission/admissionpb:admissionpb_proto",
@@ -173,6 +174,7 @@ go_proto_library(
         "//pkg/kv/kvserver/concurrency/lock",
         "//pkg/kv/kvserver/readsummary/rspb",
         "//pkg/settings",
+        "//pkg/sql/catalog/fetchpb",
         "//pkg/storage/enginepb",
         "//pkg/util",
         "//pkg/util/admission/admissionpb",

diff --git a/pkg/roachpb/api.proto b/pkg/roachpb/api.proto
@@ -20,6 +20,7 @@ import "roachpb/errors.proto";
 import "roachpb/metadata.proto";
 import "roachpb/span_config.proto";
 import "settings/encoding.proto";
+import "sql/catalog/fetchpb/index_fetch.proto";
 import "storage/enginepb/mvcc.proto";
 import "storage/enginepb/mvcc3.proto";
 import "util/hlc/timestamp.proto";
@@ -557,6 +558,10 @@ enum ScanFormat {
   // The batch_response format: a byte slice of alternating keys and values,
   // each prefixed by their length as a varint.
   BATCH_RESPONSE = 1;
+  // The serialized (in the Apache Arrow format) representation of
+  // coldata.Batch'es that contain only necessary (according to the
+  // fetchpb.IndexFetchSpec) columns.
+  COL_BATCH_RESPONSE = 2;
 }
 
 
@@ -568,9 +573,9 @@ message ScanRequest {
 
   RequestHeader header = 1 [(gogoproto.nullable) = false, (gogoproto.embed) = true];
 
-  // The desired format for the response. If set to BATCH_RESPONSE, the server
-  // will set the batch_responses field in the ScanResponse instead of the rows
-  // field.
+  // The desired format for the response. If set to BATCH_RESPONSE or
+  // COL_BATCH_RESPONSE, the server will set the batch_responses field in the
+  // ScanResponse instead of the rows field.
   ScanFormat scan_format = 4;
 
   // The desired key-level locking mode used during this scan. When set to None
@@ -601,12 +606,19 @@ message ScanResponse {
   // revisit this decision if this ever becomes a problem.
   repeated KeyValue intent_rows = 3 [(gogoproto.nullable) = false];
 
-  // If set, each item in this repeated bytes field contains part of the results
-  // in batch format - the key/value pairs are a buffer of varint-prefixed
-  // slices, alternating from key to value. Each entry in this field is
-  // complete - there are no key/value pairs that are split across more than one
-  // entry. There are num_keys total pairs across all entries, as defined by the
-  // ResponseHeader. If set, rows will not be set and vice versa.
+  // If set, then depending on the ScanFormat, each item in this repeated bytes
+  // field contains part of the results in batch format:
+  // - for BATCH_RESPONSE - the key/value pairs are a buffer of varint-prefixed
+  // slices, alternating from key to value. Each entry in this field is complete
+  // (i.e. there are no key/value pairs that are split across more than one
+  // entry). There are num_keys total pairs across all entries, as defined by
+  // the ResponseHeader.
+  // - for COL_BATCH_RESPONSE - each []byte is a single serialized (in the
+  // Apache Arrow format) coldata.Batch. Each SQL row in that coldata.Batch is
+  // complete. num_keys total key-value pairs were used to populate all of the
+  // coldata.Batch'es in this field.
+  //
+  // If set, rows will not be set and vice versa.
   repeated bytes batch_responses = 4;
 }
 
@@ -618,9 +630,9 @@ message ReverseScanRequest {
 
   RequestHeader header = 1 [(gogoproto.nullable) = false, (gogoproto.embed) = true];
 
-  // The desired format for the response. If set to BATCH_RESPONSE, the server
-  // will set the batch_responses field in the ScanResponse instead of the rows
-  // field.
+  // The desired format for the response. If set to BATCH_RESPONSE or
+  // COL_BATCH_RESPONSE, the server will set the batch_responses field in the
+  // ScanResponse instead of the rows field.
   ScanFormat scan_format = 4;
 
   // The desired key-level locking mode used during this scan. When set to None
@@ -651,12 +663,19 @@ message ReverseScanResponse {
   // revisit this decision if this ever becomes a problem.
   repeated KeyValue intent_rows = 3 [(gogoproto.nullable) = false];
 
-  // If set, each item in this repeated bytes field contains part of the results
-  // in batch format - the key/value pairs are a buffer of varint-prefixed
-  // slices, alternating from key to value. Each entry in this field is
-  // complete - there are no key/value pairs that are split across more than one
-  // entry. There are num_keys total pairs across all entries, as defined by the
-  // ResponseHeader. If set, rows will not be set and vice versa.
+  // If set, then depending on the ScanFormat, each item in this repeated bytes
+  // field contains part of the results in batch format:
+  // - for BATCH_RESPONSE - the key/value pairs are a buffer of varint-prefixed
+  // slices, alternating from key to value. Each entry in this field is complete
+  // (i.e. there are no key/value pairs that are split across more than one
+  // entry). There are num_keys total pairs across all entries, as defined by
+  // the ResponseHeader.
+  // - for COL_BATCH_RESPONSE - each []byte is a single serialized (in the
+  // Apache Arrow format) coldata.Batch. Each SQL row in that coldata.Batch is
+  // complete. num_keys total key-value pairs were used to populate all of the
+  // coldata.Batch'es in this field.
+  //
+  // If set, rows will not be set and vice versa.
   repeated bytes batch_responses = 4;
 }
 
@@ -2525,8 +2544,6 @@ message Header {
   // The given value specifies the maximum number of keys in a row (i.e. the
   // number of column families). If any larger rows are found at the end of the
   // result, an error is returned.
-  //
-  // Added in 22.1, callers must check the ScanWholeRows version gate first.
   int32 whole_rows_of_size = 26;
   // If true, allow returning an empty result when the first result exceeds a
   // limit (e.g. TargetBytes). Only supported by Get, Scan, and ReverseScan.
@@ -2605,6 +2622,15 @@ message Header {
   // If a recording mode is set, the response will have collect_spans filled in.
   util.tracing.tracingpb.TraceInfo trace_info = 25;
 
+  // index_fetch_spec, if set, is used for ScanRequests and ReverseScanRequests
+  // with COL_BATCH_RESPONSE ScanFormat.
+  //
+  // The rationale for having this as a single "global" field on the
+  // BatchRequest is that SQL never issues requests touching different indexes
+  // in a single BatchRequest, so it would be redundant to copy this field into
+  // each Scan and ReverseScan.
+  sql.sqlbase.IndexFetchSpec index_fetch_spec = 29;
+
   reserved 7, 10, 12, 14, 20;
 }