Merge cockroachdb#76202

76202: coldata: change Bytes vector implementation r=yuzefovich a=yuzefovich **coldata: change Bytes vector implementation** This commit changes the way `Bytes` vector is implemented in order to support `Set` operation at arbitrary positions. Now, a `Bytes` vector is a slice of `element`s. `element` describes a single `[]byte` value. If a value doesn't exceed 30 bytes, then the whole value is stored within the `element`; otherwise, the `element` "points" at a region in a common `Bytes.buffer` where the non-inlined value is stored. If the value is inlined, then the layout of `element` is used as follows: ``` 24-byte header | 6-byte padding element: .............................. | length | true Bytes.buffer: N/A ``` where 30 dots describe the inlinable space followed by a single byte for the length followed by a boolean `true` indicating an inlined value. If the value is non-inlined, then the layout of `element` is used as follows: ``` padding element: .offset. | ..len... | ..cap... | xxxxxx | x | false Bytes.buffer: xxxxxxxx | offset .... | xxxxxxxx ``` where first 24 bytes contain our custom "header" of a byte slice that is backed by `Bytes.buffer`. The following 7 bytes (the padding and the `inlinedLength` byte) are not used, and then we have a boolean `false` indicating a non-inlined value. We have a custom "slice header" for the non-inlined values and don't just use `[]byte` for two reasons: 1. we can easily overwrite the whole struct when a value is inlined without causing any issues with the GC (because our slice header doesn't contain an actual slice, GC doesn't have to track whether the backing array is live or not); 2. our slice header remains valid even when `Bytes.buffer` is reallocated. If we were to use `[]byte`, then whenever that reallocation occurred, we would have to iterate over all non-inlined values and update the byte slices to point to the new `Bytes.buffer`. This new implementation was inspired by Velox library from Facebook (https://facebookincubator.github.io/velox/develop/vectors.html#flat-vectors-scalar-types). Our original implementation was based on the Arrow format, so we were getting extremely cheap serialization/deserialization. With this change some of the benchmarks have regressed significantly: ``` Append/bytes/AppendSimple/NullProbability=0.0-24 356MB/s ± 5% 221MB/s ± 1% -37.81% (p=0.000 n=9+8) Append/bytes/AppendWithSel/NullProbability=0.0-24 241MB/s ± 2% 105MB/s ± 4% -56.49% (p=0.000 n=9+10) Append/bytes/AppendSimple/NullProbability=0.2-24 354MB/s ± 6% 219MB/s ± 4% -38.11% (p=0.000 n=8+9) Append/bytes/AppendWithSel/NullProbability=0.2-24 229MB/s ± 3% 98MB/s ± 4% -57.20% (p=0.000 n=9+10) Copy/bytes/CopySimple/NullProbability=0.0-24 39.7GB/s ± 1% 0.5GB/s ± 0% -98.67% (p=0.000 n=10+10) Copy/bytes/CopyWithSel/NullProbability=0.0-24 753MB/s ± 1% 544MB/s ± 0% -27.83% (p=0.000 n=10+8) Copy/bytes/CopySimple/NullProbability=0.2-24 28.9GB/s ± 1% 0.5GB/s ± 0% -98.19% (p=0.000 n=8+10) Copy/bytes/CopyWithSel/NullProbability=0.2-24 634MB/s ± 0% 575MB/s ± 0% -9.37% (p=0.000 n=9+9) ``` ``` ArrowBatchConverter/bytes/nullFraction=0.00/BatchToArrow-24 248GB/s ± 2% 3GB/s ± 3% -98.90% (p=0.000 n=10+10) ArrowBatchConverter/bytes/nullFraction=0.25/BatchToArrow-24 240GB/s ± 2% 3GB/s ± 2% -98.87% (p=0.000 n=10+10) ArrowBatchConverter/bytes/nullFraction=0.50/BatchToArrow-24 242GB/s ± 1% 3GB/s ± 2% -98.86% (p=0.000 n=10+10) ArrowBatchConverter/bytes/nullFraction=0.00/ArrowToBatch-24 830GB/s ± 2% 3GB/s ± 2% -99.59% (p=0.000 n=10+10) ArrowBatchConverter/bytes/nullFraction=0.25/ArrowToBatch-24 893GB/s ± 1% 3GB/s ± 1% -99.62% (p=0.000 n=10+8) ArrowBatchConverter/bytes/nullFraction=0.50/ArrowToBatch-24 910GB/s ± 1% 3GB/s ± 1% -99.62% (p=0.000 n=10+9) ``` I'm not particularly worried about the regression in the serialization (since the absolute speeds are still very high for this to not be a bottle neck), but the slowdown in `Append/Copy` benchmark is worrisome. I do have some ideas of how to improve those operations to reduce the regressions to acceptable levels, and I'll add those optimizations in a follow-up commit. As a proxy for change of performance of `Bytes` vector overall, I ran `BenchmarkSortUUID` with results [here](https://gist.github.com/yuzefovich/baaf6b4096cb3e2f1504ad2b38431cc6) and the unordered distinct benchmarks with results [here](https://gist.github.com/yuzefovich/d8b9166977ec56780cb1749e848f5935). They show that the number of allocations has decreased in all cases although the total allocated space increased in some cases with small number of rows. As a result, there are some regressions in performance on a small number of rows. Overall, results seem satisfactory. Notably, as a result of this change the performance of top K sort has improved drastically when bytes-like types are present in the input columns (because `CopySlice` used in top K sorter no longer has to move the data around). It's worth calling out that we still use the same Arrow format for serialization, so there is no need to bump the DistSQL version because the change of this commit is purely to the in-memory representation. An additional nice result of this change is that we no longer have to maintain the non-decreasing offsets, which will allow us to remove calls to `SetLength` in many places where the length of the batch is not changed. This will be done in the follow-up commit. Fixes: cockroachdb#75536. Release note: None Release justification: low risk, high benefit changes to existing functionality. **coldata: optimize Append and Copy for Bytes vectors** This commit optimizes `Append/Copy` operations for `Bytes` vectors. The main idea is that a new primitive `copy` is introduced which allows for copying one `element` from one vector into another `element` of another vector which removes the allocations when copying inlined values. An additional improvement is preallocating enough `element`s to copy/append into many values. ``` name old speed new speed delta Append/bytes/AppendSimple/NullProbability=0.0-24 226MB/s ± 7% 221MB/s ± 8% ~ (p=0.143 n=10+10) Append/bytes/AppendWithSel/NullProbability=0.0-24 104MB/s ± 5% 189MB/s ± 1% +81.57% (p=0.000 n=10+8) Append/bytes/AppendSimple/NullProbability=0.2-24 220MB/s ± 2% 229MB/s ± 7% +4.23% (p=0.004 n=8+10) Append/bytes/AppendWithSel/NullProbability=0.2-24 98.4MB/s ± 5% 197.7MB/s ± 4% +100.98% (p=0.000 n=10+10) Copy/bytes/CopySimple/NullProbability=0.0-24 521MB/s ± 0% 2981MB/s ± 1% +471.63% (p=0.000 n=10+10) Copy/bytes/CopyWithSel/NullProbability=0.0-24 539MB/s ± 0% 1020MB/s ± 0% +89.32% (p=0.000 n=10+10) Copy/bytes/CopySimple/NullProbability=0.2-24 518MB/s ± 0% 2913MB/s ± 0% +462.62% (p=0.000 n=10+8) Copy/bytes/CopyWithSel/NullProbability=0.2-24 560MB/s ± 0% 875MB/s ± 0% +56.12% (p=0.000 n=10+8) name old alloc/op new alloc/op delta Append/bytes/AppendSimple/NullProbability=0.0-24 76.1kB ± 8% 118.9kB ± 7% +56.23% (p=0.000 n=10+10) Append/bytes/AppendWithSel/NullProbability=0.0-24 192kB ± 2% 131kB ± 1% -32.15% (p=0.000 n=10+8) Append/bytes/AppendSimple/NullProbability=0.2-24 78.0kB ± 0% 117.2kB ± 3% +50.27% (p=0.000 n=8+9) Append/bytes/AppendWithSel/NullProbability=0.2-24 198kB ± 1% 105kB ± 3% -47.21% (p=0.000 n=10+9) Copy/bytes/CopySimple/NullProbability=0.0-24 0.00B 0.00B ~ (all equal) Copy/bytes/CopyWithSel/NullProbability=0.0-24 0.00B 0.00B ~ (all equal) Copy/bytes/CopySimple/NullProbability=0.2-24 0.00B 0.00B ~ (all equal) Copy/bytes/CopyWithSel/NullProbability=0.2-24 0.00B 0.00B ~ (all equal) ``` I'm not sure what's up with some of the increase in allocations since I don't see them locally on mac. Release note: None Release justification: low risk, high benefit changes to existing functionality. **colexec: cleanup top K with bytes-like types** Since the previous commit made it so that `Bytes` vectors support Sets at arbitrary positions, we no longer have to use `CopySlice` in the top K sort implementation. Release note: None Release justification: low risk, high benefit changes to existing functionality. **coldata: prohibit Append from the vector into itself** Previously, `Append`ing from the vector into itself was allowed when `srcStartIdx == srcEndIdx` in order to be able to truncate the vector. One of the previous commits removed the only place where this was actually used (when implementing `Append` for Bytes-like types when selection vector is present), so this commit removes this no-longer-used feature. Release note: None Release justification: low risk cleanup. **colexec: remove no longer needed calls to SetLength** Previously, we required that operators that might populate a `Bytes` vector set the length on the output batch (which was needed in order to maintain the invariant of non-decreasing offsets in the old implementation of the `Bytes` vector) even if the length of the batch didn't change. This is no longer needed given the updated implementation, so this commit removes those redundant calls to `SetLength`. Release note: None Release justification: low risk cleanup. Co-authored-by: Yahor Yuzefovich <[email protected]>
maryliag · Mar 9, 2022 · fe889cf · fe889cf
2 parents 5e1b8d6 + ec8e548
commit fe889cf
Show file tree

Hide file tree

Showing 33 changed files with 779 additions and 4,195 deletions.
diff --git a/pkg/col/coldata/BUILD.bazel b/pkg/col/coldata/BUILD.bazel
@@ -18,13 +18,11 @@ go_library(
     deps = [
         "//pkg/col/typeconv",
         "//pkg/sql/colexecerror",
-        "//pkg/sql/memsize",
         "//pkg/sql/types",
         "//pkg/util",
         "//pkg/util/buildutil",
         "//pkg/util/duration",
         "//pkg/util/json",
-        "//pkg/util/uuid",
         "@com_github_cockroachdb_apd_v3//:apd",
         "@com_github_cockroachdb_errors//:errors",
         "@com_github_stretchr_testify//require",

diff --git a/pkg/col/coldata/batch.go b/pkg/col/coldata/batch.go
@@ -126,9 +126,7 @@ func NewMemBatchWithCapacity(typs []*types.T, capacity int, factory ColumnFactor
 		col := &cols[i]
 		col.init(t, capacity, factory)
 		b.b[i] = col
-		if col.IsBytesLike() {
-			b.bytesVecIdxs.Add(i)
-		} else if col.CanonicalTypeFamily() == typeconv.DatumVecCanonicalTypeFamily {
+		if col.CanonicalTypeFamily() == typeconv.DatumVecCanonicalTypeFamily {
 			b.datumVecIdxs.Add(i)
 		}
 	}
@@ -201,11 +199,6 @@ type MemBatch struct {
 	capacity int
 	// b is the slice of columns in this batch.
 	b []Vec
-	// bytesVecIdxs stores the indices of all vectors of Bytes type in b. Bytes
-	// vectors require special handling, so rather than iterating over all
-	// vectors and checking whether they are of Bytes type we store this slice
-	// separately.
-	bytesVecIdxs util.FastIntSet
 	// datumVecIdxs stores the indices of all datum-backed vectors in b.
 	datumVecIdxs util.FastIntSet
 	useSel       bool
@@ -257,27 +250,11 @@ func (m *MemBatch) SetSelection(b bool) {
 // SetLength implements the Batch interface.
 func (m *MemBatch) SetLength(length int) {
 	m.length = length
-	if length > 0 {
-		// In order to maintain the invariant of Bytes vectors we need to update
-		// offsets up to the element with the largest index that can be accessed
-		// by the batch.
-		maxIdx := length - 1
-		if m.useSel {
-			// Note that here we rely on the fact that selection vectors are
-			// increasing sequences.
-			maxIdx = m.sel[length-1]
-		}
-		for i, ok := m.bytesVecIdxs.Next(0); ok; i, ok = m.bytesVecIdxs.Next(i + 1) {
-			UpdateOffsetsToBeNonDecreasing(m.b[i], maxIdx+1)
-		}
-	}
 }
 
 // AppendCol implements the Batch interface.
 func (m *MemBatch) AppendCol(col Vec) {
-	if col.IsBytesLike() {
-		m.bytesVecIdxs.Add(len(m.b))
-	} else if col.CanonicalTypeFamily() == typeconv.DatumVecCanonicalTypeFamily {
+	if col.CanonicalTypeFamily() == typeconv.DatumVecCanonicalTypeFamily {
 		m.datumVecIdxs.Add(len(m.b))
 	}
 	m.b = append(m.b, col)
@@ -318,11 +295,6 @@ func (m *MemBatch) Reset(typs []*types.T, length int, factory ColumnFactory) {
 	// since those will get reset in ResetInternalBatch anyway.
 	m.b = m.b[:len(typs)]
 	m.sel = m.sel[:length]
-	for i, ok := m.bytesVecIdxs.Next(0); ok; i, ok = m.bytesVecIdxs.Next(i + 1) {
-		if i >= len(typs) {
-			m.bytesVecIdxs.Remove(i)
-		}
-	}
 	for i, ok := m.datumVecIdxs.Next(0); ok; i, ok = m.datumVecIdxs.Next(i + 1) {
 		if i >= len(typs) {
 			m.datumVecIdxs.Remove(i)
@@ -339,11 +311,9 @@ func (m *MemBatch) ResetInternalBatch() int64 {
 	for _, v := range m.b {
 		if v.CanonicalTypeFamily() != types.UnknownFamily {
 			v.Nulls().UnsetNulls()
+			ResetIfBytesLike(v)
 		}
 	}
-	for i, ok := m.bytesVecIdxs.Next(0); ok; i, ok = m.bytesVecIdxs.Next(i + 1) {
-		Reset(m.b[i])
-	}
 	var released int64
 	for i, ok := m.datumVecIdxs.Next(0); ok; i, ok = m.datumVecIdxs.Next(i + 1) {
 		released += m.b[i].Datum().Reset()