Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
76202: coldata: change Bytes vector implementation r=yuzefovich a=yuzefovich

**coldata: change Bytes vector implementation**

This commit changes the way `Bytes` vector is implemented in order to
support `Set` operation at arbitrary positions. Now, a `Bytes` vector is
a slice of `element`s.

`element` describes a single `[]byte` value. If a value doesn't exceed
30 bytes, then the whole value is stored within the `element`; otherwise,
the `element` "points" at a region in a common `Bytes.buffer` where the
non-inlined value is stored.

If the value is inlined, then the layout of `element` is used as follows:
```
           24-byte header | 6-byte padding
  element: .............................. | length | true
  Bytes.buffer: N/A
```
where 30 dots describe the inlinable space followed by a single byte for the
length followed by a boolean `true` indicating an inlined value.

If the value is non-inlined, then the layout of `element` is used as follows:
```
                                            padding
  element: .offset. | ..len... | ..cap... | xxxxxx | x | false
  Bytes.buffer: xxxxxxxx | offset .... | xxxxxxxx
```
where first 24 bytes contain our custom "header" of a byte slice that is
backed by `Bytes.buffer`. The following 7 bytes (the padding and the
`inlinedLength` byte) are not used, and then we have a boolean `false`
indicating a non-inlined value.

We have a custom "slice header" for the non-inlined values and don't
just use `[]byte` for two reasons:
1. we can easily overwrite the whole struct when a value is inlined without
causing any issues with the GC (because our slice header doesn't contain an
actual slice, GC doesn't have to track whether the backing array is live or
not);
2. our slice header remains valid even when `Bytes.buffer` is reallocated.
If we were to use `[]byte`, then whenever that reallocation occurred, we
would have to iterate over all non-inlined values and update the byte
slices to point to the new `Bytes.buffer`.

This new implementation was inspired by Velox library from Facebook
(https://facebookincubator.github.io/velox/develop/vectors.html#flat-vectors-scalar-types).

Our original implementation was based on the Arrow format, so we were
getting extremely cheap serialization/deserialization. With this change
some of the benchmarks have regressed significantly:
```
Append/bytes/AppendSimple/NullProbability=0.0-24    356MB/s ± 5%   221MB/s ± 1%    -37.81%  (p=0.000 n=9+8)
Append/bytes/AppendWithSel/NullProbability=0.0-24   241MB/s ± 2%   105MB/s ± 4%    -56.49%  (p=0.000 n=9+10)
Append/bytes/AppendSimple/NullProbability=0.2-24    354MB/s ± 6%   219MB/s ± 4%    -38.11%  (p=0.000 n=8+9)
Append/bytes/AppendWithSel/NullProbability=0.2-24   229MB/s ± 3%    98MB/s ± 4%    -57.20%  (p=0.000 n=9+10)
Copy/bytes/CopySimple/NullProbability=0.0-24       39.7GB/s ± 1%   0.5GB/s ± 0%    -98.67%  (p=0.000 n=10+10)
Copy/bytes/CopyWithSel/NullProbability=0.0-24       753MB/s ± 1%   544MB/s ± 0%    -27.83%  (p=0.000 n=10+8)
Copy/bytes/CopySimple/NullProbability=0.2-24       28.9GB/s ± 1%   0.5GB/s ± 0%    -98.19%  (p=0.000 n=8+10)
Copy/bytes/CopyWithSel/NullProbability=0.2-24       634MB/s ± 0%   575MB/s ± 0%     -9.37%  (p=0.000 n=9+9)
```
```
ArrowBatchConverter/bytes/nullFraction=0.00/BatchToArrow-24   248GB/s ± 2%      3GB/s ± 3%     -98.90%  (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.25/BatchToArrow-24   240GB/s ± 2%      3GB/s ± 2%     -98.87%  (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.50/BatchToArrow-24   242GB/s ± 1%      3GB/s ± 2%     -98.86%  (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.00/ArrowToBatch-24   830GB/s ± 2%      3GB/s ± 2%     -99.59%  (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.25/ArrowToBatch-24   893GB/s ± 1%      3GB/s ± 1%     -99.62%  (p=0.000 n=10+8)
ArrowBatchConverter/bytes/nullFraction=0.50/ArrowToBatch-24   910GB/s ± 1%      3GB/s ± 1%     -99.62%  (p=0.000 n=10+9)
```
I'm not particularly worried about the regression in the serialization
(since the absolute speeds are still very high for this to not be
a bottle neck), but the slowdown in `Append/Copy` benchmark is
worrisome. I do have some ideas of how to improve those operations to
reduce the regressions to acceptable levels, and I'll add those
optimizations in a follow-up commit.

As a proxy for change of performance of `Bytes` vector overall, I ran
`BenchmarkSortUUID` with results [here](https://gist.github.com/yuzefovich/baaf6b4096cb3e2f1504ad2b38431cc6)
and the unordered distinct benchmarks with results [here](https://gist.github.com/yuzefovich/d8b9166977ec56780cb1749e848f5935).
They show that the number of allocations has decreased in all cases
although the total allocated space increased in some cases with small
number of rows. As a result, there are some regressions in performance
on a small number of rows. Overall, results seem satisfactory.

Notably, as a result of this change the performance of top K sort has
improved drastically when bytes-like types are present in the input
columns (because `CopySlice` used in top K sorter no longer has to move
the data around).

It's worth calling out that we still use the same Arrow format for
serialization, so there is no need to bump the DistSQL version because
the change of this commit is purely to the in-memory representation.

An additional nice result of this change is that we no longer have to
maintain the non-decreasing offsets, which will allow us to remove calls
to `SetLength` in many places where the length of the batch is not
changed. This will be done in the follow-up commit.

Fixes: cockroachdb#75536.

Release note: None

Release justification: low risk, high benefit changes to existing
functionality.

**coldata: optimize Append and Copy for Bytes vectors**

This commit optimizes `Append/Copy` operations for `Bytes` vectors. The
main idea is that a new primitive `copy` is introduced which allows for
copying one `element` from one vector into another `element` of another
vector which removes the allocations when copying inlined values.

An additional improvement is preallocating enough `element`s to
copy/append into many values.
```
name                                               old speed      new speed       delta
Append/bytes/AppendSimple/NullProbability=0.0-24    226MB/s ± 7%    221MB/s ± 8%      ~     (p=0.143 n=10+10)
Append/bytes/AppendWithSel/NullProbability=0.0-24   104MB/s ± 5%    189MB/s ± 1%   +81.57%  (p=0.000 n=10+8)
Append/bytes/AppendSimple/NullProbability=0.2-24    220MB/s ± 2%    229MB/s ± 7%    +4.23%  (p=0.004 n=8+10)
Append/bytes/AppendWithSel/NullProbability=0.2-24  98.4MB/s ± 5%  197.7MB/s ± 4%  +100.98%  (p=0.000 n=10+10)
Copy/bytes/CopySimple/NullProbability=0.0-24        521MB/s ± 0%   2981MB/s ± 1%  +471.63%  (p=0.000 n=10+10)
Copy/bytes/CopyWithSel/NullProbability=0.0-24       539MB/s ± 0%   1020MB/s ± 0%   +89.32%  (p=0.000 n=10+10)
Copy/bytes/CopySimple/NullProbability=0.2-24        518MB/s ± 0%   2913MB/s ± 0%  +462.62%  (p=0.000 n=10+8)
Copy/bytes/CopyWithSel/NullProbability=0.2-24       560MB/s ± 0%    875MB/s ± 0%   +56.12%  (p=0.000 n=10+8)

name                                               old alloc/op   new alloc/op    delta
Append/bytes/AppendSimple/NullProbability=0.0-24     76.1kB ± 8%    118.9kB ± 7%   +56.23%  (p=0.000 n=10+10)
Append/bytes/AppendWithSel/NullProbability=0.0-24     192kB ± 2%      131kB ± 1%   -32.15%  (p=0.000 n=10+8)
Append/bytes/AppendSimple/NullProbability=0.2-24     78.0kB ± 0%    117.2kB ± 3%   +50.27%  (p=0.000 n=8+9)
Append/bytes/AppendWithSel/NullProbability=0.2-24     198kB ± 1%      105kB ± 3%   -47.21%  (p=0.000 n=10+9)
Copy/bytes/CopySimple/NullProbability=0.0-24          0.00B           0.00B           ~     (all equal)
Copy/bytes/CopyWithSel/NullProbability=0.0-24         0.00B           0.00B           ~     (all equal)
Copy/bytes/CopySimple/NullProbability=0.2-24          0.00B           0.00B           ~     (all equal)
Copy/bytes/CopyWithSel/NullProbability=0.2-24         0.00B           0.00B           ~     (all equal)
```
I'm not sure what's up with some of the increase in allocations since
I don't see them locally on mac.

Release note: None

Release justification: low risk, high benefit changes to existing
functionality.

**colexec: cleanup top K with bytes-like types**

Since the previous commit made it so that `Bytes` vectors support Sets
at arbitrary positions, we no longer have to use `CopySlice` in the top
K sort implementation.

Release note: None

Release justification: low risk, high benefit changes to existing
functionality.

**coldata: prohibit Append from the vector into itself**

Previously, `Append`ing from the vector into itself was allowed when
`srcStartIdx == srcEndIdx` in order to be able to truncate the vector.
One of the previous commits removed the only place where this was
actually used (when implementing `Append` for Bytes-like types when
selection vector is present), so this commit removes this no-longer-used
feature.

Release note: None

Release justification: low risk cleanup.

**colexec: remove no longer needed calls to SetLength**

Previously, we required that operators that might populate a `Bytes`
vector set the length on the output batch (which was needed in order to
maintain the invariant of non-decreasing offsets in the old
implementation of the `Bytes` vector) even if the length of the batch
didn't change. This is no longer needed given the updated
implementation, so this commit removes those redundant calls to
`SetLength`.

Release note: None

Release justification: low risk cleanup.

Co-authored-by: Yahor Yuzefovich <[email protected]>
  • Loading branch information
craig[bot] and yuzefovich committed Mar 9, 2022
2 parents 5e1b8d6 + ec8e548 commit fe889cf
Show file tree
Hide file tree
Showing 33 changed files with 779 additions and 4,195 deletions.
2 changes: 0 additions & 2 deletions pkg/col/coldata/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,11 @@ go_library(
deps = [
"//pkg/col/typeconv",
"//pkg/sql/colexecerror",
"//pkg/sql/memsize",
"//pkg/sql/types",
"//pkg/util",
"//pkg/util/buildutil",
"//pkg/util/duration",
"//pkg/util/json",
"//pkg/util/uuid",
"@com_github_cockroachdb_apd_v3//:apd",
"@com_github_cockroachdb_errors//:errors",
"@com_github_stretchr_testify//require",
Expand Down
36 changes: 3 additions & 33 deletions pkg/col/coldata/batch.go
Original file line number Diff line number Diff line change
Expand Up @@ -126,9 +126,7 @@ func NewMemBatchWithCapacity(typs []*types.T, capacity int, factory ColumnFactor
col := &cols[i]
col.init(t, capacity, factory)
b.b[i] = col
if col.IsBytesLike() {
b.bytesVecIdxs.Add(i)
} else if col.CanonicalTypeFamily() == typeconv.DatumVecCanonicalTypeFamily {
if col.CanonicalTypeFamily() == typeconv.DatumVecCanonicalTypeFamily {
b.datumVecIdxs.Add(i)
}
}
Expand Down Expand Up @@ -201,11 +199,6 @@ type MemBatch struct {
capacity int
// b is the slice of columns in this batch.
b []Vec
// bytesVecIdxs stores the indices of all vectors of Bytes type in b. Bytes
// vectors require special handling, so rather than iterating over all
// vectors and checking whether they are of Bytes type we store this slice
// separately.
bytesVecIdxs util.FastIntSet
// datumVecIdxs stores the indices of all datum-backed vectors in b.
datumVecIdxs util.FastIntSet
useSel bool
Expand Down Expand Up @@ -257,27 +250,11 @@ func (m *MemBatch) SetSelection(b bool) {
// SetLength implements the Batch interface.
func (m *MemBatch) SetLength(length int) {
m.length = length
if length > 0 {
// In order to maintain the invariant of Bytes vectors we need to update
// offsets up to the element with the largest index that can be accessed
// by the batch.
maxIdx := length - 1
if m.useSel {
// Note that here we rely on the fact that selection vectors are
// increasing sequences.
maxIdx = m.sel[length-1]
}
for i, ok := m.bytesVecIdxs.Next(0); ok; i, ok = m.bytesVecIdxs.Next(i + 1) {
UpdateOffsetsToBeNonDecreasing(m.b[i], maxIdx+1)
}
}
}

// AppendCol implements the Batch interface.
func (m *MemBatch) AppendCol(col Vec) {
if col.IsBytesLike() {
m.bytesVecIdxs.Add(len(m.b))
} else if col.CanonicalTypeFamily() == typeconv.DatumVecCanonicalTypeFamily {
if col.CanonicalTypeFamily() == typeconv.DatumVecCanonicalTypeFamily {
m.datumVecIdxs.Add(len(m.b))
}
m.b = append(m.b, col)
Expand Down Expand Up @@ -318,11 +295,6 @@ func (m *MemBatch) Reset(typs []*types.T, length int, factory ColumnFactory) {
// since those will get reset in ResetInternalBatch anyway.
m.b = m.b[:len(typs)]
m.sel = m.sel[:length]
for i, ok := m.bytesVecIdxs.Next(0); ok; i, ok = m.bytesVecIdxs.Next(i + 1) {
if i >= len(typs) {
m.bytesVecIdxs.Remove(i)
}
}
for i, ok := m.datumVecIdxs.Next(0); ok; i, ok = m.datumVecIdxs.Next(i + 1) {
if i >= len(typs) {
m.datumVecIdxs.Remove(i)
Expand All @@ -339,11 +311,9 @@ func (m *MemBatch) ResetInternalBatch() int64 {
for _, v := range m.b {
if v.CanonicalTypeFamily() != types.UnknownFamily {
v.Nulls().UnsetNulls()
ResetIfBytesLike(v)
}
}
for i, ok := m.bytesVecIdxs.Next(0); ok; i, ok = m.bytesVecIdxs.Next(i + 1) {
Reset(m.b[i])
}
var released int64
for i, ok := m.datumVecIdxs.Next(0); ok; i, ok = m.datumVecIdxs.Next(i + 1) {
released += m.b[i].Datum().Reset()
Expand Down
Loading

0 comments on commit fe889cf

Please sign in to comment.