Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
76202: coldata: change Bytes vector implementation r=yuzefovich a=yuzefovich **coldata: change Bytes vector implementation** This commit changes the way `Bytes` vector is implemented in order to support `Set` operation at arbitrary positions. Now, a `Bytes` vector is a slice of `element`s. `element` describes a single `[]byte` value. If a value doesn't exceed 30 bytes, then the whole value is stored within the `element`; otherwise, the `element` "points" at a region in a common `Bytes.buffer` where the non-inlined value is stored. If the value is inlined, then the layout of `element` is used as follows: ``` 24-byte header | 6-byte padding element: .............................. | length | true Bytes.buffer: N/A ``` where 30 dots describe the inlinable space followed by a single byte for the length followed by a boolean `true` indicating an inlined value. If the value is non-inlined, then the layout of `element` is used as follows: ``` padding element: .offset. | ..len... | ..cap... | xxxxxx | x | false Bytes.buffer: xxxxxxxx | offset .... | xxxxxxxx ``` where first 24 bytes contain our custom "header" of a byte slice that is backed by `Bytes.buffer`. The following 7 bytes (the padding and the `inlinedLength` byte) are not used, and then we have a boolean `false` indicating a non-inlined value. We have a custom "slice header" for the non-inlined values and don't just use `[]byte` for two reasons: 1. we can easily overwrite the whole struct when a value is inlined without causing any issues with the GC (because our slice header doesn't contain an actual slice, GC doesn't have to track whether the backing array is live or not); 2. our slice header remains valid even when `Bytes.buffer` is reallocated. If we were to use `[]byte`, then whenever that reallocation occurred, we would have to iterate over all non-inlined values and update the byte slices to point to the new `Bytes.buffer`. This new implementation was inspired by Velox library from Facebook (https://facebookincubator.github.io/velox/develop/vectors.html#flat-vectors-scalar-types). Our original implementation was based on the Arrow format, so we were getting extremely cheap serialization/deserialization. With this change some of the benchmarks have regressed significantly: ``` Append/bytes/AppendSimple/NullProbability=0.0-24 356MB/s ± 5% 221MB/s ± 1% -37.81% (p=0.000 n=9+8) Append/bytes/AppendWithSel/NullProbability=0.0-24 241MB/s ± 2% 105MB/s ± 4% -56.49% (p=0.000 n=9+10) Append/bytes/AppendSimple/NullProbability=0.2-24 354MB/s ± 6% 219MB/s ± 4% -38.11% (p=0.000 n=8+9) Append/bytes/AppendWithSel/NullProbability=0.2-24 229MB/s ± 3% 98MB/s ± 4% -57.20% (p=0.000 n=9+10) Copy/bytes/CopySimple/NullProbability=0.0-24 39.7GB/s ± 1% 0.5GB/s ± 0% -98.67% (p=0.000 n=10+10) Copy/bytes/CopyWithSel/NullProbability=0.0-24 753MB/s ± 1% 544MB/s ± 0% -27.83% (p=0.000 n=10+8) Copy/bytes/CopySimple/NullProbability=0.2-24 28.9GB/s ± 1% 0.5GB/s ± 0% -98.19% (p=0.000 n=8+10) Copy/bytes/CopyWithSel/NullProbability=0.2-24 634MB/s ± 0% 575MB/s ± 0% -9.37% (p=0.000 n=9+9) ``` ``` ArrowBatchConverter/bytes/nullFraction=0.00/BatchToArrow-24 248GB/s ± 2% 3GB/s ± 3% -98.90% (p=0.000 n=10+10) ArrowBatchConverter/bytes/nullFraction=0.25/BatchToArrow-24 240GB/s ± 2% 3GB/s ± 2% -98.87% (p=0.000 n=10+10) ArrowBatchConverter/bytes/nullFraction=0.50/BatchToArrow-24 242GB/s ± 1% 3GB/s ± 2% -98.86% (p=0.000 n=10+10) ArrowBatchConverter/bytes/nullFraction=0.00/ArrowToBatch-24 830GB/s ± 2% 3GB/s ± 2% -99.59% (p=0.000 n=10+10) ArrowBatchConverter/bytes/nullFraction=0.25/ArrowToBatch-24 893GB/s ± 1% 3GB/s ± 1% -99.62% (p=0.000 n=10+8) ArrowBatchConverter/bytes/nullFraction=0.50/ArrowToBatch-24 910GB/s ± 1% 3GB/s ± 1% -99.62% (p=0.000 n=10+9) ``` I'm not particularly worried about the regression in the serialization (since the absolute speeds are still very high for this to not be a bottle neck), but the slowdown in `Append/Copy` benchmark is worrisome. I do have some ideas of how to improve those operations to reduce the regressions to acceptable levels, and I'll add those optimizations in a follow-up commit. As a proxy for change of performance of `Bytes` vector overall, I ran `BenchmarkSortUUID` with results [here](https://gist.github.com/yuzefovich/baaf6b4096cb3e2f1504ad2b38431cc6) and the unordered distinct benchmarks with results [here](https://gist.github.com/yuzefovich/d8b9166977ec56780cb1749e848f5935). They show that the number of allocations has decreased in all cases although the total allocated space increased in some cases with small number of rows. As a result, there are some regressions in performance on a small number of rows. Overall, results seem satisfactory. Notably, as a result of this change the performance of top K sort has improved drastically when bytes-like types are present in the input columns (because `CopySlice` used in top K sorter no longer has to move the data around). It's worth calling out that we still use the same Arrow format for serialization, so there is no need to bump the DistSQL version because the change of this commit is purely to the in-memory representation. An additional nice result of this change is that we no longer have to maintain the non-decreasing offsets, which will allow us to remove calls to `SetLength` in many places where the length of the batch is not changed. This will be done in the follow-up commit. Fixes: cockroachdb#75536. Release note: None Release justification: low risk, high benefit changes to existing functionality. **coldata: optimize Append and Copy for Bytes vectors** This commit optimizes `Append/Copy` operations for `Bytes` vectors. The main idea is that a new primitive `copy` is introduced which allows for copying one `element` from one vector into another `element` of another vector which removes the allocations when copying inlined values. An additional improvement is preallocating enough `element`s to copy/append into many values. ``` name old speed new speed delta Append/bytes/AppendSimple/NullProbability=0.0-24 226MB/s ± 7% 221MB/s ± 8% ~ (p=0.143 n=10+10) Append/bytes/AppendWithSel/NullProbability=0.0-24 104MB/s ± 5% 189MB/s ± 1% +81.57% (p=0.000 n=10+8) Append/bytes/AppendSimple/NullProbability=0.2-24 220MB/s ± 2% 229MB/s ± 7% +4.23% (p=0.004 n=8+10) Append/bytes/AppendWithSel/NullProbability=0.2-24 98.4MB/s ± 5% 197.7MB/s ± 4% +100.98% (p=0.000 n=10+10) Copy/bytes/CopySimple/NullProbability=0.0-24 521MB/s ± 0% 2981MB/s ± 1% +471.63% (p=0.000 n=10+10) Copy/bytes/CopyWithSel/NullProbability=0.0-24 539MB/s ± 0% 1020MB/s ± 0% +89.32% (p=0.000 n=10+10) Copy/bytes/CopySimple/NullProbability=0.2-24 518MB/s ± 0% 2913MB/s ± 0% +462.62% (p=0.000 n=10+8) Copy/bytes/CopyWithSel/NullProbability=0.2-24 560MB/s ± 0% 875MB/s ± 0% +56.12% (p=0.000 n=10+8) name old alloc/op new alloc/op delta Append/bytes/AppendSimple/NullProbability=0.0-24 76.1kB ± 8% 118.9kB ± 7% +56.23% (p=0.000 n=10+10) Append/bytes/AppendWithSel/NullProbability=0.0-24 192kB ± 2% 131kB ± 1% -32.15% (p=0.000 n=10+8) Append/bytes/AppendSimple/NullProbability=0.2-24 78.0kB ± 0% 117.2kB ± 3% +50.27% (p=0.000 n=8+9) Append/bytes/AppendWithSel/NullProbability=0.2-24 198kB ± 1% 105kB ± 3% -47.21% (p=0.000 n=10+9) Copy/bytes/CopySimple/NullProbability=0.0-24 0.00B 0.00B ~ (all equal) Copy/bytes/CopyWithSel/NullProbability=0.0-24 0.00B 0.00B ~ (all equal) Copy/bytes/CopySimple/NullProbability=0.2-24 0.00B 0.00B ~ (all equal) Copy/bytes/CopyWithSel/NullProbability=0.2-24 0.00B 0.00B ~ (all equal) ``` I'm not sure what's up with some of the increase in allocations since I don't see them locally on mac. Release note: None Release justification: low risk, high benefit changes to existing functionality. **colexec: cleanup top K with bytes-like types** Since the previous commit made it so that `Bytes` vectors support Sets at arbitrary positions, we no longer have to use `CopySlice` in the top K sort implementation. Release note: None Release justification: low risk, high benefit changes to existing functionality. **coldata: prohibit Append from the vector into itself** Previously, `Append`ing from the vector into itself was allowed when `srcStartIdx == srcEndIdx` in order to be able to truncate the vector. One of the previous commits removed the only place where this was actually used (when implementing `Append` for Bytes-like types when selection vector is present), so this commit removes this no-longer-used feature. Release note: None Release justification: low risk cleanup. **colexec: remove no longer needed calls to SetLength** Previously, we required that operators that might populate a `Bytes` vector set the length on the output batch (which was needed in order to maintain the invariant of non-decreasing offsets in the old implementation of the `Bytes` vector) even if the length of the batch didn't change. This is no longer needed given the updated implementation, so this commit removes those redundant calls to `SetLength`. Release note: None Release justification: low risk cleanup. Co-authored-by: Yahor Yuzefovich <[email protected]>
- Loading branch information