forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-39231][SQL] Use
ConstantColumnVector
instead of `On/OffHeapC…
…olumnVector` to store partition columns in `VectorizedParquetRecordReader` ### What changes were proposed in this pull request? This pr change to use `ConstantColumnVector` to store partition columns in `VectorizedParquetRecordReader` because partition column vector always constant vector. ### Why are the changes needed? 1. Partition columns vector alway constant vector. 2. **Performance improvement**: `ConstantColumnVector` has better reading and writing performance than `OnHeapColumnVector` and `OffHeapColumnVector`. From the microbench results, the performance improvement is obvious for `StringType` : the read throughput is increased by about 2 times, and the write throughput is increased by more than 100 times. 3. **Memory saving**: `ConstantColumnVector` saves more memory than `OnHeapColumnVector` and `OffHeapColumnVector`, for `UTF8String` type Vector with length of 4096(default `batchSize`), 'ConstantColumnVector' can save more than 90% of memory compared with `OnHeapColumnVector`: - - `ConstantColumnVector` only stores an `UTF8String` - - `OnHeapColumnVector` needs `arrayOffsets(int[4096])` + `arrayLengths(int[4096])` + `(UTF8String * 4096)` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass Github Action - Add new UTs to test the new method introduced by this pr: `ColumnVectorUtils.fill(ConstantColumnVector col, InternalRow row, int fieldIdx)` - Add new micro benchmark to compare the read and write performance of constant vector(simulate partition column scene) between `OnHeapColumnVector`, `OffHeapColumnVector` and `ConstantColumnVector` Closes apache#36616 from LuciferYang/SPARK-39231. Authored-by: yangjie01 <[email protected]> Signed-off-by: Chao Sun <[email protected]>
- Loading branch information
1 parent
feae21c
commit c0a12cf
Showing
9 changed files
with
1,414 additions
and
14 deletions.
There are no files selected for viewing
280 changes: 280 additions & 0 deletions
280
sql/core/benchmarks/ConstantColumnVectorBenchmark-jdk11-results.txt
Large diffs are not rendered by default.
Oops, something went wrong.
280 changes: 280 additions & 0 deletions
280
sql/core/benchmarks/ConstantColumnVectorBenchmark-jdk17-results.txt
Large diffs are not rendered by default.
Oops, something went wrong.
280 changes: 280 additions & 0 deletions
280
sql/core/benchmarks/ConstantColumnVectorBenchmark-results.txt
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.