-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize DeltaBitPackDecoder, up to 5x faster decoding #1284
Conversation
2488fa2
to
b693f8d
Compare
@@ -39,6 +39,7 @@ flate2 = { version = "1.0", optional = true } | |||
lz4 = { version = "1.23", optional = true } | |||
zstd = { version = "0.10", optional = true } | |||
chrono = { version = "0.4", default-features = false } | |||
num = "0.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially looked to extend the existing traits, but this ran into a couple of issues
wrapping_add
doesn't make sense for all types, e.g.ByteArray
- A lot of effort has already been put into creating numeric traits, we might as well just benefit from this
This crate is already used by arrow
mini_block_idx: 0, | ||
delta_bit_width: 0, | ||
delta_bit_widths: ByteBuffer::new(), | ||
deltas_in_mini_block: vec![], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously this would read to Vec<i64>
we now read directly to the output buffer
impl<T: DataType> DeltaBitPackDecoder<T> { | ||
impl<T: DataType> DeltaBitPackDecoder<T> | ||
where | ||
T::T: Default + FromPrimitive + WrappingAdd + Copy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to #1277 we can now add constraints to the types accepted here - this encoding is only valid for DataType
where DataType::T
is i32
or i64
. ParquetValueType is a crate-local trait, so users can't define custom value types.
|
||
// Per block info | ||
min_delta: i64, | ||
/// The minimum delta in the block | ||
min_delta: T::T, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decode and operate on T::T
instead of converting everything to i64 and back
b693f8d
to
fa6233a
Compare
parquet/benches/arrow_reader.rs
Outdated
@@ -78,16 +78,15 @@ fn build_plain_encoded_int32_page_iterator( | |||
max_def_level | |||
}; | |||
if def_level == max_def_level { | |||
int32_value += 1; | |||
values.push(int32_value); | |||
values.push(rng.gen_range(0..1000)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is to make the benchmark more realistic, a constant offset between consecutive values is the optimal case for DeltaBitPackDecoder - the min_delta will be 1, and all the miniblocks will have a bit width of 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good!
|
||
self.values_per_mini_block = (block_size / self.num_mini_blocks) as usize; | ||
assert!(self.values_per_mini_block % 8 == 0); | ||
if self.values_per_mini_block % 32 != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, parquet-mr
allows values_per_mini_block
to be multiple of 8: see https://issues.apache.org/jira/browse/PARQUET-2077, although I think it's very unlikely to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels to me like a bug in parquet-mr 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real question is if any actual parquet files have this pattern (values_per_mini_block be a multiple of 8)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the answer. In parquet-mr
it is assumed to be multiple of 8 in both read and write path. As I mentioned above, personally I think it's extremely unlikely to happen though since most users will just use higher APIs in parquet-mr
to write files, instead of directly using DeltaBinaryPackingValuesWriterForInteger
or DeltaBinaryPackingValuesWriterForLong
.
@@ -541,6 +547,17 @@ impl BitReader { | |||
|
|||
let mut i = 0; | |||
|
|||
if num_bits > 32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably add unpack64
similar to Arrow C++. It also has SIMD acceleration.
0cc9fe9
to
f987da9
Compare
@@ -17,15 +17,19 @@ | |||
|
|||
use arrow::array::Array; | |||
use arrow::datatypes::DataType; | |||
use criterion::{criterion_group, criterion_main, Criterion}; | |||
use criterion::measurement::WallTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmarks are reworked to allow using the same code for different encodings of integer primitives, along with different primitive types (Int32, Int64). It should be fairly mechanical to extend to other types should we wish to in future
assert_eq!(count, EXPECTED_VALUE_COUNT); | ||
}, | ||
); | ||
group.bench_function("plain encoded, mandatory, no NULLs", |b| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type, i.e. StringArray
is now encoded in the group name
Apologies for taking so long to get back to this, but I think this should incorporate all the feedback now |
Codecov Report
@@ Coverage Diff @@
## master #1284 +/- ##
==========================================
- Coverage 83.03% 83.00% -0.03%
==========================================
Files 180 180
Lines 52296 52753 +457
==========================================
+ Hits 43422 43788 +366
- Misses 8874 8965 +91
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the logic pretty carefully and it makes sense to me. However, I don't think I am enough of an expert in this format (or code) to be able to offer a thorough review of just the logic on its own merits (though it does look very nice 👌 ).
Before we release this I would like to ensure that we get some more testing with real data. The idea would be to read with existing decoder and new decoder and see that the values are the same.
@tustvold and I can definitely use some internal parquet files, but it would be great to get others involved as well. Perhaps @maxburke or @jhorstmann have some things they could test with. We could also perhaps test with the files on apache/datafusion#1441
So my proposal:
- Merge this PR in
- Send a note to the mailing list / slack asking people to test with a pre-release version of arrow
Any thoughts?
@@ -78,16 +89,17 @@ fn build_plain_encoded_int32_page_iterator( | |||
max_def_level | |||
}; | |||
if def_level == max_def_level { | |||
int32_value += 1; | |||
values.push(int32_value); | |||
let value = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this changes the benchmark to use random numbers rather than an increasing sequence (1
, 2
, ...) , right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's otherwise too optimal 😂
|
||
self.values_per_mini_block = (block_size / self.num_mini_blocks) as usize; | ||
assert!(self.values_per_mini_block % 8 == 0); | ||
if self.values_per_mini_block % 32 != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real question is if any actual parquet files have this pattern (values_per_mini_block be a multiple of 8)
I'm personally fairly confident in the coverage from the fuzz tests, but getting more coverage, particularly of different encoders which may interpret the spec differently, definitely couldn't hurt and would add an extra safety guarantee 👍 |
🤔 I also ran the performance tests On a google cloud machine (c2-standard-16, CPU Intel Cascade Lake)
I ran the benchmarks like this cargo bench -p parquet --bench arrow_reader --features=test_common,experimental -- --save-baseline after And then compared: alamb@instance-1:/data/arrow-rs$ critcmp before after
group after before
----- ----- ------
arrow_array_reader/Int32Array/binary packed, mandatory, no NULLs 1.00 31.5±0.04µs ? ?/sec 3.15 99.1±0.22µs ? ?/sec
arrow_array_reader/Int32Array/binary packed, optional, half NULLs 1.00 45.8±0.18µs ? ?/sec 1.73 79.5±0.32µs ? ?/sec
arrow_array_reader/Int32Array/binary packed, optional, no NULLs 1.00 49.1±0.11µs ? ?/sec 2.38 116.8±0.28µs ? ?/sec
arrow_array_reader/Int32Array/dictionary encoded, mandatory, no NULLs 1.00 36.1±0.07µs ? ?/sec 1.02 36.8±0.08µs ? ?/sec
arrow_array_reader/Int32Array/dictionary encoded, optional, half NULLs 1.00 48.6±0.11µs ? ?/sec 1.01 49.0±0.16µs ? ?/sec
arrow_array_reader/Int32Array/dictionary encoded, optional, no NULLs 1.00 53.6±0.13µs ? ?/sec 1.02 54.7±0.10µs ? ?/sec
arrow_array_reader/Int32Array/plain encoded, mandatory, no NULLs 1.01 4.7±0.24µs ? ?/sec 1.00 4.7±0.11µs ? ?/sec
arrow_array_reader/Int32Array/plain encoded, optional, half NULLs 1.00 32.2±0.09µs ? ?/sec 1.01 32.5±0.19µs ? ?/sec
arrow_array_reader/Int32Array/plain encoded, optional, no NULLs 1.00 22.4±0.64µs ? ?/sec 1.03 23.0±0.35µs ? ?/sec
arrow_array_reader/Int64Array/binary packed, mandatory, no NULLs 1.00 40.5±0.05µs ? ?/sec 5.77 233.9±0.68µs ? ?/sec
arrow_array_reader/Int64Array/binary packed, optional, half NULLs 1.00 51.3±0.08µs ? ?/sec 2.87 147.0±0.45µs ? ?/sec
arrow_array_reader/Int64Array/binary packed, optional, no NULLs 1.00 58.3±0.06µs ? ?/sec 4.31 251.3±0.24µs ? ?/sec
arrow_array_reader/Int64Array/dictionary encoded, mandatory, no NULLs 1.00 37.5±0.07µs ? ?/sec 1.02 38.3±0.15µs ? ?/sec
arrow_array_reader/Int64Array/dictionary encoded, optional, half NULLs 1.00 49.5±0.16µs ? ?/sec 1.02 50.4±0.21µs ? ?/sec
arrow_array_reader/Int64Array/dictionary encoded, optional, no NULLs 1.00 54.7±0.11µs ? ?/sec 1.02 56.0±0.11µs ? ?/sec
arrow_array_reader/Int64Array/plain encoded, mandatory, no NULLs 1.00 7.9±0.44µs ? ?/sec 1.02 8.0±0.26µs ? ?/sec
arrow_array_reader/Int64Array/plain encoded, optional, half NULLs 1.00 34.3±0.12µs ? ?/sec 1.00 34.3±0.06µs ? ?/sec
arrow_array_reader/Int64Array/plain encoded, optional, no NULLs 1.01 26.3±1.57µs ? ?/sec 1.00 26.1±0.44µs ? ?/sec
arrow_array_reader/StringArray/dictionary encoded, mandatory, no NULLs 1.00 204.7±1.03µs ? ?/sec 1.01 205.8±0.93µs ? ?/sec
arrow_array_reader/StringArray/dictionary encoded, optional, half NULLs 1.00 226.6±2.75µs ? ?/sec 1.00 226.9±0.76µs ? ?/sec
arrow_array_reader/StringArray/dictionary encoded, optional, no NULLs 1.01 223.2±1.35µs ? ?/sec 1.00 221.9±1.55µs ? ?/sec
arrow_array_reader/StringArray/plain encoded, mandatory, no NULLs 1.00 213.5±0.77µs ? ?/sec 1.00 212.9±0.61µs ? ?/sec
arrow_array_reader/StringArray/plain encoded, optional, half NULLs 1.01 233.5±0.63µs ? ?/sec 1.00 230.7±0.96µs ? ?/sec
arrow_array_reader/StringArray/plain encoded, optional, no NULLs 1.00 235.8±1.39µs ? ?/sec 1.00 236.1±0.86µs ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, mandatory, no NULLs - new 1.00 28.4±0.07µs ? ?/sec 1.01 28.8±0.08µs ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, mandatory, no NULLs - old 1.00 1910.0±3.25µs ? ?/sec 1.00 1913.3±8.46µs ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, optional, half NULLs - new 1.00 45.8±0.39µs ? ?/sec 1.02 46.7±0.10µs ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, optional, half NULLs - old 1.00 1726.2±3.24µs ? ?/sec 1.00 1731.8±6.23µs ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, optional, no NULLs - new 1.00 46.0±0.11µs ? ?/sec 1.01 46.6±0.20µs ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, optional, no NULLs - old 1.00 1963.8±3.34µs ? ?/sec 1.00 1968.7±4.28µs ? ?/sec There are some very nice improvements (3x - 5x) in binary packed decoding 👨🍳 👌 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too. Thanks @tustvold ! The performance improvement looks great!
I spent a few hours this morning testing this reader compared to the existing implementation on an internal corpus of parquet files using this tool: https://github.com/alamb/parquet_cmp I am pleased to say the same values were produced:
I encourage everyone else with a similar corpus to give it a try and report their results 🚀 |
@alamb @tustvold do you see many production use cases of DeltaBinaryPacked encoding? my understanding is most people are still using Parquet V1 format and hence PLAIN + DICTIONARY + RLE encodings. We also recently implemented support for DeltaBinaryPacked encoding in Spark and the read performance is slower than PLAIN even for sorted data, see here. From the benchmark result above, it also appears to be slower than PLAIN (although on par with DICTIONARY). |
At least for IOx, we're in control of the parquet data written and so it is a case of us choosing to write using the encodings that give us the best balance of compression and performance. Ultimately I saw this decoder show up in a profile, as the dictionary encoding appears to spill to this, and realised there was clearly some low-hanging fruit here and so coded it up. I'm personally optimistic there are further potential improvements that will make DeltaBinaryPacked have comparable decode performance to PLAIN, at which point it effectively becomes free compression, but I don't know that for sure 😅 So to directly answer your question, we do have production use cases of DeltaBinaryPacked encoding, but this is somewhat accidental and we would likely switch to something else should it yield better performance characteristics. |
FWIW one important usecase in IOx is timestamps where the delta between values is often very regular so |
Thank you both for the input! this information is very useful. Yes it does seem a good fit for the timestamp use case. We are also evaluating the V2 format here at Apple and trying to identify scenarios where it is clearly better than V1. |
Which issue does this PR close?
Closes #1281.
Rationale for this change
Make
DeltaBitPackDecoder
fasterWhat changes are included in this PR?
Adapts
DeltaBitPackDecoder
to eliminate intermediate buffering, and to vectorize better.The performance bump is not quite what I was hoping for, I suspect
unpack32
may not be vectorizing correctly, but is still decent.Are there any user-facing changes?
No, the encoding module is experimental