Vectorize DeltaBitPackDecoder, up to 5x faster decoding #1284

tustvold · 2022-02-07T11:46:30Z

Which issue does this PR close?

Closes #1281.

Rationale for this change

Make DeltaBitPackDecoder faster

What changes are included in this PR?

Adapts DeltaBitPackDecoder to eliminate intermediate buffering, and to vectorize better.

The performance bump is not quite what I was hoping for, I suspect unpack32 may not be vectorizing correctly, but is still decent.

arrow_array_reader/read Int32Array, binary packed, mandatory, no NULLs                                                                             
                        time:   [20.642 us 20.650 us 20.659 us]
                        change: [-73.470% -73.455% -73.441%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/read Int32Array, binary packed, optional, no NULLs                                                                             
                        time:   [32.453 us 32.459 us 32.466 us]
                        change: [-63.907% -63.896% -63.885%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/read Int32Array, binary packed, optional, half NULLs                                                                             
                        time:   [35.435 us 35.445 us 35.456 us]
                        change: [-44.831% -44.794% -44.745%] (p = 0.00 < 0.05)
                        Performance has improved.

Are there any user-facing changes?

No, the encoding module is experimental

tustvold · 2022-02-08T10:56:27Z

parquet/Cargo.toml

@@ -39,6 +39,7 @@ flate2 = { version = "1.0", optional = true }
 lz4 = { version = "1.23", optional = true }
 zstd = { version = "0.10", optional = true }
 chrono = { version = "0.4", default-features = false }
+num = "0.4"


I initially looked to extend the existing traits, but this ran into a couple of issues

wrapping_add doesn't make sense for all types, e.g. ByteArray

A lot of effort has already been put into creating numeric traits, we might as well just benefit from this

This crate is already used by arrow

tustvold · 2022-02-08T11:00:22Z

parquet/src/encodings/decoding.rs

            mini_block_idx: 0,
-            delta_bit_width: 0,
-            delta_bit_widths: ByteBuffer::new(),
-            deltas_in_mini_block: vec![],


Previously this would read to Vec<i64> we now read directly to the output buffer

tustvold · 2022-02-08T11:01:40Z

parquet/src/encodings/decoding.rs

-impl<T: DataType> DeltaBitPackDecoder<T> {
+impl<T: DataType> DeltaBitPackDecoder<T>
+where
+    T::T: Default + FromPrimitive + WrappingAdd + Copy,


Thanks to #1277 we can now add constraints to the types accepted here - this encoding is only valid for DataType where DataType::T is i32 or i64. ParquetValueType is a crate-local trait, so users can't define custom value types.

tustvold · 2022-02-08T11:02:35Z

parquet/src/encodings/decoding.rs


    // Per block info
-    min_delta: i64,
+    /// The minimum delta in the block
+    min_delta: T::T,


We decode and operate on T::T instead of converting everything to i64 and back

tustvold · 2022-02-08T11:04:37Z

parquet/benches/arrow_reader.rs

@@ -78,16 +78,15 @@ fn build_plain_encoded_int32_page_iterator(
                    max_def_level
                };
                if def_level == max_def_level {
-                    int32_value += 1;
-                    values.push(int32_value);
+                    values.push(rng.gen_range(0..1000));


This change is to make the benchmark more realistic, a constant offset between consecutive values is the optimal case for DeltaBitPackDecoder - the min_delta will be 1, and all the miniblocks will have a bit width of 0.

sunchao

Looks pretty good!

parquet/src/encodings/decoding.rs

parquet/src/util/bit_util.rs

sunchao · 2022-02-09T04:08:48Z

parquet/src/encodings/decoding.rs


-        self.values_per_mini_block = (block_size / self.num_mini_blocks) as usize;
-        assert!(self.values_per_mini_block % 8 == 0);
+        if self.values_per_mini_block % 32 != 0 {


In theory, parquet-mr allows values_per_mini_block to be multiple of 8: see https://issues.apache.org/jira/browse/PARQUET-2077, although I think it's very unlikely to happen.

This feels to me like a bug in parquet-mr 😅

The real question is if any actual parquet files have this pattern (values_per_mini_block be a multiple of 8)

I don't know the answer. In parquet-mr it is assumed to be multiple of 8 in both read and write path. As I mentioned above, personally I think it's extremely unlikely to happen though since most users will just use higher APIs in parquet-mr to write files, instead of directly using DeltaBinaryPackingValuesWriterForInteger or DeltaBinaryPackingValuesWriterForLong.

sunchao · 2022-02-09T04:17:47Z

parquet/src/util/bit_util.rs

@@ -541,6 +547,17 @@ impl BitReader {

        let mut i = 0;

+        if num_bits > 32 {


We can probably add unpack64 similar to Arrow C++. It also has SIMD acceleration.

parquet/src/encodings/decoding.rs

parquet/benches/arrow_reader.rs

tustvold · 2022-02-12T14:08:14Z

parquet/benches/arrow_reader.rs

@@ -17,15 +17,19 @@

 use arrow::array::Array;
 use arrow::datatypes::DataType;
-use criterion::{criterion_group, criterion_main, Criterion};
+use criterion::measurement::WallTime;


The benchmarks are reworked to allow using the same code for different encodings of integer primitives, along with different primitive types (Int32, Int64). It should be fairly mechanical to extend to other types should we wish to in future

tustvold · 2022-02-12T14:08:46Z

parquet/benches/arrow_reader.rs

-            assert_eq!(count, EXPECTED_VALUE_COUNT);
-        },
-    );
+    group.bench_function("plain encoded, mandatory, no NULLs", |b| {


The type, i.e. StringArray is now encoded in the group name

tustvold · 2022-02-12T14:09:36Z

Apologies for taking so long to get back to this, but I think this should incorporate all the feedback now

codecov-commenter · 2022-02-12T14:18:36Z

Codecov Report

Merging #1284 (f987da9) into master (936ed5e) will decrease coverage by 0.02%.
The diff coverage is 76.92%.

@@            Coverage Diff             @@
##           master    #1284      +/-   ##
==========================================
- Coverage   83.03%   83.00%   -0.03%     
==========================================
  Files         180      180              
  Lines       52296    52753     +457     
==========================================
+ Hits        43422    43788     +366     
- Misses       8874     8965      +91

Impacted Files	Coverage Δ
parquet/src/encodings/decoding.rs	`89.46% <75.00%> (-1.27%)`	⬇️
parquet/src/util/bit_util.rs	`93.19% <87.50%> (+<0.01%)`	⬆️
arrow/src/compute/kernels/filter.rs	`84.77% <0.00%> (-7.73%)`	⬇️
parquet/src/arrow/record_reader.rs	`93.57% <0.00%> (-0.51%)`	⬇️
parquet/src/data_type.rs	`76.25% <0.00%> (-0.36%)`	⬇️
parquet/src/schema/printer.rs	`72.32% <0.00%> (-0.16%)`	⬇️
arrow/src/array/array_binary.rs	`93.53% <0.00%> (-0.12%)`	⬇️
parquet/src/encodings/rle.rs	`92.62% <0.00%> (-0.08%)`	⬇️
arrow/src/array/builder.rs	`86.73% <0.00%> (-0.04%)`	⬇️
parquet/src/schema/types.rs	`87.15% <0.00%> (-0.03%)`	⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 936ed5e...f987da9. Read the comment docs.

alamb

I read the logic pretty carefully and it makes sense to me. However, I don't think I am enough of an expert in this format (or code) to be able to offer a thorough review of just the logic on its own merits (though it does look very nice 👌 ).

Before we release this I would like to ensure that we get some more testing with real data. The idea would be to read with existing decoder and new decoder and see that the values are the same.

@tustvold and I can definitely use some internal parquet files, but it would be great to get others involved as well. Perhaps @maxburke or @jhorstmann have some things they could test with. We could also perhaps test with the files on apache/datafusion#1441

So my proposal:

Merge this PR in
Send a note to the mailing list / slack asking people to test with a pre-release version of arrow

Any thoughts?

alamb · 2022-02-13T13:50:52Z

parquet/benches/arrow_reader.rs

@@ -78,16 +89,17 @@ fn build_plain_encoded_int32_page_iterator(
                    max_def_level
                };
                if def_level == max_def_level {
-                    int32_value += 1;
-                    values.push(int32_value);
+                    let value =


this changes the benchmark to use random numbers rather than an increasing sequence (1, 2, ...) , right?

Yes, it's otherwise too optimal 😂

alamb · 2022-02-13T14:23:13Z

parquet/src/encodings/decoding.rs


-        self.values_per_mini_block = (block_size / self.num_mini_blocks) as usize;
-        assert!(self.values_per_mini_block % 8 == 0);
+        if self.values_per_mini_block % 32 != 0 {


The real question is if any actual parquet files have this pattern (values_per_mini_block be a multiple of 8)

tustvold · 2022-02-13T14:51:13Z

I'm personally fairly confident in the coverage from the fuzz tests, but getting more coverage, particularly of different encoders which may interpret the spec differently, definitely couldn't hurt and would add an extra safety guarantee 👍

alamb · 2022-02-13T15:11:09Z

🤔 I also ran the performance tests

On a google cloud machine (c2-standard-16, CPU Intel Cascade Lake)

before: alamb@3125344 / https://github.com/alamb/arrow-rs/tree/alamb/delta-packed-reader-benches branch, based off master 936ed5e with the new benches

after: on this branch f987da9 / https://github.com/tustvold/arrow-rs/tree/delta-packed-reader

I ran the benchmarks like this

cargo bench -p parquet --bench arrow_reader --features=test_common,experimental -- --save-baseline after

And then compared:

alamb@instance-1:/data/arrow-rs$ critcmp before after
group                                                                                 after                                  before
-----                                                                                 -----                                  ------
arrow_array_reader/Int32Array/binary packed, mandatory, no NULLs                      1.00     31.5±0.04µs        ? ?/sec    3.15     99.1±0.22µs        ? ?/sec
arrow_array_reader/Int32Array/binary packed, optional, half NULLs                     1.00     45.8±0.18µs        ? ?/sec    1.73     79.5±0.32µs        ? ?/sec
arrow_array_reader/Int32Array/binary packed, optional, no NULLs                       1.00     49.1±0.11µs        ? ?/sec    2.38    116.8±0.28µs        ? ?/sec
arrow_array_reader/Int32Array/dictionary encoded, mandatory, no NULLs                 1.00     36.1±0.07µs        ? ?/sec    1.02     36.8±0.08µs        ? ?/sec
arrow_array_reader/Int32Array/dictionary encoded, optional, half NULLs                1.00     48.6±0.11µs        ? ?/sec    1.01     49.0±0.16µs        ? ?/sec
arrow_array_reader/Int32Array/dictionary encoded, optional, no NULLs                  1.00     53.6±0.13µs        ? ?/sec    1.02     54.7±0.10µs        ? ?/sec
arrow_array_reader/Int32Array/plain encoded, mandatory, no NULLs                      1.01      4.7±0.24µs        ? ?/sec    1.00      4.7±0.11µs        ? ?/sec
arrow_array_reader/Int32Array/plain encoded, optional, half NULLs                     1.00     32.2±0.09µs        ? ?/sec    1.01     32.5±0.19µs        ? ?/sec
arrow_array_reader/Int32Array/plain encoded, optional, no NULLs                       1.00     22.4±0.64µs        ? ?/sec    1.03     23.0±0.35µs        ? ?/sec
arrow_array_reader/Int64Array/binary packed, mandatory, no NULLs                      1.00     40.5±0.05µs        ? ?/sec    5.77    233.9±0.68µs        ? ?/sec
arrow_array_reader/Int64Array/binary packed, optional, half NULLs                     1.00     51.3±0.08µs        ? ?/sec    2.87    147.0±0.45µs        ? ?/sec
arrow_array_reader/Int64Array/binary packed, optional, no NULLs                       1.00     58.3±0.06µs        ? ?/sec    4.31    251.3±0.24µs        ? ?/sec
arrow_array_reader/Int64Array/dictionary encoded, mandatory, no NULLs                 1.00     37.5±0.07µs        ? ?/sec    1.02     38.3±0.15µs        ? ?/sec
arrow_array_reader/Int64Array/dictionary encoded, optional, half NULLs                1.00     49.5±0.16µs        ? ?/sec    1.02     50.4±0.21µs        ? ?/sec
arrow_array_reader/Int64Array/dictionary encoded, optional, no NULLs                  1.00     54.7±0.11µs        ? ?/sec    1.02     56.0±0.11µs        ? ?/sec
arrow_array_reader/Int64Array/plain encoded, mandatory, no NULLs                      1.00      7.9±0.44µs        ? ?/sec    1.02      8.0±0.26µs        ? ?/sec
arrow_array_reader/Int64Array/plain encoded, optional, half NULLs                     1.00     34.3±0.12µs        ? ?/sec    1.00     34.3±0.06µs        ? ?/sec
arrow_array_reader/Int64Array/plain encoded, optional, no NULLs                       1.01     26.3±1.57µs        ? ?/sec    1.00     26.1±0.44µs        ? ?/sec
arrow_array_reader/StringArray/dictionary encoded, mandatory, no NULLs                1.00    204.7±1.03µs        ? ?/sec    1.01    205.8±0.93µs        ? ?/sec
arrow_array_reader/StringArray/dictionary encoded, optional, half NULLs               1.00    226.6±2.75µs        ? ?/sec    1.00    226.9±0.76µs        ? ?/sec
arrow_array_reader/StringArray/dictionary encoded, optional, no NULLs                 1.01    223.2±1.35µs        ? ?/sec    1.00    221.9±1.55µs        ? ?/sec
arrow_array_reader/StringArray/plain encoded, mandatory, no NULLs                     1.00    213.5±0.77µs        ? ?/sec    1.00    212.9±0.61µs        ? ?/sec
arrow_array_reader/StringArray/plain encoded, optional, half NULLs                    1.01    233.5±0.63µs        ? ?/sec    1.00    230.7±0.96µs        ? ?/sec
arrow_array_reader/StringArray/plain encoded, optional, no NULLs                      1.00    235.8±1.39µs        ? ?/sec    1.00    236.1±0.86µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, mandatory, no NULLs - new     1.00     28.4±0.07µs        ? ?/sec    1.01     28.8±0.08µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, mandatory, no NULLs - old     1.00   1910.0±3.25µs        ? ?/sec    1.00   1913.3±8.46µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, optional, half NULLs - new    1.00     45.8±0.39µs        ? ?/sec    1.02     46.7±0.10µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, optional, half NULLs - old    1.00   1726.2±3.24µs        ? ?/sec    1.00   1731.8±6.23µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, optional, no NULLs - new      1.00     46.0±0.11µs        ? ?/sec    1.01     46.6±0.20µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, optional, no NULLs - old      1.00   1963.8±3.34µs        ? ?/sec    1.00   1968.7±4.28µs        ? ?/sec

There are some very nice improvements (3x - 5x) in binary packed decoding 👨‍🍳 👌

sunchao

LGTM too. Thanks @tustvold ! The performance improvement looks great!

alamb · 2022-02-15T15:11:34Z

I spent a few hours this morning testing this reader compared to the existing implementation on an internal corpus of parquet files using this tool: https://github.com/alamb/parquet_cmp

I am pleased to say the same values were produced:

107 files read with different readers compared successfully

I encourage everyone else with a similar corpus to give it a try and report their results

🚀

sunchao · 2022-02-15T17:18:34Z

@alamb @tustvold do you see many production use cases of DeltaBinaryPacked encoding? my understanding is most people are still using Parquet V1 format and hence PLAIN + DICTIONARY + RLE encodings. We also recently implemented support for DeltaBinaryPacked encoding in Spark and the read performance is slower than PLAIN even for sorted data, see here. From the benchmark result above, it also appears to be slower than PLAIN (although on par with DICTIONARY).

tustvold · 2022-02-15T18:03:36Z

At least for IOx, we're in control of the parquet data written and so it is a case of us choosing to write using the encodings that give us the best balance of compression and performance. Ultimately I saw this decoder show up in a profile, as the dictionary encoding appears to spill to this, and realised there was clearly some low-hanging fruit here and so coded it up.

I'm personally optimistic there are further potential improvements that will make DeltaBinaryPacked have comparable decode performance to PLAIN, at which point it effectively becomes free compression, but I don't know that for sure 😅

So to directly answer your question, we do have production use cases of DeltaBinaryPacked encoding, but this is somewhat accidental and we would likely switch to something else should it yield better performance characteristics.

alamb · 2022-02-15T18:29:32Z

FWIW one important usecase in IOx is timestamps where the delta between values is often very regular so DeltaBinaryPacked is a good fit

sunchao · 2022-02-15T19:38:09Z

Thank you both for the input! this information is very useful. Yes it does seem a good fit for the timestamp use case. We are also evaluating the V2 format here at Apple and trying to identify scenarios where it is clearly better than V1.

github-actions bot added the parquet Changes to the parquet crate label Feb 7, 2022

tustvold mentioned this pull request Feb 8, 2022

Refactor BitReader to contain explicit state (#1282) #1283

Closed

tustvold force-pushed the delta-packed-reader branch from 2488fa2 to b693f8d Compare February 8, 2022 10:53

tustvold commented Feb 8, 2022

View reviewed changes

Vectorized DeltaBitPackDecoder (apache#1281)

fa6233a

tustvold force-pushed the delta-packed-reader branch from b693f8d to fa6233a Compare February 8, 2022 11:03

tustvold commented Feb 8, 2022

View reviewed changes

tustvold marked this pull request as ready for review February 8, 2022 11:05

alamb requested a review from sunchao February 8, 2022 19:53

sunchao reviewed Feb 9, 2022

View reviewed changes

Review feedback

f987da9

tustvold force-pushed the delta-packed-reader branch from 0cc9fe9 to f987da9 Compare February 12, 2022 14:06

tustvold commented Feb 12, 2022

View reviewed changes

alamb approved these changes Feb 13, 2022

View reviewed changes

sunchao approved these changes Feb 15, 2022

View reviewed changes

alamb merged commit 02d17ab into apache:master Feb 15, 2022

alamb added the performance label Feb 16, 2022

alamb changed the title ~~Vectorized DeltaBitPackDecoder (#1281)~~ Vectorize DeltaBitPackDecoder, up to 5x faster decoding (#1281) Feb 16, 2022

alamb changed the title ~~Vectorize DeltaBitPackDecoder, up to 5x faster decoding (#1281)~~ Vectorize DeltaBitPackDecoder, up to 5x faster decoding Feb 16, 2022

tustvold mentioned this pull request Mar 10, 2022

Parquet SQL Benchmarks Broken apache/datafusion#1976

Closed

This was referenced Mar 10, 2022

Fix DeltaBitPack MiniBlock Bit Width Padding #1418

Merged

DeltaBitPackDecoder Incorrectly Handles Non-Zero MiniBlock Bit Width Padding #1417

Closed

tustvold mentioned this pull request May 22, 2022

Fix BitReader::get_batch zero extension (#1708) #1722

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize DeltaBitPackDecoder, up to 5x faster decoding #1284

Vectorize DeltaBitPackDecoder, up to 5x faster decoding #1284

tustvold commented Feb 7, 2022 •

edited

Loading

tustvold Feb 8, 2022 •

edited

Loading

tustvold Feb 8, 2022

tustvold Feb 8, 2022 •

edited

Loading

tustvold Feb 8, 2022

tustvold Feb 8, 2022

sunchao left a comment

sunchao Feb 9, 2022

tustvold Feb 12, 2022

alamb Feb 13, 2022

sunchao Feb 15, 2022

sunchao Feb 9, 2022

tustvold Feb 12, 2022

tustvold Feb 12, 2022

tustvold commented Feb 12, 2022

codecov-commenter commented Feb 12, 2022

alamb left a comment

alamb Feb 13, 2022

tustvold Feb 13, 2022

alamb Feb 13, 2022

tustvold commented Feb 13, 2022

alamb commented Feb 13, 2022

sunchao left a comment

alamb commented Feb 15, 2022

sunchao commented Feb 15, 2022

tustvold commented Feb 15, 2022

alamb commented Feb 15, 2022

sunchao commented Feb 15, 2022

		@@ -541,6 +547,17 @@ impl BitReader {

		let mut i = 0;

		if num_bits > 32 {

Vectorize DeltaBitPackDecoder, up to 5x faster decoding #1284

Vectorize DeltaBitPackDecoder, up to 5x faster decoding #1284

Conversation

tustvold commented Feb 7, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Feb 12, 2022

codecov-commenter commented Feb 12, 2022

Codecov Report

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Feb 13, 2022

alamb commented Feb 13, 2022

sunchao left a comment

Choose a reason for hiding this comment

alamb commented Feb 15, 2022

sunchao commented Feb 15, 2022

tustvold commented Feb 15, 2022

alamb commented Feb 15, 2022

sunchao commented Feb 15, 2022

tustvold commented Feb 7, 2022 •

edited

Loading

tustvold Feb 8, 2022 •

edited

Loading

tustvold Feb 8, 2022 •

edited

Loading