Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize DeltaBitPackDecoder, up to 5x faster decoding #1284

Merged
merged 2 commits into from
Feb 15, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Feb 7, 2022

Which issue does this PR close?

Closes #1281.

Rationale for this change

Make DeltaBitPackDecoder faster

What changes are included in this PR?

Adapts DeltaBitPackDecoder to eliminate intermediate buffering, and to vectorize better.

The performance bump is not quite what I was hoping for, I suspect unpack32 may not be vectorizing correctly, but is still decent.

arrow_array_reader/read Int32Array, binary packed, mandatory, no NULLs                                                                             
                        time:   [20.642 us 20.650 us 20.659 us]
                        change: [-73.470% -73.455% -73.441%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/read Int32Array, binary packed, optional, no NULLs                                                                             
                        time:   [32.453 us 32.459 us 32.466 us]
                        change: [-63.907% -63.896% -63.885%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/read Int32Array, binary packed, optional, half NULLs                                                                             
                        time:   [35.435 us 35.445 us 35.456 us]
                        change: [-44.831% -44.794% -44.745%] (p = 0.00 < 0.05)
                        Performance has improved.

Are there any user-facing changes?

No, the encoding module is experimental

@@ -39,6 +39,7 @@ flate2 = { version = "1.0", optional = true }
lz4 = { version = "1.23", optional = true }
zstd = { version = "0.10", optional = true }
chrono = { version = "0.4", default-features = false }
num = "0.4"
Copy link
Contributor Author

@tustvold tustvold Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially looked to extend the existing traits, but this ran into a couple of issues

  • wrapping_add doesn't make sense for all types, e.g. ByteArray
  • A lot of effort has already been put into creating numeric traits, we might as well just benefit from this

This crate is already used by arrow

mini_block_idx: 0,
delta_bit_width: 0,
delta_bit_widths: ByteBuffer::new(),
deltas_in_mini_block: vec![],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this would read to Vec<i64> we now read directly to the output buffer

impl<T: DataType> DeltaBitPackDecoder<T> {
impl<T: DataType> DeltaBitPackDecoder<T>
where
T::T: Default + FromPrimitive + WrappingAdd + Copy,
Copy link
Contributor Author

@tustvold tustvold Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to #1277 we can now add constraints to the types accepted here - this encoding is only valid for DataType where DataType::T is i32 or i64. ParquetValueType is a crate-local trait, so users can't define custom value types.


// Per block info
min_delta: i64,
/// The minimum delta in the block
min_delta: T::T,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decode and operate on T::T instead of converting everything to i64 and back

@@ -78,16 +78,15 @@ fn build_plain_encoded_int32_page_iterator(
max_def_level
};
if def_level == max_def_level {
int32_value += 1;
values.push(int32_value);
values.push(rng.gen_range(0..1000));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to make the benchmark more realistic, a constant offset between consecutive values is the optimal case for DeltaBitPackDecoder - the min_delta will be 1, and all the miniblocks will have a bit width of 0.

@tustvold tustvold marked this pull request as ready for review February 8, 2022 11:05
@alamb alamb requested a review from sunchao February 8, 2022 19:53
Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good!

parquet/src/encodings/decoding.rs Outdated Show resolved Hide resolved
parquet/src/encodings/decoding.rs Outdated Show resolved Hide resolved
parquet/src/encodings/decoding.rs Outdated Show resolved Hide resolved
parquet/src/util/bit_util.rs Outdated Show resolved Hide resolved

self.values_per_mini_block = (block_size / self.num_mini_blocks) as usize;
assert!(self.values_per_mini_block % 8 == 0);
if self.values_per_mini_block % 32 != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, parquet-mr allows values_per_mini_block to be multiple of 8: see https://issues.apache.org/jira/browse/PARQUET-2077, although I think it's very unlikely to happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels to me like a bug in parquet-mr 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real question is if any actual parquet files have this pattern (values_per_mini_block be a multiple of 8)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the answer. In parquet-mr it is assumed to be multiple of 8 in both read and write path. As I mentioned above, personally I think it's extremely unlikely to happen though since most users will just use higher APIs in parquet-mr to write files, instead of directly using DeltaBinaryPackingValuesWriterForInteger or DeltaBinaryPackingValuesWriterForLong.

@@ -541,6 +547,17 @@ impl BitReader {

let mut i = 0;

if num_bits > 32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably add unpack64 similar to Arrow C++. It also has SIMD acceleration.

parquet/src/encodings/decoding.rs Outdated Show resolved Hide resolved
parquet/benches/arrow_reader.rs Outdated Show resolved Hide resolved
@@ -17,15 +17,19 @@

use arrow::array::Array;
use arrow::datatypes::DataType;
use criterion::{criterion_group, criterion_main, Criterion};
use criterion::measurement::WallTime;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks are reworked to allow using the same code for different encodings of integer primitives, along with different primitive types (Int32, Int64). It should be fairly mechanical to extend to other types should we wish to in future

assert_eq!(count, EXPECTED_VALUE_COUNT);
},
);
group.bench_function("plain encoded, mandatory, no NULLs", |b| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type, i.e. StringArray is now encoded in the group name

@tustvold
Copy link
Contributor Author

Apologies for taking so long to get back to this, but I think this should incorporate all the feedback now

@codecov-commenter
Copy link

Codecov Report

Merging #1284 (f987da9) into master (936ed5e) will decrease coverage by 0.02%.
The diff coverage is 76.92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1284      +/-   ##
==========================================
- Coverage   83.03%   83.00%   -0.03%     
==========================================
  Files         180      180              
  Lines       52296    52753     +457     
==========================================
+ Hits        43422    43788     +366     
- Misses       8874     8965      +91     
Impacted Files Coverage Δ
parquet/src/encodings/decoding.rs 89.46% <75.00%> (-1.27%) ⬇️
parquet/src/util/bit_util.rs 93.19% <87.50%> (+<0.01%) ⬆️
arrow/src/compute/kernels/filter.rs 84.77% <0.00%> (-7.73%) ⬇️
parquet/src/arrow/record_reader.rs 93.57% <0.00%> (-0.51%) ⬇️
parquet/src/data_type.rs 76.25% <0.00%> (-0.36%) ⬇️
parquet/src/schema/printer.rs 72.32% <0.00%> (-0.16%) ⬇️
arrow/src/array/array_binary.rs 93.53% <0.00%> (-0.12%) ⬇️
parquet/src/encodings/rle.rs 92.62% <0.00%> (-0.08%) ⬇️
arrow/src/array/builder.rs 86.73% <0.00%> (-0.04%) ⬇️
parquet/src/schema/types.rs 87.15% <0.00%> (-0.03%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 936ed5e...f987da9. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the logic pretty carefully and it makes sense to me. However, I don't think I am enough of an expert in this format (or code) to be able to offer a thorough review of just the logic on its own merits (though it does look very nice 👌 ).

Before we release this I would like to ensure that we get some more testing with real data. The idea would be to read with existing decoder and new decoder and see that the values are the same.

@tustvold and I can definitely use some internal parquet files, but it would be great to get others involved as well. Perhaps @maxburke or @jhorstmann have some things they could test with. We could also perhaps test with the files on apache/datafusion#1441

So my proposal:

  1. Merge this PR in
  2. Send a note to the mailing list / slack asking people to test with a pre-release version of arrow

Any thoughts?

@@ -78,16 +89,17 @@ fn build_plain_encoded_int32_page_iterator(
max_def_level
};
if def_level == max_def_level {
int32_value += 1;
values.push(int32_value);
let value =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this changes the benchmark to use random numbers rather than an increasing sequence (1, 2, ...) , right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's otherwise too optimal 😂


self.values_per_mini_block = (block_size / self.num_mini_blocks) as usize;
assert!(self.values_per_mini_block % 8 == 0);
if self.values_per_mini_block % 32 != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real question is if any actual parquet files have this pattern (values_per_mini_block be a multiple of 8)

@tustvold
Copy link
Contributor Author

I'm personally fairly confident in the coverage from the fuzz tests, but getting more coverage, particularly of different encoders which may interpret the spec differently, definitely couldn't hurt and would add an extra safety guarantee 👍

@alamb
Copy link
Contributor

alamb commented Feb 13, 2022

🤔 I also ran the performance tests

On a google cloud machine (c2-standard-16, CPU Intel Cascade Lake)

before: alamb@3125344 / https://github.com/alamb/arrow-rs/tree/alamb/delta-packed-reader-benches branch, based off master 936ed5e with the new benches

after: on this branch f987da9 / https://github.com/tustvold/arrow-rs/tree/delta-packed-reader

I ran the benchmarks like this

cargo bench -p parquet --bench arrow_reader --features=test_common,experimental -- --save-baseline after

And then compared:

alamb@instance-1:/data/arrow-rs$ critcmp before after
group                                                                                 after                                  before
-----                                                                                 -----                                  ------
arrow_array_reader/Int32Array/binary packed, mandatory, no NULLs                      1.00     31.5±0.04µs        ? ?/sec    3.15     99.1±0.22µs        ? ?/sec
arrow_array_reader/Int32Array/binary packed, optional, half NULLs                     1.00     45.8±0.18µs        ? ?/sec    1.73     79.5±0.32µs        ? ?/sec
arrow_array_reader/Int32Array/binary packed, optional, no NULLs                       1.00     49.1±0.11µs        ? ?/sec    2.38    116.8±0.28µs        ? ?/sec
arrow_array_reader/Int32Array/dictionary encoded, mandatory, no NULLs                 1.00     36.1±0.07µs        ? ?/sec    1.02     36.8±0.08µs        ? ?/sec
arrow_array_reader/Int32Array/dictionary encoded, optional, half NULLs                1.00     48.6±0.11µs        ? ?/sec    1.01     49.0±0.16µs        ? ?/sec
arrow_array_reader/Int32Array/dictionary encoded, optional, no NULLs                  1.00     53.6±0.13µs        ? ?/sec    1.02     54.7±0.10µs        ? ?/sec
arrow_array_reader/Int32Array/plain encoded, mandatory, no NULLs                      1.01      4.7±0.24µs        ? ?/sec    1.00      4.7±0.11µs        ? ?/sec
arrow_array_reader/Int32Array/plain encoded, optional, half NULLs                     1.00     32.2±0.09µs        ? ?/sec    1.01     32.5±0.19µs        ? ?/sec
arrow_array_reader/Int32Array/plain encoded, optional, no NULLs                       1.00     22.4±0.64µs        ? ?/sec    1.03     23.0±0.35µs        ? ?/sec
arrow_array_reader/Int64Array/binary packed, mandatory, no NULLs                      1.00     40.5±0.05µs        ? ?/sec    5.77    233.9±0.68µs        ? ?/sec
arrow_array_reader/Int64Array/binary packed, optional, half NULLs                     1.00     51.3±0.08µs        ? ?/sec    2.87    147.0±0.45µs        ? ?/sec
arrow_array_reader/Int64Array/binary packed, optional, no NULLs                       1.00     58.3±0.06µs        ? ?/sec    4.31    251.3±0.24µs        ? ?/sec
arrow_array_reader/Int64Array/dictionary encoded, mandatory, no NULLs                 1.00     37.5±0.07µs        ? ?/sec    1.02     38.3±0.15µs        ? ?/sec
arrow_array_reader/Int64Array/dictionary encoded, optional, half NULLs                1.00     49.5±0.16µs        ? ?/sec    1.02     50.4±0.21µs        ? ?/sec
arrow_array_reader/Int64Array/dictionary encoded, optional, no NULLs                  1.00     54.7±0.11µs        ? ?/sec    1.02     56.0±0.11µs        ? ?/sec
arrow_array_reader/Int64Array/plain encoded, mandatory, no NULLs                      1.00      7.9±0.44µs        ? ?/sec    1.02      8.0±0.26µs        ? ?/sec
arrow_array_reader/Int64Array/plain encoded, optional, half NULLs                     1.00     34.3±0.12µs        ? ?/sec    1.00     34.3±0.06µs        ? ?/sec
arrow_array_reader/Int64Array/plain encoded, optional, no NULLs                       1.01     26.3±1.57µs        ? ?/sec    1.00     26.1±0.44µs        ? ?/sec
arrow_array_reader/StringArray/dictionary encoded, mandatory, no NULLs                1.00    204.7±1.03µs        ? ?/sec    1.01    205.8±0.93µs        ? ?/sec
arrow_array_reader/StringArray/dictionary encoded, optional, half NULLs               1.00    226.6±2.75µs        ? ?/sec    1.00    226.9±0.76µs        ? ?/sec
arrow_array_reader/StringArray/dictionary encoded, optional, no NULLs                 1.01    223.2±1.35µs        ? ?/sec    1.00    221.9±1.55µs        ? ?/sec
arrow_array_reader/StringArray/plain encoded, mandatory, no NULLs                     1.00    213.5±0.77µs        ? ?/sec    1.00    212.9±0.61µs        ? ?/sec
arrow_array_reader/StringArray/plain encoded, optional, half NULLs                    1.01    233.5±0.63µs        ? ?/sec    1.00    230.7±0.96µs        ? ?/sec
arrow_array_reader/StringArray/plain encoded, optional, no NULLs                      1.00    235.8±1.39µs        ? ?/sec    1.00    236.1±0.86µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, mandatory, no NULLs - new     1.00     28.4±0.07µs        ? ?/sec    1.01     28.8±0.08µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, mandatory, no NULLs - old     1.00   1910.0±3.25µs        ? ?/sec    1.00   1913.3±8.46µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, optional, half NULLs - new    1.00     45.8±0.39µs        ? ?/sec    1.02     46.7±0.10µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, optional, half NULLs - old    1.00   1726.2±3.24µs        ? ?/sec    1.00   1731.8±6.23µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, optional, no NULLs - new      1.00     46.0±0.11µs        ? ?/sec    1.01     46.6±0.20µs        ? ?/sec
arrow_array_reader/StringDictionary/dictionary encoded, optional, no NULLs - old      1.00   1963.8±3.34µs        ? ?/sec    1.00   1968.7±4.28µs        ? ?/sec

There are some very nice improvements (3x - 5x) in binary packed decoding 👨‍🍳 👌

Screen Shot 2022-02-13 at 10 10 15 AM

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too. Thanks @tustvold ! The performance improvement looks great!

@alamb
Copy link
Contributor

alamb commented Feb 15, 2022

I spent a few hours this morning testing this reader compared to the existing implementation on an internal corpus of parquet files using this tool: https://github.com/alamb/parquet_cmp

I am pleased to say the same values were produced:

107 files read with different readers compared successfully

I encourage everyone else with a similar corpus to give it a try and report their results

🚀

@alamb alamb merged commit 02d17ab into apache:master Feb 15, 2022
@sunchao
Copy link
Member

sunchao commented Feb 15, 2022

@alamb @tustvold do you see many production use cases of DeltaBinaryPacked encoding? my understanding is most people are still using Parquet V1 format and hence PLAIN + DICTIONARY + RLE encodings. We also recently implemented support for DeltaBinaryPacked encoding in Spark and the read performance is slower than PLAIN even for sorted data, see here. From the benchmark result above, it also appears to be slower than PLAIN (although on par with DICTIONARY).

@tustvold
Copy link
Contributor Author

At least for IOx, we're in control of the parquet data written and so it is a case of us choosing to write using the encodings that give us the best balance of compression and performance. Ultimately I saw this decoder show up in a profile, as the dictionary encoding appears to spill to this, and realised there was clearly some low-hanging fruit here and so coded it up.

I'm personally optimistic there are further potential improvements that will make DeltaBinaryPacked have comparable decode performance to PLAIN, at which point it effectively becomes free compression, but I don't know that for sure 😅

So to directly answer your question, we do have production use cases of DeltaBinaryPacked encoding, but this is somewhat accidental and we would likely switch to something else should it yield better performance characteristics.

@alamb
Copy link
Contributor

alamb commented Feb 15, 2022

FWIW one important usecase in IOx is timestamps where the delta between values is often very regular so DeltaBinaryPacked is a good fit

@sunchao
Copy link
Member

sunchao commented Feb 15, 2022

Thank you both for the input! this information is very useful. Yes it does seem a good fit for the timestamp use case. We are also evaluating the V2 format here at Apple and trying to identify scenarios where it is clearly better than V1.

@alamb alamb changed the title Vectorized DeltaBitPackDecoder (#1281) Vectorize DeltaBitPackDecoder, up to 5x faster decoding (#1281) Feb 16, 2022
@alamb alamb changed the title Vectorize DeltaBitPackDecoder, up to 5x faster decoding (#1281) Vectorize DeltaBitPackDecoder, up to 5x faster decoding Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up DeltaBitPackDecoder
4 participants