Add unpack8, unpack16, unpack64 (#2276) ~10-50% faster #2278

tustvold · 2022-08-02T10:06:03Z

Which issue does this PR close?

Rationale for this change

Eliminates a load of unsafe code, and allows unpacking into the native type instead of always having to go via u32.

I'm not confident the generated code is necessarily optimal, but godbolt would suggest that LLVM at least takes a decent stab at using the bit-shuffling instructions - https://rust.godbolt.org/z/1aMdKT8qc

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2022-08-02T16:55:01Z

Roughly ~10% performance improvement, which isn't bad given half the objective was cleaning up the code with an eventual view to implementing #2257. The major return for DeltaBitPackDecoder would likely only be realizable with #2282 and larger miniblock sizes.

Performance vs master

arrow_array_reader/Int32Array/plain encoded, mandatory, no NULLs                                                                             
                        time:   [4.8459 us 4.8589 us 4.8732 us]
                        change: [+0.3572% +0.8566% +1.3081%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/Int32Array/plain encoded, optional, no NULLs                                                                             
                        time:   [22.337 us 22.347 us 22.358 us]
                        change: [-1.0612% -0.7802% -0.5283%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/Int32Array/plain encoded, optional, half NULLs                                                                             
                        time:   [28.798 us 28.815 us 28.833 us]
                        change: [-13.045% -12.928% -12.803%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int32Array/binary packed, mandatory, no NULLs                                                                             
                        time:   [33.001 us 33.015 us 33.031 us]
                        change: [-18.056% -18.023% -17.989%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int32Array/binary packed, optional, no NULLs                                                                             
                        time:   [50.711 us 50.729 us 50.747 us]
                        change: [-12.294% -12.252% -12.205%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/Int32Array/binary packed, optional, half NULLs                                                                             
                        time:   [43.718 us 43.753 us 43.792 us]
                        change: [-15.172% -15.098% -15.016%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/Int32Array/dictionary encoded, mandatory, no NULLs                                                                             
                        time:   [33.403 us 33.410 us 33.419 us]
                        change: [-7.3073% -7.2057% -7.0230%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  5 (5.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/Int32Array/dictionary encoded, optional, no NULLs                                                                             
                        time:   [51.100 us 51.116 us 51.133 us]
                        change: [-4.6051% -4.5621% -4.5166%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/Int32Array/dictionary encoded, optional, half NULLs                                                                             
                        time:   [44.215 us 44.235 us 44.255 us]
                        change: [-9.9963% -9.9351% -9.8587%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

arrow_array_reader/Int64Array/plain encoded, mandatory, no NULLs                                                                             
                        time:   [8.0967 us 8.1230 us 8.1534 us]
                        change: [+0.7474% +1.1695% +1.5384%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
arrow_array_reader/Int64Array/plain encoded, optional, no NULLs                                                                             
                        time:   [25.602 us 25.631 us 25.665 us]
                        change: [-1.4309% -1.0332% -0.6523%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int64Array/plain encoded, optional, half NULLs                                                                             
                        time:   [31.643 us 31.672 us 31.700 us]
                        change: [+14.966% +15.048% +15.132%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  10 (10.00%) high mild
arrow_array_reader/Int64Array/binary packed, mandatory, no NULLs                                                                            
                        time:   [62.860 us 62.874 us 62.891 us]
                        change: [-1.3340% -1.2969% -1.2569%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
arrow_array_reader/Int64Array/binary packed, optional, no NULLs                                                                            
                        time:   [80.933 us 80.982 us 81.059 us]
                        change: [-0.8099% -0.7311% -0.6512%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
arrow_array_reader/Int64Array/binary packed, optional, half NULLs                                                                            
                        time:   [58.719 us 58.774 us 58.832 us]
                        change: [+6.0864% +6.1605% +6.2268%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int64Array/dictionary encoded, mandatory, no NULLs                                                                             
                        time:   [35.035 us 35.049 us 35.065 us]
                        change: [-12.831% -12.790% -12.750%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
arrow_array_reader/Int64Array/dictionary encoded, optional, no NULLs                                                                            
                        time:   [52.687 us 52.703 us 52.720 us]
                        change: [-8.9727% -8.9393% -8.9011%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/Int64Array/dictionary encoded, optional, half NULLs                                                                             
                        time:   [46.217 us 46.235 us 46.255 us]
                        change: [+4.6787% +4.7709% +4.8563%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

arrow_array_reader/StringArray/plain encoded, mandatory, no NULLs                                                                            
                        time:   [175.32 us 175.44 us 175.56 us]
                        change: [-3.4189% -3.2578% -3.0783%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/StringArray/plain encoded, optional, no NULLs                                                                            
                        time:   [194.47 us 194.59 us 194.74 us]
                        change: [-3.1158% -3.0118% -2.8869%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
arrow_array_reader/StringArray/plain encoded, optional, half NULLs                                                                            
                        time:   [205.21 us 205.27 us 205.34 us]
                        change: [-6.1702% -6.0441% -5.9480%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
arrow_array_reader/StringArray/dictionary encoded, mandatory, no NULLs                                                                            
                        time:   [133.67 us 133.89 us 134.13 us]
                        change: [-4.3816% -4.2028% -4.0310%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringArray/dictionary encoded, optional, no NULLs                                                                            
                        time:   [153.30 us 153.54 us 153.79 us]
                        change: [-1.8329% -1.6595% -1.5029%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
arrow_array_reader/StringArray/dictionary encoded, optional, half NULLs                                                                            
                        time:   [183.34 us 183.49 us 183.67 us]
                        change: [-6.1044% -6.0061% -5.8991%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

arrow_array_reader/StringDictionary/dictionary encoded, mandatory, no NULLs                                                                             
                        time:   [26.463 us 26.471 us 26.480 us]
                        change: [-0.4894% -0.4345% -0.3755%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/StringDictionary/dictionary encoded, optional, no NULLs                                                                             
                        time:   [43.910 us 43.924 us 43.941 us]
                        change: [-0.0280% +0.0426% +0.1125%] (p = 0.24 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/StringDictionary/dictionary encoded, optional, half NULLs                                                                             
                        time:   [49.736 us 49.766 us 49.803 us]
                        change: [+3.9807% +4.0557% +4.1406%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

For completeness, performance vs master with -C target-cpu=native. So only ~5% but not a regression which is the important thing.

arrow_array_reader/Int32Array/plain encoded, mandatory, no NULLs                                                                             
                        time:   [4.7553 us 4.7598 us 4.7661 us]
                        change: [-0.0821% +0.7986% +1.5839%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe
arrow_array_reader/Int32Array/plain encoded, optional, no NULLs                                                                             
                        time:   [21.556 us 21.620 us 21.690 us]
                        change: [-0.9754% -0.5451% -0.0853%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe
arrow_array_reader/Int32Array/plain encoded, optional, half NULLs                                                                             
                        time:   [28.930 us 28.942 us 28.957 us]
                        change: [-12.993% -12.935% -12.879%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int32Array/binary packed, mandatory, no NULLs                                                                             
                        time:   [30.420 us 30.429 us 30.440 us]
                        change: [+2.5445% +2.7910% +2.9347%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) high mild
  10 (10.00%) high severe
arrow_array_reader/Int32Array/binary packed, optional, no NULLs                                                                             
                        time:   [47.153 us 47.164 us 47.175 us]
                        change: [+1.0496% +1.4538% +1.8614%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  6 (6.00%) high severe
arrow_array_reader/Int32Array/binary packed, optional, half NULLs                                                                             
                        time:   [42.822 us 42.856 us 42.897 us]
                        change: [-7.1747% -6.9019% -6.5972%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int32Array/dictionary encoded, mandatory, no NULLs                                                                             
                        time:   [30.892 us 30.899 us 30.905 us]
                        change: [-9.6399% -9.4767% -9.2311%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
arrow_array_reader/Int32Array/dictionary encoded, optional, no NULLs                                                                             
                        time:   [47.705 us 47.719 us 47.733 us]
                        change: [-6.2662% -6.0597% -5.9385%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/Int32Array/dictionary encoded, optional, half NULLs                                                                             
                        time:   [43.416 us 43.432 us 43.449 us]
                        change: [-9.5642% -9.3380% -9.2034%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

arrow_array_reader/Int64Array/plain encoded, mandatory, no NULLs                                                                             
                        time:   [7.9014 us 7.9765 us 8.0715 us]
                        change: [-2.1869% -1.2440% -0.1997%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 17 outliers among 100 measurements (17.00%)
  17 (17.00%) high severe
arrow_array_reader/Int64Array/plain encoded, optional, no NULLs                                                                             
                        time:   [25.072 us 25.171 us 25.303 us]
                        change: [-3.4804% -2.0406% -0.7844%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe
arrow_array_reader/Int64Array/plain encoded, optional, half NULLs                                                                             
                        time:   [30.868 us 30.886 us 30.907 us]
                        change: [-10.822% -10.526% -10.232%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/Int64Array/binary packed, mandatory, no NULLs                                                                            
                        time:   [62.747 us 62.765 us 62.787 us]
                        change: [+0.1408% +0.4562% +0.7639%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int64Array/binary packed, optional, no NULLs                                                                            
                        time:   [79.580 us 79.602 us 79.628 us]
                        change: [+0.4857% +0.6106% +0.8242%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  9 (9.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int64Array/binary packed, optional, half NULLs                                                                            
                        time:   [57.406 us 57.415 us 57.424 us]
                        change: [-8.3594% -8.0473% -7.7786%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/Int64Array/dictionary encoded, mandatory, no NULLs                                                                             
                        time:   [32.319 us 32.323 us 32.328 us]
                        change: [-9.0431% -9.0036% -8.9690%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
arrow_array_reader/Int64Array/dictionary encoded, optional, no NULLs                                                                             
                        time:   [49.277 us 49.289 us 49.305 us]
                        change: [-5.9239% -5.6357% -5.3767%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int64Array/dictionary encoded, optional, half NULLs                                                                             
                        time:   [44.122 us 44.141 us 44.163 us]
                        change: [-10.843% -10.591% -10.341%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

arrow_array_reader/StringArray/plain encoded, mandatory, no NULLs                                                                            
                        time:   [176.52 us 176.59 us 176.65 us]
                        change: [-7.9276% -7.8375% -7.7034%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/StringArray/plain encoded, optional, no NULLs                                                                            
                        time:   [194.73 us 194.83 us 194.94 us]
                        change: [-8.9202% -8.6621% -8.4009%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  12 (12.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
arrow_array_reader/StringArray/plain encoded, optional, half NULLs                                                                            
                        time:   [218.35 us 218.44 us 218.55 us]
                        change: [-3.4216% -3.1171% -2.7915%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/StringArray/dictionary encoded, mandatory, no NULLs                                                                            
                        time:   [135.36 us 135.61 us 135.90 us]
                        change: [-2.4626% -1.9884% -1.5822%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
arrow_array_reader/StringArray/dictionary encoded, optional, no NULLs                                                                            
                        time:   [153.32 us 153.48 us 153.67 us]
                        change: [-1.2992% -1.0836% -0.7954%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
arrow_array_reader/StringArray/dictionary encoded, optional, half NULLs                                                                            
                        time:   [197.32 us 197.47 us 197.64 us]
                        change: [+1.4295% +1.6590% +1.8209%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

arrow_array_reader/StringDictionary/dictionary encoded, mandatory, no NULLs                                                                             
                        time:   [25.246 us 25.256 us 25.265 us]
                        change: [-3.9833% -3.8835% -3.7605%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
arrow_array_reader/StringDictionary/dictionary encoded, optional, no NULLs                                                                             
                        time:   [42.021 us 42.028 us 42.037 us]
                        change: [-2.7292% -2.4821% -2.3167%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
arrow_array_reader/StringDictionary/dictionary encoded, optional, half NULLs                                                                             
                        time:   [48.923 us 48.940 us 48.959 us]
                        change: [+4.7617% +5.0899% +5.3886%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

tustvold · 2022-08-02T17:06:28Z

Running the benchmarks with -C target-cpu=native and the block size of DeltaBitPackEncoder bumped up to 1024 from 128 and we get a 50% performance improvement for 64-bit integers

arrow_array_reader/Int32Array/binary packed, mandatory, no NULLs                                                                             
                        time:   [23.948 us 23.964 us 23.981 us]
                        change: [-5.5829% -5.2823% -4.9443%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe
arrow_array_reader/Int32Array/binary packed, optional, no NULLs                                                                             
                        time:   [40.763 us 40.771 us 40.781 us]
                        change: [-3.8441% -3.5619% -3.2927%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
arrow_array_reader/Int32Array/binary packed, optional, half NULLs                                                                             
                        time:   [48.342 us 48.366 us 48.391 us]
                        change: [+9.3416% +9.6807% +10.033%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

arrow_array_reader/Int64Array/binary packed, mandatory, no NULLs                                                                             
                        time:   [23.154 us 23.162 us 23.172 us]
                        change: [-56.875% -56.752% -56.626%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe
arrow_array_reader/Int64Array/binary packed, optional, no NULLs                                                                             
                        time:   [39.882 us 39.890 us 39.898 us]
                        change: [-43.358% -43.288% -43.164%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  9 (9.00%) high mild
  5 (5.00%) high severe
arrow_array_reader/Int64Array/binary packed, optional, half NULLs                                                                             
                        time:   [48.638 us 48.670 us 48.704 us]
                        change: [-16.193% -15.929% -15.647%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

I wonder if we should just bump the default block size to 256 for 64-bit integers 🤔

viirya

cc @sunchao This new bit pack utils look useful to us.

tustvold · 2022-08-02T17:32:22Z

parquet/src/util/bit_pack.rs

+                )
+            };
+
+            unroll!(for i in 0..$bits {


LLVM was very reticent to unroll this loop, this is not all that surprising given it isn't immediately obvious how much it collapses down. We therefore give it a helping hand 😄

tustvold · 2022-08-02T17:33:28Z

parquet/src/util/bit_pack.rs

+
+//! Vectorised bit-packing utilities
+
+macro_rules! unroll_impl {


This is kind of gross, but macros can't actually do maths, so we "emulate" it by building an expression tree of "1 + 1 + ...". This will all get compiled away

Can we get some comments in this macro so that future readers can have a hope of understanding what it does without hours of study ;)

Starting with inlining the comments from the PR is probably a good start :)

viirya · 2022-08-02T17:37:37Z

parquet/src/util/bit_pack.rs

+        unroll_impl!(1, $offset + 1, $v, $c);
+    }};
+    (4, $offset:expr, $v:ident, $c:block) => {{
+        unroll_impl!(2, 0, __v4, { unroll_impl!(2, __v4 * 2 + $offset, $v, $c) });


Hmm, where does __v4 comes? Seems not macro parameter?

It's an identifier created by this macro instantiation, and then passed into the child. This was the only way I could work out to do arithmetic in a macro, it's kind of wild 😅

For reference this is what it expands to

unroll!(for i in 0..4 { vec.push(i); });

#[allow(non_upper_case_globals)] { { { { const __v4: usize = 0; { { { const i: usize = (__v4 * 2 + 0); { vec.push(i); } } { const i: usize = ((__v4 * 2 + 0) + 1); { vec.push(i); } } } } } { const __v4: usize = (0 + 1); { { { const i: usize = (__v4 * 2 + 0); { vec.push(i); } } { const i: usize = ((__v4 * 2 + 0) + 1); { vec.push(i); } } } } } } } }

Probably worth adding some comments for future reference :)

alamb

I think this looks great. Really nice work @tustvold I reviewed the changes and test coverage and 👍

I especially like that this deletes code and makes it faster:

I think we should wait a few days before merging this to let @sunchao and/or @nevi-me review, if they would like

alamb · 2022-08-02T17:48:45Z

parquet/src/util/bit_pack.rs

+
+//! Vectorised bit-packing utilities
+
+macro_rules! unroll_impl {


Can we get some comments in this macro so that future readers can have a hope of understanding what it does without hours of study ;)

Starting with inlining the comments from the PR is probably a good start :)

alamb · 2022-08-02T17:49:21Z

parquet/src/util/bit_pack.rs

+    use super::*;
+    use rand::{thread_rng, Rng};
+
+    #[inline(never)]


Why this annotation?

Oops hold over from some godbolt shenanigans, will remove

alamb · 2022-08-02T17:53:08Z

parquet/src/util/bit_pack.rs

+///
+/// - force unrolling of a loop so that LLVM optimises it properly
+/// - call a const generic function with the loop value
+macro_rules! unroll {


An example might help to document this -- btw the use of matching patterns for unrolling i in 0..end is quite neat

alamb · 2022-08-02T17:54:16Z

parquet/src/util/bit_pack.rs

+                    let val = r(end_byte);
+                    let b = val << (NUM_BITS - end_bit_offset);
+
+                    output[i] = a | (b & mask);


I wonder if we could somehow remove the output[i] bounds check too? Perhaps with an iterator or append or something 🤔

As output is a fixed length slice, the bounds check is automatically elided

ah good call, I was confusing that with input: &[u8] 🤦

alamb · 2022-08-02T17:56:47Z

parquet/src/util/bit_util.rs

+            _ => unreachable!(),
+        }
+
+        // Try to read smaller batches if possible


Suggested change

// Try to read smaller batches if possible

// Try to read remainder, if any, in batches if possible

alamb · 2022-08-02T17:58:29Z

parquet/src/util/bit_util.rs

@@ -476,17 +477,6 @@ impl BitReader {

        let mut i = 0;



it isn't clear to me what the num_bits input parameter means (is it the number of bits to read after batch.len() whole T values?) -- can you possibly update the comments too?

I originally renamed this to bit_width, but we seem to use num_bits as the name for this concept in a lot of places. I've updated the doc comment

alamb · 2022-08-02T17:59:36Z

parquet/src/util/bit_util.rs

                match i {
                    0..=8 => test_get_batch_helper::<u8>(*s, i),
                    9..=16 => test_get_batch_helper::<u16>(*s, i),
-                    _ => test_get_batch_helper::<u32>(*s, i),
+                    17..=32 => test_get_batch_helper::<u32>(*s, i),


viirya

It looks great to reduce the code a lot. Thanks for the work. It is nicer if we can add some comments as @alamb suggested, as at first look, the macros look complicated.

sunchao · 2022-08-02T18:40:37Z

Thanks, this looks very nice. I'll take a look too today. BTW I think it's possible to apply SIMD on this code path to further improve the performance. Arrow C++ already did it, see https://issues.apache.org/jira/browse/ARROW-9702 and related.

tustvold · 2022-08-02T18:51:17Z

BTW I think it's possible to apply SIMD on this code path to further improve the performance

The motivation for the manual unrolling of the loops is so that LLVM does this for us. If you check the godbolt output linked in the PR you'll see it is using AVX2 instructions. This fits with what we've seen in general with the arrow compute kernels, where only AVX512 seems to need manual implementation.

viirya · 2022-08-02T20:08:51Z

parquet/src/util/bit_pack.rs

+    (16, $offset:expr, $v:ident, $c:block) => {{
+        unroll_impl!(4, 0, __v16, {
+            unroll_impl!(4, __v16 * 4 + $offset, $v, $c)
+        });
+    }};


Can we use unroll_impl!(8, ...) here?

sunchao · 2022-08-02T17:22:28Z

parquet/src/util/bit_pack.rs

+/// Macro that generates an unpack function taking the number of bits as a const generic
+macro_rules! unpack_impl {
+    ($t:ty, $bytes:literal, $bits:tt) => {
+        pub fn unpack<const NUM_BITS: usize>(input: &[u8], output: &mut [$t; $bits]) {


👍 when the original version was written const generic is not stabilized yet.

sunchao · 2022-08-02T17:46:55Z

parquet/src/util/bit_pack.rs

+        unroll_impl!(1, $offset + 1, $v, $c);
+    }};
+    (4, $offset:expr, $v:ident, $c:block) => {{
+        unroll_impl!(2, 0, __v4, { unroll_impl!(2, __v4 * 2 + $offset, $v, $c) });


Probably worth adding some comments for future reference :)

sunchao · 2022-08-02T20:37:36Z

parquet/src/util/bit_pack.rs

+        /// Unpack packed `input` into `output` with a bit width of `num_bits`
+        pub fn $name(input: &[u8], output: &mut [$t; $bits], num_bits: usize) {
+            // This will get optimised into a jump table
+            unroll!(for i in 0..$bits {


I feel it might be worth checkouting https://docs.rs/seq-macro/latest/seq_macro/, the generated output is easier to understand.

For example the output of unpack8 becomes:

pub fn unpack8(input: &[u8], output: &mut [u8; 8], num_bits: usize) { if 0 == num_bits { return unpack8::unpack::<0>(input, output); } if 1 == num_bits { return unpack8::unpack::<1>(input, output); } if 2 == num_bits { return unpack8::unpack::<2>(input, output); } if 3 == num_bits { return unpack8::unpack::<3>(input, output); } if 4 == num_bits { return unpack8::unpack::<4>(input, output); } if 5 == num_bits { return unpack8::unpack::<5>(input, output); } if 6 == num_bits { return unpack8::unpack::<6>(input, output); } if 7 == num_bits { return unpack8::unpack::<7>(input, output); }; if num_bits == 8 { return unpack8::unpack::<8>(input, output); } ::core::panicking::panic_fmt(::core::fmt::Arguments::new_v1(&["internal error: entered unreachable code: "], &[::core::fmt::ArgumentV1::new_display(&::core::fmt::Arguments::new_v1(&["invalid num_bits "], &[::core::fmt::ArgumentV1::new_display(&num_bits)]))])); }

instead of

/// Unpack packed `input` into `output` with a bit width of `num_bits` pub fn unpack8(input: &[u8], output: &mut [u8; 8], num_bits: usize) { #[allow(non_upper_case_globals)] { { { { const __v8: usize = 0; { { { { const __v4: usize = 0; { { { const i: usize = __v4 * 2 + (__v8 * 4 + 0); { if i == num_bits { return unpack8::unpack::(input, output); } } }; { const i: usize = __v4 * 2 + (__v8 * 4 + 0) + 1; { if i == num_bits { return unpack8::unpack::(input, output); } } }; } } }; { const __v4: usize = 0 + 1; { { { const i: usize = __v4 * 2 + (__v8 * 4 + 0); { if i == num_bits { return unpack8::unpack::(input, output); } } }; { const i: usize = __v4 * 2 + (__v8 * 4 + 0) + 1; { if i == num_bits { return unpack8::unpack::(input, output); } } }; } } }; }; } } }; { const __v8: usize = 0 + 1; { { { { const __v4: usize = 0; { { { const i: usize = __v4 * 2 + (__v8 * 4 + 0); { if i == num_bits { return unpack8::unpack::(input, output); } } }; { const i: usize = __v4 * 2 + (__v8 * 4 + 0) + 1; { if i == num_bits { return unpack8::unpack::(input, output); } } }; } } }; { const __v4: usize = 0 + 1; { { { const i: usize = __v4 * 2 + (__v8 * 4 + 0); { if i == num_bits { return unpack8::unpack::(input, output); } } }; { const i: usize = __v4 * 2 + (__v8 * 4 + 0) + 1; { if i == num_bits { return unpack8::unpack::(input, output); } } }; } } }; }; } } }; }; } }; if num_bits == 8 { return unpack8::unpack::<8>(input, output); } ::core::panicking::panic_fmt(::core::fmt::Arguments::new_v1(&["internal error: entered unreachable code: "], &[::core::fmt::ArgumentV1::new_display(&::core::fmt::Arguments::new_v1(&["invalid num_bits "], &[::core::fmt::ArgumentV1::new_display(&num_bits)]))])); }

and you may no longer need the hack in unroll_impl.

I wanted to avoid introducing a new dependency, as that has historically been controversial

jhorstmann · 2022-08-03T08:01:23Z

I think in this case using a dependency could be preferrable and lead to even faster code. A macro relying on implementation details of how identifiers are generated is indeed a bit ugly. I know that parquet2 is using the bitpacking crate, which uses optimized vectorized implementations without relying too much on llvm.

tustvold · 2022-08-03T08:14:38Z

I looked at the bitpacking crate but it only supports the equivalent of unpack32 which it achieves in much the same way relying on LLVM to optimise it. I'll switch this to using seq_macro as unroll appears to be the major sticking point

Dandandan

🎉

alamb

I say 🚢 🇮🇹 and I will keep my 🤐 about adding new dependencies -- I think seq is ok given what I see of the https://crates.io/crates/seq-macro

sunchao

LGTM too. seq-macro doesn't carry any transitive dependency and it seems well maintained, so I think it's fine to add it.

tustvold · 2022-08-03T20:46:00Z

Thank you all for reviewing and for the feedback 😃

ursabot · 2022-08-03T20:52:22Z

Benchmark runs are scheduled for baseline = 2cf4cd8 and contender = 8a092e3. 8a092e3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

tustvold · 2022-08-04T09:09:23Z

#2319 bumps the defaults to make the most of this change, PTAL 😄

@tustvold

This PR ports apache/arrow-rs#2278 to parquet2. Credit to the design and implementation of the unpacking path go to @tustvold - it is 5-10% faster than the bitpacking crate 🚀 Additionally, it adds the corresponding packing code path, thereby completely replacing the dependency on bitpacking. It also adds some traits that allows code to be written via generics. A curious observation is that, with this PR, parquet2 no longer executes unsafe code (bitpacking had some) 🎉 Backward changes: renamed parquet2::encoding::bitpacking to parquet2::encoding::bitpacked parquet2::encoding::bitpacked::Decoder now has a generic parameter (output type) parquet2::encoding::bitpacked::Decoder::new's second parameter is now a usize

@tustvold

This PR ports apache/arrow-rs#2278 to parquet2. Credit to the design and implementation of the unpacking path go to @tustvold - it is 5-10% faster than the bitpacking crate 🚀 Additionally, it adds the corresponding packing code path, thereby completely replacing the dependency on bitpacking. It also adds some traits that allows code to be written via generics. A curious observation is that, with this PR, parquet2 no longer executes unsafe code (bitpacking had some) 🎉 Backward changes: renamed parquet2::encoding::bitpacking to parquet2::encoding::bitpacked parquet2::encoding::bitpacked::Decoder now has a generic parameter (output type) parquet2::encoding::bitpacked::Decoder::new's second parameter is now a usize

tustvold added 2 commits August 2, 2022 09:47

Add unpack8, unpack16, unpack64 (apache#2276)

fd82128

Add zero-extend fallback

612cc1d

github-actions bot added the parquet Changes to the parquet crate label Aug 2, 2022

Fix copy-elision

f5edaad

tustvold marked this pull request as ready for review August 2, 2022 16:45

tustvold changed the title ~~Add unpack8, unpack16, unpack64 (#2276)~~ Add unpack8, unpack16, unpack64 (#2276) ~10% faster Aug 2, 2022

viirya reviewed Aug 2, 2022

View reviewed changes

tustvold commented Aug 2, 2022

View reviewed changes

tustvold changed the title ~~Add unpack8, unpack16, unpack64 (#2276) ~10% faster~~ Add unpack8, unpack16, unpack64 (#2276) ~10%-50% faster Aug 2, 2022

tustvold changed the title ~~Add unpack8, unpack16, unpack64 (#2276) ~10%-50% faster~~ Add unpack8, unpack16, unpack64 (#2276) ~10-50% faster Aug 2, 2022

viirya reviewed Aug 2, 2022

View reviewed changes

alamb approved these changes Aug 2, 2022

View reviewed changes

viirya approved these changes Aug 2, 2022

View reviewed changes

viirya reviewed Aug 2, 2022

View reviewed changes

sunchao reviewed Aug 2, 2022

View reviewed changes

Switch to using seq_macro

5ee4a48

tustvold force-pushed the more-unpack branch from 78aaaf7 to 5ee4a48 Compare August 3, 2022 09:28

tustvold added 2 commits August 3, 2022 10:29

Remove unused function

2f43617

Update docs

8c39265

Dandandan approved these changes Aug 3, 2022

View reviewed changes

alamb approved these changes Aug 3, 2022

View reviewed changes

sunchao approved these changes Aug 3, 2022

View reviewed changes

tustvold merged commit 8a092e3 into apache:master Aug 3, 2022

tustvold mentioned this pull request Aug 4, 2022

Increase DeltaBitPackEncoder miniblock size to 64 for 64-bit integers (#2282) #2319

Merged

jorgecarleitao mentioned this pull request Aug 12, 2022

Improved bitpacking jorgecarleitao/parquet2#176

Merged


		//! Vectorised bit-packing utilities

		macro_rules! unroll_impl {

	// Try to read smaller batches if possible
	// Try to read remainder, if any, in batches if possible

Add unpack8, unpack16, unpack64 (#2276) ~10-50% faster #2278

Add unpack8, unpack16, unpack64 (#2276) ~10-50% faster #2278

Conversation

tustvold commented Aug 2, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented Aug 2, 2022

tustvold commented Aug 2, 2022

viirya left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

sunchao commented Aug 2, 2022

tustvold commented Aug 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhorstmann commented Aug 3, 2022

tustvold commented Aug 3, 2022 • edited Loading

Dandandan left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

tustvold commented Aug 3, 2022

ursabot commented Aug 3, 2022

tustvold commented Aug 4, 2022

tustvold commented Aug 2, 2022 •

edited

Loading

viirya left a comment •

edited

Loading

tustvold commented Aug 3, 2022 •

edited

Loading