ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode #1375

wpleonardo · 2023-01-11T06:49:46Z

What changes were proposed in this pull request?

In the original ORC Rle-bit-packing, it decodes value one by one, and Intel AVX-512 brings the capabilities of 512-bit vector operations to accelerate the Rle-bit-packing decode process. We only need execute much less CPU instructions to unpacking more data than usual. So the performance of AVX-512 vector decode is much better than before. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs. In the real world, user will store large data with ORC data format, and need to decoding hundreds or thousands of bytes, AVX-512 vector decode will be more efficient and help to improve this processing.

In the real world, the data size in ORC will be less than 32bit as usual. So I supplied the vector code transform about the data value size less than 32bits in this PR. To the data value is 8bit, 16bit or other 8x bit size, the performance improvement will be relatively small compared with other not 8x bit size value.

Intel AVX512 instructions official link:
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html

Added cmake option named "BUILD_ENABLE_AVX512", to switch this feature enable or not in the building process.
The default value of BUILD_ENABLE_AVX512 is OFF.
For example, cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON
This will build ORC library with AVX512 Bit-unpacking enabling.
Added macro "ORC_HAVE_RUNTIME_AVX512" to enable this feature code build or not in ORC.
Added the file "CpuInfoUtil.cc" to dynamicly detect the current platform supports AVX-512 or not. When customers build ORC with AVX-512 enable, and the current platform ORC running on doesn't support AVX-512, it will use the original bit-packing decode function instead of AVX-512 vector decode.
Added the functions "vectorUnpackX" to support X-bit value decode instead of the original function plainUnpackLongs or vectorUnpackX
Added the testcases "RleV2BitUnpackAvx512Test" to verify N-bit value AVX-512 vector decode in the new testcase file TestRleVectorDecoder.cc.
Modified the function plainUnpackLongs, added an output parameter uint64_t& startBit. This parameter used to store the left bit number after unpacking.
AVX-512 vector decode process 512 bits data in every data unpacking. So if the current unpacking data length is long enough, almost all of the data can be processed by AVX-512. But if the data length (or block size) is too short, less than 512 bits, it will not use AVX-512 to do unpacking work. It will back to the original decode way to do unpacking one by one.

Add new files:

New Files	File Purpose
CpuInfoUtil.hh .cc	Dynamically detect the current platform supports AVX-512 or not. If yes, will use AVX-512 vector decode, if not, will still the original decode functions.
BitUnpackerAvx512.hh	This file contains the new macros, arrays, and unions which AVX-512 vector decode needs.
BpackingAvx512.hh .cc	This file contains the AVX512 Bit-unpacking functions about 1~32 bit data
BpackingDefault.hh .cc	This file contains the default Bit-unpacking functions
Dispatch.hh	This file contains the dynamic dispatch according to available DispatchLevel
TestRleVectorDecoder.cc	New testcases to do unit and funcational test about this new feature

Why are the changes needed?

This can improve the performance of Rle-bit-packing decode. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs.
As Intel gradually improves CPU performance every year and users do data analyzation based ORC data format on the newer platform. 6 years ago, on Intel SKX platform it already support AVX512 instructions. So we need to upgrade ORC data unpacking according to the popular feature of CPU, this will keep ORC pace with the times.

How to enable AVX512 Bit-unpacking?

Enable the cmake option BUILD_ENABLE_AVX512, it will build ORC library with AVX512 enabling.
cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON
Set the ENV parameter when using ORC library
export ORC_USER_SIMD_LEVEL=AVX512
(Note: This parameter has only 2 values "AVX512" && "none", the value has no case-sensitive)
If set ORC_USER_SIMD_LEVEL=none, AVX512 Bit-unpacking will be disabled.

How was this patch tested?

I created a new testcase file TestRleVectorDecoder.cc. It contains the below testcases, we can open cmake option -DBUILD_ENABLE_AVX512=ON and running these testcases on the platform support AVX-512. Every testcase contain 2 scenarios:

The blockSize increases from 1 to 10000, and data length is 10240;
The blockSize increases from 1000 to 10000, and data length increases from 1000 to 70000
The testcase will be executed for a while, so I added a progress bar for every testcase.
Here is a progress bar demo print of one testcase:
[ RUN ] OrcTest/RleVectorTest.RleV2_basic_vector_decode_10bit/1
10bit Test 1st Part:[OK][#################################################################################][100%]
10bit Test 2nd Part:[OK][#################################################################################][100%]
To the main vector function vectorUnpackX, the test code coverage upto 100%.

New Testcases	Test Data Bit Size
RleV2_basic_vector_decode_1bit	1bit
RleV2_basic_vector_decode_2bit	2bit
RleV2_basic_vector_decode_3bit	3bit
RleV2_basic_vector_decode_4bit	4bit
RleV2_basic_vector_decode_5bit	5bit
RleV2_basic_vector_decode_6bit	6bit
RleV2_basic_vector_decode_7bit	7bit
RleV2_basic_vector_decode_9bit	9bit
RleV2_basic_vector_decode_10bit	10bit
RleV2_basic_vector_decode_11bit	11bit
RleV2_basic_vector_decode_12bit	12bit
RleV2_basic_vector_decode_13bit	13bit
RleV2_basic_vector_decode_14bit	14bit
RleV2_basic_vector_decode_15bit	15bit
RleV2_basic_vector_decode_16bit	16bit
RleV2_basic_vector_decode_17bit	17bit
RleV2_basic_vector_decode_18bit	18bit
RleV2_basic_vector_decode_19bit	19bit
RleV2_basic_vector_decode_20bit	20bit
RleV2_basic_vector_decode_21bit	21bit
RleV2_basic_vector_decode_22bit	22bit
RleV2_basic_vector_decode_23bit	23bit
RleV2_basic_vector_decode_24bit	24bit
RleV2_basic_vector_decode_26bit	26bit
RleV2_basic_vector_decode_28bit	28bit
RleV2_basic_vector_decode_30bit	30bit
RleV2_basic_vector_decode_32bit	32bit

ORC bit packing performance. Only contains 1~32bit opt.

dongjoon-hyun

Could you make CI happy, @wpleonardo ?

wgtmac · 2023-01-11T14:58:11Z

Welcome to the Apache ORC community! @wpleonardo

This feature looks promising. Will take a look this week.

cc @stiga-huang @coderex2522

…os and linux.

wgtmac

I only did a preliminary review. Left some comments but mostly are cosmetic. Will take a deep look later.

CMakeLists.txt

c++/src/DetectPlatform.hh

CMakeLists.txt

c++/src/DetectPlatform.hh

c++/src/VectorDecoder.hh

c++/src/RleDecoderV2.cc

c++/src/RLEv2.hh

dongjoon-hyun

It seems that still format issue.

c++/test/TestRleVectorDecoder.cc:29:1: error: code should be clang-formatted [-Wclang-format-violations]
#include "wrap/orc-proto-wrapper.hh"
^
c++/test/TestRleVectorDecoder.cc:38:51: error: code should be clang-formatted [-Wclang-format-violations]
  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M
...

2. Add the dynamiclly judge the current compiler and platform support AVX512 or not; 3. The build option BUILD_ENABLE_AVX512 default value change to "ON"; 4. Add the build option about file TestRleVectorDecoder.cc, and try to fix clang format build issue.

dongjoon-hyun

If you don't mind, could you test this locally first?

96 warnings and 20 errors generated.
make[2]: *** [c++/src/CMakeFiles/orc.dir/build.make:461: c++/src/CMakeFiles/orc.dir/RleDecoderV2.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:433: c++/src/CMakeFiles/orc.dir/all] Error 2

2. Change CMakeLists.txt some options

…orDecoder.cc 2. Change the option CXX_COMMON_FLAGS to CMAKE_CXX_FLAGS

wpleonardo · 2023-01-15T12:37:34Z

May I have a question about clang-format error about file TestRleVectorDecoder.cc?
I have already use clang-format -style=google to format file TestRleVectorDecoder.cc, but still get clang-format errors in CI. Do we use -style=google in clang-format, or other style?
Thank you very much!

wgtmac · 2023-01-15T12:54:54Z

May I have a question about clang-format error about file TestRleVectorDecoder.cc? I have already use clang-format -style=google to format file TestRleVectorDecoder.cc, but still get clang-format errors in CI. Do we use -style=google in clang-format, or other style? Thank you very much!

The clang-format we use is defined here: https://github.com/apache/orc/blob/main/.clang-format. You can simply use clang-format -i TestRleVectorDecoder.cc to format it automatically.

dongjoon-hyun

Gentle ping, @wpleonardo .

wpleonardo · 2023-01-28T01:01:56Z

Gentle ping, @wpleonardo .

Sorry, the past few days are my holiday, I will back to work and follow your suggestions in the next few days.
Thank you very much!

CMakeLists.txt

Change the invoking way about bufferstart,bufferend parameters.

wgtmac · 2023-04-21T08:07:33Z

Thanks @stiga-huang and @wpleonardo!

1. Code format change

wpleonardo · 2023-04-21T13:17:36Z

Just fixed an AVX512 flag check issue on windows platform.
In CI Windows test, the test machine doesn't have AVX512 CPU flags, but in Cmake file, the checking code failed to verify successfully. The reason is that
check_cxx_compiler_flag("/arch:AVX512" COMPILER_SUPPORT_AVX512)
only check if enable the use of AVX512 instructions (https://learn.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-170), but CPU doesn't have AVX512 flags.
So, I changed the checking code to
check_cxx_compiler_flag("-mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw" COMPILER_SUPPORT_AVX512)
It will verify if the current CPU has AVX512 instructions directly.

2. Fix an AVX512 flags check issue on windows.

Modified cmakefile about the checking of AVX512.

wpleonardo · 2023-04-23T04:47:08Z

In cmake_modules/ConfigSimdLevel.cmake, changed check_cxx_source_compiles to check_cxx_source_runs, to make sure AVX512 program can run normally on that machine.
https://github.com/wpleonardo/orc/blob/d6fd57d1c81709d6412fd506301aeffde39a3db6/cmake_modules/ConfigSimdLevel.cmake#L57
Please help me rerun CI test. Sorry for multiple rerun CI test.

wpleonardo · 2023-04-23T14:14:35Z

check_cxx_source_runs will be hung on windows platform, when the CPU doesn't have AVX512 flags.
So change check_cxx_source_runs back to check_cxx_source_compiles, and added "grep avx512f /proc/cpuinfo" to check CPU if have AVX512 flags.
https://github.com/wpleonardo/orc/blob/1f2085e68ff4e691fb178080ec0c53e5b37286ea/cmake_modules/ConfigSimdLevel.cmake#L79

wpleonardo · 2023-04-24T00:56:56Z

Hi @wgtmac @dongjoon-hyun @coderex2522 , CI test passed, do you have any other comments? Thank you very much!

…x_source_run back CHECK_CXX_SOURCE_COMPILES, and added "grep avx512f /proc/cpuinfo" to check CPU flags.

Because check_cxx_source_run will be hung on windows, change check_cx…

wpleonardo · 2023-04-26T01:05:45Z

Hi @dongjoon-hyun, welcome back from vacation! Do you have any other comments? Thank you very much!

wgtmac

I will merge it by the end of this week if no further comment.

wpleonardo · 2023-05-06T02:56:53Z

Thank you very much for you help, Gang! ^_^ B&R, Wang Peng At 2023-05-06 09:52:08, "Gang Wu" ***@***.***> wrote: @wgtmac approved this pull request. I will merge it by the end of this week if no further comment. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

dongjoon-hyun

+1, LGTM.

cc @williamhyun

williamhyun

+1 LGTM

wgtmac · 2023-05-07T14:16:47Z

I have submitted this. Thanks all!

taiyang-li · 2023-10-10T08:36:58Z

@wpleonardo Do we have any performance benchmark about this PR? @alexey-milovidov Maybe you are interested in it.

I try to use this feature in clickhouse(https://github.com/clickHouse/ClickHouse), but can't see any performance improvement.

Q: select * from file('/data1/clickhouse_official/data/user_files/test.orc') format Null;

With AVX512:

0 rows in set. Elapsed: 3.659 sec. Processed 1.13 million rows, 486.19 MB (308.68 thousand rows/s., 132.88 MB/s.)
0 rows in set. Elapsed: 3.653 sec. Processed 1.20 million rows, 517.87 MB (329.40 thousand rows/s., 141.76 MB/s.)
0 rows in set. Elapsed: 3.719 sec. Processed 1.13 million rows, 486.19 MB (303.70 thousand rows/s., 130.74 MB/s.)

Without AVX512

0 rows in set. Elapsed: 3.565 sec. Processed 1.13 million rows, 486.19 MB (316.81 thousand rows/s., 136.38 MB/s.)
0 rows in set. Elapsed: 3.540 sec. Processed 1.20 million rows, 517.87 MB (339.91 thousand rows/s., 146.28 MB/s.)
0 rows in set. Elapsed: 3.681 sec. Processed 1.20 million rows, 517.87 MB (326.90 thousand rows/s., 140.69 MB/s.)

About the test orc file:

$ du -sh test.orc                                                     
505M	test.orc


$ orc-metadata ./test.orc                           
{ "name": "./test.orc",
  "type": "struct<reporttime:bigint,appid:bigint,uid:bigint,platform:int,nettype:int,clientversioncode:bigint,sdkversioncode:bigint,statid:string,statversion:int,countrycode:string,language:string,model:string,osversion:string,channel:string,heartcount:int,msgcount:int,giftcount:int,barragecount:int,gid:string,entrytype:int,prefetchedms:int,linkdstate:int,networkavailable:int,starttimestamp:bigint,sessionlogints:int,medialogints:int,sdkboundts:int,msconnectedts:int,vsconnectedts:int,firstiframets:int,ownerstatus:int,stopreason:int,totaltime:int,cpuusageavg:int,memusageavg:int,backgroundtotal:bigint,foregroundtotal:bigint,firstvideopackts:int,firstvoicerecvts:int,firstvoiceplayts:int,firstiframeassemblets:int,uiinitts:int,uiloadedts:int,uiappearedts:int,setvideoviewts:int,blurviewdimissts:int,preparesdkinqueuets:int,preparesdkexects:int,startsdkinqueuets:int,startsdkexects:int,sdkjoinchannelinqueuets:int,sdkjoinchannelexects:int,lastsdkleavechannelinqueuets:int,lastsdkleavechannelexects:int,unused_1:int,unused_2:int,setvideoviewinqueuets:int,setvideoviewexects:int,livetype:int,audiostatus:int,firstiframesize:bigint,firstiframedecodetime:bigint,extras:bigint,entrancetype:int,entrancemode:int,mclientip:bigint,mnc:bigint,mcc:bigint,vsipsuccess:bigint,msipsuccess:bigint,vsipfail:bigint,msipfail:bigint,mediaflag:bigint,dispatchid:string,proxyflag:int,redirectcount:int,directorrescode:int,subentrancetab:string,logininfolist:array<struct<strategy:bigint,ip:bigint,loginStat:bigint,reserve1:bigint,reserve2:bigint>>,playcentertype:int,videomutetype:bigint,owneruid:bigint,extra:string>",
  "rows": 1203317,
  "stripe count": 12,
  "format": "0.12", "writer version": "future - 9",
  "compression": "snappy", "compression block": 65536,
  "file length": 529207118,
  "content": 529182229, "stripe stats": 21150, "footer": 3712, "postscript": 26,
  "row index stride": 10000,
  "user metadata": {
    "org.apache.spark.version": "3.3.2"
  },
  "stripes": [
    { "stripe": 0, "rows": 117760,
      "offset": 3, "length": 50876922,
      "index": 23728, "data": 50851823, "footer": 1371
    },
    { "stripe": 1, "rows": 117760,
      "offset": 50876925, "length": 50948680,
      "index": 23679, "data": 50923619, "footer": 1382
    },
    { "stripe": 2, "rows": 62050,
      "offset": 101825605, "length": 26902880,
      "index": 15322, "data": 26886211, "footer": 1347
    },
    { "stripe": 3, "rows": 117760,
      "offset": 128728485, "length": 50474083,
      "index": 24110, "data": 50448601, "footer": 1372
    },
    { "stripe": 4, "rows": 117760,
      "offset": 179202568, "length": 50413042,
      "index": 23858, "data": 50387825, "footer": 1359
    },
    { "stripe": 5, "rows": 63570,
      "offset": 229615610, "length": 27504277,
      "index": 14890, "data": 27488029, "footer": 1358
    },
    { "stripe": 6, "rows": 117760,
      "offset": 268435456, "length": 50981984,
      "index": 24191, "data": 50956424, "footer": 1369
    },
    { "stripe": 7, "rows": 117760,
      "offset": 319417440, "length": 51017894,
      "index": 23792, "data": 50992731, "footer": 1371
    },
    { "stripe": 8, "rows": 61720,
      "offset": 370435334, "length": 26840720,
      "index": 15246, "data": 26824109, "footer": 1365
    },
    { "stripe": 9, "rows": 117760,
      "offset": 397276054, "length": 49971095,
      "index": 23487, "data": 49946233, "footer": 1375
    },
    { "stripe": 10, "rows": 117760,
      "offset": 447247149, "length": 50259825,
      "index": 24090, "data": 50234369, "footer": 1366
    },
    { "stripe": 11, "rows": 73897,
      "offset": 497506974, "length": 31675255,
      "index": 16948, "data": 31656952, "footer": 1355
    }
  ]
}

wpleonardo · 2023-10-10T13:20:17Z

@wpleonardo Do we have any performance benchmark about this PR? @alexey-milovidov Maybe you are interested in it.

I try to use this feature in clickhouse(https://github.com/clickHouse/ClickHouse), but can't see any performance improvement.

Q: select * from file('/data1/clickhouse_official/data/user_files/test.orc') format Null;

With AVX512:

0 rows in set. Elapsed: 3.659 sec. Processed 1.13 million rows, 486.19 MB (308.68 thousand rows/s., 132.88 MB/s.)
0 rows in set. Elapsed: 3.653 sec. Processed 1.20 million rows, 517.87 MB (329.40 thousand rows/s., 141.76 MB/s.)
0 rows in set. Elapsed: 3.719 sec. Processed 1.13 million rows, 486.19 MB (303.70 thousand rows/s., 130.74 MB/s.)

Without AVX512

0 rows in set. Elapsed: 3.565 sec. Processed 1.13 million rows, 486.19 MB (316.81 thousand rows/s., 136.38 MB/s.)
0 rows in set. Elapsed: 3.540 sec. Processed 1.20 million rows, 517.87 MB (339.91 thousand rows/s., 146.28 MB/s.)
0 rows in set. Elapsed: 3.681 sec. Processed 1.20 million rows, 517.87 MB (326.90 thousand rows/s., 140.69 MB/s.)

About the test orc file:

$ du -sh test.orc                                                     
505M	test.orc


$ orc-metadata ./test.orc                           
{ "name": "./test.orc",
  "type": "struct<reporttime:bigint,appid:bigint,uid:bigint,platform:int,nettype:int,clientversioncode:bigint,sdkversioncode:bigint,statid:string,statversion:int,countrycode:string,language:string,model:string,osversion:string,channel:string,heartcount:int,msgcount:int,giftcount:int,barragecount:int,gid:string,entrytype:int,prefetchedms:int,linkdstate:int,networkavailable:int,starttimestamp:bigint,sessionlogints:int,medialogints:int,sdkboundts:int,msconnectedts:int,vsconnectedts:int,firstiframets:int,ownerstatus:int,stopreason:int,totaltime:int,cpuusageavg:int,memusageavg:int,backgroundtotal:bigint,foregroundtotal:bigint,firstvideopackts:int,firstvoicerecvts:int,firstvoiceplayts:int,firstiframeassemblets:int,uiinitts:int,uiloadedts:int,uiappearedts:int,setvideoviewts:int,blurviewdimissts:int,preparesdkinqueuets:int,preparesdkexects:int,startsdkinqueuets:int,startsdkexects:int,sdkjoinchannelinqueuets:int,sdkjoinchannelexects:int,lastsdkleavechannelinqueuets:int,lastsdkleavechannelexects:int,unused_1:int,unused_2:int,setvideoviewinqueuets:int,setvideoviewexects:int,livetype:int,audiostatus:int,firstiframesize:bigint,firstiframedecodetime:bigint,extras:bigint,entrancetype:int,entrancemode:int,mclientip:bigint,mnc:bigint,mcc:bigint,vsipsuccess:bigint,msipsuccess:bigint,vsipfail:bigint,msipfail:bigint,mediaflag:bigint,dispatchid:string,proxyflag:int,redirectcount:int,directorrescode:int,subentrancetab:string,logininfolist:array<struct<strategy:bigint,ip:bigint,loginStat:bigint,reserve1:bigint,reserve2:bigint>>,playcentertype:int,videomutetype:bigint,owneruid:bigint,extra:string>",
  "rows": 1203317,
  "stripe count": 12,
  "format": "0.12", "writer version": "future - 9",
  "compression": "snappy", "compression block": 65536,
  "file length": 529207118,
  "content": 529182229, "stripe stats": 21150, "footer": 3712, "postscript": 26,
  "row index stride": 10000,
  "user metadata": {
    "org.apache.spark.version": "3.3.2"
  },
  "stripes": [
    { "stripe": 0, "rows": 117760,
      "offset": 3, "length": 50876922,
      "index": 23728, "data": 50851823, "footer": 1371
    },
    { "stripe": 1, "rows": 117760,
      "offset": 50876925, "length": 50948680,
      "index": 23679, "data": 50923619, "footer": 1382
    },
    { "stripe": 2, "rows": 62050,
      "offset": 101825605, "length": 26902880,
      "index": 15322, "data": 26886211, "footer": 1347
    },
    { "stripe": 3, "rows": 117760,
      "offset": 128728485, "length": 50474083,
      "index": 24110, "data": 50448601, "footer": 1372
    },
    { "stripe": 4, "rows": 117760,
      "offset": 179202568, "length": 50413042,
      "index": 23858, "data": 50387825, "footer": 1359
    },
    { "stripe": 5, "rows": 63570,
      "offset": 229615610, "length": 27504277,
      "index": 14890, "data": 27488029, "footer": 1358
    },
    { "stripe": 6, "rows": 117760,
      "offset": 268435456, "length": 50981984,
      "index": 24191, "data": 50956424, "footer": 1369
    },
    { "stripe": 7, "rows": 117760,
      "offset": 319417440, "length": 51017894,
      "index": 23792, "data": 50992731, "footer": 1371
    },
    { "stripe": 8, "rows": 61720,
      "offset": 370435334, "length": 26840720,
      "index": 15246, "data": 26824109, "footer": 1365
    },
    { "stripe": 9, "rows": 117760,
      "offset": 397276054, "length": 49971095,
      "index": 23487, "data": 49946233, "footer": 1375
    },
    { "stripe": 10, "rows": 117760,
      "offset": 447247149, "length": 50259825,
      "index": 24090, "data": 50234369, "footer": 1366
    },
    { "stripe": 11, "rows": 73897,
      "offset": 497506974, "length": 31675255,
      "index": 16948, "data": 31656952, "footer": 1355
    }
  ]
}

Yes, we have the performance micro-benchmark for this PR. If you use the ORC default align fixed bit width, AVX512 bit-unpacking has almost the same performance as non-AVX512. But if you use the ORC not align bit width, AVX512 bit-unpacking has almost 6X performance gain compared with non-AVX512, and performance close to non-AVX512 with aligned fixed bit-width.
So, maybe you could check the Clickhouse ORC setting if aligned bit-width or not.

taiyang-li · 2023-10-11T03:39:52Z

@wpleonardo I tried, but still find no improvement

orc file(snappy + unaligned) + avx512
0 rows in set. Elapsed: 3.478 sec. Processed 1.20 million rows, 539.37 MB (345.98 thousand rows/s., 155.08 MB/s.)
0 rows in set. Elapsed: 3.424 sec. Processed 1.20 million rows, 539.37 MB (351.44 thousand rows/s., 157.53 MB/s.)
0 rows in set. Elapsed: 3.444 sec. Processed 1.20 million rows, 539.37 MB (349.44 thousand rows/s., 156.63 MB/s.)


orc file (snappy + unaligned) +  none
0 rows in set. Elapsed: 3.362 sec. Processed 1.20 million rows, 539.37 MB (357.89 thousand rows/s., 160.42 MB/s.)
0 rows in set. Elapsed: 3.535 sec. Processed 1.20 million rows, 539.37 MB (340.43 thousand rows/s., 152.59 MB/s.)
0 rows in set. Elapsed: 3.370 sec. Processed 1.20 million rows, 539.37 MB (357.08 thousand rows/s., 160.06 MB/s.)
 

orc file (lz4 + unaligned) + avx512
0 rows in set. Elapsed: 3.075 sec. Processed 1.20 million rows, 1.90 GB (391.26 thousand rows/s., 618.31 MB/s.)
0 rows in set. Elapsed: 3.082 sec. Processed 1.20 million rows, 1.90 GB (390.46 thousand rows/s., 617.05 MB/s.)
0 rows in set. Elapsed: 3.014 sec. Processed 1.20 million rows, 1.90 GB (399.18 thousand rows/s., 630.82 MB/s.)


orc file (lz4 + unaligned) + none 
rows in set. Elapsed: 2.973 sec. Processed 1.20 million rows, 1.90 GB (404.76 thousand rows/s., 639.64 MB/s.)
0 rows in set. Elapsed: 3.070 sec. Processed 1.20 million rows, 1.90 GB (391.90 thousand rows/s., 619.32 MB/s.)
0 rows in set. Elapsed: 2.903 sec. Processed 1.20 million rows, 1.90 GB (414.51 thousand rows/s., 655.05 MB/s.)

wpleonardo · 2023-10-11T13:04:18Z

@wpleonardo I tried, but still find no improvement

orc file(snappy + unaligned) + avx512
0 rows in set. Elapsed: 3.478 sec. Processed 1.20 million rows, 539.37 MB (345.98 thousand rows/s., 155.08 MB/s.)
0 rows in set. Elapsed: 3.424 sec. Processed 1.20 million rows, 539.37 MB (351.44 thousand rows/s., 157.53 MB/s.)
0 rows in set. Elapsed: 3.444 sec. Processed 1.20 million rows, 539.37 MB (349.44 thousand rows/s., 156.63 MB/s.)


orc file (snappy + unaligned) +  none
0 rows in set. Elapsed: 3.362 sec. Processed 1.20 million rows, 539.37 MB (357.89 thousand rows/s., 160.42 MB/s.)
0 rows in set. Elapsed: 3.535 sec. Processed 1.20 million rows, 539.37 MB (340.43 thousand rows/s., 152.59 MB/s.)
0 rows in set. Elapsed: 3.370 sec. Processed 1.20 million rows, 539.37 MB (357.08 thousand rows/s., 160.06 MB/s.)
 

orc file (lz4 + unaligned) + avx512
0 rows in set. Elapsed: 3.075 sec. Processed 1.20 million rows, 1.90 GB (391.26 thousand rows/s., 618.31 MB/s.)
0 rows in set. Elapsed: 3.082 sec. Processed 1.20 million rows, 1.90 GB (390.46 thousand rows/s., 617.05 MB/s.)
0 rows in set. Elapsed: 3.014 sec. Processed 1.20 million rows, 1.90 GB (399.18 thousand rows/s., 630.82 MB/s.)


orc file (lz4 + unaligned) + none 
rows in set. Elapsed: 2.973 sec. Processed 1.20 million rows, 1.90 GB (404.76 thousand rows/s., 639.64 MB/s.)
0 rows in set. Elapsed: 3.070 sec. Processed 1.20 million rows, 1.90 GB (391.90 thousand rows/s., 619.32 MB/s.)
0 rows in set. Elapsed: 2.903 sec. Processed 1.20 million rows, 1.90 GB (414.51 thousand rows/s., 655.05 MB/s.)

Could you do a simple test first, for example, just select the int64 column instead of all columns?

taiyang-li · 2023-10-12T06:28:19Z

@wpleonardo still find no improvement if just select int64 type columns.

Q: select reporttime,appid,uid,clientversioncode,sdkversioncode,starttimestamp,backgroundtotal,foregroundtotal,firstiframesize,firstiframedecodetime,extras,mclientip,mnc,mcc,vsipsuccess,msipsuccess,vsipfail,msipfail,mediaflag,videomutetype,owneruid from file('lz4_new_bigolive_audience_stats_orc.orc') format Null;

without avx512:

localhost:9001, queries: 20, QPS: 2.256, RPS: 2715210.217, MiB/s: 4092.049, result RPS: 0.000, result MiB/s: 0.000.

0.000%		0.421 sec.	
10.000%		0.423 sec.	
20.000%		0.425 sec.	
30.000%		0.429 sec.	
40.000%		0.433 sec.	
50.000%		0.440 sec.	
60.000%		0.440 sec.	
70.000%		0.442 sec.	
80.000%		0.443 sec.	
90.000%		0.456 sec.	
95.000%		0.457 sec.	
99.000%		0.464 sec.	
99.900%		0.464 sec.	
99.990%		0.464 sec.

with avx512

localhost:9001, queries: 20, QPS: 2.216, RPS: 2665968.958, MiB/s: 4017.839, result RPS: 0.000, result MiB/s: 0.000.

0.000%		0.423 sec.	
10.000%		0.429 sec.	
20.000%		0.431 sec.	
30.000%		0.434 sec.	
40.000%		0.438 sec.	
50.000%		0.442 sec.	
60.000%		0.448 sec.	
70.000%		0.451 sec.	
80.000%		0.453 sec.	
90.000%		0.469 sec.	
95.000%		0.473 sec.	
99.000%		0.482 sec.	
99.900%		0.482 sec.	
99.990%		0.482 sec.

wpleonardo · 2023-10-13T11:58:53Z

@wpleonardo still find no improvement if just select int64 type columns.

Q: select reporttime,appid,uid,clientversioncode,sdkversioncode,starttimestamp,backgroundtotal,foregroundtotal,firstiframesize,firstiframedecodetime,extras,mclientip,mnc,mcc,vsipsuccess,msipsuccess,vsipfail,msipfail,mediaflag,videomutetype,owneruid from file('lz4_new_bigolive_audience_stats_orc.orc') format Null;

without avx512:
localhost:9001, queries: 20, QPS: 2.256, RPS: 2715210.217, MiB/s: 4092.049, result RPS: 0.000, result MiB/s: 0.000.

0.000%		0.421 sec.	
10.000%		0.423 sec.	
20.000%		0.425 sec.	
30.000%		0.429 sec.	
40.000%		0.433 sec.	
50.000%		0.440 sec.	
60.000%		0.440 sec.	
70.000%		0.442 sec.	
80.000%		0.443 sec.	
90.000%		0.456 sec.	
95.000%		0.457 sec.	
99.000%		0.464 sec.	
99.900%		0.464 sec.	
99.990%		0.464 sec.	
with avx512
localhost:9001, queries: 20, QPS: 2.216, RPS: 2665968.958, MiB/s: 4017.839, result RPS: 0.000, result MiB/s: 0.000.

0.000%		0.423 sec.	
10.000%		0.429 sec.	
20.000%		0.431 sec.	
30.000%		0.434 sec.	
40.000%		0.438 sec.	
50.000%		0.442 sec.	
60.000%		0.448 sec.	
70.000%		0.451 sec.	
80.000%		0.453 sec.	
90.000%		0.469 sec.	
95.000%		0.473 sec.	
99.000%		0.482 sec.	
99.900%		0.482 sec.	
99.990%		0.482 sec.	

Could you debug your program to check if ORC is using AVX512 bit-unpacking, for example, to check if the function "BitUnpackAVX512::readLongs" is invoked when you execute the query statement?
If you find ORC is using AVX512 bit-unpacking, then execute the command "perf top" to check the proportion of AVX512 bit-unpacking function hotspots, for example, function "vectorUnpack x".

… instructions ### What changes were proposed in this pull request? In the original ORC Rle-bit-packing, it decodes value one by one, and Intel AVX-512 brings the capabilities of 512-bit vector operations to accelerate the Rle-bit-packing decode process. We only need execute much less CPU instructions to unpacking more data than usual. So the performance of AVX-512 vector decode is much better than before. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs. In the real world, user will store large data with ORC data format, and need to decoding hundreds or thousands of bytes, AVX-512 vector decode will be more efficient and help to improve this processing. In the real world, the data size in ORC will be less than 32bit as usual. So I supplied the vector code transform about the data value size less than 32bits in this PR. To the data value is 8bit, 16bit or other 8x bit size, the performance improvement will be relatively small compared with other not 8x bit size value. Intel AVX512 instructions official link: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html 1. Added cmake option named "BUILD_ENABLE_AVX512", to switch this feature enable or not in the building process. The default value of BUILD_ENABLE_AVX512 is OFF. For example, cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON This will build ORC library with AVX512 Bit-unpacking enabling. 2. Added macro "ORC_HAVE_RUNTIME_AVX512" to enable this feature code build or not in ORC. 3. Added the file "CpuInfoUtil.cc" to dynamicly detect the current platform supports AVX-512 or not. When customers build ORC with AVX-512 enable, and the current platform ORC running on doesn't support AVX-512, it will use the original bit-packing decode function instead of AVX-512 vector decode. 4. Added the functions "vectorUnpackX" to support X-bit value decode instead of the original function plainUnpackLongs or vectorUnpackX 5. Added the testcases "RleV2BitUnpackAvx512Test" to verify N-bit value AVX-512 vector decode in the new testcase file TestRleVectorDecoder.cc. 6. Modified the function plainUnpackLongs, added an output parameter uint64_t& startBit. This parameter used to store the left bit number after unpacking. 7. AVX-512 vector decode process 512 bits data in every data unpacking. So if the current unpacking data length is long enough, almost all of the data can be processed by AVX-512. But if the data length (or block size) is too short, less than 512 bits, it will not use AVX-512 to do unpacking work. It will back to the original decode way to do unpacking one by one. ### Why are the changes needed? This can improve the performance of Rle-bit-packing decode. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs. As Intel gradually improves CPU performance every year and users do data analyzation based ORC data format on the newer platform. 6 years ago, on Intel SKX platform it already support AVX512 instructions. So we need to upgrade ORC data unpacking according to the popular feature of CPU, this will keep ORC pace with the times. ### How to enable AVX512 Bit-unpacking? 1. Enable the cmake option BUILD_ENABLE_AVX512, it will build ORC library with AVX512 enabling. cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON 2. Set the ENV parameter when using ORC library export ORC_USER_SIMD_LEVEL=AVX512 (Note: This parameter has only 2 values "AVX512" && "none", the value has no case-sensitive) If set ORC_USER_SIMD_LEVEL=none, AVX512 Bit-unpacking will be disabled. ### How was this patch tested? I created a new testcase file TestRleVectorDecoder.cc. It contains the below testcases, we can open cmake option -DBUILD_ENABLE_AVX512=ON and running these testcases on the platform support AVX-512. Every testcase contain 2 scenarios: 1. The blockSize increases from 1 to 10000, and data length is 10240; 2. The blockSize increases from 1000 to 10000, and data length increases from 1000 to 70000 The testcase will be executed for a while, so I added a progress bar for every testcase. Here is a progress bar demo print of one testcase: [ RUN ] OrcTest/RleVectorTest.RleV2_basic_vector_decode_10bit/1 10bit Test 1st Part:[OK][#################################################################################][100%] 10bit Test 2nd Part:[OK][#################################################################################][100%] To the main vector function vectorUnpackX, the test code coverage up to 100%. This closes apache#1375

wpleonardo added 6 commits January 10, 2023 14:17

Use AVX512 to optimize bit-packing decode functions. This will improve

58c3ab6

ORC bit packing performance. Only contains 1~32bit opt.

Fix some conficts.

acbc214

Fix some conflicts.

293d863

Fix the code format.

e7a9119

Modify TestRleVectorDecoder.cc to match the new format.

cfde08f

Fix a mistake on function name

8341943

github-actions bot added BUILD CPP labels Jan 11, 2023

dongjoon-hyun added this to the 1.9.0 milestone Jan 11, 2023

dongjoon-hyun reviewed Jan 11, 2023

View reviewed changes

wpleonardo changed the title ~~[C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode~~ ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode Jan 11, 2023

wpleonardo added 2 commits January 11, 2023 14:16

Modified code into namespace orc

e840649

Modify function name to fix a build issue.

c7962d5

wpleonardo added 3 commits January 12, 2023 06:42

Modify code format.

495a620

Fix a build issue about int64 has different printf format between mac…

5c937e6

…os and linux.

Fix build issue on windows.

a87c281

wgtmac reviewed Jan 12, 2023

View reviewed changes

Fix some code format issue and function name.

d8fcbe6

dongjoon-hyun reviewed Jan 13, 2023

View reviewed changes

1. Modified the code format;

668335c

2. Add the dynamiclly judge the current compiler and platform support AVX512 or not; 3. The build option BUILD_ENABLE_AVX512 default value change to "ON"; 4. Add the build option about file TestRleVectorDecoder.cc, and try to fix clang format build issue.

dongjoon-hyun previously requested changes Jan 14, 2023

View reviewed changes

dongjoon-hyun marked this pull request as draft January 14, 2023 05:49

wpleonardo added 2 commits January 14, 2023 14:49

1. Use clang-format to modify the code format of TestRleVectorDecoder.cc

46daa2d

2. Change CMakeLists.txt some options

1. Use clang-format -style=google to format code style of TestRleVect…

415d1eb

…orDecoder.cc 2. Change the option CXX_COMMON_FLAGS to CMAKE_CXX_FLAGS

dongjoon-hyun reviewed Jan 27, 2023

View reviewed changes

dongjoon-hyun reviewed Jan 30, 2023

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

Merge pull request #42 from wpleonardo/fix_comments

ce77b50

Change the invoking way about bufferstart,bufferend parameters.

Merge pull request #43 from wpleonardo/fix_comments

6c84d8d

1. Code format change

stiga-huang approved these changes Apr 21, 2023

View reviewed changes

Your Name and others added 3 commits April 21, 2023 09:59

Change the invoking way about bufferstart,bufferend parameters.

f3ff215

1. Code format change

af96de9

2. Fix an AVX512 flags check issue on windows.

Merge pull request #44 from wpleonardo/fix_comments

d6fd57d

Modified cmakefile about the checking of AVX512.

Modified cmakefile about the checking of AVX512.

0bfc862

Your Name and others added 3 commits April 23, 2023 22:07

Because check_cxx_source_run will be hung on windows, change check_cx…

e584a42

…x_source_run back CHECK_CXX_SOURCE_COMPILES, and added "grep avx512f /proc/cpuinfo" to check CPU flags.

Change check_cxx_source_runs back to CHECK_CXX_SOURCE_COMPILES

4d261eb

Merge pull request #45 from wpleonardo/fix_comments

1f2085e

Because check_cxx_source_run will be hung on windows, change check_cx…

wgtmac approved these changes May 6, 2023

View reviewed changes

dongjoon-hyun approved these changes May 6, 2023

View reviewed changes

williamhyun approved these changes May 7, 2023

View reviewed changes

wgtmac merged commit 0f2c5d3 into apache:main May 7, 2023

dongjoon-hyun mentioned this pull request Dec 12, 2023

Add a copyright messaging to the BpackingAvx512.hh #1691

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode #1375

ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode #1375

wpleonardo commented Jan 11, 2023 •

edited

Loading

dongjoon-hyun left a comment

wgtmac commented Jan 11, 2023

wgtmac left a comment

dongjoon-hyun left a comment

dongjoon-hyun left a comment

wpleonardo commented Jan 15, 2023

wgtmac commented Jan 15, 2023

dongjoon-hyun left a comment

wpleonardo commented Jan 28, 2023

wgtmac commented Apr 21, 2023 •

edited

Loading

wpleonardo commented Apr 21, 2023

wpleonardo commented Apr 23, 2023

wpleonardo commented Apr 23, 2023

wpleonardo commented Apr 24, 2023 •

edited

Loading

wpleonardo commented Apr 26, 2023

wgtmac left a comment

wpleonardo commented May 6, 2023 via email

dongjoon-hyun left a comment

williamhyun left a comment

wgtmac commented May 7, 2023

taiyang-li commented Oct 10, 2023 •

edited

Loading

wpleonardo commented Oct 10, 2023

taiyang-li commented Oct 11, 2023

wpleonardo commented Oct 11, 2023

taiyang-li commented Oct 12, 2023 •

edited

Loading

wpleonardo commented Oct 13, 2023

ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode #1375

ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode #1375

Conversation

wpleonardo commented Jan 11, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

How to enable AVX512 Bit-unpacking?

How was this patch tested?

dongjoon-hyun left a comment

Choose a reason for hiding this comment

wgtmac commented Jan 11, 2023

wgtmac left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

wpleonardo commented Jan 15, 2023

wgtmac commented Jan 15, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

wpleonardo commented Jan 28, 2023

wgtmac commented Apr 21, 2023 • edited Loading

wpleonardo commented Apr 21, 2023

wpleonardo commented Apr 23, 2023

wpleonardo commented Apr 23, 2023

wpleonardo commented Apr 24, 2023 • edited Loading

wpleonardo commented Apr 26, 2023

wgtmac left a comment

Choose a reason for hiding this comment

wpleonardo commented May 6, 2023 via email

dongjoon-hyun left a comment

Choose a reason for hiding this comment

williamhyun left a comment

Choose a reason for hiding this comment

wgtmac commented May 7, 2023

taiyang-li commented Oct 10, 2023 • edited Loading

wpleonardo commented Oct 10, 2023

taiyang-li commented Oct 11, 2023

wpleonardo commented Oct 11, 2023

taiyang-li commented Oct 12, 2023 • edited Loading

wpleonardo commented Oct 13, 2023

wpleonardo commented Jan 11, 2023 •

edited

Loading

wgtmac commented Apr 21, 2023 •

edited

Loading

wpleonardo commented Apr 24, 2023 •

edited

Loading

taiyang-li commented Oct 10, 2023 •

edited

Loading

taiyang-li commented Oct 12, 2023 •

edited

Loading