Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode #1375

Merged
merged 110 commits into from
May 7, 2023
Merged

Conversation

wpleonardo
Copy link
Contributor

@wpleonardo wpleonardo commented Jan 11, 2023

What changes were proposed in this pull request?

In the original ORC Rle-bit-packing, it decodes value one by one, and Intel AVX-512 brings the capabilities of 512-bit vector operations to accelerate the Rle-bit-packing decode process. We only need execute much less CPU instructions to unpacking more data than usual. So the performance of AVX-512 vector decode is much better than before. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs. In the real world, user will store large data with ORC data format, and need to decoding hundreds or thousands of bytes, AVX-512 vector decode will be more efficient and help to improve this processing.

In the real world, the data size in ORC will be less than 32bit as usual. So I supplied the vector code transform about the data value size less than 32bits in this PR. To the data value is 8bit, 16bit or other 8x bit size, the performance improvement will be relatively small compared with other not 8x bit size value.

Intel AVX512 instructions official link:
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html

  1. Added cmake option named "BUILD_ENABLE_AVX512", to switch this feature enable or not in the building process.
    The default value of BUILD_ENABLE_AVX512 is OFF.
    For example, cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON
    This will build ORC library with AVX512 Bit-unpacking enabling.
  2. Added macro "ORC_HAVE_RUNTIME_AVX512" to enable this feature code build or not in ORC.
  3. Added the file "CpuInfoUtil.cc" to dynamicly detect the current platform supports AVX-512 or not. When customers build ORC with AVX-512 enable, and the current platform ORC running on doesn't support AVX-512, it will use the original bit-packing decode function instead of AVX-512 vector decode.
  4. Added the functions "vectorUnpackX" to support X-bit value decode instead of the original function plainUnpackLongs or vectorUnpackX
  5. Added the testcases "RleV2BitUnpackAvx512Test" to verify N-bit value AVX-512 vector decode in the new testcase file TestRleVectorDecoder.cc.
  6. Modified the function plainUnpackLongs, added an output parameter uint64_t& startBit. This parameter used to store the left bit number after unpacking.
  7. AVX-512 vector decode process 512 bits data in every data unpacking. So if the current unpacking data length is long enough, almost all of the data can be processed by AVX-512. But if the data length (or block size) is too short, less than 512 bits, it will not use AVX-512 to do unpacking work. It will back to the original decode way to do unpacking one by one.

Add new files:

<style> </style>
New Files File Purpose
CpuInfoUtil.hh .cc Dynamically detect the current platform supports  AVX-512 or not. If yes, will use AVX-512 vector decode, if not, will still the original decode functions.
BitUnpackerAvx512.hh This file contains the new macros, arrays, and unions which AVX-512 vector decode needs.
BpackingAvx512.hh .cc This file contains the AVX512 Bit-unpacking functions about 1~32 bit data
BpackingDefault.hh .cc This file contains the default Bit-unpacking functions
Dispatch.hh This file contains the dynamic dispatch according to available DispatchLevel
TestRleVectorDecoder.cc New testcases to do unit and funcational test about this new feature

Why are the changes needed?

This can improve the performance of Rle-bit-packing decode. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs.
As Intel gradually improves CPU performance every year and users do data analyzation based ORC data format on the newer platform. 6 years ago, on Intel SKX platform it already support AVX512 instructions. So we need to upgrade ORC data unpacking according to the popular feature of CPU, this will keep ORC pace with the times.

How to enable AVX512 Bit-unpacking?

  1. Enable the cmake option BUILD_ENABLE_AVX512, it will build ORC library with AVX512 enabling.
    cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON
  2. Set the ENV parameter when using ORC library
    export ORC_USER_SIMD_LEVEL=AVX512
    (Note: This parameter has only 2 values "AVX512" && "none", the value has no case-sensitive)
    If set ORC_USER_SIMD_LEVEL=none, AVX512 Bit-unpacking will be disabled.

How was this patch tested?

I created a new testcase file TestRleVectorDecoder.cc. It contains the below testcases, we can open cmake option -DBUILD_ENABLE_AVX512=ON and running these testcases on the platform support AVX-512. Every testcase contain 2 scenarios:

  1. The blockSize increases from 1 to 10000, and data length is 10240;
  2. The blockSize increases from 1000 to 10000, and data length increases from 1000 to 70000
    The testcase will be executed for a while, so I added a progress bar for every testcase.
    Here is a progress bar demo print of one testcase:
    [ RUN ] OrcTest/RleVectorTest.RleV2_basic_vector_decode_10bit/1
    10bit Test 1st Part:[OK][#################################################################################][100%]
    10bit Test 2nd Part:[OK][#################################################################################][100%]
    To the main vector function vectorUnpackX, the test code coverage upto 100%.
<style> </style>
New Testcases Test Data Bit Size
RleV2_basic_vector_decode_1bit 1bit
RleV2_basic_vector_decode_2bit 2bit
RleV2_basic_vector_decode_3bit 3bit
RleV2_basic_vector_decode_4bit 4bit
RleV2_basic_vector_decode_5bit 5bit
RleV2_basic_vector_decode_6bit 6bit
RleV2_basic_vector_decode_7bit 7bit
RleV2_basic_vector_decode_9bit 9bit
RleV2_basic_vector_decode_10bit 10bit
RleV2_basic_vector_decode_11bit 11bit
RleV2_basic_vector_decode_12bit 12bit
RleV2_basic_vector_decode_13bit 13bit
RleV2_basic_vector_decode_14bit 14bit
RleV2_basic_vector_decode_15bit 15bit
RleV2_basic_vector_decode_16bit 16bit
RleV2_basic_vector_decode_17bit 17bit
RleV2_basic_vector_decode_18bit 18bit
RleV2_basic_vector_decode_19bit 19bit
RleV2_basic_vector_decode_20bit 20bit
RleV2_basic_vector_decode_21bit 21bit
RleV2_basic_vector_decode_22bit 22bit
RleV2_basic_vector_decode_23bit 23bit
RleV2_basic_vector_decode_24bit 24bit
RleV2_basic_vector_decode_26bit 26bit
RleV2_basic_vector_decode_28bit 28bit
RleV2_basic_vector_decode_30bit 30bit
RleV2_basic_vector_decode_32bit 32bit

@dongjoon-hyun dongjoon-hyun added this to the 1.9.0 milestone Jan 11, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make CI happy, @wpleonardo ?

@wpleonardo wpleonardo changed the title [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode Jan 11, 2023
@wgtmac
Copy link
Member

wgtmac commented Jan 11, 2023

Welcome to the Apache ORC community! @wpleonardo

This feature looks promising. Will take a look this week.

cc @stiga-huang @coderex2522

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did a preliminary review. Left some comments but mostly are cosmetic. Will take a deep look later.

CMakeLists.txt Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
c++/src/DetectPlatform.hh Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
c++/src/DetectPlatform.hh Outdated Show resolved Hide resolved
c++/src/DetectPlatform.hh Outdated Show resolved Hide resolved
c++/src/DetectPlatform.hh Outdated Show resolved Hide resolved
c++/src/VectorDecoder.hh Outdated Show resolved Hide resolved
c++/src/RleDecoderV2.cc Outdated Show resolved Hide resolved
c++/src/RLEv2.hh Outdated Show resolved Hide resolved
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that still format issue.

c++/test/TestRleVectorDecoder.cc:29:1: error: code should be clang-formatted [-Wclang-format-violations]
#include "wrap/orc-proto-wrapper.hh"
^
c++/test/TestRleVectorDecoder.cc:38:51: error: code should be clang-formatted [-Wclang-format-violations]
  const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M
...

2. Add the dynamiclly judge the current compiler and platform support AVX512 or not;
3. The build option BUILD_ENABLE_AVX512 default value change to "ON";
4. Add the build option about file TestRleVectorDecoder.cc, and try to fix clang format build issue.
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind, could you test this locally first?

96 warnings and 20 errors generated.
make[2]: *** [c++/src/CMakeFiles/orc.dir/build.make:461: c++/src/CMakeFiles/orc.dir/RleDecoderV2.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:433: c++/src/CMakeFiles/orc.dir/all] Error 2

@dongjoon-hyun dongjoon-hyun marked this pull request as draft January 14, 2023 05:49
…orDecoder.cc

2. Change the option CXX_COMMON_FLAGS to CMAKE_CXX_FLAGS
@wpleonardo
Copy link
Contributor Author

May I have a question about clang-format error about file TestRleVectorDecoder.cc?
I have already use clang-format -style=google to format file TestRleVectorDecoder.cc, but still get clang-format errors in CI. Do we use -style=google in clang-format, or other style?
Thank you very much!

@wgtmac
Copy link
Member

wgtmac commented Jan 15, 2023

May I have a question about clang-format error about file TestRleVectorDecoder.cc? I have already use clang-format -style=google to format file TestRleVectorDecoder.cc, but still get clang-format errors in CI. Do we use -style=google in clang-format, or other style? Thank you very much!

The clang-format we use is defined here: https://github.com/apache/orc/blob/main/.clang-format. You can simply use clang-format -i TestRleVectorDecoder.cc to format it automatically.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gentle ping, @wpleonardo .

@wpleonardo
Copy link
Contributor Author

Gentle ping, @wpleonardo .

Sorry, the past few days are my holiday, I will back to work and follow your suggestions in the next few days.
Thank you very much!

Change the invoking way about bufferstart,bufferend parameters.
@wgtmac
Copy link
Member

wgtmac commented Apr 21, 2023

Thanks @stiga-huang and @wpleonardo!

@wpleonardo
Copy link
Contributor Author

Just fixed an AVX512 flag check issue on windows platform.
In CI Windows test, the test machine doesn't have AVX512 CPU flags, but in Cmake file, the checking code failed to verify successfully. The reason is that
check_cxx_compiler_flag("/arch:AVX512" COMPILER_SUPPORT_AVX512)
only check if enable the use of AVX512 instructions (https://learn.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-170), but CPU doesn't have AVX512 flags.
So, I changed the checking code to
check_cxx_compiler_flag("-mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw" COMPILER_SUPPORT_AVX512)
It will verify if the current CPU has AVX512 instructions directly.

Your Name and others added 3 commits April 21, 2023 09:59
2. Fix an AVX512 flags check issue on windows.
Modified cmakefile about the checking of AVX512.
@wpleonardo
Copy link
Contributor Author

In cmake_modules/ConfigSimdLevel.cmake, changed check_cxx_source_compiles to check_cxx_source_runs, to make sure AVX512 program can run normally on that machine.
https://github.com/wpleonardo/orc/blob/d6fd57d1c81709d6412fd506301aeffde39a3db6/cmake_modules/ConfigSimdLevel.cmake#L57
Please help me rerun CI test. Sorry for multiple rerun CI test.

@wpleonardo
Copy link
Contributor Author

check_cxx_source_runs will be hung on windows platform, when the CPU doesn't have AVX512 flags.
So change check_cxx_source_runs back to check_cxx_source_compiles, and added "grep avx512f /proc/cpuinfo" to check CPU if have AVX512 flags.
https://github.com/wpleonardo/orc/blob/1f2085e68ff4e691fb178080ec0c53e5b37286ea/cmake_modules/ConfigSimdLevel.cmake#L79

@wpleonardo
Copy link
Contributor Author

wpleonardo commented Apr 24, 2023

Hi @wgtmac @dongjoon-hyun @coderex2522 , CI test passed, do you have any other comments? Thank you very much!

Your Name and others added 3 commits April 23, 2023 22:07
…x_source_run back CHECK_CXX_SOURCE_COMPILES, and added "grep avx512f /proc/cpuinfo" to check CPU flags.
Because check_cxx_source_run will be hung on windows, change check_cx…
@wpleonardo
Copy link
Contributor Author

Hi @dongjoon-hyun, welcome back from vacation! Do you have any other comments? Thank you very much!

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will merge it by the end of this week if no further comment.

@wpleonardo
Copy link
Contributor Author

wpleonardo commented May 6, 2023 via email

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

cc @williamhyun

Copy link
Member

@williamhyun williamhyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@wgtmac wgtmac merged commit 0f2c5d3 into apache:main May 7, 2023
@wgtmac
Copy link
Member

wgtmac commented May 7, 2023

I have submitted this. Thanks all!

@taiyang-li
Copy link
Contributor

taiyang-li commented Oct 10, 2023

@wpleonardo Do we have any performance benchmark about this PR? @alexey-milovidov Maybe you are interested in it.

I try to use this feature in clickhouse(https://github.com/clickHouse/ClickHouse), but can't see any performance improvement.

Q: select * from file('/data1/clickhouse_official/data/user_files/test.orc') format Null;

With AVX512:

0 rows in set. Elapsed: 3.659 sec. Processed 1.13 million rows, 486.19 MB (308.68 thousand rows/s., 132.88 MB/s.)
0 rows in set. Elapsed: 3.653 sec. Processed 1.20 million rows, 517.87 MB (329.40 thousand rows/s., 141.76 MB/s.)
0 rows in set. Elapsed: 3.719 sec. Processed 1.13 million rows, 486.19 MB (303.70 thousand rows/s., 130.74 MB/s.)

Without AVX512

0 rows in set. Elapsed: 3.565 sec. Processed 1.13 million rows, 486.19 MB (316.81 thousand rows/s., 136.38 MB/s.)
0 rows in set. Elapsed: 3.540 sec. Processed 1.20 million rows, 517.87 MB (339.91 thousand rows/s., 146.28 MB/s.)
0 rows in set. Elapsed: 3.681 sec. Processed 1.20 million rows, 517.87 MB (326.90 thousand rows/s., 140.69 MB/s.)

About the test orc file:

$ du -sh test.orc                                                     
505M	test.orc


$ orc-metadata ./test.orc                           
{ "name": "./test.orc",
  "type": "struct<reporttime:bigint,appid:bigint,uid:bigint,platform:int,nettype:int,clientversioncode:bigint,sdkversioncode:bigint,statid:string,statversion:int,countrycode:string,language:string,model:string,osversion:string,channel:string,heartcount:int,msgcount:int,giftcount:int,barragecount:int,gid:string,entrytype:int,prefetchedms:int,linkdstate:int,networkavailable:int,starttimestamp:bigint,sessionlogints:int,medialogints:int,sdkboundts:int,msconnectedts:int,vsconnectedts:int,firstiframets:int,ownerstatus:int,stopreason:int,totaltime:int,cpuusageavg:int,memusageavg:int,backgroundtotal:bigint,foregroundtotal:bigint,firstvideopackts:int,firstvoicerecvts:int,firstvoiceplayts:int,firstiframeassemblets:int,uiinitts:int,uiloadedts:int,uiappearedts:int,setvideoviewts:int,blurviewdimissts:int,preparesdkinqueuets:int,preparesdkexects:int,startsdkinqueuets:int,startsdkexects:int,sdkjoinchannelinqueuets:int,sdkjoinchannelexects:int,lastsdkleavechannelinqueuets:int,lastsdkleavechannelexects:int,unused_1:int,unused_2:int,setvideoviewinqueuets:int,setvideoviewexects:int,livetype:int,audiostatus:int,firstiframesize:bigint,firstiframedecodetime:bigint,extras:bigint,entrancetype:int,entrancemode:int,mclientip:bigint,mnc:bigint,mcc:bigint,vsipsuccess:bigint,msipsuccess:bigint,vsipfail:bigint,msipfail:bigint,mediaflag:bigint,dispatchid:string,proxyflag:int,redirectcount:int,directorrescode:int,subentrancetab:string,logininfolist:array<struct<strategy:bigint,ip:bigint,loginStat:bigint,reserve1:bigint,reserve2:bigint>>,playcentertype:int,videomutetype:bigint,owneruid:bigint,extra:string>",
  "rows": 1203317,
  "stripe count": 12,
  "format": "0.12", "writer version": "future - 9",
  "compression": "snappy", "compression block": 65536,
  "file length": 529207118,
  "content": 529182229, "stripe stats": 21150, "footer": 3712, "postscript": 26,
  "row index stride": 10000,
  "user metadata": {
    "org.apache.spark.version": "3.3.2"
  },
  "stripes": [
    { "stripe": 0, "rows": 117760,
      "offset": 3, "length": 50876922,
      "index": 23728, "data": 50851823, "footer": 1371
    },
    { "stripe": 1, "rows": 117760,
      "offset": 50876925, "length": 50948680,
      "index": 23679, "data": 50923619, "footer": 1382
    },
    { "stripe": 2, "rows": 62050,
      "offset": 101825605, "length": 26902880,
      "index": 15322, "data": 26886211, "footer": 1347
    },
    { "stripe": 3, "rows": 117760,
      "offset": 128728485, "length": 50474083,
      "index": 24110, "data": 50448601, "footer": 1372
    },
    { "stripe": 4, "rows": 117760,
      "offset": 179202568, "length": 50413042,
      "index": 23858, "data": 50387825, "footer": 1359
    },
    { "stripe": 5, "rows": 63570,
      "offset": 229615610, "length": 27504277,
      "index": 14890, "data": 27488029, "footer": 1358
    },
    { "stripe": 6, "rows": 117760,
      "offset": 268435456, "length": 50981984,
      "index": 24191, "data": 50956424, "footer": 1369
    },
    { "stripe": 7, "rows": 117760,
      "offset": 319417440, "length": 51017894,
      "index": 23792, "data": 50992731, "footer": 1371
    },
    { "stripe": 8, "rows": 61720,
      "offset": 370435334, "length": 26840720,
      "index": 15246, "data": 26824109, "footer": 1365
    },
    { "stripe": 9, "rows": 117760,
      "offset": 397276054, "length": 49971095,
      "index": 23487, "data": 49946233, "footer": 1375
    },
    { "stripe": 10, "rows": 117760,
      "offset": 447247149, "length": 50259825,
      "index": 24090, "data": 50234369, "footer": 1366
    },
    { "stripe": 11, "rows": 73897,
      "offset": 497506974, "length": 31675255,
      "index": 16948, "data": 31656952, "footer": 1355
    }
  ]
}

@wpleonardo
Copy link
Contributor Author

@wpleonardo Do we have any performance benchmark about this PR? @alexey-milovidov Maybe you are interested in it.

I try to use this feature in clickhouse(https://github.com/clickHouse/ClickHouse), but can't see any performance improvement.

Q: select * from file('/data1/clickhouse_official/data/user_files/test.orc') format Null;

With AVX512:

0 rows in set. Elapsed: 3.659 sec. Processed 1.13 million rows, 486.19 MB (308.68 thousand rows/s., 132.88 MB/s.)
0 rows in set. Elapsed: 3.653 sec. Processed 1.20 million rows, 517.87 MB (329.40 thousand rows/s., 141.76 MB/s.)
0 rows in set. Elapsed: 3.719 sec. Processed 1.13 million rows, 486.19 MB (303.70 thousand rows/s., 130.74 MB/s.)

Without AVX512

0 rows in set. Elapsed: 3.565 sec. Processed 1.13 million rows, 486.19 MB (316.81 thousand rows/s., 136.38 MB/s.)
0 rows in set. Elapsed: 3.540 sec. Processed 1.20 million rows, 517.87 MB (339.91 thousand rows/s., 146.28 MB/s.)
0 rows in set. Elapsed: 3.681 sec. Processed 1.20 million rows, 517.87 MB (326.90 thousand rows/s., 140.69 MB/s.)

About the test orc file:

$ du -sh test.orc                                                     
505M	test.orc


$ orc-metadata ./test.orc                           
{ "name": "./test.orc",
  "type": "struct<reporttime:bigint,appid:bigint,uid:bigint,platform:int,nettype:int,clientversioncode:bigint,sdkversioncode:bigint,statid:string,statversion:int,countrycode:string,language:string,model:string,osversion:string,channel:string,heartcount:int,msgcount:int,giftcount:int,barragecount:int,gid:string,entrytype:int,prefetchedms:int,linkdstate:int,networkavailable:int,starttimestamp:bigint,sessionlogints:int,medialogints:int,sdkboundts:int,msconnectedts:int,vsconnectedts:int,firstiframets:int,ownerstatus:int,stopreason:int,totaltime:int,cpuusageavg:int,memusageavg:int,backgroundtotal:bigint,foregroundtotal:bigint,firstvideopackts:int,firstvoicerecvts:int,firstvoiceplayts:int,firstiframeassemblets:int,uiinitts:int,uiloadedts:int,uiappearedts:int,setvideoviewts:int,blurviewdimissts:int,preparesdkinqueuets:int,preparesdkexects:int,startsdkinqueuets:int,startsdkexects:int,sdkjoinchannelinqueuets:int,sdkjoinchannelexects:int,lastsdkleavechannelinqueuets:int,lastsdkleavechannelexects:int,unused_1:int,unused_2:int,setvideoviewinqueuets:int,setvideoviewexects:int,livetype:int,audiostatus:int,firstiframesize:bigint,firstiframedecodetime:bigint,extras:bigint,entrancetype:int,entrancemode:int,mclientip:bigint,mnc:bigint,mcc:bigint,vsipsuccess:bigint,msipsuccess:bigint,vsipfail:bigint,msipfail:bigint,mediaflag:bigint,dispatchid:string,proxyflag:int,redirectcount:int,directorrescode:int,subentrancetab:string,logininfolist:array<struct<strategy:bigint,ip:bigint,loginStat:bigint,reserve1:bigint,reserve2:bigint>>,playcentertype:int,videomutetype:bigint,owneruid:bigint,extra:string>",
  "rows": 1203317,
  "stripe count": 12,
  "format": "0.12", "writer version": "future - 9",
  "compression": "snappy", "compression block": 65536,
  "file length": 529207118,
  "content": 529182229, "stripe stats": 21150, "footer": 3712, "postscript": 26,
  "row index stride": 10000,
  "user metadata": {
    "org.apache.spark.version": "3.3.2"
  },
  "stripes": [
    { "stripe": 0, "rows": 117760,
      "offset": 3, "length": 50876922,
      "index": 23728, "data": 50851823, "footer": 1371
    },
    { "stripe": 1, "rows": 117760,
      "offset": 50876925, "length": 50948680,
      "index": 23679, "data": 50923619, "footer": 1382
    },
    { "stripe": 2, "rows": 62050,
      "offset": 101825605, "length": 26902880,
      "index": 15322, "data": 26886211, "footer": 1347
    },
    { "stripe": 3, "rows": 117760,
      "offset": 128728485, "length": 50474083,
      "index": 24110, "data": 50448601, "footer": 1372
    },
    { "stripe": 4, "rows": 117760,
      "offset": 179202568, "length": 50413042,
      "index": 23858, "data": 50387825, "footer": 1359
    },
    { "stripe": 5, "rows": 63570,
      "offset": 229615610, "length": 27504277,
      "index": 14890, "data": 27488029, "footer": 1358
    },
    { "stripe": 6, "rows": 117760,
      "offset": 268435456, "length": 50981984,
      "index": 24191, "data": 50956424, "footer": 1369
    },
    { "stripe": 7, "rows": 117760,
      "offset": 319417440, "length": 51017894,
      "index": 23792, "data": 50992731, "footer": 1371
    },
    { "stripe": 8, "rows": 61720,
      "offset": 370435334, "length": 26840720,
      "index": 15246, "data": 26824109, "footer": 1365
    },
    { "stripe": 9, "rows": 117760,
      "offset": 397276054, "length": 49971095,
      "index": 23487, "data": 49946233, "footer": 1375
    },
    { "stripe": 10, "rows": 117760,
      "offset": 447247149, "length": 50259825,
      "index": 24090, "data": 50234369, "footer": 1366
    },
    { "stripe": 11, "rows": 73897,
      "offset": 497506974, "length": 31675255,
      "index": 16948, "data": 31656952, "footer": 1355
    }
  ]
}

Yes, we have the performance micro-benchmark for this PR. If you use the ORC default align fixed bit width, AVX512 bit-unpacking has almost the same performance as non-AVX512. But if you use the ORC not align bit width, AVX512 bit-unpacking has almost 6X performance gain compared with non-AVX512, and performance close to non-AVX512 with aligned fixed bit-width.
So, maybe you could check the Clickhouse ORC setting if aligned bit-width or not.

@taiyang-li
Copy link
Contributor

@wpleonardo I tried, but still find no improvement

orc file(snappy + unaligned) + avx512
0 rows in set. Elapsed: 3.478 sec. Processed 1.20 million rows, 539.37 MB (345.98 thousand rows/s., 155.08 MB/s.)
0 rows in set. Elapsed: 3.424 sec. Processed 1.20 million rows, 539.37 MB (351.44 thousand rows/s., 157.53 MB/s.)
0 rows in set. Elapsed: 3.444 sec. Processed 1.20 million rows, 539.37 MB (349.44 thousand rows/s., 156.63 MB/s.)


orc file (snappy + unaligned) +  none
0 rows in set. Elapsed: 3.362 sec. Processed 1.20 million rows, 539.37 MB (357.89 thousand rows/s., 160.42 MB/s.)
0 rows in set. Elapsed: 3.535 sec. Processed 1.20 million rows, 539.37 MB (340.43 thousand rows/s., 152.59 MB/s.)
0 rows in set. Elapsed: 3.370 sec. Processed 1.20 million rows, 539.37 MB (357.08 thousand rows/s., 160.06 MB/s.)
 

orc file (lz4 + unaligned) + avx512
0 rows in set. Elapsed: 3.075 sec. Processed 1.20 million rows, 1.90 GB (391.26 thousand rows/s., 618.31 MB/s.)
0 rows in set. Elapsed: 3.082 sec. Processed 1.20 million rows, 1.90 GB (390.46 thousand rows/s., 617.05 MB/s.)
0 rows in set. Elapsed: 3.014 sec. Processed 1.20 million rows, 1.90 GB (399.18 thousand rows/s., 630.82 MB/s.)


orc file (lz4 + unaligned) + none 
rows in set. Elapsed: 2.973 sec. Processed 1.20 million rows, 1.90 GB (404.76 thousand rows/s., 639.64 MB/s.)
0 rows in set. Elapsed: 3.070 sec. Processed 1.20 million rows, 1.90 GB (391.90 thousand rows/s., 619.32 MB/s.)
0 rows in set. Elapsed: 2.903 sec. Processed 1.20 million rows, 1.90 GB (414.51 thousand rows/s., 655.05 MB/s.)

@wpleonardo
Copy link
Contributor Author

@wpleonardo I tried, but still find no improvement

orc file(snappy + unaligned) + avx512
0 rows in set. Elapsed: 3.478 sec. Processed 1.20 million rows, 539.37 MB (345.98 thousand rows/s., 155.08 MB/s.)
0 rows in set. Elapsed: 3.424 sec. Processed 1.20 million rows, 539.37 MB (351.44 thousand rows/s., 157.53 MB/s.)
0 rows in set. Elapsed: 3.444 sec. Processed 1.20 million rows, 539.37 MB (349.44 thousand rows/s., 156.63 MB/s.)


orc file (snappy + unaligned) +  none
0 rows in set. Elapsed: 3.362 sec. Processed 1.20 million rows, 539.37 MB (357.89 thousand rows/s., 160.42 MB/s.)
0 rows in set. Elapsed: 3.535 sec. Processed 1.20 million rows, 539.37 MB (340.43 thousand rows/s., 152.59 MB/s.)
0 rows in set. Elapsed: 3.370 sec. Processed 1.20 million rows, 539.37 MB (357.08 thousand rows/s., 160.06 MB/s.)
 

orc file (lz4 + unaligned) + avx512
0 rows in set. Elapsed: 3.075 sec. Processed 1.20 million rows, 1.90 GB (391.26 thousand rows/s., 618.31 MB/s.)
0 rows in set. Elapsed: 3.082 sec. Processed 1.20 million rows, 1.90 GB (390.46 thousand rows/s., 617.05 MB/s.)
0 rows in set. Elapsed: 3.014 sec. Processed 1.20 million rows, 1.90 GB (399.18 thousand rows/s., 630.82 MB/s.)


orc file (lz4 + unaligned) + none 
rows in set. Elapsed: 2.973 sec. Processed 1.20 million rows, 1.90 GB (404.76 thousand rows/s., 639.64 MB/s.)
0 rows in set. Elapsed: 3.070 sec. Processed 1.20 million rows, 1.90 GB (391.90 thousand rows/s., 619.32 MB/s.)
0 rows in set. Elapsed: 2.903 sec. Processed 1.20 million rows, 1.90 GB (414.51 thousand rows/s., 655.05 MB/s.)

Could you do a simple test first, for example, just select the int64 column instead of all columns?

@taiyang-li
Copy link
Contributor

taiyang-li commented Oct 12, 2023

@wpleonardo still find no improvement if just select int64 type columns.

Q: select reporttime,appid,uid,clientversioncode,sdkversioncode,starttimestamp,backgroundtotal,foregroundtotal,firstiframesize,firstiframedecodetime,extras,mclientip,mnc,mcc,vsipsuccess,msipsuccess,vsipfail,msipfail,mediaflag,videomutetype,owneruid from file('lz4_new_bigolive_audience_stats_orc.orc') format Null;

without avx512:

localhost:9001, queries: 20, QPS: 2.256, RPS: 2715210.217, MiB/s: 4092.049, result RPS: 0.000, result MiB/s: 0.000.

0.000%		0.421 sec.	
10.000%		0.423 sec.	
20.000%		0.425 sec.	
30.000%		0.429 sec.	
40.000%		0.433 sec.	
50.000%		0.440 sec.	
60.000%		0.440 sec.	
70.000%		0.442 sec.	
80.000%		0.443 sec.	
90.000%		0.456 sec.	
95.000%		0.457 sec.	
99.000%		0.464 sec.	
99.900%		0.464 sec.	
99.990%		0.464 sec.	

with avx512

localhost:9001, queries: 20, QPS: 2.216, RPS: 2665968.958, MiB/s: 4017.839, result RPS: 0.000, result MiB/s: 0.000.

0.000%		0.423 sec.	
10.000%		0.429 sec.	
20.000%		0.431 sec.	
30.000%		0.434 sec.	
40.000%		0.438 sec.	
50.000%		0.442 sec.	
60.000%		0.448 sec.	
70.000%		0.451 sec.	
80.000%		0.453 sec.	
90.000%		0.469 sec.	
95.000%		0.473 sec.	
99.000%		0.482 sec.	
99.900%		0.482 sec.	
99.990%		0.482 sec.	

@wpleonardo
Copy link
Contributor Author

@wpleonardo still find no improvement if just select int64 type columns.

Q: select reporttime,appid,uid,clientversioncode,sdkversioncode,starttimestamp,backgroundtotal,foregroundtotal,firstiframesize,firstiframedecodetime,extras,mclientip,mnc,mcc,vsipsuccess,msipsuccess,vsipfail,msipfail,mediaflag,videomutetype,owneruid from file('lz4_new_bigolive_audience_stats_orc.orc') format Null;

without avx512:

localhost:9001, queries: 20, QPS: 2.256, RPS: 2715210.217, MiB/s: 4092.049, result RPS: 0.000, result MiB/s: 0.000.

0.000%		0.421 sec.	
10.000%		0.423 sec.	
20.000%		0.425 sec.	
30.000%		0.429 sec.	
40.000%		0.433 sec.	
50.000%		0.440 sec.	
60.000%		0.440 sec.	
70.000%		0.442 sec.	
80.000%		0.443 sec.	
90.000%		0.456 sec.	
95.000%		0.457 sec.	
99.000%		0.464 sec.	
99.900%		0.464 sec.	
99.990%		0.464 sec.	

with avx512

localhost:9001, queries: 20, QPS: 2.216, RPS: 2665968.958, MiB/s: 4017.839, result RPS: 0.000, result MiB/s: 0.000.

0.000%		0.423 sec.	
10.000%		0.429 sec.	
20.000%		0.431 sec.	
30.000%		0.434 sec.	
40.000%		0.438 sec.	
50.000%		0.442 sec.	
60.000%		0.448 sec.	
70.000%		0.451 sec.	
80.000%		0.453 sec.	
90.000%		0.469 sec.	
95.000%		0.473 sec.	
99.000%		0.482 sec.	
99.900%		0.482 sec.	
99.990%		0.482 sec.	

Could you debug your program to check if ORC is using AVX512 bit-unpacking, for example, to check if the function "BitUnpackAVX512::readLongs" is invoked when you execute the query statement?
If you find ORC is using AVX512 bit-unpacking, then execute the command "perf top" to check the proportion of AVX512 bit-unpacking function hotspots, for example, function "vectorUnpack x".

cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
… instructions

### What changes were proposed in this pull request?

In the original ORC Rle-bit-packing, it decodes value one by one, and Intel AVX-512 brings the capabilities of 512-bit vector operations to accelerate the Rle-bit-packing decode process. We only need execute much less CPU instructions to unpacking more data than usual. So the performance of AVX-512 vector decode is much better than before. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs. In the real world, user will store large data with ORC data format, and need to decoding hundreds or thousands of bytes, AVX-512 vector decode will be more efficient and help to improve this processing.

In the real world, the data size in ORC will be less than 32bit as usual. So I supplied the vector code transform about the data value size less than 32bits in this PR. To the data value is 8bit, 16bit or other 8x bit size, the performance improvement will be relatively small compared with other not 8x bit size value.

Intel AVX512 instructions official link:
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html

1. Added cmake option named "BUILD_ENABLE_AVX512", to switch this feature enable or not in the building process.
The default value of BUILD_ENABLE_AVX512 is OFF.
For example, cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON
This will build ORC library with AVX512 Bit-unpacking enabling.
2. Added macro "ORC_HAVE_RUNTIME_AVX512" to enable this feature code build or not in ORC.
3. Added the file "CpuInfoUtil.cc" to dynamicly detect the current platform supports AVX-512 or not. When customers build ORC with AVX-512 enable, and the current platform ORC running on doesn't support AVX-512, it will use the original bit-packing decode function instead of AVX-512 vector decode.
4. Added the functions "vectorUnpackX" to support X-bit value decode instead of the original function plainUnpackLongs or vectorUnpackX
5. Added the testcases "RleV2BitUnpackAvx512Test" to verify N-bit value AVX-512 vector decode in the new testcase file TestRleVectorDecoder.cc.
6. Modified the function plainUnpackLongs, added an output parameter uint64_t& startBit. This parameter used to store the left bit number after unpacking.
7. AVX-512 vector decode process 512 bits data in every data unpacking. So if the current unpacking data length is long enough, almost all of the data can be processed by AVX-512. But if the data length (or block size) is too short, less than 512 bits, it will not use AVX-512 to do unpacking work. It will back to the original decode way to do unpacking one by one.

### Why are the changes needed?
This can improve the performance of Rle-bit-packing decode. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs.
As Intel gradually improves CPU performance every year and users do data analyzation based ORC data format on the newer platform. 6 years ago, on Intel SKX platform it already support AVX512 instructions. So we need to upgrade ORC data unpacking according to the popular feature of CPU, this will keep ORC pace with the times. 

### How to enable AVX512 Bit-unpacking?
1. Enable the cmake option BUILD_ENABLE_AVX512, it will build ORC library with AVX512 enabling.
cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON
2. Set the ENV parameter when using ORC library
export ORC_USER_SIMD_LEVEL=AVX512
(Note: This parameter has only 2 values "AVX512" && "none", the value has no case-sensitive)
If set ORC_USER_SIMD_LEVEL=none, AVX512 Bit-unpacking will be disabled.

### How was this patch tested?
I created a new testcase file TestRleVectorDecoder.cc. It contains the below testcases, we can open cmake option -DBUILD_ENABLE_AVX512=ON and running these testcases on the platform support AVX-512. Every testcase contain 2 scenarios:
1. The blockSize increases from 1 to 10000, and data length is 10240;
2. The blockSize increases from 1000 to 10000, and data length increases from 1000 to 70000
The testcase will be executed for a while, so I added a progress bar for every testcase.
Here is a progress bar demo print of one testcase:
[ RUN      ] OrcTest/RleVectorTest.RleV2_basic_vector_decode_10bit/1
10bit Test 1st Part:[OK][#################################################################################][100%]
10bit Test 2nd Part:[OK][#################################################################################][100%]
To the main vector function vectorUnpackX, the test code coverage up to 100%.

This closes apache#1375
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants