-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1356: [C++] Use Intel AVX-512 instructions to accelerate the Rle-bit-packing decode #1375
Conversation
ORC bit packing performance. Only contains 1~32bit opt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make CI happy, @wpleonardo ?
Welcome to the Apache ORC community! @wpleonardo This feature looks promising. Will take a look this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only did a preliminary review. Left some comments but mostly are cosmetic. Will take a deep look later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that still format issue.
c++/test/TestRleVectorDecoder.cc:29:1: error: code should be clang-formatted [-Wclang-format-violations]
#include "wrap/orc-proto-wrapper.hh"
^
c++/test/TestRleVectorDecoder.cc:38:51: error: code should be clang-formatted [-Wclang-format-violations]
const int DEFAULT_MEM_STREAM_SIZE = 1024 * 1024; // 1M
...
2. Add the dynamiclly judge the current compiler and platform support AVX512 or not; 3. The build option BUILD_ENABLE_AVX512 default value change to "ON"; 4. Add the build option about file TestRleVectorDecoder.cc, and try to fix clang format build issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind, could you test this locally first?
96 warnings and 20 errors generated.
make[2]: *** [c++/src/CMakeFiles/orc.dir/build.make:461: c++/src/CMakeFiles/orc.dir/RleDecoderV2.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:433: c++/src/CMakeFiles/orc.dir/all] Error 2
2. Change CMakeLists.txt some options
…orDecoder.cc 2. Change the option CXX_COMMON_FLAGS to CMAKE_CXX_FLAGS
May I have a question about clang-format error about file TestRleVectorDecoder.cc? |
The clang-format we use is defined here: https://github.com/apache/orc/blob/main/.clang-format. You can simply use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gentle ping, @wpleonardo .
Sorry, the past few days are my holiday, I will back to work and follow your suggestions in the next few days. |
Change the invoking way about bufferstart,bufferend parameters.
Thanks @stiga-huang and @wpleonardo! |
1. Code format change
Just fixed an AVX512 flag check issue on windows platform. |
2. Fix an AVX512 flags check issue on windows.
Modified cmakefile about the checking of AVX512.
In cmake_modules/ConfigSimdLevel.cmake, changed check_cxx_source_compiles to check_cxx_source_runs, to make sure AVX512 program can run normally on that machine. |
check_cxx_source_runs will be hung on windows platform, when the CPU doesn't have AVX512 flags. |
Hi @wgtmac @dongjoon-hyun @coderex2522 , CI test passed, do you have any other comments? Thank you very much! |
…x_source_run back CHECK_CXX_SOURCE_COMPILES, and added "grep avx512f /proc/cpuinfo" to check CPU flags.
Because check_cxx_source_run will be hung on windows, change check_cx…
Hi @dongjoon-hyun, welcome back from vacation! Do you have any other comments? Thank you very much! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will merge it by the end of this week if no further comment.
Thank you very much for you help, Gang! ^_^
B&R,
Wang Peng
At 2023-05-06 09:52:08, "Gang Wu" ***@***.***> wrote:
@wgtmac approved this pull request.
I will merge it by the end of this week if no further comment.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
cc @williamhyun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM
I have submitted this. Thanks all! |
@wpleonardo Do we have any performance benchmark about this PR? @alexey-milovidov Maybe you are interested in it. I try to use this feature in clickhouse(https://github.com/clickHouse/ClickHouse), but can't see any performance improvement. Q: With AVX512:
Without AVX512
About the test orc file:
|
Yes, we have the performance micro-benchmark for this PR. If you use the ORC default align fixed bit width, AVX512 bit-unpacking has almost the same performance as non-AVX512. But if you use the ORC not align bit width, AVX512 bit-unpacking has almost 6X performance gain compared with non-AVX512, and performance close to non-AVX512 with aligned fixed bit-width. |
@wpleonardo I tried, but still find no improvement
|
Could you do a simple test first, for example, just select the int64 column instead of all columns? |
@wpleonardo still find no improvement if just select int64 type columns. Q: without avx512:
with avx512
|
Could you debug your program to check if ORC is using AVX512 bit-unpacking, for example, to check if the function "BitUnpackAVX512::readLongs" is invoked when you execute the query statement? |
… instructions ### What changes were proposed in this pull request? In the original ORC Rle-bit-packing, it decodes value one by one, and Intel AVX-512 brings the capabilities of 512-bit vector operations to accelerate the Rle-bit-packing decode process. We only need execute much less CPU instructions to unpacking more data than usual. So the performance of AVX-512 vector decode is much better than before. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs. In the real world, user will store large data with ORC data format, and need to decoding hundreds or thousands of bytes, AVX-512 vector decode will be more efficient and help to improve this processing. In the real world, the data size in ORC will be less than 32bit as usual. So I supplied the vector code transform about the data value size less than 32bits in this PR. To the data value is 8bit, 16bit or other 8x bit size, the performance improvement will be relatively small compared with other not 8x bit size value. Intel AVX512 instructions official link: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html 1. Added cmake option named "BUILD_ENABLE_AVX512", to switch this feature enable or not in the building process. The default value of BUILD_ENABLE_AVX512 is OFF. For example, cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON This will build ORC library with AVX512 Bit-unpacking enabling. 2. Added macro "ORC_HAVE_RUNTIME_AVX512" to enable this feature code build or not in ORC. 3. Added the file "CpuInfoUtil.cc" to dynamicly detect the current platform supports AVX-512 or not. When customers build ORC with AVX-512 enable, and the current platform ORC running on doesn't support AVX-512, it will use the original bit-packing decode function instead of AVX-512 vector decode. 4. Added the functions "vectorUnpackX" to support X-bit value decode instead of the original function plainUnpackLongs or vectorUnpackX 5. Added the testcases "RleV2BitUnpackAvx512Test" to verify N-bit value AVX-512 vector decode in the new testcase file TestRleVectorDecoder.cc. 6. Modified the function plainUnpackLongs, added an output parameter uint64_t& startBit. This parameter used to store the left bit number after unpacking. 7. AVX-512 vector decode process 512 bits data in every data unpacking. So if the current unpacking data length is long enough, almost all of the data can be processed by AVX-512. But if the data length (or block size) is too short, less than 512 bits, it will not use AVX-512 to do unpacking work. It will back to the original decode way to do unpacking one by one. ### Why are the changes needed? This can improve the performance of Rle-bit-packing decode. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs. As Intel gradually improves CPU performance every year and users do data analyzation based ORC data format on the newer platform. 6 years ago, on Intel SKX platform it already support AVX512 instructions. So we need to upgrade ORC data unpacking according to the popular feature of CPU, this will keep ORC pace with the times. ### How to enable AVX512 Bit-unpacking? 1. Enable the cmake option BUILD_ENABLE_AVX512, it will build ORC library with AVX512 enabling. cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON 2. Set the ENV parameter when using ORC library export ORC_USER_SIMD_LEVEL=AVX512 (Note: This parameter has only 2 values "AVX512" && "none", the value has no case-sensitive) If set ORC_USER_SIMD_LEVEL=none, AVX512 Bit-unpacking will be disabled. ### How was this patch tested? I created a new testcase file TestRleVectorDecoder.cc. It contains the below testcases, we can open cmake option -DBUILD_ENABLE_AVX512=ON and running these testcases on the platform support AVX-512. Every testcase contain 2 scenarios: 1. The blockSize increases from 1 to 10000, and data length is 10240; 2. The blockSize increases from 1000 to 10000, and data length increases from 1000 to 70000 The testcase will be executed for a while, so I added a progress bar for every testcase. Here is a progress bar demo print of one testcase: [ RUN ] OrcTest/RleVectorTest.RleV2_basic_vector_decode_10bit/1 10bit Test 1st Part:[OK][#################################################################################][100%] 10bit Test 2nd Part:[OK][#################################################################################][100%] To the main vector function vectorUnpackX, the test code coverage up to 100%. This closes apache#1375
What changes were proposed in this pull request?
In the original ORC Rle-bit-packing, it decodes value one by one, and Intel AVX-512 brings the capabilities of 512-bit vector operations to accelerate the Rle-bit-packing decode process. We only need execute much less CPU instructions to unpacking more data than usual. So the performance of AVX-512 vector decode is much better than before. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs. In the real world, user will store large data with ORC data format, and need to decoding hundreds or thousands of bytes, AVX-512 vector decode will be more efficient and help to improve this processing.
In the real world, the data size in ORC will be less than 32bit as usual. So I supplied the vector code transform about the data value size less than 32bits in this PR. To the data value is 8bit, 16bit or other 8x bit size, the performance improvement will be relatively small compared with other not 8x bit size value.
Intel AVX512 instructions official link:
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
The default value of BUILD_ENABLE_AVX512 is OFF.
For example, cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON
This will build ORC library with AVX512 Bit-unpacking enabling.
Add new files:
<style> </style>Why are the changes needed?
This can improve the performance of Rle-bit-packing decode. In the funcational micro-performance test I suppose AVX-512 vector decode could bring average 6X ~ 7X performance latency improvement compare vector function vectorUnpackX with the original Rle-bit-packing decode function plainUnpackLongs.
As Intel gradually improves CPU performance every year and users do data analyzation based ORC data format on the newer platform. 6 years ago, on Intel SKX platform it already support AVX512 instructions. So we need to upgrade ORC data unpacking according to the popular feature of CPU, this will keep ORC pace with the times.
How to enable AVX512 Bit-unpacking?
cmake .. -DCMAKE_BUILD_TYPE=release -DBUILD_ENABLE_AVX512=ON
export ORC_USER_SIMD_LEVEL=AVX512
(Note: This parameter has only 2 values "AVX512" && "none", the value has no case-sensitive)
If set ORC_USER_SIMD_LEVEL=none, AVX512 Bit-unpacking will be disabled.
How was this patch tested?
I created a new testcase file TestRleVectorDecoder.cc. It contains the below testcases, we can open cmake option -DBUILD_ENABLE_AVX512=ON and running these testcases on the platform support AVX-512. Every testcase contain 2 scenarios:
The testcase will be executed for a while, so I added a progress bar for every testcase.
Here is a progress bar demo print of one testcase:
[ RUN ] OrcTest/RleVectorTest.RleV2_basic_vector_decode_10bit/1
10bit Test 1st Part:[OK][#################################################################################][100%]
10bit Test 2nd Part:[OK][#################################################################################][100%]
To the main vector function vectorUnpackX, the test code coverage upto 100%.