-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3849: [C++] Leverage Armv8 crc32 extension instructions to accelerate the hash computation for Arm64 #3010
Conversation
Unit Tests are passed on Arm64 and x86(Xeon E5-2650): Arm64:
x86:
|
+1 re basing on #3005. I will work on reviewing that today so we can merge soon |
@guyuqi what arm64 platform / system are you running this one? I wonder if there are public CI services available that we could use for testing on ARM |
This was merged to master so you can rebase on master now |
@@ -231,6 +231,26 @@ if (APPLE) | |||
set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -stdlib=libc++") | |||
endif() | |||
|
|||
if (CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64|AARCH64") | |||
CHECK_CXX_SOURCE_COMPILES(" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required?
By the way there are CRC intrinsics for ARM64 and an accompanying preprocessor macro. See example at https://stackoverflow.com/questions/37066261/why-is-arm-feature-crc32-not-being-defined-by-the-compiler
(more doc at https://gcc.gnu.org/onlinedocs/gcc/ARM-C-Language-Extensions-_0028ACLE_0029.html#ARM-C-Language-Extensions-_0028ACLE_0029 apparently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments.
First I request crc extension capabilities from the assembler , and then detect the flags "-march=armv8-a+crc" for Arm64 crc intrinsics. I think it's also a reliable solution. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the point is, it adds gratuitous complication. Testing for -march=armv8-a+crc
should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For not all compilers support the Armv8 crc intrinsics, it is not sufficient just to test for
-march=armv8-a+crc
.
If some compiler do not support intrinsics, it is needed to request crc extension capabilities from the assembler. So Arm asm : crc32cx
will be tested first, and then we test for -march=armv8-a+crc
to detect whether compiler supports intrinsics or not.
1. not support Arm crc
2. support Arm crc, but not support intrinsics
3. support intrinsics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which compilers don't support intrinsics? Apparently they are a ARM standard.
cpp/src/arrow/util/config.in.cmake
Outdated
#define MY_CONFIG_H | ||
|
||
/* Support Armv8 CRC instructions */ | ||
#cmakedefine ARROW_USE_ARMCE 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file shouldn't be required. Instead you can add preprocessor definitions directly in the cmake files, using e.g. the add_definitions command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really thanks for comments.
'add_definitions' looks like a switch to enable/disable the macro by the command like
"cmake -Dxxx=on/off".
In my opinion, it's better to detect whether the macro is supported or not on cmake phase and dynamically to define or not define the Macro related. And it is also convenient and explicit to manage the macro definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the doc says, it "Adds definitions to the compiler command line" (emphasis mine). So you can use it to define a preprocessor symbol.
bdcf7b7
to
32f5f4f
Compare
cpp/src/arrow/util/armce-util.h
Outdated
|
||
// define our own implementations of the intrinsics instead. | ||
static inline uint32_t ARMCE_crc32_u8(uint32_t crc, uint8_t value) { | ||
__asm__("crc32cb %w[c], %w[c], %w[v]":[c]"+r"(crc):[v]"r"(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather only use the official intrinsics if possible. Inline assembler adds a maintenance burden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inline asm is for the compilers that do not support intrinsics.
cpp/src/arrow/util/armce-util.h
Outdated
#endif | ||
static inline uint32_t crc32c_runtime_check(void) | ||
{ | ||
unsigned long auxv = getauxval(AT_HWCAP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our style guide mandates 2-space indentation. I recommend you run the style checker using make clang-format
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, thanks. I'll fix it.
cpp/src/arrow/util/hash-util.h
Outdated
@@ -67,14 +76,37 @@ class HashUtil { | |||
return hash; | |||
} | |||
|
|||
static uint32_t CrcHashARMCE(const void* data, int32_t nbytes, uint32_t hash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to reconcile the two functions instead of duplicating the code with different function names?
Same below for DoubleCRCHash
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, How about separating the code by using 'Arch macro' in CrcHash/DoubleCRCHash?
Like this:
static uint32_t CrcHash(...) {
.
..
...
while (p <= end - 4) {
#ifdef ARROW_HAVE_SSE4_2
hash = SSE4_crc32_u32(hash, *reinterpret_cast<const uint32_t*>(p));
#elif ARROW_HAVE_ARMCE
hash = ARMCE_crc32_u64(hash, *reinterpret_cast<const uint64_t*>(p));
#endif
p += 4;
}
...
...
.
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather have the switch outside of the functions. For example you could have:
#ifdef ARROW_HAVE_SSE4_2
#define HW_crc32_u32 SSE4_crc32_u32
#elif defined(ARROW_HAVE_ARMCE)
#define HW_crc32_u32 ARMCE_crc32_u32
#endif
By the way, "ARMCE" doesn't look like a recognized acronym, so I'd rather have e.g. ARROW_HAVE_ARM_CRC
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it makes sense.
cpp/src/arrow/util/hash-util.h
Outdated
template <> | ||
inline int HashUtil::DoubleHash<USE_ARMCRC>(const void* data, int32_t bytes, uint32_t seed) { | ||
// Need run time check | ||
if (crc32c_runtime_check()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the record, is the runtime check cheap? The hash function will often be called for smallish strings, e.g. 20 to 40 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean for "cheap"?
Sorry, I'm not familiar with the hash function calling scenario.
If the hash function is called very frequently, the run time check should be called just once to avoid performance degradation. I will try to move the run time check
to a init
routine if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to find out by running the hashing-benchmark
with Arrow compiled in release mode. See if this PR actually gives a speedup :-)
(also, it would be nice to post the results so we non-ARM users can get an idea of the performance we're talking about)
cpp/src/arrow/util/hashing.h
Outdated
return ScalarHelper<uint64_t, AlgNum>::ComputeHash(h); | ||
} else { | ||
// Fall back on 64-bit Murmur2 for longer strings. | ||
// It has decent speed for medium-sized strings. There may be faster | ||
// hashes on long strings such as xxHash, but that may not matter much | ||
// for the typical length distribution of hash keys. | ||
return HashUtil::MurmurHash2_64(data, static_cast<int>(length), AlgNum); | ||
return HashUtil::Hash<USE_DEFAULT>(data, static_cast<int>(length), AlgNum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep the explicit fallback to MurmurHash here, rather than Hash<USE_DEFAULT>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok. I'll follow it.
423000b
to
b13ee1f
Compare
Update the PR basing on above comments . Average of 10 times:
Disable Arm CRC:
It shows ~8% improvement. I also run the |
Thanks @guyuqi. Can you run
|
I re-run the hashing-benchmark:
Disable Arm64 CRC32 extension:
It seems |
Very nice :-) |
By the way, which CPU / platform are you testing on? |
The code looks good to me now. What's remaining is that you need to fix the style issues spotted in CI: To do it locally (rather than wait for CI to validate your changes), best is to call |
…the hash computation for Arm64 The Hash utility leverages SSE4 to accelerate the Crc32 data hash computation for x86. Correspondingly, we will leverage the Arm crc32 extension instructions to accelerate the hash computation for AArch64. Change-Id: I7da36f8da8d1c32f10eef33a664e5e230f214c59 Signed-off-by: Yuqi Gu <[email protected]>
Change-Id: I3044a89bae619968e340636996f014a0134f1030 Signed-off-by: Yuqi Gu <[email protected]>
d86aee1
to
7ddeb0b
Compare
Change-Id: I02e8a52935376b4ecc700cd17edaeb871b2bd487
Codecov Report
@@ Coverage Diff @@
## master #3010 +/- ##
===========================================
+ Coverage 67.32% 88.06% +20.74%
===========================================
Files 58 425 +367
Lines 3718 64888 +61170
===========================================
+ Hits 2503 57146 +54643
- Misses 1114 7742 +6628
+ Partials 101 0 -101 Continue to review full report at Codecov.
|
Fix the coding style and rebase PR on master for conflict with #3037 . |
Thanks @guyuqi ! |
The 'Hash utility' leverages SSE4 to accelerate the Crc32 data hash computation for x86.
Correspondingly, we will leverage the Arm crc32 extension instructions
to accelerate the hash computation for Arm64.