rANS_static32x16pr_neon.c doesn't build on AArch32 #81

clausecker · 2023-04-17T21:05:05Z

Your code base has NEON-optimised code in rANS_static32x16pr_neon.c. Unfortunately, this code only builds on AArch64 (arm64). It does not build on AArch32 as many of the intrinsics used therein are not supported in AArch32 mode:

htscodecs/htscodecs/rANS_static32x16pr_neon.c:200:31: warning: implicit declaration of function 'vtrn1q_u32' is invalid in C99 [-Wimplicit-function-declaration]
            uint32x4_t A1_1 = vtrn1q_u32(A_1, B_1);
                              ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:200:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t A1_1 = vtrn1q_u32(A_1, B_1);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:201:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t C1_1 = vtrn1q_u32(C_1, D_1);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:202:31: warning: implicit declaration of function 'vtrn2q_u32' is invalid in C99 [-Wimplicit-function-declaration]
            uint32x4_t A2_1 = vtrn2q_u32(A_1, B_1);
                              ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:202:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t A2_1 = vtrn2q_u32(A_1, B_1);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:203:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t C2_1 = vtrn2q_u32(C_1, D_1);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:207:39: warning: implicit declaration of function 'vtrn1q_u64' is invalid in C99 [-Wimplicit-function-declaration]
            uint32x4_t Xmaxv1=u32_u64(vtrn1q_u64(u64_u32(A1_1),u64_u32(C1_1)));
                                      ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:207:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t Xmaxv1=u32_u64(vtrn1q_u64(u64_u32(A1_1),u64_u32(C1_1)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:208:39: warning: implicit declaration of function 'vtrn2q_u64' is invalid in C99 [-Wimplicit-function-declaration]
            uint32x4_t Biasv1=u32_u64(vtrn2q_u64(u64_u32(A1_1),u64_u32(C1_1)));
                                      ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:208:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t Biasv1=u32_u64(vtrn2q_u64(u64_u32(A1_1),u64_u32(C1_1)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:209:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t RFv1  =u32_u64(vtrn1q_u64(u64_u32(A2_1),u64_u32(C2_1)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:210:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t FSv1  =u32_u64(vtrn2q_u64(u64_u32(A2_1),u64_u32(C2_1)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:217:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t A1_2 = vtrn1q_u32(A_2, B_2);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:218:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t C1_2 = vtrn1q_u32(C_2, D_2);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:219:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t A2_2 = vtrn2q_u32(A_2, B_2);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:220:24: error: initializing 'uint32x4_t' (vector of 4 'uint32_t' values) with an expression of incompatible type 'int'
            uint32x4_t C2_2 = vtrn2q_u32(C_2, D_2);
                       ^      ~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:222:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t Xmaxv2=u32_u64(vtrn1q_u64(u64_u32(A1_2),u64_u32(C1_2)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:223:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t Biasv2=u32_u64(vtrn2q_u64(u64_u32(A1_2),u64_u32(C1_2)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:224:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t RFv2  =u32_u64(vtrn1q_u64(u64_u32(A2_2),u64_u32(C2_2)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:225:31: error: passing 'int' to parameter of incompatible type 'uint64x2_t' (vector of 2 'uint64_t' values)
            uint32x4_t FSv2  =u32_u64(vtrn2q_u64(u64_u32(A2_2),u64_u32(C2_2)));
                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:205:42: note: expanded from macro 'u32_u64'
#define u32_u64(x) vreinterpretq_u32_u64((x))
                                         ^~~
/usr/lib/clang/13.0.0/include/arm_neon.h:32841:50: note: passing argument to parameter '__p0' here
__ai uint32x4_t vreinterpretq_u32_u64(uint64x2_t __p0) {
                                                 ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:232:31: warning: implicit declaration of function 'vaddvq_u32' is invalid in C99 [-Wimplicit-function-declaration]
            uint32_t imask1 = vaddvq_u32(vandq_u32(Cv1, bit));
                              ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:237:21: warning: implicit declaration of function 'vqtbl1_u8' is invalid in C99 [-Wimplicit-function-declaration]
            norm1 = vqtbl1_u8(vreinterpretq_u8_u32(Rv1),vtab[imask1]);
                    ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:237:19: error: assigning to 'uint8x8_t' (vector of 8 'uint8_t' values) from incompatible type 'int'
            norm1 = vqtbl1_u8(vreinterpretq_u8_u32(Rv1),vtab[imask1]);
                  ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:238:19: error: assigning to 'uint8x8_t' (vector of 8 'uint8_t' values) from incompatible type 'int'
            norm2 = vqtbl1_u8(vreinterpretq_u8_u32(Rv2),vtab[imask2]);
                  ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
htscodecs/htscodecs/rANS_static32x16pr_neon.c:264:31: warning: implicit declaration of function 'vmull_high_u32' is invalid in C99 [-Wimplicit-function-declaration]
            uint64x2_t qvh1 = vmull_high_u32(Rv1, RFv1);
                              ^
htscodecs/htscodecs/rANS_static32x16pr_neon.c:264:24: error: initializing 'uint64x2_t' (vector of 2 'uint64_t' values) with an expression of incompatible type 'int'
            uint64x2_t qvh1 = vmull_high_u32(Rv1, RFv1);
                       ^      ~~~~~~~~~~~~~~~~~~~~~~~~~
fatal error: too many errors emitted, stopping now [-ferror-limit=]

Please adapt your build system to build this code on AArch64 only. Alternatively, fix the code so these files compile on AArch32, too.

The text was updated successfully, but these errors were encountered:

jkbonfield · 2023-04-18T08:19:37Z

Is this building from a release, or from the develop branch at github?

I assume this is related to the changes in samtools/htslib#1587, but I cannot test it as I don't have any way of testing on an AArch32 system (or does -m32 do this on a 64-bit system?).

See also #78. It sounds like Rob's changes are to use compile time #ifdef __ARM_NEON checks. I don't understand why that would attempt to add NEON instructions on a CPU that doesn't include them. Any advice on how to detect this properly at compile time would be welcomed.

clausecker · 2023-04-18T09:21:56Z

This was found on 1.17, your latest release.

ARMv7 systems do generally support NEON, but NEON on AArch32 is only a subset of NEON on AArch64. The ARM intrinsics guides specify which NEON intrinsics are available on both and which are AArch64 only. The problem is that many of the intrinsics your code uses are available for AArch64 only. The macro __ARM_NEON only shows if any sort of NEON support is available. This is true on AArch32 too. But not all intrinsics are available.

jkbonfield · 2023-04-18T10:12:32Z

Thanks. It sounds like we need to check __aarch64__ in addition to __ARM_NEON then. Eg https://github.com/daviesrob/htscodecs/blob/24eb2b7a1282d27002ef9848a2aa1dd92d38dd78/htscodecs/rANS_static32x16pr_neon.c#L35

clausecker · 2023-04-18T10:17:10Z

Yes, correct. Or rework the code so it only uses intrinsics that also work on AArch32. Or add polyfill for the missing intrinsics when running on AArch32.

jkbonfield · 2023-04-18T14:26:19Z

PR #82 should enable it to compiler, but the difference is trivial. I don't have SIMD for Aarch32 as that's complex and probably also fruitless, but it can obviously run the scalar implementation just fine.

Unfortunately I'm not able to test it as AWS Arm is Aarch64 only.

NEON alone isn't a sufficient guard as AArch32 also has some limited Neon capabilities. While we could no doubt have a 32-bit alternative, for now this is the simple fix and let aarch32 use the scalar implementation. Doing a 32-bit neon is a complex task and without having access to the hardware it's pretty much impossible. I also wouldn't have high hopes for any significant speed gains over scalar with only half the lanes available. Fixes samtools#81

clausecker · 2023-04-18T14:50:54Z

This matches the changes I plan to deploy as local patches to the FreeBSD port of this package. I've also patched the NEON detection in configure to ensure it only triggers if AArch64-only intrinsics are available.

jkbonfield · 2023-04-18T16:04:07Z

The neon tests in configure were removed in #78 as they conflict with MacOS multi-arch builds. The Mac basically has a compiler that performs 2 compilations, for ARM and Amd64 CPUs, so any tests of compiler capability (like does it support a -mavx2 option) will break. So instead it all needs to be done with on-the-fly CPU detection via ifdefs instead.

clausecker · 2023-04-18T16:06:48Z

@jkbonfield Good to hear. This is LGTM then from my side.

jkbonfield transferred this issue from samtools/htslib Apr 18, 2023

jkbonfield mentioned this issue Apr 18, 2023

Only use the ARM NEON 32-way unrolled rANS on AArch64. #82

Merged

daviesrob closed this as completed in 479b5be Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rANS_static32x16pr_neon.c doesn't build on AArch32 #81

rANS_static32x16pr_neon.c doesn't build on AArch32 #81

clausecker commented Apr 17, 2023

jkbonfield commented Apr 18, 2023

clausecker commented Apr 18, 2023

jkbonfield commented Apr 18, 2023

clausecker commented Apr 18, 2023

jkbonfield commented Apr 18, 2023 •

edited

Loading

clausecker commented Apr 18, 2023

jkbonfield commented Apr 18, 2023

clausecker commented Apr 18, 2023

rANS_static32x16pr_neon.c doesn't build on AArch32 #81

rANS_static32x16pr_neon.c doesn't build on AArch32 #81

Comments

clausecker commented Apr 17, 2023

jkbonfield commented Apr 18, 2023

clausecker commented Apr 18, 2023

jkbonfield commented Apr 18, 2023

clausecker commented Apr 18, 2023

jkbonfield commented Apr 18, 2023 • edited Loading

clausecker commented Apr 18, 2023

jkbonfield commented Apr 18, 2023

clausecker commented Apr 18, 2023

jkbonfield commented Apr 18, 2023 •

edited

Loading