SIMDe 0.8.0

Summary

Complete set of implementations for all NEON intrinsics have been finished, up from 56.46% in the previous release! (@yyctw @wewe5215)
SIMDe PRs are tested using Fedora Rawhide (@junaruga)

For the entire project: 656 files changed, 202635 insertions(+), 1724 deletions(-)

For just the simde folder: 295 files changed, 47053 insertions(+), 896 deletions(-)

X86

There are a total of 6876 SIMD functions on x86, 2930 (43.17%) of which have been implemented in SIMDe so far. Specifically for AVX-512, of the 5160 functions currently in AVX-512, SIMDe implements 1510 (29.26%).

Note: Intel has removed the intrinsics that were unique to Intel Xeon Phi (ER, PF, 4MAPS, and 4VNNIW) from their intrinsic list. SIMDe will retain those few implementations we already had, but this changes how our completeness statistics are calculated.

Newly added function families

AES: 5 of 6 (83.33%)

Newly AVX512 added function families

castph: 1 of 9 (11.11%) implemented.
cvtus_storeu: 1 of 18 (5.56%) implemented.
fpclass: 3 of 24 (12.50%) implemented.
i32gather: 1 of 8 (12.50%) implemented.
i64gather: 8 of 8 💯
permutex: 3 of 12 (25.00%) implemented.
rcp14: 1 of 24 (4.17%) implemented.
reduce
reduce_max: 7 of 31 (22.58%) implemented.
reduce_min: 7 of 31 (22.58%) implemented.
shufflehi: 1 of 7 (14.29%) implemented.
shufflelo: 1 of 7 (14.29%) implemented.

Additions to existing families

AVX512BW: 7 additional, 337 of 790 (42.66%)
AVX512DQ: 5 additional, 112 total of 376 (29.79%)
AVX512F: 48 additional, 1087 total of 2812 (38.66%)
AVX512_FP16: 15 additional, 17 total of 1105 (1.54%)

Neon

SIMDe currently implements 6670 out of 6670 (100.00%) NEON functions; up from 56.46% in the previous release!

Newly added families

abal
abal_high
abd
abdh
abdl_high
addhn_high
aes
bfdot
bfdot_lane
cadd_rot
cale
calt
cmla_lane
cmla_rot_lane
copy_lane
cvt_high
cvt_n
cvta
cvtn
cvtp
cvtx
cvtx_high
div
dupb_lane
duph_lane
eor3
fmlal
fms
fms_lane
fms_n
ld2_dup
ld2_lane
ld3_dup
ld3_lane
ld4_dup
maxnmv
minnmv
mla_lane
mla_high_lane
mls_lane
mlsl_high_lane
mmla
mull_high_lane
mull_high_n
mulx
mulx_lane
pmaxnm
pminnm
qdmlal
qdmlal_high
qdmlal_high_lane
qdmlal_high_n
qdmlal_lane
qdmlal_n
qdmlsl
qdmlsl_high
qdmlsl_high_lane
qdmlsl_high_n
qdmlsl_lane
qdmlsl_n
qdmlslh
qdmlslh_lane
qdmulhh
qdmulhh_lane
qdmull_high
qdmull_high_lane
qdmull_high_n
qdmull_lane
qdmull_n
qdmullh_lane
qmovun_high
qrdmlah
qrdmlah_lane
qrdmlahh
qrdmlahh_lane
qrdmlsh
qrdmlsh_lane
qrdmlshh
qrdmlshh_lane
qrdmulhh_lane
qrshl
qrshlh
qrshrn_high_n
qrshrnh_n
qrshrun_high_n
qrshrunh_n
qshl_n
qshlh_n
qshluh_n
qshrn_high_n
qshrnh_n
qshrun_high_n
qshrunh_n
raddhn
raddhn_high
rax
recp
rnd32x
rnd32x
rnd32x
rnd64z
rnda
rndx
rshrn_high_n
rsubhn
rsubhn
set_lane
sha1
sha1h
sha256
sha512
shll_high_n
shrn_high_n
sli_n
sm3
sm4
sqrt
st1_x2
st1_x3
st1_x4
st1q_x2
st1q_x3
st1q_x4
subhn_high
sudot_lane
usdot
usdot_lane

Finally complete families

cvtn
mla_lane

Details

simde-f16: improve _Float16 usage; better INFHF/NANHF defs 8910057 @mr-c
simde_float16: prefer __fp16 if available aba26f6 @mr-c

Implementation of Arm intrinsics

NEON

cvtn: vcvtnq_{s32_f32,s64_f64}: add SSE & AVX512 optimized implementations e134cc7 @mr-c
cvtn: vcvtnq_u32_f32 is a V8 function 8432c70 @mr-c
min: Remove non-working MMX specialization from simde_vmin_s16 6858b92 @M-HT
shll: Extend constant range in simde_vshll_n_XXX intrinsics (#1064) beb1c61 @M-HT
various: Implement some f16XN types and f16 related intrinsics. (#1071) aae2245 @yyctw
qtbl/qtbx polyfills for A32V7 a2fef9e @easyaspi314
arm: use SIMDE_ARCH_ARM_FMA 7198d6d @mr-c
arm neon: Complex operations from Armv8.3-a (#1077) d08d67c @wewe5215
more fp16 using intrinsics supported by architecture v7 (skip version) (#1081) 5e7c4d4 @yyctw
st1{,q}_*_x{2,3,4}: initial implementation (#1082) 879d1a0 @yyctw
part 1 of implement all intrinsics supported by architecture A64 (#1090) 2eedece @yyctw
Add AES instructions. 23adcd2 805ccd2 @yyctw
Modified simde_float16 to simde_float16_t (#1100) 8a05dc6 @yyctw
implement all intrinsics supported by architecture A64-remaining part (#1093) 018ba24 @yyctw
add enable vmlaq_laneq_f32 and vcvtq_n_f64_u64 c7d314b @yyctw
implement all bf16-related intrinsics (#1110) c59db7c @yyctw
arm/neon abs: negating INT_MIN is undefined behavior in C/C++ c200c16 @mr-c

SVE Intrinsics

Improve performance of simde_mm512_add_epi32 (#1126) 6cde31c @AymenQ

WASM intrinsics

simd128: fix altivec_p7 version of wasm_f64x2_pmin 96d6e53 @mr-c
simd128: add missing unsigned functions ea5e283 @mr-c
simd128 f{32x4,64x2}_min: add workaround for a gcc<6 issue d5d6d10 @mr-c
detect support for Relaxed SIMD mode 2e66dd4 @mr-c
simd128/relaxed: begin MIPS implementations db8ad84 @mr-c
relaxed: add f{32x4,64x2}_relaxed_{min,max} 9d1a34e @mr-c
relaxed: updated names; reordered FMA operations 8cc8874 @mr-c

x86 intrinsics

sse{,2,4.1}, avx{,2} *_stream_{,load}: use __builtin_nontemporal_{load,store} 6ce6030 @mr-c

SSE*

sse: Fix issues related to MXCSR register (#1060) 653aba8 @M-HT
sse: implement _mm_movelh_ps for Arm64 514564e @mr-c
sse _mm_movemask_ps: remove unused code fba97e4 @mr-c
sse2 mm_pause: more archs, add a basic test 692a2e8 @mr-
sse4.1: use logical OR instead of bitwise OR in neon impl of _mm_testnzc_si128 edd4678 @mr-c
sse4.1 _mm_testz_si128: fix backwards short circuit logic f132275 @mr-c

AVX

run test from #926 ce9708c @mr-c
simde_mm256_shuffle_pd fix for natural vector size < 128 1594d7c @mr-c

AVX2

correction of simde_mm256_sign_epi{8,16,32} (#1123) c376610 @Proudsalsa

AVX512

fpclass: naive implementation 353bf5f @mr-c
loadu: fix native detection 305f434 @mr-c
set: add simde_x_mm512_set_m256{,d} 67e0c50 @mr-c
gather: add MSVC native fallbacks 7b7e3f6 @mr-c
AVX512FP16 / m512h initial support e97691c @mr-c
fix many native aliases 75014b9 @mr-c

CLMUL

fix natives, some require VPCLMULQDQ f819c52 @mr-c

SVML

enable SIMDE_X86_SVML_NATIVE for MSVC 2019+ 593af95 @mr-c

AES

aes: initial implementation of most aes instructions (#1072) 8632391 @Vineg

MIPS MSA intrinics

msa neon impl: float64x2_t is not avail in A32V7 ae4c4ab @mr-c

Arch support

x86(-64)

fix SIMDE_ARCH_X86_SSE4_2 define 5e4b308 @cbielow

arm64

x86 aes: add neon implementation using the crypto extension fb3554f @mr-

Altivec

neon/st1: disable last remaining AltiVec implementation 0521245 @mr-c

Power

sse2,wasm simd128: skip SIMDE_CONVERT_VECTOR_ impementations on PowerPC 4de999a @mr-c
wasm simd128: more powerpc fixes 7cb5691 @mr-c

Compiler Specific

GCC

GCC AVX512F: SIMDE_BUG_GCC_95399 was fixed in GCC 9.5, 10.4, 11.4, 12+ 3fa89c5 @mr-c
GCC x86/x64: SIMDE_BUG_GCC_98521 was fixed in 10.3 edde42e @mr-c
GCC x86: SIMDE_BUG_GCC_94482 was fixed in 8.5, 9.4, 10+ 43d86a3 @mr-c
Add workaround for GCC bug 111609 fdafd8e @M-HT
arm neon ld2: silence warnings at -O3 on gcc risc-v 8f56628 @mr-c
avx512 abs: refine GCC compiler checks for _mm512{,_mask}_abs_pd (#1118) 5405bbd @thomas-schlichter

Clang

clang powerpc: vec_bperm bug was fixed in clang-14 6feb28a @mr-c
clmul: aarch64 clang has difficulties with poly64x1_t 1e1bd76 @mr-c
aarch64: optimization bug 45541 was fixed in clang-15 7ca5712 @mr-c
A32V7: Don't trust clang for load multiple on A32V7 927f141 @easyaspi314
wasm: SIMDE_BUG_CLANG_60655 is fixed in the upcoming 17.0 release 25cebbe @mr-c
simde-detect-clang.h: add clang 17 detection 923f8ac 684baa1 50d98c1 @Coeur

ClangCL

fp16: don't use _Float16 on ClangCL if not supported 8a6b8c5 @mr-c
svml: don't enable SIMDE_X86_SVML_NATIVE for ClangCl c877fe5 @mr-

Emscripten

emcc tot: set -Wno-switch-default fdbd6b2 @mr-c

MSVC

avx512 types: avoid using native AVX512 types on MSVC unless required 029d749 @mr-c
arm neon: {u,s}addh apply arm64 windows workaround only on msvc<1938 (#1121) 14311d6 @Changqing-JING

Testing with Docker/Podman & CI

Update recipe for qemu git mode 54b8c8f @mr-c
riscv64 gcc: typo fix for endian little 7423339 @mr-c
add new cross sets; Ubuntu Focal and Bionic support b0b9710 @mr-c
native tests: also AVX512, MSA; fix WASM SIMD128 path bdd075b @mr-c
test-flags: support the x86 microarchitecture levels 518b777 @mr-c
ignore common build paths b3689ea @mr-c

Appveyor

preserve test log 9815161 @mr-c
save meson log on error 5207d83 @mr-

Circle CI

circleci: clang, set -Wno-unsafe-buffer-usage 24c93c2 @mr-c

GitHub Actions

upgrade qemu ; fixes remaining ppc64el fails! e91944b @mr-c
tidy matrix ordering for easier to read job names b52ac36 @mr-c
add clang-qemu: aarch64, riscv64, ppc64el, s390x 8a6dbab @mr-c
test armv7 with gcc-12 via qemu 8cd8de1 @mr-c
add armel to gcc and clang qemu matrices 4ca849b @mr-c
add armv7 to clang-qemu matrix a144aca @mr-c
use GCC 12 for adv x64 native testing + AVX512FP f156b41 @mr-c
expand mac-os/xcode testing matrix 8055410 @mr-c
fix macos-13+brew failure c6149de @mr-c
test with clang-16 e25ced8 @mr-c
add gcc-13 43ac8fc @mr-c
simplify x86 ISA matrix 6b7c1b3 @mr-c
run on commits to the primary branch to prime the cache 6055bfb @mr-c
build(deps): bump actions/checkout from 3 to 4 149d0af @dependabot[bot]
build(deps): bump github/codeql-action from 2 to 3 (#1138) 5026e66 @dependabot[bot]
build(deps): bump actions/setup-python from 4 to 5 (#1137) 2768da8 @dependabot[bot]
build(deps): bump actions/setup-dotnet from 3 to 4 (#1135) ed382cb @dependabot[bot]
build(deps): bump ad-m/github-push-action from 0.6.0 to 0.8.0 (#1134) 193be1b @dependabot[bot]
add new repo for clang-16 7ebd267 @mr-c
add clang-17 (#1127) d31de99 @mr-c
test mips64el using qemu on gcc12/clang16 934d86d @mr-c
disable {clang,gcc}-qemu mips64el; needs newer Ubuntu version 471a342 @mr-c
test WASM Relaxed SIMD da0604f @mr-c

Packit CI

Start testing SIMDe PRs using Fedora Rawhide d64b103 6ae0763 b309d89 4d55fc2 643c419 @junaruga

Travis

restart testing with Travis CI 93905f5 @mr-c

Misc

README: mark F16C as complete 2d87cf5 @mr-c
README: Give credit to creator/maintainer of the vcpkg for SIMDe ceb1e73 @mr-c
README: related projects: add AvxToNeon 13bf92a @mr-c
README: add more background links for supported ISAs c76450d @mr-c
README: turn Packit CI link into a deep link e9e1901 @mr-c
README: NEON is complete 7412139 @mr-c
docs: explain how to target a single test 2158ac7 @mr-c

New Contributors

@thomas-schlichter made their first contribution in #1118
@Proudsalsa made their first contribution in #1123
@Changqing-JING made their first contribution in #1121
@AymenQ made their first contribution in #1126
@Coeur made their first contribution in #1129
@dependabot made their first contribution in #1134
@cbielow made their first contribution in #1055
@M-HT made their first contribution in #1060
@yyctw made their first contribution in #1071
@Vineg made their first contribution in #1072
@wewe5215 made their first contribution in #1077

Full Changelog: v0.7.6...v0.8.0