v0.8.0
SIMDe 0.8.0
Summary
- Complete set of implementations for all NEON intrinsics have been finished, up from 56.46% in the previous release! (@yyctw @wewe5215)
- SIMDe PRs are tested using Fedora Rawhide (@junaruga)
For the entire project: 656 files changed, 202635 insertions(+), 1724 deletions(-)
For just the simde
folder: 295 files changed, 47053 insertions(+), 896 deletions(-)
X86
There are a total of 6876 SIMD functions on x86, 2930 (43.17%) of which have been implemented in SIMDe so far. Specifically for AVX-512, of the 5160 functions currently in AVX-512, SIMDe implements 1510 (29.26%).
Note: Intel has removed the intrinsics that were unique to Intel Xeon Phi (ER
, PF
, 4MAPS
, and 4VNNIW
) from their intrinsic list. SIMDe will retain those few implementations we already had, but this changes how our completeness statistics are calculated.
Newly added function families
- AES: 5 of 6 (83.33%)
Newly AVX512 added function families
- castph: 1 of 9 (11.11%) implemented.
- cvtus_storeu: 1 of 18 (5.56%) implemented.
- fpclass: 3 of 24 (12.50%) implemented.
- i32gather: 1 of 8 (12.50%) implemented.
- i64gather: 8 of 8 💯
- permutex: 3 of 12 (25.00%) implemented.
- rcp14: 1 of 24 (4.17%) implemented.
reduce - reduce_max: 7 of 31 (22.58%) implemented.
- reduce_min: 7 of 31 (22.58%) implemented.
- shufflehi: 1 of 7 (14.29%) implemented.
- shufflelo: 1 of 7 (14.29%) implemented.
Additions to existing families
- AVX512BW: 7 additional, 337 of 790 (42.66%)
- AVX512DQ: 5 additional, 112 total of 376 (29.79%)
- AVX512F: 48 additional, 1087 total of 2812 (38.66%)
- AVX512_FP16: 15 additional, 17 total of 1105 (1.54%)
Neon
SIMDe currently implements 6670 out of 6670 (100.00%) NEON functions; up from 56.46% in the previous release!
Newly added families
- abal
- abal_high
- abd
- abdh
- abdl_high
- addhn_high
- aes
- bfdot
- bfdot_lane
- cadd_rot
- cale
- calt
- cmla_lane
- cmla_rot_lane
- copy_lane
- cvt_high
- cvt_n
- cvta
- cvtn
- cvtp
- cvtx
- cvtx_high
- div
- dupb_lane
- duph_lane
- eor3
- fmlal
- fms
- fms_lane
- fms_n
- ld2_dup
- ld2_lane
- ld3_dup
- ld3_lane
- ld4_dup
- maxnmv
- minnmv
- mla_lane
- mla_high_lane
- mls_lane
- mlsl_high_lane
- mmla
- mull_high_lane
- mull_high_n
- mulx
- mulx_lane
- pmaxnm
- pminnm
- qdmlal
- qdmlal_high
- qdmlal_high_lane
- qdmlal_high_n
- qdmlal_lane
- qdmlal_n
- qdmlsl
- qdmlsl_high
- qdmlsl_high_lane
- qdmlsl_high_n
- qdmlsl_lane
- qdmlsl_n
- qdmlslh
- qdmlslh_lane
- qdmulhh
- qdmulhh_lane
- qdmull_high
- qdmull_high_lane
- qdmull_high_n
- qdmull_lane
- qdmull_n
- qdmullh_lane
- qmovun_high
- qrdmlah
- qrdmlah_lane
- qrdmlahh
- qrdmlahh_lane
- qrdmlsh
- qrdmlsh_lane
- qrdmlshh
- qrdmlshh_lane
- qrdmulhh_lane
- qrshl
- qrshlh
- qrshrn_high_n
- qrshrnh_n
- qrshrun_high_n
- qrshrunh_n
- qshl_n
- qshlh_n
- qshluh_n
- qshrn_high_n
- qshrnh_n
- qshrun_high_n
- qshrunh_n
- raddhn
- raddhn_high
- rax
- recp
- rnd32x
- rnd32x
- rnd32x
- rnd64z
- rnda
- rndx
- rshrn_high_n
- rsubhn
- rsubhn
- set_lane
- sha1
- sha1h
- sha256
- sha512
- shll_high_n
- shrn_high_n
- sli_n
- sm3
- sm4
- sqrt
- st1_x2
- st1_x3
- st1_x4
- st1q_x2
- st1q_x3
- st1q_x4
- subhn_high
- sudot_lane
- usdot
- usdot_lane
Finally complete families
- cvtn
- mla_lane
Details
- simde-f16: improve
_Float16
usage; better INFHF/NANHF defs 8910057 @mr-c - simde_float16: prefer
__fp16
if available aba26f6 @mr-c
Implementation of Arm intrinsics
NEON
- cvtn:
vcvtnq_{s32_f32,s64_f64}
: add SSE & AVX512 optimized implementations e134cc7 @mr-c - cvtn:
vcvtnq_u32_f32
is a V8 function 8432c70 @mr-c - min: Remove non-working MMX specialization from
simde_vmin_s16
6858b92 @M-HT - shll: Extend constant range in
simde_vshll_n_XXX
intrinsics (#1064) beb1c61 @M-HT - various: Implement some f16XN types and f16 related intrinsics. (#1071) aae2245 @yyctw
- qtbl/qtbx polyfills for A32V7 a2fef9e @easyaspi314
- arm: use
SIMDE_ARCH_ARM_FMA
7198d6d @mr-c - arm neon: Complex operations from Armv8.3-a (#1077) d08d67c @wewe5215
- more fp16 using intrinsics supported by architecture v7 (skip version) (#1081) 5e7c4d4 @yyctw
st1{,q}_*_x{2,3,4}
: initial implementation (#1082) 879d1a0 @yyctw- part 1 of implement all intrinsics supported by architecture A64 (#1090) 2eedece @yyctw
- Add AES instructions. 23adcd2 805ccd2 @yyctw
- Modified
simde_float16
tosimde_float16_t
(#1100) 8a05dc6 @yyctw - implement all intrinsics supported by architecture A64-remaining part (#1093) 018ba24 @yyctw
- add enable
vmlaq_laneq_f32
andvcvtq_n_f64_u64
c7d314b @yyctw - implement all bf16-related intrinsics (#1110) c59db7c @yyctw
- arm/neon abs: negating
INT_MIN
is undefined behavior in C/C++ c200c16 @mr-c
SVE Intrinsics
WASM intrinsics
- simd128: fix altivec_p7 version of
wasm_f64x2_pmin
96d6e53 @mr-c - simd128: add missing unsigned functions ea5e283 @mr-c
- simd128
f{32x4,64x2}_min
: add workaround for a gcc<6 issue d5d6d10 @mr-c - detect support for Relaxed SIMD mode 2e66dd4 @mr-c
- simd128/relaxed: begin MIPS implementations db8ad84 @mr-c
- relaxed: add
f{32x4,64x2}_relaxed_{min,max}
9d1a34e @mr-c - relaxed: updated names; reordered FMA operations 8cc8874 @mr-c
x86 intrinsics
SSE*
- sse: Fix issues related to MXCSR register (#1060) 653aba8 @M-HT
- sse: implement
_mm_movelh_ps
for Arm64 514564e @mr-c - sse
_mm_movemask_ps
: remove unused code fba97e4 @mr-c - sse2 mm_pause: more archs, add a basic test 692a2e8 @mr-
- sse4.1: use logical OR instead of bitwise OR in neon impl of
_mm_testnzc_si128
edd4678 @mr-c - sse4.1
_mm_testz_si128
: fix backwards short circuit logic f132275 @mr-c
AVX
- run test from #926 ce9708c @mr-c
simde_mm256_shuffle_pd
fix for natural vector size < 128 1594d7c @mr-c
AVX2
- correction of
simde_mm256_sign_epi{8,16,32}
(#1123) c376610 @Proudsalsa
AVX512
- fpclass: naive implementation 353bf5f @mr-c
- loadu: fix native detection 305f434 @mr-c
- set: add
simde_x_mm512_set_m256{,d}
67e0c50 @mr-c - gather: add MSVC native fallbacks 7b7e3f6 @mr-c
- AVX512FP16 / m512h initial support e97691c @mr-c
- fix many native aliases 75014b9 @mr-c
CLMUL
SVML
AES
MIPS MSA intrinics
Arch support
x86(-64)
arm64
Altivec
Power
- sse2,wasm simd128: skip
SIMDE_CONVERT_VECTOR_
impementations on PowerPC 4de999a @mr-c - wasm simd128: more powerpc fixes 7cb5691 @mr-c
Compiler Specific
GCC
- GCC AVX512F:
SIMDE_BUG_GCC_95399
was fixed in GCC 9.5, 10.4, 11.4, 12+ 3fa89c5 @mr-c - GCC x86/x64:
SIMDE_BUG_GCC_98521
was fixed in 10.3 edde42e @mr-c - GCC x86:
SIMDE_BUG_GCC_94482
was fixed in 8.5, 9.4, 10+ 43d86a3 @mr-c - Add workaround for GCC bug 111609 fdafd8e @M-HT
- arm neon ld2: silence warnings at -O3 on gcc risc-v 8f56628 @mr-c
- avx512 abs: refine GCC compiler checks for
_mm512{,_mask}_abs_pd
(#1118) 5405bbd @thomas-schlichter
Clang
- clang powerpc:
vec_bperm
bug was fixed in clang-14 6feb28a @mr-c - clmul: aarch64 clang has difficulties with poly64x1_t 1e1bd76 @mr-c
- aarch64: optimization bug 45541 was fixed in clang-15 7ca5712 @mr-c
- A32V7: Don't trust clang for load multiple on A32V7 927f141 @easyaspi314
- wasm:
SIMDE_BUG_CLANG_60655
is fixed in the upcoming 17.0 release 25cebbe @mr-c simde-detect-clang.h
: add clang 17 detection 923f8ac 684baa1 50d98c1 @Coeur
ClangCL
- fp16: don't use
_Float16
on ClangCL if not supported 8a6b8c5 @mr-c - svml: don't enable
SIMDE_X86_SVML_NATIVE
for ClangCl c877fe5 @mr-
Emscripten
MSVC
- avx512 types: avoid using native AVX512 types on MSVC unless required 029d749 @mr-c
- arm neon:
{u,s}addh
apply arm64 windows workaround only on msvc<1938 (#1121) 14311d6 @Changqing-JING
Testing with Docker/Podman & CI
- Update recipe for qemu git mode 54b8c8f @mr-c
- riscv64 gcc: typo fix for endian little 7423339 @mr-c
- add new cross sets; Ubuntu Focal and Bionic support b0b9710 @mr-c
- native tests: also AVX512, MSA; fix WASM SIMD128 path bdd075b @mr-c
- test-flags: support the x86 microarchitecture levels 518b777 @mr-c
- ignore common build paths b3689ea @mr-c
Appveyor
Circle CI
GitHub Actions
- upgrade qemu ; fixes remaining ppc64el fails! e91944b @mr-c
- tidy matrix ordering for easier to read job names b52ac36 @mr-c
- add clang-qemu: aarch64, riscv64, ppc64el, s390x 8a6dbab @mr-c
- test armv7 with gcc-12 via qemu 8cd8de1 @mr-c
- add armel to gcc and clang qemu matrices 4ca849b @mr-c
- add armv7 to clang-qemu matrix a144aca @mr-c
- use GCC 12 for adv x64 native testing + AVX512FP f156b41 @mr-c
- expand mac-os/xcode testing matrix 8055410 @mr-c
- fix macos-13+brew failure c6149de @mr-c
- test with clang-16 e25ced8 @mr-c
- add gcc-13 43ac8fc @mr-c
- simplify x86 ISA matrix 6b7c1b3 @mr-c
- run on commits to the primary branch to prime the cache 6055bfb @mr-c
- build(deps): bump actions/checkout from 3 to 4 149d0af @dependabot[bot]
- build(deps): bump github/codeql-action from 2 to 3 (#1138) 5026e66 @dependabot[bot]
- build(deps): bump actions/setup-python from 4 to 5 (#1137) 2768da8 @dependabot[bot]
- build(deps): bump actions/setup-dotnet from 3 to 4 (#1135) ed382cb @dependabot[bot]
- build(deps): bump ad-m/github-push-action from 0.6.0 to 0.8.0 (#1134) 193be1b @dependabot[bot]
- add new repo for clang-16 7ebd267 @mr-c
- add clang-17 (#1127) d31de99 @mr-c
- test mips64el using qemu on gcc12/clang16 934d86d @mr-c
- disable {clang,gcc}-qemu mips64el; needs newer Ubuntu version 471a342 @mr-c
- test WASM Relaxed SIMD da0604f @mr-c
Packit CI
Travis
Misc
- README: mark F16C as complete 2d87cf5 @mr-c
- README: Give credit to creator/maintainer of the vcpkg for SIMDe ceb1e73 @mr-c
- README: related projects: add AvxToNeon 13bf92a @mr-c
- README: add more background links for supported ISAs c76450d @mr-c
- README: turn Packit CI link into a deep link e9e1901 @mr-c
- README: NEON is complete 7412139 @mr-c
- docs: explain how to target a single test 2158ac7 @mr-c
New Contributors
- @thomas-schlichter made their first contribution in #1118
- @Proudsalsa made their first contribution in #1123
- @Changqing-JING made their first contribution in #1121
- @AymenQ made their first contribution in #1126
- @Coeur made their first contribution in #1129
- @dependabot made their first contribution in #1134
- @cbielow made their first contribution in #1055
- @M-HT made their first contribution in #1060
- @yyctw made their first contribution in #1071
- @Vineg made their first contribution in #1072
- @wewe5215 made their first contribution in #1077
Full Changelog: v0.7.6...v0.8.0