Skip to content

v0.8.0

Compare
Choose a tag to compare
@mr-c mr-c released this 14 Mar 13:03
· 98 commits to master since this release
589c7d5

SIMDe 0.8.0

Summary

  • Complete set of implementations for all NEON intrinsics have been finished, up from 56.46% in the previous release! (@yyctw @wewe5215)
  • SIMDe PRs are tested using Fedora Rawhide (@junaruga)

For the entire project: 656 files changed, 202635 insertions(+), 1724 deletions(-)

For just the simde folder: 295 files changed, 47053 insertions(+), 896 deletions(-)

X86

There are a total of 6876 SIMD functions on x86, 2930 (43.17%) of which have been implemented in SIMDe so far. Specifically for AVX-512, of the 5160 functions currently in AVX-512, SIMDe implements 1510 (29.26%).

Note: Intel has removed the intrinsics that were unique to Intel Xeon Phi (ER, PF, 4MAPS, and 4VNNIW) from their intrinsic list. SIMDe will retain those few implementations we already had, but this changes how our completeness statistics are calculated.

Newly added function families

  • AES: 5 of 6 (83.33%)

Newly AVX512 added function families

Additions to existing families

  • AVX512BW: 7 additional, 337 of 790 (42.66%)
  • AVX512DQ: 5 additional, 112 total of 376 (29.79%)
  • AVX512F: 48 additional, 1087 total of 2812 (38.66%)
  • AVX512_FP16: 15 additional, 17 total of 1105 (1.54%)

Neon

SIMDe currently implements 6670 out of 6670 (100.00%) NEON functions; up from 56.46% in the previous release!

Newly added families

  • abal
  • abal_high
  • abd
  • abdh
  • abdl_high
  • addhn_high
  • aes
  • bfdot
  • bfdot_lane
  • cadd_rot
  • cale
  • calt
  • cmla_lane
  • cmla_rot_lane
  • copy_lane
  • cvt_high
  • cvt_n
  • cvta
  • cvtn
  • cvtp
  • cvtx
  • cvtx_high
  • div
  • dupb_lane
  • duph_lane
  • eor3
  • fmlal
  • fms
  • fms_lane
  • fms_n
  • ld2_dup
  • ld2_lane
  • ld3_dup
  • ld3_lane
  • ld4_dup
  • maxnmv
  • minnmv
  • mla_lane
  • mla_high_lane
  • mls_lane
  • mlsl_high_lane
  • mmla
  • mull_high_lane
  • mull_high_n
  • mulx
  • mulx_lane
  • pmaxnm
  • pminnm
  • qdmlal
  • qdmlal_high
  • qdmlal_high_lane
  • qdmlal_high_n
  • qdmlal_lane
  • qdmlal_n
  • qdmlsl
  • qdmlsl_high
  • qdmlsl_high_lane
  • qdmlsl_high_n
  • qdmlsl_lane
  • qdmlsl_n
  • qdmlslh
  • qdmlslh_lane
  • qdmulhh
  • qdmulhh_lane
  • qdmull_high
  • qdmull_high_lane
  • qdmull_high_n
  • qdmull_lane
  • qdmull_n
  • qdmullh_lane
  • qmovun_high
  • qrdmlah
  • qrdmlah_lane
  • qrdmlahh
  • qrdmlahh_lane
  • qrdmlsh
  • qrdmlsh_lane
  • qrdmlshh
  • qrdmlshh_lane
  • qrdmulhh_lane
  • qrshl
  • qrshlh
  • qrshrn_high_n
  • qrshrnh_n
  • qrshrun_high_n
  • qrshrunh_n
  • qshl_n
  • qshlh_n
  • qshluh_n
  • qshrn_high_n
  • qshrnh_n
  • qshrun_high_n
  • qshrunh_n
  • raddhn
  • raddhn_high
  • rax
  • recp
  • rnd32x
  • rnd32x
  • rnd32x
  • rnd64z
  • rnda
  • rndx
  • rshrn_high_n
  • rsubhn
  • rsubhn
  • set_lane
  • sha1
  • sha1h
  • sha256
  • sha512
  • shll_high_n
  • shrn_high_n
  • sli_n
  • sm3
  • sm4
  • sqrt
  • st1_x2
  • st1_x3
  • st1_x4
  • st1q_x2
  • st1q_x3
  • st1q_x4
  • subhn_high
  • sudot_lane
  • usdot
  • usdot_lane

Finally complete families

  • cvtn
  • mla_lane

Details

  • simde-f16: improve _Float16 usage; better INFHF/NANHF defs 8910057 @mr-c
  • simde_float16: prefer __fp16 if available aba26f6 @mr-c

Implementation of Arm intrinsics

NEON

SVE Intrinsics

WASM intrinsics

  • simd128: fix altivec_p7 version of wasm_f64x2_pmin 96d6e53 @mr-c
  • simd128: add missing unsigned functions ea5e283 @mr-c
  • simd128 f{32x4,64x2}_min: add workaround for a gcc<6 issue d5d6d10 @mr-c
  • detect support for Relaxed SIMD mode 2e66dd4 @mr-c
  • simd128/relaxed: begin MIPS implementations db8ad84 @mr-c
  • relaxed: add f{32x4,64x2}_relaxed_{min,max} 9d1a34e @mr-c
  • relaxed: updated names; reordered FMA operations 8cc8874 @mr-c

x86 intrinsics

  • sse{,2,4.1}, avx{,2} *_stream_{,load}: use __builtin_nontemporal_{load,store} 6ce6030 @mr-c

SSE*

  • sse: Fix issues related to MXCSR register (#1060) 653aba8 @M-HT
  • sse: implement _mm_movelh_ps for Arm64 514564e @mr-c
  • sse _mm_movemask_ps: remove unused code fba97e4 @mr-c
  • sse2 mm_pause: more archs, add a basic test 692a2e8 @mr-
  • sse4.1: use logical OR instead of bitwise OR in neon impl of _mm_testnzc_si128 edd4678 @mr-c
  • sse4.1 _mm_testz_si128: fix backwards short circuit logic f132275 @mr-c

AVX

AVX2

AVX512

CLMUL

SVML

AES

MIPS MSA intrinics

  • msa neon impl: float64x2_t is not avail in A32V7 ae4c4ab @mr-c

Arch support

x86(-64)

arm64

  • x86 aes: add neon implementation using the crypto extension fb3554f @mr-

Altivec

  • neon/st1: disable last remaining AltiVec implementation 0521245 @mr-c

Power

  • sse2,wasm simd128: skip SIMDE_CONVERT_VECTOR_ impementations on PowerPC 4de999a @mr-c
  • wasm simd128: more powerpc fixes 7cb5691 @mr-c

Compiler Specific

GCC

Clang

ClangCL

  • fp16: don't use _Float16 on ClangCL if not supported 8a6b8c5 @mr-c
  • svml: don't enable SIMDE_X86_SVML_NATIVE for ClangCl c877fe5 @mr-

Emscripten

MSVC

Testing with Docker/Podman & CI

Appveyor

Circle CI

  • circleci: clang, set -Wno-unsafe-buffer-usage 24c93c2 @mr-c

GitHub Actions

Packit CI

Travis

Misc

New Contributors

Full Changelog: v0.7.6...v0.8.0