Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avx512 Keccak #7561

Merged
merged 12 commits into from
Oct 7, 2024
Merged

Avx512 Keccak #7561

merged 12 commits into from
Oct 7, 2024

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Oct 7, 2024

Changes

  • Implement Keccak using Avx512 intrinsics

Inner loop is as follows, with no stack writes or stack spills

G_M000_IG04:                ;; offset=0x0284
       vpxorq   zmm15, zmm6, zmm7
       vpxorq   zmm14, zmm8, zmm9
       vpxorq   zmm14, zmm15, zmm14
       vpxorq   zmm14, zmm14, zmm10
       vpermq   zmm15, zmm0, zmm14
       vpsllq   zmm0, zmm15, 1
       vpsrlq   zmm15, zmm15, 63
       vporq    zmm0, zmm0, zmm15
       vpermq   zmm14, zmm1, zmm14
       vpxorq   zmm0, zmm14, zmm0
       vpxorq   zmm6, zmm6, zmm0
       vpxorq   zmm7, zmm7, zmm0
       vpxorq   zmm8, zmm8, zmm0
       vpxorq   zmm9, zmm9, zmm0
       vpxorq   zmm10, zmm10, zmm0
       vprolvq  zmm6, zmm6, zmm2
       vprolvq  zmm7, zmm7, zmm3
       vprolvq  zmm8, zmm8, zmm4
       vprolvq  zmm9, zmm9, zmm5
       vprolvq  zmm10, zmm10, zmm16
       vmovaps  zmm0, zmm6
       vpermt2q zmm0, zmm17, zmm7
       vpermt2q zmm0, zmm18, zmm8
       vpermt2q zmm0, zmm19, zmm9
       vpermt2q zmm0, zmm20, zmm10
       vmovaps  zmm14, zmm6
       vpermt2q zmm14, zmm21, zmm7
       vpermt2q zmm14, zmm22, zmm8
       vpermt2q zmm14, zmm23, zmm9
       vpermt2q zmm14, zmm24, zmm10
       vmovaps  zmm15, zmm6
       vpermt2q zmm15, zmm25, zmm7
       vpermt2q zmm15, zmm26, zmm8
       vpermt2q zmm15, zmm27, zmm9
       vpermt2q zmm15, zmm28, zmm10
       vmovaps  zmm1, zmm6
       vpermt2q zmm1, zmm29, zmm7
       vpermt2q zmm1, zmm30, zmm8
       vpermt2q zmm1, zmm31, zmm9
       vpermt2q zmm1, zmm11, zmm10
       vpermt2q zmm6, zmm12, zmm7
       vpermt2q zmm6, zmm13, zmm8
       vmovups  zmm8, zmmword ptr [rsp+0xA0]
       vpermt2q zmm6, zmm8, zmm9
       vmovups  zmm9, zmmword ptr [rsp+0x60]
       vpermt2q zmm6, zmm9, zmm10
       vmovups  zmm7, zmmword ptr [reloc @RWD64]
       vpermq   zmm10, zmm7, zmm0
       vmovups  zmm7, zmmword ptr [reloc @RWD1792]
       vpermq   zmm7, zmm7, zmm0
       vpternlogq zmm0, zmm10, zmm7, -46
       vmovaps  zmm7, zmm0
       vmovups  zmm0, zmmword ptr [reloc @RWD64]
       vpermq   zmm0, zmm0, zmm14
       vmovups  zmm10, zmmword ptr [reloc @RWD1792]
       vpermq   zmm10, zmm10, zmm14
       vpternlogq zmm14, zmm0, zmm10, -46
       vmovups  zmm0, zmmword ptr [reloc @RWD64]
       vpermq   zmm0, zmm0, zmm15
       vmovups  zmm10, zmmword ptr [reloc @RWD1792]
       vpermq   zmm10, zmm10, zmm15

G_M000_IG05:                ;; offset=0x0418
       vpternlogq zmm15, zmm0, zmm10, -46
       vmovups  zmm0, zmmword ptr [reloc @RWD64]
       vpermq   zmm0, zmm0, zmm1
       vmovups  zmm10, zmmword ptr [reloc @RWD1792]
       vpermq   zmm10, zmm10, zmm1
       vpternlogq zmm1, zmm0, zmm10, -46
       vmovaps  zmm10, zmm1
       vmovups  zmm0, zmmword ptr [reloc @RWD64]
       vpermq   zmm0, zmm0, zmm6
       vmovups  zmm1, zmmword ptr [reloc @RWD1792]
       vpermq   zmm1, zmm1, zmm6
       vpternlogq zmm6, zmm0, zmm1, -46
       mov      r8d, edx
       vmovd    xmm0, qword ptr [rcx+8*r8+0x10]
       xor      r8d, r8d
       vpinsrq  xmm0, xmm0, r8, 1
       vxorps   xmm1, xmm1, xmm1
       vinserti128 ymm0, ymm0, xmm1, 1
       vxorps   ymm1, ymm1, ymm1
       vinserti64x4 zmm0, zmm0, ymm1, 1
       vpxorq   zmm7, zmm7, zmm0
       inc      edx
       cmp      eax, edx
       vmovups  zmm0, zmmword ptr [rsp+0x120]
       vmovups  zmm1, zmmword ptr [rsp+0xE0]
       jg       G_M000_IG08

5% faster on Zen4 which doesn't have proper Avx512; should be faster on Zen5

Method ScenarioIndex Mean Error StdDev Ratio RatioSD Ops/s
Current 1 374.6 ns 6.53 ns 9.57 ns 1.00 0.00 2,669,514
Avx512F 1 356.7 ns 6.76 ns 10.12 ns 0.95 0.02 2,803,476

Types of changes

What types of changes does your code introduce?

  • Optimization

Testing

Requires testing

  • Yes
  • No

If yes, did you write tests?

  • Yes
  • No

@benaadams benaadams requested a review from Scooletz October 7, 2024 08:04
Copy link
Member

@LukaszRozmej LukaszRozmej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should all those constant vectors be changed to static fields or that won't help?

src/Nethermind/Nethermind.Core/Crypto/KeccakHash.cs Outdated Show resolved Hide resolved
src/Nethermind/Nethermind.Core/Crypto/KeccakHash.cs Outdated Show resolved Hide resolved
@benaadams benaadams merged commit 50d7486 into master Oct 7, 2024
67 checks passed
@benaadams benaadams deleted the keccak-512 branch October 7, 2024 13:58
@Scooletz
Copy link
Contributor

Scooletz commented Oct 7, 2024

For such cases maybe we should consider suites that are run for platform specific features? I tried to follow the permutations, but without a comparison in tests I cannot tell that any of them is ✅

benaadams added a commit that referenced this pull request Oct 7, 2024
benaadams added a commit that referenced this pull request Oct 7, 2024
rjnrohit pushed a commit that referenced this pull request Oct 10, 2024
rjnrohit pushed a commit that referenced this pull request Oct 10, 2024
benaadams added a commit that referenced this pull request Oct 26, 2024
@benaadams benaadams mentioned this pull request Oct 26, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants