Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assembler implementation for arm64 #51

Merged
merged 1 commit into from
Dec 4, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
189 changes: 189 additions & 0 deletions xxhash_arm64.s
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
// +build gc,!purego

#include "textflag.h"

// Register allocation.
#define digest R1
#define h R2 // Return value.
#define p R3 // Input pointer.
#define len R4
#define nblocks R5 // len / 32.
#define prime1 R7
#define prime2 R8
#define prime3 R9
#define prime4 R10
#define prime5 R11
#define v1 R12
#define v2 R13
#define v3 R14
#define v4 R15
#define x1 R20
#define x2 R21
#define x3 R22
#define x4 R23

#define round(acc, x) \
MADD prime2, acc, x, acc \
ROR $64-31, acc \
cespare marked this conversation as resolved.
Show resolved Hide resolved
MUL prime1, acc \

// x = round(0, x).
#define round0(x) \
MUL prime2, x \
ROR $64-31, x \
MUL prime1, x \

#define mergeRound(x) \
round0(x) \
EOR x, h \
MADD h, prime4, prime1, h \

// Update v[1-4] with 32-byte blocks. Assumes len >= 32.
#define blocksLoop() \
LSR $5, len, nblocks \
PCALIGN $16 \
cespare marked this conversation as resolved.
Show resolved Hide resolved
loop: \
LDP.P 32(p), (x1, x2) \
cespare marked this conversation as resolved.
Show resolved Hide resolved
round(v1, x1) \
LDP -16(p), (x3, x4) \
round(v2, x2) \
round(v3, x3) \
round(v4, x4) \
SUB $1, nblocks \
CBNZ nblocks, loop \


// The primes are repeated here to ensure that they're stored
// in a contiguous array, so we can load them with LDP.
DATA primes<> +0(SB)/8, $11400714785074694791
DATA primes<> +8(SB)/8, $14029467366897019727
DATA primes<>+16(SB)/8, $1609587929392839161
DATA primes<>+24(SB)/8, $9650029242287828579
DATA primes<>+32(SB)/8, $2870177450012600261
GLOBL primes<>(SB), NOPTR+RODATA, $40


// func Sum64(b []byte) uint64
TEXT ·Sum64(SB), NOFRAME+NOSPLIT, $0-32
cespare marked this conversation as resolved.
Show resolved Hide resolved
LDP b_base+0(FP), (p, len)

LDP primes<> +0(SB), (prime1, prime2)
LDP primes<>+16(SB), (prime3, prime4)
MOVD primes<>+32(SB), prime5

CMP $32, len
CSEL LT, prime5, ZR, h // if len < 32 { h = prime5 } else { h = 0 }
BLT afterLoop

ADD prime1, prime2, v1
MOVD prime2, v2
MOVD $0, v3
NEG prime1, v4

blocksLoop()

ROR $64-1, v1, x1
ROR $64-7, v2, x2
ADD x1, x2
ROR $64-12, v3, x3
ROR $64-18, v4, x4
ADD x3, x4
ADD x2, x4, h

mergeRound(v1)
mergeRound(v2)
mergeRound(v3)
mergeRound(v4)

afterLoop:
ADD len, h

TBZ $4, len, try8
LDP.P 16(p), (x1, x2)

round0(x1)
ROR $64-27, h
EOR x1 @> 64-27, h, h
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another place where it seems like you have a slightly convoluted sequence which I'm guessing is a performance optimization? To my mind, the obvious way to write these two lines would be

EOR x1, h
ROR $64-27, h

(and the same for many similar pairs of ROR/EOR below).

Same as before, my benchmarks showed essentially no difference with my simplified version. Were you seeing noticeable improvements from sequencing it this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is

ROR $64-27, x1
EOR x1, h

except that it doesn't write to x1, but that doesn't matter here.

I don't recall benchmarking this specifically, but I've found that code density in general can matter a lot on smaller devices.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is

ROR $64-27, x1
EOR x1, h

I don't think that's correct. I double-checked that my two instructions do the same thing as the original (effectively: h = rol27(h^x1)) but mine seems simpler because it's doing the xor first so it doesn't need the ROR addressing mode on the EOR instruction. Your two instructions above seem to be doing something different.

I've found that code density in general can matter a lot on smaller devices.

What do you mean by code density here? It's two instructions either way, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Sorry, I thought you meant only the last instruction.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing more thorough benchmarking I found that ordering the EOR after the ROR with a rotated register was a small, but measurable amount faster for some input sizes on some of my test machines. I ended up leaving that in.

MADD h, prime4, prime1, h

round0(x2)
ROR $64-27, h
EOR x2 @> 64-27, h
MADD h, prime4, prime1, h

try8:
TBZ $3, len, try4
MOVD.P 8(p), x1

round0(x1)
ROR $64-27, h
EOR x1 @> 64-27, h
MADD h, prime4, prime1, h

try4:
TBZ $2, len, try2
MOVWU.P 4(p), x2

MUL prime1, x2
ROR $64-23, h
EOR x2 @> 64-23, h
MADD h, prime3, prime2, h

try2:
TBZ $1, len, try1
MOVHU.P 2(p), x3
AND $255, x3, x1
LSR $8, x3, x2

MUL prime5, x1
ROR $64-11, h
EOR x1 @> 64-11, h
MUL prime1, h

MUL prime5, x2
ROR $64-11, h
EOR x2 @> 64-11, h
MUL prime1, h

try1:
TBZ $0, len, end
MOVBU (p), x4

MUL prime5, x4
ROR $64-11, h
EOR x4 @> 64-11, h
MUL prime1, h

end:
EOR h >> 33, h
MUL prime2, h
EOR h >> 29, h
MUL prime3, h
EOR h >> 32, h

MOVD h, ret+24(FP)
RET


// func writeBlocks(d *Digest, b []byte) int
//
// Assumes len(b) >= 32.
TEXT ·writeBlocks(SB), NOFRAME+NOSPLIT, $0-40
LDP primes<>(SB), (prime1, prime2)

// Load state. Assume v[1-4] are stored contiguously.
MOVD d+0(FP), digest
LDP 0(digest), (v1, v2)
LDP 16(digest), (v3, v4)

LDP b_base+8(FP), (p, len)

blocksLoop()

// Store updated state.
STP (v1, v2), 0(digest)
STP (v3, v4), 16(digest)

BIC $31, len
MOVD len, ret+32(FP)
RET
7 changes: 5 additions & 2 deletions xxhash_amd64.go → xxhash_asm.go
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
//go:build !appengine && gc && !purego
// +build !appengine,gc,!purego
//go:build (amd64 || arm64) && !appengine && gc && !purego
// +build amd64 arm64
// +build !appengine
// +build gc
// +build !purego

package xxhash

Expand Down
4 changes: 2 additions & 2 deletions xxhash_other.go
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
//go:build !amd64 || appengine || !gc || purego
// +build !amd64 appengine !gc purego
//go:build (!amd64 && !arm64) || appengine || !gc || purego
// +build !amd64,!arm64 appengine !gc purego

package xxhash

Expand Down