Assembler version of Sum64String #42

greatroar · 2020-11-03T13:37:06Z

This moves the forwarding of Sum64String to Sum64 into assembler. The Go version cannot be inlined by Go 1.15.3:

./xxhash_unsafe.go:28:6: cannot inline Sum64String: function too complex: cost 91 exceeds budget 80

The assembler version makes Sum64String somewhere between a bit and a lot faster for short strings:

name                                  old time/op    new time/op    delta
Hashes/xxhash,direct,bytes,n=5B-8       5.96ns ± 1%    5.92ns ± 1%   -0.56%  (p=0.000 n=17+18)
Hashes/xxhash,direct,string,n=5B-8      8.71ns ± 1%    5.91ns ± 0%  -32.15%  (p=0.000 n=19+19)
Hashes/xxhash,direct,bytes,n=100B-8     19.2ns ± 2%    18.9ns ± 1%   -1.25%  (p=0.000 n=18+18)
Hashes/xxhash,direct,string,n=100B-8    20.3ns ± 1%    17.7ns ± 1%  -12.74%  (p=0.000 n=17+20)
Hashes/xxhash,direct,bytes,n=4KB-8       278ns ± 1%     276ns ± 1%   -0.54%  (p=0.000 n=19+16)
Hashes/xxhash,direct,string,n=4KB-8      280ns ± 2%     276ns ± 1%   -1.35%  (p=0.000 n=18+20)
Hashes/xxhash,direct,bytes,n=10MB-8      767µs ± 2%     766µs ± 1%     ~     (p=0.775 n=18+19)
Hashes/xxhash,direct,string,n=10MB-8     773µs ± 1%     775µs ± 2%     ~     (p=0.496 n=20+19)

name                                  old speed      new speed      delta
Hashes/xxhash,direct,bytes,n=5B-8      839MB/s ± 1%   844MB/s ± 1%   +0.55%  (p=0.000 n=17+18)
Hashes/xxhash,direct,string,n=5B-8     574MB/s ± 1%   846MB/s ± 0%  +47.38%  (p=0.000 n=19+19)
Hashes/xxhash,direct,bytes,n=100B-8   5.21GB/s ± 2%  5.28GB/s ± 2%   +1.21%  (p=0.000 n=18+19)
Hashes/xxhash,direct,string,n=100B-8  4.92GB/s ± 1%  5.64GB/s ± 2%  +14.57%  (p=0.000 n=17+20)
Hashes/xxhash,direct,bytes,n=4KB-8    14.4GB/s ± 1%  14.5GB/s ± 1%   +0.44%  (p=0.000 n=19+17)
Hashes/xxhash,direct,string,n=4KB-8   14.3GB/s ± 2%  14.5GB/s ± 0%   +1.58%  (p=0.000 n=18+17)
Hashes/xxhash,direct,bytes,n=10MB-8   13.0GB/s ± 2%  13.1GB/s ± 1%     ~     (p=0.775 n=18+19)
Hashes/xxhash,direct,string,n=10MB-8  12.9GB/s ± 1%  12.9GB/s ± 2%     ~     (p=0.496 n=20+19)

The implementation is a bit messy because now there need to be two _unsafe.go files. I didn't see an easy way of avoiding that.

This makes Sum64String somewhere between a bit and a lot faster for short strings: name old time/op new time/op delta Hashes/xxhash,direct,bytes,n=5B-8 5.96ns ± 1% 5.92ns ± 1% -0.56% (p=0.000 n=17+18) Hashes/xxhash,direct,string,n=5B-8 8.71ns ± 1% 5.91ns ± 0% -32.15% (p=0.000 n=19+19) Hashes/xxhash,direct,bytes,n=100B-8 19.2ns ± 2% 18.9ns ± 1% -1.25% (p=0.000 n=18+18) Hashes/xxhash,direct,string,n=100B-8 20.3ns ± 1% 17.7ns ± 1% -12.74% (p=0.000 n=17+20) Hashes/xxhash,direct,bytes,n=4KB-8 278ns ± 1% 276ns ± 1% -0.54% (p=0.000 n=19+16) Hashes/xxhash,direct,string,n=4KB-8 280ns ± 2% 276ns ± 1% -1.35% (p=0.000 n=18+20) Hashes/xxhash,direct,bytes,n=10MB-8 767µs ± 2% 766µs ± 1% ~ (p=0.775 n=18+19) Hashes/xxhash,direct,string,n=10MB-8 773µs ± 1% 775µs ± 2% ~ (p=0.496 n=20+19) name old speed new speed delta Hashes/xxhash,direct,bytes,n=5B-8 839MB/s ± 1% 844MB/s ± 1% +0.55% (p=0.000 n=17+18) Hashes/xxhash,direct,string,n=5B-8 574MB/s ± 1% 846MB/s ± 0% +47.38% (p=0.000 n=19+19) Hashes/xxhash,direct,bytes,n=100B-8 5.21GB/s ± 2% 5.28GB/s ± 2% +1.21% (p=0.000 n=18+19) Hashes/xxhash,direct,string,n=100B-8 4.92GB/s ± 1% 5.64GB/s ± 2% +14.57% (p=0.000 n=17+20) Hashes/xxhash,direct,bytes,n=4KB-8 14.4GB/s ± 1% 14.5GB/s ± 1% +0.44% (p=0.000 n=19+17) Hashes/xxhash,direct,string,n=4KB-8 14.3GB/s ± 2% 14.5GB/s ± 0% +1.58% (p=0.000 n=18+17) Hashes/xxhash,direct,bytes,n=10MB-8 13.0GB/s ± 2% 13.1GB/s ± 1% ~ (p=0.775 n=18+19) Hashes/xxhash,direct,string,n=10MB-8 12.9GB/s ± 1% 12.9GB/s ± 2% ~ (p=0.496 n=20+19)

cespare

Hey @greatroar, thanks for sending this.

A better approach, if possible, would be to convince Go to inline the function. I looked into it a bit and discovered two issues:

The benchmark uses an indirect call which isn't inlined today (more context at Remove duplicated function definitions #22 (comment)). I added new benchmarks which use direct calls in e0ea1e3.
The inliner weights the unsafe conversion much higher than I would expect. I'll file a compiler bug.

I found a workaround for (2) and sent #50. The advantages of that approach are:

It can be inlined in non-amd64 environments (or with -tags purego).
Go (even unsafe Go) is easier to read, write, and debug than assembly.

cespare · 2020-11-20T04:46:32Z

.travis.yml

@@ -7,5 +7,6 @@ go:
  - master
 env:
  - TAGS=""
+  - TAGS="-tags appengine"


I don't think we really care about special appengine support these days. I filed #46.

Test appengine (no unsafe) version on Travis

a50fa75

greatroar force-pushed the sum64string-asm branch 5 times, most recently from 41a2696 to a16de4a Compare November 3, 2020 14:31

greatroar force-pushed the sum64string-asm branch from a16de4a to 518aee4 Compare November 19, 2020 09:45

cespare mentioned this pull request Nov 20, 2020

Allow Sum64String and (*Digest).WriteString to be inlined #50

Merged

cespare reviewed Nov 20, 2020

View reviewed changes

This was referenced Nov 20, 2020

Add an arm64 assembly implementation #45

Closed

Assembler implementation for arm64 #51

Merged

greatroar closed this Dec 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assembler version of Sum64String #42

Assembler version of Sum64String #42

greatroar commented Nov 3, 2020 •

edited

Loading

cespare left a comment

cespare Nov 20, 2020

Assembler version of Sum64String #42

Assembler version of Sum64String #42

Conversation

greatroar commented Nov 3, 2020 • edited Loading

cespare left a comment

Choose a reason for hiding this comment

cespare Nov 20, 2020

Choose a reason for hiding this comment

greatroar commented Nov 3, 2020 •

edited

Loading