Allow Sum64String and (*Digest).WriteString to be inlined #50

cespare · 2020-11-20T04:41:09Z

This is an alternative approach to #42.

Ideally the compiler would do mid-stack inlining for Sum64String since it's a minimal unsafe wrapper around Sum64:

func Sum64String(s string) uint64 {
	var b []byte
	bh := (*reflect.SliceHeader)(unsafe.Pointer(&b))
	bh.Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data
	bh.Len = len(s)
	bh.Cap = len(s)
	return Sum64(b)
}

Unfortunately, the weight the inliner computes is too high. I filed golang/go#42739.

In the meantime, I found some tricks (with help from @josharian) to generate a lower cost that gets us below the threshold value.

Additionally, add tests to confirm that Sum64String and (*Digest).WriteString are inlined.

Benchmarks:

name                  old time/op    new time/op    delta
Sum64String/4B-12       4.78ns ± 1%    3.57ns ± 4%  -25.27%  (p=0.000 n=8+10)
Sum64String/100B-12     14.5ns ± 1%    12.9ns ± 0%  -10.76%  (p=0.000 n=9+10)
Sum64String/4KB-12       229ns ± 0%     229ns ± 1%     ~     (p=0.395 n=7+10)
Sum64String/10MB-12      628µs ± 1%     630µs ± 2%     ~     (p=1.000 n=9+10)
DigestString/4B-12      11.4ns ± 1%     9.7ns ± 1%  -14.95%  (p=0.000 n=10+10)
DigestString/100B-12    23.6ns ± 1%    21.3ns ± 2%   -9.65%  (p=0.000 n=10+10)
DigestString/4KB-12      241ns ± 1%     239ns ± 0%   -0.67%  (p=0.001 n=10+7)
DigestString/10MB-12     627µs ± 1%     628µs ± 1%     ~     (p=0.631 n=10+10)

name                  old speed      new speed      delta
Sum64String/4B-12      837MB/s ± 1%  1124MB/s ± 2%  +34.42%  (p=0.000 n=10+9)
Sum64String/100B-12   6.88GB/s ± 2%  7.72GB/s ± 1%  +12.16%  (p=0.000 n=10+10)
Sum64String/4KB-12    17.5GB/s ± 0%  17.5GB/s ± 1%     ~     (p=0.408 n=8+10)
Sum64String/10MB-12   15.9GB/s ± 1%  15.9GB/s ± 2%     ~     (p=1.000 n=9+10)
DigestString/4B-12     350MB/s ± 1%   411MB/s ± 1%  +17.55%  (p=0.000 n=10+10)
DigestString/100B-12  4.23GB/s ± 1%  4.69GB/s ± 1%  +10.84%  (p=0.000 n=10+9)
DigestString/4KB-12   16.6GB/s ± 1%  16.7GB/s ± 0%   +0.67%  (p=0.001 n=10+8)
DigestString/10MB-12  16.0GB/s ± 1%  15.9GB/s ± 1%     ~     (p=0.631 n=10+10)

And with -tags purego:

name                  old time/op    new time/op    delta
Sum64String/4B-12       5.57ns ± 1%    4.22ns ± 1%  -24.14%  (p=0.000 n=10+9)
Sum64String/100B-12     16.0ns ± 1%    14.8ns ± 0%   -7.27%  (p=0.000 n=10+6)
Sum64String/4KB-12       327ns ± 2%     325ns ± 1%     ~     (p=0.050 n=10+10)
Sum64String/10MB-12      866µs ± 3%     856µs ± 0%   -1.05%  (p=0.002 n=9+8)
DigestString/4B-12      11.2ns ± 1%    10.0ns ± 1%  -10.90%  (p=0.000 n=10+9)
DigestString/100B-12    25.5ns ± 1%    22.8ns ± 0%  -10.62%  (p=0.000 n=10+9)
DigestString/4KB-12      342ns ± 1%     340ns ± 1%   -0.56%  (p=0.018 n=9+10)
DigestString/10MB-12     877µs ± 1%     878µs ± 2%     ~     (p=0.400 n=10+9)

name                  old speed      new speed      delta
Sum64String/4B-12      718MB/s ± 1%   947MB/s ± 1%  +31.82%  (p=0.000 n=10+9)
Sum64String/100B-12   6.26GB/s ± 1%  6.75GB/s ± 1%   +7.81%  (p=0.000 n=10+10)
Sum64String/4KB-12    12.2GB/s ± 2%  12.3GB/s ± 1%   +0.70%  (p=0.022 n=10+9)
Sum64String/10MB-12   11.6GB/s ± 3%  11.7GB/s ± 0%   +1.05%  (p=0.002 n=9+8)
DigestString/4B-12     357MB/s ± 1%   401MB/s ± 1%  +12.32%  (p=0.000 n=10+9)
DigestString/100B-12  3.93GB/s ± 1%  4.40GB/s ± 0%  +11.95%  (p=0.000 n=10+9)
DigestString/4KB-12   11.7GB/s ± 1%  11.8GB/s ± 1%   +0.68%  (p=0.011 n=10+10)
DigestString/10MB-12  11.4GB/s ± 1%  11.4GB/s ± 2%     ~     (p=0.400 n=10+9)

/cc @greatroar

greatroar · 2020-11-20T09:01:29Z

xxhash_unsafe.go

+	// this way gets us down to the max cost (barely).
+
+	*(*string)(unsafe.Pointer(&b)) = s
+	(*sliceHeader)(unsafe.Pointer(&b)).cap = len(s)


I suppose this should work as long as the memory layout of []byte is exactly that of sliceHeader (unsafe.Pointer docs promise (1)). The only way I can see this break is if the compiler or GC would at some point start checking that cap(b) >= len(b), which would also break the existing code (but that would be easier to fix).

greatroar · 2020-11-20T09:43:51Z

I ran xxhashbench on this, and it doesn't quite give the speedup that the assembler version gives.

Linux/amd64, old=this PR, new=#42:

name                                  old time/op    new time/op    delta
Hashes/xxhash,direct,bytes,n=5B-8       5.95ns ± 2%    5.92ns ± 1%   -0.39%  (p=0.028 n=18+18)
Hashes/xxhash,direct,string,n=5B-8      7.92ns ± 1%    5.91ns ± 0%  -25.39%  (p=0.000 n=19+19)
Hashes/xxhash,direct,bytes,n=100B-8     19.2ns ± 2%    18.9ns ± 1%   -1.44%  (p=0.000 n=19+18)
Hashes/xxhash,direct,string,n=100B-8    19.5ns ± 1%    17.7ns ± 1%   -9.16%  (p=0.000 n=17+20)
Hashes/xxhash,direct,bytes,n=4KB-8       278ns ± 1%     276ns ± 1%   -0.67%  (p=0.000 n=18+16)
Hashes/xxhash,direct,string,n=4KB-8      280ns ± 1%     276ns ± 1%   -1.29%  (p=0.000 n=20+20)
Hashes/xxhash,direct,bytes,n=10MB-8      769µs ± 1%     766µs ± 1%   -0.40%  (p=0.020 n=18+19)
Hashes/xxhash,direct,string,n=10MB-8     772µs ± 2%     775µs ± 2%     ~     (p=0.191 n=19+19)

name                                  old speed      new speed      delta
Hashes/xxhash,direct,bytes,n=5B-8      840MB/s ± 2%   844MB/s ± 1%   +0.48%  (p=0.017 n=19+18)
Hashes/xxhash,direct,string,n=5B-8     631MB/s ± 1%   846MB/s ± 0%  +34.04%  (p=0.000 n=19+19)
Hashes/xxhash,direct,bytes,n=100B-8   5.20GB/s ± 2%  5.28GB/s ± 2%   +1.51%  (p=0.000 n=19+19)
Hashes/xxhash,direct,string,n=100B-8  5.12GB/s ± 1%  5.64GB/s ± 2%  +10.05%  (p=0.000 n=18+20)
Hashes/xxhash,direct,bytes,n=4KB-8    14.4GB/s ± 2%  14.5GB/s ± 1%   +0.48%  (p=0.002 n=17+17)
Hashes/xxhash,direct,string,n=4KB-8   14.3GB/s ± 1%  14.5GB/s ± 0%   +1.53%  (p=0.000 n=20+17)
Hashes/xxhash,direct,bytes,n=10MB-8   13.0GB/s ± 1%  13.1GB/s ± 1%   +0.40%  (p=0.020 n=18+19)
Hashes/xxhash,direct,string,n=10MB-8  13.0GB/s ± 2%  12.9GB/s ± 2%     ~     (p=0.191 n=19+19)

My actual input strings are somewhere between the 5 and 100B marks, typically.

cespare · 2020-11-20T10:02:03Z

Yes, that's because xxhashbench uses indirect calls. What I was getting at in #42 (review) and #22 (comment) is that I don't think the indirect call performance matters that much.

I intend to delete xxhashbench (or at least rework it to use only direct calls). For now, I think we can focus on the new benchmarks I added in the xxhash package itself.

greatroar · 2020-11-20T10:23:45Z

Sorry, I read that but I was still running the wrong benchmarks. Here's the main package ones cherry-picked onto #42 (old) vs. this PR (new):

name                 old speed      new speed      delta
Sum64/4B-8            733MB/s ± 1%   796MB/s ± 0%  +8.63%  (p=0.000 n=10+10)
Sum64/100B-8         5.44GB/s ± 1%  5.56GB/s ± 1%  +2.11%  (p=0.000 n=10+10)
Sum64/4KB-8          14.3GB/s ± 1%  14.4GB/s ± 1%  +0.88%  (p=0.000 n=10+9)
Sum64/10MB-8         12.9GB/s ± 2%  12.9GB/s ± 2%    ~     (p=0.605 n=9+9)
Sum64String/4B-8      769MB/s ± 1%   810MB/s ± 1%  +5.28%  (p=0.000 n=10+9)
Sum64String/100B-8   5.73GB/s ± 1%  5.69GB/s ± 1%  -0.80%  (p=0.002 n=10+10)
Sum64String/4KB-8    14.5GB/s ± 1%  14.5GB/s ± 1%    ~     (p=0.297 n=9+9)
Sum64String/10MB-8   13.1GB/s ± 1%  13.0GB/s ± 1%    ~     (p=0.089 n=10+10)
DigestBytes/4B-8      267MB/s ± 0%   266MB/s ± 1%    ~     (p=0.079 n=9+10)
DigestBytes/100B-8   3.46GB/s ± 0%  3.46GB/s ± 0%    ~     (p=0.546 n=9+9)
DigestBytes/4KB-8    13.7GB/s ± 0%  13.6GB/s ± 0%  -0.43%  (p=0.000 n=10+10)
DigestBytes/10MB-8   13.0GB/s ± 1%  13.0GB/s ± 0%    ~     (p=0.758 n=9+7)
DigestString/4B-8     230MB/s ± 1%   228MB/s ± 2%    ~     (p=0.075 n=10+10)
DigestString/100B-8  2.83GB/s ± 1%  2.84GB/s ± 1%    ~     (p=0.631 n=10+10)
DigestString/4KB-8   13.4GB/s ± 3%  13.5GB/s ± 1%    ~     (p=0.661 n=10+9)
DigestString/10MB-8  13.0GB/s ± 1%  13.0GB/s ± 0%    ~     (p=0.074 n=8+9)

So this version is actually quite a bit faster in the common case of direct calls.

cespare · 2020-11-20T20:48:51Z

With help from @josharian I got a different way to write the conversions that allow both Sum64String and (*Digest).WriteString to be inlined. I updated the description and benchmarks.

I'm pretty happy with this for now so I think I'll merge.

Benchmarks: name old time/op new time/op delta Sum64String/4B-12 4.78ns ± 1% 3.57ns ± 4% -25.27% (p=0.000 n=8+10) Sum64String/100B-12 14.5ns ± 1% 12.9ns ± 0% -10.76% (p=0.000 n=9+10) Sum64String/4KB-12 229ns ± 0% 229ns ± 1% ~ (p=0.395 n=7+10) Sum64String/10MB-12 628µs ± 1% 630µs ± 2% ~ (p=1.000 n=9+10) DigestString/4B-12 11.4ns ± 1% 9.7ns ± 1% -14.95% (p=0.000 n=10+10) DigestString/100B-12 23.6ns ± 1% 21.3ns ± 2% -9.65% (p=0.000 n=10+10) DigestString/4KB-12 241ns ± 1% 239ns ± 0% -0.67% (p=0.001 n=10+7) DigestString/10MB-12 627µs ± 1% 628µs ± 1% ~ (p=0.631 n=10+10) name old speed new speed delta Sum64String/4B-12 837MB/s ± 1% 1124MB/s ± 2% +34.42% (p=0.000 n=10+9) Sum64String/100B-12 6.88GB/s ± 2% 7.72GB/s ± 1% +12.16% (p=0.000 n=10+10) Sum64String/4KB-12 17.5GB/s ± 0% 17.5GB/s ± 1% ~ (p=0.408 n=8+10) Sum64String/10MB-12 15.9GB/s ± 1% 15.9GB/s ± 2% ~ (p=1.000 n=9+10) DigestString/4B-12 350MB/s ± 1% 411MB/s ± 1% +17.55% (p=0.000 n=10+10) DigestString/100B-12 4.23GB/s ± 1% 4.69GB/s ± 1% +10.84% (p=0.000 n=10+9) DigestString/4KB-12 16.6GB/s ± 1% 16.7GB/s ± 0% +0.67% (p=0.001 n=10+8) DigestString/10MB-12 16.0GB/s ± 1% 15.9GB/s ± 1% ~ (p=0.631 n=10+10) And with -tags purego: name old time/op new time/op delta Sum64String/4B-12 5.57ns ± 1% 4.22ns ± 1% -24.14% (p=0.000 n=10+9) Sum64String/100B-12 16.0ns ± 1% 14.8ns ± 0% -7.27% (p=0.000 n=10+6) Sum64String/4KB-12 327ns ± 2% 325ns ± 1% ~ (p=0.050 n=10+10) Sum64String/10MB-12 866µs ± 3% 856µs ± 0% -1.05% (p=0.002 n=9+8) DigestString/4B-12 11.2ns ± 1% 10.0ns ± 1% -10.90% (p=0.000 n=10+9) DigestString/100B-12 25.5ns ± 1% 22.8ns ± 0% -10.62% (p=0.000 n=10+9) DigestString/4KB-12 342ns ± 1% 340ns ± 1% -0.56% (p=0.018 n=9+10) DigestString/10MB-12 877µs ± 1% 878µs ± 2% ~ (p=0.400 n=10+9) name old speed new speed delta Sum64String/4B-12 718MB/s ± 1% 947MB/s ± 1% +31.82% (p=0.000 n=10+9) Sum64String/100B-12 6.26GB/s ± 1% 6.75GB/s ± 1% +7.81% (p=0.000 n=10+10) Sum64String/4KB-12 12.2GB/s ± 2% 12.3GB/s ± 1% +0.70% (p=0.022 n=10+9) Sum64String/10MB-12 11.6GB/s ± 3% 11.7GB/s ± 0% +1.05% (p=0.002 n=9+8) DigestString/4B-12 357MB/s ± 1% 401MB/s ± 1% +12.32% (p=0.000 n=10+9) DigestString/100B-12 3.93GB/s ± 1% 4.40GB/s ± 0% +11.95% (p=0.000 n=10+9) DigestString/4KB-12 11.7GB/s ± 1% 11.8GB/s ± 1% +0.68% (p=0.011 n=10+10) DigestString/10MB-12 11.4GB/s ± 1% 11.4GB/s ± 2% ~ (p=0.400 n=10+9)

cespare mentioned this pull request Nov 20, 2020

cmd/compile: make the inliner assign unsafe conversions lower costs golang/go#42739

Closed

greatroar reviewed Nov 20, 2020

View reviewed changes

greatroar mentioned this pull request Nov 20, 2020

Assembler implementation for arm64 #51

Merged

cespare force-pushed the sum64string-inline branch from 2e6c845 to 0dac3a2 Compare November 20, 2020 20:40

cespare changed the title ~~Trick the inliner into accepting Sum64String~~ Allow Sum64String and (*Digest).WriteString to be inlined Nov 20, 2020

cespare force-pushed the sum64string-inline branch from 0dac3a2 to a7909af Compare November 20, 2020 20:50

cespare merged commit a7909af into master Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Sum64String and (*Digest).WriteString to be inlined #50

Allow Sum64String and (*Digest).WriteString to be inlined #50

cespare commented Nov 20, 2020 •

edited

Loading

greatroar Nov 20, 2020

greatroar commented Nov 20, 2020

cespare commented Nov 20, 2020

greatroar commented Nov 20, 2020

cespare commented Nov 20, 2020

Allow Sum64String and (*Digest).WriteString to be inlined #50

Allow Sum64String and (*Digest).WriteString to be inlined #50

Conversation

cespare commented Nov 20, 2020 • edited Loading

greatroar Nov 20, 2020

Choose a reason for hiding this comment

greatroar commented Nov 20, 2020

cespare commented Nov 20, 2020

greatroar commented Nov 20, 2020

cespare commented Nov 20, 2020

cespare commented Nov 20, 2020 •

edited

Loading