Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Sum64String and (*Digest).WriteString to be inlined #50

Merged
merged 1 commit into from
Nov 20, 2020

Conversation

cespare
Copy link
Owner

@cespare cespare commented Nov 20, 2020

This is an alternative approach to #42.

Ideally the compiler would do mid-stack inlining for Sum64String since it's a minimal unsafe wrapper around Sum64:

func Sum64String(s string) uint64 {
	var b []byte
	bh := (*reflect.SliceHeader)(unsafe.Pointer(&b))
	bh.Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data
	bh.Len = len(s)
	bh.Cap = len(s)
	return Sum64(b)
}

Unfortunately, the weight the inliner computes is too high. I filed golang/go#42739.

In the meantime, I found some tricks (with help from @josharian) to generate a lower cost that gets us below the threshold value.

Additionally, add tests to confirm that Sum64String and (*Digest).WriteString are inlined.

Benchmarks:

name                  old time/op    new time/op    delta
Sum64String/4B-12       4.78ns ± 1%    3.57ns ± 4%  -25.27%  (p=0.000 n=8+10)
Sum64String/100B-12     14.5ns ± 1%    12.9ns ± 0%  -10.76%  (p=0.000 n=9+10)
Sum64String/4KB-12       229ns ± 0%     229ns ± 1%     ~     (p=0.395 n=7+10)
Sum64String/10MB-12      628µs ± 1%     630µs ± 2%     ~     (p=1.000 n=9+10)
DigestString/4B-12      11.4ns ± 1%     9.7ns ± 1%  -14.95%  (p=0.000 n=10+10)
DigestString/100B-12    23.6ns ± 1%    21.3ns ± 2%   -9.65%  (p=0.000 n=10+10)
DigestString/4KB-12      241ns ± 1%     239ns ± 0%   -0.67%  (p=0.001 n=10+7)
DigestString/10MB-12     627µs ± 1%     628µs ± 1%     ~     (p=0.631 n=10+10)

name                  old speed      new speed      delta
Sum64String/4B-12      837MB/s ± 1%  1124MB/s ± 2%  +34.42%  (p=0.000 n=10+9)
Sum64String/100B-12   6.88GB/s ± 2%  7.72GB/s ± 1%  +12.16%  (p=0.000 n=10+10)
Sum64String/4KB-12    17.5GB/s ± 0%  17.5GB/s ± 1%     ~     (p=0.408 n=8+10)
Sum64String/10MB-12   15.9GB/s ± 1%  15.9GB/s ± 2%     ~     (p=1.000 n=9+10)
DigestString/4B-12     350MB/s ± 1%   411MB/s ± 1%  +17.55%  (p=0.000 n=10+10)
DigestString/100B-12  4.23GB/s ± 1%  4.69GB/s ± 1%  +10.84%  (p=0.000 n=10+9)
DigestString/4KB-12   16.6GB/s ± 1%  16.7GB/s ± 0%   +0.67%  (p=0.001 n=10+8)
DigestString/10MB-12  16.0GB/s ± 1%  15.9GB/s ± 1%     ~     (p=0.631 n=10+10)

And with -tags purego:

name                  old time/op    new time/op    delta
Sum64String/4B-12       5.57ns ± 1%    4.22ns ± 1%  -24.14%  (p=0.000 n=10+9)
Sum64String/100B-12     16.0ns ± 1%    14.8ns ± 0%   -7.27%  (p=0.000 n=10+6)
Sum64String/4KB-12       327ns ± 2%     325ns ± 1%     ~     (p=0.050 n=10+10)
Sum64String/10MB-12      866µs ± 3%     856µs ± 0%   -1.05%  (p=0.002 n=9+8)
DigestString/4B-12      11.2ns ± 1%    10.0ns ± 1%  -10.90%  (p=0.000 n=10+9)
DigestString/100B-12    25.5ns ± 1%    22.8ns ± 0%  -10.62%  (p=0.000 n=10+9)
DigestString/4KB-12      342ns ± 1%     340ns ± 1%   -0.56%  (p=0.018 n=9+10)
DigestString/10MB-12     877µs ± 1%     878µs ± 2%     ~     (p=0.400 n=10+9)

name                  old speed      new speed      delta
Sum64String/4B-12      718MB/s ± 1%   947MB/s ± 1%  +31.82%  (p=0.000 n=10+9)
Sum64String/100B-12   6.26GB/s ± 1%  6.75GB/s ± 1%   +7.81%  (p=0.000 n=10+10)
Sum64String/4KB-12    12.2GB/s ± 2%  12.3GB/s ± 1%   +0.70%  (p=0.022 n=10+9)
Sum64String/10MB-12   11.6GB/s ± 3%  11.7GB/s ± 0%   +1.05%  (p=0.002 n=9+8)
DigestString/4B-12     357MB/s ± 1%   401MB/s ± 1%  +12.32%  (p=0.000 n=10+9)
DigestString/100B-12  3.93GB/s ± 1%  4.40GB/s ± 0%  +11.95%  (p=0.000 n=10+9)
DigestString/4KB-12   11.7GB/s ± 1%  11.8GB/s ± 1%   +0.68%  (p=0.011 n=10+10)
DigestString/10MB-12  11.4GB/s ± 1%  11.4GB/s ± 2%     ~     (p=0.400 n=10+9)

/cc @greatroar

xxhash_unsafe.go Outdated
// this way gets us down to the max cost (barely).

*(*string)(unsafe.Pointer(&b)) = s
(*sliceHeader)(unsafe.Pointer(&b)).cap = len(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this should work as long as the memory layout of []byte is exactly that of sliceHeader (unsafe.Pointer docs promise (1)). The only way I can see this break is if the compiler or GC would at some point start checking that cap(b) >= len(b), which would also break the existing code (but that would be easier to fix).

@greatroar
Copy link
Contributor

I ran xxhashbench on this, and it doesn't quite give the speedup that the assembler version gives.

Linux/amd64, old=this PR, new=#42:

name                                  old time/op    new time/op    delta
Hashes/xxhash,direct,bytes,n=5B-8       5.95ns ± 2%    5.92ns ± 1%   -0.39%  (p=0.028 n=18+18)
Hashes/xxhash,direct,string,n=5B-8      7.92ns ± 1%    5.91ns ± 0%  -25.39%  (p=0.000 n=19+19)
Hashes/xxhash,direct,bytes,n=100B-8     19.2ns ± 2%    18.9ns ± 1%   -1.44%  (p=0.000 n=19+18)
Hashes/xxhash,direct,string,n=100B-8    19.5ns ± 1%    17.7ns ± 1%   -9.16%  (p=0.000 n=17+20)
Hashes/xxhash,direct,bytes,n=4KB-8       278ns ± 1%     276ns ± 1%   -0.67%  (p=0.000 n=18+16)
Hashes/xxhash,direct,string,n=4KB-8      280ns ± 1%     276ns ± 1%   -1.29%  (p=0.000 n=20+20)
Hashes/xxhash,direct,bytes,n=10MB-8      769µs ± 1%     766µs ± 1%   -0.40%  (p=0.020 n=18+19)
Hashes/xxhash,direct,string,n=10MB-8     772µs ± 2%     775µs ± 2%     ~     (p=0.191 n=19+19)

name                                  old speed      new speed      delta
Hashes/xxhash,direct,bytes,n=5B-8      840MB/s ± 2%   844MB/s ± 1%   +0.48%  (p=0.017 n=19+18)
Hashes/xxhash,direct,string,n=5B-8     631MB/s ± 1%   846MB/s ± 0%  +34.04%  (p=0.000 n=19+19)
Hashes/xxhash,direct,bytes,n=100B-8   5.20GB/s ± 2%  5.28GB/s ± 2%   +1.51%  (p=0.000 n=19+19)
Hashes/xxhash,direct,string,n=100B-8  5.12GB/s ± 1%  5.64GB/s ± 2%  +10.05%  (p=0.000 n=18+20)
Hashes/xxhash,direct,bytes,n=4KB-8    14.4GB/s ± 2%  14.5GB/s ± 1%   +0.48%  (p=0.002 n=17+17)
Hashes/xxhash,direct,string,n=4KB-8   14.3GB/s ± 1%  14.5GB/s ± 0%   +1.53%  (p=0.000 n=20+17)
Hashes/xxhash,direct,bytes,n=10MB-8   13.0GB/s ± 1%  13.1GB/s ± 1%   +0.40%  (p=0.020 n=18+19)
Hashes/xxhash,direct,string,n=10MB-8  13.0GB/s ± 2%  12.9GB/s ± 2%     ~     (p=0.191 n=19+19)

My actual input strings are somewhere between the 5 and 100B marks, typically.

@cespare
Copy link
Owner Author

cespare commented Nov 20, 2020

Yes, that's because xxhashbench uses indirect calls. What I was getting at in #42 (review) and #22 (comment) is that I don't think the indirect call performance matters that much.

I intend to delete xxhashbench (or at least rework it to use only direct calls). For now, I think we can focus on the new benchmarks I added in the xxhash package itself.

@greatroar
Copy link
Contributor

Sorry, I read that but I was still running the wrong benchmarks. Here's the main package ones cherry-picked onto #42 (old) vs. this PR (new):

name                 old speed      new speed      delta
Sum64/4B-8            733MB/s ± 1%   796MB/s ± 0%  +8.63%  (p=0.000 n=10+10)
Sum64/100B-8         5.44GB/s ± 1%  5.56GB/s ± 1%  +2.11%  (p=0.000 n=10+10)
Sum64/4KB-8          14.3GB/s ± 1%  14.4GB/s ± 1%  +0.88%  (p=0.000 n=10+9)
Sum64/10MB-8         12.9GB/s ± 2%  12.9GB/s ± 2%    ~     (p=0.605 n=9+9)
Sum64String/4B-8      769MB/s ± 1%   810MB/s ± 1%  +5.28%  (p=0.000 n=10+9)
Sum64String/100B-8   5.73GB/s ± 1%  5.69GB/s ± 1%  -0.80%  (p=0.002 n=10+10)
Sum64String/4KB-8    14.5GB/s ± 1%  14.5GB/s ± 1%    ~     (p=0.297 n=9+9)
Sum64String/10MB-8   13.1GB/s ± 1%  13.0GB/s ± 1%    ~     (p=0.089 n=10+10)
DigestBytes/4B-8      267MB/s ± 0%   266MB/s ± 1%    ~     (p=0.079 n=9+10)
DigestBytes/100B-8   3.46GB/s ± 0%  3.46GB/s ± 0%    ~     (p=0.546 n=9+9)
DigestBytes/4KB-8    13.7GB/s ± 0%  13.6GB/s ± 0%  -0.43%  (p=0.000 n=10+10)
DigestBytes/10MB-8   13.0GB/s ± 1%  13.0GB/s ± 0%    ~     (p=0.758 n=9+7)
DigestString/4B-8     230MB/s ± 1%   228MB/s ± 2%    ~     (p=0.075 n=10+10)
DigestString/100B-8  2.83GB/s ± 1%  2.84GB/s ± 1%    ~     (p=0.631 n=10+10)
DigestString/4KB-8   13.4GB/s ± 3%  13.5GB/s ± 1%    ~     (p=0.661 n=10+9)
DigestString/10MB-8  13.0GB/s ± 1%  13.0GB/s ± 0%    ~     (p=0.074 n=8+9)

So this version is actually quite a bit faster in the common case of direct calls.

@cespare cespare changed the title Trick the inliner into accepting Sum64String Allow Sum64String and (*Digest).WriteString to be inlined Nov 20, 2020
@cespare
Copy link
Owner Author

cespare commented Nov 20, 2020

With help from @josharian I got a different way to write the conversions that allow both Sum64String and (*Digest).WriteString to be inlined. I updated the description and benchmarks.

I'm pretty happy with this for now so I think I'll merge.

Benchmarks:

name                  old time/op    new time/op    delta
Sum64String/4B-12       4.78ns ± 1%    3.57ns ± 4%  -25.27%  (p=0.000 n=8+10)
Sum64String/100B-12     14.5ns ± 1%    12.9ns ± 0%  -10.76%  (p=0.000 n=9+10)
Sum64String/4KB-12       229ns ± 0%     229ns ± 1%     ~     (p=0.395 n=7+10)
Sum64String/10MB-12      628µs ± 1%     630µs ± 2%     ~     (p=1.000 n=9+10)
DigestString/4B-12      11.4ns ± 1%     9.7ns ± 1%  -14.95%  (p=0.000 n=10+10)
DigestString/100B-12    23.6ns ± 1%    21.3ns ± 2%   -9.65%  (p=0.000 n=10+10)
DigestString/4KB-12      241ns ± 1%     239ns ± 0%   -0.67%  (p=0.001 n=10+7)
DigestString/10MB-12     627µs ± 1%     628µs ± 1%     ~     (p=0.631 n=10+10)

name                  old speed      new speed      delta
Sum64String/4B-12      837MB/s ± 1%  1124MB/s ± 2%  +34.42%  (p=0.000 n=10+9)
Sum64String/100B-12   6.88GB/s ± 2%  7.72GB/s ± 1%  +12.16%  (p=0.000 n=10+10)
Sum64String/4KB-12    17.5GB/s ± 0%  17.5GB/s ± 1%     ~     (p=0.408 n=8+10)
Sum64String/10MB-12   15.9GB/s ± 1%  15.9GB/s ± 2%     ~     (p=1.000 n=9+10)
DigestString/4B-12     350MB/s ± 1%   411MB/s ± 1%  +17.55%  (p=0.000 n=10+10)
DigestString/100B-12  4.23GB/s ± 1%  4.69GB/s ± 1%  +10.84%  (p=0.000 n=10+9)
DigestString/4KB-12   16.6GB/s ± 1%  16.7GB/s ± 0%   +0.67%  (p=0.001 n=10+8)
DigestString/10MB-12  16.0GB/s ± 1%  15.9GB/s ± 1%     ~     (p=0.631 n=10+10)

And with -tags purego:

name                  old time/op    new time/op    delta
Sum64String/4B-12       5.57ns ± 1%    4.22ns ± 1%  -24.14%  (p=0.000 n=10+9)
Sum64String/100B-12     16.0ns ± 1%    14.8ns ± 0%   -7.27%  (p=0.000 n=10+6)
Sum64String/4KB-12       327ns ± 2%     325ns ± 1%     ~     (p=0.050 n=10+10)
Sum64String/10MB-12      866µs ± 3%     856µs ± 0%   -1.05%  (p=0.002 n=9+8)
DigestString/4B-12      11.2ns ± 1%    10.0ns ± 1%  -10.90%  (p=0.000 n=10+9)
DigestString/100B-12    25.5ns ± 1%    22.8ns ± 0%  -10.62%  (p=0.000 n=10+9)
DigestString/4KB-12      342ns ± 1%     340ns ± 1%   -0.56%  (p=0.018 n=9+10)
DigestString/10MB-12     877µs ± 1%     878µs ± 2%     ~     (p=0.400 n=10+9)

name                  old speed      new speed      delta
Sum64String/4B-12      718MB/s ± 1%   947MB/s ± 1%  +31.82%  (p=0.000 n=10+9)
Sum64String/100B-12   6.26GB/s ± 1%  6.75GB/s ± 1%   +7.81%  (p=0.000 n=10+10)
Sum64String/4KB-12    12.2GB/s ± 2%  12.3GB/s ± 1%   +0.70%  (p=0.022 n=10+9)
Sum64String/10MB-12   11.6GB/s ± 3%  11.7GB/s ± 0%   +1.05%  (p=0.002 n=9+8)
DigestString/4B-12     357MB/s ± 1%   401MB/s ± 1%  +12.32%  (p=0.000 n=10+9)
DigestString/100B-12  3.93GB/s ± 1%  4.40GB/s ± 0%  +11.95%  (p=0.000 n=10+9)
DigestString/4KB-12   11.7GB/s ± 1%  11.8GB/s ± 1%   +0.68%  (p=0.011 n=10+10)
DigestString/10MB-12  11.4GB/s ± 1%  11.4GB/s ± 2%     ~     (p=0.400 n=10+9)
@cespare cespare merged commit a7909af into master Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants