-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Sum64String and (*Digest).WriteString to be inlined #50
Conversation
xxhash_unsafe.go
Outdated
// this way gets us down to the max cost (barely). | ||
|
||
*(*string)(unsafe.Pointer(&b)) = s | ||
(*sliceHeader)(unsafe.Pointer(&b)).cap = len(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this should work as long as the memory layout of []byte is exactly that of sliceHeader (unsafe.Pointer docs promise (1)). The only way I can see this break is if the compiler or GC would at some point start checking that cap(b) >= len(b), which would also break the existing code (but that would be easier to fix).
I ran xxhashbench on this, and it doesn't quite give the speedup that the assembler version gives. Linux/amd64, old=this PR, new=#42:
My actual input strings are somewhere between the 5 and 100B marks, typically. |
Yes, that's because xxhashbench uses indirect calls. What I was getting at in #42 (review) and #22 (comment) is that I don't think the indirect call performance matters that much. I intend to delete xxhashbench (or at least rework it to use only direct calls). For now, I think we can focus on the new benchmarks I added in the xxhash package itself. |
Sorry, I read that but I was still running the wrong benchmarks. Here's the main package ones cherry-picked onto #42 (old) vs. this PR (new):
So this version is actually quite a bit faster in the common case of direct calls. |
2e6c845
to
0dac3a2
Compare
With help from @josharian I got a different way to write the conversions that allow both I'm pretty happy with this for now so I think I'll merge. |
Benchmarks: name old time/op new time/op delta Sum64String/4B-12 4.78ns ± 1% 3.57ns ± 4% -25.27% (p=0.000 n=8+10) Sum64String/100B-12 14.5ns ± 1% 12.9ns ± 0% -10.76% (p=0.000 n=9+10) Sum64String/4KB-12 229ns ± 0% 229ns ± 1% ~ (p=0.395 n=7+10) Sum64String/10MB-12 628µs ± 1% 630µs ± 2% ~ (p=1.000 n=9+10) DigestString/4B-12 11.4ns ± 1% 9.7ns ± 1% -14.95% (p=0.000 n=10+10) DigestString/100B-12 23.6ns ± 1% 21.3ns ± 2% -9.65% (p=0.000 n=10+10) DigestString/4KB-12 241ns ± 1% 239ns ± 0% -0.67% (p=0.001 n=10+7) DigestString/10MB-12 627µs ± 1% 628µs ± 1% ~ (p=0.631 n=10+10) name old speed new speed delta Sum64String/4B-12 837MB/s ± 1% 1124MB/s ± 2% +34.42% (p=0.000 n=10+9) Sum64String/100B-12 6.88GB/s ± 2% 7.72GB/s ± 1% +12.16% (p=0.000 n=10+10) Sum64String/4KB-12 17.5GB/s ± 0% 17.5GB/s ± 1% ~ (p=0.408 n=8+10) Sum64String/10MB-12 15.9GB/s ± 1% 15.9GB/s ± 2% ~ (p=1.000 n=9+10) DigestString/4B-12 350MB/s ± 1% 411MB/s ± 1% +17.55% (p=0.000 n=10+10) DigestString/100B-12 4.23GB/s ± 1% 4.69GB/s ± 1% +10.84% (p=0.000 n=10+9) DigestString/4KB-12 16.6GB/s ± 1% 16.7GB/s ± 0% +0.67% (p=0.001 n=10+8) DigestString/10MB-12 16.0GB/s ± 1% 15.9GB/s ± 1% ~ (p=0.631 n=10+10) And with -tags purego: name old time/op new time/op delta Sum64String/4B-12 5.57ns ± 1% 4.22ns ± 1% -24.14% (p=0.000 n=10+9) Sum64String/100B-12 16.0ns ± 1% 14.8ns ± 0% -7.27% (p=0.000 n=10+6) Sum64String/4KB-12 327ns ± 2% 325ns ± 1% ~ (p=0.050 n=10+10) Sum64String/10MB-12 866µs ± 3% 856µs ± 0% -1.05% (p=0.002 n=9+8) DigestString/4B-12 11.2ns ± 1% 10.0ns ± 1% -10.90% (p=0.000 n=10+9) DigestString/100B-12 25.5ns ± 1% 22.8ns ± 0% -10.62% (p=0.000 n=10+9) DigestString/4KB-12 342ns ± 1% 340ns ± 1% -0.56% (p=0.018 n=9+10) DigestString/10MB-12 877µs ± 1% 878µs ± 2% ~ (p=0.400 n=10+9) name old speed new speed delta Sum64String/4B-12 718MB/s ± 1% 947MB/s ± 1% +31.82% (p=0.000 n=10+9) Sum64String/100B-12 6.26GB/s ± 1% 6.75GB/s ± 1% +7.81% (p=0.000 n=10+10) Sum64String/4KB-12 12.2GB/s ± 2% 12.3GB/s ± 1% +0.70% (p=0.022 n=10+9) Sum64String/10MB-12 11.6GB/s ± 3% 11.7GB/s ± 0% +1.05% (p=0.002 n=9+8) DigestString/4B-12 357MB/s ± 1% 401MB/s ± 1% +12.32% (p=0.000 n=10+9) DigestString/100B-12 3.93GB/s ± 1% 4.40GB/s ± 0% +11.95% (p=0.000 n=10+9) DigestString/4KB-12 11.7GB/s ± 1% 11.8GB/s ± 1% +0.68% (p=0.011 n=10+10) DigestString/10MB-12 11.4GB/s ± 1% 11.4GB/s ± 2% ~ (p=0.400 n=10+9)
0dac3a2
to
a7909af
Compare
This is an alternative approach to #42.
Ideally the compiler would do mid-stack inlining for
Sum64String
since it's a minimal unsafe wrapper aroundSum64
:Unfortunately, the weight the inliner computes is too high. I filed golang/go#42739.
In the meantime, I found some tricks (with help from @josharian) to generate a lower cost that gets us below the threshold value.
Additionally, add tests to confirm that
Sum64String
and(*Digest).WriteString
are inlined.Benchmarks:
And with
-tags purego
:/cc @greatroar