Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assembler version of Sum64String #42

Closed
wants to merge 2 commits into from

Conversation

greatroar
Copy link
Contributor

@greatroar greatroar commented Nov 3, 2020

This moves the forwarding of Sum64String to Sum64 into assembler. The Go version cannot be inlined by Go 1.15.3:

./xxhash_unsafe.go:28:6: cannot inline Sum64String: function too complex: cost 91 exceeds budget 80

The assembler version makes Sum64String somewhere between a bit and a lot faster for short strings:

name                                  old time/op    new time/op    delta
Hashes/xxhash,direct,bytes,n=5B-8       5.96ns ± 1%    5.92ns ± 1%   -0.56%  (p=0.000 n=17+18)
Hashes/xxhash,direct,string,n=5B-8      8.71ns ± 1%    5.91ns ± 0%  -32.15%  (p=0.000 n=19+19)
Hashes/xxhash,direct,bytes,n=100B-8     19.2ns ± 2%    18.9ns ± 1%   -1.25%  (p=0.000 n=18+18)
Hashes/xxhash,direct,string,n=100B-8    20.3ns ± 1%    17.7ns ± 1%  -12.74%  (p=0.000 n=17+20)
Hashes/xxhash,direct,bytes,n=4KB-8       278ns ± 1%     276ns ± 1%   -0.54%  (p=0.000 n=19+16)
Hashes/xxhash,direct,string,n=4KB-8      280ns ± 2%     276ns ± 1%   -1.35%  (p=0.000 n=18+20)
Hashes/xxhash,direct,bytes,n=10MB-8      767µs ± 2%     766µs ± 1%     ~     (p=0.775 n=18+19)
Hashes/xxhash,direct,string,n=10MB-8     773µs ± 1%     775µs ± 2%     ~     (p=0.496 n=20+19)

name                                  old speed      new speed      delta
Hashes/xxhash,direct,bytes,n=5B-8      839MB/s ± 1%   844MB/s ± 1%   +0.55%  (p=0.000 n=17+18)
Hashes/xxhash,direct,string,n=5B-8     574MB/s ± 1%   846MB/s ± 0%  +47.38%  (p=0.000 n=19+19)
Hashes/xxhash,direct,bytes,n=100B-8   5.21GB/s ± 2%  5.28GB/s ± 2%   +1.21%  (p=0.000 n=18+19)
Hashes/xxhash,direct,string,n=100B-8  4.92GB/s ± 1%  5.64GB/s ± 2%  +14.57%  (p=0.000 n=17+20)
Hashes/xxhash,direct,bytes,n=4KB-8    14.4GB/s ± 1%  14.5GB/s ± 1%   +0.44%  (p=0.000 n=19+17)
Hashes/xxhash,direct,string,n=4KB-8   14.3GB/s ± 2%  14.5GB/s ± 0%   +1.58%  (p=0.000 n=18+17)
Hashes/xxhash,direct,bytes,n=10MB-8   13.0GB/s ± 2%  13.1GB/s ± 1%     ~     (p=0.775 n=18+19)
Hashes/xxhash,direct,string,n=10MB-8  12.9GB/s ± 1%  12.9GB/s ± 2%     ~     (p=0.496 n=20+19)

The implementation is a bit messy because now there need to be two _unsafe.go files. I didn't see an easy way of avoiding that.

@greatroar greatroar force-pushed the sum64string-asm branch 5 times, most recently from 41a2696 to a16de4a Compare November 3, 2020 14:31
This makes Sum64String somewhere between a bit and a lot faster for
short strings:

name                                  old time/op    new time/op    delta
Hashes/xxhash,direct,bytes,n=5B-8       5.96ns ± 1%    5.92ns ± 1%   -0.56%  (p=0.000 n=17+18)
Hashes/xxhash,direct,string,n=5B-8      8.71ns ± 1%    5.91ns ± 0%  -32.15%  (p=0.000 n=19+19)
Hashes/xxhash,direct,bytes,n=100B-8     19.2ns ± 2%    18.9ns ± 1%   -1.25%  (p=0.000 n=18+18)
Hashes/xxhash,direct,string,n=100B-8    20.3ns ± 1%    17.7ns ± 1%  -12.74%  (p=0.000 n=17+20)
Hashes/xxhash,direct,bytes,n=4KB-8       278ns ± 1%     276ns ± 1%   -0.54%  (p=0.000 n=19+16)
Hashes/xxhash,direct,string,n=4KB-8      280ns ± 2%     276ns ± 1%   -1.35%  (p=0.000 n=18+20)
Hashes/xxhash,direct,bytes,n=10MB-8      767µs ± 2%     766µs ± 1%     ~     (p=0.775 n=18+19)
Hashes/xxhash,direct,string,n=10MB-8     773µs ± 1%     775µs ± 2%     ~     (p=0.496 n=20+19)

name                                  old speed      new speed      delta
Hashes/xxhash,direct,bytes,n=5B-8      839MB/s ± 1%   844MB/s ± 1%   +0.55%  (p=0.000 n=17+18)
Hashes/xxhash,direct,string,n=5B-8     574MB/s ± 1%   846MB/s ± 0%  +47.38%  (p=0.000 n=19+19)
Hashes/xxhash,direct,bytes,n=100B-8   5.21GB/s ± 2%  5.28GB/s ± 2%   +1.21%  (p=0.000 n=18+19)
Hashes/xxhash,direct,string,n=100B-8  4.92GB/s ± 1%  5.64GB/s ± 2%  +14.57%  (p=0.000 n=17+20)
Hashes/xxhash,direct,bytes,n=4KB-8    14.4GB/s ± 1%  14.5GB/s ± 1%   +0.44%  (p=0.000 n=19+17)
Hashes/xxhash,direct,string,n=4KB-8   14.3GB/s ± 2%  14.5GB/s ± 0%   +1.58%  (p=0.000 n=18+17)
Hashes/xxhash,direct,bytes,n=10MB-8   13.0GB/s ± 2%  13.1GB/s ± 1%     ~     (p=0.775 n=18+19)
Hashes/xxhash,direct,string,n=10MB-8  12.9GB/s ± 1%  12.9GB/s ± 2%     ~     (p=0.496 n=20+19)
Copy link
Owner

@cespare cespare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @greatroar, thanks for sending this.

A better approach, if possible, would be to convince Go to inline the function. I looked into it a bit and discovered two issues:

  1. The benchmark uses an indirect call which isn't inlined today (more context at Remove duplicated function definitions #22 (comment)). I added new benchmarks which use direct calls in e0ea1e3.
  2. The inliner weights the unsafe conversion much higher than I would expect. I'll file a compiler bug.

I found a workaround for (2) and sent #50. The advantages of that approach are:

  • It can be inlined in non-amd64 environments (or with -tags purego).
  • Go (even unsafe Go) is easier to read, write, and debug than assembly.

@@ -7,5 +7,6 @@ go:
- master
env:
- TAGS=""
- TAGS="-tags appengine"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we really care about special appengine support these days. I filed #46.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants