-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: add GOAMD64 architecture variants #25489
Comments
I'd like to add:
There is also a problem with opting into 512-bit version of AVX512. |
I thought that the Go runtime and standard library already had ways to check for certain hardware features at run-time. How do these enhancements compare to that? If these were to be made at compile-time, would the run-time checks be removed? |
It is preferable to use runtime detection when possible. For any sizeable chunk of code, that's what we currently do. But especially for individual instructions, the overhead for runtime checking can be large. In that case, it can be preferable to "bake in" the decision during compilation. That's what such an environment variable would do. |
Apologues for asking more questions - if we add an environment variable, wouldn't it be simpler and faster to simply use the optimal code for the hardware, getting rid of all run-time detection? Unless the point is to do a best-effort combination of the both, so that very expensive code like crypto would still perform well on newer hardware even if the environment variable wasn't set up properly? |
We want to use the environment variable as little as possible, as it bakes the optimization decision in at compile time. As a consequence, the binary cannot then adapt at runtime; it would fail on an old machine, or not use optimal code on a new machine. |
function multi-versioning was proposed some time ago as a compromise between less run-time checks and portability between older/newer CPUs. I can't find discussion right now but the approach with checking CPUID features once and then using I have no specific proposals, more over, I'm not familiar with referenced technique well. |
For some cases indirect call would be slower than call + branch inside it though. Needs a lot of investigation. There is also JIT-like approach where we re-write function code segment during startup if there are more optimized versions available. This is not possible for every imaginable platforms, but given this is mostly about AMD64, seems doable in that context. |
Great ! I have been wanting this for a long time - https://groups.google.com/forum/#!topic/golang-nuts/b75xYuORaok :) My thinking was that using the non-destructive 3 operand form of VEX encoded instructions will alleviate register pressure and will make the binaries much smaller (Unfortunately I know nothing about compiler internals to do a PoC and confirm this). @ianlancetaylor had mentioned -
|
I've implemented a prototype in https://go-review.googlesource.com/c/go/+/117295 to see how beneficial this can be. |
With pleasure :) |
I have done some testing. On binary size, I am seeing a slight decrease 🎉. I tested with 3 production applications. 2 network services and 1 webapp. This is without Service1 - 17488784 to 17482928 = 5856 On the perf side, general application benchmarks do not show noticeable improvements. But if I do micro-benchmarks targeting very specific functions - for eg if I benchmark I looked into the One thing that I noticed is that the V(ADD|SUB|MUL|DIV)SD operations always act on registers, and not on memory operands which they are especially useful for in reducing register pressure. For eg. I see patterns like this throughout -
which I believe can be optimized to -
That's all I could glean. Happy to provide more info if required. |
Benchmarks that I ran Friday, got busy and forgot to post:
|
I apologize for my stupidity. I was comparing with 1.10.2. 😞 I have updated my comment. The results match with @dr2chase's observations. |
No need for the s-word, benchmarking is picky work that's easy to get wrong, I ended up writing a tool to help me do it because I could not trust myself to do it all correctly and reproducibly, and I really ought to make that tool automate even more of the steps. |
@dr2chase what machine are you using? I've tried to reproduce Growth_MultiSegment regression, but it showed no changes. |
It is "Model name: Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz" |
It'd nice to be able to use |
I haven't got data on the performance impact yet, because I have to hack the compiler to do it, but we have a tight loop that's oring two 8KB blocks together, and we also want to popcount them, and right now, on amd64, adding popcount more than doubles runtime for the loop according to naive benchmarks. The difference from adding a popcount inside the loop is larger than I'd have expected:
becomes
Note, the "addq" in the middle is actually adding a previous popcount, this loop's popcount is added to a counter about four instructions later. So this isn't the actual instructions for this step, but it's the same instructions as them. I don't understand why there's an If time allows, I'm gonna hack my local compiler to ditch that and see what the impact is like. |
For the |
So I made a funny mistake which actually produced informative results: I wrote a benchmark for this in which I omitted any actual use of the results of popcount operations. As it happens, this has the curious effect of generating the conditional jumps to call the popcount function if necessary, but not generating the actual popcount instructions (as the compiler figured out that the value being set to their sum was unused). So, for basically similar code, my reported times were: no popcount: 450ns/op (somewhat approximate values) I then went ahead and did the fairly obvious thing to just take away the runtime tests for OnesCount64, and with those, I get 444/444/520. What's interesting about this is that the cost of adding the popcount instructions without the conditional tests is dramatically lower (~80ns) than the cost of adding the popcount instructions with the conditional tests (~200ns more than the conditionals without the popcounts). So, for this specific function, it looks like the difference would be ~520ns vs. ~800ns. A 35% reduction in runtime for a hot loop is pretty tempting. I don't really know enough about modern machines to know how common it is to find amd64 hardware without popcount. But I'd say that, yeah, I would be pretty interested in a target that allowed me to get that improvement on something that has been a pain point for us in the past. Just in case it's relevant: Looking in more detail with using 1.12.9:
Using the same thing, but with the conditional jump for OnesCount64 omitted:
Branch misses don't seem to be a significant factor here, but something about this is causing some stalling, giving us a noticable drop in instructions/cycle, but also increasing instructions executed about 35%. YMMV. |
As another data point, https://go-review.googlesource.com/c/go/+/299491 is a proof-of-concept CL that adds support for a handful of Haswell instructions to the compiler, which I (a not particularly savvy micro-optimizer) was able to use to improve encoding/binary.Uvarint's throughput by 166% (from 616ns to 231ns). For comparison, the Go protobuf package includes a fully unrolled version of varint decoding (because varint decoding is a very hot code path within protobuf unmarshalling, which takes 347ns (+78% throughput). Applying the extra tricks from CL 299491 (using rotation instead of shifts, having a fast path for len(buf) >= 8 to eliminate bounds checks, early loading of the slice data pointer) bumps that a little further to about 315ns (+96%). Still measurably short of using Haswell extensions. I think large-scale cloud deployments would benefit from being able to statically build Go binaries that take advantage of more recent x86 instruction set extensions. Haswell (2013) is older than POWER9 (2017), which is configurable via GOPPC64. |
Varint decoding is also very hot code in the runtime, particularly during stack copying. See e.g. https://go-review.googlesource.com/43150. |
Another data point: https://go-review.googlesource.com/c/go/+/299649 reworks scanobject to use some Haswell instructions, and gets a 4% geomean performance improvement in compilebench. (Not bad for another 1 day hack.) I'm going to see if I can benchmark some longer-lived processes, and then also look into heapBitsSetType. |
FWIW, |
I asked Google's C++ toolchain team for advice here, and they pointed out the x86-64 psABI now defines microarchitecture levels with specific feature sets: https://en.wikipedia.org/wiki/X86-64#Microarchitecture_Levels. Also, this blog post from Red Hat giving more context: https://developers.redhat.com/blog/2021/01/05/building-red-hat-enterprise-linux-9-for-the-x86-64-v2-microarchitecture-level/ It seems like the three immediately useful targets are "baseline", "v2", and "v3". Baseline is what we support today. v2 and v3 each offer new instructions that could be used for better optimizing current Go code (e.g., POPCNT in v2, BMI1/BMI2 in v3). I don't see any immediate benefit from supporting v4 today. Go doesn't have native SIMD support, and AVX512 hardware support seems comparatively less common today still. (There are also lots of discussions about AVX512 having adverse system-level effects due to CPU down-throttling, etc.) So my tentative proposal would be There have also been suggestions that binaries should check at startup time whether the CPU supports the required feature, so we don't get lots of SIGILL issue reports from users. That seems reasonable to me. Also, there's been some brainstorming about various ways we could automatically compile code multiple ways and efficiently dispatch to the right code paths at a higher-level to balance between binary size and runtime dispatch overhead. I think that's intriguing, but it's not obvious to me that we have a concrete solution here or anyone with time to investigate/implement that today. |
How about also |
Historically we've avoided things like Which isn't to say that we can't change that behavior, but I think that should be a separate decision, and one that should apply to all the |
Make Go runtime throw if it's been compiled to assume instruction set extensions that aren't available on the CPU. This feature is originally suggested by mdempsky in golang#45453 and golang#25489. Updates golang#45453
Make Go runtime throw if it's been compiled to assume instruction set extensions that aren't available on the CPU. This feature is originally suggested by mdempsky in golang#45453 and golang#25489. Updates golang#45453
Change https://golang.org/cl/351191 mentions this issue: |
Since #45453 is resolved I think we can close this bug too. |
This is a data-gathering bug to collect instances in which a GOAMD64 environment variable would be helpful. Please feel free to add to it. My hope is that over time, if/when these use cases accumulate, we will get a clearer picture of whether to add GOAMD64, and if so, what shape it should take.
Uses:
cc @randall77 @martisch @TocarIP @quasilyte @dr2chase
The text was updated successfully, but these errors were encountered: