-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: perf improvements for runtime.memmove for ppc64le/ppc64 #14507
Comments
You do not get any alignment guarantee on the memmove inputs. On Thu, Feb 25, 2016 at 8:51 AM, laboger [email protected] wrote:
|
The unaligned loads and stores will work but they are very inefficient. I handle that as part of my changes. Thanks. |
It would be interesting to instrument memmove to see how it is called, both in terms of alignment and sizes. I think when copying anything containing pointers, you'll have at least pointer alignment -- maybe if this could be determined at the call site there would be a case for splitting memmove into one that assumes alignment and one that does not (and as we all know by now, if you are copying pointers you have to do it 8 bytes at a time). |
It might be informative to break memmove calls down by type. There are:
#1,#2, and #4 are almost always aligned. Although very likely, I think we could only guarantee it at #2 and #4 callsites. Alignment in case #3 is less likely. |
Isn't it more fair to say that the alignment of the arguments to memmove match the alignment of the type being copied? When appending two []byte's, there's no guarantee of alignment. And so the question becomes whether the type's alignment is known at the callsite to memmove – I think for the cases you list, the calls are all via typedmemmove, which does have that information, so maybe it would be practical to have a separate memmove_aligned function. The size distribution is interesting too, there's always this temptation (for me, anyway) to focus on the asymptotic times and making the larger copies as fast as possible, but I don't think that's the way to go for most actual Go programs (but I might be wrong; I'd love for someone to do something rigorous here) |
typedmemmove is itself generic. So I guess one test/branch in typedmemmove would ensure that both the source and destination are aligned and could call a specialized memmove routine. We can do the same thing in memmove itself with an or/test/branch. Doesn't seem like a big win. I agree with your size feelings. It's more important for Go to get the 1 and 2 word memmoves fast. Of course it depends a lot on the benchmark. |
I was thinking typedmemmove could check t.align but I guess that doesn't On 1 March 2016 at 05:18, Keith Randall [email protected] wrote:
|
I'm sorry but I was wrong in my previous comment about the performance of unaligned loads and stores on power8. On earlier Power instruction sets unaligned loads and stores were very inefficient but I have since found out that on power8 unaligned loads and stores perform well. So for now I think the main areas of performance improvement for the memmove are when 1) figuring out if the src and dst overlap (I don't think the current test is always correct and may decide to use backward moves when not necessary) and 2) when moving large amounts of data. In the second case, unrolling the loop and using dcache hints will help. I'm doing more experiments and getting input from other power8 developers. |
CL https://golang.org/cl/21990 mentions this issue. |
This change improves the performance of memmove on ppc64 & ppc64le mainly for moves >=32 bytes. In addition, the test to detect backward moves was enhanced to avoid backward moves if source and dest were in different types of storage, since backward moves might not always be efficient. Fixes golang#14507 The following shows some of the improvements from the test in the runtime package: BenchmarkMemmove32 4229.56 4717.13 1.12x BenchmarkMemmove64 6156.03 7810.42 1.27x BenchmarkMemmove128 7521.69 12468.54 1.66x BenchmarkMemmove256 6729.90 18260.33 2.71x BenchmarkMemmove512 8521.59 18033.81 2.12x BenchmarkMemmove1024 9760.92 25762.61 2.64x BenchmarkMemmove2048 10241.00 29584.94 2.89x BenchmarkMemmove4096 10399.37 31882.31 3.07x BenchmarkMemmoveUnalignedDst16 1943.69 2258.33 1.16x BenchmarkMemmoveUnalignedDst32 3885.08 3965.81 1.02x BenchmarkMemmoveUnalignedDst64 5121.63 6965.54 1.36x BenchmarkMemmoveUnalignedDst128 7212.34 11372.68 1.58x BenchmarkMemmoveUnalignedDst256 6564.52 16913.59 2.58x BenchmarkMemmoveUnalignedDst512 8364.35 17782.57 2.13x BenchmarkMemmoveUnalignedDst1024 9539.87 24914.72 2.61x BenchmarkMemmoveUnalignedDst2048 9199.23 21235.11 2.31x BenchmarkMemmoveUnalignedDst4096 10077.39 25231.99 2.50x BenchmarkMemmoveUnalignedSrc32 3249.83 3742.52 1.15x BenchmarkMemmoveUnalignedSrc64 5562.35 6627.96 1.19x BenchmarkMemmoveUnalignedSrc128 6023.98 10200.84 1.69x BenchmarkMemmoveUnalignedSrc256 6921.83 15258.43 2.20x BenchmarkMemmoveUnalignedSrc512 8593.13 16541.97 1.93x BenchmarkMemmoveUnalignedSrc1024 9730.95 22927.84 2.36x BenchmarkMemmoveUnalignedSrc2048 9793.28 21537.73 2.20x BenchmarkMemmoveUnalignedSrc4096 10132.96 26295.06 2.60x Change-Id: I73af59970d4c97c728deabb9708b31ec7e01bdf2 Reviewed-on: https://go-review.googlesource.com/21990 Reviewed-by: Bill O'Farrell <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>
There is a ppc64(le) specific implementation of memmove in runtime/memmove_ppc64x.s which works well in some cases but found some cases where further improvements can be made and plan to work on those. On inspection I also have some concerns about the alignment assumptions.
There are no alignment checks for the source and destination in this code. Is there an assumption when using the runtime.memmove function that these pointers will refer to data that is 8 byte aligned? If the length is >= 8 then it does ldu/stdu regardless of the alignment. I couldn't find any documentation stating that there is an expectation for the alignment of the source and destination with this function.
The text was updated successfully, but these errors were encountered: