runtime: perf improvements for runtime.memmove for ppc64le/ppc64 #14507

laboger · 2016-02-25T16:50:08Z

There is a ppc64(le) specific implementation of memmove in runtime/memmove_ppc64x.s which works well in some cases but found some cases where further improvements can be made and plan to work on those. On inspection I also have some concerns about the alignment assumptions.

There are no alignment checks for the source and destination in this code. Is there an assumption when using the runtime.memmove function that these pointers will refer to data that is 8 byte aligned? If the length is >= 8 then it does ldu/stdu regardless of the alignment. I couldn't find any documentation stating that there is an expectation for the alignment of the source and destination with this function.

randall77 · 2016-02-25T16:58:23Z

You do not get any alignment guarantee on the memmove inputs.
If ppc can't do unaligned loads and stores, it does need fixup code.
If there isn't yet, you should definitely add a test for this.

On Thu, Feb 25, 2016 at 8:51 AM, laboger [email protected] wrote:

There is a ppc64(le) specific implementation of memmove in
runtime/memmove_ppc64x.s which works well in some cases but found some
cases where further improvements can be made and plan to work on those. On
inspection I also have some concerns about the alignment assumptions.

There are no alignment checks for the source and destination in this code.
Is there an assumption when using the runtime.memmove function that these
pointers will refer to data that is 8 byte aligned? If the length is >= 8
then it does ldu/stdu regardless of the alignment. I couldn't find any
documentation stating that there is an expectation for the alignment of the
source and destination with this function.

—
Reply to this email directly or view it on GitHub
#14507.

laboger · 2016-02-25T17:42:02Z

The unaligned loads and stores will work but they are very inefficient. I handle that as part of my changes. Thanks.

mwhudson · 2016-02-29T03:50:32Z

It would be interesting to instrument memmove to see how it is called, both in terms of alignment and sizes. I think when copying anything containing pointers, you'll have at least pointer alignment -- maybe if this could be determined at the call site there would be a case for splitting memmove into one that assumes alignment and one that does not (and as we all know by now, if you are copying pointers you have to do it 8 bytes at a time).

randall77 · 2016-02-29T04:04:39Z

It might be informative to break memmove calls down by type. There are:

Calls for runtime generics (maps, channels, interfaces, etc.)
Calls for append
Calls for string <-> []byte
Calls for compiler-generated copying of big user types

#1,#2, and #4 are almost always aligned. Although very likely, I think we could only guarantee it at #2 and #4 callsites. Alignment in case #3 is less likely.
For #2 and #4, the copies are typically large so a test and branch for alignment wouldn't add much relative cost.

mwhudson · 2016-02-29T09:37:51Z

Isn't it more fair to say that the alignment of the arguments to memmove match the alignment of the type being copied? When appending two []byte's, there's no guarantee of alignment. And so the question becomes whether the type's alignment is known at the callsite to memmove – I think for the cases you list, the calls are all via typedmemmove, which does have that information, so maybe it would be practical to have a separate memmove_aligned function.

The size distribution is interesting too, there's always this temptation (for me, anyway) to focus on the asymptotic times and making the larger copies as fast as possible, but I don't think that's the way to go for most actual Go programs (but I might be wrong; I'd love for someone to do something rigorous here)

randall77 · 2016-02-29T16:18:00Z

typedmemmove is itself generic. So I guess one test/branch in typedmemmove would ensure that both the source and destination are aligned and could call a specialized memmove routine. We can do the same thing in memmove itself with an or/test/branch. Doesn't seem like a big win.

I agree with your size feelings. It's more important for Go to get the 1 and 2 word memmoves fast. Of course it depends a lot on the benchmark.

mwhudson · 2016-02-29T20:53:05Z

I was thinking typedmemmove could check t.align but I guess that doesn't
buy much over checking the alignment in memmove indeed.

On 1 March 2016 at 05:18, Keith Randall [email protected] wrote:

typedmemmove is itself generic. So I guess one test/branch in typedmemmove
would ensure that both the source and destination are aligned and could
call a specialized memmove routine. We can do the same thing in memmove
itself with an or/test/branch. Doesn't seem like a big win.

I agree with your size feelings. It's more important for Go to get the 1
and 2 word memmoves fast. Of course it depends a lot on the benchmark.

—
Reply to this email directly or view it on GitHub
#14507 (comment).

laboger · 2016-03-08T14:53:10Z

I'm sorry but I was wrong in my previous comment about the performance of unaligned loads and stores on power8. On earlier Power instruction sets unaligned loads and stores were very inefficient but I have since found out that on power8 unaligned loads and stores perform well. So for now I think the main areas of performance improvement for the memmove are when 1) figuring out if the src and dst overlap (I don't think the current test is always correct and may decide to use backward moves when not necessary) and 2) when moving large amounts of data. In the second case, unrolling the loop and using dcache hints will help. I'm doing more experiments and getting input from other power8 developers.

laboger · 2016-04-13T14:12:45Z

https://go-review.googlesource.com/#/c/21990/

gopherbot · 2016-04-13T15:00:55Z

CL https://golang.org/cl/21990 mentions this issue.

This change improves the performance of memmove on ppc64 & ppc64le mainly for moves >=32 bytes. In addition, the test to detect backward moves was enhanced to avoid backward moves if source and dest were in different types of storage, since backward moves might not always be efficient. Fixes golang#14507 The following shows some of the improvements from the test in the runtime package: BenchmarkMemmove32 4229.56 4717.13 1.12x BenchmarkMemmove64 6156.03 7810.42 1.27x BenchmarkMemmove128 7521.69 12468.54 1.66x BenchmarkMemmove256 6729.90 18260.33 2.71x BenchmarkMemmove512 8521.59 18033.81 2.12x BenchmarkMemmove1024 9760.92 25762.61 2.64x BenchmarkMemmove2048 10241.00 29584.94 2.89x BenchmarkMemmove4096 10399.37 31882.31 3.07x BenchmarkMemmoveUnalignedDst16 1943.69 2258.33 1.16x BenchmarkMemmoveUnalignedDst32 3885.08 3965.81 1.02x BenchmarkMemmoveUnalignedDst64 5121.63 6965.54 1.36x BenchmarkMemmoveUnalignedDst128 7212.34 11372.68 1.58x BenchmarkMemmoveUnalignedDst256 6564.52 16913.59 2.58x BenchmarkMemmoveUnalignedDst512 8364.35 17782.57 2.13x BenchmarkMemmoveUnalignedDst1024 9539.87 24914.72 2.61x BenchmarkMemmoveUnalignedDst2048 9199.23 21235.11 2.31x BenchmarkMemmoveUnalignedDst4096 10077.39 25231.99 2.50x BenchmarkMemmoveUnalignedSrc32 3249.83 3742.52 1.15x BenchmarkMemmoveUnalignedSrc64 5562.35 6627.96 1.19x BenchmarkMemmoveUnalignedSrc128 6023.98 10200.84 1.69x BenchmarkMemmoveUnalignedSrc256 6921.83 15258.43 2.20x BenchmarkMemmoveUnalignedSrc512 8593.13 16541.97 1.93x BenchmarkMemmoveUnalignedSrc1024 9730.95 22927.84 2.36x BenchmarkMemmoveUnalignedSrc2048 9793.28 21537.73 2.20x BenchmarkMemmoveUnalignedSrc4096 10132.96 26295.06 2.60x Change-Id: I73af59970d4c97c728deabb9708b31ec7e01bdf2 Reviewed-on: https://go-review.googlesource.com/21990 Reviewed-by: Bill O'Farrell <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>

bradfitz added this to the Unplanned milestone Feb 25, 2016

bradfitz added the Performance label Feb 25, 2016

gopherbot closed this as completed in c4807d4 Apr 13, 2016

golang locked and limited conversation to collaborators Apr 13, 2017

gopherbot added the FrozenDueToAge label Apr 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: perf improvements for runtime.memmove for ppc64le/ppc64 #14507

runtime: perf improvements for runtime.memmove for ppc64le/ppc64 #14507

laboger commented Feb 25, 2016

randall77 commented Feb 25, 2016

laboger commented Feb 25, 2016

mwhudson commented Feb 29, 2016

randall77 commented Feb 29, 2016

mwhudson commented Feb 29, 2016

randall77 commented Feb 29, 2016

mwhudson commented Feb 29, 2016

laboger commented Mar 8, 2016

laboger commented Apr 13, 2016

gopherbot commented Apr 13, 2016

runtime: perf improvements for runtime.memmove for ppc64le/ppc64 #14507

runtime: perf improvements for runtime.memmove for ppc64le/ppc64 #14507

Comments

laboger commented Feb 25, 2016

randall77 commented Feb 25, 2016

laboger commented Feb 25, 2016

mwhudson commented Feb 29, 2016

randall77 commented Feb 29, 2016

mwhudson commented Feb 29, 2016

randall77 commented Feb 29, 2016

mwhudson commented Feb 29, 2016

laboger commented Mar 8, 2016

laboger commented Apr 13, 2016

gopherbot commented Apr 13, 2016