Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: perf improvements for runtime.memmove for ppc64le/ppc64 #14507

Closed
laboger opened this issue Feb 25, 2016 · 10 comments
Closed

runtime: perf improvements for runtime.memmove for ppc64le/ppc64 #14507

laboger opened this issue Feb 25, 2016 · 10 comments

Comments

@laboger
Copy link
Contributor

laboger commented Feb 25, 2016

There is a ppc64(le) specific implementation of memmove in runtime/memmove_ppc64x.s which works well in some cases but found some cases where further improvements can be made and plan to work on those. On inspection I also have some concerns about the alignment assumptions.

There are no alignment checks for the source and destination in this code. Is there an assumption when using the runtime.memmove function that these pointers will refer to data that is 8 byte aligned? If the length is >= 8 then it does ldu/stdu regardless of the alignment. I couldn't find any documentation stating that there is an expectation for the alignment of the source and destination with this function.

@bradfitz bradfitz added this to the Unplanned milestone Feb 25, 2016
@randall77
Copy link
Contributor

You do not get any alignment guarantee on the memmove inputs.
If ppc can't do unaligned loads and stores, it does need fixup code.
If there isn't yet, you should definitely add a test for this.

On Thu, Feb 25, 2016 at 8:51 AM, laboger [email protected] wrote:

There is a ppc64(le) specific implementation of memmove in
runtime/memmove_ppc64x.s which works well in some cases but found some
cases where further improvements can be made and plan to work on those. On
inspection I also have some concerns about the alignment assumptions.

There are no alignment checks for the source and destination in this code.
Is there an assumption when using the runtime.memmove function that these
pointers will refer to data that is 8 byte aligned? If the length is >= 8
then it does ldu/stdu regardless of the alignment. I couldn't find any
documentation stating that there is an expectation for the alignment of the
source and destination with this function.


Reply to this email directly or view it on GitHub
#14507.

@laboger
Copy link
Contributor Author

laboger commented Feb 25, 2016

The unaligned loads and stores will work but they are very inefficient. I handle that as part of my changes. Thanks.

@mwhudson
Copy link
Contributor

It would be interesting to instrument memmove to see how it is called, both in terms of alignment and sizes. I think when copying anything containing pointers, you'll have at least pointer alignment -- maybe if this could be determined at the call site there would be a case for splitting memmove into one that assumes alignment and one that does not (and as we all know by now, if you are copying pointers you have to do it 8 bytes at a time).

@randall77
Copy link
Contributor

It might be informative to break memmove calls down by type. There are:

  1. Calls for runtime generics (maps, channels, interfaces, etc.)
  2. Calls for append
  3. Calls for string <-> []byte
  4. Calls for compiler-generated copying of big user types

#1,#2, and #4 are almost always aligned. Although very likely, I think we could only guarantee it at #2 and #4 callsites. Alignment in case #3 is less likely.
For #2 and #4, the copies are typically large so a test and branch for alignment wouldn't add much relative cost.

@mwhudson
Copy link
Contributor

Isn't it more fair to say that the alignment of the arguments to memmove match the alignment of the type being copied? When appending two []byte's, there's no guarantee of alignment. And so the question becomes whether the type's alignment is known at the callsite to memmove – I think for the cases you list, the calls are all via typedmemmove, which does have that information, so maybe it would be practical to have a separate memmove_aligned function.

The size distribution is interesting too, there's always this temptation (for me, anyway) to focus on the asymptotic times and making the larger copies as fast as possible, but I don't think that's the way to go for most actual Go programs (but I might be wrong; I'd love for someone to do something rigorous here)

@randall77
Copy link
Contributor

typedmemmove is itself generic. So I guess one test/branch in typedmemmove would ensure that both the source and destination are aligned and could call a specialized memmove routine. We can do the same thing in memmove itself with an or/test/branch. Doesn't seem like a big win.

I agree with your size feelings. It's more important for Go to get the 1 and 2 word memmoves fast. Of course it depends a lot on the benchmark.

@mwhudson
Copy link
Contributor

I was thinking typedmemmove could check t.align but I guess that doesn't
buy much over checking the alignment in memmove indeed.

On 1 March 2016 at 05:18, Keith Randall [email protected] wrote:

typedmemmove is itself generic. So I guess one test/branch in typedmemmove
would ensure that both the source and destination are aligned and could
call a specialized memmove routine. We can do the same thing in memmove
itself with an or/test/branch. Doesn't seem like a big win.

I agree with your size feelings. It's more important for Go to get the 1
and 2 word memmoves fast. Of course it depends a lot on the benchmark.


Reply to this email directly or view it on GitHub
#14507 (comment).

@laboger
Copy link
Contributor Author

laboger commented Mar 8, 2016

I'm sorry but I was wrong in my previous comment about the performance of unaligned loads and stores on power8. On earlier Power instruction sets unaligned loads and stores were very inefficient but I have since found out that on power8 unaligned loads and stores perform well. So for now I think the main areas of performance improvement for the memmove are when 1) figuring out if the src and dst overlap (I don't think the current test is always correct and may decide to use backward moves when not necessary) and 2) when moving large amounts of data. In the second case, unrolling the loop and using dcache hints will help. I'm doing more experiments and getting input from other power8 developers.

@laboger
Copy link
Contributor Author

laboger commented Apr 13, 2016

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/21990 mentions this issue.

ceseo pushed a commit to powertechpreview/go that referenced this issue May 3, 2016
This change improves the performance of memmove
on ppc64 & ppc64le mainly for moves >=32 bytes.
In addition, the test to detect backward moves
 was enhanced to avoid backward moves if source
and dest were in different types of storage, since
backward moves might not always be efficient.

Fixes golang#14507

The following shows some of the improvements from the test
in the runtime package:

BenchmarkMemmove32                   4229.56      4717.13      1.12x
BenchmarkMemmove64                   6156.03      7810.42      1.27x
BenchmarkMemmove128                  7521.69      12468.54     1.66x
BenchmarkMemmove256                  6729.90      18260.33     2.71x
BenchmarkMemmove512                  8521.59      18033.81     2.12x
BenchmarkMemmove1024                 9760.92      25762.61     2.64x
BenchmarkMemmove2048                 10241.00     29584.94     2.89x
BenchmarkMemmove4096                 10399.37     31882.31     3.07x

BenchmarkMemmoveUnalignedDst16       1943.69      2258.33      1.16x
BenchmarkMemmoveUnalignedDst32       3885.08      3965.81      1.02x
BenchmarkMemmoveUnalignedDst64       5121.63      6965.54      1.36x
BenchmarkMemmoveUnalignedDst128      7212.34      11372.68     1.58x
BenchmarkMemmoveUnalignedDst256      6564.52      16913.59     2.58x
BenchmarkMemmoveUnalignedDst512      8364.35      17782.57     2.13x
BenchmarkMemmoveUnalignedDst1024     9539.87      24914.72     2.61x
BenchmarkMemmoveUnalignedDst2048     9199.23      21235.11     2.31x
BenchmarkMemmoveUnalignedDst4096     10077.39     25231.99     2.50x

BenchmarkMemmoveUnalignedSrc32       3249.83      3742.52      1.15x
BenchmarkMemmoveUnalignedSrc64       5562.35      6627.96      1.19x
BenchmarkMemmoveUnalignedSrc128      6023.98      10200.84     1.69x
BenchmarkMemmoveUnalignedSrc256      6921.83      15258.43     2.20x
BenchmarkMemmoveUnalignedSrc512      8593.13      16541.97     1.93x
BenchmarkMemmoveUnalignedSrc1024     9730.95      22927.84     2.36x
BenchmarkMemmoveUnalignedSrc2048     9793.28      21537.73     2.20x
BenchmarkMemmoveUnalignedSrc4096     10132.96     26295.06     2.60x

Change-Id: I73af59970d4c97c728deabb9708b31ec7e01bdf2
Reviewed-on: https://go-review.googlesource.com/21990
Reviewed-by: Bill O'Farrell <[email protected]>
Reviewed-by: Brad Fitzpatrick <[email protected]>
@golang golang locked and limited conversation to collaborators Apr 13, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants