-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: loads/constants not lifted out of loop #15808
Comments
Happy to work on this; I prototyped a loop-invariant code hoister and tried it out, but it didn't (at the time) do anything exciting for performance on amd64. Note that this can increase register pressure. |
I thought the register allocator pre-allocated things like this before loop entry. I suspect that something is not working about that code, not that we need a whole new pass. (Not that a code hoister is a bad idea, just that it may not be needed for this issue.) |
If overwriting t in goroutine, the loop should refer the new t. So need to LEAQ is required in this loop. If the loop-count is large enough, below should remove the LEAQ in the loop. func f(b []byte) {
t := t // THIS
for i := 0; i < 100000000; i++ {
b[i%len(b)] = t[i%len(t)]
}
} |
I just revived my old loop-invariant-code experiment, and here's a (very) preliminary result. (1) this pushes "loop-invariant" till it finds them all.
|
CL https://golang.org/cl/37338 mentions this issue. |
For the reference - I was looking at
The (citing from above)
the Encode() function is not reported for being detected with invariant in a loop. Thanks, |
Out of curiosity, Kirill, is 1.8 any better than tip for Encode? (Or vice versa, depending on which you are using?) There was a recent change to CSE constant strings. |
Josh, thanks for asking. I'm using tip by default. Tip is better than 1.8 for hex.Encode in that sense that 57370a8 (encoding/hex: change lookup table from string to array) is no longer needed because there is no longer two So the diff for assembly with the following patch applied --- a/src/encoding/hex/hex.go
+++ b/src/encoding/hex/hex.go
@@ -12,10 +12,11 @@ import (
"io"
)
-var hextable = [16]byte{
- '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
- 'a', 'b', 'c', 'd', 'e', 'f',
-}
+//var hextable = [16]byte{
+// '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
+// 'a', 'b', 'c', 'd', 'e', 'f',
+//}
+const hextable = "0123456789abcdef"
// EncodedLen returns the length of an encoding of n source bytes.
// Specifically, it returns n * 2. is only this: --- a/0tip.sss
+++ b/0tipstr.sss
@@ -18,8 +18,8 @@ pc39:
MOVL DI, R9
SHRB $4, DIB
MOVBLZX DIB, DI
- LEAQ ·hextable(SB), R10
- MOVBLZX (R10)(DI*1), DI
+ LEAQ go.string."0123456789abcdef"(SB), R10 // hex.go:32
+ MOVBLZX (R10)(DI*1), DI // hex.go:31
CMPQ SI, CX
JCC pc139
MOVB DIB, (DX)(SI*1) (thus reverting 57370a8 would make sense anyway as then hextable will go to rodata and will be probably deduplicated by linker with other such tables /cc @ALTree, @bradfitz). For present state - when hextable is --- a/018.sss
+++ b/0tip.sss
@@ -5,34 +5,34 @@ TEXT ·Encode(SB), $8-56 // hex.go:29
// FUNCDATA $0, gclocals·b9de2a960cf046391bcd3b554f7fabca(SB) (args)
FUNCDATA $1, gclocals·69c1753bd5f81501d95132d08af04464(SB) (locals)
MOVQ src+32(FP), AX // hex.go:30
+ TESTQ AX, AX
+ JLE pc115
MOVQ dst+8(FP), CX
MOVQ dst+0(FP), DX
MOVQ src+24(FP), BX
MOVQ $0, SI // hex.go:29
- CMPQ SI, AX // hex.go:30
- JGE $0, pc115
pc39:
- MOVBLZX (BX), DI
- MOVL DI, R8 // hex.go:31
+ MOVBLZX (BX), DI // hex.go:30
+ MOVQ SI, R8 // hex.go:31
+ SHLQ $1, SI
+ MOVL DI, R9
SHRB $4, DIB
MOVBLZX DIB, DI
- LEAQ ·hextable(SB), R9
- MOVBLZX (R9)(DI*1), DI
- MOVQ SI, R10
- SHLQ $1, SI
+ LEAQ ·hextable(SB), R10
+ MOVBLZX (R10)(DI*1), DI
CMPQ SI, CX
- JCC $0, pc139
+ JCC pc139
MOVB DIB, (DX)(SI*1)
- ANDL $15, R8 // hex.go:32
- MOVBLZX (R9)(R8*1), DI
- LEAQ 1(SI), R8
- CMPQ R8, CX
- JCC $0, pc132
- MOVB DIB, 1(DX)(SI*1)
+ LEAQ 1(SI), DI // hex.go:32
+ ANDL $15, R9
+ MOVBLZX (R10)(R9*1), R9
+ CMPQ DI, CX
+ JCC pc132
+ MOVB R9B, 1(DX)(SI*1)
INCQ BX // hex.go:30
- LEAQ 1(R10), SI
+ LEAQ 1(R8), SI
CMPQ SI, AX
- JLT $0, pc39
+ JLT pc39
pc115:
SHLQ $1, AX // hex.go:35
MOVQ AX, _r2+48(FP) which to me looks the same from the first glance modulo some different register allocations, reordering, and likely labels removed from jumps (@randall77 214be5b says likely hints are no longer needed "since #15837 was fixed" but #15837 is not yet fixed - maybe some thinko crept in?). Thanks, |
#15837 is fixed in the sense that it matters here, and in 214be5b. The obj library no longer reorders instructions. |
@randall77 thanks for feedback and good luck with finding what is going on with regressions in that CL. |
CL 27254 changed hextable to a byte array for performance. CL 28219 fixed the compiler so that that is no longer necessary. As Kirill notes in #15808, a string is preferable as the linker can easily de-dup it. So go back. No performance changes. Change-Id: Ibef7d21d0f2507968a0606602c5dd57ed4a85b1b Reviewed-on: https://go-review.googlesource.com/40970 Run-TryBot: Josh Bleecher Snyder <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
CL https://golang.org/cl/40970 mentions this issue. |
@dr2chase, what's the status here? |
I have a CL up for looking at, there is some bad interaction with racewalk that I have yet to debug. It's unclear if it helps, it seems not to hurt, and there might be code not in our benchmarks that it helps. See https://go-review.googlesource.com/c/37338 -- I think it fails to mention this bug. It may not help this particular bug, I'll try to check Thursday night or Friday. |
Status is now that TryBots are happy, but Go1 benchmarks contain no loop invariant code at all, so of course Revcomp slows down, but I think that is alignment-related. |
@dr2chase over at #18977 (comment), @navytux hacked together a clever way to try out different alignments, so that luck (or lack thereof) can be assessed. (Probably a coincidence, but @martisch was also puzzling over revcomp slowdowns in CL 38061.) |
From the CL: "Hoisting, but performance improvements are uninspiring." More important:
|
OK. I'll move this issue to 1.10, then. Feel free to move to Unplanned if you think that's better.
:)
Great! By the way, do have any recommended reading for code layout? I don't know how much time I'll have in 1.10, but it seems like a promising and interesting area to pursue. (As usual, I have lots of ideas but not much background knowledge.) |
Ha, well I guess I was misled by the title of this issue and didn't read the rest very thoroughly. Maybe I can hack it so constants can be hoisted for ppc64le and see how much it helps. And I will turn on some traces to understand what is being hoisted. |
Before using some CPU instructions, we must check for their presence. We use global variables in the runtime package to record features. Prior to this CL, we issued a regular memory load for these features. The downside to this is that, because it is a regular memory load, it cannot be hoisted out of loops or otherwise reordered with other loads. This CL introduces a new intrinsic just for checking cpu features. It still ends up resulting in a memory load, but that memory load can now be floated to the entry block and rematerialized as needed. One downside is that the regular load could be combined with the comparison into a CMPBconstload+NE. This new intrinsic cannot; it generates MOVB+TESTB+NE. (It is possible that MOVBQZX+TESTQ+NE would be better.) This CL does only amd64. It is easy to extend to other architectures. For the benchmark in #36196, on my machine, this offers a mild speedup. name old time/op new time/op delta FMA-8 1.39ns ± 6% 1.29ns ± 9% -7.19% (p=0.000 n=97+96) NonFMA-8 2.03ns ±11% 2.04ns ±12% ~ (p=0.618 n=99+98) Updates #15808 Updates #36196 Change-Id: I75e2fcfcf5a6df1bdb80657a7143bed69fca6deb Reviewed-on: https://go-review.googlesource.com/c/go/+/212360 Run-TryBot: Josh Bleecher Snyder <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Giovanni Bajo <[email protected]>
Change https://golang.org/cl/227238 mentions this issue: |
In the commit message of CL 212360, I wrote: > This new intrinsic ... generates MOVB+TESTB+NE. > (It is possible that MOVBQZX+TESTQ+NE would be better.) I should have tested. MOVBQZX+TESTQ+NE does in fact appear to be better. For the benchmark in #36196, on my machine: name old time/op new time/op delta FMA-8 0.86ns ± 6% 0.70ns ± 5% -18.79% (p=0.000 n=98+97) NonFMA-8 0.61ns ± 5% 0.60ns ± 4% -0.74% (p=0.001 n=100+97) Interestingly, these are both considerably faster than the measurements I took a couple of months ago (1.4ns/2ns). It appears that CL 219131 (clearing VZEROUPPER in asyncPreempt) helped a lot. And FMA is now once again slower than NonFMA, although this change helps it regain some ground. Updates #15808 Updates #36351 Updates #36196 Change-Id: I8a326289a963b1939aaa7eaa2fab2ec536467c7d Reviewed-on: https://go-review.googlesource.com/c/go/+/227238 Run-TryBot: Josh Bleecher Snyder <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Keith Randall <[email protected]>
I have time now to look at this again. I was not able to successfully disable tighten in order get constants moved out. I understand the purpose of doing the tighten but seems like there should be allowance for loops where loads can be moved to the dominator block of a loop? Maybe only for some targets like ppc64 and arm64 where there are more registers and loads cost more. Another question I had is whether NilCheck should/could be moved out of a loop if the pointer or address it is checking is invariant? Seems like doing it once before the loop should be sufficient. |
The tighten pass will only move values toward their uses, it will not move values back. Moving loads and nil checks around is tricky, because we need panics to appear at the correct place. For instance, if a loop runs 0 times, we can't lift a nil check out of the loop. Loads that we know won't fault could be lifted, though. I think David's CL did something like that. I'd like to see something on this front, though. There are certainly things that are safe to lift out of loops that would help. |
Yes I think it would be good to hoist constants. In the other compilers for ppc64, the base address of an array is moved back before the loop, but in Go the address is always loaded and the array element address computed each time in the loop. I tried changing regalloc to get the constants hoisted but couldn't get it to work, so based on what Josh said earlier I assumed that one of the tighten passes was preventing it.
Agreed, and if you can tell if the loop executes at least once then it should be possible to do this.
I think it is good to leave constants as rematerializable to avoid extra code to spill the value, but just should be treated differently with respect to hoisting. |
@dr2chase I don't see how loads of constants could be hoisted with your CL. It is only saving v's from the block as candidates to be hoisted, not any of their args or any subexpressions for that matter. |
I think that One problem we have is that constant-loads that take a mem operand, in a loop where there are stores or calls, I'm a little bugged that critical edges is not run before early hoistloopiv; the code at lines 60-66 assumes that implicitly (it assumes that there is a unique parent). |
The filtered invariant set is based on what's returned from the loopinvariants function, which only returns v's that are invariant. So in one case I'm looking at is in package log function itoa where there is a multiply of a constant with a variable and I want the constant to be moved out. The multiply is not invariant so isn't in the filtered invariant set, which means the constant arg is not processed in the lines you mentioned above. |
But, in the late hoist, isn't the constant arg visible as a load of a constant, hence invariant? |
I added a few extra prints to the debug output like this:
which results in this:
So the MOVDconst has the constant value in its auxint. But as I read the code, the Phi is not already in ss so it takes the continue path and doesn't add the MULHD to the list. |
I'm looking at your latest CL on a simple test like this:
I would like the base address of the flags array (which is a constant) to be hoisted outside the loop. In this test, it is loaded in the first block and stays there through all the passes until it gets to regalloc.
But after regalloc:
I know there are checks for rematerializable with comments that say the loads need to be closer to their uses, so I suspect that is causing this. However, I think in the case where the constant is referenced within a loop, it should at least be able to be hoisted out of the loop. This situation happens where a loop is stepping through an array and the base address is a constant but can't be hoisted since constants are all rematerializable, which is common. |
The current code, which is amd64-influenced, may not be aggressive enough to do what you want here; code (fragments) that don't reach a threshold aren't hoisted. Fighting with the register allocator is another issue. |
Conservatively hoist some loop invariants outside the loop. Updates golang#15808
Change https://go.dev/cl/478815 mentions this issue: |
Conservatively hoist some loop invariants outside the loop. Updates golang#15808
Compiles to:
Both the
LEAQ
and theMOVQ "".b+8
could be done outside the loop.The text was updated successfully, but these errors were encountered: