Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/link: Incorrect symbol linked in darwin/arm64 #58935

Closed
sywhang opened this issue Mar 8, 2023 · 19 comments
Closed

cmd/link: Incorrect symbol linked in darwin/arm64 #58935

sywhang opened this issue Mar 8, 2023 · 19 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@sywhang
Copy link
Contributor

sywhang commented Mar 8, 2023

Hello,

I'm from the Go Platform team at Uber, and we've been running into what appears to be a linker bug in macOS/M1 while trying to upgrade to Go 1.20.

What version of Go are you using (go version)?

This repros on all of 1.20 minor point releases 1.20.2.

$ go version
1.20.2

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

The issue only manifests in M1 macs.

We are in a Bazel sandbox environment, using rules_go.

go env Output
$ go env
❯ go env
GO111MODULE="off"
GOARCH="arm64"
GOBIN="/Users/sungyoon/go/bin/"
GOCACHE="/Users/sungyoon/Library/Caches/go-build"
GOENV="/Users/sungyoon/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/sungyoon/go-code/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/sungyoon/go-code"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/private/var/tmp/_bazel_sungyoon/ac08da491286b6bc27a8f65f3e5696d3/external/go_sdk"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/private/var/tmp/_bazel_sungyoon/ac08da491286b6bc27a8f65f3e5696d3/external/go_sdk/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.20.2"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/zl/5_l54cc5231fssrrq3thj6z40000gn/T/go-build2387064609=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

We received reports of some tests in our Go Monorepo that are only failing in M1 after upgrading to Go 1.20.

The panic trace depends on the failing targets, but all of them panic in some form during init.

Invalid return address:

runtime: g 15: unexpected return pc for runtime.callers called from 0x0
stack: frame={sp:0x14001dd3e80, fp:0x14001dd3ef0} stack=[0x14001dd2000,0x14001dd4000)
0x0000014001dd3d80:  0x00000140004216f0  0x0000000104d6f903 <testing.tRunner+0x0000000000000033>
0x0000014001dd3d90:  0x00000001162e2560  0x00000001121c655b
0x0000014001dd3da0:  0x000000000000000f  0x000000011485543f
0x0000014001dd3db0:  0x000000000000001d  0x000000000000059a
0x0000014001dd3dc0:  0x0000000000000599  0x0000000104d6f8d0 <testing.tRunner+0x0000000000000000>
0x0000014001dd3dd0:  0x00000001162e2560  0x00000001183881a0
0x0000014001dd3de0:  0x0000000104d6f903 <testing.tRunner+0x0000000000000033>  0x00000001162e2560
0x0000014001dd3df0:  0x00000001121c655b  0x000000000000000f
0x0000014001dd3e00:  0x0000000000000000  0x000000010d318ef5
0x0000014001dd3e10:  0x000000000000002a  0x000000010d2d9395
0x0000014001dd3e20:  0x0000000000000026  0x000000010d44460a
0x0000014001dd3e30:  0x0000000000000036  0x000000010d22f3ca
0x0000014001dd3e40:  0x000000000000001b  0x000000010d3d07d4
0x0000014001dd3e50:  0x0000000000000032  0x000000010d2d936f
0x0000014001dd3e60:  0x0000000000000026  0x000000010d2906eb
0x0000014001dd3e70:  0x0000000000000021  0x0000000000000000
runtime.callers(0x0?, {0x0?, 0x0?, 0x0?})
    GOROOT/src/runtime/traceback.go:843 +0x78 fp=0x14001dd3ef0 sp=0x14001dd3e80 pc=0x104c09988
created by testing.(*T).Run
    GOROOT/src/testing/testing.go:1629 +0x36c

Panics from callee:
case 1:

goroutine 1 [running]:
reflect.Value.lenNonSlice({0x10c2a5b30?, 0x140091352c0?, 0x140091c3080?})
        GOROOT/src/reflect/value.go:1714 +0x160
github.com/go-playground/validator/v10.init.0()
        external/com_github_go_playground_validator_v10/postcode_regexes.go:170 +0x38

case 2:

panic: reflect: call of reflect.Value.Uint on kind31 Value

goroutine 1 [running]:
reflect.Value.Uint({0x0?, 0x0?, 0x111bd4680?})
        GOROOT/src/reflect/value.go:2666 +0xfc
github.com/shopspring/decimal.newFromFloat(0x3de5d8fd1fd19ccd, 0x3de5d8fd1fd19ccd, 0x1160395a0)
        external/com_github_shopspring_decimal/decimal.go:302 +0x154
github.com/shopspring/decimal.NewFromFloat(0x3de5d8fd1fd19ccd)
        external/com_github_shopspring_decimal/decimal.go:262 +0x7c
github.com/shopspring/decimal.init()
        external/com_github_shopspring_decimal/decimal.go:1735 +0x23c

Looking through the disassembly, we're seeing calls to runtime.duffzero getting linked with some arbitrary functions in the problematic targets. If the linked callee panics from invalid args, then it causes this panic to occur. Sometimes the panic happens because the linked target expects a different stack size from one caller set up, and panics from invalid return pc.

Below is part of the disassembled init func of one of monorepo dependencies: (github.com/shopspring/decimal).

  0x5ba0e        a93eeffd        STP (R29, R27), -24(RSP)                
  0x5ba12        d10063fd        SUB $24, RSP, R29                    
  0x5ba16        94000000        CALL 0(PC)                        [0:4]R_CALLARM64:runtime.duffzero<1>+52    
  0x5ba1a        d10023fd        SUB $8, RSP, R29                    
  0x5ba1e        f901dfff        MOVD ZR, 952(RSP)                    
  0x5ba22        f94023e1        MOVD 64(RSP), R1                    
  0x5ba26        910223e0        ADD $136, RSP, R0                    
  0x5ba2a        94000000        CALL 0(PC)                        [0:4]R_CALLARM64:github.com/shopspring/decimal.(*decimal).Assign    
  0x5ba2e        f94247e2        MOVD 1160(RSP), R2                    

This is what we see in the intermediate archive file generated for compile, before it's linked.

But in the final binary, we're seeing the linker somehow linked the call to runtime.duffzero with reflect.Value in the same init function:

  0x107c89d38        a93eeffd        STP (R29, R27), -24(RSP)                    
  0x107c89d3c        d10063fd        SUB $24, RSP, R29                        
  0x107c89d40        97fded21        CALL reflect.Value.Float.island(SB)                 // ???
  0x107c89d44        d10023fd        SUB $8, RSP, R29                        
  0x107c89d48        f901dfff        MOVD ZR, 952(RSP)                        
  0x107c89d4c        f94023e1        MOVD 64(RSP), R1                        
  0x107c89d50        910223e0        ADD $136, RSP, R0                        
  0x107c89d54        9400022f        CALL github.com/shopspring/decimal.(*decimal).Assign(SB)    
  0x107c89d58        f94247e2        MOVD 1160(RSP), R2                        

Similar issue happens with runtime.duffcopy in another target:

Pre-linking:

  0xe6364        90000014        ADRP 0(PC), R20                [0:8]R_ADDRARM64:gopkg.in/yaml%2ev3..stmp_41<1>    
  0xe6368        91000294        ADD $0, R20, R20            
  0xe636c        1000009b        ADR 16(PC), R27                
  0xe6370        a93eeffd        STP (R29, R27), -24(RSP)        
  0xe6374        d10063fd        SUB $24, RSP, R29            
  0xe6378        94000000        CALL 0(PC)                [0:4]R_CALLARM64:runtime.duffcopy<1>+288    
  0xe637c        d10023fd        SUB $8, RSP, R29            
  0xe6380        910c63e0        ADD $792, RSP, R0            
  0xe6384        f900bfe0        MOVD R0, 376(RSP)            
  0xe6388        3980001b        MOVB (R0), R27                
  0xe638c        90000000        ADRP 0(PC), R0                [0:8]R_ADDRARM64:type:bool    

Post-linking:

  0x107cb2884        d004f174        ADRP 165863424(PC), R20                
  0x107cb2888        91304294        ADD $3088, R20, R20                
  0x107cb288c        1000009b        ADR 16(PC), R27                    
  0x107cb2890        a93eeffd        STP (R29, R27), -24(RSP)            
  0x107cb2894        d10063fd        SUB $24, RSP, R29                
  0x107cb2898        97fd49a8        CALL reflect.New.island(SB)              // ???
  0x107cb289c        d10023fd        SUB $8, RSP, R29                
  0x107cb28a0        910c63e0        ADD $792, RSP, R0                
  0x107cb28a4        f900bfe0        MOVD R0, 376(RSP)                
  0x107cb28a8        3980001b        MOVB (R0), R27                    
  0x107cb28ac        90043da0        ADRP 142295040(PC), R0   

This issue does not occur with every binary that uses these dependencies, but only some of them. Another point worth noting is that when we change the binary layout by turning inline optimization off or all optimizations off with -N/-l gcflags, the issues go away, but it starts happening on some other targets that were passing with the optimizations.

This issue does not occur on any other environments we have (Linux amd64 or darwin amd64).

What did you expect to see?

Linker correctly links correct binaries.

What did you see instead?

Panics as described above.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 8, 2023
@thanm thanm added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 8, 2023
@thanm
Copy link
Contributor

thanm commented Mar 8, 2023

Thanks for the report.

It would be helpful to know what the exact linker invocation is, and also whether you are using internal linking or external linking (latter is what you would typically get if your program uses CGO).

@thanm
Copy link
Contributor

thanm commented Mar 8, 2023

@cherrymui @thanm

@cherrymui
Copy link
Member

It would be helpful to know what the exact linker invocation is, and also whether you are using internal linking or external linking (latter is what you would typically get if your program uses CGO).

Also, how big the binary is. (For very large binaries it needs trampolines, which could be where the problem is)
Thanks.

@cherrymui
Copy link
Member

while trying to upgrade to Go 1.20.

Does it not fail with Go 1.19.x? That information would also be helpful. Thanks.

@thanm thanm added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Mar 8, 2023
@sywhang
Copy link
Contributor Author

sywhang commented Mar 8, 2023

We are using external linking mode, with these flags:

-extld external/local_config_cc/cc_wrapper.sh -s -w -X '<target>' '-buildid=redacted' -extldflags '-undefined dynamic_lookup -headerpad_max_install_names -lm')

This cc_wrapper.sh is coming from our Bazel environment.

The binary is also fairly large - haven't collected the size of all of the failing targets but the ones I did investigate were all upwards of 300mb. Some of the failing targets are even larger, reaching up to ~800mb.

And yup, this does not happen on any of the 1.19.x releases.

@thanm
Copy link
Contributor

thanm commented Mar 8, 2023

The use of external linking plus the fact that runtime.duff* routines are involved suggests that this might be a problem similar to the one referred to in CL 469275 (where the external linker is getting confused due to the non-zero addend on the call relocation). With that in mind, if you could let us know what flavor of linker you are using (ld.ldd, ld.bfd, etc) as well as the version.

@sywhang
Copy link
Contributor Author

sywhang commented Mar 8, 2023

We're using ld.ldd.

❯ ld -v
@(#)PROGRAM:ld  PROJECT:ld64-820.1
BUILD 20:07:05 Nov  7 2022
configured to support archs: armv6 armv7 armv7s arm64 arm64e arm64_32 i386 x86_64 x86_64h armv6m armv7k armv7m armv7em
LTO support using: LLVM version 14.0.0, (clang-1400.0.29.202) (static support for 29, runtime is 29)
TAPI support using: Apple TAPI version 14.0.0 (tapi-1400.0.11)

@cherrymui
Copy link
Member

Thanks. For a mis-compiled binary, could you show the incorrect instruction, and from the symbol table the address of the incorrect target function, as well as all duff symbols (i.e. symbols containing runtime.duffzero or runtime.duffcopy in their names, possibly with some suffixes)? That could be helpful. Thanks.

@cherrymui
Copy link
Member

cherrymui commented Mar 8, 2023

Could you try if CL https://golang.org/cl/474620 makes any difference? (You'll need to rebuild the Go toolchain with the CL patched in.)

Initially I thought it may be related. But I'm not really sure now. Maybe it is something else...

@sywhang
Copy link
Contributor Author

sywhang commented Mar 8, 2023

For a mis-compiled binary, could you show the incorrect instruction, and from the symbol table the address of the incorrect target function, as well as all duff symbols

Sure. I pulled this from one of the failing binaries:

The incorrect instruction is at 0x107c82630.

  0x107c82630           97fe0d71                CALL reflect.Value.Uint.island(SB)

The incorrect target function (reflect.Value.Uint) is at 0x1000adc60.

TEXT reflect.Value.Uint(SB) GOROOT/src/reflect/value.go
  0x1000adc60           f9400b90                MOVD 16(R28), R16

Here are some duff symbols:

TEXT runtime.duffzero+220(SB) src/runtime/duff_arm64.s
  duff_arm64.s:63       0x100072a9c             a8817e9f                STP.P (ZR, ZR), 16(R20)

TEXT runtime.duffzero+224(SB) src/runtime/duff_arm64.s
  duff_arm64.s:64       0x100072aa0             a8817e9f                STP.P (ZR, ZR), 16(R20)

TEXT runtime.duffzero+228(SB) src/runtime/duff_arm64.s
  duff_arm64.s:65       0x100072aa4             a8817e9f                STP.P (ZR, ZR), 16(R20)

TEXT runtime.duffzero+232(SB) src/runtime/duff_arm64.s
  duff_arm64.s:66       0x100072aa8             a8817e9f                STP.P (ZR, ZR), 16(R20)

TEXT runtime.duffzero+236(SB) src/runtime/duff_arm64.s
  duff_arm64.s:67       0x100072aac             a8817e9f                STP.P (ZR, ZR), 16(R20)

TEXT runtime.duffzero+240(SB) src/runtime/duff_arm64.s
  duff_arm64.s:68       0x100072ab0             a8817e9f                STP.P (ZR, ZR), 16(R20)

TEXT runtime.duffzero+244(SB) src/runtime/duff_arm64.s
  duff_arm64.s:69       0x100072ab4             a8817e9f                STP.P (ZR, ZR), 16(R20)

TEXT runtime.duffzero+248(SB) src/runtime/duff_arm64.s
  duff_arm64.s:70       0x100072ab8             a8817e9f                STP.P (ZR, ZR), 16(R20)

...

TEXT runtime.duffcopy+336(SB) src/runtime/duff_arm64.s
  duff_arm64.s:201      0x100072c20             a8c16e9a                LDP.P 16(R20), (R26, R27)
  duff_arm64.s:202      0x100072c24             a8816eba                STP.P (R26, R27), 16(R21)

TEXT runtime.duffcopy+344(SB) src/runtime/duff_arm64.s
  duff_arm64.s:204      0x100072c28             a8c16e9a                LDP.P 16(R20), (R26, R27)
  duff_arm64.s:205      0x100072c2c             a8816eba                STP.P (R26, R27), 16(R21)

TEXT runtime.duffcopy+352(SB) src/runtime/duff_arm64.s
  duff_arm64.s:207      0x100072c30             a8c16e9a                LDP.P 16(R20), (R26, R27)
  duff_arm64.s:208      0x100072c34             a8816eba                STP.P (R26, R27), 16(R21)

TEXT runtime.duffcopy+360(SB) src/runtime/duff_arm64.s
  duff_arm64.s:210      0x100072c38             a8c16e9a                LDP.P 16(R20), (R26, R27)
  duff_arm64.s:211      0x100072c3c             a8816eba                STP.P (R26, R27), 16(R21)

TEXT runtime.duffcopy+368(SB) src/runtime/duff_arm64.s
  duff_arm64.s:213      0x100072c40             a8c16e9a                LDP.P 16(R20), (R26, R27)
  duff_arm64.s:214      0x100072c44             a8816eba                STP.P (R26, R27), 16(R21)

Could you try if CL https://golang.org/cl/474620 makes any difference?

I've actually found this code in the linker, and tried to do the same change you made. It doesn't make any difference.

@cherrymui
Copy link
Member

Thanks. I think I understand it now. Will send a CL tomorrow.

@cherrymui
Copy link
Member

Updated CL https://golang.org/cl/474620 to be what I think may work. Could you try that? Thanks!

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/474620 mentions this issue: cmd/link: use label symbols for Duff's devices on darwin/arm64

@sywhang
Copy link
Contributor Author

sywhang commented Mar 9, 2023

@cherrymui Thanks! I confirmed this does fix the issues in the known failing targets. Still waiting to run the entire test suite in our Go Monorepo. That'll take some time but I'll update once that's done.

@cherrymui
Copy link
Member

@gopherbot please backport this to Go 1.20 release. This could cause binaries to be built incorrectly. Thanks.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #58954 (for 1.20).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@cherrymui cherrymui removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Mar 9, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/475175 mentions this issue: [release-branch.go1.20] cmd/link: use label symbols for Duff's devices on darwin/arm64

@sywhang
Copy link
Contributor Author

sywhang commented Mar 9, 2023

We've verified that the fix does not break any further builds so far. The full suite is yet to finish, but we've covered enough tests so far to call this fix good on our end. Thank you for the quick turnaround.

@cherrymui
Copy link
Member

@sywhang thanks for confirming!

gopherbot pushed a commit that referenced this issue Mar 15, 2023
…s on darwin/arm64

On darwin, the external linker generally supports CALL relocations
with addend. One exception is that for a very large binary when it
decides to insert a trampoline, instead of applying the addend to
the call target (in the trampoline), it applies the addend to the
CALL instruction in the caller, i.e. generating a call to
trampoline+addend, which is not the correct address and usually
points to unreloated functions.

To work around this, we use label symbols so the CALL is targeting
a label symbol without addend. To make things simple we always use
label symbols for CALLs with addend (in external linking mode on
darwin/arm64), even for small binaries.

Updates #58935.
Fixes #58954.

Change-Id: I38aed6b62a0496c277c589b5accbbef6aace8dd5
Reviewed-on: https://go-review.googlesource.com/c/go/+/474620
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Cherry Mui <[email protected]>
Reviewed-by: Than McIntosh <[email protected]>
(cherry picked from commit 7dbd6de)
Reviewed-on: https://go-review.googlesource.com/c/go/+/475175
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 17, 2023
@dmitshur dmitshur added this to the Go1.21 milestone Apr 17, 2023
@golang golang locked and limited conversation to collaborators Apr 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

5 participants