Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime, cgo: programs using Cocoa/OpenGL/Metal APIs on macOS exhibit problems at tip not seen in 1.19.4 #57263

Closed
dmitshur opened this issue Dec 12, 2022 · 21 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin release-blocker
Milestone

Comments

@dmitshur
Copy link
Contributor

dmitshur commented Dec 12, 2022

What version of Go are you using (go version)?

$ go version
go version go1.20rc1 darwin/arm64

Does this issue reproduce with the latest release?

Not with the latest stable release.

This appears to be a regression that happens only with Go 1.20 RC 1 and at tip (as of yesterday), but doesn't happen at all with Go 1.19.4 or any older stable Go release.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/gopher/Library/Caches/go-build"
GOENV="/Users/gopher/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/gopher/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/gopher/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.20rc1"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/s_/5sjqzr0j6xggz_xtzmq_8r4m00jwcj/T/go-build1742320560=/tmp/go-build -gno-record-gcc-switches -fno-common"

Generally a close-to-default install of Go 1.20 RC 1 on an M1-based Mac running latest macOS (13.0.1/22A400) and Xcode (14.1/14B47b).

What did you do?

I tried running various Go programs that use Cocoa and either OpenGL or Metal APIs via cgo to open a window and render graphics. This problem affects all of them in the same way. The smallest way to reproduce I can share at this time are the simple example programs in the go-gl org:

$ cd $(mktemp -d)
$ git clone https://github.com/go-gl/example && cd example
$ go run ./gl21-cube
$ go run ./gl41core-cube

There's no problem running simple cgo programs that don't use the Cocoa/OpenGL/Metal APIs.

What did you expect to see?

Normal program execution, no warnings or errors, just like with Go 1.19.4 or older.

What did you see instead?

Almost always, there are warnings/log messages printed including:

2022-12-12 11:40:49.511 gl21-cube[64718:25499557] +[NSXPCSharedListener endpointForReply:withListenerName:replyErrorCode:]: an error occurred while attempting to obtain endpoint for listener 'ClientCallsAuxiliary': Connection invalid
2022-12-12 11:40:49.518 gl21-cube[64718:25499557] Error received in message reply handler: Connection invalid
2022-12-12 11:40:49.518 gl21-cube[64718:25499593] Connection Invalid error for service com.apple.hiservices-xpcservice.

(Those log messages may show up only after the Cocoa window is selected.)

Sometimes it also prints:

FALLBACK (log once): Fallback to SW vertex processing because buildPipelineState failed
FALLBACK (log once): Fallback to SW fragment processing because buildPipelineState failed
UNSUPPORTED (log once): UNEXPECTED/FATAL?: buildPipelineState failed to build fragment-fallback PSO, m_disable_code: 1001000
FALLBACK (log once): Fallback to SW vertex processing, m_disable_code: 1000
FALLBACK (log once): Fallback to SW fragment processing, m_disable_code: 1000000

Also observed:

2022/12/11 14:08:22 Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)

(The OS was not shutting down. Restarting had no effect.)

It's not completely deterministic: sometimes the program will exit due to an error, or fail to render graphics properly if it keeps running. However, running the same program with Go 1.19.4 will work okay, and then re-running it with Go 1.20 RC 1 will also work (still with warnings, but graphics will render normally). From a quick look, it seems that modifying a shader source will cause it to start to fail again, so it's possible running it with Go 1.19.4 causes those to be built successfully, cached and reused.

This might very well be a problem with the Go program accessing those macOS APIs in an unsafe way, or a problem in macOS itself, but it only started happening at tip and doesn't happen when reverting to an older stable Go version.

CC @golang/runtime.

@dmitshur dmitshur added OS-Darwin NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Dec 12, 2022
@dmitshur dmitshur added this to the Go1.20 milestone Dec 12, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 12, 2022
@rsc
Copy link
Contributor

rsc commented Dec 13, 2022

Can you bisect to when it happened?

@gtownsend
Copy link

If this helps...:

I started seeing the first error from an existing binary (of a Go app) when I upgraded my Mac to "Ventura" (OS 13.0). It was nonfatal, and the app still worked, so I didn't pay a lot of attention. Rebuilding the app with Go 1.20.RC1 had no effect.

Today I upgraded to MacOS 13.1 and with both the old and new app binaries the initial error is followed by
SIGABRT: abort
PC=0x18cff66ec m=7 sigcode=0
signal arrived during cgo execution

I guess the main thing I'm suggesting is that it might possibly be a change in MacOS rather than a change in the Go code that provoked the problem.

@dmitshur
Copy link
Contributor Author

@rsc Sure, I'll run a bisect and see if I can spot what Go commit introduces this, thanks. (It should be easy as long as this stays reproducible for me.)

@gtownsend Thanks. In your case, does compiling the Go program with Go 1.19.4 make any difference? For me, when I tried it last, the problem would immediately go away whenever the program was built with 1.19.4.

I'll also see if today's macOS 13.1 release brings any new differences.

@gtownsend
Copy link

gtownsend commented Dec 13, 2022 via email

@dmitshur
Copy link
Contributor Author

Observing the same behavior after updating from macOS 13.0.1 to 13.1.

@rsc Bisect found that this issue reproduces as of CL 451735 ("runtime: work around Apple libc bugs to make exec stop hanging"). Reverting just that commit at tip is sufficient to restore problem-free behavior.

@dmitshur
Copy link
Contributor Author

dmitshur commented Dec 14, 2022

After re-reading #56784 I realized this likely has to do with fork/new process being involved during shader compilation.

In case it ends up being helpful, here's a more self-contained repro that uses the Metal API:

package main

/*
#cgo CFLAGS: -x objective-c
#cgo LDFLAGS: -framework Metal -framework CoreGraphics -framework Foundation

#import <Metal/Metal.h>

void CompileMetalShader() {
	id<MTLDevice> device = MTLCreateSystemDefaultDevice();
	if (!device) {
		printf("no Metal device\n");
		return;
	}

	NSError * err;
	id<MTLLibrary> library = [device newLibraryWithSource:@"// Empty shader 123."
	                                              options:NULL
	                                                error:&err];
	if (err) {
		printf("newLibraryWithSource error: %s\n", err.localizedDescription.UTF8String);
		return;
	}

	printf("ok\n");
}
*/
import "C"

func main() {
	C.CompileMetalShader()

	// Output (with CL 451735, with shader source that hasn't been previously compiled):
	// newLibraryWithSource error: Compiler encountered XPC_ERROR_CONNECTION_INVALID (is the OS shutting down?)

	// Output (without CL 451735):
	// ok
}

Edit: Added -framework Foundation; it's needed for linking to succeed in some, albeit not all, macOS environments.

@gtownsend
Copy link

gtownsend commented Dec 14, 2022 via email

@dmitshur
Copy link
Contributor Author

dmitshur commented Dec 14, 2022

Tentatively adding release-blocker since this might be preventing some cgo programs from running on a first class port.

@ericchiang
Copy link
Contributor

ericchiang commented Dec 21, 2022

#57419 is another issue that appears to be caused by go.dev/cl/451735

Reproducer:

package main

// #cgo darwin LDFLAGS: -framework PCSC
// #include <PCSC/winscard.h>
// #include <PCSC/wintypes.h>
import "C"
import "fmt"

func main() {
	var ctx C.SCARDCONTEXT
	rc := int64(C.SCardEstablishContext(C.SCARD_SCOPE_SYSTEM, nil, nil, &ctx))
	if rc < 0 {
		// Fix overflow.
		rc += (1 << 32)
	}
	fmt.Printf("0x%08x\n", rc)
	if rc == C.SCARD_S_SUCCESS {
		C.SCardReleaseContext(ctx)
	}
}

Without cl/451735 or using Go 1.19 and earlier, this prints 0x00000000 (SCARD_S_SUCCESS). On Go 1.20rc1 it prints 0x8010001d (SCARD_E_NO_SERVICE)

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/459175 mentions this issue: runtime: revert Apple libc atfork workaround

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/459176 mentions this issue: runtime: call __fork instead of fork on darwin

gopherbot pushed a commit that referenced this issue Dec 22, 2022
Revert CL 451735 (1f4394a), which fixed #33565 and #56784
but also introduced #57263.

I have a different fix to apply instead. Since the first fix was
never backported, it will be easiest to backport the new fix
if the new fix is done in a separate CL from the revert.

Change-Id: I6c8ea3a46e542ee4702675bbc058e29ccd2723e0
Reviewed-on: https://go-review.googlesource.com/c/go/+/459175
Reviewed-by: Cherry Mui <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Russ Cox <[email protected]>
Repository owner moved this from In Progress to Done in Go Compiler / Runtime Dec 22, 2022
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/459178 mentions this issue: runtime: call __fork instead of fork on darwin

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/459178 mentions this issue: [release-branch.go1.19] runtime: call __fork instead of fork on darwin

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/459179 mentions this issue: [release-branch.go1.18] runtime: call __fork instead of fork on darwin

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Dec 22, 2022
gopherbot pushed a commit that referenced this issue Dec 22, 2022
Issues #33565 and #56784 were caused by hangs in the child process
after fork, while it ran atfork handlers that ran into slow paths that
didn't work in the child.

CL 451735 worked around those two issues by calling a couple functions
at startup to try to warm up those child paths. That mostly worked,
but it broke programs using cgo with certain macOS frameworks (#57263).

CL 459175 reverted CL 451735.

This CL introduces a different fix: bypass the atfork child handlers
entirely. For a general fork call where the child and parent are both
meant to keep executing the original program, atfork handlers can be
necessary to fix any state that would otherwise be tied to the parent
process. But Go only uses fork as preparation for exec, and it takes
care to limit what it attempts to do in the child between the fork and
exec. In particular it doesn't use any of the things that the macOS
atfork handlers are trying to fix up (malloc, xpc, others). So we can
use the low-level fork system call (__fork) instead of the
atfork-wrapped one.

The full list of functions that can be called in a child after fork in
exec_libc2.go is:

 - ptrace
 - setsid
 - setpgid
 - getpid
 - ioctl
 - chroot
 - setgroups
 - setgid
 - setuid
 - chdir
 - dup2
 - fcntl
 - close
 - execve
 - write
 - exit

I disassembled all of these while attached to a hung exec.test binary
and confirmed that nearly all of them are making direct kernel calls,
not using anything that the atfork handler needs to fix up.
The exceptions are ioctl, fcntl, and exit.

The ioctl and fcntl implementations do some extra work around the
kernel call but don't call any other functions, so they should still
be OK. (If not, we could use __ioctl and __fcntl instead, but without
a good reason, we should keep using the standard entry points.)

The exit implementation calls atexit handlers. That is almost
certainly inappropriate in a failed fork child, so this CL changes
that call to __exit on darwin. To avoid making unnecessary changes at
this point in the release cycle, this CL leaves OpenBSD calling plain
exit, even though that is probably a bug in the OpenBSD port
(filed #57446).

Fixes #33565.
Fixes #56784.
Fixes #57263.

Fixes #56837.

Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/459176
Reviewed-by: David Chase <[email protected]>
Reviewed-by: Cherry Mui <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Russ Cox <[email protected]>
Reviewed-on: https://go-review.googlesource.com/c/go/+/459178
gopherbot pushed a commit that referenced this issue Dec 22, 2022
Issues #33565 and #56784 were caused by hangs in the child process
after fork, while it ran atfork handlers that ran into slow paths that
didn't work in the child.

CL 451735 worked around those two issues by calling a couple functions
at startup to try to warm up those child paths. That mostly worked,
but it broke programs using cgo with certain macOS frameworks (#57263).

CL 459175 reverted CL 451735.

This CL introduces a different fix: bypass the atfork child handlers
entirely. For a general fork call where the child and parent are both
meant to keep executing the original program, atfork handlers can be
necessary to fix any state that would otherwise be tied to the parent
process. But Go only uses fork as preparation for exec, and it takes
care to limit what it attempts to do in the child between the fork and
exec. In particular it doesn't use any of the things that the macOS
atfork handlers are trying to fix up (malloc, xpc, others). So we can
use the low-level fork system call (__fork) instead of the
atfork-wrapped one.

The full list of functions that can be called in a child after fork in
exec_libc2.go is:

 - ptrace
 - setsid
 - setpgid
 - getpid
 - ioctl
 - chroot
 - setgroups
 - setgid
 - setuid
 - chdir
 - dup2
 - fcntl
 - close
 - execve
 - write
 - exit

I disassembled all of these while attached to a hung exec.test binary
and confirmed that nearly all of them are making direct kernel calls,
not using anything that the atfork handler needs to fix up.
The exceptions are ioctl, fcntl, and exit.

The ioctl and fcntl implementations do some extra work around the
kernel call but don't call any other functions, so they should still
be OK. (If not, we could use __ioctl and __fcntl instead, but without
a good reason, we should keep using the standard entry points.)

The exit implementation calls atexit handlers. That is almost
certainly inappropriate in a failed fork child, so this CL changes
that call to __exit on darwin. To avoid making unnecessary changes at
this point in the release cycle, this CL leaves OpenBSD calling plain
exit, even though that is probably a bug in the OpenBSD port
(filed #57446).

Fixes #33565.
Fixes #56784.
Fixes #57263.
Fixes #56836.

Change-Id: I26812c26a72bdd7fcf72ec41899ba11cf6b9c4ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/459176
Reviewed-by: David Chase <[email protected]>
Reviewed-by: Cherry Mui <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Russ Cox <[email protected]>
Reviewed-on: https://go-review.googlesource.com/c/go/+/459179
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/460476 mentions this issue: runtime: Apple libc atfork workaround take 3

@tmm1
Copy link
Contributor

tmm1 commented Jan 6, 2023

We're tracking some customer crashes on M1 macOS 13.1 which seems to be related to the __fork commit. However, I have not yet been able to replicate on our own hardware running the same OS. I am sending the customer a build with the __fork commit reverted to see if it resolves the issue.

PC=0x1a95867c8 m=0 sigcode=0
signal arrived during cgo execution

goroutine 1 [syscall, locked to thread]:
runtime.cgocall(0x101b0e538, 0x14000d0fc48)
	runtime/cgocall.go:158 +0x54 fp=0x14000d0fc10 sp=0x14000d0fbd0 pc=0x100ac31b4
github.com/fancybits/channels-server/menubar._Cfunc_StartApp()
	_cgo_gotypes.go:91 +0x34 fp=0x14000d0fc40 sp=0x14000d0fc10 pc=0x1016c9684
github.com/fancybits/channels-server/menubar.Start({0x14000d1cd80, 0x7, 0x8})
	github.com/fancybits/channels-server/menubar/menubar.go:34 +0xe0 fp=0x14000d0fc70 sp=0x14000d0fc40 pc=0x1016c9920
main.StartMenu()
	github.com/fancybits/channels-server/menu_darwin.go:74 +0x424 fp=0x14000d0fd10 sp=0x14000d0fc70 pc=0x101afeb74
We are using cgo to populate an icon in the macOS menu bar.
/*
#cgo CFLAGS: -x objective-c -mmacosx-version-min=10.10
#cgo LDFLAGS: -framework Cocoa -framework Security
#import 
void SetupApp();
void StartApp();
void StopApp();
void add_menu_item(int, const char *);
void add_separator_item();
*/
import "C"
void SetupApp(void) {
  [NSAutoreleasePool new];
  [NSApplication sharedApplication];
  [NSApp setActivationPolicy:NSApplicationActivationPolicyAccessory];

SessionAttributeBits attributes;
OSStatus err = SessionGetInfo(callerSecuritySession, NULL, &attributes);
if ((err == noErr) && (attributes & sessionHasGraphicAccess)) {
menuDelegate = [MenuHandler new];
appMenu = [[NSMenu new] autorelease];

NSData *data;
float scale = [[NSScreen mainScreen] backingScaleFactor];
if (scale > 1) {
  data = [NSData dataWithBytes:menubar_2x_png length:menubar_2x_png_len];
} else {
  data = [NSData dataWithBytes:menubar_png length:menubar_png_len];
}
NSImage *icon = [[NSImage alloc] initWithData:data];
if (scale > 1) {
  icon.size = NSMakeSize(icon.size.width / scale, icon.size.height / scale);
}
[icon setTemplate:YES];

statusItem = [[[NSStatusBar systemStatusBar] statusItemWithLength:NSSquareStatusItemLength] retain];
[statusItem setMenu:appMenu];
[statusItem setImage:icon];
[statusItem setHighlightMode:YES];
[statusItem setToolTip:@"Channels DVR"];

}

[NSWorkspace.sharedWorkspace.notificationCenter addObserver:menuDelegate
selector:@selector(didWake:)
name:NSWorkspaceDidWakeNotification
object:NULL];

[NSApp activateIgnoringOtherApps:NO];
}

void StartApp() {
[NSApp run];
}

@rsc rsc moved this from Done to Todo in Go Compiler / Runtime Jan 8, 2023
@rsc rsc reopened this Jan 8, 2023
@github-project-automation github-project-automation bot moved this from Todo to In Progress in Go Compiler / Runtime Jan 8, 2023
@rsc
Copy link
Contributor

rsc commented Jan 8, 2023

Which version of Go are you using?
Reopening to make sure we resolve this one way or the other before Go 1.20.
Does your crashing program use os/exec?
And if so does it use os/exec before calling main.StartMenu?

@tmm1
Copy link
Contributor

tmm1 commented Jan 8, 2023

We are tracking release-branch.go1.19 in a fork that we use here: https://github.com/fancybits/go/tree/fancybits-1.19

The customer confirmed that reverting the __fork change fixed his crashes.

I also got more details about his setup, which is different than what we were trying ourselves:

It’s an M1 Max mini. It’s I’m running as a dedicated non-admin user and it’s running logged out. I have it running as a service via launchd.


And if so does it use os/exec before calling main.StartMenu?

Yes, it does use os/exec before calling StartMenu

@rsc
Copy link
Contributor

rsc commented Jan 8, 2023

Thank you for the quick confirmation.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/461115 mentions this issue: [release-branch.go1.19] runtime: revert "call __fork instead of fork on darwin"

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/461116 mentions this issue: [release-branch.go1.18] runtime: revert "call __fork instead of fork on darwin"

gopherbot pushed a commit that referenced this issue Jan 9, 2023
…on darwin"

A recent comment on #57263 reports an unexplained crash in a cgo program
that is fixed by reverting the __fork fix. We don't have any viable fix for the
os/exec bug at this point, so give up on a fix for the January point releases.

This reverts CL 459179 (commit 07b6ffb).

Fixes #57689.

Change-Id: I3b81de6bded399f47862325129e86a65c83d8e3b
Reviewed-on: https://go-review.googlesource.com/c/go/+/461116
Reviewed-by: Ian Lance Taylor <[email protected]>
Run-TryBot: Russ Cox <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Cherry Mui <[email protected]>
gopherbot pushed a commit that referenced this issue Jan 9, 2023
…on darwin"

A recent comment on #57263 reports an unexplained crash in a cgo program
that is fixed by reverting the __fork fix. We don't have any viable fix for the
os/exec bug at this point, so give up on a fix for the January point releases.

This reverts CL 459178 (commit 91bc4cd).

Fixes #57690.

Change-Id: Ieb38d9bc7f967e9a726429eab2ea515d5ca0847f
Reviewed-on: https://go-review.googlesource.com/c/go/+/461115
Run-TryBot: Russ Cox <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Cherry Mui <[email protected]>
@github-project-automation github-project-automation bot moved this from In Progress to Done in Go Compiler / Runtime Jan 10, 2023
@golang golang locked and limited conversation to collaborators Jan 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants