Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall: cgo calling of glibc nptl:setxid functions timing out in the face of pthread termination pressure #42494

Closed
AndrewGMorgan opened this issue Nov 10, 2020 · 11 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@AndrewGMorgan
Copy link
Contributor

AndrewGMorgan commented Nov 10, 2020

What version of Go are you using (go version)?

HEAD (1.16) release branch, but suspect earlier go toolchains might also suffer from this.

Does this issue reproduce with the latest release?

Don't know yet, but plan to start my investigation by writing a standalone go test case.

Update 2020-11-25: Yes, all versions of Go at least as far back as 1.10 (but probably earlier too, untested).

What operating system and processor architecture are you using (go env)?

All variants of Linux architectures.

What did you do?

Aggressively terminate threads while invoking glibc-backed ntpl:setxid functions linked with cgo.

What did you expect to see?

No crashes or timeouts.

What did you see instead?

In the 1.16 build tree, this yields timeouts when running the syscall.TestSetuidEtc() as root with a modification to the test explored in:

#42462 (comment)

The nocgo support for this test, featuring a new syscall.AllThreadsSyscall() implementation does not lockup. It is just the cgo redirection to glibc implementation of the various functions that times out in the face of this thread pressure - inside glibc code.

Example:

panic: test timed out after 3m0s

goroutine 5 [running]:
testing.(*M).startAlarm.func1()
	/workdir/go/src/testing/testing.go:1680 +0xcb
created by time.goFunc
	/workdir/go/src/time/sleep.go:167 +0x37

goroutine 1 [chan receive]:
testing.(*T).Run(0xa506c40, 0x81aaf83, 0xd, 0x81b63ac, 0x401)
	/workdir/go/src/testing/testing.go:1219 +0x236
testing.runTests.func1(0xa4829a0)
	/workdir/go/src/testing/testing.go:1491 +0x5a
testing.tRunner(0xa4829a0, 0xa43aeb0)
	/workdir/go/src/testing/testing.go:1173 +0xb7
testing.runTests(0xa4b0030, 0x829b140, 0x2e, 0x2e, 0x162925e3, 0xbfe2d74a, 0xe8de9c73, 0x29, 0x829cee0, 0xa43af24)
	/workdir/go/src/testing/testing.go:1489 +0x25a
testing.(*M).Run(0xa4d2000, 0x0)
	/workdir/go/src/testing/testing.go:1397 +0x178
syscall_test.TestMain(0xa4d2000)
	/workdir/go/src/syscall/syscall_linux_test.go:153 +0x97
main.main()
	_testmain.go:137 +0x110

goroutine 43 [chan receive]:
testing.(*T).Parallel(0xa5060e0)
	/workdir/go/src/testing/testing.go:1039 +0xfd
syscall_test.TestInvalidExec(0xa5060e0)
	/workdir/go/src/syscall/exec_unix_test.go:222 +0x21
testing.tRunner(0xa5060e0, 0x81b6324)
	/workdir/go/src/testing/testing.go:1173 +0xb7
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1218 +0x217

goroutine 56 [syscall]:
syscall.cgocaller(0x8049700, 0xa4a3b8c, 0x1, 0x1, 0x804e7b0)
	/workdir/go/src/runtime/cgocall.go:109 +0x37
syscall.Setegid(0x1, 0xa5062a0, 0xa48fe30)
	/workdir/go/src/syscall/syscall_linux.go:1107 +0x61
syscall_test.TestSetuidEtc.func1(0xa48fe00, 0x81b63d8)
	/workdir/go/src/syscall/syscall_linux_test.go:627 +0x21
syscall_test.TestSetuidEtc(0xa506c40)
	/workdir/go/src/syscall/syscall_linux_test.go:661 +0x1dc
testing.tRunner(0xa506c40, 0x81b63ac)
	/workdir/go/src/testing/testing.go:1173 +0xb7
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1218 +0x217
FAIL	syscall	180.019s
FAIL
go tool dist: Failed: exit status 1
@AndrewGMorgan
Copy link
Contributor Author

Please feel free to assign this issue to me.

@ianlancetaylor ianlancetaylor changed the title cgo calling of glibc nptl:setxid functions timing out in the face of pthread termination pressure syscall: cgo calling of glibc nptl:setxid functions timing out in the face of pthread termination pressure Nov 11, 2020
@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 11, 2020
@ianlancetaylor ianlancetaylor added this to the Go1.16 milestone Nov 11, 2020
@AndrewGMorgan
Copy link
Contributor Author

AndrewGMorgan commented Nov 11, 2020

It would appear that this is an issue at HEAD and at least as far back as go1.10.8 (I've not looked further back):

The test code, a cgo program that performs one privileged system call via glibc: seteuid() with or without go killing threads somewhat in parallel:

// Program singer investigates how stable go's use of pthreads is with
// respect to glibc's nptl:setxid backed C.seteuid() function call.
// There are 9 such setxid backed functions in glibc. We expect them
// all to have similar stability, but only experiment on one of them.
//
// This is to characterize: https://github.com/golang/go/issues/42494
package main

import (
	"flag"
	"fmt"
	"log"
	"runtime"
)

// #include <unistd.h>
import "C"

var (
	priv  = flag.Bool("seteuid", false, "switch euid back and forth 0<->1")
	kill  = flag.Bool("kill", false, "kill a thread each loop iteration")
	count = flag.Int("count", 50, "loop this many times")
	steps = flag.Int("steps", 1, "display a '.' every so many steps")
)

// killAThread locks the goroutine to a thread and exits. This
// has the effect of making the go runtime terminate the thread.
func killAThread(c <-chan struct{}) {
	runtime.LockOSThread()
	<-c
}

func main() {
	flag.Parse()

	if *steps < 1 {
		log.Fatal("--steps must be positive")
	}

	for i, s := *count, *steps; i > 0; i-- {
		if i%s == 0 {
			fmt.Print(".")
		}

		if *kill {
			c := make(chan struct{})
			go killAThread(c)
			close(c)
		}

		if *priv {
			if v := C.seteuid(C.uid_t(i & 1)); v != 0 {
				log.Fatalf(" seteuid(%d) failed", i&1)
			}
		}
	}

	fmt.Println(" done")
}

For the testing, I downloaded sdk builds for linux/amd64 and ran them on a Fedora release 32 machine running:

$ rpm -q glibc
glibc-2.31-4.fc32.x86_64

This is the test log:

$ for x in ~/sdk/* ; do ${x}/bin/go build singer.go && echo "${x##*/}:" ; for c in "--kill" "--seteuid" "--kill --seteuid" ; do echo -n "${c}: " ; sudo timeout -k9 10 ./singer $c ; done ; echo ; done
go1.10.8:
--kill: .................................................. done
--seteuid: .................................................. done
--kill --seteuid: .
go1.11.13:
--kill: .................................................. done
--seteuid: .................................................. done
--kill --seteuid: ...............
go1.12.17:
--kill: .................................................. done
--seteuid: .................................................. done
--kill --seteuid: ..
go1.13.15:
--kill: .................................................. done
--seteuid: .................................................. done
--kill --seteuid: .
go1.14.11:
--kill: .................................................. done
--seteuid: .................................................. done
--kill --seteuid: ..
go1.15.4:
--kill: .................................................. done
--seteuid: .................................................. done
--kill --seteuid: .......

As can be seen above:

  • looping over thread killing seems to work fine
  • looping over seteuid()ing seems to work fine
  • looping them together yields some sort of lockup on all releases of go

@AndrewGMorgan
Copy link
Contributor Author

I get essentially the same result using a modified version of the same program which uses the psx cgo package and doesn't depend on glibc's setxid mechanism:

// Program backupsinger investigates how stable go's use of pthreads is with
// respect to the psx package's way of supporting the seteuid() function.
//
// This is to characterize: https://github.com/golang/go/issues/42494
package main

import (
	"flag"
	"fmt"
	"log"
	"runtime"
	"syscall"

	"kernel.org/pub/linux/libs/security/libcap/psx"
)

var (
	priv  = flag.Bool("seteuid", false, "switch euid back and forth 0<->1")
	kill  = flag.Bool("kill", false, "kill a thread each loop iteration")
	count = flag.Int("count", 50, "loop this many times")
	steps = flag.Int("steps", 1, "display a '.' every so many steps")
)

// killAThread locks the goroutine to a thread and exits. This
// has the effect of making the go runtime terminate the thread.
func killAThread(c <-chan struct{}) {
	runtime.LockOSThread()
	<-c
}

// minusOne is -1 in the form of a uintptr for system call use.
const minusOne = ^uintptr(0)

func main() {
	flag.Parse()

	if *steps < 1 {
		log.Fatal("--steps must be positive")
	}

	for i, s := *count, *steps; i > 0; i-- {
		if i%s == 0 {
			fmt.Print(".")
		}

		if *kill {
			c := make(chan struct{})
			go killAThread(c)
			close(c)
		}

		if *priv {
			if _, _, e := psx.Syscall3(syscall.SYS_SETRESUID, minusOne, uintptr(i&1), minusOne); e != 0 {
				log.Fatalf(" psx:seteuid(%d) failed: %v", i&1, syscall.Errno(e))
			}
		}
	}

	fmt.Println(" done")
}

The result being:

--kill: .................................................. done
--seteuid: .................................................. done
--kill --seteuid: ....

I think I'm going to explore this issue using this psx package since I'm more familiar with those moving parts.

@AndrewGMorgan
Copy link
Contributor Author

So these tests are just stand alone cgo programs to focus on getting to the bottom of these failures from the #42462 (comment) :

2 of 23 SlowBots failed:
Failed on linux-amd64: https://storage.googleapis.com/go-build-log/d73fad79/linux-amd64_0f499959.log
Failed on linux-386: https://storage.googleapis.com/go-build-log/d73fad79/linux-386_1a608de2.log

When I filed this present bug, I had assumed the issue was generic to all Linux architectures, but on the face of it 21 build tests passed. Is there a list of what the other 21 tests are? Do any others include cgo builds executed as uid=0 ? Could this issue just be a x86 thing?

@AndrewGMorgan
Copy link
Contributor Author

AndrewGMorgan commented Nov 11, 2020

I guess for completeness, here is the same code which runs the new (targeted to 1.16) syscall.AllThreadsSyscall() function (this is just a more explicit variant of what the syscall.Seteuid() function does when not needing cgo):

// Program subsinger investigates how stable go's use of threads is with
// respect to the newly added syscall.AllThreadsSyscall() function.
//
// This is to characterize: https://github.com/golang/go/issues/42494
package main

import (
        "flag"
        "fmt"
        "log"
        "runtime"
        "syscall"
)

var (
        priv  = flag.Bool("seteuid", false, "switch euid back and forth 0<->1")
        kill  = flag.Bool("kill", false, "kill a thread each loop iteration")
        count = flag.Int("count", 50, "loop this many times")
        steps = flag.Int("steps", 1, "display a '.' every so many steps")
)

// killAThread locks the goroutine to a thread and exits. This
// has the effect of making the go runtime terminate the thread.
func killAThread(c <-chan struct{}) {
        runtime.LockOSThread()
        <-c
}

// minusOne is -1 in the form of a uintptr for system call use.
const minusOne = ^uintptr(0)

func main() {
        flag.Parse()

        if *steps < 1 {
                log.Fatal("--steps must be positive")
        }

        for i, s := *count, *steps; i > 0; i-- {
                if i%s == 0 {
                        fmt.Print(".")
                }

                if *kill {
                        c := make(chan struct{})
                        go killAThread(c)
                        close(c)
                }

                if *priv {
                        if _, _, e := syscall.AllThreadsSyscall(syscall.SYS_SETRESUID, minusOne, uintptr(i&1), minusOne); e != 0 {
                                log.Fatalf("AllThreadSyscall:seteuid(%d) failed: %v", i&1, syscall.Errno(e))
                        }
                }
        }

        fmt.Println(" done")
}

The result being the intended one:

--kill: .................................................. done
--seteuid: .................................................. done
--kill --seteuid: .................................................. done

@AndrewGMorgan
Copy link
Contributor Author

AndrewGMorgan commented Nov 12, 2020

Exploring the behavior against the psx implementation of POSIX semantics system call support, I was able to determine that the issue at least with that implementation is that when the thread dies, Go's runtime decides to block all interrupts before allowing the thread to actually exit.

As such, the workaround for the psx package is to unblock the single psx-interrupt that is pending on the thread as it is dying until the thread can execute its cleanup code. Here is the patch for that.

I've tagged that one psx/v0.2.46-rc1 so a go.mod file like this and the backupsinger.go source (see #42494 (comment)):

module backupsinger

go 1.15

require kernel.org/pub/linux/libs/security/libcap/psx v0.2.46-rc1

Runs as follows:

$ ~/sdk/go1.15.4/bin/go build backupsinger.go
$ for c in "--kill" "--seteuid" "--kill --seteuid" ; do echo -n "${c}: " ; sudo timeout -k9 10 ./backupsinger $c ; done 
--kill: .................................................. done
--seteuid: .................................................. done
--kill --seteuid: .................................................. done

Success!

Next up, try to figure out how to make the glibc version work (I suspect this may require a golang change).

@AndrewGMorgan
Copy link
Contributor Author

I have this working in my sandbox. It appears to be yet another corner case issue like these:

#6997 #3871 #9400 #12498

I'm wondering how to test for (musl) issues such as #39343 ?

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/269799 mentions this issue: runtime: linux iscgo support for not blocking nptl signals

@AndrewGMorgan
Copy link
Contributor Author

The above performs as follows:

$ ../bin/go test -c syscall
$ ldd syscall.test 
        linux-vdso.so.1 (0x00007ffd273e5000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f022ee34000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f022ec6f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f022ee83000)
$ sudo ./syscall.test -test.run=SetuidEtc -test.count=10000
PASS

@AndrewGMorgan
Copy link
Contributor Author

I'm wondering if this should be marked a release blocker? Also, is there an annotation to consider back-porting the fix to one of the earlier releases?

@ianlancetaylor
Copy link
Member

It seems like this is not a new problem, so it shouldn't be a release blocker. Of course it would be good if we can fix it for 1.16, but we shouldn't block the release over it.

Similarly, this doesn't meet our usual backport criteria.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants