Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: async preemption causes short sleep in tickspersecond #63103

Closed
rhysh opened this issue Sep 19, 2023 · 6 comments
Closed

runtime: async preemption causes short sleep in tickspersecond #63103

rhysh opened this issue Sep 19, 2023 · 6 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Sep 19, 2023

What version of Go are you using (go version)?

go1.21 version
go version go1.21.1 darwin/arm64

Does this issue reproduce with the latest release?

Yes, Go 1.21 is the latest release series.

What operating system and processor architecture are you using (go env)?

go env Output
$ go1.21 env
GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/rhys/Library/Caches/go-build'
GOENV='/Users/rhys/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/rhys/go/pkg/mod'
GONOPROXY='*'
GONOSUMDB='*'
GOOS='darwin'
GOPATH='/Users/rhys/go'
GOPRIVATE='*'
GOPROXY='direct'
GOROOT='/Users/rhys/go/version/go1.21'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/Users/rhys/go/version/go1.21/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.21.1'
GCCGO='gccgo'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/lf/n4ry0qv5639_3z0nhyhy3z7c0000gs/T/go-build1797745903=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

$ cat slowprof.go
package main

import (
	"flag"
	"io"
	"log"
	"runtime"
	"runtime/pprof"
)

func main() {
	cpu := flag.Bool("cpu", false, "collect a CPU profile")
	mutex := flag.Int("mutex", -1, "collect a mutex profile with specified debug level")
	block := flag.Int("block", -1, "collect a block profile with specified debug level")
	flag.Parse()

	if *cpu {
		err := pprof.StartCPUProfile(io.Discard)
		if err != nil {
			log.Fatalf("pprof.StartCPUProfile: %v", err)
		}
		pprof.StopCPUProfile()
	}

	if *mutex >= 0 {
		runtime.SetMutexProfileFraction(2)
		err := pprof.Lookup("mutex").WriteTo(io.Discard, *mutex)
		if err != nil {
			log.Fatalf(`pprof.Lookup("mutex").WriteTo: %v`, err)
		}
	}

	if *block >= 0 {
		runtime.SetBlockProfileRate(2)
		err := pprof.Lookup("block").WriteTo(io.Discard, *block)
		if err != nil {
			log.Fatalf(`pprof.Lookup("block").WriteTo: %v`, err)
		}
	}
}

$ go1.21 build -o /tmp/slowprof ./slowprof.go

$ for i in {1..10}; do time /tmp/slowprof; done                          
/tmp/slowprof  0.00s user 0.01s system 58% cpu 0.022 total
/tmp/slowprof  0.00s user 0.00s system 36% cpu 0.009 total
/tmp/slowprof  0.00s user 0.00s system 55% cpu 0.007 total
/tmp/slowprof  0.00s user 0.00s system 41% cpu 0.010 total
/tmp/slowprof  0.00s user 0.00s system 39% cpu 0.009 total
/tmp/slowprof  0.00s user 0.00s system 36% cpu 0.006 total
/tmp/slowprof  0.00s user 0.00s system 37% cpu 0.005 total
/tmp/slowprof  0.00s user 0.00s system 33% cpu 0.008 total
/tmp/slowprof  0.00s user 0.00s system 29% cpu 0.007 total
/tmp/slowprof  0.00s user 0.00s system 29% cpu 0.007 total


$ for i in {1..10}; do time GODEBUG=asyncpreemptoff=1 /tmp/slowprof -mutex=0; done
GODEBUG=asyncpreemptoff=1 /tmp/slowprof -mutex=0  0.00s user 0.01s system 7% cpu 0.120 total
GODEBUG=asyncpreemptoff=1 /tmp/slowprof -mutex=0  0.00s user 0.00s system 4% cpu 0.117 total
GODEBUG=asyncpreemptoff=1 /tmp/slowprof -mutex=0  0.00s user 0.01s system 7% cpu 0.120 total
GODEBUG=asyncpreemptoff=1 /tmp/slowprof -mutex=0  0.00s user 0.00s system 6% cpu 0.120 total
GODEBUG=asyncpreemptoff=1 /tmp/slowprof -mutex=0  0.00s user 0.00s system 5% cpu 0.115 total
GODEBUG=asyncpreemptoff=1 /tmp/slowprof -mutex=0  0.00s user 0.00s system 5% cpu 0.113 total
GODEBUG=asyncpreemptoff=1 /tmp/slowprof -mutex=0  0.00s user 0.00s system 5% cpu 0.112 total
GODEBUG=asyncpreemptoff=1 /tmp/slowprof -mutex=0  0.00s user 0.00s system 6% cpu 0.113 total
GODEBUG=asyncpreemptoff=1 /tmp/slowprof -mutex=0  0.00s user 0.00s system 5% cpu 0.111 total
GODEBUG=asyncpreemptoff=1 /tmp/slowprof -mutex=0  0.00s user 0.00s system 6% cpu 0.118 total

$ for i in {1..10}; do time GODEBUG=asyncpreemptoff=0 /tmp/slowprof -mutex=0; done
GODEBUG=asyncpreemptoff=0 /tmp/slowprof -mutex=0  0.00s user 0.01s system 34% cpu 0.035 total
GODEBUG=asyncpreemptoff=0 /tmp/slowprof -mutex=0  0.00s user 0.00s system 20% cpu 0.026 total
GODEBUG=asyncpreemptoff=0 /tmp/slowprof -mutex=0  0.00s user 0.00s system 19% cpu 0.026 total
GODEBUG=asyncpreemptoff=0 /tmp/slowprof -mutex=0  0.00s user 0.00s system 14% cpu 0.021 total
GODEBUG=asyncpreemptoff=0 /tmp/slowprof -mutex=0  0.00s user 0.00s system 12% cpu 0.021 total
GODEBUG=asyncpreemptoff=0 /tmp/slowprof -mutex=0  0.00s user 0.00s system 11% cpu 0.025 total
GODEBUG=asyncpreemptoff=0 /tmp/slowprof -mutex=0  0.00s user 0.00s system 18% cpu 0.022 total
GODEBUG=asyncpreemptoff=0 /tmp/slowprof -mutex=0  0.00s user 0.00s system 16% cpu 0.024 total
GODEBUG=asyncpreemptoff=0 /tmp/slowprof -mutex=0  0.00s user 0.00s system 17% cpu 0.024 total
GODEBUG=asyncpreemptoff=0 /tmp/slowprof -mutex=0  0.00s user 0.00s system 17% cpu 0.024 total

What did you expect to see?

I expected the code in runtime.tickspersecond that calls usleep(100 * 1000) to result in a delay of 100,000µs (100ms).

https://github.com/golang/go/blob/go1.21.1/src/runtime/runtime.go#L35

What did you see instead?

Asynchronous preemption interrupts the call, both on darwin/arm64 and linux/amd64, resulting in a shorter delay (10–20ms).

Maybe the shorter delay is fine (see #63078), but if it's intentional then the code should change to reflect that intent.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Sep 19, 2023
@cherrymui
Copy link
Member

It looks to me that the code calculates ticks per second using the actual time passed and the cpu tick differences https://cs.opensource.google/go/go/+/master:src/runtime/runtime.go;l=36-41 , so a shorter delay should be fine. Do you suggest that we add a comment? Thanks.

@rhysh
Copy link
Contributor Author

rhysh commented Sep 22, 2023

Right, the code is written to not rely on the sleep being exactly 100ms: scheduling delays in the OS could make it longer, async preemption in practice makes it shorter. That's all fine.

I'd like the code within that function to match what we see in practice. If it appears to say "sleep for 100ms" but it usually only sleeps for the scheduling quantum (10ms today), then I'd like either for the code to change so it sleeps for the whole 100ms, or for it to explicitly attempt sleeping for one scheduling quantum.

I'd like for the various profiles to impose a minimal overhead on applications. For short-lived programs this sort of delay can be a significant source of overhead: given a program that takes only 200ms to run, this adds either 50% (as written, +100ms) or 5% (in practice, +10ms with async preemption). So I'm also interested in ways to avoid delaying the app, maybe via capturing the two clocks unconditionally when the runtime boots up and again at 100ms. Callers of tickspersecond might block until then, or they might calculate the value early (but not cache it).

@cherrymui
Copy link
Member

Okay, thanks. So the concern is mostly the delay for short-running programs?

Sleeping 10ms is probably fine. It is probably long enough for all CPUs to get a good ratio. Calculating it unconditionally at runtime startup also sounds reasonable to me.

@prattmic
Copy link
Member

At a high level, I don't like tickspersecond in the first place. As someone who has written a package to track time from the TSC (which involves computing the TSC frequency) I can tell you it certainly isn't as simple doing division once and assuming that the frequency never changes.

But getting off my high horse, this estimate is probably good enough for our use and I'm not sure we need to overhaul it.

I kind of like starting this computation at startup since it is cheap overall. But where would you suggest putting the check for completion? It bugs me a bit to unconditionally put somewhere central like the scheduler. Maybe it could actually go in tickspersecond itself? i.e., it would be computed from runtime start until a mutex profile (or other user) is taken?

@rhysh
Copy link
Contributor Author

rhysh commented Sep 22, 2023

So the concern is mostly the delay for short-running programs?

My practical concerns are the delay for short-running programs, and the delay in start-up time for programs that call runtime.SetBlockProfileRate (even though their later calls to fetch "mutex" and "block" profiles may not be time sensitive).

But where would you suggest putting the check for completion?

The first option is to make the first tickspersecond call do the calculation, whenever that might be. It might be after only 1ms — do we have timer resolution problems on some platforms, with that being less than an observable tick? It might be after several hours — does one of the clocks stop running if the program is on a laptop that went into hibernation?

A second option is for the runtime's boot process to also set a timer, which on expiration (10 or 100ms later) would finish the calculation. Calls to tickspersecond before that expiration could either block until then, or could collect their own view of the timers and return their own (no to be cached) calculation. If they're allowed to return early, then the calls to tickspersecond that have lasting effects (such as calculating the block profile rate) might arrange to receive the final "official" calculated value once the boot timer expired: atomic.Store64(&blockprofilerate, uint64(r)) (as appears in runtime.SetBlockProfileRate) would end up running a second time with an improved estimate for r.

It bugs me a bit to unconditionally put somewhere central like the scheduler.

I get the impression that sysmon isn't very popular. But is sysmon an option for this?

@thanm thanm added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 25, 2023
@mknyszek mknyszek added this to the Backlog milestone Oct 4, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/538898 mentions this issue: runtime: improve tickspersecond

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants