cmd/pprof: macOS 12.6.1 profile overcounts system calls (again) #57722

felixge · 2023-01-10T15:18:14Z

What version of Go are you using (`go version`)?

$ go version
go version go1.19.4 darwin/arm64

This is on macOS 12.6.1 with an M1 chip, but the problem seems to affect Intel as well.

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE="auto"
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/MyUser/Library/Caches/go-build"
GOENV="/Users/MyUser/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS="-count=1"
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/MyUser/go/pkg/mod"
GONOPROXY="github.com/MyOrg"
GONOSUMDB="github.com/MyOrg,go.myorg.io"
GOOS="darwin"
GOPATH="/Users/MyUser/go"
GOPRIVATE="github.com/MyOrg"
GOPROXY="binaries.myorg.io,https://proxy.golang.org,direct"
GOROOT="/opt/homebrew/Cellar/go/1.19.4/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/opt/homebrew/Cellar/go/1.19.4/libexec/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.19.4"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/MyUser/go/src/github.com/felixge/macos-profiler-bias/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/0t/nwrdnfnd2mjdgr0z_p_f6xww0000gq/T/go-build3012663781=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Create a CPU profile of a CPU-bound workload that is interleaved with short system calls.

For example TestInterleavedIO is a reproducer that is compressing a 250MB file containing random data.

What did you expect to see?

A profile that is similar to the one shown below for linux. Most of time should be spent on gzip compression, and a little bit on syscalls.

What did you see instead?

On macOS the majority of time is attributed to read and write syscalls. The gzip compression is barely showing up at all.

Sanity Check

TestSequentialIO implements the same workload, but instead of interleaved I/O, it does one big read first, followed by gzip compression, followed by a big write.

As expected, Linux produces a profile dominated by gzip compression.

And macOS now shows the right profile too.

Conclusion

macOS setitimer(2) seems to bias towards syscalls. This issue was reported and fixed in the past, see #17406 (2016) and #6047 (2013). So this could be a macOS regression.

I've uploaded all code, pprofs and screenshots to this google drive folder.

The text was updated successfully, but these errors were encountered:

cbedard · 2023-01-10T19:38:03Z

Testing on an Intel MacOS (MacOS 12.6.1)

go version

go version
go version go1.19.4 darwin/amd64

go env

go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/cameron.bedard/Library/Caches/go-build"
GOENV="/Users/cameron.bedard/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/cameron.bedard/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/cameron.bedard/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.19.4"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/zs/r6mt7f1n66n9mw2rz8f9w2540000gp/T/go-build1170536968=/tmp/go-build -gno-record-gcc-switches -fno-common"

I'm a bit of a go noob (start during advent of code lol), so I wasn't able to set up the UI you used but my IDE's built in pprof output looks similar to yours. If you want the matching profiler UI and send me a how-to link I can set it up :)

TestInterleavedIO

TestSequentialIO

felixge · 2023-01-10T21:07:01Z

@cbedard thanks, this looks like Intel Macs have the same issue. The UI doesn't matter, this is fine :).

cagedmantis · 2023-01-11T16:32:13Z

@golang/runtime

prattmic · 2023-01-11T18:05:43Z

It would not surprise me if darwin has unfair SIGPROF delivery. e.g., if the system call return path updates the CPU timers, it may be more likely to trigger overflow and deliver SIGPROF to itself vs threads that do not make system calls.

prattmic · 2023-01-11T18:07:32Z

I think this could be tested with a simple C program. Two threads, one running a busy loop, the other with a busy loop that makes a system call. Do they both receive ~50% of SIGPROF signals?

cherrymui · 2023-01-11T18:16:49Z

I haven't read the sample code. Just wonder, for the "interleaved" case, are the compression and I/O running concurrently on different goroutines/threads, or they are on the same thread but just interleaved? If the latter, do you know what frequency or time period the interleaving is, and is it faster than the profile sampling frequency (100 Hz)? Thanks.

felixge · 2023-01-11T21:44:56Z

I haven't read the sample code. Just wonder, for the "interleaved" case, are the compression and I/O running concurrently on different goroutines/threads, or they are on the same thread but just interleaved?

It's the latter. But I suspect that there could be concurrent/parallel read-ahead happening on the OS level.

If the latter, do you know what frequency or time period the interleaving is, and is it faster than the profile sampling frequency (100 Hz)? Thanks.

According to the trace (shown below), the workload issues 53k syscalls in 3.7s, so at ~14kHz it's much faster than the profiling frequency.

It would not surprise me if darwin has unfair SIGPROF delivery. e.g., if the system call return path updates the CPU timers, it may be more likely to trigger overflow and deliver SIGPROF to itself vs threads that do not make system calls.
I think this could be tested with a simple C program. Two threads, one running a busy loop, the other with a busy loop that makes a system call. Do they both receive ~50% of SIGPROF signals?

What syscall would you use? It would have to be a non-blocking one to see 50%? But yeah, a small C reproducer could be useful for this.

prattmic · 2023-01-18T20:17:11Z

I'd start with getpid or something similar, though perhaps it needs to go through the scheduler, so a very briefly blocking system call may be needed.

mknyszek · 2023-01-18T20:17:28Z

In triage now, since it seems like the next step is a C reproducer to confirm unfair SIGPROF delivery? Does that sound right? (The assignee is mostly just for tracking, there's no obligation; feel free to remove it.)

adonovan · 2023-06-30T18:00:14Z

I tried to reproduce the problem in C with this program, but no luck. The compute thread gets around 1200 ticks each of 1ms CPU (which is a huge underestimate since /usr/bin/time reports 6.1s of user CPU), but the main thread gets 0 ticks unless I increase the argument to fib().

Any suggestions?

cherrymui · 2023-12-13T03:35:36Z

I spent some time look into this. I ran the original Go reproducer with both profiling and tracing turned on, and saw that while the syscalls and compression are interleaved, the profiling samples indeed mostly land on syscalls.

This is a (small portion of a) trace with InterleavedIO. The yellow bars on the second row are syscalls, which occupy a small fraction of the time. The pink/magenta bars are profiling samples. They are ~10 ms apart (as expected with 100 Hz profiling rate), but mostly land on syscalls, with many of them land at the end of syscalls, but not always (some land in the middle, and the syscall restarts). So clearly there is a bias.

Interestingly, there seems also some bias with SequentialIO.

This is a (total) trace of SequentialIO. The yellow blocks are syscalls (open, read, write), which takes about ~500 ms in total, or nearly ~10% of the total time. But in the profile there is very few samples on syscalls, much less than 10%. Zooming in the trace for the last syscall

The pink/magenta bars are profiling samples. There is a train of samples before the syscall, and then when the syscall starts, it gets exactly 1 sample. The syscall last much longer, but there is no more sample land on that syscall. So while for the interleaved case syscalls are over-sampled, for long running syscall, it is under-sampled (!).

(To rule out other possibilities, I also turned off async preemption, in case that the preemption signal interrupts syscalls or affects profiling signal delivery, but it doesn't make a difference. To make sure the runtime is doing the right accounting, I printed signal PCs and it is indeed that most of them landed in libc address range.)

C reproducer

Inspired by the interleaved IO and Alan's C program, I wrote this C program https://go.dev/play/p/x3zK__wMKly , which does interleaved (short) read syscall and (long) computation. On my machine it prints

fib pc 0x10213cb10, etext 0x102141000
read 183741000 ns (7.894187%), fib 2142026000 ns (92.029294%)
sys 1970 samples, fib 98 samples, other 0 samples

That is, read is ~8% of total time, fib (computation) is 92% of total time, whereas most of the profiling samples land in syscalls (or at least, libc functions). So there is a strong bias.

I also used the macOS builtin sample program to take a "sample profile", which shows

Sort by top of stack, same collapsed (when >= 5):
        fib  (in a.out)        1636
        read  (in libsystem_kernel.dylib)        149
        _sigtramp  (in libsystem_platform.dylib)        23

Most sample land in fib, as expected. This suggests that the kernel is capable of doing the sampling and accounting correctly, but somehow it doesn't deliver the profiling signals in a way that we expect...

cherrymui · 2023-12-13T14:06:31Z

For the long syscall, if it is mostly IO and the kernel counts only CPU time for profiling timers, the under-sampling may be okay.

For the short ones, maybe the kernel is more likely to update its accounting at syscalls and only then find that it should trigger a profiling timer and deliver the signal...

FiloSottile · 2024-01-20T14:33:34Z

Not sure a +1 is useful here, but this is still true in macOS 14.2.1 (23C71) and very visible in anything that calls crypto/rand.Read, making it appear as if all the time was spent on unix.GetEntropy.

felixge · 2024-03-12T07:14:58Z

also used the macOS builtin sample program to take a "sample profile", which shows [...]
This suggests that the kernel is capable of doing the sampling and accounting correctly, but somehow it doesn't deliver the profiling signals in a way that we expect...

I recently looked into how the sample program (and it's big brother spindump) works, and as far as I can tell, it's a wall clock profiler. I.e. it takes stack traces of all threads on a fixed interval, and there is no attempt to account for CPU time. However, it does record the thread state (On-CPU vs Off-CPU) when it takes samples, so filtering the resulting profile by samples that were On-CPU should yield a CPU profile.

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 10, 2023

felixge changed the title ~~cmd/pprof: macOS 12.6.1 (M1) profile overcounts system calls (again)~~ cmd/pprof: macOS 12.6.1 profile overcounts system calls (again) Jan 10, 2023

cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 11, 2023

prattmic added this to Go Compiler / Runtime Jan 11, 2023

mknyszek added the OS-Darwin label Jan 18, 2023

mknyszek assigned felixge Jan 18, 2023

mknyszek added this to the Go1.21 milestone Jan 18, 2023

mknyszek moved this to In Progress in Go Compiler / Runtime Jan 18, 2023

mknyszek mentioned this issue Feb 16, 2023

runtime: diagnostics improvements tracking issue #57175

Open

19 tasks

adonovan mentioned this issue Mar 30, 2023

runtime/pprof: CPU profiles incorrect for kernels with broken setitimer support #13841

Open

prattmic modified the milestones: Go1.21, Go1.22 Jun 5, 2023

gopherbot modified the milestones: Go1.22, Go1.23 Feb 6, 2024

rhysh mentioned this issue Jul 16, 2024

runtime/pprof: GOOS=darwin over-reports CPU usage in mutex-heavy workload #68477

Open

gopherbot removed this from the Go1.23 milestone Aug 13, 2024

gopherbot added this to the Go1.24 milestone Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/pprof: macOS 12.6.1 profile overcounts system calls (again) #57722

cmd/pprof: macOS 12.6.1 profile overcounts system calls (again) #57722

felixge commented Jan 10, 2023 •

edited

Loading

cbedard commented Jan 10, 2023 •

edited

Loading

felixge commented Jan 10, 2023

cagedmantis commented Jan 11, 2023

prattmic commented Jan 11, 2023 •

edited

Loading

prattmic commented Jan 11, 2023

cherrymui commented Jan 11, 2023

felixge commented Jan 11, 2023

prattmic commented Jan 18, 2023 •

edited

Loading

mknyszek commented Jan 18, 2023

adonovan commented Jun 30, 2023

cherrymui commented Dec 13, 2023 •

edited

Loading

cherrymui commented Dec 13, 2023

FiloSottile commented Jan 20, 2024

felixge commented Mar 12, 2024

cmd/pprof: macOS 12.6.1 profile overcounts system calls (again) #57722

cmd/pprof: macOS 12.6.1 profile overcounts system calls (again) #57722

Comments

felixge commented Jan 10, 2023 • edited Loading

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Sanity Check

Conclusion

cbedard commented Jan 10, 2023 • edited Loading

felixge commented Jan 10, 2023

cagedmantis commented Jan 11, 2023

prattmic commented Jan 11, 2023 • edited Loading

prattmic commented Jan 11, 2023

cherrymui commented Jan 11, 2023

felixge commented Jan 11, 2023

prattmic commented Jan 18, 2023 • edited Loading

mknyszek commented Jan 18, 2023

adonovan commented Jun 30, 2023

cherrymui commented Dec 13, 2023 • edited Loading

C reproducer

cherrymui commented Dec 13, 2023

FiloSottile commented Jan 20, 2024

felixge commented Mar 12, 2024

felixge commented Jan 10, 2023 •

edited

Loading

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

cbedard commented Jan 10, 2023 •

edited

Loading

prattmic commented Jan 11, 2023 •

edited

Loading

prattmic commented Jan 18, 2023 •

edited

Loading

cherrymui commented Dec 13, 2023 •

edited

Loading