Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime/pprof: add PMU-based profiles #36821

Open
chabbimilind opened this issue Jan 28, 2020 · 78 comments
Open

proposal: runtime/pprof: add PMU-based profiles #36821

chabbimilind opened this issue Jan 28, 2020 · 78 comments

Comments

@chabbimilind
Copy link

chabbimilind commented Jan 28, 2020

What version of Go are you using (go version)?

$ go version
go version go1.13.4 linux/amd64

Does this issue reproduce with the latest release?

Yes; tested on go version devel +74d366f484 Mon Jan 27 20:45:39 2020 +0000 linux/amd64

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/xxx/.cache/go-build"
GOENV="/home/xxx/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/xxx/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/xxx/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/xxx/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build696469159=/tmp/go-build -gno-record-gcc-switches"

What did you do?

The following experiments demonstrate that pprof CPU profiles lack accuracy (closeness to the ground truth) and precision (repeatability across different runs). The tests are highly predictive in nature and involve little runtime overheads (allocation, GC, system call, etc.). They are carefully designed so that we can compare pprof CPU profiles against our expectations. The evaluation shows a wide difference from expectation. One should not confuse this to be a runtime issue; the issue is with the use of OS timers used for sampling; OS timers are coarse-grained and have a high skid. In a proposal/design document that will follow, I will propose a design to extend CPU profiling by sampling CPU Performance Monitoring Unit (PMU) aka hardware performance counters. PMU event-based sampling is a mature technology available on almost all modern CPUs. PMU as a sampling agent offers many benefits (in no particular priority order):

  • Enhance sampling granularity from 100s of milli-second to 100s of microseconds.
  • Avoid precision and accuracy problems of OS timers.
  • Enable collecting rich hardware profiles such as cache misses, branch misses, inter-socket cacheline transfers, to name a new, which help root cause performance problems more easily.

There are two test cases goroutine.go and serial.go in this issue.

test 1 (parallel)

Download the first test case, goroutine.go program, from https://github.com/chabbimilind/GoPprofDemo/blob/master/goroutine.go
It has ten exactly the same goroutines: f1-f10 and I use pprof to collect the CPU profiles.
Run the following command several times and notice that each time, pprof reports a different amount of time spent in each of the ten routines.

go run goroutine.go && go tool pprof -top goroutine_prof

test 2 (serial)

Download the second test case, serial.go go program, from https://github.com/chabbimilind/GoPprofDemo/blob/master/serial.go:
It has ten functions (A_expect_1_82 - J_expect_18_18). The function A_expect_1_82 is expected to consume 1.82% of the total execution time and J_expect_18_18 is expected to consume 18.18% of the execution time, and so on. The code is serial and there is complete data dependence between each function and each iteration of the loop in the functions to avoid any hardware-level optimizations.
Run the following command several times.
go run serial.go && go tool pprof -top serial_prof

What did you expect to see?

For goroutine.go, each function (f1-10) should be attributed with exactly (or almost exactly) 10% of the execution time on each run.

For serial.go the time attribution should roughly follow the distribution shown below in each run.

FUNCTION NAME			EXPECTED RELATIVE TIME

A_expect_1_82			1.82%
B_expect_3_64			3.64%
C_expect_5_46			5.46%
D_expect_7_27			7.27%
E_expect_9_09			9.09%
F_expect_10_91			10.91%
G_expect_12_73			12.73%
H_expect_14_546			14.546%
I_expect_16_36			16.36%
J_expect_18_18			18.18%

What did you see instead?

test 1 (parallel)

Run 1, 2, and 3, shown below, respectively show the pprof -top output of goroutine.go.
You will notice in a single run (say Run 1), f1-f10 have a wide variance in the time attributed to them; the expectation is that each of them gets 10% execution time. There is up to 6x difference in time attributed to the function with the highest amount if attribution (main.f7, 4210ms, in Run 1) vs. the function with the lowest a amount of attribution (main.f9, 700ms, in Run 1). This shows a poor accuracy (deviation from the ground truth) of pprof timer-based profiles.

Furthermore, the time attributed to a function widely varies from run to run. Notice how the top-10 ordering changes. In Run 1, main.f7 is shown to run for 4210ms, whereas in Run 2 it is shown to run for only 520ms. The expectation is that the measurements remain the same from run to run. This shows a poor precision (unpredictability of measurement) of pprof timer-based profiles.

goroutine.go/Run 1:

File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:45pm (PST)
Duration: 6.70s, Total samples = 18060ms (269.37%)
Showing nodes accounting for 18060ms, 100% of 18060ms total
      flat  flat%   sum%        cum   cum%
    4210ms 23.31% 23.31%     4210ms 23.31%  main.f7
    2610ms 14.45% 37.76%     2610ms 14.45%  main.f2
    2010ms 11.13% 48.89%     2010ms 11.13%  main.f6
    1810ms 10.02% 58.91%     1810ms 10.02%  main.f10
    1780ms  9.86% 68.77%     1780ms  9.86%  main.f3
    1410ms  7.81% 76.58%     1410ms  7.81%  main.f1
    1310ms  7.25% 83.83%     1310ms  7.25%  main.f4
    1110ms  6.15% 89.98%     1110ms  6.15%  main.f5
    1110ms  6.15% 96.12%     1110ms  6.15%  main.f8
     700ms  3.88%   100%      700ms  3.88%  main.f9

goroutine.go/Run 2:

File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:45pm (PST)
Duration: 6.71s, Total samples = 17400ms (259.39%)
Showing nodes accounting for 17400ms, 100% of 17400ms total
      flat  flat%   sum%        cum   cum%
    3250ms 18.68% 18.68%     3250ms 18.68%  main.f2
    2180ms 12.53% 31.21%     2180ms 12.53%  main.f9
    2100ms 12.07% 43.28%     2100ms 12.07%  main.f1
    1770ms 10.17% 53.45%     1770ms 10.17%  main.f6
    1700ms  9.77% 63.22%     1700ms  9.77%  main.f5
    1550ms  8.91% 72.13%     1550ms  8.91%  main.f4
    1500ms  8.62% 80.75%     1500ms  8.62%  main.f8
    1440ms  8.28% 89.02%     1440ms  8.28%  main.f3
    1390ms  7.99% 97.01%     1390ms  7.99%  main.f10
     520ms  2.99%   100%      520ms  2.99%  main.f7

goroutine.go/Run 3:

File: goroutine
Type: cpu
Time: Jan 27, 2020 at 3:48pm (PST)
Duration: 6.71s, Total samples = 17.73s (264.31%)
Showing nodes accounting for 17.73s, 100% of 17.73s total
      flat  flat%   sum%        cum   cum%
     3.74s 21.09% 21.09%      3.74s 21.09%  main.f7
     2.08s 11.73% 32.83%      2.08s 11.73%  main.f9
     2.05s 11.56% 44.39%      2.05s 11.56%  main.f2
     1.85s 10.43% 54.82%      1.85s 10.43%  main.f10
     1.78s 10.04% 64.86%      1.78s 10.04%  main.f1
     1.43s  8.07% 72.93%      1.43s  8.07%  main.f3
     1.42s  8.01% 80.94%      1.42s  8.01%  main.f8
     1.18s  6.66% 87.59%      1.18s  6.66%  main.f6
     1.17s  6.60% 94.19%      1.17s  6.60%  main.f5
     1.03s  5.81%   100%      1.03s  5.81%  main.f4

test 2 (serial)

The output for go run serial.go && go tool pprof -top serial_prof for three runs is shown below.
Comparing the flat% (or cum%) against the expected percentage for each function shows a large difference. For example, main.H_expect_14_546, which is expected to have 14.546% execution time, is attributed 25% of the execution time in Run 1. Furthermore, run to run, there is a lack of precision, for example, main.I_expect_16_36 is attributed 6.25% (20ms) execution time in Run 1, whereas it is attributed 21.88% (70ms) execution time in Run 2.

serial.go/Run 1:

File: serial
Type: cpu
Time: Jan 27, 2020 at 1:42pm (PST)
Duration: 501.51ms, Total samples = 320ms (63.81%)
Showing nodes accounting for 320ms, 100% of 320ms total
      flat  flat%   sum%        cum   cum%
      80ms 25.00% 25.00%       80ms 25.00%  main.H_expect_14_546
      80ms 25.00% 50.00%       80ms 25.00%  main.J_expect_18_18
      60ms 18.75% 68.75%       60ms 18.75%  main.G_expect_12_73
      20ms  6.25% 75.00%       20ms  6.25%  main.B_expect_3_64
      20ms  6.25% 81.25%       20ms  6.25%  main.D_expect_7_27
      20ms  6.25% 87.50%       20ms  6.25%  main.F_expect_10_91
      20ms  6.25% 93.75%       20ms  6.25%  main.I_expect_16_36
      10ms  3.12% 96.88%       10ms  3.12%  main.A_expect_1_82
      10ms  3.12%   100%       10ms  3.12%  main.C_expect_5_46
         0     0%   100%      320ms   100%  main.main
         0     0%   100%      320ms   100%  runtime.main

serial.go/Run 2:

File: serial
Type: cpu
Time: Jan 27, 2020 at 1:44pm (PST)
Duration: 501.31ms, Total samples = 320ms (63.83%)
Showing nodes accounting for 320ms, 100% of 320ms total
      flat  flat%   sum%        cum   cum%
      70ms 21.88% 21.88%       70ms 21.88%  main.I_expect_16_36
      50ms 15.62% 37.50%       50ms 15.62%  main.J_expect_18_18
      40ms 12.50% 50.00%       40ms 12.50%  main.E_expect_9_09
      40ms 12.50% 62.50%       40ms 12.50%  main.F_expect_10_91
      40ms 12.50% 75.00%       40ms 12.50%  main.H_expect_14_546
      30ms  9.38% 84.38%       30ms  9.38%  main.D_expect_7_27
      20ms  6.25% 90.62%       20ms  6.25%  main.B_expect_3_64
      20ms  6.25% 96.88%       20ms  6.25%  main.G_expect_12_73
      10ms  3.12%   100%       10ms  3.12%  main.C_expect_5_46
         0     0%   100%      320ms   100%  main.main
         0     0%   100%      320ms   100%  runtime.main

serial.go/Run 3:

File: serial
Type: cpu
Time: Jan 27, 2020 at 1:45pm (PST)
Duration: 501.39ms, Total samples = 310ms (61.83%)
Showing nodes accounting for 310ms, 100% of 310ms total
      flat  flat%   sum%        cum   cum%
     110ms 35.48% 35.48%      110ms 35.48%  main.J_expect_18_18
      70ms 22.58% 58.06%       70ms 22.58%  main.G_expect_12_73
      60ms 19.35% 77.42%       60ms 19.35%  main.F_expect_10_91
      30ms  9.68% 87.10%       30ms  9.68%  main.I_expect_16_36
      20ms  6.45% 93.55%       20ms  6.45%  main.H_expect_14_546
      10ms  3.23% 96.77%       10ms  3.23%  main.B_expect_3_64
      10ms  3.23%   100%       10ms  3.23%  main.C_expect_5_46
         0     0%   100%      310ms   100%  main.main
         0     0%   100%      310ms   100%  runtime.main

Improved results with PMU-based profiling.

In a prototype PMU-based profiling implementation, below are the pprof profiles for CPU cycles hardware performance counter for the same goroutine.go program. Notice that each functions gets the same (or almost the same) CPU cycles attribution within a single run and across runs.

goroutine.go/Run 1:

File: goroutine
Type: cycles
Time: Jan 27, 2020 at 4:49pm (PST)
Showing nodes accounting for 234000000000, 100% of 234000000000 total
      flat  flat%   sum%        cum   cum%
23400000000 10.00% 10.00% 23400000000 10.00%  main.f1
23400000000 10.00% 20.00% 23400000000 10.00%  main.f10
23400000000 10.00% 30.00% 23400000000 10.00%  main.f2
23400000000 10.00% 40.00% 23400000000 10.00%  main.f3
23400000000 10.00% 50.00% 23400000000 10.00%  main.f4
23400000000 10.00% 60.00% 23400000000 10.00%  main.f5
23400000000 10.00% 70.00% 23400000000 10.00%  main.f6
23400000000 10.00% 80.00% 23400000000 10.00%  main.f7
23400000000 10.00% 90.00% 23400000000 10.00%  main.f8
23400000000 10.00%   100% 23400000000 10.00%  main.f9

goroutine.go/Run 2:

File: goroutine
Type: cycles
Time: Jan 27, 2020 at 4:51pm (PST)
Showing nodes accounting for 234000000000, 100% of 234000000000 total
      flat  flat%   sum%        cum   cum%
23800000000 10.17% 10.17% 23800000000 10.17%  main.f1
23500000000 10.04% 20.21% 23500000000 10.04%  main.f7
23500000000 10.04% 30.26% 23500000000 10.04%  main.f9
23400000000 10.00% 40.26% 23400000000 10.00%  main.f10
23400000000 10.00% 50.26% 23400000000 10.00%  main.f2
23400000000 10.00% 60.26% 23400000000 10.00%  main.f4
23400000000 10.00% 70.26% 23400000000 10.00%  main.f6
23400000000 10.00% 80.26% 23400000000 10.00%  main.f8
23300000000  9.96% 90.21% 23300000000  9.96%  main.f3
22900000000  9.79%   100% 22900000000  9.79%  main.f5

Below are the pprof profiles for CPU cycles hardware performance counter for the same serial.go program. Notice that each function gets close to expected CPU cycles attribution within a single run and across runs.

serial.go/Run 1:

File: serial
Type: cycles
Time: Jan 27, 2020 at 4:54pm (PST)
Showing nodes accounting for 1105000000, 100% of 1105000000 total
      flat  flat%   sum%        cum   cum%
 200000000 18.10% 18.10%  200000000 18.10%  main.J_expect_18_18
 183000000 16.56% 34.66%  183000000 16.56%  main.I_expect_16_36
 165000000 14.93% 49.59%  165000000 14.93%  main.H_expect_14_546
 137000000 12.40% 61.99%  137000000 12.40%  main.G_expect_12_73
 120000000 10.86% 72.85%  120000000 10.86%  main.F_expect_10_91
 100000000  9.05% 81.90%  100000000  9.05%  main.E_expect_9_09
  82000000  7.42% 89.32%   82000000  7.42%  main.D_expect_7_27
  63000000  5.70% 95.02%   63000000  5.70%  main.C_expect_5_46
  37000000  3.35% 98.37%   37000000  3.35%  main.B_expect_3_64
  18000000  1.63%   100%   18000000  1.63%  main.A_expect_1_82
         0     0%   100% 1105000000   100%  main.main
         0     0%   100% 1105000000   100%  runtime.main

serial.go/Run 2:

File: serial
Type: cycles
Time: Jan 27, 2020 at 4:54pm (PST)
Showing nodes accounting for 1105000000, 100% of 1105000000 total
      flat  flat%   sum%        cum   cum%
 200000000 18.10% 18.10%  200000000 18.10%  main.J_expect_18_18
 183000000 16.56% 34.66%  183000000 16.56%  main.I_expect_16_36
 159000000 14.39% 49.05%  159000000 14.39%  main.H_expect_14_546
 142000000 12.85% 61.90%  142000000 12.85%  main.G_expect_12_73
 119000000 10.77% 72.67%  119000000 10.77%  main.F_expect_10_91
 100000000  9.05% 81.72%  100000000  9.05%  main.E_expect_9_09
  82000000  7.42% 89.14%   82000000  7.42%  main.D_expect_7_27
  61000000  5.52% 94.66%   61000000  5.52%  main.C_expect_5_46
  40000000  3.62% 98.28%   40000000  3.62%  main.B_expect_3_64
  18000000  1.63% 99.91%   18000000  1.63%  main.A_expect_1_82
   1000000  0.09%   100% 1105000000   100%  main.main
         0     0%   100% 1105000000   100%  runtime.main

Dependence on the number of cores and length of test execution:

The results of goroutine.go test depend on the number of CPU cores available. On a multi-core CPU, if you set GOMAXPROCS=1, goroutine.go will not show a huge variation, since each goroutine runs for several seconds. However, if you set GOMAXPROCS to a larger value, say 4, you will notice a significant measurement attribution problem. One reason for this problem is that the itimer samples on Linux are not guaranteed to be delivered to the thread whose timer expired.

The results of serial.go can change based on the time of execution of each function. By passing the -m=<int> to the program, you can make the program run for longer or shorter. By making it run longer, the profiles can be made more accurate, but when a function runs for less than 100ms, the accuracy is often low.

@chabbimilind chabbimilind changed the title Runtime/pprof OS-timer profiles lack precision and accuracy. Need PMU-based profiles. [Proposal] runtime/pprof cpuprofiles lack precision and accuracy; need PMU-based profiles. Jan 28, 2020
@ALTree ALTree changed the title [Proposal] runtime/pprof cpuprofiles lack precision and accuracy; need PMU-based profiles. Proposal: runtime/pprof cpuprofiles lack precision and accuracy; need PMU-based profiles Jan 29, 2020
@gopherbot gopherbot added this to the Proposal milestone Jan 29, 2020
@ALTree ALTree changed the title Proposal: runtime/pprof cpuprofiles lack precision and accuracy; need PMU-based profiles proposal: runtime/pprof cpuprofiles lack precision and accuracy; need PMU-based profiles Jan 29, 2020
@hyangah
Copy link
Contributor

hyangah commented Jan 30, 2020

Will it be available to non-linux OSes? If so, that's great. Previously, we've suggested Linux perf (that can use PMU events, right?) as an alternative to runtime/pprof to work around the known limitation of runtime/pprof. The pprof tool can interpret Linux perf output. If this proposal makes it possible to help non-Linux users and to allow extra information such as labels, etc while not introducing too much complexity to the runtime, that will be super.

@aclements @rsc @dvyukov @rhysh

@chabbimilind
Copy link
Author

Hi Hyang-Ah Hana Kim,
The design doc, which I will submit very soon, proposes a solution for Linux-based systems. If you have ideas on the windows equivalent for perf_event_open, I will include the windows extensions into the design doc.
An important use-case we (at Uber) have in mind is fetching PMU-based pprof profiles over http, which is not possible via perf. Furthermore, profiling as a first-party (the process samples itself, e.g., pprof) vs. a second party (a different process samples a running service, e.g., perf) requires different security settings. Sampling as a first-party has an upper hand.
Hence, enabling perf_events as a sampling engine inside Go runtime is what I am proposing.
Wait for more design details to be posted soon.

-Milind

@josharian
Copy link
Contributor

Related: #21295

@networkimprov
Copy link

The pprof docs don't indicate that Linux perf delivers more accurate reports in general; is that the case? From https://golang.org/doc/diagnostics.html#profiling:

What other profilers can I use to profile Go programs?

On Linux, perf tools can be used for profiling Go programs. Perf can profile and unwind cgo/SWIG code and kernel, so it can be useful to get insights into native/kernel performance bottlenecks. On macOS, Instruments suite can be used profile Go programs.

@chabbimilind
Copy link
Author

chabbimilind commented Jan 30, 2020

Hi Liam,
perf is a commandline wrapper around the underlying perf_event_open system call that provides access to CPU's PMUs. It is neither correct nor incorrect to say perf delivers more accurate reports. PMUs can be configured in a much better way to yield superior profiles. As I have shown above, the OS timer interrupts have a low resolution; they are not guaranteed to interrupt at the same fixed interval (just sometime after the specified time) and the timer interrupts can be delivered to any thread disregarding the one whose timer expired.

@chabbimilind
Copy link
Author

Related: #21295

Super useful to make this proposal more desired.

@networkimprov
Copy link

What's required to make Linux perf deliver the superior accuracy you propose? Is it hard to configure? Apologies if that's an FAQ...

@chabbimilind
Copy link
Author

chabbimilind commented Jan 30, 2020

Hi Liam,
Perf can be configured with higher resolution to provide richer profiles. The proposal is orthogonal to simply using perf in the following three aspects:

  1. It will fix the current measurement problems of pprof by making it more accurate and precise.
  2. It will not require using a separate tool like perf.
  3. It will allow the first-party profiling (the process profiling itself rather than a separate agent like perf attaching to a process), which allows profile collection over an http port. This latter aspect of profile collection over http/pprof is particularly important for our production services, where access to run the perf tool is restricted.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/219508 mentions this issue: proposal: hardware performance counters for CPU profiling.

@rsc
Copy link
Contributor

rsc commented Mar 4, 2020

/cc @aclements

@rsc rsc changed the title proposal: runtime/pprof cpuprofiles lack precision and accuracy; need PMU-based profiles proposal: runtime/pprof: add PMU-based profiles Mar 4, 2020
@rsc
Copy link
Contributor

rsc commented Mar 11, 2020

Thanks for the proposal doc (https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md).

Using hardware counter to do a better job than setitimer seems fairly unobjectionable.

No new API would be an implementation detail and fairly easy to accept (assuming the implementation were not too large).

The hard part is when the new API starts getting added. We don't want to overfit to Linux, nor to x86. How much new API is really needed here? Can the proposal be useful with zero new API?

@chabbimilind
Copy link
Author

chabbimilind commented Mar 11, 2020

Thanks Russ for your comments.
You can get a general sense of the amount of change by looking at this prototype code: https://github.com/uber-dev/go/commits/pmu_pprof
The changes were non-trivial, which is about 4000 lines of code. Most of it comes from defining constants for each CPU architecture and some assembly code for each architecture.

The proposal suggests introducing a handful of new APIs in the runtime/pprof package. The most prominent being StartCPUProfileWithConfig. I imagine your zero new API is not referring to this one.
If it is, please let me know if you have an idea of how this could be molded into a zero new API.

Right now, we add a few PMU configurations that surface as public APIs, e.g., func CPUCycles(io.Writer, uint64) ProfilingOption and func CPUBranchInstructions(io.Writer, uint64) ProfilingOption, which can keep growing based on the demand.

If we don't want to keep this set growing, I have a solution that will introduce only one API that takes a string argument func ConfigString(io.Writer, string) ProfilingOption. Here the string can be similar, in spirit, to the format used by Linux perf. For example cycles:u@1000000 means profile CPU cycles in the u=user space at the sampling interval of 1M. Another example can be cache-misses:ukppp@10000, which means, use cache-misses events to profile, in both u=user and k=kernel space, use a skid-free precision=3 (3 p's) provided by the underlying platform (if any), and sample once in 10000 cache misses. Yet another example r0423@1000, sample the r=raw event of mask=0x04 and event id=0x23 at a sampling rate of 1000. Let me know your thoughts on going with this one API instead of several.

Let me know which parts of the proposal seem overfitted to Linux or x86. I don't think there is x86 overfitting because any architecture on which Linux runs will implement perf_event_open system call.
Luckily, implementations specific to Linux are in the Linux-specific files. At the time of this proposal, we do not have another OS where we can borrow another PMU sampling implementation that will allow us to come up with a more canonical form to avoid any overfitting. Again, luckily, almost all of this is runtime internal and can be modified if the need arises to incorporate another OS.

@songlh
Copy link

songlh commented Mar 12, 2020

I have a very quick question. Are the performance numbers from PMU mapped to goroutines? Based on my understanding, numbers from PMU are for different threads, while how goroutines are mapped to threads is managed by the runtime.

@chabbimilind
Copy link
Author

PMUs are configured per OS thread. PMUs by themselves are unaware of goroutines. Just like an OS timer sample, the signal handler attributes the sample to the goroutine currently running on the OS thread.

@songlh
Copy link

songlh commented Mar 12, 2020

@chabbimilind Thanks a lot for the reply! I feel the tool would be very helpful in performance optimization and performance failure diagnosis, since both of them rely on precise time measurement.

@ianlancetaylor
Copy link
Member

@chabbimilind To a first approximation, no new API means no new API.

Is this proposal useful with no new API at all?

@chabbimilind
Copy link
Author

@ianlancetaylor, the proposal will not work without a new API. That aspect is quite clearly stated in the proposal.

@egonelbre
Copy link
Contributor

The closest thing I was able to find for Windows was https://docs.microsoft.com/en-us/windows/win32/api/_hcp/. But it requires a driver for configuring things. Windows Insider builds also have DTrace.

For OS X, Intel has special drivers https://www.campos.cc/post/using-intel-performance-counter-monitor-on-mac-os-x/.

Both platforms have separate programs to measure counters (xperf and Instruments). But there doesn't seem to be a nice way to invoke them programmatically.

I guess the canonical API is going to be processors themselves.

@chabbimilind
Copy link
Author

@egonelbre, Windows and macOS, AFAIK, are not exposing system calls to access PMUs in a clean way like Linux. Depending on a library can be fragile because the libraries may not even exist and there is also the question of whether we would ever make such cgo calls from inside the go runtime.

@chabbimilind
Copy link
Author

@aclements @dvyukov @rhysh please can you complete reviewing this proposal?

@rsc
Copy link
Contributor

rsc commented Mar 25, 2020

@chabbimilind, I really did mean zero new API, which would mean making the current CPU-based profiles more accurate but not enable profiling other events like mispredicted branches and so on. Once we add those I fear the API just gets arbitrarily large, since there are so many different counters. And they vary by architecture.

We will take some time to read your doc, but we're not working as fast as normal these days. Thanks for your patience.

@chabbimilind
Copy link
Author

@rsc , the concern around API growth is understandable. Hence, earlier in this thread, I proposed an alternative API that does not grow. It simply takes a string input and the string parsing remains internal to the implementation of the single API exposed. This will allow all kinds of counter names / values / configs being passed in as strings without needing any new APIs. Please do consider that alternative.

I cannot see how zero API change is possible. We need to maintain the OS interval timer as the default sampling engine to work across all platforms under all environments. The hardware performance counter-based sampling engine has to enabled by exposing at least one API.

@chabbimilind
Copy link
Author

Correct. We can use kernel-provided stacks if needed. For example, in another profiler I worked on, we stitched the kernel call stacks from the perf-provided ring buffer, but used the user-land frames based on our own unwinding. I think, there is quite a lot of flexibility (and decisions to be taken) on these details.

irfansharif added a commit to irfansharif/runner that referenced this issue May 12, 2022
Evaluate measured cputime using the tests from
github.com/golang/go/issues/36821, which demonstrate the inaccuracy and
imprecision of go's 100Hz cpu profiler. The results here suggest that
measured cputime is both accurate and precise with regards to computing
on-CPU time.

    === RUN   TestEquivalentGoroutines
        0's got  9.98% of total time
        1's got  9.53% of total time
        2's got  9.22% of total time
        3's got 10.42% of total time
        4's got  9.84% of total time
        5's got 10.43% of total time
        6's got 10.50% of total time
        7's got 10.21% of total time
        8's got 10.03% of total time
        9's got  9.86% of total time

    === RUN   TestProportionalGoroutines
        0's got  1.87% of total time (1.000000x)
        1's got  3.60% of total time (1.931999x)
        2's got  5.41% of total time (2.899312x)
        3's got  7.21% of total time (3.864451x)
        4's got  9.11% of total time (4.880925x)
        5's got 10.94% of total time (5.864723x)
        6's got 12.77% of total time (6.842004x)
        7's got 14.34% of total time (7.685840x)
        8's got 16.58% of total time (8.885060x)
        9's got 18.18% of total time (9.741030x)
irfansharif added a commit to irfansharif/runner that referenced this issue May 13, 2022
Evaluate measured cputime using the tests from
github.com/golang/go/issues/36821, which demonstrate the inaccuracy and
imprecision of go's 100Hz cpu profiler. The results here suggest that
measured cputime is both accurate and precise with regards to computing
on-CPU time.

    === RUN   TestEquivalentGoroutines
        0's got  9.98% of total time
        1's got  9.53% of total time
        2's got  9.22% of total time
        3's got 10.42% of total time
        4's got  9.84% of total time
        5's got 10.43% of total time
        6's got 10.50% of total time
        7's got 10.21% of total time
        8's got 10.03% of total time
        9's got  9.86% of total time

    === RUN   TestProportionalGoroutines
        0's got  1.87% of total time (1.000000x)
        1's got  3.60% of total time (1.931999x)
        2's got  5.41% of total time (2.899312x)
        3's got  7.21% of total time (3.864451x)
        4's got  9.11% of total time (4.880925x)
        5's got 10.94% of total time (5.864723x)
        6's got 12.77% of total time (6.842004x)
        7's got 14.34% of total time (7.685840x)
        8's got 16.58% of total time (8.885060x)
        9's got 18.18% of total time (9.741030x)
@rhysh
Copy link
Contributor

rhysh commented May 13, 2022

As of Go 1.18, the built-in profiler uses per-thread timers (via timer_create) on Linux.

@chabbimilind , I ran your initial test code with go1.18.2 and saw that the results have improved from what you reported in early 2020. On a machine running Linux with 16 Intel threads (8 cores), I saw each of the 10 functions in the parallel test case reported as using 9–11% of the CPU time, across about 6900 samples. It is no longer limited to reporting ~250 samples per second, and now reports close to the 1000 per second (an average of 917 for the test run below) that I'd expect from 10 running goroutines. The serial test is short enough that each run collects only 30 samples, but when I extend it to run 100x longer (to collect about 3100 samples) I see each function get within 0.5% / 50 basis points of its expected fraction.

Now that the baseline has changed, could you refresh this issue with what incremental improvements you'd like to see? Is this list now:

  • being able to see other performance counters, such as cache misses or branch mis-prediction
  • being able to collect more data in less time via higher sampling rate

Thanks.

$ go run goroutine.go  && go tool pprof -top goroutine_prof
44438869911
44438869911
44438869911
44438869911
44438869911
44438869911
44438869911
44438869911
44438869911
44438869911
File: goroutine
Type: cpu
Time: May 13, 2022 at 4:26am (UTC)
Duration: 7.54s, Total samples = 69.07s (916.55%)
Showing nodes accounting for 69.07s, 100% of 69.07s total
      flat  flat%   sum%        cum   cum%
     7.43s 10.76% 10.76%      7.43s 10.76%  main.f2
     7.43s 10.76% 21.51%      7.43s 10.76%  main.f4
     7.43s 10.76% 32.27%      7.43s 10.76%  main.f5
     7.04s 10.19% 42.46%      7.04s 10.19%  main.f10
     6.95s 10.06% 52.53%      6.95s 10.06%  main.f9
     6.56s  9.50% 62.02%      6.56s  9.50%  main.f3
     6.56s  9.50% 71.52%      6.56s  9.50%  main.f6
     6.56s  9.50% 81.02%      6.56s  9.50%  main.f7
     6.56s  9.50% 90.52%      6.56s  9.50%  main.f8
     6.55s  9.48%   100%      6.55s  9.48%  main.f1
$ go run serial.go && go tool pprof -top serial_prof
2992155612185
File: serial
Type: cpu
Time: May 13, 2022 at 4:27am (UTC)
Duration: 30.85s, Total samples = 30670ms (99.41%)
Showing nodes accounting for 30660ms, 100% of 30670ms total
Dropped 1 node (cum <= 153.35ms)
      flat  flat%   sum%        cum   cum%
    5550ms 18.10% 18.10%     5560ms 18.13%  main.J_expect_18_18 (inline)
    5040ms 16.43% 34.53%     5040ms 16.43%  main.I_expect_16_36 (inline)
    4400ms 14.35% 48.88%     4400ms 14.35%  main.H_expect_14_546 (inline)
    3920ms 12.78% 61.66%     3920ms 12.78%  main.G_expect_12_73 (inline)
    3270ms 10.66% 72.32%     3270ms 10.66%  main.F_expect_10_91 (inline)
    2840ms  9.26% 81.58%     2840ms  9.26%  main.E_expect_9_09 (inline)
    2140ms  6.98% 88.56%     2140ms  6.98%  main.D_expect_7_27 (inline)
    1770ms  5.77% 94.33%     1770ms  5.77%  main.C_expect_5_46 (inline)
    1070ms  3.49% 97.82%     1070ms  3.49%  main.B_expect_3_64 (inline)
     660ms  2.15%   100%      660ms  2.15%  main.A_expect_1_82 (inline)
         0     0%   100%    30670ms   100%  main.main
         0     0%   100%    30670ms   100%  runtime.main

irfansharif added a commit to irfansharif/go that referenced this issue May 13, 2022
Fixes golang#41554.

This commit introduces a /sched/goroutine/running:nanoseconds metric,
defined as the total time spent by a goroutine in the running state.
This measurement is useful for systems that would benefit from
fine-grained CPU attribution.

An alternative to scheduler-backed CPU attribution would be the use of
profiler labels. Given it's common to spawn multiple goroutines for the
same task, goroutine-backed statistics can easily become misleading.
Profiler labels instead let you measure CPU performance across a set of
cooperating goroutines. That said, it has two downsides:
- performance overhead; for high-performance systems that care about
  fine-grained CPU attribution (databases for e.g. that want to measure
  total CPU time spent processing each request), profiler labels are too
  cost-prohibitive, especially given the Go runtime has a much cheaper
  and more granular view of the data needed
- inaccuracy and imprecision, as evaluated in
  golang#36821

It's worth noting that we already export /sched/latencies:seconds to
track scheduling latencies, i.e. time spent by a goroutine in the
runnable state (go-review.googlesource.com/c/go/+/308933). This commit
does effectively the same except for the running state on the requesting
goroutine. Users are free to use this metric to power histograms or
tracking on-CPU time across a set of goroutines.

Change-Id: Ie78336a3ddeca0521ae29cce57bc7a5ea67da297
@chabbimilind
Copy link
Author

@rhysh, it is great to see timer_create being incorporated into the runtime profiler. I think this is a great first step in the right direction. Yes, the remaining items would be the two items you listed above.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/410798 mentions this issue: runtime: support PMU events based cpu profiling for linux

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/410797 mentions this issue: runtime: add perfEventOpen, fcntl, and ioctl system calls

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/410799 mentions this issue: runtime: support more PMU events

@erifan
Copy link

erifan commented Jun 7, 2022

Hi, since #42502 has been accepted, I submitted an implementation for it. At the same time based on this new API, I added some PMU events support for this issue, and the implementation does not introduce any new API. FYI.

@chabbimilind
Copy link
Author

This is great news!
Although I have not fully reviewed the code, I noticed that the changes are not propagated to http/pprof package, which is a useful API for fetching PMU profiles remotely.

@erifan
Copy link

erifan commented Jun 8, 2022

I noticed that the changes are not propagated to http/pprof package, which is a useful API for fetching PMU profiles remotely.

Yes, there is currently no PMU support for http/pprof and testing. Because the os-timer cannot be completely replaced, a new API must be introduced. I'm not sure if this is appropriate, so didn't add it.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/410796 mentions this issue: runtime: add CPU profiling configuration structure in runtime

@RuinanSun
Copy link
Contributor

RuinanSun commented Oct 27, 2023

Thanks @chabbimilind for this great proposal and @erifan for the implementation!

Recently I've been looking at this proposal but I still have a few questions:

  1. How do we handle the non-go threads?

The inherit bit can only apply to the child threads instead of the existing threads, which means that if we want to profile the whole process, we should set the events for all threads (maybe from /proc/self/task?) by perf_event_open when the profiler is enabled. I think this will be tricky for the threads created by C code since we can't make sure there are no newly created C threads when we're going through the existing threads.

  1. Is there a skid between the event overflow and the signal handling?

I just looked up some kernel code and I found the overflow signal is not delivered directly in the PMU interrupt handler. If my understanding is right, there will be a time gap between the overflows and the signal handling. This will make the profiling results less reliable since the signal handler is not the first place where the event overflows. If we are going to add more events like branch miss in the future, the profiling results could be very inaccurate.

From my view, maybe we can read the profiling info directly from the perf ring buffer. But I think some runtime info will be lost and I'm not sure whether we can recover it. I will investigate it further.

@erifan
Copy link

erifan commented Oct 31, 2023

I think this will be tricky for the threads created by C code since we can't make sure there are no newly created C threads when we're going through the existing threads.

This situation is indeed difficult to deal with, but it should be rare. Go's use of PMU is similar to perf attach. It seems that perf attach does not handle this situation neither? So maybe this is not a big problem.

From my view, maybe we can read the profiling info directly from the perf ring buffer. But I think some runtime info will be lost and I'm not sure whether we can recover it.

We can get the g from R28 if the goroutine is not dead when we writing the samples. But if the goroutine is dead, that will be a problem.

Maybe you can test which of the two implementations is more accurate. My previous implementation showed that the skid from overflow to signal handling does not seem to be very large.

@RuinanSun
Copy link
Contributor

RuinanSun commented Nov 14, 2023

This situation is indeed difficult to deal with, but it should be rare. Go's use of PMU is similar to perf attach. It seems that perf attach does not handle this situation neither? So maybe this is not a big problem.

I hope so, if then the implementation will be much easier. Or we can pass the profiling event from env parameters like GOPROFILE=cycles, and set the event for the initial thread in the runtime initialization, which makes our behavior more like perf record.

Is there a skid between the event overflow and the signal handling?

I sample the following code with n=1e9, 5000 frequency, event=cycles

//go:noinline
func loop(n int) int {
	ret := 0
	for i := 0; i < n; i++ {
		if i&7 > 0 {
			ret += i * i
		} else {
			ret /= 2
		}
	}
	return ret
}
  • For the signal based implementation, here is the result comparing with perf:
Screenshot 2023-11-13 at 18 14 35

In 0xb55dc and 0xb55e8, the sample shows a deviation from perf.

  • For the ring buffer based implementation, the result shows closer to perf.
Screenshot 2023-11-13 at 17 56 56

This is a small case but we can see there is a deviation in the signal based sampling results. To keep the results accurate, I would suggest using the result in perf's ring buffer instead of relying on the signal.

From my view, maybe we can read the profiling info directly from the perf ring buffer. But I think some runtime info will be lost and I'm not sure whether we can recover it.

We can get the g from R28 if the goroutine is not dead when we writing the samples. But if the goroutine is dead, that will be a problem.

I found that there is a labels field in every goroutine, which can be used for filtering in the go pprof. If we rely on the perf's ring buffer, this info will be lost. I'm wondering whether we can store the labels in the stack so we can recover this info by the user stack which can be collected by setting PERF_SAMPLE_STACK_USER in the perf_event_open.

Personally I think it will be heavy to recover the labels. Or if we can, just ignore this field when we're sampling with PMU events.


There is another issue if we use the kernel ring buffer: some call chains may be lost for a cgo function.

Before we use kernel's sample we get the stack trace in the runtime, so we can get the whole call chains even we are in a cgo function. But for kernel, it just samples the call chains based on the frame pointer, and it only allows the frame pointer to go upwards the stack. Thus we may lack of some call chains if we cross a system stack.

I believe we need more discussions here before we are going forward. Any suggestions are welcome, thanks!

@andrewhodel
Copy link

Having access to a unique number that's faster across multiple cores than an increment in a mutex is needed. This is allowing the mutex to be removed with regards to queue processing and serial data insertion.

This is accomplished in Linux with PERF_COUNT_HW_INSTRUCTIONS in perf_event_open and in in type cpuEvent int32 as _CPUPROF_HW_INSTRUCTIONS in the proposal of this issue that the author of this issue wrote https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md

I hope that this is implemented in Go.

@prattmic
Copy link
Member

@andrewhodel Could you clarify your use-case? Are you saying that you want to enable the PERF_COUNT_HW_INSTRUCTIONS event and then use the RDPMC instruction to read the event counter, and use that value as a globally unique identifier?

If so, isn't each counter per-core, and thus not actually guaranteed to be globally unique?

@chabbimilind
Copy link
Author

chabbimilind commented Sep 17, 2024

I have question and comments about using the ring buffer.

Ring buffer is accurate for the PC, but the call stack has to be found by unwinding. For the leaf frame, which PC to use? Is it the one seen in the signal handler or the one given by the ring buffer? Due to the skid between the time the PMU takes the PC sample and when the signal is produced, the PC seen in the signal handler can be off from the PC recorded in the ring buffer.

We can unwind the stack from the signal handler and override the leaf-frame PC with the ring buffer PC, but this can be wrong if the real execution has moved on to another function by the time the signal is delivered.

I knew of this problem from the beginning in my proposal and preferred to use the PC seen in the signal handler. AFAIK, perf overrides the PC with the one seen in the ring buffer. I had a personal communication with Stephane Eranian a while back about this matter, who is a subject matter expert.

I think it is probably "ok" to get the call stack from the signal and override the leaf PC with the PC given by the ring buffer, knowing that sometimes the call stack will look whacky.

@prattmic
Copy link
Member

Hm, interesting problem, I hadn't thought about that. I agree that it seems OK to use the adjusted PC, especially if that is already what the Linux perf tool does.

Due to the skid between the time the PMU takes the PC sample and when the signal is produced, the PC seen in the signal handler can be off with the PC recorded in the ring buffer.

Just to clarify, the impact is really the skid between the PMU count and CPU interrupt delivery? i.e., even a stack trace collected by the kernel in the interrupt handler has the same issue. I don't think there should be any skid between interrupt delivery and signal delivery? (After the interrupt, the next return to userspace should be directly to the signal handler) [1]

[1] Unless I suppose, the skid happen exactly in the kernel where the thread is being descheduled and is made visible to other CPUs to run before the interrupt occurs. That seems vanishingly rare though.

@chabbimilind
Copy link
Author

Yes, I meant the skid between PC is sampled and when the PMU raises an interrupt.
I am not aware of any skid between OS interrupt handler and user-space signal delivery. If there was such a thing, the skid would have been enormous and we would have seen it.

Referring back to "precise_ip" https://man7.org/linux/man-pages/man2/perf_event_open.2.html, it states the following:

" This controls the amount of skid. Skid is how many
instructions execute between an event of interest
happening and the kernel being able to stop and record the
event. Smaller skid is better and allows more accurate
reporting of which events correspond to which
instructions, but hardware is often limited with how small
this can be."

I don't know how accurate this statement is. My understanding is that the register snapshot in the ring buffer can be set to have zero skid; but there is no guarantee on when the kernel will unwind the stack and what stack state it will see. The kernel can use the register snapshot from the ring buffer for its unwinding, but what if the processor had moved on and made changes to the state of memory e.g., return address on the stack?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Hold
Development

No branches or pull requests