Skip to content

Commit

Permalink
Add profile: a CPU profiler (#620)
Browse files Browse the repository at this point in the history
* Add profile: a CPU profiler

* move Perf to common class
  • Loading branch information
brendangregg authored and 4ast committed Jul 22, 2016
1 parent 2947ee3 commit f4bf275
Show file tree
Hide file tree
Showing 7 changed files with 1,409 additions and 62 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ Examples:
- tools/[oomkill](tools/oomkill.py): Trace the out-of-memory (OOM) killer. [Examples](tools/oomkill_example.txt).
- tools/[opensnoop](tools/opensnoop.py): Trace open() syscalls. [Examples](tools/opensnoop_example.txt).
- tools/[pidpersec](tools/pidpersec.py): Count new processes (via fork). [Examples](tools/pidpersec_example.txt).
- tools/[profile](tools/profile.py): Profile CPU usage by sampling stack traces at a timed interval. [Examples](tools/profile_example.txt).
- tools/[runqlat](tools/runqlat.py): Run queue (scheduler) latency as a histogram. [Examples](tools/runqlat_example.txt).
- tools/[softirqs](tools/softirqs.py): Measure soft IRQ (soft interrupt) event time. [Examples](tools/softirqs_example.txt).
- tools/[solisten](tools/solisten.py): Trace TCP socket listen. [Examples](tools/solisten_example.txt).
Expand Down
148 changes: 148 additions & 0 deletions man/man8/profile.8
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
.TH profile 8 "2016-07-17" "USER COMMANDS"
.SH NAME
profile \- Profile CPU usage by sampling stack traces. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B profile [\-adfh] [\-p PID] [\-U | \-k] [\-F FREQUENCY]
.B [\-\-stack\-storage\-size COUNT] [\-S FRAMES] [duration]
.SH DESCRIPTION
This is a CPU profiler. It works by taking samples of stack traces at timed
intervals. It will help you understand and quantify CPU usage: which code is
executing, and by how much, including both user-level and kernel code.

By default this samples at 49 Hertz (samples per second), across all CPUs.
This frequency can be tuned using a command line option. The reason for 49, and
not 50, is to avoid lock-step sampling.

This is also an efficient profiler, as stack traces are frequency counted in
kernel context, rather than passing each stack to user space for frequency
counting there. Only the unique stacks and counts are passed to user space
at the end of the profile, greatly reducing the kernel<->user transfer.

Note: if another perf-based sampling session is active, the output may become
polluted with their events. On older kernels, the ouptut may also become
polluted with tracing sessions (when the kprobe is used instead of the
tracepoint). This may be filtered in a future version if it becomes a problem.
.SH REQUIREMENTS
CONFIG_BPF and bcc.

This also requires Linux 4.6+ (BPF_MAP_TYPE_STACK_TRACE support), and the
perf:perf_hrtimer tracepoint (currently a kernel patch). If the latter is
unavailable, this will try to use kprobes as a fallback (of perf_misc_flags()),
which may work or
may not, depending on your kernel build. If the kprobe doesn't work, this tool
will either error on instrumentation, or, instrument successfully but
generate no output.
.SH OPTIONS
.TP
\-h
Print usage message.
.TP
\-p PID
Trace this process ID only (filtered in-kernel). Without this, all CPUs are
profiled.
.TP
\-F frequency
Frequency to sample stacks (default 49).
.TP
\-f
Print output in folded stack format.
.TP
\-d
Include an output delimiter between kernel and user stacks (either "--", or,
in folded mode, "-").
.TP
\-U
Show stacks from user space only (no kernel space stacks).
.TP
\-K
Show stacks from kernel space only (no user space stacks).
.TP
\-\-stack-storage-size COUNT
The maximum number of unique stack traces that the kernel will count (default
2048). If the sampled count exceeds this, a warning will be printed.
.TP
\-S FRAMES
A fixed number of kernel frames to skip. By default, extra registers are
recorded so that the interrupt framework stack can be identified and excluded
from the output. If this isn't working on your architecture, or, if you'd
like to improve performance a tiny amount, then you can specify a fixed count
to skip. Note for debugging that the IP address is printed as the first frame,
followed by the captured stack.
.TP
duration
Duration to trace, in seconds.
.SH EXAMPLES
.TP
Profile (sample) stack traces system-wide at 49 Hertz (samples per second) until Ctrl-C:
#
.B profile
.TP
Profile for 5 seconds only:
#
.B profile 5
.TP
Profile at 99 Hertz for 5 seconds only:
#
.B profile -F 99 5
.TP
Profile PID 181 only:
#
.B profile -p 181
.TP
Profile for 5 seconds and output in folded stack format (suitable as input for flame graphs), including a delimiter between kernel and user stacks:
#
.B profile -df 5
.TP
Profile kernel stacks only:
#
.B profile -K
.SH DEBUGGING
See "[unknown]" frames with bogus addresses? This can happen for different
reasons. Your best approach is to get Linux perf to work first, and then to
try this tool. Eg, "perf record \-F 49 \-a \-g \-\- sleep 1; perf script", and
to check for unknown frames there.

The most common reason for "[unknown]" frames is that the target software has
not been compiled
with frame pointers, and so we can't use that simple method for walking the
stack. The fix in that case is to use software that does have frame pointers,
eg, gcc -fno-omit-frame-pointer, or Java's -XX:+PreserveFramePointer.

Another reason for "[unknown]" frames is JIT compilers, which don't use a
traditional symbol table. The fix in that case is to populate a
/tmp/perf-PID.map file with the symbols, which this tool should read. How you
do this depends on the runtime (Java, Node.js).

If you seem to have unrelated samples in the output, check for other
sampling or tracing tools that may be running. The current version of this
tool can include their events if profiling happened concurrently. Those
samples may be filtered in a future version.
.SH OVERHEAD
This is an efficient profiler, as stack traces are frequency counted in
kernel context, and only the unique stacks and their counts are passed to
user space. Contrast this with the current "perf record -F 99 -a" method
of profiling, which writes each sample to user space (via a ring buffer),
and then to the file system (perf.data), which must be post-processed.

This uses perf_event_open to setup a timer which is instrumented by BPF,
and for efficiency it does not initialize the perf ring buffer, so the
redundant perf samples are not collected.

It's expected that the overhead while sampling at 49 Hertz (the default),
across all CPUs, should be negligible. If you increase the sample rate, the
overhead might begin to be measurable.
.SH SOURCE
This is from bcc.
.IP
https://github.com/iovisor/bcc
.PP
Also look in the bcc distribution for a companion _examples.txt file containing
example usage, output, and commentary for this tool.
.SH OS
Linux
.SH STABILITY
Unstable - in development.
.SH AUTHOR
Brendan Gregg
.SH SEE ALSO
offcputime(8)
3 changes: 2 additions & 1 deletion src/python/bcc/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,8 @@
from .libbcc import lib, _CB_TYPE, bcc_symbol
from .procstat import ProcStat, ProcUtils
from .table import Table
from .tracepoint import Perf, Tracepoint
from .tracepoint import Tracepoint
from .perf import Perf
from .usyms import ProcessSymbols

_kprobe_limit = 1000
Expand Down
108 changes: 108 additions & 0 deletions src/python/bcc/perf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Copyright 2016 Sasha Goldshtein
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import ctypes as ct
import multiprocessing
import os

class Perf(object):
class perf_event_attr(ct.Structure):
_fields_ = [
('type', ct.c_uint),
('size', ct.c_uint),
('config', ct.c_ulong),
('sample_period', ct.c_ulong),
('sample_type', ct.c_ulong),
('read_format', ct.c_ulong),
('flags', ct.c_ulong),
('wakeup_events', ct.c_uint),
('IGNORE3', ct.c_uint),
('IGNORE4', ct.c_ulong),
('IGNORE5', ct.c_ulong),
('IGNORE6', ct.c_ulong),
('IGNORE7', ct.c_uint),
('IGNORE8', ct.c_int),
('IGNORE9', ct.c_ulong),
('IGNORE10', ct.c_uint),
('IGNORE11', ct.c_uint)
]

# x86 specific, from arch/x86/include/generated/uapi/asm/unistd_64.h
NR_PERF_EVENT_OPEN = 298

#
# Selected constants from include/uapi/linux/perf_event.h.
# Values copied during Linux 4.7 series.
#

# perf_type_id
PERF_TYPE_HARDWARE = 0
PERF_TYPE_SOFTWARE = 1
PERF_TYPE_TRACEPOINT = 2

# perf_event_sample_format
PERF_SAMPLE_RAW = 1024 # it's a u32; could also try zero args

# perf_event_attr
PERF_ATTR_FLAG_FREQ = 1024

# perf_event.h
PERF_FLAG_FD_CLOEXEC = 8
PERF_EVENT_IOC_SET_FILTER = 1074275334
PERF_EVENT_IOC_ENABLE = 9216

# fetch syscall routines
libc = ct.CDLL('libc.so.6', use_errno=True)
syscall = libc.syscall # not declaring vararg types
ioctl = libc.ioctl # not declaring vararg types

@staticmethod
def _open_for_cpu(cpu, attr):
pfd = Perf.syscall(Perf.NR_PERF_EVENT_OPEN, ct.byref(attr),
attr.pid, cpu, -1,
Perf.PERF_FLAG_FD_CLOEXEC)
if pfd < 0:
errno_ = ct.get_errno()
raise OSError(errno_, os.strerror(errno_))

if attr.type == Perf.PERF_TYPE_TRACEPOINT:
if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_SET_FILTER,
"common_pid == -17") < 0:
errno_ = ct.get_errno()
raise OSError(errno_, os.strerror(errno_))

# we don't setup the perf ring buffers, as we won't read them

if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_ENABLE, 0) < 0:
errno_ = ct.get_errno()
raise OSError(errno_, os.strerror(errno_))

@staticmethod
def perf_event_open(tpoint_id, pid=-1, ptype=PERF_TYPE_TRACEPOINT,
freq=0):
attr = Perf.perf_event_attr()
attr.config = tpoint_id
attr.pid = pid
attr.type = ptype
attr.sample_type = Perf.PERF_SAMPLE_RAW
if freq > 0:
# setup sampling
attr.flags = Perf.PERF_ATTR_FLAG_FREQ # no mmap or comm
attr.sample_period = freq
else:
attr.sample_period = 1
attr.wakeup_events = 9999999 # don't wake up

for cpu in range(0, multiprocessing.cpu_count()):
Perf._open_for_cpu(cpu, attr)
62 changes: 1 addition & 61 deletions src/python/bcc/tracepoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,65 +17,6 @@
import os
import re

class Perf(object):
class perf_event_attr(ct.Structure):
_fields_ = [
('type', ct.c_uint),
('size', ct.c_uint),
('config', ct.c_ulong),
('sample_period', ct.c_ulong),
('sample_type', ct.c_ulong),
('IGNORE1', ct.c_ulong),
('IGNORE2', ct.c_ulong),
('wakeup_events', ct.c_uint),
('IGNORE3', ct.c_uint),
('IGNORE4', ct.c_ulong),
('IGNORE5', ct.c_ulong),
('IGNORE6', ct.c_ulong),
('IGNORE7', ct.c_uint),
('IGNORE8', ct.c_int),
('IGNORE9', ct.c_ulong),
('IGNORE10', ct.c_uint),
('IGNORE11', ct.c_uint)
]

NR_PERF_EVENT_OPEN = 298
PERF_TYPE_TRACEPOINT = 2
PERF_SAMPLE_RAW = 1024
PERF_FLAG_FD_CLOEXEC = 8
PERF_EVENT_IOC_SET_FILTER = 1074275334
PERF_EVENT_IOC_ENABLE = 9216

libc = ct.CDLL('libc.so.6', use_errno=True)
syscall = libc.syscall # not declaring vararg types
ioctl = libc.ioctl # not declaring vararg types

@staticmethod
def _open_for_cpu(cpu, attr):
pfd = Perf.syscall(Perf.NR_PERF_EVENT_OPEN, ct.byref(attr),
-1, cpu, -1, Perf.PERF_FLAG_FD_CLOEXEC)
if pfd < 0:
errno_ = ct.get_errno()
raise OSError(errno_, os.strerror(errno_))
if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_SET_FILTER,
"common_pid == -17") < 0:
errno_ = ct.get_errno()
raise OSError(errno_, os.strerror(errno_))
if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_ENABLE, 0) < 0:
errno_ = ct.get_errno()
raise OSError(errno_, os.strerror(errno_))

@staticmethod
def perf_event_open(tpoint_id):
attr = Perf.perf_event_attr()
attr.config = tpoint_id
attr.type = Perf.PERF_TYPE_TRACEPOINT
attr.sample_type = Perf.PERF_SAMPLE_RAW
attr.sample_period = 1
attr.wakeup_events = 1
for cpu in range(0, multiprocessing.cpu_count()):
Perf._open_for_cpu(cpu, attr)

class Tracepoint(object):
enabled_tracepoints = []
trace_root = "/sys/kernel/debug/tracing"
Expand Down Expand Up @@ -172,7 +113,7 @@ def enable_tracepoint(cls, category, event):
if tp_id == -1:
raise ValueError("no such tracepoint found: %s:%s" %
(category, event))
Perf.perf_event_open(tp_id)
Perf.perf_event_open(tp_id, ptype=Perf.PERF_TYPE_TRACEPOINT)
new_tp = Tracepoint(category, event, tp_id)
cls.enabled_tracepoints.append(new_tp)
return new_tp
Expand All @@ -199,4 +140,3 @@ def attach(cls, bpf):
if cls._any_tracepoints_enabled():
bpf.attach_kprobe(event="tracing_generic_entry_update",
fn_name="__trace_entry_update")

Loading

0 comments on commit f4bf275

Please sign in to comment.