-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add profile: a CPU profiler * move Perf to common class
- Loading branch information
1 parent
2947ee3
commit f4bf275
Showing
7 changed files
with
1,409 additions
and
62 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,148 @@ | ||
.TH profile 8 "2016-07-17" "USER COMMANDS" | ||
.SH NAME | ||
profile \- Profile CPU usage by sampling stack traces. Uses Linux eBPF/bcc. | ||
.SH SYNOPSIS | ||
.B profile [\-adfh] [\-p PID] [\-U | \-k] [\-F FREQUENCY] | ||
.B [\-\-stack\-storage\-size COUNT] [\-S FRAMES] [duration] | ||
.SH DESCRIPTION | ||
This is a CPU profiler. It works by taking samples of stack traces at timed | ||
intervals. It will help you understand and quantify CPU usage: which code is | ||
executing, and by how much, including both user-level and kernel code. | ||
|
||
By default this samples at 49 Hertz (samples per second), across all CPUs. | ||
This frequency can be tuned using a command line option. The reason for 49, and | ||
not 50, is to avoid lock-step sampling. | ||
|
||
This is also an efficient profiler, as stack traces are frequency counted in | ||
kernel context, rather than passing each stack to user space for frequency | ||
counting there. Only the unique stacks and counts are passed to user space | ||
at the end of the profile, greatly reducing the kernel<->user transfer. | ||
|
||
Note: if another perf-based sampling session is active, the output may become | ||
polluted with their events. On older kernels, the ouptut may also become | ||
polluted with tracing sessions (when the kprobe is used instead of the | ||
tracepoint). This may be filtered in a future version if it becomes a problem. | ||
.SH REQUIREMENTS | ||
CONFIG_BPF and bcc. | ||
|
||
This also requires Linux 4.6+ (BPF_MAP_TYPE_STACK_TRACE support), and the | ||
perf:perf_hrtimer tracepoint (currently a kernel patch). If the latter is | ||
unavailable, this will try to use kprobes as a fallback (of perf_misc_flags()), | ||
which may work or | ||
may not, depending on your kernel build. If the kprobe doesn't work, this tool | ||
will either error on instrumentation, or, instrument successfully but | ||
generate no output. | ||
.SH OPTIONS | ||
.TP | ||
\-h | ||
Print usage message. | ||
.TP | ||
\-p PID | ||
Trace this process ID only (filtered in-kernel). Without this, all CPUs are | ||
profiled. | ||
.TP | ||
\-F frequency | ||
Frequency to sample stacks (default 49). | ||
.TP | ||
\-f | ||
Print output in folded stack format. | ||
.TP | ||
\-d | ||
Include an output delimiter between kernel and user stacks (either "--", or, | ||
in folded mode, "-"). | ||
.TP | ||
\-U | ||
Show stacks from user space only (no kernel space stacks). | ||
.TP | ||
\-K | ||
Show stacks from kernel space only (no user space stacks). | ||
.TP | ||
\-\-stack-storage-size COUNT | ||
The maximum number of unique stack traces that the kernel will count (default | ||
2048). If the sampled count exceeds this, a warning will be printed. | ||
.TP | ||
\-S FRAMES | ||
A fixed number of kernel frames to skip. By default, extra registers are | ||
recorded so that the interrupt framework stack can be identified and excluded | ||
from the output. If this isn't working on your architecture, or, if you'd | ||
like to improve performance a tiny amount, then you can specify a fixed count | ||
to skip. Note for debugging that the IP address is printed as the first frame, | ||
followed by the captured stack. | ||
.TP | ||
duration | ||
Duration to trace, in seconds. | ||
.SH EXAMPLES | ||
.TP | ||
Profile (sample) stack traces system-wide at 49 Hertz (samples per second) until Ctrl-C: | ||
# | ||
.B profile | ||
.TP | ||
Profile for 5 seconds only: | ||
# | ||
.B profile 5 | ||
.TP | ||
Profile at 99 Hertz for 5 seconds only: | ||
# | ||
.B profile -F 99 5 | ||
.TP | ||
Profile PID 181 only: | ||
# | ||
.B profile -p 181 | ||
.TP | ||
Profile for 5 seconds and output in folded stack format (suitable as input for flame graphs), including a delimiter between kernel and user stacks: | ||
# | ||
.B profile -df 5 | ||
.TP | ||
Profile kernel stacks only: | ||
# | ||
.B profile -K | ||
.SH DEBUGGING | ||
See "[unknown]" frames with bogus addresses? This can happen for different | ||
reasons. Your best approach is to get Linux perf to work first, and then to | ||
try this tool. Eg, "perf record \-F 49 \-a \-g \-\- sleep 1; perf script", and | ||
to check for unknown frames there. | ||
|
||
The most common reason for "[unknown]" frames is that the target software has | ||
not been compiled | ||
with frame pointers, and so we can't use that simple method for walking the | ||
stack. The fix in that case is to use software that does have frame pointers, | ||
eg, gcc -fno-omit-frame-pointer, or Java's -XX:+PreserveFramePointer. | ||
|
||
Another reason for "[unknown]" frames is JIT compilers, which don't use a | ||
traditional symbol table. The fix in that case is to populate a | ||
/tmp/perf-PID.map file with the symbols, which this tool should read. How you | ||
do this depends on the runtime (Java, Node.js). | ||
|
||
If you seem to have unrelated samples in the output, check for other | ||
sampling or tracing tools that may be running. The current version of this | ||
tool can include their events if profiling happened concurrently. Those | ||
samples may be filtered in a future version. | ||
.SH OVERHEAD | ||
This is an efficient profiler, as stack traces are frequency counted in | ||
kernel context, and only the unique stacks and their counts are passed to | ||
user space. Contrast this with the current "perf record -F 99 -a" method | ||
of profiling, which writes each sample to user space (via a ring buffer), | ||
and then to the file system (perf.data), which must be post-processed. | ||
|
||
This uses perf_event_open to setup a timer which is instrumented by BPF, | ||
and for efficiency it does not initialize the perf ring buffer, so the | ||
redundant perf samples are not collected. | ||
|
||
It's expected that the overhead while sampling at 49 Hertz (the default), | ||
across all CPUs, should be negligible. If you increase the sample rate, the | ||
overhead might begin to be measurable. | ||
.SH SOURCE | ||
This is from bcc. | ||
.IP | ||
https://github.com/iovisor/bcc | ||
.PP | ||
Also look in the bcc distribution for a companion _examples.txt file containing | ||
example usage, output, and commentary for this tool. | ||
.SH OS | ||
Linux | ||
.SH STABILITY | ||
Unstable - in development. | ||
.SH AUTHOR | ||
Brendan Gregg | ||
.SH SEE ALSO | ||
offcputime(8) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
# Copyright 2016 Sasha Goldshtein | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import ctypes as ct | ||
import multiprocessing | ||
import os | ||
|
||
class Perf(object): | ||
class perf_event_attr(ct.Structure): | ||
_fields_ = [ | ||
('type', ct.c_uint), | ||
('size', ct.c_uint), | ||
('config', ct.c_ulong), | ||
('sample_period', ct.c_ulong), | ||
('sample_type', ct.c_ulong), | ||
('read_format', ct.c_ulong), | ||
('flags', ct.c_ulong), | ||
('wakeup_events', ct.c_uint), | ||
('IGNORE3', ct.c_uint), | ||
('IGNORE4', ct.c_ulong), | ||
('IGNORE5', ct.c_ulong), | ||
('IGNORE6', ct.c_ulong), | ||
('IGNORE7', ct.c_uint), | ||
('IGNORE8', ct.c_int), | ||
('IGNORE9', ct.c_ulong), | ||
('IGNORE10', ct.c_uint), | ||
('IGNORE11', ct.c_uint) | ||
] | ||
|
||
# x86 specific, from arch/x86/include/generated/uapi/asm/unistd_64.h | ||
NR_PERF_EVENT_OPEN = 298 | ||
|
||
# | ||
# Selected constants from include/uapi/linux/perf_event.h. | ||
# Values copied during Linux 4.7 series. | ||
# | ||
|
||
# perf_type_id | ||
PERF_TYPE_HARDWARE = 0 | ||
PERF_TYPE_SOFTWARE = 1 | ||
PERF_TYPE_TRACEPOINT = 2 | ||
|
||
# perf_event_sample_format | ||
PERF_SAMPLE_RAW = 1024 # it's a u32; could also try zero args | ||
|
||
# perf_event_attr | ||
PERF_ATTR_FLAG_FREQ = 1024 | ||
|
||
# perf_event.h | ||
PERF_FLAG_FD_CLOEXEC = 8 | ||
PERF_EVENT_IOC_SET_FILTER = 1074275334 | ||
PERF_EVENT_IOC_ENABLE = 9216 | ||
|
||
# fetch syscall routines | ||
libc = ct.CDLL('libc.so.6', use_errno=True) | ||
syscall = libc.syscall # not declaring vararg types | ||
ioctl = libc.ioctl # not declaring vararg types | ||
|
||
@staticmethod | ||
def _open_for_cpu(cpu, attr): | ||
pfd = Perf.syscall(Perf.NR_PERF_EVENT_OPEN, ct.byref(attr), | ||
attr.pid, cpu, -1, | ||
Perf.PERF_FLAG_FD_CLOEXEC) | ||
if pfd < 0: | ||
errno_ = ct.get_errno() | ||
raise OSError(errno_, os.strerror(errno_)) | ||
|
||
if attr.type == Perf.PERF_TYPE_TRACEPOINT: | ||
if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_SET_FILTER, | ||
"common_pid == -17") < 0: | ||
errno_ = ct.get_errno() | ||
raise OSError(errno_, os.strerror(errno_)) | ||
|
||
# we don't setup the perf ring buffers, as we won't read them | ||
|
||
if Perf.ioctl(pfd, Perf.PERF_EVENT_IOC_ENABLE, 0) < 0: | ||
errno_ = ct.get_errno() | ||
raise OSError(errno_, os.strerror(errno_)) | ||
|
||
@staticmethod | ||
def perf_event_open(tpoint_id, pid=-1, ptype=PERF_TYPE_TRACEPOINT, | ||
freq=0): | ||
attr = Perf.perf_event_attr() | ||
attr.config = tpoint_id | ||
attr.pid = pid | ||
attr.type = ptype | ||
attr.sample_type = Perf.PERF_SAMPLE_RAW | ||
if freq > 0: | ||
# setup sampling | ||
attr.flags = Perf.PERF_ATTR_FLAG_FREQ # no mmap or comm | ||
attr.sample_period = freq | ||
else: | ||
attr.sample_period = 1 | ||
attr.wakeup_events = 9999999 # don't wake up | ||
|
||
for cpu in range(0, multiprocessing.cpu_count()): | ||
Perf._open_for_cpu(cpu, attr) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.