AKA the Portable Performance Tool. But really, the performance tool of last resort.
Too cool to find on Google! We live in the shadows of powerpoint!
ppt
is a performance tracing tool. You define different types of individual
frames and ppt
will generate code to call and link into your executable.
Then ppt
will let you attach to your running executable to collect these
traces.
This is quite useful when you want to know why your code performs a certain way in real-world scenarios.
A frame is essentially a C struct
that you fill out and save to a circular
buffer in shared memory, called a buffer
. You can add any members you like
to a frame to record relevant details about the event. ppt
also has built-in
support for using timers and CPU performance counters.
ppt
prioritizes low overhead first.
When ppt
isn’t attached to a running process, there is no buffer and saving a
frame is a no-op.
You need stack
, the haskell build tool. https://docs.haskellstack.org/en/stable/README/. Only tested on Linux. I’m quite sure it won’t work on MacOS X without serious changes (e.g., the ELF dependency).
With nix
, it’s just stack install
. The build output will indicate the directory (~/.local/bin
) where the binaries are installed.
nix
installs some dependencies, but not many. You just need to install two libraries:
Library | Ubuntu Names |
---|---|
zlib | zlib1g, zlib1g-dev |
libpfm4 | libpfm4 libpfm4-dev |
Once installed just run stack install --no-nix
. The build output will indicate the directory (~/.local/bin
) where the binaries are installed. You can ignore elf
, only ppt
matters.
Because they solve different problems. In a dreaded car analogy, gprof
is
a dynomometer check, dtrace
is a PMU logger, and ppt
is custom
telemetry. All useful, all separate.
dtrace
is an excellent tool. It instruments unaltered binaries using
several built-in “providers”, and you can alter the program to give dtrace
even more data. But it has two drawbacks:
- Attachment time: the more instrumentation you want on the execution, the longer it takes to attach to the process. This can leave the process unresponsive for longer than you may like.
- Statistics and histograms only: no access to raw data. Instead, you can get histograms of collected data (or simple statistics). But the underlying raw data is captured only (AFAIK) with a relatively expensive print statement.
gprof
is a good tool. It provides a simple way to get a rough idea of
where your CPU time is taken. But it sacrifices a few things to get there:
- It chooses where the overhead goes –
gprof
instruments on a function level, which biases towards overreporting small function overhead (because as a percentage,gprof
-induced overhead is higher). - You don’t get as much detail.
gprof
gives you basic statistics on a program execution, but much detail can be hidden behind those statistics. For example, the program performance could be dominated by some infrequent - but very expensive - situation.
In contrast, ppt
’s data gives you the whole distribution (instead of
statistics) for your data, and you choose what and when to instrument in the
program’s execution.
The transfer protocol for sending a frame from the process under observation
to ppt
for recording is a pair of memory barriers, a copy of the data, and
some additional writes to the sequence number. There’s no flow-control or
other synchronization between ppt
and the instrumented app. ppt
can’t
slow down your app, even if you ^Z ppt
.
The best use-case for ppt
is for regression analysis of state or inputs to
performance values. Embed input values, loop counts, and whatever else you
want (add more than you think you’ll need, the instrumentation’s cheap) and
measure what matters. The resulting records will directly contain cause and
effect.
Yes. This is a flow-control policy. ppt
does not let the writer (the
process being instrumented) wait for the reader (ppt
) to catch up.
Instead, some data will be overwritten in the shared buffer before it’s read
by ppt
and saved to disk.
Each frame in a ppt
buffer has a sequence number, that monotonically
increases. If ppt
doesn’t capture some frames in time, the captured frames
will have jumps in their sequence numbers.
Roughly:
- Describe the frames you want to collect in a .spec file.
- Generate source for those frames,
- Add instrumentation to your program to fill in frames and save them to the buffer.
- Attach to the running program to collect your data.
- Convert the collected data to CSV for analysis.
It’s a lot of steps, but it has a few advantages:
- During your program’s run, PPT isn’t doing anything more than saving a shared memory buffer to disk periodically.
- You get lots of control in your instrumentation.
- You get raw data at the end, for deeper analysis.
emit C++;
buffer Minimal 512;
option time timespec realtime;
frame first {
int a, b, c;
interval time duration;
interval counter events;
}
frame second {
interval counter foos;
}
See the Buffer Syntax Reference for details.
Notice above that you can use the counter
type. Its recorded value is a
uint64_t
, but the actual counter used is unspecified. When attaching,
use the -c
flag to specify which actual counters to use.
The counter names are as-specified by libpfm4
. You can use the
showevtinfo
command from that distribution to list the (many) counters
available on your machine. Some highlights:
Counter | Description |
---|---|
INSTRUCTION_RETIRED | count the number of instructions at retirement. For instructions that consists of multiple micro-ops, this event counts the retirement of the last micro-op of the instruction |
LLC_MISSES | count each cache miss condition for references to the last level cache. The event count may include speculation, but excludes cache line fills due to hardware prefetch |
DTLB_LOAD_MISSES | DTLB Load misses. See all modifiers in showevtinfo for details. |
ITLB_MISSES | Instruction TLB misses. |
L1-DCACHE-LOAD-MISSES | L1 data cache load misses. |
L1-ICACHE-LOAD-MISSES | L1 instruction cache misses |
Many, many more are available, but showevtinfo
does the job of explaining
what’s counters you can use much better than we can.
ppt
allocates enough space for 3 counters’ worth of data in each
counter
member of a frame. Twice that for interval counter
. You
specify which counters you want on the command line in ppt attach
. You
can attach to the same process several times (sequentially) with different
counters to get more than 3.
The current implementation will use a system call to read the counters in the generated code. Please beware of the performance impact of measurement.
$ ppt generate ./minimal.spec
Will generate source with this public API:
namespace ppt { namespace Minimal {
class first {
public:
struct timespec duration_start;
struct timespec duration_end;
uint64_t events_0_start= 0;
uint64_t events_0_end= 0;
uint64_t events_1_start= 0;
uint64_t events_1_end= 0;
uint64_t events_2_start= 0;
uint64_t events_2_end= 0;
int a= 0;
int b= 0;
int c= 0;
void save();
void snapshot_duration_start();
void snapshot_duration_end();
void snapshot_events_start();
void snapshot_events_end();
};
class second {
public:
uint64_t foos_0_start= 0;
uint64_t foos_0_end= 0;
uint64_t foos_1_start= 0;
uint64_t foos_1_end= 0;
uint64_t foos_2_start= 0;
uint64_t foos_2_end= 0;
void save();
void snapshot_foos_start();
void snapshot_foos_end();
};
}} // namespace ppt::Minimal
There are additional members saved in each class
that ppt
uses
internally. Check out the generated code for details.
Generally:
- For simple scalar frame members, you should see a matching
class
member of the same name and type. Directly assign to it. - For
time
members, a method is provided to save the current time, calledsnapshot_MEMBER()
. - Same for
counter
members, only that they’re saving to 3 members at a time. - For
interval
types,ppt
adds a_start
and_end
suffix to distinguish members for the start and end of the interval. - When it’s done filling in the frame, your code should call
save()
to make it available for an attachedppt
process to capture.
minimal-client: ppt-Minimal.hh ppt-Minimal.cc minimal-client.cc
g++ -o minimal-client minimal-client.cc ppt-Minimal.cc
#include "ppt-Minimal.hh"
int main() {
int acount = 0;
while (1) {
ppt::Minimal::first record;
// collect timestamp of when this starts.
record.snapshot_duration_start();
// snapshot performance counters
record.snapshot_events_start();
// Do anything you want here. For example, saving relevant parts of
// input, loop counts, etc.
record.a = 0xaaaa0000 + acount++;
record.b = acount - record.a;
record.c = 0xcccccccc;
// snapshot performance counters.
record.snapshot_events_end();
// snapshot timestamp.
record.snapshot_duration_end();
// save to buffer.
record.save();
}
return 0;
}
Roughly: make an instance of the frame type you want, fill it in, and then
save()
it. You can reuse the instance across save()
calls. The members
are directly accessible.
For example, if you’re processing a batch of the events, you may not want to
pay the overhead of repeating snapshot_duration_start()
and
snapshot_duration_end()
calls. You can simply call one of them, and copy the
member over to the other:
ppt::Minimal::first record;
bool first = true;
while (1) {
if (first) {
record.snapshot_duration_start();
first = false;
} else {
record.duration_start = record.duration_end;
}
// .. same innards as above.
record.snapshot_duration_end();
record.save();
}
You can do the same for the counters, but you may end up with a frame of distorted data if you detach and reattach (with different counters).
$ ./minimal-client &
$ ppt attach -p $(pgrep minimal-client) -o output.bin
$ ppt convert output.bin
$ ls output.bin_output
minimal.csv
ppt
has two phases of your program’s lifecycle where it becomes quite
handy:
- During development, it provides good feedback on the implementation’s
performance characteristics. This is very useful for:
- Determining performance trade-offs.
- Improving performance of the system
- During operations, it provides a good way to monitor the application.
- Significantly faster to log than text
- Easier to analyze/plot
Note that more operational support is planned. Once I get around to learning (n)curses.
First, you’ll clearly have to figure out what you want to optimize. The latency in response to an event? The time through the main loop? Time to complete N items of work?
- Sort that out and come back. We’ll wait.
- Got it? Good. Here we go:
First figure out what you’re measuring:
- The performance metric: the number you want to make better. This can be, for example: frame rate (go higher!), latency (go lower), or throughput (higher again!). You don’t have to make it time-based. If you want to measure how many I/Os you do instead, you can do that.
- The unit of work. What’s a single measurement look like? For frame rates, this would be the time for a single frame. For latency, the a single time interval. For throughput, the time for a single item (or if batching, two numbers: number in batch and time taken for batch).
Now, here’s what you want to instrument:
- The load on your program for this unit. The event you processed, some characteristic of how much data you processed in that iteration of your main loop, or the type/parameters of the item processed.
- The effort expended on this unit. This is where you do most of your
instrumentation. Things to record:
- How many iterations of each loop you run
- Which major branches (if conditions) you take
- Key performance counters
- Mostly we’re talking about cache misses
- Synchronization overhead
- How long you spent waiting for a mutex, for example.
Generally, you can start off with a rough breakdown of where you’re spending your effort, and drill down with more instrumentation once you see where the effort’s really going.
When optimizing a program, you can’t be sure that you’ve actually improved the performance without a benchmark for comparison. This doesn’t have to be hard, it can be setup like a unit test.
- Take some input that’s characteristic of reality.
- Run it through your code.
- Collect results.
Run before/after each change you want to compare, and you can tell if you’re doing better. As to how many times you want to run it: generally run it a bunch of times to get clean data, then investigate why it varies.
Then repeatedly go back and change your instrumentation, and re-benchmark until you can predict the performance of slightly different code or input load. Now that you have a real mapping between your load, your implementation, and its performance, you can start to alter the code to perform more like how you want it.
ppt decode
emits CSV files, one per frame type. You can use Jupyter, R,
or Excel to great success with that data. Yes, other output formats are
good too, I just haven’t had a use case to write more. CSV just keeps
working, no matter how it kinda sucks.
This intended use case for ppt
doesn’t have the desired support in ppt
yet. Generally, you can define other buffers for monitoring, and in the
future, ppt
should have a monitor mode that presents a live-decoded
version of the data in that buffer.
The essential issue with this is that we need to present the monitor data in
a way that scales up easily. This may just be ppt emitting JSON on stdout
.
ppt
is only maintained for x86_64 Linux. Not very portable, I know. It used to be used mostly on Solaris, which doesn’t really count. If anyone asks why I still call it portable, I do almost all the development for it while on public transportation.ppt
attaches to a running process viaptrace(2)
, which means that you can’t be debugging the process at the same time. Asppt
only usesptrace(2)
when attaching and detaching, you can start the process, have ppt attach to it, and then havegdb
attach to it. It should work, but isn’t exactly convenient.- C++ code gen only. It’s what I use, so it’s what I developed this
for. As long as the generated code follows the same format and protocol,
and that we can emit the same symbols into the executable, other languages
should be straightforward.
- Back when this was used on Solaris, the generator was C based. It’s not hard to bring it back, if it was useful.
- VM-based languages are probably not as straight forward.
- python3 support is work-in-progress.
save()
checks to see if appt
process is attached or has just detached. In that case, it will take a few system calls to setup / tear down the internal buffering and performance counter measurement apparatus. See the generated code (it’s readable!) for details.- The data transfer between
ppt
and the instrumented process is lossy.- You can detect loss by watching sequence numbers in the frames.
- You can increase the buffer size to reduce loss. But you incur more
cache misses that way. OTOH, larger buffers means
ppt
doesn’t have to wake up as often to get data. If you have a lot of cache churn anyways, you may prefer to keep that CPU core idle more often.- Perhaps non-temporal stores could be used in the future to mitigate this.
- This is the price to pay to prevent having the process-under-observation block when the reader falls behind.
- Performance counters are limited to 3 at a time, and require a system call
per use. There is experimental code to avoid the system call interface
if the counters are architectural, and could be read with
rdpmc
directly.- By experimental, I mean that it does what I think it’s supposed to, but still seems to crash the process a lot. Suggestions, pull requests, etc., more than welcome.