๐
Here are some recent and important revisions. ๐ Complete list of results.
Key: ๐: table, ๐: time plot, ๐ง : memory plot
Most recent pystats on main (f6cc7c8)
date | fork/ref | hash/flags | vs. 3.10.4: | vs. 3.12.0: | vs. 3.13.0: | vs. base: |
---|---|---|---|---|---|---|
2024-10-26 | python/f6cc7c8bd01d8468af70 | f6cc7c8 | 1.22x โ ๐๐ |
1.05x โ ๐๐ |
1.08x โ ๐๐ |
|
2024-10-26 | python/f6cc7c8bd01d8468af70 | f6cc7c8 (JIT) | 1.10x โ ๐๐ |
1.17x โ ๐๐ |
1.19x โ ๐๐ |
1.11x โ ๐๐๐ง |
2024-10-26 | python/f6cc7c8bd01d8468af70 | f6cc7c8 (NOGIL) | 1.26x โ ๐๐ |
1.61x โ ๐๐ |
1.63x โ ๐๐ |
1.53x โ ๐๐๐ง |
2024-10-07 | python/v3.13.0 | 60403a5 | 1.28x โ ๐๐ |
1.01x โ ๐๐ |
date | fork/ref | hash/flags | vs. 3.10.4: | vs. 3.12.0: | vs. 3.13.0: | vs. base: |
---|---|---|---|---|---|---|
2024-10-29 | faster-cpython/use_stackrefs | 7e6deef | 1.52x โ ๐๐ |
2.00x โ ๐๐ |
2.11x โ ๐๐ |
|
2024-10-26 | python/f6cc7c8bd01d8468af70 | f6cc7c8 | 1.38x โ ๐๐ |
1.05x โ ๐๐ |
1.02x โ ๐๐ |
|
2024-10-26 | python/f6cc7c8bd01d8468af70 | f6cc7c8 (JIT) | 1.37x โ ๐๐ |
1.04x โ ๐๐ |
1.03x โ ๐๐ |
1.01x โ ๐๐๐ง |
2024-10-07 | python/v3.13.0 | 60403a5 | 1.36x โ ๐๐ |
1.06x โ ๐๐ |
date | fork/ref | hash/flags | vs. 3.10.4: | vs. 3.12.0: | vs. 3.13.0: | vs. base: |
---|---|---|---|---|---|---|
2024-10-29 | faster-cpython/more_untracking | 1746ca4 | 1.23x โ ๐๐ |
1.06x โ ๐๐ |
1.04x โ ๐๐ |
|
2024-10-26 | python/f6cc7c8bd01d8468af70 | f6cc7c8 | 1.23x โ ๐๐ |
1.06x โ ๐๐ |
1.06x โ ๐๐ |
|
2024-10-26 | python/f6cc7c8bd01d8468af70 | f6cc7c8 (JIT) | 1.19x โ ๐๐ |
1.09x โ ๐๐ |
1.09x โ ๐๐ |
1.03x โ ๐๐๐ง |
2024-10-07 | python/v3.13.0 | 60403a5 | 1.28x โ ๐๐ |
1.00x โ ๐๐ |
date | fork/ref | hash/flags | vs. 3.10.4: | vs. 3.12.0: | vs. 3.13.0: | vs. base: |
---|---|---|---|---|---|---|
2024-10-26 | python/f6cc7c8bd01d8468af70 | f6cc7c8 | 1.15x โ ๐๐ |
1.03x โ ๐๐ |
1.07x โ ๐๐ |
|
2024-10-26 | python/f6cc7c8bd01d8468af70 | f6cc7c8 (JIT) | 1.19x โ ๐๐ |
1.01x โ ๐๐ |
1.04x โ ๐๐ |
1.03x โ ๐๐ |
2024-10-25 | brandtbucher/justin_no_externs | 5791853 (JIT) | 1.18x โ ๐๐ |
1.00x โ ๐๐ |
1.05x โ ๐๐ |
1.02x โ ๐๐ |
2024-10-07 | python/v3.13.0 | 60403a5 | 1.22x โ ๐๐ |
1.07x โ ๐๐ |
date | fork/ref | hash/flags | vs. 3.10.4: | vs. 3.12.0: | vs. 3.13.0: | vs. base: |
---|---|---|---|---|---|---|
2024-10-26 | python/f6cc7c8bd01d8468af70 | f6cc7c8 | 1.11x โ ๐๐ |
1.11x โ ๐๐ |
1.00x โ ๐๐ |
|
2024-10-26 | python/f6cc7c8bd01d8468af70 | f6cc7c8 (JIT) | 1.18x โ ๐๐ |
1.20x โ ๐๐ |
1.07x โ ๐๐ |
1.06x โ ๐๐ |
2024-10-25 | brandtbucher/justin_no_externs | 5791853 (JIT) | 1.15x โ ๐๐ |
1.16x โ ๐๐ |
1.04x โ ๐๐ |
1.02x โ ๐๐ |
2024-10-07 | python/v3.13.0 | 60403a5 | 1.09x โ ๐๐ |
1.13x โ ๐๐ |
date | fork/ref | hash/flags | vs. 3.10.4: | vs. 3.12.0: | vs. 3.13.0: | vs. base: |
---|---|---|---|---|---|---|
2024-10-26 | python/f6cc7c8bd01d8468af70 | f6cc7c8 | 1.29x โ ๐๐ |
1.08x โ ๐๐ |
1.08x โ ๐๐ |
|
2024-10-26 | python/f6cc7c8bd01d8468af70 | f6cc7c8 (JIT) | 1.22x โ ๐๐ |
1.03x โ ๐๐ |
1.03x โ ๐๐ |
1.05x โ ๐๐๐ง |
2024-10-26 | python/f6cc7c8bd01d8468af70 | f6cc7c8 (NOGIL) | 1.13x โ ๐๐ |
1.33x โ ๐๐ |
1.30x โ ๐๐ |
1.40x โ ๐๐๐ง |
2024-10-07 | python/v3.13.0 | 60403a5 | 1.16x โ ๐๐ |
1.01x โ ๐๐ |
*
indicates that the exact same versions of pyperformance was not used.
For the results above, the "faster/slower" result is a geometric mean of each of the benchmarks. The "reliability (rel)" number is the likelihood that the change is faster or slower based on the Hierarchical Performance Testing (HPT) method. For more details, visit each individual result's README.md.
Below are longitudinal timing results. There are also ๐ง longitudinal memory results.
Improvement of the HPT score of key merged benchmarks, computed with pyperf compare
.
The results have a resolution of 0.01 (1%).
- linux: Intelยฎ Xeonยฎ W-2255 CPU @ 3.70GHz, running Ubuntu 20.04 LTS, gcc 9.4.0
- linux2: 12th Gen Intelยฎ Coreโข i9-12900 @ 2.40 GHz, running Ubuntu 22.04 LTS, gcc 11.3.0
- linux-aarch64: ARM Neoverse N1, running Ubuntu 22.04 LTS, gcc 11.4.0
- macos: M1 arm64 Macยฎ Mini, running macOS 13.2.1, clang 1400.0.29.202
- windows: 12th Gen Intelยฎ Coreโข i9-12900 @ 2.40 GHz, running Windows 11 Pro (21H2, 22000.1696), MSVC v143
This is a CHANGELOG of how any derived data has changed:
- 2024-06-27: The HPT values (and the longitudinal plots that are based on them) now correctly exclude any benchmarks in
excluded_benchmarks.txt
.
Visit the ๐ benchmark action and click the "Run Workflow" button.
The available parameters are:
fork
: The fork of CPython to benchmark. If benchmarking a pull request, this would normally be your GitHub username.ref
: The branch, tag or commit SHA to benchmark. If a SHA, it must be the full SHA, since finding it by a prefix is not supported.machine
: The machine to run on. One oflinux-amd64
(default),windows-amd64
,darwin-arm64
orall
.benchmark_base
: If checked, the base of the selected branch will also be benchmarked. The base is determined by runninggit merge-base upstream/main $ref
.pystats
: If checked, collect the pystats from running the benchmarks.
To watch the progress of the benchmark, select it from the ๐ benchmark action page. It may be canceled from there as well. To show only your benchmark workflows, select your GitHub ID from the "Actor" dropdown.
When the benchmarking is complete, the results are published to this repository and will appear in the master table. Each set of benchmarks will have:
- The raw
.json
results from pyperformance. - Comparisons against important reference releases, as well as the merge base of the branch if
benchmark_base
was selected. These include- A markdown table produced by
pyperf compare_to
. - A set of "violin" plots showing the distribution of results for each benchmark.
- A set of plots showing the memory change for each benchmark (for immediate bases only, on non-Windows platforms).
- A markdown table produced by
The most convenient way to get results locally is to clone this repo and git pull
from it.
To automate benchmarking runs, it may be more convenient to use the GitHub CLI.
Once you have gh
installed and configured, you can run benchmarks by cloning this repository and then from inside it:
$ gh workflow run benchmark.yml -f fork=me -f ref=my_branch
Any of the parameters described above are available at the commandline using the -f key=value
syntax.
To collect Linux perf sampling profile data for a benchmarking run, run the _benchmark
action and check the perf
checkbox.
If the default comparisons generated by this tool aren't sufficient, you can check out the repo and use the same infrastructure to generate any arbitrary comparison.
Check out a local copy of this repo:
$ git clone https://github.com/faster-cpython/benchmarking-public
Create a new virtual environment, activate it and install the dependencies into it:
$ cd benchmarking-public
$ python -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
Run bench_runner
's compare
tool:
usage:
Generate a set of comparisons between arbitrary commits. The commits
must already exist in the dataset.
[-h] --output-dir OUTPUT_DIR [--type {1:n,n:n}] commit [commit ...]
positional arguments:
commit Commits to compare. Must be a git commit hash prefix. May optionally have a friendly name
after a comma, e.g. c0ffee,main. If ends with a "T", use the Tier 2 run for that commit. If
ends with a "J", use the JIT run for that commit. If ends with a "N", use the NOGIL run for
that commit.
options:
-h, --help show this help message and exit
--output-dir OUTPUT_DIR
Directory to output results to.
--type {1:n,n:n} Compare the first commit to all others, or do the full product of all commits
For example:
$ python -m bench_runner compare e418fc3,default e418fc3J,jit --output comparison --type 1:n
The infrastructure to make all of this work is the bench_runner project. Look there for more detailed developer docs.
The easiest way to reproduce what is here is to use the bench_runner project library directly, but if you want to run parts of it in a different context or better understand how the numbers are calculated, this section describes some of the things that the benchmarking infrastructure does.
These results combine benchmarks that live in the
pyperformance and
pyston/python-macrobenchmarks
projects, so running the default set from pyperformance
will definitely
produce different results. To combine these benchmarks in the same run, clone
both repos side-by-side in the same directory and use a manifest
file
to combine them. This file should be passed to pyperformance run
:
pyperformance run --manifest benchmarks.manifest
Benchmarks and stats collection can happen in three different configurations. Here "configuration" may be a combination of both build-time and run-time flags:
- Default: A PGO build of CPython (
./configure --enable-optimizations --with-lto=yes
). - Tier 2: The same build as above, but with the
PYTHON_UOPS
environment variable set at runtime to use the Tier 2 interpreter. - JIT: A JIT and PGO build of CPython (
./configure --enable-optimizations --with-lto=yes --enable-experimental-jit
).
Information about the configuration of the run is in the README.md
at the root
of each run directory. The directory name will also include PYTHON_UOPS
for
Tier 2 and JIT
for JIT.
To reduce the number of unknown variables when comparing results, runs are always compared against runs of the same configuration. Be aware that sometimes the base commit on main may predate the configuration becoming available, for example, before the JIT compiler was merged into main. (An exception to this rule are the weekly benchmarks of upstream main, there Tier 2 and JIT configurations are compared against default configurations of the same commit, but that isn't relevant for the common case of testing a pull request).
An additional sharp edge is that, by default, pyperformance
does not pass
environment variables to the child process that actually does the work.
Therefore for a Tier 2 configuration, the --inherit-environ=PYTHON_UOPS
flag
must be passed to pyperformance run
when running benchmarks.
For detailed information, see how configurations affect build time flags in the Github Actions configuration..
Timing benchmarks are notoriously noisy. There are a few techniques to reduce this:
- Where available (on Linux), we use
pyperf tune
to set CPU affinity and other things that make the benchmarks more reproducible. For this reason, we know that the benchmarks are more predictable on Linux than on the other platforms. pyperf
has the concept of "warmup" runs, while caches are warming up and other things about the system are still stabilizing. These runs are excluded from the timing results. This is generally effective at reducing variability, but also may exclude real work done during optimization, for example.- We use the Hierarchical Performance Testing (HPT) method (see below) to
statistically reduce the effect of benchmarks that have more variability.
This is a different method than the simple geometric mean that
pyperf
uses by default. We provide both numbers in our results.
pystats
are a set of counters in CPython that measure things like the number
of times each bytecode instruction is executed. (Detailed documentation of all
of the counters should be added to CPython in the future).
Collecting pystats
requires a special build of CPython with pystats
enabled:
(./configure --enable-pystats
).
pystats
must also be enabled at runtime, either using the -Xpystats
command
line argument or sys._stats_on()
. pyperformance
/pyperf
handles this step
automatically when running on a pystats-enabled build. Stats collection is
enabled during actual benchmarking code, and disabled while running the
"benchmarking harness" code in pyperf
itself. pyperf
has the concept of
"warmup" runs, which allow things like cache lines to warmup before actually
timing benchmarks. While they aren't included in the timing benchmarks, these
warmup runs are included in pystats collection since often Tier 2/JIT traces are
created during warmup, and we don't want the stats to appear as if the traces
ran but were not created.
Any statistics collected are then dumped at exit to the /tmp/py_stats
directory with a random filename. Lastly, the Tools/scripts/summarize_stats.py
script (in the CPython repo) is used to read all of the files from
/tmp/py_stats
and produce a human-readable markdown summary and a JSON file
with aggregate data. Because of this design, it is imperative that:
- The
/tmp/py_stats
directory is cleared before data collection. - No other Python processes are run that could also produce pystats data. Especially, this means benchmarks can not run in parallel.
For more information, see the actual code to collect pystats.
Hierarchical performance testing (HPT) is a method introduced in this paper:
T. Chen, Y. Chen, Q. Guo, O. Temam, Y. Wu and W. Hu, "Statistical performance comparisons of computers," IEEE International Symposium on High-Performance Comp Architecture, New Orleans, LA, USA, 2012, pp. 1-12, doi: 10.1109/HPCA.2012.6169043.
From the abstract:
In traditional performance comparisons, the impact of performance variability is usually ignored (i.e., the means of performance measurements are compared regardless of the variability), or in the few cases where it is factored in using parametric confidence techniques, the confidence is either erroneously computed based on the distribution of performance measurements (with the implicit assumption that it obeys the normal law), instead of the distribution of sample mean of performance measurements, or too few measurements are considered for the distribution of sample mean to be normal. โฆ We propose a non-parametric Hierarchical Performance Testing (HPT) framework for performance comparison, which is significantly more practical than standard parametric techniques because it does not require to collect a large number of measurements in order to achieve a normal distribution of the sample mean.
For each result, we compute a reliability score, as well as the estimated speedup at the 90th, 95th and 99th percentile.
The inclusion of HPT scores is considered experimental as we learn about their usefulness for decision-making.