Improving Python benchmarking tooling #1219

asmeurer · 2022-10-06T19:04:00Z

At the NumFOCUS summit a few weeks ago, several people had a conversation about the limitations of asv and the other Python benchmarking tooling. I'd like to use this issue to kick off the discussion about this so that we can gather a wishlist and a list of problems with the existing tooling, and figure out a plan on how to improve things. It's not clear yet whether that will mean improving asv, some other tool, or creating new tools.

Some people who I want to make sure are involved in this discussion (please feel free to CC others):

@jreback
@oscarbenjamin
@wkerzendorf
@jarrodmillman

asmeurer · 2022-10-06T19:30:20Z

Here's my own wishlist/list of concerns with asv:

My biggest issue with asv is that it is very tied to one specific use case, namely, running a suite of benchmarks across a history of commits for a project, and analysing the history of runtimes. When you stay within asv's designed use-case, it works very well. However, it is very difficult to do anything that strays even a little bit from that use case. Examples of things that are difficult (or even impossible) to do currently with asv:

Run just a single benchmark against master. This requires some confusing command line flags.
Run benchmarks against a development piece of code. i.e., you change some code and want to see how it affects things. Typically if I want to do this I end up just copying the benchmark code and pasting it into an IPython %timeit.

This is hard because asv is designed to download the code from a remote repository (which is hard coded in a config file), and install everything into an isolated virtual environment. It doesn't have a concept of "running a benchmark locally". At best, you can manually edit the config file (which usually is a checked-in file), but it's very annoying.

In general, asv has a specific workflow on how it runs the benchmarks (download the code, create a virtualenv, etc.), and there's virtually no way to run benchmarks that doesn't fit into this paradigm.
It's not possible to run the setup() function per benchmark (setup() not being run before benchmark runs #966). More generally, there's limited ways that you can configure how the workflow of setup(), run_benchmark(), teardown() happen.
The virtual environment installation is limited in its configurability. In another project I work on at Quansight, I have a very hacky script that overwrites how the virtualenv installation happens, because I have a dependency that must have a specific version installed depending on what commit is used.
It's hard to scale up a benchmarking suite. There's no concept of "fast benchmarks" which should run every time and "slow benchmarks" that should only run, e.g., on every release (similar to "fast" and "slow" tests). I would also say more generally that it's challenging to have any kind of "slow" benchmark (anything that takes more than a second to run) as part of an asv benchmarking suite.

More broadly speaking, I think it's important to understand why we might want to have a benchmark. There are many different use cases for a benchmark. They can

Identify performance regressions
Measure whether a change will improve performance
Compare the performance of similar tasks across different tools
Give empirical evidence of how an algorithm scales (its "big-O")
Provide explicit targets of known important use-cases for optimization

asv is designed around use-case 1., but it is very difficult or impossible to use it for the other use cases.

I will say that there are features of asv that I do like. I like that you can just set it running and it more or less just does what it should (although there are few speed bumps here, like the fact that the percentage it shows while it runs isn't accurate). And I like that it produces nice static graphs that you can easily share on the web.

Outside of asv, but related to this discussion, I think that benchmarking CI hardware is a major problem. Running benchmarks on traditional CI is something that doesn't work well. We have been doing it on SymPy and it's so flaky that I (at least) generally just ignore it. For example, here are the runs on a recent PR of mine that only touches documentation (i.e., it cannot possibly affect any benchmarks), which show multiple benchmarking differences. It's hard to tell whether these are due to the benchmarks themselves being flaky or the hardware having inconsistent timing. Also the general format of the CI output isn't very useful. It would be nice to have a more standardized system that produces nicer outputs.

Again, I don't know what the solution to these concerns should look like yet, whether we should try to expand asv's capabilities or to build new tools (or use existing tools that I'm not yet familiar with), or both. For now, I just want to make sure that all the needs of the community are expressed so we can decide what is most important, and then figure out how to achieve it.

wkerzendorf · 2022-10-06T19:52:40Z

I would like to add that it would be nice to include a 'pytest-benchmark' study to see if that could replace asv in certain cases

datapythonista · 2022-10-06T21:05:42Z

I agree with your points, and personally I think it'd help a lot if we had 3 somehow independent components instead of asv how it is now:

A minimal tool to run a single benchmark
A tool to run benchmark suites and generates the output in a standard format
A UI to visualize benchmark results over time

The main problem here is that asv is mostly abandoned and has zero maintainers or contributors right now

asmeurer · 2022-10-06T21:17:08Z

The main problem here is that asv is mostly abandoned and has zero maintainers or contributors right now

Just to be clear, one of the outcomes of the discussions at the summit was that we might be able to fix this and have some people work on this. But if we do so, it would be good to have a more guided effort. Most of the efforts that projects have put into benchmarking so far have been specific to their own project, and so have always been things that either can't be reused at all or are so tied a specific use-case that they are hard to generalize. That's why I want to start by gathering a list of the current needs so we can try to figure out what a good set of Python benchmarking tools should look like.

mattip · 2022-10-06T22:07:37Z

Is there a summary of the discussion that already started? What are some of the pain points from others in the NumFOCUS summit, and where was it felt energy should be invested?

dharhas · 2022-10-06T22:28:36Z

We had a discussion about this today at our nasa-roses monthly project meeting which had representation from Numpy/SciPy/Pandas/SciKit-Learn + @asmeurer. There is consensus that several projects want to work on this in a coordinated way. We decided that opening this issue was a good first step towards coordinating.

@jreback has a similar list of pain points and will be contributing them in this issue.

asmeurer · 2022-10-06T22:31:16Z

@wkerzendorf did anyone take notes at the NumFOCUS session? All the pain points that were discussed that I can remember I mentioned in my comment above. I think the main takeaway from the discussion was that we should try to collaborate across projects to improve things, rather than continuing to do project-specific workarounds (hence this issue).

jreback · 2022-10-07T01:08:44Z

from @jbrockmendel an asv wishlist

Implement an API; ATM there is only the command line interface, makes it extremely difficult to grok parts of the code in isolation
Support 'setup_class' in addition to 'setup', using pytest semantics for when it gets run. This will allow users to avoid unnecessarily re-running potentially expensive setup code.
Use pytest semantics for parametrization
Make it possible to change/add/refactor benchmarks without nuking the entire history Per-benchmark hashes #1218
Use something more performant than JSON files read/written to move information between processes.
Speed benchmark discovery (and avoid re-doing it per-process) xref REF: install asv in the profiled environment #908
Fix the --profile option; I've never gotten it to work.
Use something like ccache or cython's caching to speed builds.
Refactor to make the environment management, benchmark running, and display/server modular

trexfeathers · 2022-10-07T07:58:39Z

You're doing a great job of highlighting the mantra that "benchmarking is hard"! It can mean so many different things, and in every case there are many pit-falls to ensure your results are 'true'.

If ASV overcame all these challenges it would be even more complex than it is now. I'd recommend separating concerns into:

The core bits that most use-cases need
- Benchmark runner
- Results databases
- Results visualisation
- ...
The bits that are specific to individual use-cases
- Commit management
- Dependency management
- Scalability measurement
- ...

ASV's modules already help with this, but there is no distinction at a user level. There are probably many ways to improve this (the most extreme version being a core package and plugin packages).

danielsoutar · 2022-10-15T23:28:54Z

The main problem here is that asv is mostly abandoned and has zero maintainers or contributors right now

Just to be clear, one of the outcomes of the discussions at the summit was that we might be able to fix this and have some people work on this. But if we do so, it would be good to have a more guided effort. Most of the efforts that projects have put into benchmarking so far have been specific to their own project, and so have always been things that either can't be reused at all or are so tied a specific use-case that they are hard to generalize. That's why I want to start by gathering a list of the current needs so we can try to figure out what a good set of Python benchmarking tools should look like.

I'm not a dev on any of the Python scientific libraries, but I really rated asv's visualisations when prototyping/hacking something to work with a C++ codebase I previously worked on a few years ago. I always thought it'd be a cracking tool for benchmarking in general and having to force things through Python was a shame. I'd love to contribute to this if there's a roadmap of some kind, particularly to standardisation of the JSON/inputs to the visual 'backend' of asv.

oscarbenjamin · 2022-10-18T13:12:27Z

think that benchmarking CI hardware is a major problem. Running benchmarks on traditional CI is something that doesn't work well. We have been doing it on SymPy and it's so flaky that I (at least) generally just ignore it. For example, here are the runs on a recent PR of mine that only touches documentation (i.e., it cannot possibly affect any benchmarks), which show multiple benchmarking differences. It's hard to tell whether these are due to the benchmarks themselves being flaky or the hardware having inconsistent timing.

I think that you are just misinterpreting the output there. The "PR vs master" section shows no changes. However the "master vs previous release" section (correctly) shows that some benchmarks are now running faster as a result of improvements that are not yet released. You can see the benchmark results for the PR that actually made those improvements here: sympy/sympy#23821 (comment)

There is some variability in timings but I actually think that running the benchmarks in CI works fairly well for SymPy works fairly well for much of the benchmark suite. The main problem with them for SymPy is that many operations are cached so sometimes the benchmark results report something like a 50% slowdown but it's actually a 50% slowdown on a cached result where the actual time to do something once is much slower than the time reported by asv.

oscarbenjamin · 2022-10-18T13:24:08Z

To me the biggest limitation of asv for SymPy is that I want to write benchmarks that can be shared across multiple projects in order to compare timings for different software e.g. if there is a benchmark for a particular operation then I want to be able to reuse that for SymPy, SAGE, Pari, Julia, Maxima etc. Basically I don't want the benchmarks themselves to be written as Python code because I want to be able to use them with/from software that doesn't involve Python at all.

This actually extends beyond benchmarking to unit tests as well. There is no real need for unit tests and benchmarks to be different things. They are all just examples of things that can be done with the software. If SymPy's extensive unit test suite was usable to report detailed timing information on each operation then that would be a huge source of benchmarking information but the unit tests are also just written in pytest-style test_ Python functions. Basically I want most of the "test cases" and "benchmarks" to be more like data rather than code and to be usable from many different ecosystems rather than just from Python so that different projects with related functionality don't have to maintain independent test suites and can be compared both for correctness and speed.

tylerjereddy · 2022-10-19T17:15:03Z

One problem we have with the darshan project is that the Python interface doesn't control the building of the C code, and I don't think there's an asv approach to kind of custom-build C-code first and then build the Python part that uses it via pip.

I can't decide if it is more reasonable to ask the C devs to allow the Python portion of the project to have its own separate build system that also builds/links the C code, or if it would be a fairly minor thing to provide a custom set of build commands to asv to perform on a per-commit basis.

One other thing is that darshan depends on data files/assets a lot for benchmarking because it is a log-parsing library, but needing to place the assets in pip-viable repo is not ideal vs. say having a way to pull in data assets for benchmarking in a more customizable way perhaps? I don't know if that could play in with some of the community work to lean on stuff like pooch for pulling in data assets in a smart way (probably also relates to some comments above about challenges for doing large/slow benchmarks).

asmeurer · 2022-10-19T22:11:41Z

I can't decide if it is more reasonable to ask the C devs to allow the Python portion of the project to have its own separate build system that also builds/links the C code, or if it would be a fairly minor thing to provide a custom set of build commands to asv to perform on a per-commit basis.

This sounds like a similar sort of problem to the one I described above where I needed to hack around asv's inability to install a different version of a dependency depending on the commit. I think the build stage needs to be much more customizable. Right now it makes some pretty hard assumptions about how the project is built/installed into a virtual environment and how that virtual environment is cached across runs.

I'd also like for virtualenv isolation to be completely separated as a higher level step from actually running the benchmarks, so that you can just "run" the benchmarks against the dev code with the current Python (similar to just running pytest vs. isolating the tests with something like tox).

ngoldbaum · 2022-11-14T21:27:57Z

A pattern I've noticed reading this over is that people are pretty happy writing benchmark suites with asv (except some nits around parameterized benchmarks) and with asv running in the background on CI generating benchmarks over time as a project evolves, but are unhappy using asv for other benchmarking tasks.

In the past I wrote a set of benchmarks for my unyt library based on the pyperf benchmark harness. I personally found pyperf much nicer to work with and simpler to work with than asv. pyperf's main focus is to run individual stable benchmarks. It has no support for setting up a suite of benchmarks, running benchmarks over a project history, or for setting up isolated testing environments.

I wonder how much work it would be to replace asv's multiprocessing-based bespoke benchmarking with shell calls to pyperf. That way a user could very easily drop down to just using pyperf on a single benchmark if they want to drill down on one thing at a time.

Like others in this thread, I also find asv's codebase hard to grok, with lots of long function implementations that are tied to individual asv command-line options. It might also make sense to write new tools for writing a benchmark suite that runs pyperf under the hood (e.g. pyperformance does this for the Python benchmarks, but is not a general tool) and a tool for running a benchmark suite over a project's history, but given the number of downstream users of asv it might be pragmatic to keep user-visible changes minimal and instead try to refactor asv to make it more approachable.

asmeurer · 2022-11-15T00:01:57Z

Thanks Nathan. Maybe pyperf is the answer to the "just run a single benchmark" use-case that is so hard with asv right now. I haven't had a chance to use pyperf before, but I like at least in principle some of the features (like the tuning feature). Just to take one of the use-cases that is currently not so easy with pure asv, how hard is it to take an existing benchmark (or set of benchmarks) from an asv-style benchmark suite and run them against some uncommitted development changes for a library? Does this usage already work out of the box or would it require some changes to the benchmarking suite, or some new code that wraps pyperf?

I'm also curious how much of the asv benchmarking running internals are worth keeping and how much is already implemented by pyperf (things like isolating benchmark runs, doing proper statistics, and so on).

By the way, I just now realized for the first time from reading your comment that pyperf and pyperformance are two separate projects.

ngoldbaum · 2022-11-15T18:25:58Z

how hard is it to take an existing benchmark (or set of benchmarks) from an asv-style benchmark suite and run them against some uncommitted development changes for a library?

Not trivial, but not all that bad really, at least for a simple benchmark. For example, here's a pyperf-based benchmark that uses one of the sympy benchmarks:

import pyperf
from polygon import PolygonAttributes

p = PolygonAttributes()

runner = pyperf.Runner()
runner.bench_func("sympy PolygonAttributes", p.time_create)

To run against a development build of sympy, you'd simply install pyperf and the sympy development version into the python environment you're working in and run pyperf using that environment.

There are a few issues, this API doesn't support setup or teardown functions. There also doesn't appear to be a way to run pyperf from the command line by referring to a python function, the CLI only accepts statements. I guess currently if you wanted to run some setup code you'd need to manually run it in the script first?

I'm also curious how much of the asv benchmarking running internals are worth keeping and how much is already implemented by pyperf (things like isolating benchmark runs, doing proper statistics, and so on).

I think this is mostly all implemented in pyperf but haven't done a detailed comparison.

asmeurer · 2022-11-15T20:56:50Z

Another thing that I think has only been indirectly referenced here is the ability to specify whether a benchmark can safely be rerun in the same process, or whether a benchmark needs to be rerun in a new process each time, which is also somewhat related to specifying whether setup() runs before each benchmark run #966. Running setup() once and keeping the benchmarks within the same process is obviously much faster, but some benchmarks are inaccurate when run multiple times in the same process because of caching or some other internal state that would change after the first run (SymPy benchmarks are a prime example of this because of the SymPy cache).

This was referenced Oct 13, 2022

Add additional CSE benchmarks sympy/sympy_benchmarks#81

Merged

Add differentiation benchmarks sympy/sympy_benchmarks#82

Merged

tylerjereddy mentioned this issue Oct 19, 2022

BENCH: profile memleaks asv darshan-hpc/darshan#821

Open

hammer mentioned this issue Oct 26, 2022

Add asv benchmark for compression ratio of some example VCFs sgkit-dev/sgkit#938

Closed

jeromekelleher mentioned this issue Oct 31, 2022

Continuous benchmarking with ASV zarr-developers/numcodecs#126

Open

pllim mentioned this issue Nov 11, 2022

Refactor benchmarking process to run relative benchmark for PR astropy/astropy-benchmarks#106

Closed

pllim mentioned this issue Feb 27, 2023

Benchmarking scientific-python/summit-2023#18

Open

HaoZeke mentioned this issue Mar 19, 2023

ENH: Return pstats.Stats for --profile #1253

Merged

dfsp-spirit mentioned this issue May 5, 2023

Performance monitoring over time: Implement some airspeed velocity performance tests esi-neuroscience/syncopy#494

Open

sfmig mentioned this issue Jun 22, 2023

A suite of benchmarking tests for BrainGlobe brainglobe/BrainGlobe#21

Open

5 tasks

JackKelly mentioned this issue Oct 2, 2023

Pick benchmarking tool zarr-developers/zarr-benchmark#1

Closed

NickCrews mentioned this issue Dec 28, 2023

Testing: Standardized workflow and datasets for speed and performance benchmarking NickCrews/mismo#19

Closed

airvzxf mentioned this issue Apr 18, 2024

GSoC 2024 | Benchmark | First objective | Israel Roldan tardis-sn/tardis#2565

Merged

6 tasks

flying-sheep mentioned this issue May 13, 2024

Switch away from asv scverse/scanpy#3052

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Python benchmarking tooling #1219

Improving Python benchmarking tooling #1219

asmeurer commented Oct 6, 2022

asmeurer commented Oct 6, 2022

wkerzendorf commented Oct 6, 2022

datapythonista commented Oct 6, 2022

asmeurer commented Oct 6, 2022

mattip commented Oct 6, 2022

dharhas commented Oct 6, 2022

asmeurer commented Oct 6, 2022

jreback commented Oct 7, 2022

trexfeathers commented Oct 7, 2022

danielsoutar commented Oct 15, 2022

oscarbenjamin commented Oct 18, 2022

oscarbenjamin commented Oct 18, 2022 •

edited

Loading

tylerjereddy commented Oct 19, 2022

asmeurer commented Oct 19, 2022

ngoldbaum commented Nov 14, 2022

asmeurer commented Nov 15, 2022

ngoldbaum commented Nov 15, 2022 •

edited

Loading

asmeurer commented Nov 15, 2022

Improving Python benchmarking tooling #1219

Improving Python benchmarking tooling #1219

Comments

asmeurer commented Oct 6, 2022

asmeurer commented Oct 6, 2022

wkerzendorf commented Oct 6, 2022

datapythonista commented Oct 6, 2022

asmeurer commented Oct 6, 2022

mattip commented Oct 6, 2022

dharhas commented Oct 6, 2022

asmeurer commented Oct 6, 2022

jreback commented Oct 7, 2022

trexfeathers commented Oct 7, 2022

danielsoutar commented Oct 15, 2022

oscarbenjamin commented Oct 18, 2022

oscarbenjamin commented Oct 18, 2022 • edited Loading

tylerjereddy commented Oct 19, 2022

asmeurer commented Oct 19, 2022

ngoldbaum commented Nov 14, 2022

asmeurer commented Nov 15, 2022

ngoldbaum commented Nov 15, 2022 • edited Loading

asmeurer commented Nov 15, 2022

oscarbenjamin commented Oct 18, 2022 •

edited

Loading

ngoldbaum commented Nov 15, 2022 •

edited

Loading