-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Python benchmarking tooling #1219
Comments
Here's my own wishlist/list of concerns with asv: My biggest issue with asv is that it is very tied to one specific use case, namely, running a suite of benchmarks across a history of commits for a project, and analysing the history of runtimes. When you stay within asv's designed use-case, it works very well. However, it is very difficult to do anything that strays even a little bit from that use case. Examples of things that are difficult (or even impossible) to do currently with asv:
More broadly speaking, I think it's important to understand why we might want to have a benchmark. There are many different use cases for a benchmark. They can
asv is designed around use-case 1., but it is very difficult or impossible to use it for the other use cases. I will say that there are features of asv that I do like. I like that you can just set it running and it more or less just does what it should (although there are few speed bumps here, like the fact that the percentage it shows while it runs isn't accurate). And I like that it produces nice static graphs that you can easily share on the web. Outside of asv, but related to this discussion, I think that benchmarking CI hardware is a major problem. Running benchmarks on traditional CI is something that doesn't work well. We have been doing it on SymPy and it's so flaky that I (at least) generally just ignore it. For example, here are the runs on a recent PR of mine that only touches documentation (i.e., it cannot possibly affect any benchmarks), which show multiple benchmarking differences. It's hard to tell whether these are due to the benchmarks themselves being flaky or the hardware having inconsistent timing. Also the general format of the CI output isn't very useful. It would be nice to have a more standardized system that produces nicer outputs. Again, I don't know what the solution to these concerns should look like yet, whether we should try to expand asv's capabilities or to build new tools (or use existing tools that I'm not yet familiar with), or both. For now, I just want to make sure that all the needs of the community are expressed so we can decide what is most important, and then figure out how to achieve it. |
I would like to add that it would be nice to include a 'pytest-benchmark' study to see if that could replace asv in certain cases |
I agree with your points, and personally I think it'd help a lot if we had 3 somehow independent components instead of asv how it is now:
The main problem here is that asv is mostly abandoned and has zero maintainers or contributors right now |
Just to be clear, one of the outcomes of the discussions at the summit was that we might be able to fix this and have some people work on this. But if we do so, it would be good to have a more guided effort. Most of the efforts that projects have put into benchmarking so far have been specific to their own project, and so have always been things that either can't be reused at all or are so tied a specific use-case that they are hard to generalize. That's why I want to start by gathering a list of the current needs so we can try to figure out what a good set of Python benchmarking tools should look like. |
Is there a summary of the discussion that already started? What are some of the pain points from others in the NumFOCUS summit, and where was it felt energy should be invested? |
We had a discussion about this today at our nasa-roses monthly project meeting which had representation from Numpy/SciPy/Pandas/SciKit-Learn + @asmeurer. There is consensus that several projects want to work on this in a coordinated way. We decided that opening this issue was a good first step towards coordinating. @jreback has a similar list of pain points and will be contributing them in this issue. |
@wkerzendorf did anyone take notes at the NumFOCUS session? All the pain points that were discussed that I can remember I mentioned in my comment above. I think the main takeaway from the discussion was that we should try to collaborate across projects to improve things, rather than continuing to do project-specific workarounds (hence this issue). |
from @jbrockmendel an asv wishlist
|
You're doing a great job of highlighting the mantra that "benchmarking is hard"! It can mean so many different things, and in every case there are many pit-falls to ensure your results are 'true'. If ASV overcame all these challenges it would be even more complex than it is now. I'd recommend separating concerns into:
ASV's modules already help with this, but there is no distinction at a user level. There are probably many ways to improve this (the most extreme version being a core package and plugin packages). |
I'm not a dev on any of the Python scientific libraries, but I really rated asv's visualisations when prototyping/hacking something to work with a C++ codebase I previously worked on a few years ago. I always thought it'd be a cracking tool for benchmarking in general and having to force things through Python was a shame. I'd love to contribute to this if there's a roadmap of some kind, particularly to standardisation of the JSON/inputs to the visual 'backend' of asv. |
I think that you are just misinterpreting the output there. The "PR vs master" section shows no changes. However the "master vs previous release" section (correctly) shows that some benchmarks are now running faster as a result of improvements that are not yet released. You can see the benchmark results for the PR that actually made those improvements here: sympy/sympy#23821 (comment) There is some variability in timings but I actually think that running the benchmarks in CI works fairly well for SymPy works fairly well for much of the benchmark suite. The main problem with them for SymPy is that many operations are cached so sometimes the benchmark results report something like a 50% slowdown but it's actually a 50% slowdown on a cached result where the actual time to do something once is much slower than the time reported by asv. |
To me the biggest limitation of asv for SymPy is that I want to write benchmarks that can be shared across multiple projects in order to compare timings for different software e.g. if there is a benchmark for a particular operation then I want to be able to reuse that for SymPy, SAGE, Pari, Julia, Maxima etc. Basically I don't want the benchmarks themselves to be written as Python code because I want to be able to use them with/from software that doesn't involve Python at all. This actually extends beyond benchmarking to unit tests as well. There is no real need for unit tests and benchmarks to be different things. They are all just examples of things that can be done with the software. If SymPy's extensive unit test suite was usable to report detailed timing information on each operation then that would be a huge source of benchmarking information but the unit tests are also just written in pytest-style |
One problem we have with the I can't decide if it is more reasonable to ask the C devs to allow the Python portion of the project to have its own separate build system that also builds/links the C code, or if it would be a fairly minor thing to provide a custom set of build commands to One other thing is that |
This sounds like a similar sort of problem to the one I described above where I needed to hack around asv's inability to install a different version of a dependency depending on the commit. I think the build stage needs to be much more customizable. Right now it makes some pretty hard assumptions about how the project is built/installed into a virtual environment and how that virtual environment is cached across runs. I'd also like for virtualenv isolation to be completely separated as a higher level step from actually running the benchmarks, so that you can just "run" the benchmarks against the dev code with the current Python (similar to just running pytest vs. isolating the tests with something like tox). |
A pattern I've noticed reading this over is that people are pretty happy writing benchmark suites with asv (except some nits around parameterized benchmarks) and with asv running in the background on CI generating benchmarks over time as a project evolves, but are unhappy using asv for other benchmarking tasks. In the past I wrote a set of benchmarks for my unyt library based on the I wonder how much work it would be to replace asv's multiprocessing-based bespoke benchmarking with shell calls to Like others in this thread, I also find asv's codebase hard to grok, with lots of long function implementations that are tied to individual asv command-line options. It might also make sense to write new tools for writing a benchmark suite that runs pyperf under the hood (e.g. pyperformance does this for the Python benchmarks, but is not a general tool) and a tool for running a benchmark suite over a project's history, but given the number of downstream users of asv it might be pragmatic to keep user-visible changes minimal and instead try to refactor asv to make it more approachable. |
Thanks Nathan. Maybe pyperf is the answer to the "just run a single benchmark" use-case that is so hard with asv right now. I haven't had a chance to use pyperf before, but I like at least in principle some of the features (like the tuning feature). Just to take one of the use-cases that is currently not so easy with pure asv, how hard is it to take an existing benchmark (or set of benchmarks) from an asv-style benchmark suite and run them against some uncommitted development changes for a library? Does this usage already work out of the box or would it require some changes to the benchmarking suite, or some new code that wraps pyperf? I'm also curious how much of the asv benchmarking running internals are worth keeping and how much is already implemented by pyperf (things like isolating benchmark runs, doing proper statistics, and so on). By the way, I just now realized for the first time from reading your comment that pyperf and pyperformance are two separate projects. |
Not trivial, but not all that bad really, at least for a simple benchmark. For example, here's a pyperf-based benchmark that uses one of the sympy benchmarks:
To run against a development build of sympy, you'd simply install pyperf and the sympy development version into the python environment you're working in and run pyperf using that environment. There are a few issues, this API doesn't support setup or teardown functions. There also doesn't appear to be a way to run pyperf from the command line by referring to a python function, the CLI only accepts statements. I guess currently if you wanted to run some setup code you'd need to manually run it in the script first?
I think this is mostly all implemented in pyperf but haven't done a detailed comparison. |
Another thing that I think has only been indirectly referenced here is the ability to specify whether a benchmark can safely be rerun in the same process, or whether a benchmark needs to be rerun in a new process each time, which is also somewhat related to specifying whether |
At the NumFOCUS summit a few weeks ago, several people had a conversation about the limitations of asv and the other Python benchmarking tooling. I'd like to use this issue to kick off the discussion about this so that we can gather a wishlist and a list of problems with the existing tooling, and figure out a plan on how to improve things. It's not clear yet whether that will mean improving asv, some other tool, or creating new tools.
Some people who I want to make sure are involved in this discussion (please feel free to CC others):
@jreback
@oscarbenjamin
@wkerzendorf
@jarrodmillman
The text was updated successfully, but these errors were encountered: