Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paired Benchmarks #18

Open
nvzqz opened this issue Oct 30, 2023 · 5 comments
Open

Paired Benchmarks #18

nvzqz opened this issue Oct 30, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@nvzqz
Copy link
Owner

nvzqz commented Oct 30, 2023

Paired benchmarking spreads measurement noise across benchmarks. It is used in Tango.

The general algorithm for measuring performance using paired benchmarking is as follows:

  1. Prepare the same input data for both the baseline and candidate algorithms.
  2. Execute the baseline algorithm and measure its execution time.
  3. Execute the candidate algorithm and measure its execution time.
  4. Record the difference in runtime between the baseline and candidate algorithms.
  5. These steps constitute a single run, which serves as a distinct benchmark observation. Multiple runs are then performed, with each subsequent run employing a new payload and a randomized order in which the baseline and candidate functions are executed. Randomization is necessary to account for different CPU states and cache effects.

The advantages of using paired benchmarking include:

  • The difference metric is less susceptible to ephemeral biases in the system, as common-mode biases are eliminated.
  • Differences in means typically have lower variance since both algorithms are tested on the same input.
  • It becomes easier to identify and eliminate outliers that are not caused by algorithm changes, thereby enhancing test sensitivity.

@bazhenov, I would love to collaborate on how this approach would look like in Divan. 🙂

@nvzqz nvzqz added the enhancement New feature or request label Oct 30, 2023
@nvzqz nvzqz pinned this issue Oct 30, 2023
@mert-kurttutan
Copy link

Hi, this feature can be really fruitful for a project of mine.
But, I am also interested comparing more than 2 versions of of the same function.
Just putting some ideas out there, is there possibility of generalizing paired benchmarks to N-way benchmarks, where N is the number of versions of the function to compare?

@bazhenov
Copy link

is there possibility of generalizing paired benchmarks to N-way benchmarks

It is possible to build N-1 pairs measurements if it is solves your problem. Which leads me to the following:

But, I am also interested comparing more than 2 versions of of the same function.

It's an interesting use case. Can you elaborate more on the task your are trying to solve?

@mert-kurttutan
Copy link

My case: 1 rust version (= my implementation) + 2 C versions (with different algorithms) of a performance critical function.
I am also measuring the scaling behaviour of the performance based on the input size.

In the case of N-1 pair it seems OK, but it also means running the inner versions twice as much.
Let N_0, N_1, ..., N_n be different implementations of the same function, here the inner versions would be N_1,... N_{n-1}

@nvzqz
Copy link
Owner Author

nvzqz commented Nov 21, 2023

I'm convinced that currently the best way to implement this today would be to abuse async/await to get coroutines so that we can swap out the user's stack between benchmarks. This could allow for Bencher::bench to not require a 'static lifetime, whereas boxing and storing Box<dyn Fn() -> impl Future> would require a 'static bound on the closure. I think this is feasible but unfortunately requires rewriting core components spanning from the Divan entry point through the Benchers passed to each benchmark.

But, I am also interested comparing more than 2 versions of of the same function.

I don't see why we would be limited to 2 benchmarks. We could theoretically interleave an arbitrary number of benchmark runs.

@bazhenov
Copy link

bazhenov commented Jan 31, 2024

Sorry for the late reply. After several months of experience with different approaches in tango, I think I have more or less a complete approach to paired testing.

The way it works is the following. Each tango benchmark executable is a mixed object (binary/shared library). The binary part contains the tango benchmark runner, and the shared library part (FFI) contains the benchmark registry. Therefore runner can load benchmarks from self as well as any other tango binary. This way several benchmarks could be run simultaneously.

This IMO solves a lot of different problems:

  • this approach allows us to test the difference in performance not only between different algorithms in the same code but also between different versions of the same algorithm. It is very useful for performance regression testing, which is, I believe, by far the most prevalent version of performance testing
  • it is relatively easy to integrate into CI pipelines because benchmarks can be built independently from different VCS branches
  • testing the effect of non-code factors on the performance: compiler/linker options, different compiler versions, etc.
  • inter-language tests – binaries are not required to be written in the same language as long as they implement the same SPI in a compatible way
  • it allows to use of the same binary to perform both classical (pointwise) and paired testing. For classical tests, it's enough to have a single binary

Untitled Diagram drawio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants