Paired Benchmarks #18

nvzqz · 2023-10-30T22:20:12Z

Paired benchmarking spreads measurement noise across benchmarks. It is used in Tango.

The general algorithm for measuring performance using paired benchmarking is as follows:

Prepare the same input data for both the baseline and candidate algorithms.

Execute the baseline algorithm and measure its execution time.

Execute the candidate algorithm and measure its execution time.

Record the difference in runtime between the baseline and candidate algorithms.

These steps constitute a single run, which serves as a distinct benchmark observation. Multiple runs are then performed, with each subsequent run employing a new payload and a randomized order in which the baseline and candidate functions are executed. Randomization is necessary to account for different CPU states and cache effects.

The advantages of using paired benchmarking include:

The difference metric is less susceptible to ephemeral biases in the system, as common-mode biases are eliminated.

Differences in means typically have lower variance since both algorithms are tested on the same input.

It becomes easier to identify and eliminate outliers that are not caused by algorithm changes, thereby enhancing test sensitivity.

@bazhenov, I would love to collaborate on how this approach would look like in Divan. 🙂

mert-kurttutan · 2023-11-16T14:40:23Z

Hi, this feature can be really fruitful for a project of mine.
But, I am also interested comparing more than 2 versions of of the same function.
Just putting some ideas out there, is there possibility of generalizing paired benchmarks to N-way benchmarks, where N is the number of versions of the function to compare?

bazhenov · 2023-11-17T03:10:50Z

is there possibility of generalizing paired benchmarks to N-way benchmarks

It is possible to build N-1 pairs measurements if it is solves your problem. Which leads me to the following:

But, I am also interested comparing more than 2 versions of of the same function.

It's an interesting use case. Can you elaborate more on the task your are trying to solve?

mert-kurttutan · 2023-11-17T07:44:55Z

My case: 1 rust version (= my implementation) + 2 C versions (with different algorithms) of a performance critical function.
I am also measuring the scaling behaviour of the performance based on the input size.

In the case of N-1 pair it seems OK, but it also means running the inner versions twice as much.
Let N_0, N_1, ..., N_n be different implementations of the same function, here the inner versions would be N_1,... N_{n-1}

nvzqz · 2023-11-21T02:28:04Z

I'm convinced that currently the best way to implement this today would be to abuse async/await to get coroutines so that we can swap out the user's stack between benchmarks. This could allow for Bencher::bench to not require a 'static lifetime, whereas boxing and storing Box<dyn Fn() -> impl Future> would require a 'static bound on the closure. I think this is feasible but unfortunately requires rewriting core components spanning from the Divan entry point through the Benchers passed to each benchmark.

But, I am also interested comparing more than 2 versions of of the same function.

I don't see why we would be limited to 2 benchmarks. We could theoretically interleave an arbitrary number of benchmark runs.

bazhenov · 2024-01-31T05:44:45Z

Sorry for the late reply. After several months of experience with different approaches in tango, I think I have more or less a complete approach to paired testing.

The way it works is the following. Each tango benchmark executable is a mixed object (binary/shared library). The binary part contains the tango benchmark runner, and the shared library part (FFI) contains the benchmark registry. Therefore runner can load benchmarks from self as well as any other tango binary. This way several benchmarks could be run simultaneously.

This IMO solves a lot of different problems:

this approach allows us to test the difference in performance not only between different algorithms in the same code but also between different versions of the same algorithm. It is very useful for performance regression testing, which is, I believe, by far the most prevalent version of performance testing
it is relatively easy to integrate into CI pipelines because benchmarks can be built independently from different VCS branches
testing the effect of non-code factors on the performance: compiler/linker options, different compiler versions, etc.
inter-language tests – binaries are not required to be written in the same language as long as they implement the same SPI in a compatible way
it allows to use of the same binary to perform both classical (pointwise) and paired testing. For classical tests, it's enough to have a single binary

nvzqz added the enhancement New feature or request label Oct 30, 2023

nvzqz pinned this issue Oct 30, 2023

nvzqz mentioned this issue Nov 20, 2023

RFC: Ability to add post-result counters after #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paired Benchmarks #18

Paired Benchmarks #18

nvzqz commented Oct 30, 2023

mert-kurttutan commented Nov 16, 2023

bazhenov commented Nov 17, 2023

mert-kurttutan commented Nov 17, 2023

nvzqz commented Nov 21, 2023 •

edited

Loading

bazhenov commented Jan 31, 2024 •

edited

Loading

Paired Benchmarks #18

Paired Benchmarks #18

Comments

nvzqz commented Oct 30, 2023

mert-kurttutan commented Nov 16, 2023

bazhenov commented Nov 17, 2023

mert-kurttutan commented Nov 17, 2023

nvzqz commented Nov 21, 2023 • edited Loading

bazhenov commented Jan 31, 2024 • edited Loading

nvzqz commented Nov 21, 2023 •

edited

Loading

bazhenov commented Jan 31, 2024 •

edited

Loading