[FR] Using Random Interleaving to reduce run-to-run variance. #1051

haihuang-ml · 2020-10-05T18:26:12Z

tl;dr Introduce random interleaving that is able to reduce run-to-run variance by 40% and make it the default behavior. No change on the command line. The output is equivalent to setting --benchmark_repetitions=12 --benchmark_min_time=0.042. This won’t increase test execution time for most cases as 12 x 0.042 = 0.5 which is the default min time. To disable, set --benchmark_enable_random_interleaving=false --benchmark_repetitions=1 --benchmark_min_time=0.5. One can also adjust --benchmark_repetitions and --benchmark_min_time to balance between variance reduction and execution cost.

Being a dynamic measurement, microbenchmarking is vulnerable to noise, or run-to-run variance. Even with identical software versions and testing environments, there may be differences on the performance measurement if a microbenchmark is executed twice. Heightened run-to-run variance is a key challenge against reliable performance measurement and reduces the effectiveness of microbenchmarking as a way to track performance regressions.

One cause of the run-to-run variance is the dynamic state of the process, including but not limited to cache warmup, memory allocation, and thread scheduling. When running the benchmarking process twice, even with identical software versions and testing environments, the dynamic state of the process may change, and benchmarks that are sensitive to the states may see different performance numbers.

We use a microbenchmark to depict the impact of the dynamic state of the process. The microbenchmark measures the latency in nanoseconds for a random API, which generates statistics on a large dataset by evaluating a model on each slice of the data. We set the --benchmark_repetitions to 12 and run two runs. We also set --benchmark_min_time to 0.5 / 12 so the total execution time does not change.

Figure 1: Run 1 - most sample points above 43M.

*Figure 2: Run 2 - most sample points below 43M.

As shown in Figure 1 and 2, in one run most of the sample points are above 43M while in the other run most sample points are below 43M. Given we are running the two runs with the identical software version and testing environment, the only difference would be the dynamic state of the benchmark process.

We propose a novel approach, Random Interleaving, to execute microbenchmarks to lower the run-to-run variance without increasing the execution cost.

In our approach, all the iterations for a single API are not executed in one-shot. Instead, the iterations are divided into multiple chunks. Each chunk will be executed in one-shot. Because we still run the same number of iterations, we don’t increase the execution cost (modulo benchmark startup/shutdown costs).

Usually a single microbenchmark executable contains multiple benchmarks. In our approach, we interleave the execution of the chunks from one benchmark with that of the chunks from others. Thus in between each chunk, we introduce a chance to change the initial state of the process for the next chunk, and the state may affect the performance of the next chunk. To better understand this, consider cache warm-up, new objects created in memory, new threads started by the scheduler, etc, each of them may change the state of the process and affect the performance. If instead we run all the chunks from the same benchmark continuously, only the first or the first couple of chunks may exercise the state changes. With interleaving, each chunk now has an opportunity to exercise the state changes. Worth noting is that in real production, no API being benchmarked is expected to be executed repeatedly in many iterations. Instead, multiple related APIs are executed in coordination to carry out a use case or use scenario. Interleaving is able to make the benchmark execution mimicking the real production environment.

In addition, the interleaving is done in a random fashion. This helps mimic that in a real production environment, different use cases or use scenarios could happen at the same time, and related APIs may be executed in various orders.

A user is able to control the number of chunks to balance between, within a single chunk, how much time should be spent to exercise state changes, and how much time should be spent to run iterations and measure performance.

We also capture a performance data point per each chunk. Previously the microbenchmark framework only produced a single data point per repetition of a benchmark because all iterations were executed in one-shot. Now we have multiple data points per benchmark. This enables users to run A/B t-Test or u-Test (both require multiple sample points from both A and B tests) to gauge the statistical significance of performance differences.

This would be different from the existing tooling that runs a u-Test between two sets of repetitions of benchmarks. That exists to suggest any statistically significant delta between benchmarks, while this would enable statistical analysis of a single benchmark across chunks of iterations.

Furthermore, the multiple data points expose the intra-run variance which were not available before. The intra-run variance is useful information for users to understand the nature of the API, whether they are sensitive to state change in the process, and may reveal optimization opportunities.

Figure 3: Run with Random Interleaving - See sample points both above and below 43M.

With random interleaving, we are able to exercise different dynamic states of the process, and see sample points both above and below 43M. This enables us to expose the variance within a single run. By calculating the mean of the 12 sample points per run, we are able to average out the variance within each run. We run the benchmark 50 runs, and for each run we capture the mean latency out of the 12 sample points. We then calculate the STDEV / MEAN over the 50 means. The run-to-run variance is reduced by 40%.

Config	# Runs	STDEV%	MAD%
`--benchmark_repetitions=12 --benchmark_min_time=0.042` with random interleaving	50	1.01%	0.67%
`--benchmark_repetitions=12 --benchmark_min_time=0.042` without random interleaving	50	1.37%	1.25%

Worth noting that --benchmark_repetitions=12 --benchmark_min_time=0.042 is not equivalent to --benchmark_repetitions=1 --benchmark_min_time=0.5. In the former, the setup / teardown are executed 12 times, while only once in the latter.

static void BM_foo(benchmark::State& state) {
  // Setup code
  for (auto _ : state) {
    // Benchmark iteration code
  }
  // Teardown code
}

Thus the table above is not necessarily a comparison between random interleaving and the default behavior. We run the comparison against the default behavior for ~800 microbenchmarks, and the average improvement on run-to-run variance is ~50%.

Also worth noting is that random interleaving is reducing run-to-run variance, but at a potential cost of increased intra-run variance. Actually it reduces the run-to-run variance by exposing the variance within each run where the variance can be averaged out and thus not exposed as run-to-run variance. If the purpose is to track regression, reducing run-to-run variance is more important. We are not aware of any but there might be cases where the intra-run variance is interested, and random interleaving may not be the right solution for it.

Yet another note is there is no one-size fit all parameter for --benchmark_repetitions in random interleaving. We tested 10, 12, and 20, and 12 maximizes the variance reduction for the ~800 microbenchmarks we have tested.

Random interleaving may increase the execution time if:

The microbenchmark runs < 12 iterations by default
The setup / teardown is costly

In which cases, a user wants to pick a different --benchmark_repetitions to balance the benefit and additional cost, if any, of random interleaving.

In general, a higher --benchmark_repetitions means more chances to exercise different states of the process, and results in less run-to-run variance. But because we want to keep the total execution time constant, a higher --benchmark_repetitions means a lower --benchmark_min_time. If the benchmark is expected to warm up the cache, less --benchmark_min_time per repetition may not be enough to warm up the cache warm-up, and the user may not measure the right thing. Furthermore, the user may end up with less number of iterations (cold cache means long latency, which in turn means less iteration in a fixed amount of time), and less number of iterations may lead to increased variance.

If possible, it is recommended to increase --benchmark_repetitions without decreasing the --benchmark_min_time. A user will spend more time to run the benchmark, but with much stabler results.

If the setup / teardown cost is high, then one may want to reduce --benchmark_repetitions to bring down the execution time. However, the user may end up with heightened run-to-run variance which renders the result inconclusive.

The text was updated successfully, but these errors were encountered:

dmah42 · 2020-10-06T08:32:58Z

I've been working with @haih-g on this internally at Google and it's made a staggering difference to variance of benchmark runs. I completely support this work (obviously) but i'd like to hear any concerns from anyone, or any questions, in case there are use-cases not covered by our internal tests.

LebedevRI · 2020-10-07T12:08:07Z

FWIW i agree that the results are usually noisy,, it wouldn't be bad if that could be improved,
although without looking at the actual fix it is hard to say anything with more detail.

haihuang-ml · 2021-03-28T19:35:52Z

See #1105

…le#1051) Based on the original implementation by Hai Huang @haih-g) from google#1105.

…le#1051) Inspired by the original implementation by Hai Huang @haih-g from google#1105. The original implementation had design deficiencies that weren't really addressable without redesign, so it was reverted. In essence, the original implementation consisted of two separateable parts: * reducing the amount time each repetition is run for, and symmetrically increasing repetition count * running the repetitions in random order While it worked fine for the usual case, it broke down when user would specify repetitions (it would completely ignore that request), or specified per-repetition min time (while it would still adjust the repetition count, it would not adjust the per-repetition time, leading to much greater run times) Here, like i was originally suggesting in the original review, i'm separating the features, and only dealing with a single one - running repetitions in random order. Now that the runs/repetitions are no longer in-order, the tooling may wish to sort the output, and indeed `compare.py` has been updated to do that: google#1168.

…le#1051) (google#1163) Inspired by the original implementation by Hai Huang @haih-g from google#1105. The original implementation had design deficiencies that weren't really addressable without redesign, so it was reverted. In essence, the original implementation consisted of two separateable parts: * reducing the amount time each repetition is run for, and symmetrically increasing repetition count * running the repetitions in random order While it worked fine for the usual case, it broke down when user would specify repetitions (it would completely ignore that request), or specified per-repetition min time (while it would still adjust the repetition count, it would not adjust the per-repetition time, leading to much greater run times) Here, like i was originally suggesting in the original review, i'm separating the features, and only dealing with a single one - running repetitions in random order. Now that the runs/repetitions are no longer in-order, the tooling may wish to sort the output, and indeed `compare.py` has been updated to do that: google#1168.

dmah42 added enhancement next-release PRs or Issues that should be included in the next release labels Oct 6, 2020

dmah42 mentioned this issue Jan 18, 2021

[FR] Lightweight way to pause/resume timer in benchmarks #1087

Open

haihuang-ml mentioned this issue Mar 28, 2021

Implementation of random interleaving. #1105

Merged

haihuang-ml closed this as completed Mar 28, 2021

haihuang-ml reopened this Mar 28, 2021

dmah42 closed this as completed May 20, 2021

LebedevRI reopened this Jun 1, 2021

LebedevRI added a commit to LebedevRI/benchmark that referenced this issue Jun 1, 2021

Random interleaving of benchmark repetitions - the sequel (fixes goog…

09b2b9e

…le#1051) Based on the original implementation by Hai Huang @haih-g) from google#1105.

LebedevRI added a commit to LebedevRI/benchmark that referenced this issue Jun 3, 2021

Random interleaving of benchmark repetitions - the sequel (fixes goog…

f293839

…le#1051) Based on the original implementation by Hai Huang @haih-g) from google#1105.

LebedevRI closed this as completed in fbc3140 Jun 3, 2021

dmah42 mentioned this issue Aug 18, 2022

[BUG] Benchmark result depends on the order in which benchmarks were registrated #1469

Closed

BurntSushi mentioned this issue Aug 6, 2024

Should benchmarks be run on "random" values instead of the same value each time? BurntSushi/jiff#80

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Using Random Interleaving to reduce run-to-run variance. #1051

[FR] Using Random Interleaving to reduce run-to-run variance. #1051

haihuang-ml commented Oct 5, 2020

dmah42 commented Oct 6, 2020

LebedevRI commented Oct 7, 2020

haihuang-ml commented Mar 28, 2021

[FR] Using Random Interleaving to reduce run-to-run variance. #1051

[FR] Using Random Interleaving to reduce run-to-run variance. #1051

Comments

haihuang-ml commented Oct 5, 2020

dmah42 commented Oct 6, 2020

LebedevRI commented Oct 7, 2020

haihuang-ml commented Mar 28, 2021