Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Using Random Interleaving to reduce run-to-run variance. #1051

Closed
haihuang-ml opened this issue Oct 5, 2020 · 3 comments
Closed

[FR] Using Random Interleaving to reduce run-to-run variance. #1051

haihuang-ml opened this issue Oct 5, 2020 · 3 comments
Labels
enhancement next-release PRs or Issues that should be included in the next release

Comments

@haihuang-ml
Copy link
Contributor

tl;dr Introduce random interleaving that is able to reduce run-to-run variance by 40% and make it the default behavior. No change on the command line. The output is equivalent to setting --benchmark_repetitions=12 --benchmark_min_time=0.042. This won’t increase test execution time for most cases as 12 x 0.042 = 0.5 which is the default min time. To disable, set --benchmark_enable_random_interleaving=false --benchmark_repetitions=1 --benchmark_min_time=0.5. One can also adjust --benchmark_repetitions and --benchmark_min_time to balance between variance reduction and execution cost.

Being a dynamic measurement, microbenchmarking is vulnerable to noise, or run-to-run variance. Even with identical software versions and testing environments, there may be differences on the performance measurement if a microbenchmark is executed twice. Heightened run-to-run variance is a key challenge against reliable performance measurement and reduces the effectiveness of microbenchmarking as a way to track performance regressions.

One cause of the run-to-run variance is the dynamic state of the process, including but not limited to cache warmup, memory allocation, and thread scheduling. When running the benchmarking process twice, even with identical software versions and testing environments, the dynamic state of the process may change, and benchmarks that are sensitive to the states may see different performance numbers.

We use a microbenchmark to depict the impact of the dynamic state of the process. The microbenchmark measures the latency in nanoseconds for a random API, which generates statistics on a large dataset by evaluating a model on each slice of the data. We set the --benchmark_repetitions to 12 and run two runs. We also set --benchmark_min_time to 0.5 / 12 so the total execution time does not change.

image2
Figure 1: Run 1 - most sample points above 43M.

image3
*Figure 2: Run 2 - most sample points below 43M.

As shown in Figure 1 and 2, in one run most of the sample points are above 43M while in the other run most sample points are below 43M. Given we are running the two runs with the identical software version and testing environment, the only difference would be the dynamic state of the benchmark process.

We propose a novel approach, Random Interleaving, to execute microbenchmarks to lower the run-to-run variance without increasing the execution cost.

In our approach, all the iterations for a single API are not executed in one-shot. Instead, the iterations are divided into multiple chunks. Each chunk will be executed in one-shot. Because we still run the same number of iterations, we don’t increase the execution cost (modulo benchmark startup/shutdown costs).

Usually a single microbenchmark executable contains multiple benchmarks. In our approach, we interleave the execution of the chunks from one benchmark with that of the chunks from others. Thus in between each chunk, we introduce a chance to change the initial state of the process for the next chunk, and the state may affect the performance of the next chunk. To better understand this, consider cache warm-up, new objects created in memory, new threads started by the scheduler, etc, each of them may change the state of the process and affect the performance. If instead we run all the chunks from the same benchmark continuously, only the first or the first couple of chunks may exercise the state changes. With interleaving, each chunk now has an opportunity to exercise the state changes. Worth noting is that in real production, no API being benchmarked is expected to be executed repeatedly in many iterations. Instead, multiple related APIs are executed in coordination to carry out a use case or use scenario. Interleaving is able to make the benchmark execution mimicking the real production environment.

In addition, the interleaving is done in a random fashion. This helps mimic that in a real production environment, different use cases or use scenarios could happen at the same time, and related APIs may be executed in various orders.

A user is able to control the number of chunks to balance between, within a single chunk, how much time should be spent to exercise state changes, and how much time should be spent to run iterations and measure performance.

We also capture a performance data point per each chunk. Previously the microbenchmark framework only produced a single data point per repetition of a benchmark because all iterations were executed in one-shot. Now we have multiple data points per benchmark. This enables users to run A/B t-Test or u-Test (both require multiple sample points from both A and B tests) to gauge the statistical significance of performance differences.

This would be different from the existing tooling that runs a u-Test between two sets of repetitions of benchmarks. That exists to suggest any statistically significant delta between benchmarks, while this would enable statistical analysis of a single benchmark across chunks of iterations.

Furthermore, the multiple data points expose the intra-run variance which were not available before. The intra-run variance is useful information for users to understand the nature of the API, whether they are sensitive to state change in the process, and may reveal optimization opportunities.

image1
Figure 3: Run with Random Interleaving - See sample points both above and below 43M.

With random interleaving, we are able to exercise different dynamic states of the process, and see sample points both above and below 43M. This enables us to expose the variance within a single run. By calculating the mean of the 12 sample points per run, we are able to average out the variance within each run. We run the benchmark 50 runs, and for each run we capture the mean latency out of the 12 sample points. We then calculate the STDEV / MEAN over the 50 means. The run-to-run variance is reduced by 40%.

Config # Runs STDEV% MAD%
--benchmark_repetitions=12 --benchmark_min_time=0.042 with random interleaving 50 1.01% 0.67%
--benchmark_repetitions=12 --benchmark_min_time=0.042 without random interleaving 50 1.37% 1.25%

Worth noting that --benchmark_repetitions=12 --benchmark_min_time=0.042 is not equivalent to --benchmark_repetitions=1 --benchmark_min_time=0.5. In the former, the setup / teardown are executed 12 times, while only once in the latter.

static void BM_foo(benchmark::State& state) {
  // Setup code
  for (auto _ : state) {
    // Benchmark iteration code
  }
  // Teardown code
}

Thus the table above is not necessarily a comparison between random interleaving and the default behavior. We run the comparison against the default behavior for ~800 microbenchmarks, and the average improvement on run-to-run variance is ~50%.

Also worth noting is that random interleaving is reducing run-to-run variance, but at a potential cost of increased intra-run variance. Actually it reduces the run-to-run variance by exposing the variance within each run where the variance can be averaged out and thus not exposed as run-to-run variance. If the purpose is to track regression, reducing run-to-run variance is more important. We are not aware of any but there might be cases where the intra-run variance is interested, and random interleaving may not be the right solution for it.

Yet another note is there is no one-size fit all parameter for --benchmark_repetitions in random interleaving. We tested 10, 12, and 20, and 12 maximizes the variance reduction for the ~800 microbenchmarks we have tested.

Random interleaving may increase the execution time if:

  1. The microbenchmark runs < 12 iterations by default
  2. The setup / teardown is costly

In which cases, a user wants to pick a different --benchmark_repetitions to balance the benefit and additional cost, if any, of random interleaving.

In general, a higher --benchmark_repetitions means more chances to exercise different states of the process, and results in less run-to-run variance. But because we want to keep the total execution time constant, a higher --benchmark_repetitions means a lower --benchmark_min_time. If the benchmark is expected to warm up the cache, less --benchmark_min_time per repetition may not be enough to warm up the cache warm-up, and the user may not measure the right thing. Furthermore, the user may end up with less number of iterations (cold cache means long latency, which in turn means less iteration in a fixed amount of time), and less number of iterations may lead to increased variance.

If possible, it is recommended to increase --benchmark_repetitions without decreasing the --benchmark_min_time. A user will spend more time to run the benchmark, but with much stabler results.

If the setup / teardown cost is high, then one may want to reduce --benchmark_repetitions to bring down the execution time. However, the user may end up with heightened run-to-run variance which renders the result inconclusive.

@dmah42 dmah42 added enhancement next-release PRs or Issues that should be included in the next release labels Oct 6, 2020
@dmah42
Copy link
Member

dmah42 commented Oct 6, 2020

I've been working with @haih-g on this internally at Google and it's made a staggering difference to variance of benchmark runs. I completely support this work (obviously) but i'd like to hear any concerns from anyone, or any questions, in case there are use-cases not covered by our internal tests.

@LebedevRI
Copy link
Collaborator

FWIW i agree that the results are usually noisy,, it wouldn't be bad if that could be improved,
although without looking at the actual fix it is hard to say anything with more detail.

@haihuang-ml
Copy link
Contributor Author

See #1105

@haihuang-ml haihuang-ml reopened this Mar 28, 2021
@dmah42 dmah42 closed this as completed May 20, 2021
@LebedevRI LebedevRI reopened this Jun 1, 2021
LebedevRI added a commit to LebedevRI/benchmark that referenced this issue Jun 1, 2021
…le#1051)

Based on the original implementation by Hai Huang @haih-g)
from google#1105.
LebedevRI added a commit to LebedevRI/benchmark that referenced this issue Jun 3, 2021
…le#1051)

Based on the original implementation by Hai Huang @haih-g)
from google#1105.
LebedevRI added a commit to LebedevRI/benchmark that referenced this issue Jun 3, 2021
…le#1051)

Inspired by the original implementation by Hai Huang @haih-g
from google#1105.

The original implementation had design deficiencies that
weren't really addressable without redesign, so it was reverted.

In essence, the original implementation consisted of two separateable parts:
* reducing the amount time each repetition is run for, and symmetrically increasing repetition count
* running the repetitions in random order

While it worked fine for the usual case, it broke down when user would specify repetitions
(it would completely ignore that request), or specified per-repetition min time (while it would
still adjust the repetition count, it would not adjust the per-repetition time,
leading to much greater run times)

Here, like i was originally suggesting in the original review, i'm separating the features,
and only dealing with a single one - running repetitions in random order.

Now that the runs/repetitions are no longer in-order, the tooling may wish to sort the output,
and indeed `compare.py` has been updated to do that: google#1168.
LebedevRI added a commit to LebedevRI/benchmark that referenced this issue Jun 3, 2021
…le#1051)

Inspired by the original implementation by Hai Huang @haih-g
from google#1105.

The original implementation had design deficiencies that
weren't really addressable without redesign, so it was reverted.

In essence, the original implementation consisted of two separateable parts:
* reducing the amount time each repetition is run for, and symmetrically increasing repetition count
* running the repetitions in random order

While it worked fine for the usual case, it broke down when user would specify repetitions
(it would completely ignore that request), or specified per-repetition min time (while it would
still adjust the repetition count, it would not adjust the per-repetition time,
leading to much greater run times)

Here, like i was originally suggesting in the original review, i'm separating the features,
and only dealing with a single one - running repetitions in random order.

Now that the runs/repetitions are no longer in-order, the tooling may wish to sort the output,
and indeed `compare.py` has been updated to do that: google#1168.
vincenzopalazzo pushed a commit to vincenzopalazzo/benchmark that referenced this issue Feb 8, 2022
…le#1051) (google#1163)

Inspired by the original implementation by Hai Huang @haih-g
from google#1105.

The original implementation had design deficiencies that
weren't really addressable without redesign, so it was reverted.

In essence, the original implementation consisted of two separateable parts:
* reducing the amount time each repetition is run for, and symmetrically increasing repetition count
* running the repetitions in random order

While it worked fine for the usual case, it broke down when user would specify repetitions
(it would completely ignore that request), or specified per-repetition min time (while it would
still adjust the repetition count, it would not adjust the per-repetition time,
leading to much greater run times)

Here, like i was originally suggesting in the original review, i'm separating the features,
and only dealing with a single one - running repetitions in random order.

Now that the runs/repetitions are no longer in-order, the tooling may wish to sort the output,
and indeed `compare.py` has been updated to do that: google#1168.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement next-release PRs or Issues that should be included in the next release
Projects
None yet
Development

No branches or pull requests

3 participants