-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR] Using Random Interleaving to reduce run-to-run variance. #1051
Comments
I've been working with @haih-g on this internally at Google and it's made a staggering difference to variance of benchmark runs. I completely support this work (obviously) but i'd like to hear any concerns from anyone, or any questions, in case there are use-cases not covered by our internal tests. |
FWIW i agree that the results are usually noisy,, it wouldn't be bad if that could be improved, |
See #1105 |
…le#1051) Based on the original implementation by Hai Huang @haih-g) from google#1105.
…le#1051) Based on the original implementation by Hai Huang @haih-g) from google#1105.
…le#1051) Inspired by the original implementation by Hai Huang @haih-g from google#1105. The original implementation had design deficiencies that weren't really addressable without redesign, so it was reverted. In essence, the original implementation consisted of two separateable parts: * reducing the amount time each repetition is run for, and symmetrically increasing repetition count * running the repetitions in random order While it worked fine for the usual case, it broke down when user would specify repetitions (it would completely ignore that request), or specified per-repetition min time (while it would still adjust the repetition count, it would not adjust the per-repetition time, leading to much greater run times) Here, like i was originally suggesting in the original review, i'm separating the features, and only dealing with a single one - running repetitions in random order. Now that the runs/repetitions are no longer in-order, the tooling may wish to sort the output, and indeed `compare.py` has been updated to do that: google#1168.
…le#1051) Inspired by the original implementation by Hai Huang @haih-g from google#1105. The original implementation had design deficiencies that weren't really addressable without redesign, so it was reverted. In essence, the original implementation consisted of two separateable parts: * reducing the amount time each repetition is run for, and symmetrically increasing repetition count * running the repetitions in random order While it worked fine for the usual case, it broke down when user would specify repetitions (it would completely ignore that request), or specified per-repetition min time (while it would still adjust the repetition count, it would not adjust the per-repetition time, leading to much greater run times) Here, like i was originally suggesting in the original review, i'm separating the features, and only dealing with a single one - running repetitions in random order. Now that the runs/repetitions are no longer in-order, the tooling may wish to sort the output, and indeed `compare.py` has been updated to do that: google#1168.
…le#1051) (google#1163) Inspired by the original implementation by Hai Huang @haih-g from google#1105. The original implementation had design deficiencies that weren't really addressable without redesign, so it was reverted. In essence, the original implementation consisted of two separateable parts: * reducing the amount time each repetition is run for, and symmetrically increasing repetition count * running the repetitions in random order While it worked fine for the usual case, it broke down when user would specify repetitions (it would completely ignore that request), or specified per-repetition min time (while it would still adjust the repetition count, it would not adjust the per-repetition time, leading to much greater run times) Here, like i was originally suggesting in the original review, i'm separating the features, and only dealing with a single one - running repetitions in random order. Now that the runs/repetitions are no longer in-order, the tooling may wish to sort the output, and indeed `compare.py` has been updated to do that: google#1168.
tl;dr Introduce random interleaving that is able to reduce run-to-run variance by 40% and make it the default behavior. No change on the command line. The output is equivalent to setting
--benchmark_repetitions=12 --benchmark_min_time=0.042
. This won’t increase test execution time for most cases as12 x 0.042 = 0.5
which is the default min time. To disable, set--benchmark_enable_random_interleaving=false --benchmark_repetitions=1 --benchmark_min_time=0.5
. One can also adjust--benchmark_repetitions
and--benchmark_min_time
to balance between variance reduction and execution cost.Being a dynamic measurement, microbenchmarking is vulnerable to noise, or run-to-run variance. Even with identical software versions and testing environments, there may be differences on the performance measurement if a microbenchmark is executed twice. Heightened run-to-run variance is a key challenge against reliable performance measurement and reduces the effectiveness of microbenchmarking as a way to track performance regressions.
One cause of the run-to-run variance is the dynamic state of the process, including but not limited to cache warmup, memory allocation, and thread scheduling. When running the benchmarking process twice, even with identical software versions and testing environments, the dynamic state of the process may change, and benchmarks that are sensitive to the states may see different performance numbers.
We use a microbenchmark to depict the impact of the dynamic state of the process. The microbenchmark measures the latency in nanoseconds for a random API, which generates statistics on a large dataset by evaluating a model on each slice of the data. We set the
--benchmark_repetitions
to12
and run two runs. We also set--benchmark_min_time
to0.5 / 12
so the total execution time does not change.Figure 1: Run 1 - most sample points above 43M.
*Figure 2: Run 2 - most sample points below 43M.
As shown in Figure 1 and 2, in one run most of the sample points are above 43M while in the other run most sample points are below 43M. Given we are running the two runs with the identical software version and testing environment, the only difference would be the dynamic state of the benchmark process.
We propose a novel approach, Random Interleaving, to execute microbenchmarks to lower the run-to-run variance without increasing the execution cost.
In our approach, all the iterations for a single API are not executed in one-shot. Instead, the iterations are divided into multiple chunks. Each chunk will be executed in one-shot. Because we still run the same number of iterations, we don’t increase the execution cost (modulo benchmark startup/shutdown costs).
Usually a single microbenchmark executable contains multiple benchmarks. In our approach, we interleave the execution of the chunks from one benchmark with that of the chunks from others. Thus in between each chunk, we introduce a chance to change the initial state of the process for the next chunk, and the state may affect the performance of the next chunk. To better understand this, consider cache warm-up, new objects created in memory, new threads started by the scheduler, etc, each of them may change the state of the process and affect the performance. If instead we run all the chunks from the same benchmark continuously, only the first or the first couple of chunks may exercise the state changes. With interleaving, each chunk now has an opportunity to exercise the state changes. Worth noting is that in real production, no API being benchmarked is expected to be executed repeatedly in many iterations. Instead, multiple related APIs are executed in coordination to carry out a use case or use scenario. Interleaving is able to make the benchmark execution mimicking the real production environment.
In addition, the interleaving is done in a random fashion. This helps mimic that in a real production environment, different use cases or use scenarios could happen at the same time, and related APIs may be executed in various orders.
A user is able to control the number of chunks to balance between, within a single chunk, how much time should be spent to exercise state changes, and how much time should be spent to run iterations and measure performance.
We also capture a performance data point per each chunk. Previously the microbenchmark framework only produced a single data point per repetition of a benchmark because all iterations were executed in one-shot. Now we have multiple data points per benchmark. This enables users to run A/B t-Test or u-Test (both require multiple sample points from both A and B tests) to gauge the statistical significance of performance differences.
This would be different from the existing tooling that runs a u-Test between two sets of repetitions of benchmarks. That exists to suggest any statistically significant delta between benchmarks, while this would enable statistical analysis of a single benchmark across chunks of iterations.
Furthermore, the multiple data points expose the intra-run variance which were not available before. The intra-run variance is useful information for users to understand the nature of the API, whether they are sensitive to state change in the process, and may reveal optimization opportunities.
Figure 3: Run with Random Interleaving - See sample points both above and below 43M.
With random interleaving, we are able to exercise different dynamic states of the process, and see sample points both above and below 43M. This enables us to expose the variance within a single run. By calculating the mean of the 12 sample points per run, we are able to average out the variance within each run. We run the benchmark 50 runs, and for each run we capture the mean latency out of the 12 sample points. We then calculate the STDEV / MEAN over the 50 means. The run-to-run variance is reduced by 40%.
--benchmark_repetitions=12 --benchmark_min_time=0.042
with random interleaving--benchmark_repetitions=12 --benchmark_min_time=0.042
without random interleavingWorth noting that
--benchmark_repetitions=12 --benchmark_min_time=0.042
is not equivalent to--benchmark_repetitions=1 --benchmark_min_time=0.5
. In the former, the setup / teardown are executed 12 times, while only once in the latter.Thus the table above is not necessarily a comparison between random interleaving and the default behavior. We run the comparison against the default behavior for ~800 microbenchmarks, and the average improvement on run-to-run variance is ~50%.
Also worth noting is that random interleaving is reducing run-to-run variance, but at a potential cost of increased intra-run variance. Actually it reduces the run-to-run variance by exposing the variance within each run where the variance can be averaged out and thus not exposed as run-to-run variance. If the purpose is to track regression, reducing run-to-run variance is more important. We are not aware of any but there might be cases where the intra-run variance is interested, and random interleaving may not be the right solution for it.
Yet another note is there is no one-size fit all parameter for
--benchmark_repetitions
in random interleaving. We tested 10, 12, and 20, and 12 maximizes the variance reduction for the ~800 microbenchmarks we have tested.Random interleaving may increase the execution time if:
In which cases, a user wants to pick a different
--benchmark_repetitions
to balance the benefit and additional cost, if any, of random interleaving.In general, a higher
--benchmark_repetitions
means more chances to exercise different states of the process, and results in less run-to-run variance. But because we want to keep the total execution time constant, a higher--benchmark_repetitions
means a lower--benchmark_min_time
. If the benchmark is expected to warm up the cache, less--benchmark_min_time
per repetition may not be enough to warm up the cache warm-up, and the user may not measure the right thing. Furthermore, the user may end up with less number of iterations (cold cache means long latency, which in turn means less iteration in a fixed amount of time), and less number of iterations may lead to increased variance.If possible, it is recommended to increase
--benchmark_repetitions
without decreasing the--benchmark_min_time
. A user will spend more time to run the benchmark, but with much stabler results.If the setup / teardown cost is high, then one may want to reduce
--benchmark_repetitions
to bring down the execution time. However, the user may end up with heightened run-to-run variance which renders the result inconclusive.The text was updated successfully, but these errors were encountered: