-
Notifications
You must be signed in to change notification settings - Fork 66
Parallel Benchmarking
Benchee offers you the parallel
key to execute each benchmarking job in parallel. E.g. with parallel: 4
each defined benchmarking function will be spawned and executed in 4 tasks and then benchee waits until all 4 of those finish and it then goes on to benchmark the next function.
Benchee will treat the results of a parallel benchmark basically as if it was done sequentially, so the reported results will not account for the fact that they were obtained in parallel. The results will show what the average time to call a function was across all processes. So, if there was no slow down due to parallel execution (but there is, see next execution) executing with parallel: 4, time: 5
is more or less the same as executing with parallel: 1, time: 20
.
While this is great it also carries certain risks.
First it's important to know that most modern CPUs have a power boost when not all cores are occupied. This means, even if the system is not overloaded benchmarking numbers will likely get worse the higher the parallelism is.
To showcase this and the effect of overloading I ran the same benchmark on my 4 core system, under normal working load (browser, music, editor, gui etc. open). As you will see 2 processes is just a bit slower (expected cause single core boost) and afterwards it degrades as somewhat expected. Also the relative performance between benchmarks stays somewhat the same, or well it even gets more apparent lots of times.
You can also note that the stamdard deviation gets progressively bigger as (I think) it happens more often that a thread/process has to wait to be scheduled for execution (either by the OS or the BEAM vm).
tobi@happy ~/github/benchee $ mix run samples/run.exs # parallel 1
Benchmarking flat_map...
Benchmarking map.flatten...
Name ips average deviation median
map.flatten 1276.37 783.47μs (±12.28%) 759.00μs
flat_map 878.60 1138.17μs (±6.82%) 1185.00μs
Comparison:
map.flatten 1276.37
flat_map 878.60 - 1.45x slower
tobi@happy ~/github/benchee $ mix run samples/run_parallel.exs # parallel 2
Benchmarking flat_map...
Benchmarking map.flatten...
Name ips average deviation median
map.flatten 1230.53 812.66μs (±19.86%) 761.00μs
flat_map 713.82 1400.92μs (±5.63%) 1416.00μs
Comparison:
map.flatten 1230.53
flat_map 713.82 - 1.72x slower
tobi@happy ~/github/benchee $ mix run samples/run_parallel.exs # parallel 3
Benchmarking flat_map...
Benchmarking map.flatten...
Name ips average deviation median
map.flatten 1012.77 987.39μs (±29.53%) 913.00μs
flat_map 513.44 1947.63μs (±6.91%) 1943.50μs
Comparison:
map.flatten 1012.77
flat_map 513.44 - 1.97x slower
tobi@happy ~/github/benchee $ mix run samples/run_parallel.exs # parallel 4
Benchmarking flat_map...
Benchmarking map.flatten...
Name ips average deviation median
map.flatten 954.88 1047.25μs (±34.02%) 957.00μs
flat_map 452.38 2210.55μs (±21.05%) 1914.00μs
Comparison:
map.flatten 954.88
flat_map 452.38 - 2.11x slower
tobi@happy ~/github/benchee $ mix run samples/run_parallel.exs # parallel 12
Benchmarking flat_map...
Benchmarking map.flatten...
Name ips average deviation median
map.flatten 296.63 3371.18μs (±57.60%) 2827.00μs
flat_map 186.96 5348.74μs (±42.14%) 5769.50μs
Comparison:
map.flatten 296.63
flat_map 186.96 - 1.59x slower
Of course, overloading the system with 12 processes is very contra productive and a lot slower than it ought to be :D
Of course, if you want to see how a system behaves under load - overloading might be exactly what you want to stress test the system. This is the original use case why this was introduced, in the words of the contributor:
I needed to benchmark integration tests for a telephony system we wrote - with this system the tests actually interfere with each other (they're using an Ecto repo) and I wanted to see how far I could push the system as a whole. Making this small change to Benchee worked perfectly for what I needed :)