Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interleaved benchmark execution (multiple commands) #21

Open
sharkdp opened this issue Jan 21, 2018 · 39 comments
Open

Interleaved benchmark execution (multiple commands) #21

sharkdp opened this issue Jan 21, 2018 · 39 comments

Comments

@sharkdp
Copy link
Owner

sharkdp commented Jan 21, 2018

CyberShadow on HN:

This would be useful when comparing two similar commands, as interleaving them makes it less likely that e.g. a load spike will unfavorably affect only one of them, or due to e.g. thermal throttling negatively affecting the last command.

@stevepentland
Copy link
Contributor

I'd be interested in putting this functionality into the tool. I guess we'd take it in as an option and have to change how the execution state is currently being tracked. Unless you were of the opinion that this would be the new default way to run timings.

@sharkdp
Copy link
Owner Author

sharkdp commented Mar 22, 2018

I would like to discuss this first before we put this into action.

In principle, the rationale behind this feature sounds sensible. However, we also have a few features that already work in a similar direction:

  • We show the standard deviation, minimum and maximum of all runtimes for each commands. These statistical quantities give a quite good indication if something went wrong.
  • More importantly, we have an outlier detection which would most probably indicate if any load spikes affect the benchmark.
  • We also have a special indicator that detects if the first run of a benchmark was significantly slower (indicating caching effects). We could maybe add something similar to detect thermal throttling (I have never experienced this myself. How would I trigger this?).

Adding a --interleave-benchmarks option would have a few disadvantages (when used):

  • There will be only progress bar for all the benchmarks. Providing an intermediate feedback (current estimate) will be harder. We have to wait until all benchmarks are finished
  • What if we have two commands where one is fast and the other is slow? Do we run the same amount of benchmarks for both - just to be able to always interleave them?
  • If we want to use caching effects (via --warmup), the interleaving could actually interfere if the two (or more) commands do not use the same cache.

@sharkdp
Copy link
Owner Author

sharkdp commented Mar 22, 2018

Also, simplicity is a desirable goal for me. If there are command line options that we don't necessarily need, I would tend to leave them out.

@stevepentland
Copy link
Contributor

That's true, the benefits are minimal and the downsides/problems presented may not be worth it. While an interesting idea in general, it may not be a correct approach for this tool. If there are already enough other facilities available in the existing feature set, this should probably be closed and we can move on to the other issues.

@sharkdp sharkdp closed this as completed Mar 23, 2018
@comex
Copy link

comex commented Aug 4, 2020

I'm benchmarking some code that interacts heavily with the filesystem and OS, and sorely missing this functionality. Instantaneous load spikes are one thing, since they can be removed by outlier detection or at least detected by a high variance. But there are also many ways in which the state of the system can vary slowly over time. For example:

  • Thermal throttling, as the OP mentioned. On my MacBook Pro this kicks in after only a few seconds of high load (saturating all CPUs), but it accumulates over time, and recovery also takes time. (Of course, desktops will often have enough thermal headroom to never throttle.) If you happen to be on macOS you can watch the current throttling level by running pmset -g thermlog; I don't know the equivalent for other OSes.

  • OS background activity: long-running processes may spin up and down over time. Of course, ideally you would disable all such processes before running benchmarks, but it's easy to miss some.

  • The kernel page cache will gradually discard (if clean) or swap out (if dirty) memory that hasn't been accessed recently. Similarly, the vnode/inode cache (as it's called on macOS and Linux respectively) will throw out files that haven't been accessed recently. Admittedly, in most cases this shouldn't have much effect on a benchmark which mostly keeps performing the same operations – and thus touching the same pages and files – over and over. But it can have some effect.

  • SSDs often have their own DRAM caches which can fill over time before gradually draining to the actual flash storage.

When system state changes slowly enough, it won't show up in the form of outlier runs; it will just bias the results in favor of whichever parameters were tested under more favorable conditions.

While not a panacea, interleaving would greatly mitigate this effect.

@sharkdp
Copy link
Owner Author

sharkdp commented Aug 15, 2020

Thank you very much for the detailed explanation. All of this sounds reasonable.

The strategy how this could be used in hyperfine is not completely clear to me (see "disadvantages" above), but I am opening this for further discussion.

@sharkdp sharkdp reopened this Aug 15, 2020
@shepmaster
Copy link

I find myself in the thermal throttling on a laptop camp as well. I also have other cases that don't make sense — for example, a recent test run I did was:

with change:     3.551s
without change:  3.439s
with change:     3.311s

My "workaround" is to try and scale up the warmup and test iterations in a (futile?) attempt to try to heat-saturate the processor so that the before and after binaries run under similar throttled conditions.

Another potential solution for this particular case would be to run tests like:

  • full set of A
  • full set of B
  • full set of B (optional?)
  • full set of A

And ensure that the A->B and B->A runs report "close enough" to each other.

@sharkdp
Copy link
Owner Author

sharkdp commented Feb 15, 2021

I find myself in the thermal throttling on a laptop camp as well.

I think we should try to find a solution for that and document it here in the repository. I'm not yet convinced that the proposal in this ticket is the best solution (it might be).

Alternative solutions that come to my mind right now:

  • Rather obvious: disable thermal throttling for the benchmark. On Linux, the "scaling governor" can be changed to performance. The C++ micro benchmark library gbenchmark actually prints a warning, if /sys/devices/system/cpu/*/cpufreq/scaling_governor is set to anything other than performance.
  • Your "workaround" solution, using a large number of warmup runs.
  • The opposite of that: Use --prepare "sleep 10" (with a sufficiently large time) to make sure that everything has cooled down before each timing run.
  • Your ABBA strategy.

I also have other cases that don't make sense — for example, a recent test run I did was:

with change:     3.551s
without change:  3.439s
with change:     3.311s

Could you go into more detail? These numbers are pretty much meaningless without the standard deviation. Could it be just noise or is this reproducible? Are you sure that all caching effects are eliminated (either via warmup runs or by resetting the caches with --prepare)?

@shepmaster
Copy link

Could you go into more detail?

I'd rather not... but I will 😢

In that specific case, I was neglecting to build my binary correctly. I was doing cargo build --release instead of cargo build --release -p my-binary. That specific case was pure user error. 🤦

That being said, that's an example of some of the variance I get between three test runs of 10 or 25 iterations (I forget which at this point).

@sharkdp
Copy link
Owner Author

sharkdp commented Feb 15, 2021

That being said, that's an example of some of the variance I get between three test runs of 10 or 25 iterations (I forget which at this point).

Ok, but do you consider this to be a problem? That's like a ~5% variation in runtime. What did hyperfine actually report as standard deviation? How IO-heavy is your process (tends to be much more noisy when benchmarked)?

@romgrk
Copy link

romgrk commented Aug 3, 2021

Not much to add here but just to say I'd also be happy to have this option. I've been benchmarking a very IO heavy program (reading & writing GBs of data) and the benchmarks are very unreliable from one run to the next (even though I'm clearing disk caches on every iteration). Which version of the program is run first influences a lot the results.

@sharkdp
Copy link
Owner Author

sharkdp commented Oct 17, 2021

Not much to add here but just to say I'd also be happy to have this option. I've been benchmarking a very IO heavy program (reading & writing GBs of data) and the benchmarks are very unreliable from one run to the next (even though I'm clearing disk caches on every iteration). Which version of the program is run first influences a lot the results.

If this is a consistent result, I assume that some other form of caching is going on somewhere.

@comex
Copy link

comex commented Oct 17, 2021

True. But if it's, say, an SSD's internal SLC cache, I don't think there's any way to clear it other than just waiting.

Even if you do wait, both the filesystem's and the SSD's internal data structures are going to end up in different states than they started in, which might affect performance, albeit probably not by much.

@bhansconnect
Copy link

Hi, Just realized that I accidentally made a duplicate of this issue and wanted to post some of my context here.

I have run into issues with benchmarks that are very sensitive to minor condition changes, but have clear speed difference when run correctly. For example, I have done some benchmarking of compiling rather small files and also doing linking. That does a solid amount of file IO and will always have a decent amount of deviation. I have found that some of these cases, If the benchmarks are run non-interleaved, I will have to do something like -w 100 -m 1000 to get accurate results of which process is faster. This will get accurate results but will take a while to run. Since I am only interested in which case is faster, I don't really care about getting exact results (just clear performance indications). As a result, at multiple occasions, I have hacked together python scripts that time the results with interleaving. Even with a relatively small number of runs equivalent to something like -w 10 -m 50, I am able to rather consistent indications of which script is faster and approximately by how much. The important thing is that both test runs get hit pretty evenly by all of the performance issues (scheduling, cpu temp, disk io from anything else on the machine, etc).

I have also seen similar cases when measure certain hot loop code that has rather minor performance optimization. If I simply run the scripts non-interleaved, and don't do it for long enough, the first run script will almost always be the faster. To my knowledge, this is simply due to the cpu heating up and thermal throttling. Something I was running earlier today, I started testing with -w 1000 -m 5000 in order to get consistent results. I think that some types of testing simply perform very poorly due even minor amounts of system noise. As such, I found that interleaved testing can be paramount to getting quick results.

I guess as an extra note, the other option to enable shorter runs while getting decent results is to fully reboot the machine and make sure not to open apps that could interfere with benchmarking. This works, but is a lot of extra effort to deal with something that is mostly fixed by interleaving.

Hopefully these comments help to present a clear picture of why this feature could be extremely useful.

PS I run zfs even on a laptop and it can really lead to less consistent with benchmarks that have any sort of file io even with prepare dropping caches.

@bhansconnect
Copy link

bhansconnect commented Feb 22, 2022

Oh, one extra note, I think standard error is much more important than standard deviation in these cases. For example, if you have a script that takes 20 millisecond to run with a standard deviation of 5 ms, and another script that only takes 18 ms, but the standard deviation is also 5 ms, looking at mean + standard deviation will never tell you which is faster. That being said, after enough runs the stardard error will trend towards zero making it clear that the 18 ms script is actually faster. We may want to also expose standard error, because it will help to properly distinguish whether the difference was just due to noise and regular deviation, or if the results were meaningfully different.

@sharkdp
Copy link
Owner Author

sharkdp commented Feb 22, 2022

Oh, one extra note, I think standard error is much more important than standard deviation in these cases. For example, if you have a script that takes 20 millisecond to run with a standard deviation of 5 ms, and another script that only takes 18 ms, but the standard deviation is also 5 ms, looking at mean + standard deviation will never tell you which is faster. That being said, after enough runs the stardard error will trend towards zero making it clear that the 18 ms script is actually faster. We may want to also expose standard error, because it will help to properly distinguish whether the difference was just due to noise and regular deviation, or if the results were meaningfully different.

Thanks for bringing this up. I have thought about this often, and I was never sure which would be better to show. Both values are very useful. The problem is that most users probably don't know the difference very well.

But you are absolutely right, of course. A command with 18 ± 5 ms can be faster than 20 ± 5 ms in a statistically significant way, but there is no way to tell from the current output. We do, however, ship a scripts/welch_ttest.py Python script that performs a statistical test on exported benchmark results to answer this particular question: is command A actually faster than command B?

We also have a scripts/advanced_statistics.py script where we could very easily include the standard error.

But I'm also open to discussions whether or not this should be in the default output of hyperfine.

@sharkdp
Copy link
Owner Author

sharkdp commented Feb 22, 2022

I have run into issues with benchmarks that are very sensitive to minor condition changes, but have clear speed difference when run correctly. For example, I have done some benchmarking of compiling rather small files and also doing linking. That does a solid amount of file IO and will always have a decent amount of deviation. I have found that some of these cases, If the benchmarks are run non-interleaved, I will have to do something like -w 100 -m 1000 to get accurate results of which process is faster. This will get accurate results but will take a while to run. Since I am only interested in which case is faster, I don't really care about getting exact results (just clear performance indications). As a result, at multiple occasions, I have hacked together python scripts that time the results with interleaving. Even with a relatively small number of runs equivalent to something like -w 10 -m 50, I am able to rather consistent indications of which script is faster and approximately by how much. The important thing is that both test runs get hit pretty evenly by all of the performance issues (scheduling, cpu temp, disk io from anything else on the machine, etc).

It would be great if you could perform such a benchmark in a naive kind of way and share the results. Something like

hyperfine --runs 1000 --export-json benchmark.json          "cmd1" "cmd2"
hyperfine --runs 1000 --export-json benchmark_reversed.json "cmd2" "cmd1"

I would love to take a closer look at the results and the progression of the runtimes.

@bhansconnect
Copy link

bhansconnect commented Feb 25, 2022

Ok, so here is an example of a command that timing can be a nightmare for. So this command over a very long term average, will run in about 18ms. If you look at the sample timing that I posted, the timing average until line 646 is approximately 18ms, but after line 646, the average looks to be closer to 12ms.

I am almost 100% certain the timing issues are related to the amount of mmapping and file io done, but I am specifically trying to time the command with a warm cache. This command is essentially never used with a cold cache and those timings are not helpful. When timing this command against other variants of itself, or similar commands, they will generally run into the same sorts of crazy timing jumps. Generally speaking, when interleaved, both versions get timing peaks and valleys at the same time. Hopefully this gives enough context even if it isn't exactly what you asked for.

@bhansconnect
Copy link

I just accidentally recorded an even better example.

I wanted to test 3 versions of an executable against a different command. I expected hyperfine would run: version1, version2, version3, different command. Instead, it ran: version1, different command, version2, different command, version3, different command because that is how parameters work (not a bug, just my mistake). The different command didn't use the parameters at all.

Anyway, the different command took:

  • 9.2 ms ± 0.2 ms the first time running
  • 9.7 ms ± 1.9 ms the second time running
  • 15.2 ms ± 3.7 ms the third time running

Depending on which run I look at, it could be faster or about the same speed as most of the other commands I was testing. Hopefully my wording made sense here.

@sharkdp
Copy link
Owner Author

sharkdp commented Feb 27, 2022

Very interesting. Thank you very much. Here is how a histogram of the first benchmark looks like (python scripts/plot_histogram.py timings.json --bins=300). It's not just two distinct peaks, but at least four(!):

image

Could you please tell us a bit more about the command you are running and maybe the system that you are running it on (if possible)? What kind of background processes are running (browser, spotify, dropbox)?

Also related: #481

@bhansconnect
Copy link

Thanks for pointing out the 4 peaks. I am used to this having 2 peaks but not 4. Turns out that a daemon I had running in the background was turning turbo on and off. That is what was leading to the much larger distribution swing. That also makes sense as to why interleaving would stabilize the issue. Turbo mode would be on for plenty long enough that all the commands would execute multiple times before turning back off. So all commands would see the same highs and lows from turbo mode.

Anyway, this still leaves me with 2 peaks that cause problems when comparing multiple versions of this command. I just collected a much longer trace with turbo mode always on. This recording was done, in performance mode, with turbo always on, after a restart (and minor delay), with the internet off, and only a terminal + htop open. I scanned through all of the running processes, and nothing looks more interesting than what would be found in a normal ubuntu install. Generaly speaking, the terminal windows had the highest CPU usage at about 2% because this app is so short running and does a lot of file io. Extra note about my setup, I run ZFS as my file system, which I am sure creates a more complex caching and file io scenario, but I am not sure if it is the root cause for the deviation.

For more context on the command, this is compiler and linker running. So the base process is about what would be expected: load source files, process them until you have assembly, write that to the file system, do linking to get an executable. The language is a bit weird because it embeds into a binary compiled by another language. Due to this property, we have our own custom linker that surgically appends our output object on to an existing binary updating what is needed. As such, we do a lot of file io and memory mapping: Copy preprocessed binary to output location, mmap it, load a metadata file, mmap our output bytes, copy the bytes over, do a bunch of random updates across the file to make sure it is ready, and finally dump it to disk.

We also can run with a traditional linker. There are the timings from running with a regualar linker. Looking at these timings, they are much much smoother (though maybe still bimodal?). The regular linker actually does much cleaner file io due to not mmaping and writing to disk linearly after building in memory.

Given how much more bimodal the surgical linker is, I would posit that all of the memory mapping and potentially random file io is causing the clear peaks, but that is really just a guess.

@sharkdp
Copy link
Owner Author

sharkdp commented Feb 27, 2022

Very interesting. This is how all 10,000 benchmark runs (except for a few slow outliers outside of the y-axis range) look like in sequence:

image

We can clearly see the bimodal distribution. We can also see that the pattern stays constant over time. Except for a few interferences(?) around n=1550 and n=8700.

For comparison, this is how the regular linker benchmark looks like:

image

And for completeness, here are the two histograms:

image

image

Given that the bimodal pattern stays constant over time, I actually don't really see why an interleaved benchmark execution would help here. Maybe that's just the "true" runtime distribution (for whatever reason)?

@bhansconnect
Copy link

bhansconnect commented Feb 27, 2022

So I know that it looks like a consistent bimodal distribution, but there is a correlation between points that are next to each other. If you look at the timings and imagine 2 different scenarios. The first scenario (non-interleaved) is that the first 5000 points belong to one command and the last 5000 points belong to the other command. The second scenario (interleaved) is that the even points belong to one command and the odd points belong to the other command.

Here is the graph for the non-interleaved scenario:
non-interleaved-data

Here is the graph of the interleaved scenario:
interleaved-data

Looking at the 2 charts. For the non-interleaved scenario, the orange bars slightly trail the blue bars, while in the interleaved scenario, they are essentially even.

To be more specific on numbering. In the non-interleaved scenario, the second half is slower than the first half by 0.5 ms with a standard error of 0.13 ms (outside of the stand error and theoretically significant). In the interleaved scenario, the results are essential even with a difference of 0.02 ms with a standard error of 0.12 ms (well within the standard error).

Note: this holds true even when removing the first 2000 points as a warmup.

@sharkdp
Copy link
Owner Author

sharkdp commented Feb 28, 2022

I see, so there's a lot going on here 😄. I'm not sure if we can trust non-robust estimators like mean and standard deviation here. The outliers have too large of an effect: here is a moving average view of all benchmark runs with a window size of num_runs = 1000 (note that I'm excluding the boundary regions here):

image

The plateaus in the data come from outliers way outside the visible range:

image

So these three bursts of outliers could also lead to the imbalance between the first and the second half in terms of mean run time.

If we look at robust estimators (median and median absolute deviation / MAD), the picture looks a bit different:

>>> def mad(X): return np.median(np.abs(X-np.median(X)))
... 
>>> np.median(times[:5000]), mad(times[:5000])
(0.01861407567, 0.0008725065000000001)
>>> np.median(times[5000:]), mad(times[5000:])
(0.01871475567, 0.0009133250000000013)

Using MAD/sqrt(N) as a robust estimator for the "standard error", we then find:

first half:  18.61 ± 0.01 ms
second half: 18.71 ± 0.01 ms

which still indicates a (statistically significant) difference of 0.1 ms between first and second half.

@bhansconnect
Copy link

Wow! Thanks for the detailed analysis. It is really helpful to see.

I guess this leaves me with the question: what is the correct plan of action given all of this information?

  • Interleaving still functionally seems to work as a solution for my use case, but I now feel like that may be more of an adhoc solution. It just helps to even out the number of interferences that each command sees. If my commands were longer, there is still a good chance that the interference could hit one command more than another even if they are interleaved.
  • Maybe I simply need to run with hyperfine and then use some post processing scripts to dig into the data more. I could generate these sorts of graphs with runs that happen to be close together.
  • Maybe hyperfine should add some flags to enable it to generate more statistics than just mean and standard deviation. It could also expose a flag to make it ignore or clean up outliers. Not sure what the most robust way to deal with outliers is.
  • Other?

@CyberShadow
Copy link

Interleave and use the minimum measurement. :)

@bhansconnect
Copy link

bhansconnect commented Feb 28, 2022

I don't think that would work well. I think my example falls into the same category as "networking" in the talk. The effects of file io matter to the performance. Just taking the minimum would remove a lot of the characteristics of the program that are extremely important. But interesting idea for certain classes of benchmarks. Also, I don't think interleaving matters if you are just taking the minimum.

@CyberShadow
Copy link

CyberShadow commented Feb 28, 2022

I think my example falls into the same category as "networking" in the talk. The effects of file io matter to the performance.

If you are doing I/O on a filesystem backed by a physical storage device, you are pulling in a number of very complex systems into the benchmark. I would suggest testing on an isolated tmpfs.

Just taking the minimum would remove a lot of the characteristics of the program that are extremely important.

I agree that taking the minimum doesn't make a lot of sense if the tested code is nondeterministic. One such cause of nondeterminism is state which outlasts the benchmarked program's execution, which would include the state of the filesystem (in RAM and on disk), OS mechanisms like caching layers, and any backing physical devices. I think such a benchmark would overall just not be very meaningful, unless each invocation started from a clean state (i.e. the filesystem is created and mounted from scratch every time).

Also, I don't think interleaving matters if you are just taking the minimum.

It can matter in case of a lag spike which would cover one benchmarked command's non-interleaved execution.

@bhansconnect
Copy link

I would suggest testing on an isolated tmpfs.

I definitely agree that can help, but I don't think that is a solution I want to take. The issues is that the complexities of the file system are huge part of the performance here. If those complexities are remove, the results are not accurate. Linking is essentially all file io. Yes, it does some other stuff and may even generate the final binary in memory before dumping, but there is no substantial CPU work. As such, taking advantage of a tmpfs can totally distort the performance picture.

@CyberShadow
Copy link

If you would like to include measuring the performance of the filesystem implementation as part of your benchmark, I would suggest to use a clean filesystem backed by RAM, fresh for every test.

If you would like to also include the physical storage device as part of your benchmark, I would suggest performing an ATA reset, and TRIM / ATA secure erase (or the NVMe equivalents) for flash devices, for every test.

Otherwise, each iteration depends on the previous iteration's state, plus any undefined initial state, plus interference with all other processes which interact with these systems. This just doesn't make for a very meaningful measurement - the results will be effectively unreproducible.

@bhansconnect
Copy link

So I decided to dig into this for my application just to better understand what really causes the performance variations. Turns out that there are 3 core distributions to the execution speed of my applications:

  1. Nothing cached: When you drop all caches, you, unsurprisingly, get single significantly slower distributions. In my case somewhere in the range of taking between 1.5 and 2x longer to run.
  2. "Lightly" cached: This distribution can be produced by clearing the cache and then running the app once before measuring the performance. This is the slower of the 2 peaks in the graphs produced above. Note: tends to get produced with a longer tail where the faster of the 2 peaks above is.
  3. "Heavily" cached: This distributions can be produced by clearing the cache and then running the app multiple times before measuring the performance. I arbitrarily ran it 10 times before measuring. This is the faster of the 2 peaks in the graphs produced above. Note: tends to get produced with a longer tail where the slower of the 2 peaks above is.

To my surprise, running on a tmpfs backed by memory vs on zfs backed by ssd made very minimal differences to any of these distributions.

These results make me really wish syncing and dropping caches wasn't so slow. I think that "lightly" cached scenario would be a good target for general benchmarking of this app, but that takes astronomically longer to run than not clearing caches at all only to record a slightly cleaner result. I might just have to do some sort of processing of the data to remove one of the peaks.

@sharkdp
Copy link
Owner Author

sharkdp commented Mar 5, 2022

Thank you for the feedback on this.

These results make me really wish syncing and dropping caches wasn't so slow.

How many runs do you perform? I would hope that 10-100 would be enough for reasonable statistics. Especially if you "control" the caching behavior. Not >1000 like you did for the full histograms above.

Do you know what the reason for the "lightly" and "heavily" cached scenarios is?

Why do we see the "lightly"-cached peak in the histograms at all? Shouldn't we mostly see the "heavily" cached scenario?

@bhansconnect
Copy link

I apparently didn't save my timings. Probably put them in the git directory and later cleared them, but I still have some clarifications.

First, this is what 100 samples heavily cached looks like. You can see the rough form of the relative ratio of the 2 peaks:
100-samples-heavy-cache
It takes about 8 and a half minutes to generate this graph with a warmup of 10 samples. In that time, I can generate approximately 50,000 samples with 5,000 warmup samples if I am not clearing the cache. So, that is what I mean by it is really slow. Though I guess the main peak is in the right spot, so maybe it is valid to run less samples.

As an extra note, this is approximately what my graph with 2000 samples looked like:
estimated_graph
Hopefully the graph explains the rough idea of what I meant by longer tail at the "lightly" cached peak.

As for the why. No clear idea, sadly. My only guess is maybe the files that are loaded directly get cached differently then the files that get mmaped. So "lightly" cached may only deal with files being cached and "heavily" cached may be when both files and mmaps are cached.

@bhansconnect
Copy link

So, did some more digging and found the cause of the bimodal distribution. It actually isn't cause by reading caching at all. It is caused by synchronously flushing an mmap to disk. Apparently that is bimodal.

If I change from a synchronous flush at the end of linking to an asynchronous flush, the bimodal distribution goes away. This sadly doesn't not quite seem to be a full fix in my case because it adds a weird behavior where there is about a 1 in 1000 chance that the execution will freeze for about 4 seconds. I assume this is due to the app trying to close before the mmap has finished flushing. No idea why that would freeze for so long, but probably a kernel problem of some sort.

@CyberShadow
Copy link

Even when the mmap is of a RAM-backed file? That's interesting, maybe you could dump /proc/<pid>/stack to see why it's taking so long when it's taking over a millisecond. Same with the 4-second delay, perhaps it might be due to accumulated writes that are being flushed all at once.

@bhansconnect
Copy link

bhansconnect commented Mar 6, 2022

So, I just had the realization that on my current machine /tmp isn't a tmpfs. So all of my test that I thought were ram backed, weren't. With a properly RAM-backed file, this is also fixed. Thanks so much for all of the help even if it ended up not really being hyperfine related.

Same with the 4-second delay, perhaps it might be due to accumulated writes that are being flushed all at once.

This looks to also be correct. I added -p 'sync output_file' and the giant 4 second long flushing no longer happens.

Thanks a ton for helping me dig through this. Really goes to show how many things can mess up benchmarking.

@allisonkarlitskaya
Copy link

I'm also in the laptop thermal throttle camp.

I have an ugly little workaround I use to avoid this issue. It works by using named pipes in the setup/prepare/cleanup functions to coordinate two hyperfine instances to take alternating turns, each waiting for the other to complete its next run:

# process "a"
hyperfine -r10 -s 'mkfifo a b || true; cat a' -p '>b; cat a' 'sleep 0.5'

# process "b" (from same working dir as above)
hyperfine -r10 -p '>a; cat b' -c '>a' 'sleep 0.2'

It only works if the number of iterations is fixed and equal on both sides.

There are probably nicer/fancier ways to do this, but the above solution works for me.

@sharkdp
Copy link
Owner Author

sharkdp commented Nov 23, 2024

I think I changed my mind on this. This is a feature that should be supported natively by hyperfine.

@johnsonjh
Copy link

johnsonjh commented Dec 28, 2024

@sharkdp I would love to see this feature.

I have a particular laptop which always runs about 20% faster for 10 minutes or so, then settles down to the baseline. I've been unable to "fix" this by BIOS settings or performance settings in the operating system, after many attempts. This appears not to be a "turbo" artifact but a thermal limit. Other than specifying a long sleep as a cleanup command (which allows for the CPU to cool down and makes testing take forever), alternating the tests between command command 1 and command 2 would produce fairer results.

Alternately, some sort of options for cleaning the data would be useful, but probably would be too easy to "abuse". For example, automatic removal of outliers by calculating an upper and lower bounds by taking N times (where a useful N is probably somewhere between 1.2 and 5, configurable by the user) standard deviations from the mean of the tested values, assuming an otherwise normal distribution.

@sharkdp sharkdp changed the title Option to interleave benchmarks for multiple commands Interleaved benchmark execution (multiple commands) Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants