-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outlier detection seems to report false positives #2
Comments
Thank you for the feedback and detailed information! I have noticed this as well, but it uses the same modified Z-scores algorithm as hyperfine. I think if you plugged those times into your test suite, you would get the same results. Note that hyperfine had also been missing 50% of the outliers because of sharkdp/hyperfine#329. The algorithm does not like it when many of the times are the same, so the "really well defined runtime" is actually the issue. In this case, it is considering the seven |
You are right. I just did that - and it also reports 10 outliers. So the problem is the internal precision of "Benchmarking-Tool". Hyperfine uses a microsecond precision internally and shows times like these (with too many digits...):
... resulting in 0 outliers. |
Thank you for checking that!
Yes, this port uses the Bash |
It reports times in milliseconds, but measures in microseconds. Thats the resolution we naturally get from Unix time functions. I said that I don't know how to make a more precise measurement, because we launch the intermediate shell which is like a source of noise at the order of 1 millisecond.
Not sure about the 1 tick ~ 10 ms thing, because you can absolutely measure times at microsecond resolutions: #include <stdio.h>
#include <sys/time.h>
#include <unistd.h>
#include <math.h>
int main() {
struct timeval st, et;
gettimeofday(&st, NULL);
for (int i = 0; i < 10000; ++i) {
sqrt(i);
}
gettimeofday(&et, NULL);
int elapsed =
((et.tv_sec - st.tv_sec) * 1000000) + (et.tv_usec - st.tv_usec);
printf("Elapsed time: %d micro seconds\n", elapsed);
} This program consistently reports times around 15 us. If I use 10 times as many iterations, it takes around 150 us. |
Yes, it can measure times at microsecond resolutions, but unless you are running a tickless kernel where the programs are guaranteed not to get context switched, I do not think you cannot accurately measure runtimes slower than 1 tick. Depending on how many other processes are running and how many context switches occur, measurements less than 1 CPU tick are usually noisy and inaccurate. For your example C program, if a context switch happened in the middle of that for loop, it could report up to ~10015 us instead of ~15 us. I ran your C program 100,000 times with these commands (it takes a few minutes to run): gcc -Wall -O3 -o test test.c
for i in {0..100000}; do ./test; done | sed -n 's/^.*: //p' | sort -nr | { head; echo '…'; tail; } The times ranged from 12 to 3881 us. Since the C for loop is obviously CPU bound with no I/O, I believe this is the only thing that can account for the huge difference in times. You can also run this command to generate interference and increase the number of context switches. |
Well, ok. But that would be reported as an outlier. To dig a little deeper, I wrote this C program (I chose C out of curiosity and to have the least amount of overhead possible):
#include <stdio.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
int main(int argc, char* argv[]) {
if (argc < 2) {
printf("Usage: clock [prog] [args...]\n");
return 1;
}
struct timeval start, end;
int status;
gettimeofday(&start, NULL);
int pid = fork();
if (pid == 0) {
execv(argv[1], argv + 1);
return 1;
} else if (pid < 0) {
printf("Error creating child process\n");
return 1;
} else {
if (waitpid(pid, &status, 0) != pid) {
printf("Failed waiting\n");
return 1;
}
}
gettimeofday(&end, NULL);
if (WEXITSTATUS(status)) {
printf("Child process failed with exit status %d\n",
WEXITSTATUS(status));
return 1;
}
int elapsed =
((end.tv_sec - start.tv_sec) * 1000000) + (end.tv_usec - start.tv_usec);
printf("Elapsed time: %d us\n", elapsed);
} .. and compile with gcc -O2 -Wall clock.c -o clock It accurately measures actual runtimes:
The ~800 us overhead here likely comes from the fork/execve/wait syscalls. We can see a similar overhead when timing really simple programs like
Or when timing the program
But there is some signal. When timing the program
Now obviously, there can be interference from other programs - causing context switches. Here, I had
But nevertheless, it seems possible to measure execution times faster than 10 ms. Yes, there is some overhead from the actual measurement process (on the order of several hundred microseconds). And yes, a context switch to another program might completely spoil the measurement. But 10 ms it not a fundamental limit for benchmarking external programs. |
How do you know that it is accurate and that one or more context switches did not occur? I am not aware of a way to determine if a context switch occurred on a specific process, but you can use the Pressure Stall Information (PSI) added to version 4.20 of the Linux kernel to determine the percentage of runnable processes that were delayed, meaning they were context switched. Just run Note that only running processes can be context switched. The
I never claimed that is was a limit. You can of course benchmark programs with a runtime of less than 1 tick. I just believe that you cannot do so accurately. As your second and fifth examples show, context switches can produce many orders of magnitude of noise, making it extremely difficult to determine the actual runtime. I am assuming this is at least partially why hyperfine outputs a warning that the "results might be inaccurate" if you try to benchmark a command that takes less than 5 ms. As with your first C program, you would need to run your for i in {0..100000}; do ./clock ./test; done | grep 'us' | sed -n 's/^.*: //p' | sort -nr | { head; echo '…'; tail; } BTW, I am very impressed that you took the time to create these C programs! |
Interesting, I did not realize it could display this.
Yes, this would be my hope as well. Benchmarking fast programs would obviously be very useful. Unfortunately, even seemly "quiet" systems with no open applications still usually have hundreds or more background processes and services that can cause context switches (just run
Thanks again for your feedback!
Yes, I agree! I will think about this and see if I can figure out a potential solution. I will also consider adding a new option to my port to use BTW, nice graphs. (I always wonder why no one has created a CLI tool yet to output graphs like those to the console, but I think it would be a cool Rust project that could save users from having to copy/paste the data into an external application.) |
The outlier detection seems to report a lot of outliers, even for commands that have a really well defined runtime:
The actual times are:
The text was updated successfully, but these errors were encountered: