Outlier detection seems to report false positives #2

sharkdp · 2020-10-17T17:39:37Z

The outlier detection seems to report a lot of outliers, even for commands that have a really well defined runtime:

▶ bash time.sh -j test.json 'sleep 0.1'
Benchmark #1: sleep 0.1
  Time (x̅ mean ± σ std dev):      0.1019s ±  0.0006s          [User: 0.0014s, System: 0.0004s]
  Range (min … x̃ median … max):   0.101s …  0.102s …  0.103s   CPU:   1.8%, 29 runs

Warning: 10 statistical outlier(s) were detected (> 14.826 modified Z-scores or about 10σ std devs). Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the warmup or prepare options.

The actual times are:

{
  "results": [
    {
      "command": "sleep 0.1",
      "mean": 0.101862068965517,
      "stddev": 0.000570791219210244,
      "median": 0.102,
      "user": 0.00144827586206897,
      "system": 0.000413793103448276,
      "min": 0.101,
      "max": 0.103,
      "times": [
        0.102,
        0.102,
        0.102,
        0.102,
        0.102,
        0.102,
        0.102,
        0.102,
        0.102,
        0.103,
        0.102,
        0.102,
        0.101,
        0.101,
        0.102,
        0.103,
        0.101,
        0.103,
        0.101,
        0.102,
        0.102,
        0.102,
        0.102,
        0.101,
        0.102,
        0.102,
        0.102,
        0.101,
        0.101
      ]
    }
  ]
}

The text was updated successfully, but these errors were encountered:

tdulcet · 2020-10-19T11:18:10Z

Thank you for the feedback and detailed information! I have noticed this as well, but it uses the same modified Z-scores algorithm as hyperfine. I think if you plugged those times into your test suite, you would get the same results. Note that hyperfine had also been missing 50% of the outliers because of sharkdp/hyperfine#329.

The algorithm does not like it when many of the times are the same, so the "really well defined runtime" is actually the issue. In this case, it is considering the seven 0.101s and three 0.103s as outliers since there are 19 0.102s. I do not think there is anything I can do other than using a different algorithm to detect outliers.

sharkdp · 2020-10-24T09:41:28Z

I think if you plugged those times into your test suite, you would get the same results.

You are right. I just did that - and it also reports 10 outliers.

So the problem is the internal precision of "Benchmarking-Tool". Hyperfine uses a microsecond precision internally and shows times like these (with too many digits...):

    0.10058405167500001,
    0.101267848675,
    0.101255919675,
    0.10136278267500001,
    0.101470384675,
    0.10146407267500002,
    0.10123459267500001,
    0.10098665167500001,
    0.10117769467500001,
    0.101385481675,
    0.10121925167500001,
    0.10138741767500001,
    0.10133836867500001,
    0.101490163675,
    0.10130392067499999,
    0.101319621675,
    0.101263722675,
    0.101326950675,
    0.101310556675,
    0.101273586675,
    0.10111377267500002,
    0.101259305675,
    0.10116451167500001,
    0.10138028067500002,
    0.10156199367500002,
    0.10149002767500001,
    0.101329747675,
    0.101347716675,
    0.101443286675,

... resulting in 0 outliers.

tdulcet · 2020-10-25T10:24:48Z

You are right. I just did that - and it also reports 10 outliers.

Thank you for checking that!

So the problem is the internal precision of "Benchmarking-Tool". Hyperfine uses a microsecond precision internally and shows times like these (with too many digits...)

Yes, this port uses the Bash time builtin, which only supports millisecond resolution. That is the same resolution you said hyperfine provides in #1. I could use the GNU time command, but that would have a significant performance cost. I also do not think most systems can accurately/precisely measure time faster than 1 ms anyway, considering that a CPU tick is usually 10 ms. Therefore, I think the only solution is to use a different algorithm to detect outliers.

sharkdp · 2020-10-25T17:59:13Z

That is the same resolution you said hyperfine provides in #1.

It reports times in milliseconds, but measures in microseconds. Thats the resolution we naturally get from Unix time functions. I said that I don't know how to make a more precise measurement, because we launch the intermediate shell which is like a source of noise at the order of 1 millisecond.

I also do not think most systems can accurately/precisely measure time faster than 1 ms anyway, considering that a CPU tick is usually 10 ms.

Not sure about the 1 tick ~ 10 ms thing, because you can absolutely measure times at microsecond resolutions:

#include <stdio.h>
#include <sys/time.h>
#include <unistd.h>
#include <math.h>

int main() {
    struct timeval st, et;

    gettimeofday(&st, NULL);
    for (int i = 0; i < 10000; ++i) {
        sqrt(i);
    }
    gettimeofday(&et, NULL);

    int elapsed =
        ((et.tv_sec - st.tv_sec) * 1000000) + (et.tv_usec - st.tv_usec);

    printf("Elapsed time: %d micro seconds\n", elapsed);
}

This program consistently reports times around 15 us. If I use 10 times as many iterations, it takes around 150 us.

tdulcet · 2020-10-26T10:33:32Z

Not sure about the 1 tick ~ 10 ms thing, because you can absolutely measure times at microsecond resolutions:

Yes, it can measure times at microsecond resolutions, but unless you are running a tickless kernel where the programs are guaranteed not to get context switched, I do not think you cannot accurately measure runtimes slower than 1 tick. Depending on how many other processes are running and how many context switches occur, measurements less than 1 CPU tick are usually noisy and inaccurate. For your example C program, if a context switch happened in the middle of that for loop, it could report up to ~10015 us instead of ~15 us.

I ran your C program 100,000 times with these commands (it takes a few minutes to run):

gcc -Wall -O3 -o test test.c
for i in {0..100000}; do ./test; done | sed -n 's/^.*: //p' | sort -nr | { head; echo '…'; tail; }

The times ranged from 12 to 3881 us. Since the C for loop is obviously CPU bound with no I/O, I believe this is the only thing that can account for the huge difference in times. You can also run this command to generate interference and increase the number of context switches.

sharkdp · 2020-10-26T19:26:13Z

For your example C program, if a context switch happened in the middle of that for loop, it could report up to ~10015 us instead of ~15 us.

Well, ok. But that would be reported as an outlier.

To dig a little deeper, I wrote this C program (I chose C out of curiosity and to have the least amount of overhead possible):

clock.c:

#include <stdio.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

int main(int argc, char* argv[]) {
    if (argc < 2) {
        printf("Usage: clock [prog] [args...]\n");
        return 1;
    }

    struct timeval start, end;
    int status;

    gettimeofday(&start, NULL);

    int pid = fork();
    if (pid == 0) {
        execv(argv[1], argv + 1);
        return 1;
    } else if (pid < 0) {
        printf("Error creating child process\n");
        return 1;
    } else {
        if (waitpid(pid, &status, 0) != pid) {
            printf("Failed waiting\n");
            return 1;
        }
    }

    gettimeofday(&end, NULL);

    if (WEXITSTATUS(status)) {
        printf("Child process failed with exit status %d\n",
               WEXITSTATUS(status));
        return 1;
    }

    int elapsed =
        ((end.tv_sec - start.tv_sec) * 1000000) + (end.tv_usec - start.tv_usec);

    printf("Elapsed time: %d us\n", elapsed);
}

.. and compile with

gcc -O2 -Wall clock.c -o clock

It accurately measures actual runtimes:

▶ ./clock /usr/bin/sleep 0.010
Elapsed time: 10755 us

The ~800 us overhead here likely comes from the fork/execve/wait syscalls. We can see a similar overhead when timing really simple programs like echo:

▶ ./clock /usr/bin/echo hello world
hello world
Elapsed time: 559 us

Or when timing the program nothing which is a statically linked executable for int main() {}:

▶ ./clock ./nothing
Elapsed time: 551 us

But there is some signal. When timing the program sleep500us which does usleep(500), we see times like:

▶ ./clock ./sleep500us
Elapsed time: 1024 us

Now obviously, there can be interference from other programs - causing context switches. Here, I had stress --cpu 8 running in the background:

▶ ./clock /usr/bin/echo hello world   
hello world
Elapsed time: 22744 us

But nevertheless, it seems possible to measure execution times faster than 10 ms. Yes, there is some overhead from the actual measurement process (on the order of several hundred microseconds). And yes, a context switch to another program might completely spoil the measurement. But 10 ms it not a fundamental limit for benchmarking external programs.

tdulcet · 2020-10-27T12:02:43Z

It accurately measures actual runtimes

How do you know that it is accurate and that one or more context switches did not occur? I am not aware of a way to determine if a context switch occurred on a specific process, but you can use the Pressure Stall Information (PSI) added to version 4.20 of the Linux kernel to determine the percentage of runnable processes that were delayed, meaning they were context switched. Just run cat /proc/pressure/cpu or use my System Usage Information script, which produces more human readable color output. Although without running a tickless kernel, there is no way to keep those percentages at zero.

Note that only running processes can be context switched. The sleep command and the sleep() system calls in your first and fourth examples respectively put the process into a sleep state, so the runtimes should still be relatively accurate since context switches cannot occur.

But 10 ms it not a fundamental limit for benchmarking external programs.

I never claimed that is was a limit. You can of course benchmark programs with a runtime of less than 1 tick. I just believe that you cannot do so accurately. As your second and fifth examples show, context switches can produce many orders of magnitude of noise, making it extremely difficult to determine the actual runtime. I am assuming this is at least partially why hyperfine outputs a warning that the "results might be inaccurate" if you try to benchmark a command that takes less than 5 ms.

As with your first C program, you would need to run your clock program a few thousand times to see the issue. For example, here is a command to have it run your first program:

for i in {0..100000}; do ./clock ./test; done | grep 'us' | sed -n 's/^.*: //p' | sort -nr | { head; echo '…'; tail; }

BTW, I am very impressed that you took the time to create these C programs!

sharkdp · 2020-10-27T13:08:40Z

How do you know that it is accurate and that one or more context switches did not occur? I am not aware of a way to determine if a context switch occurred on a specific process

You can use /usr/bin/time (not the bash builtin) with the -v option to see the number of context switches:

▶ /usr/bin/time -v fd
…
	Voluntary context switches: 228
	Involuntary context switches: 101
…

It is available in rusage information, see man 2 getrusage:

           struct rusage {
               // …
               long   ru_nvcsw;         /* voluntary context switches */
               long   ru_nivcsw;        /* involuntary context switches */
           };

Note that only running processes can be context switched. The sleep command and the sleep() system calls in your first and fourth examples respectively put the process into a sleep state, so the runtimes should still be relatively accurate since context switches cannot occur.

Good point!

I never claimed that is was a limit. You can of course benchmark programs with a runtime of less than 1 tick. I just believe that you cannot do so accurately. As your second and fifth examples show, context switches can produce many orders of magnitude of noise, making it extremely difficult to determine the actual runtime.

I am also not claiming the opposite 😄. I am mostly just curious if it can be done. My hope would be that one could benchmark fast programs (on the order of milliseconds) accurately on a quiet machine (which should be the default when running benchmarks) by repeatedly measuring the runtime (which is what hyperfine or your tool does). We would then hopefully see a large percentage of executions without any context switches plus a few outliers from benchmarks where the kernel intervened.

I am assuming this is at least partially why hyperfine outputs a warning that the "results might be inaccurate" if you try to benchmark a command that takes less than 5 ms.

Yes and no. The warning is mostly there due to the limitation that hyperfine runs an intermediate shell. And that adds some overhead (that is also noisy). We try to compensate for that, but I am pretty sure that the compensation is not accurate enough to accurately measure runtimes below 5 ms.

for i in {0..100000}; do ./clock ./test; done | grep 'us' | sed -n 's/^.*: //p' | sort -nr | { head; echo '…'; tail; }

Will take a closer look at this. The distribution of times looks like this (10000 times):

And with a logarithmic y-axis:

So there is a clear peak at ~500 us, but there is also a wide tail with only a few definitive outliers. So maybe it's not that easy after all.

BTW, I am very impressed that you took the time to create these C programs!

Well - I learned a lot from this thread and I am always interested in making progress in my projects. It would be great if we could extend accurate benchmarking to the 500us - 5 ms range. And also potentially be more accurate in measurements > 5ms.

tdulcet · 2020-10-28T10:12:56Z

You can use /usr/bin/time (not the bash builtin) with the -v option to see the number of context switches

Interesting, I did not realize it could display this.

My hope would be that one could benchmark fast programs (on the order of milliseconds) accurately on a quiet machine (which should be the default when running benchmarks) by repeatedly measuring the runtime (which is what hyperfine or your tool does).

Yes, this would be my hope as well. Benchmarking fast programs would obviously be very useful. Unfortunately, even seemly "quiet" systems with no open applications still usually have hundreds or more background processes and services that can cause context switches (just run htop or top to see them).

Well - I learned a lot from this thread and I am always interested in making progress in my projects.

Thanks again for your feedback!

It would be great if we could extend accurate benchmarking to the 500us - 5 ms range. And also potentially be more accurate in measurements > 5ms.

Yes, I agree! I will think about this and see if I can figure out a potential solution. I will also consider adding a new option to my port to use /usr/bin/time instead of the Bash time builtin to solve the original issue of false positives.

BTW, nice graphs. (I always wonder why no one has created a CLI tool yet to output graphs like those to the console, but I think it would be a cool Rust project that could save users from having to copy/paste the data into an external application.)

tdulcet added enhancement New feature or request help wanted Extra attention is needed labels Oct 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outlier detection seems to report false positives #2

Outlier detection seems to report false positives #2

sharkdp commented Oct 17, 2020

tdulcet commented Oct 19, 2020 •

edited

Loading

sharkdp commented Oct 24, 2020 •

edited

Loading

tdulcet commented Oct 25, 2020

sharkdp commented Oct 25, 2020

tdulcet commented Oct 26, 2020 •

edited

Loading

sharkdp commented Oct 26, 2020 •

edited

Loading

tdulcet commented Oct 27, 2020 •

edited

Loading

sharkdp commented Oct 27, 2020

tdulcet commented Oct 28, 2020

Outlier detection seems to report false positives #2

Outlier detection seems to report false positives #2

Comments

sharkdp commented Oct 17, 2020

tdulcet commented Oct 19, 2020 • edited Loading

sharkdp commented Oct 24, 2020 • edited Loading

tdulcet commented Oct 25, 2020

sharkdp commented Oct 25, 2020

tdulcet commented Oct 26, 2020 • edited Loading

sharkdp commented Oct 26, 2020 • edited Loading

tdulcet commented Oct 27, 2020 • edited Loading

sharkdp commented Oct 27, 2020

tdulcet commented Oct 28, 2020

tdulcet commented Oct 19, 2020 •

edited

Loading

sharkdp commented Oct 24, 2020 •

edited

Loading

tdulcet commented Oct 26, 2020 •

edited

Loading

sharkdp commented Oct 26, 2020 •

edited

Loading

tdulcet commented Oct 27, 2020 •

edited

Loading