Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow baseline per category #617

Closed
mgravell opened this issue Jan 9, 2018 · 6 comments
Closed

Allow baseline per category #617

mgravell opened this issue Jan 9, 2018 · 6 comments
Assignees
Milestone

Comments

@mgravell
Copy link
Member

mgravell commented Jan 9, 2018

Edit: if there's already a way to do this and I'm just being ignorant: please say!

Often I'm trying to compare alternative implementations that can impact multiple scenarios - essentially baseline vs option 1 (vs option 2 etc), but for multiple separate tests A, B, C.

The Baseline = true feature is great, but only really allows a single baseline. If multiple methods are marked as Baseline, then the test runner fails and complains at you.

However, [BenchmarkCategory(...)] exists (via #248). It is currently only used to filter tests to run, but it could be much richer:

  • the result grids could be split by category
  • the relative comparisons against baseline could be computed by category

so instead of:

                   Method |         Mean |      Error |     StdDev |      Op/s | Scaled | ScaledSD |    Gen 0 |   Gen 1 |  Allocated |
------------------------- |-------------:|-----------:|-----------:|----------:|-------:|---------:|---------:|--------:|-----------:|
         'single, Stream' |     61.32 us |  1.2102 us |  1.8842 us | 16,309.04 |   1.00 |     0.00 |   2.2583 |  0.0292 |   13.88 KB |
 'single, ReadOnlyBuffer' |     52.98 us |  0.1676 us |  0.1568 us | 18,874.14 |   0.86 |     0.03 |   2.2583 |  0.0042 |   13.88 KB |
          'multi, Stream' | 14,285.07 us | 54.3681 us | 50.8560 us |     70.00 | 233.18 |     6.95 | 564.1667 | 10.0000 | 3470.77 KB |
  'multi, ReadOnlyBuffer' | 13,219.59 us | 34.1087 us | 31.9053 us |     75.65 | 215.79 |     6.41 | 564.1667 |  0.8333 | 3470.76 KB |

we could have:


Category: 'multi'

                   Method |         Mean |      Error |     StdDev |      Op/s | Scaled | ScaledSD |    Gen 0 |   Gen 1 |  Allocated |
------------------------- |-------------:|-----------:|-----------:|----------:|-------:|---------:|---------:|--------:|-----------:|
                 'Stream' |     61.32 us |  1.2102 us |  1.8842 us | 16,309.04 |   1.00 |     0.00 |   2.2583 |  0.0292 |   13.88 KB |
         'ReadOnlyBuffer' |     52.98 us |  0.1676 us |  0.1568 us | 18,874.14 |   0.86 |     0.03 |   2.2583 |  0.0042 |   13.88 KB |

Category: 'single'

                   Method |         Mean |      Error |     StdDev |      Op/s | Scaled | ScaledSD |    Gen 0 |   Gen 1 |  Allocated |
------------------------- |-------------:|-----------:|-----------:|----------:|-------:|---------:|---------:|--------:|-----------:|
                 'Stream' | 14,285.07 us | 54.3681 us | 50.8560 us |     70.00 |   1.00 |     *.** | 564.1667 | 10.0000 | 3470.77 KB |
         'ReadOnlyBuffer' | 13,219.59 us | 34.1087 us | 31.9053 us |     75.65 |   0.93 |     *.** | 564.1667 |  0.8333 | 3470.76 KB |

with Scaled / ScaledSD being relative to the Baseline (if one) in that same category.

(*.** is just where I haven't "done the math" by hand; to be clear: "single" and "multi" here are completely different tests - it isn't just more of the same - naming is hard)

If necessary, this could be an opt-in SplitResultsByCategory feature on custom options. Or it could be implicit: "there's multiple baselines == split by category" (since this won't have worked previously, this can't change existing working behaviour)

@adamsitnik
Copy link
Member

I like this idea! @AndreyAkinshin what about you?

@AndreyAkinshin
Copy link
Member

I think that it's better to introduce a unified way for grouping results. In each group, we should get exactly one baseline.
Here are some use cases for it:

  1. Compare the performance of different methods (current default basline behavior).
  2. Compare the performance of different methods in a category (this issue).
  3. Compare the performance of the same method for different jobs (e.g. .NET Core vs. .NET Framework for the same method).
  4. Compare the performance of the same method for different param values (e.g. I have a CacheStrategy enum property and I would like to compare different strategies for the same method).
  5. A combination of these use cases.

With a "unified grouping approach", it should be easy to write simple API for specific use cases (e.g. SplitResultsByCategory).

A few additional thoughts:

  • Such "grouping" should be useful not only for baselines but also for ranking
  • The feature should be combined with OrderProviders (we should be able to order lines inside each group and order the groups)
  • Grouping should be obvious (when I look at 0.34 in the baseline column, it should be obvious where is the 1.0 value for this line; when I look at 5 in the rank column, it should be obvious where are the 1, 2, 3, 4 which correspond to this line). I like the approach by @mgravell with several tables, but I'm a little concerned that this approach takes too much space. It would be great to keep one table for all results, but I don't know how to make grouping obvious for all users (especially for those who look at the BenchmarkDotNet summary table for the first time).

It's a definitely a very useful feature and it should be easy to implement it (I guess, I need one hour for the first proof-of-concept implementation). The most difficult task here is simple, obvious, and user-friendly API and the summary table presentation.

Any thoughts?

@mgravell
Copy link
Member Author

mgravell commented Jan 9, 2018 via email

@AndreyAkinshin
Copy link
Member

@mgravell, LGTM.
@adamsitnik, what do you think?

@adamsitnik
Copy link
Member

@AndreyAkinshin LGTM

@AndreyAkinshin AndreyAkinshin self-assigned this Jan 11, 2018
AndreyAkinshin added a commit that referenced this issue Jan 11, 2018
AndreyAkinshin added a commit that referenced this issue Jan 11, 2018
AndreyAkinshin added a commit that referenced this issue Jan 12, 2018
adamsitnik added a commit that referenced this issue Jan 12, 2018
@AndreyAkinshin AndreyAkinshin added this to the v0.10.12 milestone Jan 12, 2018
alinasmirnova pushed a commit to alinasmirnova/BenchmarkDotNet that referenced this issue Sep 22, 2018
@SnakyBeaky
Copy link
Contributor

For anyone that might end up in this issue while looking for this functionality, the attribute GroupBenchmarksByAttribute was added and allows multiple baselines (and with CategoriesColumnAttribute too).

For example, this benchmark:

[SimpleJob, GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByCategory), CategoriesColumn]
public class SomeBenchmark
{
    [Benchmark(Baseline = true), BenchmarkCategory("fast")]
    public async Task FastA()
    {
        await Task.Delay(100);
    }
    
    [Benchmark, BenchmarkCategory("fast")]
    public async Task FastB()
    {
        await Task.Delay(150);
    }
    
    [Benchmark(Baseline = true), BenchmarkCategory("slow")]
    public async Task SlowA()
    {
        await Task.Delay(1000);
    }
    
    [Benchmark, BenchmarkCategory("slow")]
    public async Task SlowB()
    {
        await Task.Delay(1500);
    }
}

Will output something like:

| Method | Categories |       Mean |   Error |  StdDev | Ratio | RatioSD |
|------- |----------- |-----------:|--------:|--------:|------:|--------:|
|  FastA |       fast |   110.0 ms | 1.75 ms | 1.64 ms |  1.00 |    0.00 |
|  FastB |       fast |   157.2 ms | 1.50 ms | 1.17 ms |  1.43 |    0.03 |
|        |            |            |         |         |       |         |
|  SlowA |       slow | 1,007.5 ms | 3.67 ms | 3.43 ms |  1.00 |    0.00 |
|  SlowB |       slow | 1,509.3 ms | 4.60 ms | 4.31 ms |  1.50 |    0.01 |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants