Benchmark Runner results shown in viewer #390

jakethekoenig · 2023-12-13T19:07:40Z

The BenchmarkResult class and it's viewer are made more general with the following changes:

Attributes of a benchmark result can be marked aggregateable and if they are the summary will compute and display the appropriate statistic for them: average, percent, sum. The property is not displayed if it is none for all benchmarks. If only a a subset of the benchmarks define it the percent is given in parenthesis next to the aggregated number.
Attributes of a benchmark can specify how they should be displayed: text, code, transcript or json.
Now if you introduce a new benchmark with a new property you can add the property to the BenchmarkResults class along with the appropriate metadata. This won't break previous benchmarks and if appropriate it is easy to set it for them.

misc:

previously the data was rewritten to txt on multiple runs. This is fixed.
Rendering the summary is done by the summary.

todo:

actually use it in the benchmark runner.
Aggregate indentation/syntax errors.
Have the jinja template iterate over non-aggregated display properties instead hard coding.
Group exercises in the selector. So exercises that differ only by prompt or are just repeat iterations are together.
When a field represents an aggregation of a subset denote that with parenthesis.

The goal is for this to be reused by the new benchmark runner. A benchmark results fields can specify how they are aggregated in which case their total/count/percent true/histogram will be displayed in the header. Or not in which case they will be viewable if you click through to the benchmark. TODO: actually use it in the benchmark runner. Have indentation/syntax errors be aggregated.

jakethekoenig · 2023-12-14T16:38:29Z

Here's a screenshot of what it looks like with on the new benchmarks:

You can reproduce this with the command:

pytest tests/benchmarks/benchmark_runner.py --benchmark -s --retries 2

It takes a couple of minutes to run, the following is faster:

pytest tests/benchmarks/benchmark_runner.py --benchmark -s --benchmarks clojure

granawkins · 2023-12-14T23:24:12Z

It seems not all of the code is appearing in the code box?

jakethekoenig · 2023-12-14T23:42:53Z

It seems not all of the code is appearing in the code box?

Oops, git diffs don't display untracked files. I think I'll run git add . and git diff --staged.

granawkins

I like it, should help us understand the shoggoth a bit better.

granawkins · 2023-12-14T23:40:40Z

mentat/resources/templates/benchmark.jinja

+        </style>
+        <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/styles/default.min.css">
+        <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/highlight.min.js"></script>
+        <script>hljs.initHighlightingOnLoad();</script>


It looks like this is making booleans bold/green and strings and numbers red. I like having colors but I mistook this for indicating failure/success. Could it be different colors?

Oh yeah, that does sort of seem bad in the json code blocks where true mostly means error detected. I'll check. If it's hard I'll just turn off syntax highlighting for json since it doesn't add much anyway

granawkins · 2023-12-14T23:49:53Z

tests/benchmarks/benchmark_result_summary.py

+                if len(values) == 0:
+                    summary[name] = (0, 0)
+                else:
+                    percent_set = len(values) / len(self.results)


I wonder if n/N would be clearer than a percent? I had to guess what the two different percents meant my first time.

granawkins · 2023-12-15T00:05:44Z

tests/benchmarks/benchmark_runner.py


-                diff = repo.git.diff(start_commit)
+                # Set syntax and response grade information
+                diff = repo.git.diff()


afaik this only saves edits, not new/untracked files. I ran into the same thing in #374 and added a helper func in git_handler. (I think this is why the code was missing in my earlier comment)

Now i see you already caught it. I think stage/diff/unstage is the right approach too.

biobootloader · 2023-12-15T01:28:35Z

Very nice! I really like the top bar.

Instead of showing a percentage in parenthesis when only a subset of benchmarks define an attribute, how about we show the actual counts? I.e. below it could show: 87.5% (8/14). That'd be useful even when all benchmarks do define the attribute, so you can instantly see how significant the percentage is (i.e. 100% 2/2 vs 100% 20/20)

Edit: oh I see this was already suggested 😄

biobootloader · 2023-12-15T01:39:56Z

No code showing for this one either, but this seems different as this file shouldn't be untracked?

jakethekoenig · 2023-12-15T19:56:18Z

No code showing for this one either, but this seems different as this file shouldn't be untracked?

I think this is an example of mentat making a mistake. You can see in the code block that the line it's replacing already says white-space: pre-wrap. I think no diff is actually correct for that example.

jakethekoenig · 2023-12-15T21:17:46Z

After this PR I'll edit the cron job to run the new benchmarks as well as all the javascript, python and clojure exercisms. When I do that I'll change the Benchmark summary to be saved as json, get a summary message from it and send it along with the link to slack. And also sync that json to its own bucket so it will be easy to parse old data and make graphs in the future.

granawkins · 2023-12-19T00:18:37Z

LGTM

jakethekoenig marked this pull request as draft December 13, 2023 19:09

jakethekoenig added 4 commits December 13, 2023 12:50

Benchmark result summary created

387a8eb

Use exercism viewer with benchmark runner - WIP

1ea2b20

Benchmark runner info but in display

b3f98d8

Similar tests grouped

4194c40

jakethekoenig marked this pull request as ready for review December 14, 2023 16:32

jakethekoenig changed the title ~~Benchmark Result Summary Introduced~~ Benchmark Runner results shown in viewer Dec 14, 2023

granawkins reviewed Dec 15, 2023

View reviewed changes

jakethekoenig added 2 commits December 15, 2023 12:02

diff staged

dda54ba

Ratio instead of percent

a5ab155

Fix comparison formatting

2fa836a

granawkins approved these changes Dec 19, 2023

View reviewed changes

jakethekoenig merged commit 0ccf3f9 into main Dec 19, 2023
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Runner results shown in viewer #390

Benchmark Runner results shown in viewer #390

jakethekoenig commented Dec 13, 2023 •

edited

Loading

jakethekoenig commented Dec 14, 2023

granawkins commented Dec 14, 2023

jakethekoenig commented Dec 14, 2023

granawkins left a comment

granawkins Dec 14, 2023

jakethekoenig Dec 15, 2023

granawkins Dec 14, 2023

granawkins Dec 15, 2023

granawkins Dec 15, 2023

biobootloader commented Dec 15, 2023 •

edited

Loading

biobootloader commented Dec 15, 2023

jakethekoenig commented Dec 15, 2023

jakethekoenig commented Dec 15, 2023

granawkins commented Dec 19, 2023

Benchmark Runner results shown in viewer #390

Benchmark Runner results shown in viewer #390

Conversation

jakethekoenig commented Dec 13, 2023 • edited Loading

jakethekoenig commented Dec 14, 2023

granawkins commented Dec 14, 2023

jakethekoenig commented Dec 14, 2023

granawkins left a comment

Choose a reason for hiding this comment

granawkins Dec 14, 2023

Choose a reason for hiding this comment

jakethekoenig Dec 15, 2023

Choose a reason for hiding this comment

granawkins Dec 14, 2023

Choose a reason for hiding this comment

granawkins Dec 15, 2023

Choose a reason for hiding this comment

granawkins Dec 15, 2023

Choose a reason for hiding this comment

biobootloader commented Dec 15, 2023 • edited Loading

biobootloader commented Dec 15, 2023

jakethekoenig commented Dec 15, 2023

jakethekoenig commented Dec 15, 2023

granawkins commented Dec 19, 2023

jakethekoenig commented Dec 13, 2023 •

edited

Loading

biobootloader commented Dec 15, 2023 •

edited

Loading