Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark Runner results shown in viewer #390

Merged
merged 8 commits into from
Dec 19, 2023
Merged

Conversation

jakethekoenig
Copy link
Member

@jakethekoenig jakethekoenig commented Dec 13, 2023

The BenchmarkResult class and it's viewer are made more general with the following changes:

  • Attributes of a benchmark result can be marked aggregateable and if they are the summary will compute and display the appropriate statistic for them: average, percent, sum. The property is not displayed if it is none for all benchmarks. If only a a subset of the benchmarks define it the percent is given in parenthesis next to the aggregated number.
  • Attributes of a benchmark can specify how they should be displayed: text, code, transcript or json.
    Now if you introduce a new benchmark with a new property you can add the property to the BenchmarkResults class along with the appropriate metadata. This won't break previous benchmarks and if appropriate it is easy to set it for them.

misc:

  • previously the data was rewritten to txt on multiple runs. This is fixed.
  • Rendering the summary is done by the summary.

todo:

  • actually use it in the benchmark runner.
  • Aggregate indentation/syntax errors.
  • Have the jinja template iterate over non-aggregated display properties instead hard coding.
  • Group exercises in the selector. So exercises that differ only by prompt or are just repeat iterations are together.
  • When a field represents an aggregation of a subset denote that with parenthesis.

The goal is for this to be reused by the new benchmark runner. A
benchmark results fields can specify how they are aggregated in which
case their total/count/percent true/histogram will be displayed in the
header. Or not in which case they will be viewable if you click through
to the benchmark.

TODO: actually use it in the benchmark runner. Have indentation/syntax
errors be aggregated.
@jakethekoenig jakethekoenig marked this pull request as draft December 13, 2023 19:09
@jakethekoenig jakethekoenig marked this pull request as ready for review December 14, 2023 16:32
@jakethekoenig
Copy link
Member Author

Here's a screenshot of what it looks like with on the new benchmarks:
image

You can reproduce this with the command:

pytest tests/benchmarks/benchmark_runner.py --benchmark -s --retries 2

It takes a couple of minutes to run, the following is faster:

pytest tests/benchmarks/benchmark_runner.py --benchmark -s --benchmarks clojure

@jakethekoenig jakethekoenig changed the title Benchmark Result Summary Introduced Benchmark Runner results shown in viewer Dec 14, 2023
@granawkins
Copy link
Member

It seems not all of the code is appearing in the code box?
image

@jakethekoenig
Copy link
Member Author

It seems not all of the code is appearing in the code box?
image

Oops, git diffs don't display untracked files. I think I'll run git add . and git diff --staged.

Copy link
Member

@granawkins granawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, should help us understand the shoggoth a bit better.

</style>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/styles/default.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is making booleans bold/green and strings and numbers red. I like having colors but I mistook this for indicating failure/success. Could it be different colors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, that does sort of seem bad in the json code blocks where true mostly means error detected. I'll check. If it's hard I'll just turn off syntax highlighting for json since it doesn't add much anyway

if len(values) == 0:
summary[name] = (0, 0)
else:
percent_set = len(values) / len(self.results)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if n/N would be clearer than a percent? I had to guess what the two different percents meant my first time.


diff = repo.git.diff(start_commit)
# Set syntax and response grade information
diff = repo.git.diff()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik this only saves edits, not new/untracked files. I ran into the same thing in #374 and added a helper func in git_handler. (I think this is why the code was missing in my earlier comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now i see you already caught it. I think stage/diff/unstage is the right approach too.

@biobootloader
Copy link
Member

biobootloader commented Dec 15, 2023

Very nice! I really like the top bar.

Instead of showing a percentage in parenthesis when only a subset of benchmarks define an attribute, how about we show the actual counts? I.e. below it could show: 87.5% (8/14). That'd be useful even when all benchmarks do define the attribute, so you can instantly see how significant the percentage is (i.e. 100% 2/2 vs 100% 20/20)
image

Edit: oh I see this was already suggested 😄

@biobootloader
Copy link
Member

No code showing for this one either, but this seems different as this file shouldn't be untracked?

image

@jakethekoenig
Copy link
Member Author

No code showing for this one either, but this seems different as this file shouldn't be untracked?

image

I think this is an example of mentat making a mistake. You can see in the code block that the line it's replacing already says white-space: pre-wrap. I think no diff is actually correct for that example.

@jakethekoenig
Copy link
Member Author

After this PR I'll edit the cron job to run the new benchmarks as well as all the javascript, python and clojure exercisms. When I do that I'll change the Benchmark summary to be saved as json, get a summary message from it and send it along with the link to slack. And also sync that json to its own bucket so it will be easy to parse old data and make graphs in the future.

@granawkins
Copy link
Member

LGTM

@jakethekoenig jakethekoenig merged commit 0ccf3f9 into main Dec 19, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants