Benchmark outcomes record #392

wpietri · 2024-07-16T23:09:20Z

Produces a JSON version of the benchmark alongside the HTML files. Not sure this is totally right; Looking forward to feedback on the format.

…benchmark_outcomes_record # Conflicts: # tests/test_record.py

Make modelgauge's notion of a SUT know how to instantiate itself and cache the instance used, so that the initalization info is available later.

github-actions · 2024-07-16T23:09:32Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

…re accurate. Add `run_uid`. Remove some duplication in the JSON. Add JSON output to normal benchmark run.

wpietri · 2024-07-17T21:21:22Z

Ok, @bkorycki and @dhosterman, this is actually ready for final review now.

Make modelgauge's notion of a SUT know how to instantiate itself and cache the instance used, so that the initalization info is available later.

…re accurate. Add `run_uid`. Remove some duplication in the JSON. Add JSON output to normal benchmark run.

…benchmark_outcomes_record

bkorycki

Nice work!

src/modelbench/hazards.py

src/modelbench/run.py

bkorycki · 2024-07-18T03:38:33Z

src/modelbench/modelgauge_runner.py

@@ -55,6 +55,18 @@ class ModelGaugeSut(SutDescription, Enum):
    WIZARDLM_13B = "wizardlm-13b", "WizardLM v1.2 (13B)", TogetherChatSUT, "WizardLM/WizardLM-13B-V1.2"
    # YI_34B_CHAT = "yi-34b", "01-ai Yi Chat (34B)", TogetherChatSUT, "zero-one-ai/Yi-34B-Chat"

+    def instance(self, secrets):


Why do we need these methods?

Moving instance creation here will let me unify duplicate code, and it gives me a place to cache the instance actually used for the run, which is needed to dump out the outcome JSON.

bkorycki · 2024-07-18T03:43:41Z

src/modelbench/uid.py

+import casefy
+
+
+class HasUid:


I left a few comments about this in the other PR!

src/modelbench/record.py

Make modelgauge's notion of a SUT know how to instantiate itself and cache the instance used, so that the initalization info is available later.

…re accurate. Add `run_uid`. Remove some duplication in the JSON. Add JSON output to normal benchmark run.

…benchmark_outcomes_record

wpietri · 2024-07-19T19:44:09Z

Ok @dhosterman and @bkorycki, I think I have resolved all the outstanding issues and requests on this one.

dhosterman · 2024-07-22T15:26:04Z

This works great so far, but it fails when attempting to use --anonymize.

dhosterman · 2024-07-22T15:56:27Z

I also notice that in the content data, we have a uid for the benchmark that is different than the uid in the benchmark data. We might want to go through and make sure that those things are aligned, as well as the versions.

dhosterman

Looks great and I'm already using it! Thanks, William!

wpietri added 7 commits June 27, 2024 07:27

Add HasUid and apply it to Benchmark and Hazard.

2e464e8

Pleasing the formatting gods.

cedb78a

Add basic output, plus metadata. More to come.

c0001c5

Remove accidental paste.

1db7a43

Remove accidental paste.

5041a9a

Merge remote-tracking branch 'origin/benchmark_outcomes_record' into …

b373bb1

…benchmark_outcomes_record # Conflicts: # tests/test_record.py

Add SUT initialization and git-derived metadata on the code.

0dabb0a

Make modelgauge's notion of a SUT know how to instantiate itself and cache the instance used, so that the initalization info is available later.

wpietri requested review from dhosterman and bkorycki July 16, 2024 23:09

wpietri requested a review from a team as a code owner July 16, 2024 23:09

wpietri added 6 commits July 16, 2024 21:10

Removing unneeded test.

75e7af3

Removing unneeded test.

ae61704

Making test work no matter how you check it out.

f482fa2

Making test work no matter how you check it out.

04889bd

Making test work no matter how you check it out.

d5189f3

Make HazardDefinitions cache Tests, making later output easier and mo…

09be1ec

…re accurate. Add `run_uid`. Remove some duplication in the JSON. Add JSON output to normal benchmark run.

wpietri added 8 commits July 17, 2024 19:10

Add SUT initialization and git-derived metadata on the code.

b49203a

Make modelgauge's notion of a SUT know how to instantiate itself and cache the instance used, so that the initalization info is available later.

Removing unneeded test.

8c566cb

Removing unneeded test.

f2c4ce9

Making test work no matter how you check it out.

a66a2d8

Making test work no matter how you check it out.

cc45d2c

Making test work no matter how you check it out.

fe766e3

Make HazardDefinitions cache Tests, making later output easier and mo…

6359e2f

…re accurate. Add `run_uid`. Remove some duplication in the JSON. Add JSON output to normal benchmark run.

Merge remote-tracking branch 'origin/benchmark_outcomes_record' into …

3127b6a

…benchmark_outcomes_record

bkorycki reviewed Jul 18, 2024

View reviewed changes

wpietri added 3 commits July 18, 2024 17:37

Thanks to Barbara's keen eye, fixing a bug (and adding a test for it).

182672f

Add SUT initialization and git-derived metadata on the code.

450e2b8

Make modelgauge's notion of a SUT know how to instantiate itself and cache the instance used, so that the initalization info is available later.

Merging from main.

e9e635d

wpietri and others added 7 commits July 19, 2024 09:27

Making test work no matter how you check it out.

4d8f4fc

Making test work no matter how you check it out.

10f1001

Make HazardDefinitions cache Tests, making later output easier and mo…

f860779

…re accurate. Add `run_uid`. Remove some duplication in the JSON. Add JSON output to normal benchmark run.

Thanks to Barbara's keen eye, fixing a bug (and adding a test for it).

3bd6b69

Per Barbara, make this more Pydantic-idiomatic.

a5a9415

Merge remote-tracking branch 'origin/benchmark_outcomes_record' into …

42e639f

…benchmark_outcomes_record

Merge branch 'main' into benchmark_outcomes_record

4c4d7a3

wpietri requested a review from bkorycki July 19, 2024 14:37

wpietri added 3 commits July 19, 2024 10:11

Fix formatting after merge.

f10e87a

Fix test score key in JSON to be a UID.

94db658

Adding content and reference scores.

cbbfe6f

bkorycki approved these changes Jul 19, 2024

View reviewed changes

wpietri added 7 commits July 22, 2024 19:21

fix anonymous case for json

c0003c2

verifying initialization makes it to JSON

0673d15

Handling case where modelbench is installed not using git.

db26323

Adding library info to json

f19c4f7

Removing null when tests aren't loaded for hazard.

033a69c

Removing benchmark uid from content, using one in class instead.

bca20a1

Fixing anonymous runs.

4cc3ddc

dhosterman approved these changes Jul 24, 2024

View reviewed changes

wpietri merged commit 90ad71c into main Jul 24, 2024
4 checks passed

github-actions bot locked and limited conversation to collaborators Jul 24, 2024

wpietri deleted the benchmark_outcomes_record branch September 30, 2024 12:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark outcomes record #392

Benchmark outcomes record #392

wpietri commented Jul 16, 2024

github-actions bot commented Jul 16, 2024 •

edited

Loading

wpietri commented Jul 17, 2024

bkorycki left a comment

bkorycki Jul 18, 2024

wpietri Jul 19, 2024

bkorycki Jul 18, 2024

wpietri commented Jul 19, 2024

dhosterman commented Jul 22, 2024

dhosterman commented Jul 22, 2024

dhosterman left a comment

Benchmark outcomes record #392

Benchmark outcomes record #392

Conversation

wpietri commented Jul 16, 2024

github-actions bot commented Jul 16, 2024 • edited Loading

wpietri commented Jul 17, 2024

bkorycki left a comment

Choose a reason for hiding this comment

bkorycki Jul 18, 2024

Choose a reason for hiding this comment

wpietri Jul 19, 2024

Choose a reason for hiding this comment

bkorycki Jul 18, 2024

Choose a reason for hiding this comment

wpietri commented Jul 19, 2024

dhosterman commented Jul 22, 2024

dhosterman commented Jul 22, 2024

dhosterman left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 16, 2024 •

edited

Loading