feat: evaluation framework #101

pkarw · 2024-12-11T18:28:44Z

For testing/training the dispatcher + for ongoing checkups and optimization - we should work on LLM, maybe a different model to crosscheck.

Related to #64 #68

pkarw · 2024-12-11T18:31:10Z

LLM as a Judge pattern

Inspiration: https://github.com/braintrustdata/autoevals,

pkarw · 2024-12-11T18:32:54Z

This should work like a BDD testing framework - you build the case and test the output (or outputs - for example, from two different task schedulers) comparing to train/test data.

pkarw · 2024-12-11T19:50:41Z

Credits @grabbou

borisyankov · 2024-12-12T08:33:28Z

Matt Pocock is currently cooking this exactly:
https://github.com/mattpocock/evalite

grabbou · 2024-12-16T06:56:30Z

http://braintrust.dev is also looking good

The idea is that the BDD valuation takes place after every iteration, which is a little bit suboptimal but, on the other hand, lets the users control the entire flow (and, for example, break it if something goes unexpected). Related to: #101

pkarw · 2024-12-20T12:01:00Z

Implemented in #137

pkarw mentioned this issue Dec 17, 2024

feat: BDD and evaluation framework (#101) #137

Merged

pkarw closed this as completed Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: evaluation framework #101

feat: evaluation framework #101

pkarw commented Dec 11, 2024

pkarw commented Dec 11, 2024

pkarw commented Dec 11, 2024

pkarw commented Dec 11, 2024

borisyankov commented Dec 12, 2024

grabbou commented Dec 16, 2024

pkarw commented Dec 20, 2024

feat: evaluation framework #101

feat: evaluation framework #101

Comments

pkarw commented Dec 11, 2024

pkarw commented Dec 11, 2024

pkarw commented Dec 11, 2024

pkarw commented Dec 11, 2024

borisyankov commented Dec 12, 2024

grabbou commented Dec 16, 2024

pkarw commented Dec 20, 2024