Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: evaluation framework #101

Closed
pkarw opened this issue Dec 11, 2024 · 6 comments
Closed

feat: evaluation framework #101

pkarw opened this issue Dec 11, 2024 · 6 comments

Comments

@pkarw
Copy link
Collaborator

pkarw commented Dec 11, 2024

For testing/training the dispatcher + for ongoing checkups and optimization - we should work on LLM, maybe a different model to crosscheck.

Related to #64 #68

@pkarw
Copy link
Collaborator Author

pkarw commented Dec 11, 2024

LLM as a Judge pattern

Inspiration: https://github.com/braintrustdata/autoevals,

@pkarw
Copy link
Collaborator Author

pkarw commented Dec 11, 2024

This should work like a BDD testing framework - you build the case and test the output (or outputs - for example, from two different task schedulers) comparing to train/test data.

@pkarw
Copy link
Collaborator Author

pkarw commented Dec 11, 2024

Credits @grabbou

@borisyankov
Copy link
Collaborator

Matt Pocock is currently cooking this exactly:
https://github.com/mattpocock/evalite

@grabbou
Copy link
Collaborator

grabbou commented Dec 16, 2024

http://braintrust.dev is also looking good

pkarw added a commit that referenced this issue Dec 20, 2024
The idea is that the BDD valuation takes place after every iteration,
which is a little bit suboptimal but, on the other hand, lets the users
control the entire flow (and, for example, break it if something goes
unexpected).

Related to: #101
@pkarw
Copy link
Collaborator Author

pkarw commented Dec 20, 2024

Implemented in #137

@pkarw pkarw closed this as completed Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants