-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Evaluate 1B-Instruct on GSM8k #82
base: main
Are you sure you want to change the base?
Conversation
ok, I fixed a bug and ran both vanilla LLama3.1-1B-Instruct and the sampler with settings of this branch on 200 samples of GSM8k. Here are the results. @xjdr-alt With entropix sampler, took ~10min These are preliminary results of course. |
After applying the chat template, the results differ significantly. I benchmarked the 3B-Instruct model and I could reproduce Meta's original result from their blog post without the sampler. Without sampler, took ~30 min, batch size 1: Entropix sampler, took ~10 hours: Both benched on one 4090. Is there anything wrong with this branch? @xjdr-alt |
I ran the vanilla model with:
And the entropix sampler with:
Both with lm-eval-harness version
|
Based off https://github.com/xjdr-alt/entropix/blob/70B/entropix/eval_main.py
This is still WIP as for a correct comparison
apply_chat_template()
must be implemented for theCustomLLaMAModel
to use thegsm8k_cot_llama
task.See also official docs of
gsm8k
inlm-evaluation-harness
.I will run the evaluation overnight without applying the chat template and without using multi-turn for few-shot.