-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mmlu detailed numbers #8
Comments
We run the MMLU benchmark with the script here: https://github.com/hendrycks/test/pull/13/files, and we change 2048 in this line |
Thanks for the clarification! I converted XGen to EasyLM and got 34.88% on the original hendrycks code with the 2048 filter (in the same codebase, I could reproduce the numbers for openllama 7B and 3B). Update: after filtering by 8192 I got 34.99%. Do you recall any other changes to the evaluation or it's rather some issue on my side? I'm using the base model from here: https://huggingface.co/Salesforce/xgen-7b-8k-base BTW I'm using fp32 |
34.98 on my side with 8192. So the only diff I imagine vs 36.2 could be FP32 vs FP16 |
We run all the evaluations with bf16, the model was pretrained with fp32 though. |
The detailed scores are here (5-shot, 8192 seq length, bf16): |
@syzymon could you try this script without converting the model? https://github.com/hendrycks/test/pull/13/files |
Unfortunately I only have tpu compute so I cannot run pytorch/hf, but would greatly appreciate if someone else could run this on gpu with bf16/fp32 and report their numbers |
Okay I ran it, using 8192 and dtype=float16 The main differences with chain-of-thought-hub are:
In the end there are differences that I do not explain, in both directions. I'll try to investigate by digging into one specific category. Average accuracy 0.240 - abstract_algebra |
I think there is a mistake in this https://github.com/hendrycks/test/pull/13/files implementation, I will ask the author directly there. see my comment here: hendrycks/test#13 |
so score is 35.8 with " A", " B", " C", " D" Average accuracy 0.320 - abstract_algebra |
@vince62s Hi Vince, thanks a lot for your update. Is 35.8 the average score by averaging all the categories (all categories contributes equally), or merging the results of all tasks then compute the score out of it (weighted average by the number of examples)? |
this 35.8 is what the script spits out and it is: weighted_acc = np.mean(np.concatenate(all_cors)) if you average simply tasks (up to world_religions) it's 36.25 |
Hi guys,
This is an awesome work.
I converted (as I did for many other models) xgen7B to the OpenNMT-py format and I just scored it with MMLU (chain-of-thought) implementation which is close to the original one.
I am getting 34.68 for xgen hence I would like to check the detailed numbers, maybe I got some gaps I can identify
many thanks.
https://github.com/OpenNMT/OpenNMT-py/blob/master/eval_llm/MMLU/readme.md
NB: I am running it in FP16
The text was updated successfully, but these errors were encountered: