Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmlu detailed numbers #8

Closed
vince62s opened this issue Jun 29, 2023 · 12 comments
Closed

mmlu detailed numbers #8

vince62s opened this issue Jun 29, 2023 · 12 comments

Comments

@vince62s
Copy link

vince62s commented Jun 29, 2023

Hi guys,
This is an awesome work.
I converted (as I did for many other models) xgen7B to the OpenNMT-py format and I just scored it with MMLU (chain-of-thought) implementation which is close to the original one.
I am getting 34.68 for xgen hence I would like to check the detailed numbers, maybe I got some gaps I can identify

many thanks.
https://github.com/OpenNMT/OpenNMT-py/blob/master/eval_llm/MMLU/readme.md

NB: I am running it in FP16

@tianxie-9
Copy link

tianxie-9 commented Jun 29, 2023

Hi guys, This is an awesome work. I converted (as I did for many other models) xgen7B to the OpenNMT-py format and I just scored it with MMLU (chain-of-thought) implementation which is close to the original one. I am getting 34.68 for xgen hence I would like to check the detailed numbers, maybe I got some gaps I can identify

many thanks. https://github.com/OpenNMT/OpenNMT-py/blob/master/eval_llm/MMLU/readme.md

NB: I am running it in FP16

We run the MMLU benchmark with the script here: https://github.com/hendrycks/test/pull/13/files, and we change 2048 in this line while input_ids.shape[-1] > 2048: to 4096 or 8192 for our 4k or 8k models.

@syzymon
Copy link

syzymon commented Jun 30, 2023

Thanks for the clarification! I converted XGen to EasyLM and got 34.88% on the original hendrycks code with the 2048 filter (in the same codebase, I could reproduce the numbers for openllama 7B and 3B).

Update: after filtering by 8192 I got 34.99%. Do you recall any other changes to the evaluation or it's rather some issue on my side? I'm using the base model from here: https://huggingface.co/Salesforce/xgen-7b-8k-base

BTW I'm using fp32

@vince62s
Copy link
Author

vince62s commented Jul 1, 2023

34.98 on my side with 8192. So the only diff I imagine vs 36.2 could be FP32 vs FP16

@tianxie-9
Copy link

34.98 on my side with 8192. So the only diff I imagine vs 36.2 could be FP32 vs FP16

We run all the evaluations with bf16, the model was pretrained with fp32 though.

@tianxie-9
Copy link

tianxie-9 commented Jul 1, 2023

The detailed scores are here (5-shot, 8192 seq length, bf16):
{"subcategories": {"math": 0.25093984962406013, "health": 0.3884146341463415, "physics": 0.3234375, "business": 0.4302059496567506, "biology": 0.3436123348017621, "chemistry": 0.2706270627062706, "computer science": 0.3592233009708738, "economics": 0.32075471698113206, "engineering": 0.46206896551724136, "philosophy": 0.31759443339960236, "other": 0.4454935622317597, "history": 0.421505376344086, "geography": 0.36363636363636365, "politics": 0.4012345679012346, "psychology": 0.43560933448573896, "culture": 0.46987951807228917, "law": 0.3170731707317073}, "categories": {"STEM": 0.3071570576540755, "humanities": 0.3379383634431456, "social sciences": 0.3997400064998375, "other (business, health, misc.)": 0.41455891425046265}, "weighted_accuracy": 0.3625551915681527}

@tianxie-9
Copy link

@syzymon could you try this script without converting the model? https://github.com/hendrycks/test/pull/13/files
And use bf16 or fp32. We use fp32 in training and bf16 in evaluations.

@syzymon
Copy link

syzymon commented Jul 1, 2023

Unfortunately I only have tpu compute so I cannot run pytorch/hf, but would greatly appreciate if someone else could run this on gpu with bf16/fp32 and report their numbers

@vince62s
Copy link
Author

vince62s commented Jul 3, 2023

Okay I ran it, using 8192 and dtype=float16
Using 2048 instead of 8192 give the exact same numbers EXCEPT for two tasks high_school_european_history, high_school_us_history, leading to 36.4 instead of 36.5

The main differences with chain-of-thought-hub are:

  1. we don't use the argmax of logits but the next token, which may give a very slight difference when the next token is not in [A,B,C,D] but when I checked the detail it is a marginal difference
  2. the way we average "ALL" is not the same, Hendryck computes an average by category then an average of categories, which means each category contributes equally to the grand average. chain-of-thoughts-hub totalizes the number of good answers of all tasks and compute the score out of it. My 34.98 becomes 35.3 from this standpoint.

In the end there are differences that I do not explain, in both directions. I'll try to investigate by digging into one specific category.

Average accuracy 0.240 - abstract_algebra
Average accuracy 0.326 - anatomy
Average accuracy 0.401 - astronomy
Average accuracy 0.310 - business_ethics
Average accuracy 0.325 - clinical_knowledge
Average accuracy 0.396 - college_biology
Average accuracy 0.220 - college_chemistry
Average accuracy 0.380 - college_computer_science
Average accuracy 0.290 - college_mathematics
Average accuracy 0.341 - college_medicine
Average accuracy 0.225 - college_physics
Average accuracy 0.390 - computer_security
Average accuracy 0.362 - conceptual_physics
Average accuracy 0.237 - econometrics
Average accuracy 0.469 - electrical_engineering
Average accuracy 0.278 - elementary_mathematics
Average accuracy 0.206 - formal_logic
Average accuracy 0.310 - global_facts
Average accuracy 0.313 - high_school_biology
Average accuracy 0.276 - high_school_chemistry
Average accuracy 0.410 - high_school_computer_science
Average accuracy 0.448 - high_school_european_history
Average accuracy 0.384 - high_school_geography
Average accuracy 0.503 - high_school_government_and_politics
Average accuracy 0.338 - high_school_macroeconomics
Average accuracy 0.237 - high_school_mathematics
Average accuracy 0.311 - high_school_microeconomics
Average accuracy 0.272 - high_school_physics
Average accuracy 0.523 - high_school_psychology
Average accuracy 0.222 - high_school_statistics
Average accuracy 0.426 - high_school_us_history
Average accuracy 0.447 - high_school_world_history
Average accuracy 0.430 - human_aging
Average accuracy 0.435 - human_sexuality
Average accuracy 0.446 - international_law
Average accuracy 0.435 - jurisprudence
Average accuracy 0.350 - logical_fallacies
Average accuracy 0.277 - machine_learning
Average accuracy 0.388 - management
Average accuracy 0.521 - marketing
Average accuracy 0.480 - medical_genetics
Average accuracy 0.508 - miscellaneous
Average accuracy 0.387 - moral_disputes
Average accuracy 0.242 - moral_scenarios
Average accuracy 0.405 - nutrition
Average accuracy 0.354 - philosophy
Average accuracy 0.401 - prehistory
Average accuracy 0.316 - professional_accounting
Average accuracy 0.304 - professional_law
Average accuracy 0.449 - professional_medicine
Average accuracy 0.358 - professional_psychology
Average accuracy 0.436 - public_relations
Average accuracy 0.273 - security_studies
Average accuracy 0.502 - sociology
Average accuracy 0.480 - us_foreign_policy
Average accuracy 0.410 - virology
Average accuracy 0.573 - world_religions
Average accuracy 0.254 - math
Average accuracy 0.395 - health
Average accuracy 0.328 - physics
Average accuracy 0.442 - business
Average accuracy 0.339 - biology
Average accuracy 0.257 - chemistry
Average accuracy 0.362 - computer science
Average accuracy 0.314 - economics
Average accuracy 0.469 - engineering
Average accuracy 0.319 - philosophy
Average accuracy 0.445 - other
Average accuracy 0.427 - history
Average accuracy 0.384 - geography
Average accuracy 0.401 - politics
Average accuracy 0.436 - psychology
Average accuracy 0.476 - culture
Average accuracy 0.322 - law
Average accuracy 0.308 - STEM
Average accuracy 0.341 - humanities
Average accuracy 0.400 - social sciences
Average accuracy 0.419 - other (business, health, misc.)
Average accuracy: 0.365

@vince62s
Copy link
Author

vince62s commented Jul 3, 2023

I think there is a mistake in this https://github.com/hendrycks/test/pull/13/files implementation, I will ask the author directly there.

see my comment here: hendrycks/test#13

@vince62s
Copy link
Author

vince62s commented Jul 3, 2023

so score is 35.8 with " A", " B", " C", " D"

Average accuracy 0.320 - abstract_algebra
Average accuracy 0.333 - anatomy
Average accuracy 0.355 - astronomy
Average accuracy 0.290 - business_ethics
Average accuracy 0.306 - clinical_knowledge
Average accuracy 0.403 - college_biology
Average accuracy 0.280 - college_chemistry
Average accuracy 0.360 - college_computer_science
Average accuracy 0.300 - college_mathematics
Average accuracy 0.318 - college_medicine
Average accuracy 0.245 - college_physics
Average accuracy 0.390 - computer_security
Average accuracy 0.332 - conceptual_physics
Average accuracy 0.289 - econometrics
Average accuracy 0.400 - electrical_engineering
Average accuracy 0.280 - elementary_mathematics
Average accuracy 0.246 - formal_logic
Average accuracy 0.320 - global_facts
Average accuracy 0.316 - high_school_biology
Average accuracy 0.271 - high_school_chemistry
Average accuracy 0.350 - high_school_computer_science
Average accuracy 0.424 - high_school_european_history
Average accuracy 0.389 - high_school_geography
Average accuracy 0.487 - high_school_government_and_politics
Average accuracy 0.341 - high_school_macroeconomics
Average accuracy 0.230 - high_school_mathematics
Average accuracy 0.328 - high_school_microeconomics
Average accuracy 0.272 - high_school_physics
Average accuracy 0.516 - high_school_psychology
Average accuracy 0.153 - high_school_statistics
Average accuracy 0.436 - high_school_us_history
Average accuracy 0.435 - high_school_world_history
Average accuracy 0.453 - human_aging
Average accuracy 0.359 - human_sexuality
Average accuracy 0.496 - international_law
Average accuracy 0.370 - jurisprudence
Average accuracy 0.362 - logical_fallacies
Average accuracy 0.259 - machine_learning
Average accuracy 0.379 - management
Average accuracy 0.538 - marketing
Average accuracy 0.410 - medical_genetics
Average accuracy 0.520 - miscellaneous
Average accuracy 0.338 - moral_disputes
Average accuracy 0.242 - moral_scenarios
Average accuracy 0.369 - nutrition
Average accuracy 0.347 - philosophy
Average accuracy 0.386 - prehistory
Average accuracy 0.316 - professional_accounting
Average accuracy 0.295 - professional_law
Average accuracy 0.434 - professional_medicine
Average accuracy 0.333 - professional_psychology
Average accuracy 0.436 - public_relations
Average accuracy 0.261 - security_studies
Average accuracy 0.522 - sociology
Average accuracy 0.540 - us_foreign_policy
Average accuracy 0.386 - virology
Average accuracy 0.614 - world_religions
Average accuracy 0.247 - math
Average accuracy 0.377 - health
Average accuracy 0.309 - physics
Average accuracy 0.444 - business
Average accuracy 0.344 - biology
Average accuracy 0.274 - chemistry
Average accuracy 0.337 - computer science
Average accuracy 0.329 - economics
Average accuracy 0.400 - engineering
Average accuracy 0.317 - philosophy
Average accuracy 0.453 - other
Average accuracy 0.416 - history
Average accuracy 0.389 - geography
Average accuracy 0.401 - politics
Average accuracy 0.419 - psychology
Average accuracy 0.458 - culture
Average accuracy 0.314 - law
Average accuracy 0.297 - STEM
Average accuracy 0.335 - humanities
Average accuracy 0.396 - social sciences
Average accuracy 0.413 - other (business, health, misc.)
Average accuracy: 0.358

@congyingxia
Copy link

@vince62s Hi Vince, thanks a lot for your update. Is 35.8 the average score by averaging all the categories (all categories contributes equally), or merging the results of all tasks then compute the score out of it (weighted average by the number of examples)?

@vince62s
Copy link
Author

vince62s commented Jul 5, 2023

this 35.8 is what the script spits out and it is: weighted_acc = np.mean(np.concatenate(all_cors))
all_cors being all correct answers of all tasks, which means if a task has more examples than others it will contribute more.

if you average simply tasks (up to world_religions) it's 36.25
if you average subcategories (math to law) it's 36.64
if you average categories (STEM to other) it's 36.02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants