mmlu detailed numbers #8

vince62s · 2023-06-29T14:48:03Z

Hi guys,
This is an awesome work.
I converted (as I did for many other models) xgen7B to the OpenNMT-py format and I just scored it with MMLU (chain-of-thought) implementation which is close to the original one.
I am getting 34.68 for xgen hence I would like to check the detailed numbers, maybe I got some gaps I can identify

many thanks.
https://github.com/OpenNMT/OpenNMT-py/blob/master/eval_llm/MMLU/readme.md

NB: I am running it in FP16

tianxie-9 · 2023-06-29T17:08:18Z

Hi guys, This is an awesome work. I converted (as I did for many other models) xgen7B to the OpenNMT-py format and I just scored it with MMLU (chain-of-thought) implementation which is close to the original one. I am getting 34.68 for xgen hence I would like to check the detailed numbers, maybe I got some gaps I can identify

many thanks. https://github.com/OpenNMT/OpenNMT-py/blob/master/eval_llm/MMLU/readme.md

NB: I am running it in FP16

We run the MMLU benchmark with the script here: https://github.com/hendrycks/test/pull/13/files, and we change 2048 in this line while input_ids.shape[-1] > 2048: to 4096 or 8192 for our 4k or 8k models.

syzymon · 2023-06-30T23:24:19Z

Thanks for the clarification! I converted XGen to EasyLM and got 34.88% on the original hendrycks code with the 2048 filter (in the same codebase, I could reproduce the numbers for openllama 7B and 3B).

Update: after filtering by 8192 I got 34.99%. Do you recall any other changes to the evaluation or it's rather some issue on my side? I'm using the base model from here: https://huggingface.co/Salesforce/xgen-7b-8k-base

BTW I'm using fp32

vince62s · 2023-07-01T06:52:34Z

34.98 on my side with 8192. So the only diff I imagine vs 36.2 could be FP32 vs FP16

tianxie-9 · 2023-07-01T14:30:03Z

34.98 on my side with 8192. So the only diff I imagine vs 36.2 could be FP32 vs FP16

We run all the evaluations with bf16, the model was pretrained with fp32 though.

tianxie-9 · 2023-07-01T14:31:32Z

The detailed scores are here (5-shot, 8192 seq length, bf16):
{"subcategories": {"math": 0.25093984962406013, "health": 0.3884146341463415, "physics": 0.3234375, "business": 0.4302059496567506, "biology": 0.3436123348017621, "chemistry": 0.2706270627062706, "computer science": 0.3592233009708738, "economics": 0.32075471698113206, "engineering": 0.46206896551724136, "philosophy": 0.31759443339960236, "other": 0.4454935622317597, "history": 0.421505376344086, "geography": 0.36363636363636365, "politics": 0.4012345679012346, "psychology": 0.43560933448573896, "culture": 0.46987951807228917, "law": 0.3170731707317073}, "categories": {"STEM": 0.3071570576540755, "humanities": 0.3379383634431456, "social sciences": 0.3997400064998375, "other (business, health, misc.)": 0.41455891425046265}, "weighted_accuracy": 0.3625551915681527}

tianxie-9 · 2023-07-01T14:34:18Z

@syzymon could you try this script without converting the model? https://github.com/hendrycks/test/pull/13/files
And use bf16 or fp32. We use fp32 in training and bf16 in evaluations.

syzymon · 2023-07-01T20:54:58Z

Unfortunately I only have tpu compute so I cannot run pytorch/hf, but would greatly appreciate if someone else could run this on gpu with bf16/fp32 and report their numbers

vince62s · 2023-07-03T11:19:08Z

Okay I ran it, using 8192 and dtype=float16
Using 2048 instead of 8192 give the exact same numbers EXCEPT for two tasks high_school_european_history, high_school_us_history, leading to 36.4 instead of 36.5

The main differences with chain-of-thought-hub are:

we don't use the argmax of logits but the next token, which may give a very slight difference when the next token is not in [A,B,C,D] but when I checked the detail it is a marginal difference
the way we average "ALL" is not the same, Hendryck computes an average by category then an average of categories, which means each category contributes equally to the grand average. chain-of-thoughts-hub totalizes the number of good answers of all tasks and compute the score out of it. My 34.98 becomes 35.3 from this standpoint.

In the end there are differences that I do not explain, in both directions. I'll try to investigate by digging into one specific category.

Average accuracy 0.240 - abstract_algebra
Average accuracy 0.326 - anatomy
Average accuracy 0.401 - astronomy
Average accuracy 0.310 - business_ethics
Average accuracy 0.325 - clinical_knowledge
Average accuracy 0.396 - college_biology
Average accuracy 0.220 - college_chemistry
Average accuracy 0.380 - college_computer_science
Average accuracy 0.290 - college_mathematics
Average accuracy 0.341 - college_medicine
Average accuracy 0.225 - college_physics
Average accuracy 0.390 - computer_security
Average accuracy 0.362 - conceptual_physics
Average accuracy 0.237 - econometrics
Average accuracy 0.469 - electrical_engineering
Average accuracy 0.278 - elementary_mathematics
Average accuracy 0.206 - formal_logic
Average accuracy 0.310 - global_facts
Average accuracy 0.313 - high_school_biology
Average accuracy 0.276 - high_school_chemistry
Average accuracy 0.410 - high_school_computer_science
Average accuracy 0.448 - high_school_european_history
Average accuracy 0.384 - high_school_geography
Average accuracy 0.503 - high_school_government_and_politics
Average accuracy 0.338 - high_school_macroeconomics
Average accuracy 0.237 - high_school_mathematics
Average accuracy 0.311 - high_school_microeconomics
Average accuracy 0.272 - high_school_physics
Average accuracy 0.523 - high_school_psychology
Average accuracy 0.222 - high_school_statistics
Average accuracy 0.426 - high_school_us_history
Average accuracy 0.447 - high_school_world_history
Average accuracy 0.430 - human_aging
Average accuracy 0.435 - human_sexuality
Average accuracy 0.446 - international_law
Average accuracy 0.435 - jurisprudence
Average accuracy 0.350 - logical_fallacies
Average accuracy 0.277 - machine_learning
Average accuracy 0.388 - management
Average accuracy 0.521 - marketing
Average accuracy 0.480 - medical_genetics
Average accuracy 0.508 - miscellaneous
Average accuracy 0.387 - moral_disputes
Average accuracy 0.242 - moral_scenarios
Average accuracy 0.405 - nutrition
Average accuracy 0.354 - philosophy
Average accuracy 0.401 - prehistory
Average accuracy 0.316 - professional_accounting
Average accuracy 0.304 - professional_law
Average accuracy 0.449 - professional_medicine
Average accuracy 0.358 - professional_psychology
Average accuracy 0.436 - public_relations
Average accuracy 0.273 - security_studies
Average accuracy 0.502 - sociology
Average accuracy 0.480 - us_foreign_policy
Average accuracy 0.410 - virology
Average accuracy 0.573 - world_religions
Average accuracy 0.254 - math
Average accuracy 0.395 - health
Average accuracy 0.328 - physics
Average accuracy 0.442 - business
Average accuracy 0.339 - biology
Average accuracy 0.257 - chemistry
Average accuracy 0.362 - computer science
Average accuracy 0.314 - economics
Average accuracy 0.469 - engineering
Average accuracy 0.319 - philosophy
Average accuracy 0.445 - other
Average accuracy 0.427 - history
Average accuracy 0.384 - geography
Average accuracy 0.401 - politics
Average accuracy 0.436 - psychology
Average accuracy 0.476 - culture
Average accuracy 0.322 - law
Average accuracy 0.308 - STEM
Average accuracy 0.341 - humanities
Average accuracy 0.400 - social sciences
Average accuracy 0.419 - other (business, health, misc.)
Average accuracy: 0.365

vince62s · 2023-07-03T16:00:49Z

I think there is a mistake in this https://github.com/hendrycks/test/pull/13/files implementation, I will ask the author directly there.

see my comment here: hendrycks/test#13

vince62s · 2023-07-03T18:34:12Z

so score is 35.8 with " A", " B", " C", " D"

Average accuracy 0.320 - abstract_algebra
Average accuracy 0.333 - anatomy
Average accuracy 0.355 - astronomy
Average accuracy 0.290 - business_ethics
Average accuracy 0.306 - clinical_knowledge
Average accuracy 0.403 - college_biology
Average accuracy 0.280 - college_chemistry
Average accuracy 0.360 - college_computer_science
Average accuracy 0.300 - college_mathematics
Average accuracy 0.318 - college_medicine
Average accuracy 0.245 - college_physics
Average accuracy 0.390 - computer_security
Average accuracy 0.332 - conceptual_physics
Average accuracy 0.289 - econometrics
Average accuracy 0.400 - electrical_engineering
Average accuracy 0.280 - elementary_mathematics
Average accuracy 0.246 - formal_logic
Average accuracy 0.320 - global_facts
Average accuracy 0.316 - high_school_biology
Average accuracy 0.271 - high_school_chemistry
Average accuracy 0.350 - high_school_computer_science
Average accuracy 0.424 - high_school_european_history
Average accuracy 0.389 - high_school_geography
Average accuracy 0.487 - high_school_government_and_politics
Average accuracy 0.341 - high_school_macroeconomics
Average accuracy 0.230 - high_school_mathematics
Average accuracy 0.328 - high_school_microeconomics
Average accuracy 0.272 - high_school_physics
Average accuracy 0.516 - high_school_psychology
Average accuracy 0.153 - high_school_statistics
Average accuracy 0.436 - high_school_us_history
Average accuracy 0.435 - high_school_world_history
Average accuracy 0.453 - human_aging
Average accuracy 0.359 - human_sexuality
Average accuracy 0.496 - international_law
Average accuracy 0.370 - jurisprudence
Average accuracy 0.362 - logical_fallacies
Average accuracy 0.259 - machine_learning
Average accuracy 0.379 - management
Average accuracy 0.538 - marketing
Average accuracy 0.410 - medical_genetics
Average accuracy 0.520 - miscellaneous
Average accuracy 0.338 - moral_disputes
Average accuracy 0.242 - moral_scenarios
Average accuracy 0.369 - nutrition
Average accuracy 0.347 - philosophy
Average accuracy 0.386 - prehistory
Average accuracy 0.316 - professional_accounting
Average accuracy 0.295 - professional_law
Average accuracy 0.434 - professional_medicine
Average accuracy 0.333 - professional_psychology
Average accuracy 0.436 - public_relations
Average accuracy 0.261 - security_studies
Average accuracy 0.522 - sociology
Average accuracy 0.540 - us_foreign_policy
Average accuracy 0.386 - virology
Average accuracy 0.614 - world_religions
Average accuracy 0.247 - math
Average accuracy 0.377 - health
Average accuracy 0.309 - physics
Average accuracy 0.444 - business
Average accuracy 0.344 - biology
Average accuracy 0.274 - chemistry
Average accuracy 0.337 - computer science
Average accuracy 0.329 - economics
Average accuracy 0.400 - engineering
Average accuracy 0.317 - philosophy
Average accuracy 0.453 - other
Average accuracy 0.416 - history
Average accuracy 0.389 - geography
Average accuracy 0.401 - politics
Average accuracy 0.419 - psychology
Average accuracy 0.458 - culture
Average accuracy 0.314 - law
Average accuracy 0.297 - STEM
Average accuracy 0.335 - humanities
Average accuracy 0.396 - social sciences
Average accuracy 0.413 - other (business, health, misc.)
Average accuracy: 0.358

congyingxia · 2023-07-05T18:13:44Z

@vince62s Hi Vince, thanks a lot for your update. Is 35.8 the average score by averaging all the categories (all categories contributes equally), or merging the results of all tasks then compute the score out of it (weighted average by the number of examples)?

vince62s · 2023-07-05T18:34:15Z

this 35.8 is what the script spits out and it is: weighted_acc = np.mean(np.concatenate(all_cors))
all_cors being all correct answers of all tasks, which means if a task has more examples than others it will contribute more.

if you average simply tasks (up to world_religions) it's 36.25
if you average subcategories (math to law) it's 36.64
if you average categories (STEM to other) it's 36.02

vince62s closed this as completed Jul 7, 2023

woffett mentioned this issue Jul 24, 2023

Which evaluation infra was used for benchmarking? #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmlu detailed numbers #8

mmlu detailed numbers #8

vince62s commented Jun 29, 2023 •

edited

Loading

tianxie-9 commented Jun 29, 2023 •

edited

Loading

syzymon commented Jun 30, 2023 •

edited

Loading

vince62s commented Jul 1, 2023

tianxie-9 commented Jul 1, 2023

tianxie-9 commented Jul 1, 2023 •

edited

Loading

tianxie-9 commented Jul 1, 2023

syzymon commented Jul 1, 2023

vince62s commented Jul 3, 2023

vince62s commented Jul 3, 2023 •

edited

Loading

vince62s commented Jul 3, 2023

congyingxia commented Jul 5, 2023

vince62s commented Jul 5, 2023 •

edited

Loading

mmlu detailed numbers #8

mmlu detailed numbers #8

Comments

vince62s commented Jun 29, 2023 • edited Loading

tianxie-9 commented Jun 29, 2023 • edited Loading

syzymon commented Jun 30, 2023 • edited Loading

vince62s commented Jul 1, 2023

tianxie-9 commented Jul 1, 2023

tianxie-9 commented Jul 1, 2023 • edited Loading

tianxie-9 commented Jul 1, 2023

syzymon commented Jul 1, 2023

vince62s commented Jul 3, 2023

vince62s commented Jul 3, 2023 • edited Loading

vince62s commented Jul 3, 2023

congyingxia commented Jul 5, 2023

vince62s commented Jul 5, 2023 • edited Loading

vince62s commented Jun 29, 2023 •

edited

Loading

tianxie-9 commented Jun 29, 2023 •

edited

Loading

syzymon commented Jun 30, 2023 •

edited

Loading

tianxie-9 commented Jul 1, 2023 •

edited

Loading

vince62s commented Jul 3, 2023 •

edited

Loading

vince62s commented Jul 5, 2023 •

edited

Loading