datasets/MMLU/
. Test set containing 57 subtasks.
In every text, the content before [SEP]
is the question, and the content after [SEP]
is the standard answer to that question.
examples/eval_mmlu_2x32B.sh
. The evaluation results for mmlu could be obtained by running this program.
The variables in the code should be set as follows:
Variable name | Description |
---|---|
CHECKPOINT_PATH |
the path that saves the checkpoint to be evaluated. |
TOKENIZER_MODEL_PATH |
the path that saves the tokenizer. |
MATH_DATA |
the path that saves the evaluation set. |
OUTPUT_PATH |
the path that saves the evaluation results. |
Run the following command to evaluate the model's performance on the test set:
bash -x examples/eval_mmlu_2x32B.sh
The evaluation result will be saved in the path of OUTPUT_PATH
. In the text, the content before [SEP]
is the question, and the content after [SEP]
is the answer of our model to that question.
tasks/MMLU/score_mmlu.py
. The accuracy of evaluation results for mmlu could be obtained by running this program.
The path variables in the code should be set as follows:
Variable name | Description |
---|---|
origin_file_path |
Path of evaluation set file. |
eval_file_path |
Path for saving the evaluation result file. |
txt_eval_res_dir |
Path for storing distinguished results. Files ending with _true contain correctly results, while those ending in _false contain incorrectly results. The CSV file and jsonl file store accuracy statistics for each subtask. |
Run the following command to evaluate the model's performance on the test set:
python score_mmlu.py
"Number of correct answers" and "Number of incorrect answers" respectively represent the total count of correct and incorrect answers for all tasks, while "accuracy" represents the overall accuracy.