[Fix] Fix Math Evaluation with Judge Model Evaluator & Add README (op…

…en-compass#1103) * Add Math Evaluation with Judge Model Evaluator * Add Math Evaluation with Judge Model Evaluator * Add Math Evaluation with Judge Model Evaluator * Add Math Evaluation with Judge Model Evaluator * Fix Llama-3 meta template * Fix MATH with JudgeLM Evaluation * Fix MATH with JudgeLM Evaluation * Fix MATH with JudgeLM Evaluation * Fix MATH with JudgeLM Evaluation --------- Co-authored-by: liuhongwei <[email protected]>
bittersweet1999 · Apr 28, 2024 · a6f67e1 · a6f67e1
1 parent 0b7de67
commit a6f67e1
Show file tree

Hide file tree

Showing 6 changed files with 454 additions and 18 deletions.
diff --git a/configs/datasets/math/math_gen_78ced2.py b/configs/datasets/math/math_gen_78ced2.py
@@ -0,0 +1,37 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import MATHDataset, MATHEvaluator, math_postprocess
+
+QUERY_TEMPLATE = """
+Solve the following math problem step by step. The last line of your response should be of the form ANSWER: $ANSWER (without quotes) where $ANSWER is the answer to the problem.
+
+{problem}
+
+Remember to put your answer on its own line after "ANSWER:", and you do not need to use a \\boxed command.
+""".strip()
+
+math_reader_cfg = dict(input_columns=['problem'], output_column='solution')
+
+math_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+
+        template=dict(round=[
+            dict(role="HUMAN", prompt=QUERY_TEMPLATE),
+        ])),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer, max_out_len=512))
+
+math_eval_cfg = dict(
+    evaluator=dict(type=MATHEvaluator), pred_postprocessor=dict(type=math_postprocess))
+
+math_datasets = [
+    dict(
+        type=MATHDataset,
+        abbr='math',
+        path='./data/math/math.json',
+        reader_cfg=math_reader_cfg,
+        infer_cfg=math_infer_cfg,
+        eval_cfg=math_eval_cfg)
+]
diff --git a/configs/datasets/math/math_llm_judge.py b/configs/datasets/math/math_llm_judge.py
@@ -19,7 +19,7 @@
             dict(role="HUMAN", prompt=QUERY_TEMPLATE),
         ])),
     retriever=dict(type=ZeroRetriever),
-    inferencer=dict(type=GenInferencer, max_out_len=512))
+    inferencer=dict(type=GenInferencer, max_out_len=1024))
 
 math_eval_cfg = dict(
     evaluator=dict(type=MATHEvaluator), pred_postprocessor=dict(type=math_postprocess))

diff --git a/configs/eval_math_llm_judge.py b/configs/eval_math_llm_judge.py
@@ -2,10 +2,8 @@
 from mmengine.config import read_base
 with read_base():
     from .models.hf_llama.hf_llama3_8b_instruct import models as hf_llama3_8b_instruct_model # noqa: F401, F403
-    from .models.hf_internlm.hf_internlm2_chat_20b import models as hf_internlm2_chat_20b_model  # noqa: F401, F403
     from .models.hf_llama.hf_llama3_70b_instruct import models as hf_llama3_70b_instruct_model  # noqa: F401, F403
     from .datasets.math.math_llm_judge import math_datasets  # noqa: F401, F403
-from opencompass.models.openai_api import OpenAIAllesAPIN
 from opencompass.datasets import math_judement_preprocess
 from opencompass.partitioners import NaivePartitioner, SizePartitioner
 from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
@@ -22,41 +20,69 @@
 # -------------Prompt Settings ----------------------------------------
 eng_obj_prompt = """
 Look at the following two expressions (answers to a math problem) and judge whether they are equivalent. Only perform trivial simplifications
+
 Examples:
+
     Expression 1: $2x+3$
     Expression 2: $3+2x$
-Result: [[Correct]]
+
+[Yes]
+
     Expression 1: 3/2
     Expression 2: 1.5
-Result: [[Correct]]
+
+[Yes]
+
     Expression 1: $x^2+2x+1$
     Expression 2: $y^2+2y+1$
-Result: [[Incorrect]]
+
+[No]
+
     Expression 1: $x^2+2x+1$
     Expression 2: $(x+1)^2$
-Result: [[Correct]]
+
+[Yes]
+
     Expression 1: 3245/5
     Expression 2: 649
-Result: [[Incorrect]]
+
+[No]
 (these are actually equal, don't mark them equivalent if you need to do nontrivial simplifications)
+
     Expression 1: 2/(-3)
     Expression 2: -2/3
-Result: [[Correct]]
+
+[Yes]
 (trivial simplifications are allowed)
+
     Expression 1: 72 degrees
     Expression 2: 72
-Result: [[Correct]]
+
+[Yes]
 (give benefit of the doubt to units)
+
     Expression 1: 64
     Expression 2: 64 square feet
-Result: [[Correct]]
+
+[Yes]
 (give benefit of the doubt to units)
+
+    Expression 1: 64
+    Expression 2: 
+
+[No]
+(only mark as equivalent if both expressions are nonempty)
+
 ---
+
 YOUR TASK
-Respond with only "Result: [[Correct]]" or "Result: [[Incorrect]]" (without quotes). Do not include a rationale.
+
+
+Respond with only "[Yes]" or "[No]" (without quotes). Do not include a rationale.
     Expression 1: {obj_gold}
-    Expression 2: {prediction}
-""".strip()
+    Expression 2: {prediction} 
+
+"""
 
 # -------------Inferen Stage ----------------------------------------
 # eval models

diff --git a/docs/en/advanced_guides/objective_judgelm_evaluation.md b/docs/en/advanced_guides/objective_judgelm_evaluation.md
@@ -0,0 +1,186 @@
+# Using Large Models as JudgeLLM for Objective Evaluation
+
+## Introduction
+
+Traditional objective evaluations often rely on standard answers for reference. However, in practical applications, the predicted results of models may vary due to differences in the model's instruction-following capabilities or imperfections in post-processing functions. This can lead to incorrect extraction of answers and comparison with standard answers, resulting in potentially inaccurate evaluation outcomes. To address this issue, we have adopted a process similar to subjective evaluations by introducing JudgeLLM post-prediction to assess the consistency between model responses and standard answers. ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
+
+Currently, all models supported by the opencompass repository can be directly used as JudgeLLM. Additionally, we are planning to support dedicated JudgeLLMs.
+
+## Currently Supported Objective Evaluation Datasets
+
+1. MATH ([https://github.com/hendrycks/math](https://github.com/hendrycks/math))
+
+## Custom JudgeLLM Objective Dataset Evaluation
+
+OpenCompass currently supports most datasets that use `GenInferencer` for inference. The specific process for custom JudgeLLM objective evaluation includes:
+
+1. Building evaluation configurations using API models or open-source models for inference of question answers.
+2. Employing a selected evaluation model (JudgeLLM) to assess the outputs of the model.
+
+### Step One: Building Evaluation Configurations, Using MATH as an Example
+
+Below is the Config for evaluating the MATH dataset with JudgeLLM, with the evaluation model being *Llama3-8b-instruct* and the JudgeLLM being *Llama3-70b-instruct*. For more detailed config settings, please refer to `configs/eval_math_llm_judge.py`. The following is a brief version of the annotations to help users understand the meaning of the configuration file.
+
+```python
+# Most of the code in this file is copied from https://github.com/openai/simple-evals/blob/main/math_eval.py
+from mmengine.config import read_base
+with read_base():
+    from .models.hf_llama.hf_llama3_8b_instruct import models as hf_llama3_8b_instruct_model # noqa: F401, F403
+    from .models.hf_llama.hf_llama3_70b_instruct import models as hf_llama3_70b_instruct_model  # noqa: F401, F403
+    from .datasets.math.math_llm_judge import math_datasets  # noqa: F401, F403
+from opencompass.datasets import math_judement_preprocess
+from opencompass.partitioners import NaivePartitioner, SizePartitioner
+from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
+from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
+from opencompass.runners import LocalRunner
+from opencompass.runners import SlurmSequentialRunner
+from opencompass.tasks import OpenICLInferTask
+from opencompass.tasks.subjective_eval import SubjectiveEvalTask
+from opencompass.summarizers import AllObjSummarizer
+from opencompass.openicl.icl_evaluator import LMEvaluator
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+
+
+# ------------- Prompt Settings ----------------------------------------
+# Evaluation template, please modify the template as needed, JudgeLLM typically uses [Yes] or [No] as the response. For the MATH dataset, the evaluation template is as follows:
+eng_obj_prompt = """
+Look at the following two expressions (answers to a math problem) and judge whether they are equivalent. Only perform trivial simplifications
+
+Examples:
+
+    Expression 1: $2x+3$
+    Expression 2: $3+2x$
+
+[Yes]
+
+    Expression 1: 3/2
+    Expression 2: 1.5
+
+[Yes]
+
+    Expression 1: $x^2+2x+1$
+    Expression 2: $y^2+2y+1$
+
+[No]
+
+    Expression 1: $x^2+2x+1$
+    Expression 2: $(x+1)^2$
+
+[Yes]
+
+    Expression 1: 3245/5
+    Expression 2: 649
+
+[No]
+(these are actually equal, don't mark them equivalent if you need to do nontrivial simplifications)
+
+    Expression 1: 2/(-3)
+    Expression 2: -2/3
+
+[Yes]
+(trivial simplifications are allowed)
+
+    Expression 1: 72 degrees
+    Expression 2: 72
+
+[Yes]
+(give benefit of the doubt to units)
+
+    Expression 1: 64
+    Expression 2: 64 square feet
+
+[Yes]
+(give benefit of the doubt to units)
+
+    Expression 1: 64
+    Expression 2:
+
+[No]
+(only mark as equivalent if both expressions are nonempty)
+
+---
+
+YOUR TASK
+
+
+Respond with only "[Yes]" or "[No]" (without quotes). Do not include a rationale.
+    Expression 1: {obj_gold}
+    Expression 2: {prediction}
+
+"""
+
+# ------------- Inference Phase ----------------------------------------
+# Models to be evaluated
+models = [*hf_llama3_8b_instruct_model]
+# Evaluation models
+judge_models = hf_llama3_70b_instruct_model
+
+eng_datasets = [*math_datasets]
+chn_datasets = []
+datasets = eng_datasets + chn_datasets
+
+
+for d in eng_datasets:
+    d['eval_cfg']= dict(
+        evaluator=dict(
+            type=LMEvaluator,
+            # If you need to preprocess model predictions before judging,
+            # you can specify a pred_postprocessor function here
+            pred_postprocessor=dict(type=math_judement_preprocess),
+            prompt_template=dict(
+                type=PromptTemplate,
+                template=dict(round=[
+                    dict(
+                        role='HUMAN',
+                        prompt = eng_obj_prompt
+                    ),
+                ]),
+            ),
+        ),
+        pred_role="BOT",
+    )
+
+infer = dict(
+    partitioner=dict(type=SizePartitioner, max_task_size=40000),
+    runner=dict(
+        type=LocalRunner,
+        max_num_workers=256,
+        task=dict(type=OpenICLInferTask)),
+)
+
+# ------------- Evaluation Configuration --------------------------------
+eval = dict(
+    partitioner=dict(
+        type=SubjectiveSizePartitioner, max_task_size=80000, mode='singlescore', models=models, judge_models=judge_models,
+    ),
+    runner=dict(type=LocalRunner,
+        max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
+)
+
+summarizer = dict(
+    type=AllObjSummarizer
+)
+
+# Output folder
+work_dir = 'outputs/obj_all/'
+```
+
+### Step Two: Launch Evaluation and Output Results
+
+```shell
+python run.py eval_math_llm_judge.py
+```
+
+This will initiate two rounds of evaluation. The first round involves model inference to obtain predicted answers to questions, and the second round involves JudgeLLM evaluating the consistency between the predicted answers and the standard answers, and scoring them.
+
+- The results of model predictions will be saved in `output/.../timestamp/predictions/xxmodel/xxx.json`
+- The JudgeLLM's evaluation responses will be saved in `output/.../timestamp/results/xxmodel/xxx.json`
+- The evaluation report will be output to `output/.../timestamp/summary/timestamp/xxx.csv`
+
+## Results
+
+Using the Llama3-8b-instruct as the evaluation model and the Llama3-70b-instruct as the evaluator, the MATH dataset was assessed with the following results:
+
+| Model               | JudgeLLM Evaluation | Naive Evaluation |
+| ------------------- | ------------------- | ---------------- |
+| llama-3-8b-instruct | 27.7                | 27.8             |