Evaluation scripts

Inference

We assume that you have finished running the infernece script and now have the json result file ready to use in your local disk under result_dir/wild

Batch mode evaluation

1. Generate the `*.batch_submit.jsonl` files.

MODEL="Yi-1.5-9B-Chat-Test" # your model name
bash evaluation/run_eval_v2_batch.score.sh $MODEL # individual scoring 
bash evaluation/run_eval_v2_batch.sh $MODEL gpt-4-turbo-2024-04-09 # pairwise eval with gpt-4-turbo
bash evaluation/run_eval_v2_batch.sh $MODEL claude-3-haiku-20240307 # pairwise eval with Claude-3-Opus
bash evaluation/run_eval_v2_batch.sh $MODEL Llama-2-70b-chat-hf # pairwise eval with Llama-2-70b-chat
# Now you should have the .batch_submit.jsonl files in the output_dir

You can look at the batch-submit files to see if they are correct.

2. Submit the batch jobs to OpenAI

MODEL="Yi-1.5-9B-Chat-Test" # your model name
python src/openai_batch_eval/submit_batch.py eval_results/v2.0522/pairwise.v2/eval=gpt-4-turbo-2024-04-09/ref=gpt-4-turbo-2024-04-09/$MODEL.batch-submit.jsonl
python src/openai_batch_eval/submit_batch.py eval_results/v2.0522/pairwise.v2/eval=gpt-4-turbo-2024-04-09/ref=claude-3-haiku-20240307/$MODEL.batch-submit.jsonl
python src/openai_batch_eval/submit_batch.py eval_results/v2.0522/pairwise.v2/eval=gpt-4-turbo-2024-04-09/ref=Llama-2-70b-chat-hf/$MODEL.batch-submit.jsonl
python src/openai_batch_eval/submit_batch.py eval_results/v2.0522/score.v2/eval=gpt-4-turbo-2024-04-09/$MODEL.batch-submit.jsonl

Each of the above command will output a batch id: Batch submitted. ID: batch_ZiiPf06AvELbqjPhf6qxJNls which you can use to check the status of the batch job.

3. Retrieve the Batch Result

python src/openai_batch_eval/check_batch_status_with_id.py batch_ZiiPf06AvELbqjPhf6qxJNls
# repeat this command until all batch jobs are finished

The final formatted results will be saved as follows:

eval_results/v2.0522/pairwise.v2/eval=gpt-4-turbo-2024-04-09/ref=gpt-4-turbo-2024-04-09/${MODEL}.json
eval_results/v2.0522/pairwise.v2/eval=gpt-4-turbo-2024-04-09/ref=claude-3-haiku-20240307/${MODEL}.json
eval_results/v2.0522/pairwise.v2/eval=gpt-4-turbo-2024-04-09/ref=Llama-2-70b-chat-hf/${MODEL}.json
eval_results/v2.0522/score.v2/eval=gpt-4-turbo-2024-04-09/${MODEL}.json

4. View the results

Please run bash leaderboard/show_eval.sh to view the results.

Note if you only did Score-based eval, then bash leaderboard/show_eval.sh score_only will be okay.

The 2nd argument is K, the length margin for the length penalty. You can set it to -1 or leave it empty to disable the length penalty.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EVAL.md

EVAL.md

Evaluation scripts

Inference

Batch mode evaluation

1. Generate the `*.batch_submit.jsonl` files.

2. Submit the batch jobs to OpenAI

3. Retrieve the Batch Result

4. View the results

Files

EVAL.md

Latest commit

History

EVAL.md

File metadata and controls

Evaluation scripts

Inference

Batch mode evaluation

1. Generate the *.batch_submit.jsonl files.

2. Submit the batch jobs to OpenAI

3. Retrieve the Batch Result

4. View the results

1. Generate the `*.batch_submit.jsonl` files.