🏆Leaderboard • 🔥Quick Start • 🐛Issues • 📜Citation • 🙏Acknowledgement
Latest News 🔥
- [24/10/09] We release the second version of RACE paper.
- [24/10/09] We add the evaluation results of 9 LLMs (including
o1-mini-2024-09-12
) in RACE leaderboard. - [24/10/01] We have improved the calculation methods for readability-related metrics and enhanced the robustness of the code post-processing techniques.
- [24/10/01] We have revised the test code in the LeetCode evaluation data to support the cases with multiple correct answers.
- [24/07/24] We add the evaluation results of
claude-3.5-sonnet
andQwen2-72B-Instruct
in RACE leaderboard. - [24/07/16] We release our RACE benchmark, leaderboard and paper.
RACE is a multi-dimensional benchmark for code generation that focuses on Readability, mAintainability, Correctness, and Efficiency. Its goal is to evaluate LLM's ability to generate code that is correct and meets the requirements of real-world development scenarios. The benchmark is designed with various real-world demands across different demand-dependent dimensions, making it more applicable to practical scenarios. To facilitate the evaluation of RACE, we provide easy-to-use evaluation scripts, while evaluating in a virtualized environment ensures the security of executing the code.
The overall evaluation pipeline is shown as the above. Firstly, we summarize multiple representative factors for each dimension based on their respective quality definitions. Secondly, we design several reasonable customized requirements for each factor and integrate them into task descriptions, requiring the model to generate code that is both correct and meets these requirements. Finally, leveraging static analysis and runtime monitoring techniques, we develop evaluation metrics tailored to each factor to achieve accurate and efficient evaluation. Read our paper Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models to get further information.
To start, please run the following to prepare the environment:
pip install -e .
To use vllm, please run the following command.
pip install -e .[vllm_gen]
Take Readability
as an example, use the following command to generate code samples from a model, which are saved in the JSON Lines (jsonl) format. root
refers to the directory of the output files, and backend
supports openai
and vllm
. To use backend openai
, please ensure that environment variables ${API_BASE}
and ${API_KEY}
are configured.
scripts/gen_readability.sh ${model} ${root} ${backend}
⏬ More commands for other dimensions.:: click to expand ::
# For `Correctness`
scripts/gen_correctness.sh ${model} ${root} ${backend}
# For `Maintainability`
scripts/gen_maintainability.sh ${model} ${root} ${backend}
# For `Efficiency`
scripts/gen_efficiency.sh ${model} ${root} ${backend}
Use the following command to read the code sample file, extract valid code from the model generation responses, and save them to a file with a parsed
suffix.
python scripts/parse_generated_file.py \
--generated_file_path ${generated_file_path} \
--model ${model}$
The evaluation of generated code is divided into two parts: 1) assesses the correctness of the generated code; 2) evaluates various non-execution-based metrics.
First, build the docker image as an environment for evaluating the code by execution.
docker build --rm -f "./Dockerfile" -t race:latest "."
Then, using code readability as an example, test the correctness of the LLM-generated code based on test cases. In this context, the evaluation of the generated code under the dimension of correctness
is distributed across the dimensions of code readability
, maintainability
, and efficiency
.
scripts/eval_c_readability.sh ${model} ${root}
⏬ More commands for other dimensions.:: click to expand ::
# For `Maintainability`
scripts/eval_c_maintainability.sh ${model} ${root}
# For `Efficiency`
scripts/eval_c_efficiency.sh ${model} ${root}
# For only `Correctness`
scripts/eval_c_correctness.sh ${model} ${root}
Here are further details on how to evaluate the correctness of LLM-generated code under a single factor.
# For `Readability`
docker run -v $(pwd):/data race:latest race.codeeval.evaluate_pipeline_evalplus \
--dataset [humaneval|mbpp] \
--samples "/data/outputs/${parsed_generated_file}$"
# For `Maintainability (MI Metric)`
docker run -v $(pwd):/data race:latest race.codeeval.evaluate_pipeline_classeval test_pipeline \
--model_name ${model} \
--generated_data_path "/data/outputs/${generated_file}$" \
--root "/data/outputs"
# For `Maintainability (Modularity)`
docker run -v $(pwd):/data race:latest race.codeeval.evaluate_pipeline_leetcode_style test_pipeline_simple \
--model_name ${model} \
--evaluation_test_case_path "/data/data/leetcode/evaluation_tests.jsonl" \
--generated_data_path "/data/outputs/${parsed_generated_file}$" \
--result_path "/data/outputs/${results_file}$" \
--temp_path "/data/outputs"
# For `Efficiency`
docker run -v $(pwd):/data race:latest race.codeeval.evaluate_pipeline_leetcode_style test_pipeline_complexity \
--model_name ${model} \
--evaluation_test_case_path "/data/data/leetcode_efficiency/complexity_evaluation_test_cases.jsonl" \
--evaluation_efficiency_data_path "/data/data/leetcode_efficiency/complexity_evaluation_data.jsonl" \
--generated_data_path "/data/outputs/${parsed_generated_file}$" \
--result_path "/data/outputs/${results_file}$" \
--temp_path "/data/outputs"
Finally, get the evaluation results based on specific metrics. Take Readability
as an example:
python scripts/get_metric_readability.py \
--model ${model} \
--output_path_root ${root}
⏬ More commands for other dimensions.:: click to expand ::
# For `Correctness`
python scripts/get_metric_correctness.py \
--model ${model} \
--output_path_root ${root}
# For `Maintainability`
python scripts/get_metric_maintainability.py \
--model ${model} \
--output_path_root ${root}
# For `Efficiency`
python scripts/get_metric_efficiency.py \
--model ${model} \
--output_path_root ${root}
In order to use vllm to accelerate the DeepSeek-Coder-V2 inference process, additional branch versions need to be installed (https://github.com/zwd003/vllm)
@misc{zheng2024race,
title={Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models},
author={Jiasheng Zheng and Boxi Cao and Zhengzhao Ma and Ruotong Pan and Hongyu Lin and Yaojie Lu and Xianpei Han and Le Sun},
year={2024},
eprint={2407.11470},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2407.11470},
}