GitHub - jszheng21/RACE: RACE is a multi-dimensional benchmark for code generation that focuses on Readability, mAintainability, Correctness, and Efficiency.

🏆Leaderboard • 🔥Quick Start • 🐛Issues • 📜Citation • 🙏Acknowledgement

Latest News 🔥

[24/10/09] We release the second version of RACE paper.
[24/10/09] We add the evaluation results of 9 LLMs (including o1-mini-2024-09-12) in RACE leaderboard.
[24/10/01] We have improved the calculation methods for readability-related metrics and enhanced the robustness of the code post-processing techniques.
[24/10/01] We have revised the test code in the LeetCode evaluation data to support the cases with multiple correct answers.
[24/07/24] We add the evaluation results of claude-3.5-sonnet and Qwen2-72B-Instruct in RACE leaderboard.
[24/07/16] We release our RACE benchmark, leaderboard and paper.

🏎️ About

RACE is a multi-dimensional benchmark for code generation that focuses on Readability, mAintainability, Correctness, and Efficiency. Its goal is to evaluate LLM's ability to generate code that is correct and meets the requirements of real-world development scenarios. The benchmark is designed with various real-world demands across different demand-dependent dimensions, making it more applicable to practical scenarios. To facilitate the evaluation of RACE, we provide easy-to-use evaluation scripts, while evaluating in a virtualized environment ensures the security of executing the code.

The overall evaluation pipeline is shown as the above. Firstly, we summarize multiple representative factors for each dimension based on their respective quality definitions. Secondly, we design several reasonable customized requirements for each factor and integrate them into task descriptions, requiring the model to generate code that is both correct and meets these requirements. Finally, leveraging static analysis and runtime monitoring techniques, we develop evaluation metrics tailored to each factor to achieve accurate and efficient evaluation. Read our paper Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models to get further information.

🔥 Quick Start

To start, please run the following to prepare the environment:

pip install -e .

To use vllm, please run the following command.

pip install -e .[vllm_gen]

Code Generation

Take Readability as an example, use the following command to generate code samples from a model, which are saved in the JSON Lines (jsonl) format. root refers to the directory of the output files, and backend supports openai and vllm. To use backend openai, please ensure that environment variables ${API_BASE} and ${API_KEY} are configured.

scripts/gen_readability.sh ${model} ${root} ${backend}

⏬ More commands for other dimensions.:: click to expand ::

# For `Correctness`
scripts/gen_correctness.sh ${model} ${root} ${backend}

# For `Maintainability`
scripts/gen_maintainability.sh ${model} ${root} ${backend}

# For `Efficiency`
scripts/gen_efficiency.sh ${model} ${root} ${backend}

Code Post-processing

Use the following command to read the code sample file, extract valid code from the model generation responses, and save them to a file with a parsed suffix.

python scripts/parse_generated_file.py \
    --generated_file_path ${generated_file_path} \
    --model ${model}$

Code Evaluation

The evaluation of generated code is divided into two parts: 1) assesses the correctness of the generated code; 2) evaluates various non-execution-based metrics.

First, build the docker image as an environment for evaluating the code by execution.

docker build --rm -f "./Dockerfile" -t race:latest "."

Then, using code readability as an example, test the correctness of the LLM-generated code based on test cases. In this context, the evaluation of the generated code under the dimension of correctness is distributed across the dimensions of code readability, maintainability, and efficiency.

scripts/eval_c_readability.sh ${model} ${root}

⏬ More commands for other dimensions.:: click to expand ::

# For `Maintainability`
scripts/eval_c_maintainability.sh ${model} ${root}

# For `Efficiency`
scripts/eval_c_efficiency.sh ${model} ${root}

# For only `Correctness`
scripts/eval_c_correctness.sh ${model} ${root}

Here are further details on how to evaluate the correctness of LLM-generated code under a single factor.

# For `Readability`
docker run -v $(pwd):/data race:latest race.codeeval.evaluate_pipeline_evalplus \
    --dataset [humaneval|mbpp] \
    --samples "/data/outputs/${parsed_generated_file}$"

# For `Maintainability (MI Metric)`
docker run -v $(pwd):/data race:latest race.codeeval.evaluate_pipeline_classeval test_pipeline \
    --model_name ${model} \
    --generated_data_path "/data/outputs/${generated_file}$" \
    --root "/data/outputs"

# For `Maintainability (Modularity)`
docker run -v $(pwd):/data race:latest race.codeeval.evaluate_pipeline_leetcode_style test_pipeline_simple \
    --model_name ${model} \
    --evaluation_test_case_path "/data/data/leetcode/evaluation_tests.jsonl" \
    --generated_data_path "/data/outputs/${parsed_generated_file}$" \
    --result_path "/data/outputs/${results_file}$" \
    --temp_path "/data/outputs"

# For `Efficiency`
docker run -v $(pwd):/data race:latest race.codeeval.evaluate_pipeline_leetcode_style test_pipeline_complexity \
    --model_name ${model} \
    --evaluation_test_case_path "/data/data/leetcode_efficiency/complexity_evaluation_test_cases.jsonl" \
    --evaluation_efficiency_data_path "/data/data/leetcode_efficiency/complexity_evaluation_data.jsonl" \
    --generated_data_path "/data/outputs/${parsed_generated_file}$" \
    --result_path "/data/outputs/${results_file}$" \
    --temp_path "/data/outputs"

Finally, get the evaluation results based on specific metrics. Take Readability as an example:

python scripts/get_metric_readability.py \
    --model ${model} \
    --output_path_root ${root}

⏬ More commands for other dimensions.:: click to expand ::

# For `Correctness`
python scripts/get_metric_correctness.py \
    --model ${model} \
    --output_path_root ${root}

# For `Maintainability`
python scripts/get_metric_maintainability.py \
    --model ${model} \
    --output_path_root ${root}

# For `Efficiency`
python scripts/get_metric_efficiency.py \
    --model ${model} \
    --output_path_root ${root}

🐛 Issues

In order to use vllm to accelerate the DeepSeek-Coder-V2 inference process, additional branch versions need to be installed (https://github.com/zwd003/vllm)

📜 Citation

@misc{zheng2024race,
      title={Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models}, 
      author={Jiasheng Zheng and Boxi Cao and Zhengzhao Ma and Ruotong Pan and Hongyu Lin and Yaojie Lu and Xianpei Han and Le Sun},
      year={2024},
      eprint={2407.11470},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2407.11470}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
assets		assets
data		data
race		race
scripts		scripts
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏎️ About

🔥 Quick Start

Code Generation

Code Post-processing

Code Evaluation

🐛 Issues

📜 Citation

🙏 Acknowledgement

About

Releases

Packages

Contributors 2

Languages

License

jszheng21/RACE

Folders and files

Latest commit

History

Repository files navigation

🏎️ About

🔥 Quick Start

Code Generation

Code Post-processing

Code Evaluation

🐛 Issues

📜 Citation

🙏 Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages