Skip to content
/ RACE Public

RACE is a multi-dimensional benchmark for code generation that focuses on Readability, mAintainability, Correctness, and Efficiency.

License

Notifications You must be signed in to change notification settings

jszheng21/RACE

Repository files navigation

race benchmark logo

🏆Leaderboard🔥Quick Start🐛Issues📜Citation🙏Acknowledgement


Latest News 🔥

  • [24/10/09] We release the second version of RACE paper.
  • [24/10/09] We add the evaluation results of 9 LLMs (including o1-mini-2024-09-12) in RACE leaderboard.
  • [24/10/01] We have improved the calculation methods for readability-related metrics and enhanced the robustness of the code post-processing techniques.
  • [24/10/01] We have revised the test code in the LeetCode evaluation data to support the cases with multiple correct answers.
  • [24/07/24] We add the evaluation results of claude-3.5-sonnet and Qwen2-72B-Instruct in RACE leaderboard.
  • [24/07/16] We release our RACE benchmark, leaderboard and paper.

🏎️ About

RACE is a multi-dimensional benchmark for code generation that focuses on Readability, mAintainability, Correctness, and Efficiency. Its goal is to evaluate LLM's ability to generate code that is correct and meets the requirements of real-world development scenarios. The benchmark is designed with various real-world demands across different demand-dependent dimensions, making it more applicable to practical scenarios. To facilitate the evaluation of RACE, we provide easy-to-use evaluation scripts, while evaluating in a virtualized environment ensures the security of executing the code.

overview

The overall evaluation pipeline is shown as the above. Firstly, we summarize multiple representative factors for each dimension based on their respective quality definitions. Secondly, we design several reasonable customized requirements for each factor and integrate them into task descriptions, requiring the model to generate code that is both correct and meets these requirements. Finally, leveraging static analysis and runtime monitoring techniques, we develop evaluation metrics tailored to each factor to achieve accurate and efficient evaluation. Read our paper Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models to get further information.

🔥 Quick Start

To start, please run the following to prepare the environment:

pip install -e .

To use vllm, please run the following command.

pip install -e .[vllm_gen]

Code Generation

Take Readability as an example, use the following command to generate code samples from a model, which are saved in the JSON Lines (jsonl) format. root refers to the directory of the output files, and backend supports openai and vllm. To use backend openai, please ensure that environment variables ${API_BASE} and ${API_KEY} are configured.

scripts/gen_readability.sh ${model} ${root} ${backend}
⏬ More commands for other dimensions.:: click to expand ::
# For `Correctness`
scripts/gen_correctness.sh ${model} ${root} ${backend}

# For `Maintainability`
scripts/gen_maintainability.sh ${model} ${root} ${backend}

# For `Efficiency`
scripts/gen_efficiency.sh ${model} ${root} ${backend}

Code Post-processing

Use the following command to read the code sample file, extract valid code from the model generation responses, and save them to a file with a parsed suffix.

python scripts/parse_generated_file.py \
    --generated_file_path ${generated_file_path} \
    --model ${model}$

Code Evaluation

The evaluation of generated code is divided into two parts: 1) assesses the correctness of the generated code; 2) evaluates various non-execution-based metrics.

First, build the docker image as an environment for evaluating the code by execution.

docker build --rm -f "./Dockerfile" -t race:latest "."

Then, using code readability as an example, test the correctness of the LLM-generated code based on test cases. In this context, the evaluation of the generated code under the dimension of correctness is distributed across the dimensions of code readability, maintainability, and efficiency.

scripts/eval_c_readability.sh ${model} ${root}
⏬ More commands for other dimensions.:: click to expand ::
# For `Maintainability`
scripts/eval_c_maintainability.sh ${model} ${root}

# For `Efficiency`
scripts/eval_c_efficiency.sh ${model} ${root}

# For only `Correctness`
scripts/eval_c_correctness.sh ${model} ${root}

Here are further details on how to evaluate the correctness of LLM-generated code under a single factor.

# For `Readability`
docker run -v $(pwd):/data race:latest race.codeeval.evaluate_pipeline_evalplus \
    --dataset [humaneval|mbpp] \
    --samples "/data/outputs/${parsed_generated_file}$"

# For `Maintainability (MI Metric)`
docker run -v $(pwd):/data race:latest race.codeeval.evaluate_pipeline_classeval test_pipeline \
    --model_name ${model} \
    --generated_data_path "/data/outputs/${generated_file}$" \
    --root "/data/outputs"

# For `Maintainability (Modularity)`
docker run -v $(pwd):/data race:latest race.codeeval.evaluate_pipeline_leetcode_style test_pipeline_simple \
    --model_name ${model} \
    --evaluation_test_case_path "/data/data/leetcode/evaluation_tests.jsonl" \
    --generated_data_path "/data/outputs/${parsed_generated_file}$" \
    --result_path "/data/outputs/${results_file}$" \
    --temp_path "/data/outputs"

# For `Efficiency`
docker run -v $(pwd):/data race:latest race.codeeval.evaluate_pipeline_leetcode_style test_pipeline_complexity \
    --model_name ${model} \
    --evaluation_test_case_path "/data/data/leetcode_efficiency/complexity_evaluation_test_cases.jsonl" \
    --evaluation_efficiency_data_path "/data/data/leetcode_efficiency/complexity_evaluation_data.jsonl" \
    --generated_data_path "/data/outputs/${parsed_generated_file}$" \
    --result_path "/data/outputs/${results_file}$" \
    --temp_path "/data/outputs"

Finally, get the evaluation results based on specific metrics. Take Readability as an example:

python scripts/get_metric_readability.py \
    --model ${model} \
    --output_path_root ${root}
⏬ More commands for other dimensions.:: click to expand ::
# For `Correctness`
python scripts/get_metric_correctness.py \
    --model ${model} \
    --output_path_root ${root}

# For `Maintainability`
python scripts/get_metric_maintainability.py \
    --model ${model} \
    --output_path_root ${root}

# For `Efficiency`
python scripts/get_metric_efficiency.py \
    --model ${model} \
    --output_path_root ${root}

🐛 Issues

In order to use vllm to accelerate the DeepSeek-Coder-V2 inference process, additional branch versions need to be installed (https://github.com/zwd003/vllm)

📜 Citation

@misc{zheng2024race,
      title={Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models}, 
      author={Jiasheng Zheng and Boxi Cao and Zhengzhao Ma and Ruotong Pan and Hongyu Lin and Yaojie Lu and Xianpei Han and Le Sun},
      year={2024},
      eprint={2407.11470},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2407.11470}, 
}

🙏 Acknowledgement

About

RACE is a multi-dimensional benchmark for code generation that focuses on Readability, mAintainability, Correctness, and Efficiency.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published