Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
🏠 Home Page • 💻 Data • 🏆 Leaderboard •
LiveCodeBench is a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces.
Distinctly, LiveCodeBench also focuses on a broader range of code-related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation.
Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and February 2024.
You can install the dependencies using pip:
pip install -U pebble datasets pyext # primary dependencies
pip install -U vllm google-generativeai anthropic openai mistralai # optional dependencies depending on models
Or alternatively you can use the requirements.txt
file:
pip install -r requirements.txt
We provide a benchmark for different code capability scenarios
We use vllm
for inference using local models. By default, we use tensor_parallel_size=${num_gpus}
to parallelize inference across all available GPUs. It can be configued using the --tensor_parallel_size
flag as required.
For running the inference, please use the following command.
python -m lcb_runner.runner.main --model {model_name} --scenario codegeneration
Additionally, --use_cache
flag can be used to cache the generated outputs and --continue_existing
flag can be used to use the existing dumbed results. Additionally, --multiprocess
flag can be used to parallelize queries to API servers (adjustable according to rate limits).
We compute pass@1
and pass@5
metrics for model evaluations.
We use a modified version of the checker released with the apps
benchmark to compute the metrics.
Particularly, we identified some unhandled edge cases in the original checker and fixed them and additionally simplified the checker based on our collected dataset. To run the evaluation, you can use the following command:
python -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate
Note that time limits can cause a slight ( < 0.3
) points of variation in the computation of the pass@1
and pass@5
metrics.
If you observe a significant variation in performance, adjust the --num_process_evaluate
flag to a lower value or increase the --timeout
flag. Please report particular issues caused by improper timeouts here.
Finally, to get scores over different time windows, you can use ./lcb_runner/evaluation/compute_scores.py file.
Particularly, you can provide --start_date
and --end_date
flags (using the YYYY-MM-DD
format) to get scores over the specified time window. In our paper, to counter contamination in the DeepSeek models, we only report results on problems released after August 2023. You can replicate those evaluations using:
python -m lcb_runner.evaluation.compute_scores --eval_all_file {saved_eval_all_file} --start_date 2023-09-01
Finally, our Leaderboard uses ExecEval which is a more comprehensive evaluation script using docker for evaluation of the generated programs. We will release an alternative evaluation script soon.
For running the test output prediction scenario you can simply run
python -m lcb_runner.runner.main --model {model_name} --scenario testoutputprediction --evaluate
To add support for new models, we have implemented an extensible framework to add new models and customize prompts appropirately.
Step 1: Add a new model to the ./lcb_runner/lm_styles.py file. Particularly, extend the LMStyle
class to add a new model family and extend the model to the LanguageModelList
array.
Step 2: Since we use instruction tuned models, we allow configuring the instruction for each model. Modify the ./lcb_runner/prompts/generation.py file to add a new prompt for the model in the format_prompt_generation
function.
For example, the prompt for DeepSeekCodeInstruct
family of models looks as follows
if LanguageModelStyle == LMStyle.DeepSeekCodeInstruct:
prompt = f"{PromptConstants.SYSTEM_MESSAGE_DEEPSEEK}\n\n"
prompt += f"{get_deepseekcode_question_template_answer(question)}"
return prompt
LiveCodeBench can be used to evaluate performance of LLMs on different time-windows (using problem release date to filter the models). Thus we can detect and prevent potential contamination in the evaluation process and evaluate LLMs on new problems.
Next, we evaluate models on different code capabilities and find that relative performances of models do change over tasks (left). Thus, it highlights the need for holistic evaluation of LLMs for code.
We also find evidence of possible overfitting on HumanEval (right). Particularly, models that perform well on HumanEval do not necessarily perform well on LiveCodeBench. In the scatterplot above, we find the models get clustered into two groups, shaded in red and green. The red group contains models that perform well on HumanEval but poorly on LiveCodeBench, while the green group contains models that perform well on both.
For more details, please refer to our paper at this url.
@article{jain2024livecodebench,
author = {Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica},
title = {LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code},
year = {2024},
journal = {arXiv preprint},
}