🏠[Project Homepage] 📃[Paper Link]
Official repository for the paper “Are Large Language Models Good Statisticians?”.
[Sept 26, 2024] 🎉 Our paper has been accepted by NeurIPS'24 Dataset and Benchmark Track!
[May 26, 2024] 🤗 StatQA is released!
Large Language Models (LLMs) have demonstrated impressive capabilities across a range of scientific tasks including mathematics, physics, and chemistry. Despite their successes, the effectiveness of LLMs in handling complex statistical tasks remains systematically under-explored. To bridge this gap, we introduce StatQA, a new benchmark designed for statistical analysis tasks. StatQA comprises 11,623 examples tailored to evaluate LLMs' proficiency in specialized statistical tasks and their applicability assessment capabilities, particularly for hypothesis testing methods. We systematically experiment with representative LLMs using various prompting strategies and show that even state-of-the-art models such as GPT-4o achieve a best performance of only 64.83%, indicating significant room for improvement. Notably, while open-source LLMs (e.g. LLaMA-3) show limited capability, those fine-tuned ones exhibit marked improvements, outperforming all in-context learning-based methods (e.g. GPT-4o). Moreover, our comparative human experiments highlight a striking contrast in error types between LLMs and humans: LLMs primarily make applicability errors, whereas humans mostly make statistical task confusion errors. This divergence highlights distinct areas of proficiency and deficiency, suggesting that combining LLM and human expertise could lead to complementary strengths, inviting further investigation into their collaborative potential.
We recommend you to create a conda virtual environment to run our project.
Firstly, create environment and install required python libraries from requirements.txt:
conda create --name newEnv python=3.11
conda activate newEnv
pip install -r requirements.txt
OpenAI API key will be used in this project, so please set your own API key in gpt_config.txt like this:
https://api.openai.com/v1/chat/completions
yourapikey
8 NVIDIA 4090s: for experiments of LLaMA models' evaluation.
NVIDIA A800 (80G): for fine-tuning.
For our benchmark StatQA and mini-StatQA, we provide both CSV and JSON formatted benchmark versions in StatQA/
file. If you want to conduct evaluations using them, please go to the Evaluation section.
If you want to try the process of StatQA and mini-StatQA construction on your own, we provide the script for you:
Preprocessing and information extraction:
sh Script\info_extraction.sh
Benchmark construction:
sh Script\benchmark_construction.sh
Obtained StatQA and mini-StatQA benchmarks will be stored in Data/Integrated Dataset/Balanced Benchmark
. Note that this process can take many hours and consume considerable API tokens, so please be patient or you can directly use the benchmark we already offered in StatQA/
.
We evaluate LLMs' capabilities based on mini-StatQA. The responses or answers generated by LLMs will be stored in Model Answer/Origin Answer
.
The first step is to generate certain prompts for different prompting strategies:
sh Script\prompt_organization.sh
To perform experiments on LLaMA-2 and LLaMA-3 models, please configure your own model path in Evaluation/llama_evaluation.py
, and you can also set the parallel_num
depending on your GPUs.
# Path settings
if model_type == '2_7b':
model_path = "/your_path_to/Llama-2-7b-chat-hf"
parallel_num = 2
elif model_type == '2_13b':
model_path = "/your_path_to/Llama-2-13b-chat-hf"
parallel_num = 8
elif model_type == '3_8b_instruct':
model_path = "/your_path_to/Meta-Llama-3-8B-Instruct"
parallel_num = 4
elif model_type == '3_8b':
model_path = "/your_path_to/Meta-Llama-3-8B"
parallel_num = 4
Then, you can perform evaluations for LLaMA-2/3 models by:
python Evaluation/llama_evaluation.py
--model_type "2_7b" # model type, e.g. LLaMA-2-7b
--trick "zero-shot" # prompting strategies
We also provide llama_exp.sh
script for you to run evaluations for LLaMA-2-7b/13b and LLaMA-3-8b(w/ and w/o Instruct) with different prompting strategies, you can modify it to meet your needs and run it by this:
sh Script\llama_exp.sh
Please ensure your OpenAI API key is correctly set and note that the evaluation of GPT models will consume your API tokens.
You can perform evaluations for GPT models by:
python Evaluation/gpt_evaluation.py
--selected_model "gpt-3.5-turbo" # model type, e.g. gpt-3.5-turbo
--trick "zero-shot" # prompting strategies
We also provide gpt_exp.sh
script for you to run evaluations for ChatGPT (gpt-3.5-turbo), GPT-4 and recently released GPT-4o with different prompting strategies, you can modify it to meet your needs and run it by this:
sh Script\gpt_exp.sh
We use a similar procedure but different source tabular data to obtain the training set: Data/Integrated Dataset/Dataset with Prompt/Training Set/D_train for zero-shot.csv
. The converted format for fine-tuning the data is stored in Finetuning/LLaMA-Factory/
.
We utilize LLaMA-Factory to fine-tune our models.
Install the LLaMA-Factory environment:
git submodule update --init --recursive
cd LLaMA-Factory
pip install -e .[torch,metrics]
cp -r ../Finetuning/LLaMA-Factory/* data/
Before you start fine-tuning, please make sure that the model path and save path are set correctly in .yaml
files under Finetuning/Config/
. For example:
### model
model_name_or_path: your_path/Llama-2-7b-hf
### output
output_dir: your_path/saves/llama2-7b/lora/sft
We use an A800 (80G) to fine-tune LLaMA2-7B, LLaMA3-8B, and LLaMA3-8B-Instruct:
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train ../Finetuning/Config/llama2_7b_lora_sft.yaml
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train ../Finetuning/Config/llama3_8b_lora_sft.yaml
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train ../Finetuning/Config/llama3_8b_instruct_lora_sft.yaml
To generate responses from fine-tuned LLaMA2-7B, LLaMA3-8B, and LLaMA3-8B-Instruct:
CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat ../Finetuning/Config/llama2_7b_lora_sft_inference.yaml
CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat ../Finetuning/Config/llama3_8b_lora_sft_inference.yaml
CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat ../Finetuning/Config/llama3_8b_instruct_lora_sft_inference.yaml
All predicted results on the mini-StatQA testing dataset will be in the path:
LLaMA-Factory/saves/{MODEL}/lora/predict/generated_predictions.jsonl
Note that {MODEL}
can be one of "llama2-7b", "llama3-8b" and "llama3-8b-instruct".
⚠Attention: You may encounter some errors to run these commands directly on RTX 4000 series GPUs like RTX 4090 because of some hardware limitations. Therefore, we also provide a bash
script for users who want to fine-tune models with our dataset on RTX 4000 GPUs, please refer to Finetuning/LLaMA-Factory/sft_rtx4000.sh
.
To analyze LLMs' answers and evaluate their capabilities, you can run answer_analysis.sh
script for preprocessing, accuracy calculation, performance and error analysis.
sh Script\answer_analysis.sh
If you find our work useful or inspiring, please kindly cite:
@misc{zhu2024large,
title={Are Large Language Models Good Statisticians?},
author={Yizhang Zhu and Shiyin Du and Boyan Li and Yuyu Luo and Nan Tang},
year={2024},
eprint={2406.07815},
archivePrefix={arXiv},
primaryClass={cs.CL}
}