StatQA: Are Large Language Models Good Statisticians?

Official repository for the paper “Are Large Language Models Good Statisticians?”.

🔥News

[Sept 26, 2024] 🎉 Our paper has been accepted by NeurIPS'24 Dataset and Benchmark Track!

[May 26, 2024] 🤗 StatQA is released!

📜Overview

Large Language Models (LLMs) have demonstrated impressive capabilities across a range of scientific tasks including mathematics, physics, and chemistry. Despite their successes, the effectiveness of LLMs in handling complex statistical tasks remains systematically under-explored. To bridge this gap, we introduce StatQA, a new benchmark designed for statistical analysis tasks. StatQA comprises 11,623 examples tailored to evaluate LLMs' proficiency in specialized statistical tasks and their applicability assessment capabilities, particularly for hypothesis testing methods. We systematically experiment with representative LLMs using various prompting strategies and show that even state-of-the-art models such as GPT-4o achieve a best performance of only 64.83%, indicating significant room for improvement. Notably, while open-source LLMs (e.g. LLaMA-3) show limited capability, those fine-tuned ones exhibit marked improvements, outperforming all in-context learning-based methods (e.g. GPT-4o). Moreover, our comparative human experiments highlight a striking contrast in error types between LLMs and humans: LLMs primarily make applicability errors, whereas humans mostly make statistical task confusion errors. This divergence highlights distinct areas of proficiency and deficiency, suggesting that combining LLM and human expertise could lead to complementary strengths, inviting further investigation into their collaborative potential.

🛠️Environment Setup

We recommend you to create a conda virtual environment to run our project.

Firstly, create environment and install required python libraries from requirements.txt:

conda create --name newEnv python=3.11
conda activate newEnv
pip install -r requirements.txt

OpenAI API key will be used in this project, so please set your own API key in gpt_config.txt like this:

https://api.openai.com/v1/chat/completions
yourapikey

💻Hardware Platform

8 NVIDIA 4090s: for experiments of LLaMA models' evaluation.

NVIDIA A800 (80G): for fine-tuning.

💾Benchmark

For our benchmark StatQA and mini-StatQA, we provide both CSV and JSON formatted benchmark versions in StatQA/ file. If you want to conduct evaluations using them, please go to the Evaluation section.

If you want to try the process of StatQA and mini-StatQA construction on your own, we provide the script for you:

Preprocessing and information extraction:

sh Script\info_extraction.sh

Benchmark construction:

sh Script\benchmark_construction.sh

Obtained StatQA and mini-StatQA benchmarks will be stored in Data/Integrated Dataset/Balanced Benchmark. Note that this process can take many hours and consume considerable API tokens, so please be patient or you can directly use the benchmark we already offered in StatQA/.

🧪Evaluation

We evaluate LLMs' capabilities based on mini-StatQA. The responses or answers generated by LLMs will be stored in Model Answer/Origin Answer.

The first step is to generate certain prompts for different prompting strategies:

sh Script\prompt_organization.sh

For LLaMA-2/3:

To perform experiments on LLaMA-2 and LLaMA-3 models, please configure your own model path in Evaluation/llama_evaluation.py, and you can also set the parallel_num depending on your GPUs.

# Path settings
if model_type == '2_7b':
    model_path = "/your_path_to/Llama-2-7b-chat-hf"
    parallel_num = 2
elif model_type == '2_13b':
    model_path = "/your_path_to/Llama-2-13b-chat-hf"
    parallel_num = 8
elif model_type == '3_8b_instruct':
    model_path = "/your_path_to/Meta-Llama-3-8B-Instruct"
    parallel_num = 4
elif model_type == '3_8b':
    model_path = "/your_path_to/Meta-Llama-3-8B"
    parallel_num = 4

Then, you can perform evaluations for LLaMA-2/3 models by:

python Evaluation/llama_evaluation.py 
	--model_type "2_7b" # model type, e.g. LLaMA-2-7b
	--trick "zero-shot" # prompting strategies

We also provide llama_exp.sh script for you to run evaluations for LLaMA-2-7b/13b and LLaMA-3-8b(w/ and w/o Instruct) with different prompting strategies, you can modify it to meet your needs and run it by this:

sh Script\llama_exp.sh

For GPT models:

Please ensure your OpenAI API key is correctly set and note that the evaluation of GPT models will consume your API tokens.

You can perform evaluations for GPT models by:

python Evaluation/gpt_evaluation.py 
	--selected_model "gpt-3.5-turbo" # model type, e.g. gpt-3.5-turbo
	--trick "zero-shot" # prompting strategies

We also provide gpt_exp.sh script for you to run evaluations for ChatGPT (gpt-3.5-turbo), GPT-4 and recently released GPT-4o with different prompting strategies, you can modify it to meet your needs and run it by this:

sh Script\gpt_exp.sh

⚗️Fine-tuning

Format Transformation for Training Set

We use a similar procedure but different source tabular data to obtain the training set: Data/Integrated Dataset/Dataset with Prompt/Training Set/D_train for zero-shot.csv. The converted format for fine-tuning the data is stored in Finetuning/LLaMA-Factory/.

We utilize LLaMA-Factory to fine-tune our models.

Install the LLaMA-Factory environment:

git submodule update --init  --recursive
cd LLaMA-Factory
pip install -e .[torch,metrics]

cp -r ../Finetuning/LLaMA-Factory/* data/

Before you start fine-tuning, please make sure that the model path and save path are set correctly in .yaml files under Finetuning/Config/. For example:

### model
model_name_or_path: your_path/Llama-2-7b-hf
### output
output_dir: your_path/saves/llama2-7b/lora/sft

We use an A800 (80G) to fine-tune LLaMA2-7B, LLaMA3-8B, and LLaMA3-8B-Instruct:

CUDA_VISIBLE_DEVICES=0 llamafactory-cli train ../Finetuning/Config/llama2_7b_lora_sft.yaml
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train ../Finetuning/Config/llama3_8b_lora_sft.yaml
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train ../Finetuning/Config/llama3_8b_instruct_lora_sft.yaml

To generate responses from fine-tuned LLaMA2-7B, LLaMA3-8B, and LLaMA3-8B-Instruct:

CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat ../Finetuning/Config/llama2_7b_lora_sft_inference.yaml
CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat ../Finetuning/Config/llama3_8b_lora_sft_inference.yaml
CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat ../Finetuning/Config/llama3_8b_instruct_lora_sft_inference.yaml

All predicted results on the mini-StatQA testing dataset will be in the path:

LLaMA-Factory/saves/{MODEL}/lora/predict/generated_predictions.jsonl

Note that {MODEL} can be one of "llama2-7b", "llama3-8b" and "llama3-8b-instruct".

⚠Attention: You may encounter some errors to run these commands directly on RTX 4000 series GPUs like RTX 4090 because of some hardware limitations. Therefore, we also provide a bash script for users who want to fine-tune models with our dataset on RTX 4000 GPUs, please refer to Finetuning/LLaMA-Factory/sft_rtx4000.sh.

📊Analysis

To analyze LLMs' answers and evaluate their capabilities, you can run answer_analysis.sh script for preprocessing, accuracy calculation, performance and error analysis.

sh Script\answer_analysis.sh

✏️Citation

If you find our work useful or inspiring, please kindly cite:

@misc{zhu2024large,
      title={Are Large Language Models Good Statisticians?}, 
      author={Yizhang Zhu and Shiyin Du and Boyan Li and Yuyu Luo and Nan Tang},
      year={2024},
      eprint={2406.07815},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Chart		Chart
Construction		Construction
Data		Data
Evaluation		Evaluation
Extraction		Extraction
Finetuning		Finetuning
Human Experiment		Human Experiment
LLaMA-Factory @ 83a005e		LLaMA-Factory @ 83a005e
Model Answer		Model Answer
Script		Script
StatQA		StatQA
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
analyze_model_answer.py		analyze_model_answer.py
gpt_config.txt		gpt_config.txt
mappings.py		mappings.py
path.py		path.py
prompt_wording.py		prompt_wording.py
requirements.txt		requirements.txt
stats_dataset_features.py		stats_dataset_features.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StatQA: Are Large Language Models Good Statisticians?

🔥News

📜Overview

🛠️Environment Setup

💻Hardware Platform

💾Benchmark

🧪Evaluation

For LLaMA-2/3:

For GPT models:

⚗️Fine-tuning

Format Transformation for Training Set

📊Analysis

✏️Citation

About

Releases

Packages

Contributors 2

Languages

License

HKUSTDial/StatQA

Folders and files

Latest commit

History

Repository files navigation

StatQA: Are Large Language Models Good Statisticians?

🔥News

📜Overview

🛠️Environment Setup

💻Hardware Platform

💾Benchmark

🧪Evaluation

For LLaMA-2/3:

For GPT models:

⚗️Fine-tuning

Format Transformation for Training Set

📊Analysis

✏️Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages