Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidated C-Eval Benchmark Guide for Single-GPU and Multi-GPU Environments #12618

Merged
merged 2 commits into from
Dec 26, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 127 additions & 11 deletions python/llm/dev/benchmark/ceval/README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,30 @@
## C-Eval Benchmark Test
## C-Eval Benchmark Test Guide

C-Eval benchmark test allows users to test on [C-Eval](https://cevalbenchmark.com) datasets, which is a multi-level multi-discipline chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please check [paper](https://arxiv.org/abs/2305.08322) and [github repo](https://github.com/hkust-nlp/ceval) for more information.
This guide provides instructions for running the C-Eval benchmark test in both single-GPU and multi-GPU environments. [C-Eval](https://cevalbenchmark.com) is a comprehensive multi-level, multi-discipline Chinese evaluation suite for foundational models. It consists of 13,948 multiple-choice questions spanning 52 diverse disciplines and four difficulty levels. For more details, see the [C-Eval paper](https://arxiv.org/abs/2305.08322) and [GitHub repository](https://github.com/hkust-nlp/ceval).

### Download dataset
Please download and unzip the dataset for evaluation.
```shell
---

### Single-GPU Environment

#### 1. Download Dataset

Download and unzip the dataset for evaluation:
```bash
wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
mkdir data
mv ceval-exam.zip data
cd data; unzip ceval-exam.zip
```

### Run
You can run evaluation with following command.
```shell
#### 2. Run Evaluation

Use the following command to run the evaluation:
```bash
bash run.sh
```
+ `run.sh`
```shell

Contents of `run.sh`:
```bash
export IPEX_LLM_LAST_LM_HEAD=0
python eval.py \
--model_path "path to model" \
Expand All @@ -29,4 +36,113 @@ python eval.py \

> **Note**
>
> `eval_type` there is two types of evaluation, first type is `validation`, which runs on validation dataset and output evaluation scores. The second type is `test`, which runs on test dataset and output `submission.json` file for submission on https://cevalbenchmark.com to get the evaluation score.
> - `eval_type`: There are two types of evaluations:
> - `validation`: Runs on the validation dataset and outputs evaluation scores.
> - `test`: Runs on the test dataset and outputs a `submission.json` file for submission on [C-Eval](https://cevalbenchmark.com) to get evaluation scores.

---

### Multi-GPU Environment

#### 1. Prepare Environment

1. **Set Docker Image and Container Name**:
```bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest
export CONTAINER_NAME=ceval-benchmark
```

2. **Start Docker Container**:
```bash
docker run -td \
--privileged \
--net=host \
--device=/dev/dri \
--name=$CONTAINER_NAME \
-v /home/intel/LLM:/llm/models/ \
-e no_proxy=localhost,127.0.0.1 \
-e http_proxy=$HTTP_PROXY \
-e https_proxy=$HTTPS_PROXY \
--shm-size="16g" \
$DOCKER_IMAGE
```

3. **Enter the Container**:
```bash
docker exec -it $CONTAINER_NAME bash
```

#### 2. Configure `lm-evaluation-harness`

1. **Clone the Repository**:
```bash
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
```

2. **Update Multi-GPU Support File**:
Update `lm_eval/models/vllm_causallms.py` based on the following link:
[Update Multi-GPU Support File](https://github.com/EleutherAI/lm-evaluation-harness/compare/main...liu-shaojun:lm-evaluation-harness:multi-arc?expand=1)

3. **Install Dependencies**:
```bash
pip install -e .
```

#### 3. Configure Environment Variables

Set environment variables required for multi-GPU execution:
```bash
export CCL_WORKER_COUNT=2
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0

export SYCL_CACHE_PERSISTENT=1
export FI_PROVIDER=shm
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
```

Load Intel OneCCL environment variables:
```bash
source /opt/intel/1ccl-wks/setvars.sh
```

#### 4. Run Evaluation

Use the following command to run the C-Eval benchmark:
```bash
lm_eval --model vllm \
--model_args pretrained=/llm/models/CodeLlama-34b/,dtype=float16,max_model_len=2048,device=xpu,load_in_low_bit=fp8,tensor_parallel_size=4,distributed_executor_backend="ray",gpu_memory_utilization=0.90,trust_remote_code=True \
--tasks ceval-valid \
--batch_size 2 \
--num_fewshot 0 \
--output_path c-eval-result
```

#### 5. Notes

- **Model and Parameter Adjustments**:
- **`pretrained`**: Replace with the desired model path, e.g., `/llm/models/CodeLlama-7b/`.
- **`load_in_low_bit`**: Set to `fp8` or other precision options based on hardware and task requirements.
- **`tensor_parallel_size`**: Adjust based on the number of GPUs and memory. Recommended to match the GPU count.
- **`batch_size`**: Increase to accelerate testing, but ensure it does not cause OOM errors. Recommended values are `2` or `3`.
- **`num_fewshot`**: Specify the number of few-shot examples. Default is `0`. Increasing this value can improve model contextual understanding but may significantly increase input length and runtime.

- **Logging**:
To log both to the console and a file, use:
```bash
lm_eval --model vllm ... | tee c-eval.log
```

- **Container Debugging**:
Ensure the paths for the model and tasks are correctly set, e.g., check if `/llm/models/` is properly mounted in the container.

---

By following the above steps, you can successfully run the C-Eval benchmark in both single-GPU and multi-GPU environments.

Loading