Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Readme for LLM_Bench #1

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
203 changes: 114 additions & 89 deletions llm_bench/python/README.md
Original file line number Diff line number Diff line change
@@ -1,140 +1,165 @@
# Benchmarking script for large language models
# Benchmarking Script for Large Language Models

This script provides a unified approach to estimate performance for Large Language Models.
It is based on pipelines provided by Optimum-Intel and allows to estimate performance for
pytorch and openvino models, using almost the same code and precollected models.
This script provides a unified approach to estimate performance for Large Language Models (LLMs). It leverages pipelines provided by Optimum-Intel and allows performance estimation for PyTorch and OpenVINO models using nearly identical code and pre-collected models.

## Usage

### 1. Start a Python virtual environment
### 1. Prepare Python Virtual Environment for LLM Benchmarking

``` bash
python3 -m venv python-env
source python-env/bin/activate
python3 -m venv ov-llm-bench-env
source ov-llm-bench-env/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

git clone https://github.com/openvinotoolkit/openvino.genai.git
cd openvino.genai/llm_bench/python/
pip install -r requirements.txt
```
> Note:
> If you are using an existing python environment, recommend following command to use all the dependencies with latest versions:
> pip install -U --upgrade-strategy eager -r requirements.txt

### 2. Convert a model to OpenVINO IR
The optimum-cli tool allows you to convert models from Hugging Face to the OpenVINO IR format. More detailed info about tool usage can be found in [Optimum Intel documentation](https://huggingface.co/docs/optimum/main/en/intel/openvino/export)
> Note:
> For existing Python environments, run the following command to ensure that all dependencies are installed with the latest versions:
> `pip install -U --upgrade-strategy eager -r requirements.txt`

Prerequisites:
install conversion dependencies using `requirements.txt`
#### (Optional) Hugging Face Login :

Usage:
Login to Hugging Face if you want to use non-public models:

```bash
optimum-cli export openvino --model <MODEL_NAME> --weight-format <PRECISION> <OUTPUT_DIR>
huggingface-cli login
```

Paramters:
* `--model <MODEL_NAME>` - <MODEL_NAME> model_id for downloading from huggngface_hub (https://huggingface.co/models) or path with directory where pytorch model located.
* `--weight-format` - precision for model conversion fp32, fp16, int8, int4
* `<OUTPUT_DIR>` - output directory for saving OpenVINO model.
### 2. Convert Model to OpenVINO IR Format

The `optimum-cli` tool simplifies converting Hugging Face models to OpenVINO IR format.
- Detailed documentation can be found in the [Optimum-Intel documentation](https://huggingface.co/docs/optimum/main/en/intel/openvino/export).
- To learn more about weight compression, see the [NNCF Weight Compression Guide](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/weight-compression.html).
- For additional guidance on running inference with OpenVINO for LLMs, see the [OpenVINO LLM Inference Guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html).

Usage example:
```bash
optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format fp16 models/llama-2-7b-chat
```
**Usage:**

the result of running the command will have the following file structure:
```bash
optimum-cli export openvino --model <MODEL_ID> --weight-format <PRECISION> <OUTPUT_DIR>

|-llama-2-7b-chat
|-pytorch
|-dldt
|-FP16
|-openvino_model.xml
|-openvino_model.bin
|-config.json
|-generation_config.json
|-tokenizer_config.json
|-tokenizer.json
|-tokenizer.model
|-special_tokens_map.json
optimum-cli export openvino -h # For detailed information
```

### 3. Benchmarking
* `--model <MODEL_ID>` : model_id for downloading from [huggngface_hub](https://huggingface.co/models) or path with directory where pytorch model located.
* `--weight-format <PRECISION>` : precision for model conversion. Available options: `fp32, fp16, int8, int4, mxfp4`
* `<OUTPUT_DIR>`: output directory for saving generated OpenVINO model.

Prerequisites:
install benchmarking dependencies using `requirements.txt`
**NOTE:**
- Models larger than 1 billion parameters are exported to the OpenVINO format with 8-bit weights by default. You can disable it with `--weight-format fp32`.

``` bash
pip install -r requirements.txt
**Example:**
```bash
optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format fp16 models/llama-2-7b-chat
```
note: **You can specify the installed OpenVINO version through pip install**
``` bash
# e.g.
pip install openvino==2023.3.0
**Resulting file structure:**

```console
models
└── llama-2-7b-chat
├── config.json
├── generation_config.json
├── openvino_detokenizer.bin
├── openvino_detokenizer.xml
├── openvino_model.bin
├── openvino_model.xml
├── openvino_tokenizer.bin
├── openvino_tokenizer.xml
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model
```

### 4. Run the following command to test the performance of one LLM model
### 3. Benchmark LLM Model

To benchmark the performance of the LLM, use the following command:

``` bash
python benchmark.py -m <model> -d <device> -r <report_csv> -f <framework> -p <prompt text> -n <num_iters>
# e.g.
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -n 2
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -p "What is openvino?" -n 2
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -pf prompts/llama-2-7b-chat_l.jsonl -n 2
python benchmark.py -m models/llama-2-7b-chat/ -n 2
python benchmark.py -m models/llama-2-7b-chat/ -p "What is openvino?" -n 2
python benchmark.py -m models/llama-2-7b-chat/ -pf prompts/llama-2-7b-chat_l.jsonl -n 2
```
Parameters:
* `-m` - model path
* `-d` - inference device (default=cpu)
* `-r` - report csv
* `-f` - framework (default=ov)
* `-p` - interactive prompt text
* `-pf` - path of JSONL file including interactive prompts
* `-n` - number of benchmarking iterations, if the value greater 0, will exclude the first iteration. (default=0)
* `-ic` - limit the output token size (default 512) of text_gen and code_gen models.


**Parameters:**
- `-m`: Path to the model.
- `-d`: Inference device (default: CPU).
- `-r`: Path to the CSV report.
- `-f`: Framework (default: ov).
- `-p`: Interactive prompt text.
- `-pf`: Path to a JSONL file containing prompts.
- `-n`: Number of iterations (default: 0, the first iteration is excluded).
- `-ic`: Limit the output token size (default: 512) for text generation and code generation models.

**Additional options:**
``` bash
python ./benchmark.py -h # for more information
```

## Running `torch.compile()`
#### Benchmarking the Original PyTorch Model:
To benchmark the original PyTorch model, first download the model locally and then run benchmark by specifying PyTorch as the framework with parameter `-f pt`

The option `--torch_compile_backend` uses `torch.compile()` to speed up
the PyTorch code by compiling it into optimized kernels using a selected backend.
```bash
# Download PyTorch Model
huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch
# Benchmark with PyTorch Framework
python benchmark.py -m models/llama-2-7b-chat/pytorch -n 2 -f pt
```

Prerequisites: install benchmarking dependencies using requirements.txt
> **Note:** If needed, You can install a specific OpenVINO version using pip:
> ``` bash
> # e.g.
> pip install openvino==2024.4.0
> # Optional, install the openvino nightly package if needed.
> # OpenVINO nightly is pre-release software and has not undergone full release validation or qualification.
> pip uninstall openvino
> pip install --upgrade --pre openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
> ```

``` bash
pip install -r requirements.txt
```
## 4. Benchmark LLM with `torch.compile()`

The `--torch_compile_backend` option enables you to use `torch.compile()` to accelerate PyTorch models by compiling them into optimized kernels using a specified backend.

In order to run the `torch.compile()` on CUDA GPU, install additionally the nightly PyTorch version:
Before benchmarking, you need to download the original PyTorch model. Use the following command to download the model locally:

```bash
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch
```

Add the option `--torch_compile_backend` with the desired backend: `pytorch` or `openvino` (default) while running the benchmarking script:
To run the benchmarking script with `torch.compile()`, use the `--torch_compile_backend` option to specify the backend. You can choose between `pytorch` or `openvino` (default). Example:

```bash
python ./benchmark.py -m models/llama-2-7b-chat/pytorch -d CPU --torch_compile_backend openvino
```

## Run on 2 sockets platform
> **Note:** To use `torch.compile()` with CUDA GPUs, you need to install the nightly version of PyTorch:
>
> ```bash
> pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
> ```


benchmark.py sets openvino.properties.streams.num(1) by default
## 5. Running on 2-Socket Platforms

| OpenVINO version | Behaviors |
The benchmarking script sets `openvino.properties.streams.num(1)` by default. For multi-socket platforms, use `numactl` on Linux or the `--load_config` option to modify behavior.

| OpenVINO Version | Behaviors |
|:--------------------|:------------------------------------------------|
| Before 2024.0.0 | streams.num(1) <br>execute on 2 sockets. |
| 2024.0.0 | streams.num(1) <br>execute on the same socket as the APP is running on. |
| Before 2024.0.0 | streams.num(1) <br>execute on 2 sockets. |
| 2024.0.0 | streams.num(1) <br>execute on the same socket as the APP is running on. |

numactl on Linux or --load_config for benchmark.py can be used to change the behaviors.
For example, `--load_config config.json` as following will result in streams.num(1) and execute on 2 sockets.
```json
{
"INFERENCE_NUM_THREADS": <NUMBER>
}
```
`<NUMBER>` is the number of total physical cores in 2 sockets.

For example, --load_config config.json as following in OpenVINO 2024.0.0 will result in streams.num(1) and execute on 2 sockets.
```
{"INFERENCE_NUM_THREADS":<NUMBER>}
```
`<NUMBER>` is the number of total physical cores in 2 sockets
## 6. Additional Resources

## Additional Resources
### 1. NOTE
> If you encounter any errors, please check **[NOTES.md](./doc/NOTES.md)** which provides solutions to the known errors.
### 2. Image generation
> To configure more parameters for image generation models, reference to **[IMAGE_GEN.md](./doc/IMAGE_GEN.md)**
- **Error Troubleshooting:** Check the [NOTES.md](./doc/NOTES.md) for solutions to known issues.
- **Image Generation Configuration:** Refer to [IMAGE_GEN.md](./doc/IMAGE_GEN.md) for setting parameters for image generation models.