diff --git a/llm_bench/python/README.md b/llm_bench/python/README.md index b49ad980ab..3ef58f113a 100755 --- a/llm_bench/python/README.md +++ b/llm_bench/python/README.md @@ -1,140 +1,165 @@ -# Benchmarking script for large language models +# Benchmarking Script for Large Language Models -This script provides a unified approach to estimate performance for Large Language Models. -It is based on pipelines provided by Optimum-Intel and allows to estimate performance for -pytorch and openvino models, using almost the same code and precollected models. +This script provides a unified approach to estimate performance for Large Language Models (LLMs). It leverages pipelines provided by Optimum-Intel and allows performance estimation for PyTorch and OpenVINO models using nearly identical code and pre-collected models. -## Usage -### 1. Start a Python virtual environment +### 1. Prepare Python Virtual Environment for LLM Benchmarking ``` bash -python3 -m venv python-env -source python-env/bin/activate +python3 -m venv ov-llm-bench-env +source ov-llm-bench-env/bin/activate pip install --upgrade pip -pip install -r requirements.txt + +git clone https://github.com/openvinotoolkit/openvino.genai.git +cd openvino.genai/llm_bench/python/ +pip install -r requirements.txt ``` -> Note: -> If you are using an existing python environment, recommend following command to use all the dependencies with latest versions: -> pip install -U --upgrade-strategy eager -r requirements.txt -### 2. Convert a model to OpenVINO IR - -The optimum-cli tool allows you to convert models from Hugging Face to the OpenVINO IR format. More detailed info about tool usage can be found in [Optimum Intel documentation](https://huggingface.co/docs/optimum/main/en/intel/openvino/export) +> Note: +> For existing Python environments, run the following command to ensure that all dependencies are installed with the latest versions: +> `pip install -U --upgrade-strategy eager -r requirements.txt` -Prerequisites: -install conversion dependencies using `requirements.txt` +#### (Optional) Hugging Face Login : -Usage: +Login to Hugging Face if you want to use non-public models: ```bash -optimum-cli export openvino --model <MODEL_NAME> --weight-format <PRECISION> <OUTPUT_DIR> +huggingface-cli login ``` -Paramters: -* `--model <MODEL_NAME>` - <MODEL_NAME> model_id for downloading from huggngface_hub (https://huggingface.co/models) or path with directory where pytorch model located. -* `--weight-format` - precision for model conversion fp32, fp16, int8, int4 -* `<OUTPUT_DIR>` - output directory for saving OpenVINO model. +### 2. Convert Model to OpenVINO IR Format + +The `optimum-cli` tool simplifies converting Hugging Face models to OpenVINO IR format. +- Detailed documentation can be found in the [Optimum-Intel documentation](https://huggingface.co/docs/optimum/main/en/intel/openvino/export). +- To learn more about weight compression, see the [NNCF Weight Compression Guide](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/weight-compression.html). +- For additional guidance on running inference with OpenVINO for LLMs, see the [OpenVINO LLM Inference Guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html). -Usage example: -```bash -optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format fp16 models/llama-2-7b-chat -``` +**Usage:** -the result of running the command will have the following file structure: +```bash +optimum-cli export openvino --model <MODEL_ID> --weight-format <PRECISION> <OUTPUT_DIR> - |-llama-2-7b-chat - |-pytorch - |-dldt - |-FP16 - |-openvino_model.xml - |-openvino_model.bin - |-config.json - |-generation_config.json - |-tokenizer_config.json - |-tokenizer.json - |-tokenizer.model - |-special_tokens_map.json +optimum-cli export openvino -h # For detailed information +``` -### 3. Benchmarking +* `--model <MODEL_ID>` : model_id for downloading from [huggngface_hub](https://huggingface.co/models) or path with directory where pytorch model located. +* `--weight-format <PRECISION>` : precision for model conversion. Available options: `fp32, fp16, int8, int4, mxfp4` +* `<OUTPUT_DIR>`: output directory for saving generated OpenVINO model. -Prerequisites: -install benchmarking dependencies using `requirements.txt` +**NOTE:** +- Models larger than 1 billion parameters are exported to the OpenVINO format with 8-bit weights by default. You can disable it with `--weight-format fp32`. -``` bash -pip install -r requirements.txt +**Example:** +```bash +optimum-cli export openvino --model meta-llama/Llama-2-7b-chat-hf --weight-format fp16 models/llama-2-7b-chat ``` -note: **You can specify the installed OpenVINO version through pip install** -``` bash -# e.g. -pip install openvino==2023.3.0 +**Resulting file structure:** + +```console + models + └── llama-2-7b-chat + ├── config.json + ├── generation_config.json + ├── openvino_detokenizer.bin + ├── openvino_detokenizer.xml + ├── openvino_model.bin + ├── openvino_model.xml + ├── openvino_tokenizer.bin + ├── openvino_tokenizer.xml + ├── special_tokens_map.json + ├── tokenizer_config.json + ├── tokenizer.json + └── tokenizer.model ``` -### 4. Run the following command to test the performance of one LLM model +### 3. Benchmark LLM Model + +To benchmark the performance of the LLM, use the following command: + ``` bash python benchmark.py -m <model> -d <device> -r <report_csv> -f <framework> -p <prompt text> -n <num_iters> # e.g. -python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -n 2 -python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -p "What is openvino?" -n 2 -python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -pf prompts/llama-2-7b-chat_l.jsonl -n 2 +python benchmark.py -m models/llama-2-7b-chat/ -n 2 +python benchmark.py -m models/llama-2-7b-chat/ -p "What is openvino?" -n 2 +python benchmark.py -m models/llama-2-7b-chat/ -pf prompts/llama-2-7b-chat_l.jsonl -n 2 ``` -Parameters: -* `-m` - model path -* `-d` - inference device (default=cpu) -* `-r` - report csv -* `-f` - framework (default=ov) -* `-p` - interactive prompt text -* `-pf` - path of JSONL file including interactive prompts -* `-n` - number of benchmarking iterations, if the value greater 0, will exclude the first iteration. (default=0) -* `-ic` - limit the output token size (default 512) of text_gen and code_gen models. - +**Parameters:** +- `-m`: Path to the model. +- `-d`: Inference device (default: CPU). +- `-r`: Path to the CSV report. +- `-f`: Framework (default: ov). +- `-p`: Interactive prompt text. +- `-pf`: Path to a JSONL file containing prompts. +- `-n`: Number of iterations (default: 0, the first iteration is excluded). +- `-ic`: Limit the output token size (default: 512) for text generation and code generation models. + +**Additional options:** ``` bash python ./benchmark.py -h # for more information ``` -## Running `torch.compile()` +#### Benchmarking the Original PyTorch Model: +To benchmark the original PyTorch model, first download the model locally and then run benchmark by specifying PyTorch as the framework with parameter `-f pt` -The option `--torch_compile_backend` uses `torch.compile()` to speed up -the PyTorch code by compiling it into optimized kernels using a selected backend. +```bash +# Download PyTorch Model +huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch +# Benchmark with PyTorch Framework +python benchmark.py -m models/llama-2-7b-chat/pytorch -n 2 -f pt +``` -Prerequisites: install benchmarking dependencies using requirements.txt +> **Note:** If needed, You can install a specific OpenVINO version using pip: +> ``` bash +> # e.g. +> pip install openvino==2024.4.0 +> # Optional, install the openvino nightly package if needed. +> # OpenVINO nightly is pre-release software and has not undergone full release validation or qualification. +> pip uninstall openvino +> pip install --upgrade --pre openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly +> ``` -``` bash -pip install -r requirements.txt -``` +## 4. Benchmark LLM with `torch.compile()` + +The `--torch_compile_backend` option enables you to use `torch.compile()` to accelerate PyTorch models by compiling them into optimized kernels using a specified backend. -In order to run the `torch.compile()` on CUDA GPU, install additionally the nightly PyTorch version: +Before benchmarking, you need to download the original PyTorch model. Use the following command to download the model locally: ```bash -pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 +huggingface-cli download meta-llama/Llama-2-7b-chat-hf --local-dir models/llama-2-7b-chat/pytorch ``` -Add the option `--torch_compile_backend` with the desired backend: `pytorch` or `openvino` (default) while running the benchmarking script: +To run the benchmarking script with `torch.compile()`, use the `--torch_compile_backend` option to specify the backend. You can choose between `pytorch` or `openvino` (default). Example: ```bash python ./benchmark.py -m models/llama-2-7b-chat/pytorch -d CPU --torch_compile_backend openvino ``` -## Run on 2 sockets platform +> **Note:** To use `torch.compile()` with CUDA GPUs, you need to install the nightly version of PyTorch: +> +> ```bash +> pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 +> ``` + -benchmark.py sets openvino.properties.streams.num(1) by default +## 5. Running on 2-Socket Platforms -| OpenVINO version | Behaviors | +The benchmarking script sets `openvino.properties.streams.num(1)` by default. For multi-socket platforms, use `numactl` on Linux or the `--load_config` option to modify behavior. + +| OpenVINO Version | Behaviors | |:--------------------|:------------------------------------------------| -| Before 2024.0.0 | streams.num(1) <br>execute on 2 sockets. | -| 2024.0.0 | streams.num(1) <br>execute on the same socket as the APP is running on. | +| Before 2024.0.0 | streams.num(1) <br>execute on 2 sockets. | +| 2024.0.0 | streams.num(1) <br>execute on the same socket as the APP is running on. | -numactl on Linux or --load_config for benchmark.py can be used to change the behaviors. +For example, `--load_config config.json` as following will result in streams.num(1) and execute on 2 sockets. +```json +{ + "INFERENCE_NUM_THREADS": <NUMBER> +} +``` +`<NUMBER>` is the number of total physical cores in 2 sockets. -For example, --load_config config.json as following in OpenVINO 2024.0.0 will result in streams.num(1) and execute on 2 sockets. -``` -{"INFERENCE_NUM_THREADS":<NUMBER>} -``` -`<NUMBER>` is the number of total physical cores in 2 sockets +## 6. Additional Resources -## Additional Resources -### 1. NOTE -> If you encounter any errors, please check **[NOTES.md](./doc/NOTES.md)** which provides solutions to the known errors. -### 2. Image generation -> To configure more parameters for image generation models, reference to **[IMAGE_GEN.md](./doc/IMAGE_GEN.md)** +- **Error Troubleshooting:** Check the [NOTES.md](./doc/NOTES.md) for solutions to known issues. +- **Image Generation Configuration:** Refer to [IMAGE_GEN.md](./doc/IMAGE_GEN.md) for setting parameters for image generation models. \ No newline at end of file diff --git a/llm_bench/python/convert.py b/llm_bench/python/convert.py index ae676bc269..9d4d05fb99 100644 --- a/llm_bench/python/convert.py +++ b/llm_bench/python/convert.py @@ -1464,6 +1464,7 @@ def main(): add_stateful_model_arguments(parser) args = parser.parse_args() + log.warning("[DEPRECATED] Not for production use! Please use the 'optimum-intel' to generate the IRs. For details, please check: https://github.com/openvinotoolkit/openvino.genai/blob/master/llm_bench/python/README.md#2-convert-model-to-openvino-ir-format") log.info(f"openvino runtime version: {get_version()}") model_type = get_convert_model_type(args.model_id.lower()) converter = converters[model_type]