This example showcases inference of text-generation Large Language Models (LLMs): chatglm
, LLaMA
, Qwen
and other models with the same signature. The application doesn't have many configuration options to encourage the reader to explore and modify the source code. It's only possible to change the device for inference to a differnt one, GPU for example, from the command line interface. The sample fearures ov::genai::LLMPipeline
and configures it to use multiple beam grops. There is also a Jupyter notebook which provides an example of LLM-powered Chatbot in Python.
The --upgrade-strategy eager
option is needed to ensure optimum-intel
is upgraded to the latest version.
It's not required to install ../../requirements.txt for deployment if the model has already been exported.
pip install --upgrade-strategy eager -r ../../requirements.txt
optimum-cli export openvino --trust-remote-code --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0
beam_search_causal_lm TinyLlama-1.1B-Chat-v1.0 "Why is the Sun yellow?"
To enable Unicode characters for Windows cmd open Region
settings from Control panel
. Administrative
->Change system locale
->Beta: Use Unicode UTF-8 for worldwide language support
->OK
. Reboot.
Discrete GPUs (dGPUs) usually provide better performance compared to CPUs. It is recommended to run larger models on a dGPU with 32GB+ RAM. For example, the model meta-llama/Llama-2-13b-chat-hf can benefit from being run on a dGPU. Modify the source code to change the device for inference to the GPU.
See https://github.com/openvinotoolkit/openvino.genai/blob/master/src/README.md#supported-models for the list of supported models.