-
Notifications
You must be signed in to change notification settings - Fork 33
2.2.3 Backend: vLLM
Handle:
vllm
URL: http://localhost:33911
A high-throughput and memory-efficient inference and serving engine for LLMs
Once you've found a model you want to run, you can configure it with Harbor:
# Quickly lookup some of the compatible quants
harbor hf find awq
harbor hf find gptq
# This propagates the settings
# to the relevant configuration files
harbor vllm model google/gemma-2-2b-it
# To run a gated model, ensure that you've
# also set your Huggingface API Token
harbor hf token <your-token>
harbor up vllm
Models served by vLLM should be available in the Open WebUI by default.
You can configure specific portions of vllm via Harbor CLI:
# See original CLI help
harbor run vllm --help
# Get/Set the extra arguments
harbor vllm args
harbor vllm args '--dtype bfloat16 --code-revision 3.5'
# Select attention backend
harbor vllm attention ROCM_FLASH
harbor config set vllm.host.port 4090
# Get/set desired vLLM version
harbor vllm version # v0.5.3
# Command accepts a docker tag
harbor vllm version latest
You can specify more options directly via the .env
file.
Below are some steps to take if running out of VRAM (no magic, though).
vLLM supports partial offloading to the CPU, similar to llama.cpp and some other backends. This can be configured via the --cpu-offload-gb
flag.
harbor vllm args --cpu-offload-gb 4
When loading the model, VRAM usage can spike when computing the CUDA graphs. This can be disabled via --enforce-eager
flag.
harbor vllm args --enforce-eager
Reduce the amount of VRAM allocated for the model executor. Can be ranged from 0 to 1.0, 0.9
by default.
harbor vllm args --gpu-memory-utilization 0
You can move to CPU by setting the --device cpu
flag.
harbor vllm args --device cpu