Skip to content

A model serving framework for various research and production scenarios. Seamlessly built upon the PyTorch and HuggingFace ecosystem.

License

Notifications You must be signed in to change notification settings

mlsys-io/kv.run

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kv.run

(Limited) comparison of popular model serving solutions

Solution Inference backend Serving backend Advanced kernel support Model support
Huggingface TGI Pytorch HF TGI (Rust) Paged + Flash attention Language
Deepspeed MII PyTorch Deepspeed (Python) DeepSpeed-Kernels Language
TensorRT-LLM TensorRT-LLM TensorRT-LLM (C++) TensorRT XQA Language
vLLM vLLM vLLM (Python) Paged + Flash attention Language
kv.run PyTorch HF TGI + more (Rust) Paged + Flash attention, FlashInfer Language, Diffusion models(soon)

Installation

Install Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Install Protobuf

sudo apt-get install libssl-dev gcc -y
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP

Install Kernel Libraries (optional)

# Install FlashInfer
# For CUDA 12.1 & torch 2.3
pip install flashinfer==0.1.1 -i https://flashinfer.ai/whl/cu121/torch2.3
# For other CUDA & torch versions, please check https://docs.flashinfer.ai/installation.html

# Install Flash and Paged Attention
cd server && make install-flash-attention && make install-vllm-cuda && install-flash-attention-v2-cuda

Build Code Base

make install

Build Docker Image (optional)

Dockerfile_kvrun provides a docker image building script. We will provide pre-built docker images shortly.

Usages

Deploy services

text-generation-launcher --model-id tjluyao/llama-3-8b

You can use --disable-flashinfer to force a classic TGI serving.

Query the model

You can query the model either through curl:

curl 127.0.0.1:3000/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"lora_id": "tjluyao/llama-3-8b-math", "max_new_tokens":20}}' -H 'Content-Type: application/json'

or using the Python client. Please refer to README.me.

Local API tests

cd server/examples && python test_local_api.py

Local UI demo

(Inherited from Punica)

python server/examples/test_ui.py
demo.mp4

Using quantized models

Add --quantize [Method] to the command above, for example:

text-generation-launcher --model-id TechxGenus/gemma-2b-GPTQ --lora-ids tjluyao/gemma-2b-it-math --quantize gptq

The supported quantization methods include:

  • AWQ: 4-bit. Need specific quantized model.
  • EETQ: 8-bit. Can work for any model.
  • GPTQ: 4-bit. Need specific quantized model.
  • Bitandbytes: 8-bit. Can work for any model.

For AWQ and EETQ quantization, you need to build their specific kernels:

# AWQ
cd server && make install-awq
git clone https://github.com/casper-hansen/AutoAWQ && cd AutoAWQ
pip install -e .
# EETQ
cd server && make install-eetq
# GTPQ
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
pip install -vvv --no-build-isolation -e .

Multi-LoRA support

  • To load LoRA adapters, you may either (1) specify in the laucher argument using lora-ids:
text-generation-launcher --model-id tjluyao/llama-3-8b --lora-ids tjluyao/llama-3-8b-math;tjluyao/llama-3-8b-zh

Or, loading dynamically by the client after the model is launched:

curl 127.0.0.1:3000/download_lora_adapter -X POST -d '{"lora_id":"tjluyao/llama-3-8b-math"}' -H 'Content-Type: application/json'
  • To query the model, similarly you can use lora-id in the parameters (make sure the adapter is loaded):
curl 127.0.0.1:3000/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"lora_id": "tjluyao/llama-3-8b-math", "max_new_tokens":20}}' -H 'Content-Type: application/json'

Benchmarks

Testing Llama-2-7b on RTX 6000 ada (Vast AI):

Step Batch Size Average FlashInfer Average TGI
Prefill 1 52.16 tokens/secs 41.14 tokens/secs
2 101.64 tokens/secs 78.69 tokens/secs
4 191.48 tokens/secs 154.11 tokens/secs
8 323.21 tokens/secs 290.82 tokens/secs
16 512.50 tokens/secs 538.15 tokens/secs
32 697.89 tokens/secs 783.61 tokens/secs
Decode 1 56.55 tokens/secs 40.84 tokens/secs
2 108.55 tokens/secs 77.85 tokens/secs
4 207.10 tokens/secs 154.27 tokens/secs
8 383.92 tokens/secs 297.53 tokens/secs
16 682.78 tokens/secs 562.83 tokens/secs
32 1119.92 tokens/secs 993.33 tokens/secs

Testing Llama-2-7b on 3090 (Vast AI):

Step Batch Size Average FlashInfer Average TGI
Prefill 1 44.33 tokens/secs 23.32 tokens/secs
2 74.81 tokens/secs 46.68 tokens/secs
4 133.93 tokens/secs 90.51 tokens/secs
8 189.78 tokens/secs 168.27 tokens/secs
16 231.24 tokens/secs 218.12 tokens/secs
32 270.12 tokens/secs 265.74tokens/secs
Decode 1 50.21 tokens/secs 23.13 tokens/secs
2 89.70 tokens/secs 47.26 tokens/secs
4 174.92 tokens/secs 93.09 tokens/secs
8 324.06 tokens/secs 175.21 tokens/secs
16 567.67 tokens/secs 337.92 tokens/secs
32 861.50 tokens/secs 601.03 tokens/secs

Model and kernel support matrix

Note: L = Language, I = Image

Model MOE Size Modality Flash & Page Attention FlashInfer
Idefics 9B L, I ⇒ L
Idefics 2 8B L, I ⇒ L
Llava Next (1.6) 13B L, I ⇒ L
Llama 2 7B L ⇒ L
Llama 3 8B L ⇒ L
Phi 1.5 2.7B L ⇒ L
Phi 3 3.8B L ⇒ L
Gemma 2B L ⇒ L
Cohere 104B L ⇒ L
Dbrx 132B L ⇒ L
Mamba 2.8B L ⇒ L
Mistral 7B L ⇒ L
Mixtral 8x22B L ⇒ L
Gpt Bigcode 1.1B L ⇒ L
Baichuan 7B L ⇒ L
Falcon 7B L ⇒ L
StarCoder 2 15B L ⇒ L
Qwen 2 7B L ⇒ L
Qwen 1.5 7B L ⇒ L
Opt 6.7B L ⇒ L
T5 11B L ⇒ L
Galactica 120B L ⇒ L
SantaCoder 1.1B L ⇒ L
Bloom 560M L ⇒ L
Mpt 7B L ⇒ L
Gpt2 124M L ⇒ L
Gpt Neox 20B L ⇒ L
Yi 1.5 9B L ⇒ L
ChatGLM 4 9B L ⇒ L

About

A model serving framework for various research and production scenarios. Seamlessly built upon the PyTorch and HuggingFace ecosystem.

Resources

License

Stars

Watchers

Forks

Releases

No releases published