Rust LLM Serving Framework
- Paged Attention
- Continuous Batch
- Quantization
- awq
- squeezellm
- Models
- llama
- gemma
- chatglm
Examples
$ cargo run --release --example llm_engine_example -- --model <llma model dir> --gpu-memory-utilization 0.95 --block-size 8 --max-model-len 1024
API Server
$ cargo build --release
$ ./target/release/entrypoints --model <llma model dir> --gpu-memory-utilization 0.95 --block-size 8 --max-model-len 1024 --host 0.0.0.0 --port 8000