Skip to content

v0.6.4

Compare
Choose a tag to compare
@github-actions github-actions released this 15 Nov 07:32
· 63 commits to main since this release
02dbf30

Highlights

Model Support

  • New LLMs and VLMs: Idefics3 (#9767), H2OVL-Mississippi (#9747), Qwen2-Audio (#9248), Pixtral models in the HF Transformers format (#9036), FalconMamba (#9325), Florence-2 language backbone (#9555)
  • New encoder-decoder embedding models: BERT (#9056), RoBERTa & XLM-RoBERTa (#9387)
  • Expanded task support: Llama embeddings (#9806), Math-Shepherd (Mistral reward modeling) (#9697), Qwen2 classification (#9704), Qwen2 embeddings (#10184), VLM2Vec (Phi-3-Vision embeddings) (#9303), E5-V (LLaVA-NeXT embeddings) (#9576), Qwen2-VL embeddings (#9944)
    • Add user-configurable --task parameter for models that support both generation and embedding (#9424)
    • Chat-based Embeddings API (#9759)
  • Tool calling parser for Granite 3.0 (#9027), Jamba (#9154), granite-20b-functioncalling (#8339)
  • LoRA support for Granite 3.0 MoE (#9673), Idefics3 (#10281), Llama embeddings (#10071), Qwen (#9622), Qwen2-VL (#10022)
  • BNB quantization support for Idefics3 (#10310), Mllama (#9720), Qwen2 (#9467, #9574), MiniCPMV (#9891)
  • Unified multi-modal processor for VLM (#10040, #10044)
  • Simplify model interface (#9933, #10237, #9938, #9958, #10007, #9978, #9983, #10205)

Hardware Support

  • Gaudi: Add Intel Gaudi (HPU) inference backend (#6143)
  • CPU: Add embedding models support for CPU backend (#10193)
  • TPU: Correctly profile peak memory usage & Upgrade PyTorch XLA (#9438)
  • Triton: Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857)

Performance

  • Combine chunked prefill with speculative decoding (#9291)
  • fused_moe Performance Improvement (#9384)

Engine Core

  • Override HF config.json via CLI (#5836)
  • Add goodput metric support (#9338)
  • Move parallel sampling out from vllm core, paving way for V1 engine (#9302)
  • Add stateless process group for easier integration with RLHF and disaggregated prefill (#10216, #10072)

Others

  • Improvements to the pull request experience with DCO, mergify, stale bot, etc. (#9436, #9512, #9513, #9259, #10082, #10285, #9803)
  • Dropped support for Python 3.8 (#10038, #8464)
  • Basic Integration Test For TPU (#9968)
  • Document the class hierarchy in vLLM (#10240), explain the integration with Hugging Face (#10173).
  • Benchmark throughput now supports image input (#9851)

What's Changed

New Contributors

Full Changelog: v0.6.3...v0.6.4