Release v0.6.4 · vllm-project/vllm

Highlights

Significant progress in V1 engine core refactor (#9826, #10135, #10288, #10211, #10225, #10228, #10268, #9954, #10272, #9971, #10224, #10166, #9289, #10058, #9888, #9972, #10059, #9945, #9679, #9871, #10227, #10245, #9629, #10097, #10203, #10148). You can checkout more details regarding the design and plan ahead in our recent meetup slides
Signficant progress in torch.compile support. Many models now support torch compile with TorchInductor. You can checkout our meetup slides for more details. (#9775, #9614, #9639, #9641, #9876, #9946, #9589, #9896, #9637, #9300, #9947, #9138, #9715, #9866, #9632, #9858, #9889)

Model Support

New LLMs and VLMs: Idefics3 (#9767), H2OVL-Mississippi (#9747), Qwen2-Audio (#9248), Pixtral models in the HF Transformers format (#9036), FalconMamba (#9325), Florence-2 language backbone (#9555)
New encoder-decoder embedding models: BERT (#9056), RoBERTa & XLM-RoBERTa (#9387)
Expanded task support: Llama embeddings (#9806), Math-Shepherd (Mistral reward modeling) (#9697), Qwen2 classification (#9704), Qwen2 embeddings (#10184), VLM2Vec (Phi-3-Vision embeddings) (#9303), E5-V (LLaVA-NeXT embeddings) (#9576), Qwen2-VL embeddings (#9944)
- Add user-configurable --task parameter for models that support both generation and embedding (#9424)
- Chat-based Embeddings API (#9759)
Tool calling parser for Granite 3.0 (#9027), Jamba (#9154), granite-20b-functioncalling (#8339)
LoRA support for Granite 3.0 MoE (#9673), Idefics3 (#10281), Llama embeddings (#10071), Qwen (#9622), Qwen2-VL (#10022)
BNB quantization support for Idefics3 (#10310), Mllama (#9720), Qwen2 (#9467, #9574), MiniCPMV (#9891)
Unified multi-modal processor for VLM (#10040, #10044)
Simplify model interface (#9933, #10237, #9938, #9958, #10007, #9978, #9983, #10205)

Hardware Support

Gaudi: Add Intel Gaudi (HPU) inference backend (#6143)
CPU: Add embedding models support for CPU backend (#10193)
TPU: Correctly profile peak memory usage & Upgrade PyTorch XLA (#9438)
Triton: Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857)

Performance

Combine chunked prefill with speculative decoding (#9291)
fused_moe Performance Improvement (#9384)

Engine Core

Override HF config.json via CLI (#5836)
Add goodput metric support (#9338)
Move parallel sampling out from vllm core, paving way for V1 engine (#9302)
Add stateless process group for easier integration with RLHF and disaggregated prefill (#10216, #10072)

Others

Improvements to the pull request experience with DCO, mergify, stale bot, etc. (#9436, #9512, #9513, #9259, #10082, #10285, #9803)
Dropped support for Python 3.8 (#10038, #8464)
Basic Integration Test For TPU (#9968)
Document the class hierarchy in vLLM (#10240), explain the integration with Hugging Face (#10173).
Benchmark throughput now supports image input (#9851)

What's Changed

[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by @WoosukKwon in #9350
[Frontend] merge beam search implementations by @LunrEclipse in #9296
[Model] Make llama3.2 support multiple and interleaved images by @xiangxu-google in #9095
[Bugfix] Clean up some cruft in mamba.py by @tlrmchlsmth in #9343
[Frontend] Clarify model_type error messages by @stevegrubb in #9345
[Doc] Fix code formatting in spec_decode.rst by @mgoin in #9348
[Bugfix] Update InternVL input mapper to support image embeds by @hhzhang16 in #9351
[BugFix] Fix chat API continuous usage stats by @njhill in #9357
pass ignore_eos parameter to all benchmark_serving calls by @gracehonv in #9349
[Misc] Directly use compressed-tensors for checkpoint definitions by @mgoin in #8909
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by @CatherineSue in #9034
[Bugfix][CI/Build] Fix CUDA 11.8 Build by @LucasWilkinson in #9386
[Bugfix] Molmo text-only input bug fix by @mrsalehi in #9397
[Misc] Standardize RoPE handling for Qwen2-VL by @DarkLight1337 in #9250
[Model] VLM2Vec, the first multimodal embedding model in vLLM by @DarkLight1337 in #9303
[CI/Build] Test VLM embeddings by @DarkLight1337 in #9406
[Core] Rename input data types by @DarkLight1337 in #8688
[Misc] Consolidate example usage of OpenAI client for multimodal models by @ywang96 in #9412
[Model] Support SDPA attention for Molmo vision backbone by @Isotr0py in #9410
Support mistral interleaved attn by @patrickvonplaten in #9414
[Kernel][Model] Improve continuous batching for Jamba and Mamba by @mzusman in #9189
[Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by @streaver91 in #9396
[Performance][Spec Decode] Optimize ngram lookup performance by @LiuXiaoxuanPKU in #9333
[CI/Build] mypy: Resolve some errors from checking vllm/engine by @russellb in #9267
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by @tlrmchlsmth in #9425
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by @rasmith in #9391
Add notes on the use of Slack by @terrytangyuan in #9442
[Kernel] Add Exllama as a backend for compressed-tensors by @LucasWilkinson in #9395
[Misc] Print stack trace using logger.exception by @DarkLight1337 in #9461
[misc] CUDA Time Layerwise Profiler by @LucasWilkinson in #8337
[Bugfix] Allow prefill of assistant response when using mistral_common by @sasha0552 in #9446
[TPU] Call torch._sync(param) during weight loading by @WoosukKwon in #9437
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by @bigPYJ1151 in #9344
[Core] Deprecating block manager v1 and make block manager v2 default by @KuntaiDu in #8704
[CI/Build] remove .github from .dockerignore, add dirty repo check by @dtrifiro in #9375
[Misc] Remove commit id file by @DarkLight1337 in #9470
[torch.compile] Fine-grained CustomOp enabling mechanism by @ProExpertProg in #9300
[Bugfix] Fix support for dimension like integers and ScalarType by @bnellnm in #9299
[Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script by @wukaixingxp in #9013
[Bugfix] Print warnings related to mistral_common tokenizer only once by @sasha0552 in #9468
[Hardwware][Neuron] Simplify model load for transformers-neuronx library by @sssrijan-amazon in #9380
Support BERTModel (first encoder-only embedding model) by @robertgshaw2-neuralmagic in #9056
[BugFix] Stop silent failures on compressed-tensors parsing by @dsikka in #9381
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage by @joerunde in #9352
[Qwen2.5] Support bnb quant for Qwen2.5 by @blueyo0 in #9467
[CI/Build] Use commit hash references for github actions by @russellb in #9430
[BugFix] Typing fixes to RequestOutput.prompt and beam search by @njhill in #9473
[Frontend][Feature] Add jamba tool parser by @tomeras91 in #9154
[BugFix] Fix and simplify completion API usage streaming by @njhill in #9475
[CI/Build] Fix lint errors in mistral tokenizer by @DarkLight1337 in #9504
[Bugfix] Fix offline_inference_with_prefix.py by @tlrmchlsmth in #9505
[Misc] benchmark: Add option to set max concurrency by @russellb in #9390
[Model] Add user-configurable task for models that support both generation and embedding by @DarkLight1337 in #9424
[CI/Build] Add error matching config for mypy by @russellb in #9512
[Model] Support Pixtral models in the HF Transformers format by @mgoin in #9036
[MISC] Add lora requests to metrics by @coolkp in #9477
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py by @comaniac in #9510
[Kernel] Add env variable to force flashinfer backend to enable tensor cores by @tdoublep in #9497
[Bugfix] Fix offline mode when using mistral_common by @sasha0552 in #9457
🐛 fix torch memory profiling by @joerunde in #9516
[Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily by @njhill in #9521
[Doc] update gpu-memory-utilization flag docs by @joerunde in #9507
[CI/Build] Add error matching for ruff output by @russellb in #9513
[CI/Build] Configure matcher for actionlint workflow by @russellb in #9511
[Frontend] Support simpler image input format by @yue-anyscale in #9478
[Bugfix] Fix missing task for speculative decoding by @DarkLight1337 in #9524
[Model][Pixtral] Optimizations for input_processor_for_pixtral_hf by @mgoin in #9514
[Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger by @heheda12345 in #9530
[Model][Pixtral] Use memory_efficient_attention for PixtralHFVision by @mgoin in #9520
[Kernel] Support sliding window in flash attention backend by @heheda12345 in #9403
[Frontend][Misc] Goodput metric support by @Imss27 in #9338
[CI/Build] Split up decoder-only LM tests by @DarkLight1337 in #9488
[Doc] Consistent naming of attention backends by @tdoublep in #9498
[Model] FalconMamba Support by @dhiaEddineRhaiem in #9325
[Bugfix][Misc]: fix graph capture for decoder by @yudian0504 in #9549
[BugFix] Use correct python3 binary in Docker.ppc64le entrypoint by @varad-ahirwadkar in #9492
[Model][Bugfix] Fix batching with multi-image in PixtralHF by @mgoin in #9518
[Frontend] Reduce frequency of client cancellation checking by @njhill in #7959
[doc] fix format by @youkaichao in #9562
[BugFix] Update draft model TP size check to allow matching target TP size by @njhill in #9394
[Frontend] Don't log duplicate error stacktrace for every request in the batch by @wallashss in #9023
[CI] Make format checker error message more user-friendly by using emoji by @KuntaiDu in #9564
🐛 Fixup more test failures from memory profiling by @joerunde in #9563
[core] move parallel sampling out from vllm core by @youkaichao in #9302
[Bugfix]: serialize config instances by value when using --trust-remote-code by @tjohnson31415 in #6751
[CI/Build] Remove unnecessary fork_new_process by @DarkLight1337 in #9484
[Bugfix][OpenVINO] fix_dockerfile_openvino by @ngrozae in #9552
[Bugfix]: phi.py get rope_theta from config file by @Falko1 in #9503
[CI/Build] Replaced some models on tests for smaller ones by @wallashss in #9570
[Core] Remove evictor_v1 by @KuntaiDu in #9572
[Doc] Use shell code-blocks and fix section headers by @rafvasq in #9508
support TP in qwen2 bnb by @chenqianfzh in #9574
[Hardware][CPU] using current_platform.is_cpu by @wangshuai09 in #9536
[V1] Implement vLLM V1 [1/N] by @WoosukKwon in #9289
[CI/Build][LoRA] Temporarily fix long context failure issue by @jeejeelee in #9579
[Neuron] [Bugfix] Fix neuron startup by @xendo in #9374
[Model][VLM] Initialize support for Mono-InternVL model by @Isotr0py in #9528
[Bugfix] Eagle: change config name for fc bias by @gopalsarda in #9580
[Hardware][Intel CPU][DOC] Update docs for CPU backend by @zhouyuan in #6212
[Frontend] Support custom request_id from request by @guoyuhong in #9550
[BugFix] Prevent exporting duplicate OpenTelemetry spans by @ronensc in #9017
[torch.compile] auto infer dynamic_arg_dims from type annotation by @youkaichao in #9589
[Bugfix] fix detokenizer shallow copy by @aurickq in #5919
[Misc] Make benchmarks use EngineArgs by @JArnoldAMD in #9529
[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing by @LucasWilkinson in #9487
[BugFix] Fix metrics error for --num-scheduler-steps > 1 by @yuleil in #8234
[Doc]: Update tensorizer docs to include vllm[tensorizer] by @sethkimmel3 in #7889
[Bugfix] Generate exactly input_len tokens in benchmark_throughput by @heheda12345 in #9592
[Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend to all logging messages by @sfc-gh-zhwang in #9590
[Model] Support E5-V by @DarkLight1337 in #9576
[Build] Fix FetchContent multiple build issue by @ProExpertProg in #9596
[Hardware][XPU] using current_platform.is_xpu by @MengqingCao in #9605
[Model] Initialize Florence-2 language backbone support by @Isotr0py in #9555
[VLM] Post-layernorm override and quant config in vision encoder by @DarkLight1337 in #9217
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs by @alex-jw-brooks in #9612
[Bugfix] Fix _init_vision_model in NVLM_D model by @DarkLight1337 in #9611
[misc] comment to avoid future confusion about baichuan by @youkaichao in #9620
[Bugfix] Fix divide by zero when serving Mamba models by @tlrmchlsmth in #9617
[Misc] Separate total and output tokens in benchmark_throughput.py by @mgoin in #8914
[torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9614
[Frontend] Enable Online Multi-image Support for MLlama by @alex-jw-brooks in #9393
[Model] Add Qwen2-Audio model support by @faychu in #9248
[CI/Build] Add bot to close stale issues and PRs by @russellb in #9436
[Bugfix][Model] Fix Mllama SDPA illegal memory access for batched multi-image by @mgoin in #9626
[Bugfix] Use "vision_model" prefix for MllamaVisionModel by @mgoin in #9628
[Bugfix]: Make chat content text allow type content by @vrdn-23 in #9358
[XPU] avoid triton import for xpu by @yma11 in #9440
[Bugfix] Fix PP for ChatGLM and Molmo, and weight loading for Qwen2.5-Math-RM by @DarkLight1337 in #9422
[V1][Bugfix] Clean up requests when aborted by @WoosukKwon in #9629
[core] simplify seq group code by @youkaichao in #9569
[torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9639
[Kernel] add kernel for FATReLU by @jeejeelee in #9610
[torch.compile] expanding support and fix allgather compilation by @CRZbulabula in #9637
[Doc] Move additional tips/notes to the top by @DarkLight1337 in #9647
[Bugfix]Disable the post_norm layer of the vision encoder for LLaVA models by @litianjian in #9653
Increase operation per run limit for "Close inactive issues and PRs" workflow by @hmellor in #9661
[torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9641
[CI/Build] Fix VLM test failures when using transformers v4.46 by @DarkLight1337 in #9666
[Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints by @alex-jw-brooks in #9650
[Log][Bugfix] Fix default value check for image_url.detail by @mgoin in #9663
[Performance][Kernel] Fused_moe Performance Improvement by @charlifu in #9384
[Bugfix] Remove xformers requirement for Pixtral by @mgoin in #9597
[ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #9675 by @khluu in #9676
[Model] add a lora module for granite 3.0 MoE models by @willmj in #9673
[V1] Support sliding window attention by @WoosukKwon in #9679
[Bugfix] Fix compressed_tensors_moe bad config.strategy by @mgoin in #9677
[Doc] Improve quickstart documentation by @rafvasq in #9256
[Bugfix] Fix crash with llama 3.2 vision models and guided decoding by @tjohnson31415 in #9631
[Bugfix] Steaming continuous_usage_stats default to False by @samos123 in #9709
[Hardware][openvino] is_openvino --> current_platform.is_openvino by @MengqingCao in #9716
Fix: MI100 Support By Bypassing Custom Paged Attention by @MErkinSag in #9560
[Frontend] Bad words sampling parameter by @Alvant in #9717
[Model] Add classification Task with Qwen2ForSequenceClassification by @kakao-kevin-us in #9704
[Misc] SpecDecodeWorker supports profiling by @Abatom in #9719
[core] cudagraph output with tensor weak reference by @youkaichao in #9724
[Misc] Upgrade to pytorch 2.5 by @bnellnm in #9588
Fix cache management in "Close inactive issues and PRs" actions workflow by @hmellor in #9734
[Bugfix] Fix load config when using bools by @madt2709 in #9533
[Hardware][ROCM] using current_platform.is_rocm by @wangshuai09 in #9642
[torch.compile] support moe models by @youkaichao in #9632
Fix beam search eos by @robertgshaw2-neuralmagic in #9627
[Bugfix] Fix ray instance detect issue by @yma11 in #9439
[CI/Build] Adopt Mergify for auto-labeling PRs by @russellb in #9259
[Model][VLM] Add multi-video support for LLaVA-Onevision by @litianjian in #8905
[torch.compile] Adding "torch compile" annotations to some models by @CRZbulabula in #9758
[misc] avoid circular import by @youkaichao in #9765
[torch.compile] add deepseek v2 compile by @youkaichao in #9775
[Doc] fix third-party model example by @russellb in #9771
[Model][LoRA]LoRA support added for Qwen by @jeejeelee in #9622
[Doc] Specify async engine args in docs by @DarkLight1337 in #9726
[Bugfix] Use temporary directory in registry by @DarkLight1337 in #9721
[Frontend] re-enable multi-modality input in the new beam search implementation by @FerdinandZhong in #9427
[Model] Add BNB quantization support for Mllama by @Isotr0py in #9720
[Hardware] using current_platform.seed_everything by @wangshuai09 in #9785
[Misc] Add metrics for request queue time, forward time, and execute time by @Abatom in #9659
Fix the log to correct guide user to install modelscope by @tastelikefeet in #9793
[Bugfix] Use host argument to bind to interface by @svenseeberg in #9798
[Misc]: Typo fix: Renaming classes (casualLM -> causalLM) by @yannicks1 in #9801
[Model] Add LlamaEmbeddingModel as an embedding Implementation of LlamaModel by @jsato8094 in #9806
[CI][Bugfix] Skip chameleon for transformers 4.46.1 by @mgoin in #9808
[CI/Build] mergify: fix rules for ci/build label by @russellb in #9804
[MISC] Set label value to timestamp over 0, to keep track of recent history by @coolkp in #9777
[Bugfix][Frontend] Guard against bad token ids by @joerunde in #9634
[Model] tool calling support for ibm-granite/granite-20b-functioncalling by @wseaton in #8339
[Docs] Add notes about Snowflake Meetup by @simon-mo in #9814
[Bugfix] Fix prefix strings for quantized VLMs by @mgoin in #9772
[core][distributed] fix custom allreduce in pytorch 2.5 by @youkaichao in #9815
Update README.md by @LiuXiaoxuanPKU in #9819
[Bugfix][VLM] Make apply_fp8_linear work with >2D input by @mgoin in #9812
[ci/build] Pin CI dependencies version with pip-compile by @khluu in #9810
[Bugfix] Fix multi nodes TP+PP for XPU by @yma11 in #8884
[Doc] Add the DCO to CONTRIBUTING.md by @russellb in #9803
[torch.compile] rework compile control with piecewise cudagraph by @youkaichao in #9715
[Misc] Specify minimum pynvml version by @jeejeelee in #9827
[TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA by @WoosukKwon in #9438
[CI/Build] VLM Test Consolidation by @alex-jw-brooks in #9372
[Model] Support math-shepherd-mistral-7b-prm model by @Went-Liang in #9697
[Misc] Add chunked-prefill support on FlashInfer. by @elfiegg in #9781
[Bugfix][core] replace heartbeat with pid check by @joerunde in #9818
[Doc] link bug for multistep guided decoding by @joerunde in #9843
[Neuron] Update Dockerfile.neuron to fix build failure by @hbikki in #9822
[doc] update pp support by @youkaichao in #9853
[CI/Build] Simplify exception trace in api server tests by @CRZbulabula in #9787
[torch.compile] upgrade tests by @youkaichao in #9858
[Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint by @gcalmettes in #9837
Revert "[Bugfix] Use host argument to bind to interface (#9798)" by @khluu in #9852
[Model] Support quantization of Qwen2VisionTransformer for Qwen2-VL by @mgoin in #9817
[Misc] Remove deprecated arg for cuda graph capture by @ywang96 in #9864
[Doc] Update Qwen documentation by @jeejeelee in #9869
[CI/Build] Add Model Tests for Qwen2-VL by @alex-jw-brooks in #9846
[CI/Build] Adding a forced docker system prune to clean up space by @Alexei-V-Ivanov-AMD in #9849
[Bugfix] Fix illegal memory access error with chunked prefill, prefix caching, block manager v2 and xformers enabled together by @sasha0552 in #9532
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 by @mzusman in #9838
[ci/build] Configure dependabot to update pip dependencies by @khluu in #9811
[Bugfix][Frontend] Reject guided decoding in multistep mode by @joerunde in #9892
[torch.compile] directly register custom op by @youkaichao in #9896
[Bugfix] Fix layer skip logic with bitsandbytes by @mgoin in #9887
[torch.compile] rework test plans by @youkaichao in #9866
[Model] Support bitsandbytes for MiniCPMV by @mgoin in #9891
[torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9876
[Doc] Update multi-input support by @DarkLight1337 in #9906
[Frontend] Chat-based Embeddings API by @DarkLight1337 in #9759
[CI/Build] Add Model Tests for PixtralHF by @mgoin in #9813
[Frontend] Use a proper chat template for VLM2Vec by @DarkLight1337 in #9912
[Bugfix] Fix edge cases for MistralTokenizer by @tjohnson31415 in #9625
[Core] Refactor: Clean up unused argument preemption_mode in Scheduler._preempt by @andrejonasson in #9696
[torch.compile] use interpreter with stable api from pytorch by @youkaichao in #9889
[Bugfix/Core] Remove assertion for Flashinfer k_scale and v_scale by @pavanimajety in #9861
[1/N] pass the complete config from engine to executor by @youkaichao in #9933
[Bugfix] PicklingError on RayTaskError by @GeneDer in #9934
Bump the patch-update group with 10 updates by @dependabot in #9897
[Core][VLM] Add precise multi-modal placeholder tracking by @petersalas in #8346
[ci/build] Have dependabot ignore pinned dependencies by @khluu in #9935
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models by @sroy745 in #9559
[torch.compile] fix cpu broken code by @youkaichao in #9947
[Docs] Update Granite 3.0 models in supported models table by @njhill in #9930
[Doc] Updated tpu-installation.rst with more details by @mikegre-google in #9926
[2/N] executor pass the complete config to worker/modelrunner by @youkaichao in #9938
[V1] Fix EngineArgs refactor on V1 by @robertgshaw2-neuralmagic in #9954
[bugfix] fix chatglm dummy_data_for_glmv by @youkaichao in #9955
[3/N] model runner pass the whole config to model by @youkaichao in #9958
[CI/Build] Quoting around > by @nokados in #9956
[torch.compile] Adding torch compile annotations to vision-language models by @CRZbulabula in #9946
[bugfix] fix tsts by @youkaichao in #9959
[V1] Support per-request seed by @njhill in #9945
[Model] Add support for H2OVL-Mississippi models by @cooleel in #9747
[V1] Fix Configs by @robertgshaw2-neuralmagic in #9971
[Bugfix] Fix MiniCPMV and Mllama BNB bug by @jeejeelee in #9917
[Bugfix]Using the correct type hints by @gshtras in #9885
[Misc] Compute query_start_loc/seq_start_loc on CPU by @zhengy001 in #9447
[Bugfix] Fix E2EL mean and median stats by @daitran2k1 in #9984
[Bugfix][OpenVINO] Fix circular reference #9939 by @MengqingCao in #9974
[Frontend] Multi-Modality Support for Loading Local Image Files by @chaunceyjiang in #9915
[4/N] make quant config first-class citizen by @youkaichao in #9978
[Misc]Reduce BNB static variable by @jeejeelee in #9987
[Model] factoring out MambaMixer out of Jamba by @mzusman in #8993
[CI] Basic Integration Test For TPU by @robertgshaw2-neuralmagic in #9968
[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs by @hissu-hyvarinen in #9279
[Doc] Update VLM doc about loading from local files by @ywang96 in #9999
[Bugfix] Fix MQLLMEngine hanging by @robertgshaw2-neuralmagic in #9973
[Misc] Refactor benchmark_throughput.py by @lk-chen in #9779
[Frontend] Add max_tokens prometheus metric by @tomeras91 in #9881
[Bugfix] Upgrade to pytorch 2.5.1 by @bnellnm in #10001
[4.5/N] bugfix for quant config in speculative decode by @youkaichao in #10007
[Bugfix] Respect modules_to_not_convert within awq_marlin by @mgoin in #9895
[Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep by @tlrmchlsmth in #9994
[Core] Make encoder-decoder inputs a nested structure to be more composable by @DarkLight1337 in #9604
[Bugfix] Fixup Mamba by @tlrmchlsmth in #10004
[BugFix] Lazy import ray by @GeneDer in #10021
[Misc] vllm CLI flags should be ordered for better user readability by @chaunceyjiang in #10017
[Frontend] Fix tcp port reservation for api server by @russellb in #10012
Refactor TPU requirements file and pin build dependencies by @richardsliu in #10010
[Misc] Add logging for CUDA memory by @yangalan123 in #10027
[CI/Build] Limit github CI jobs based on files changed by @russellb in #9928
[Model] Support quantization of PixtralHFTransformer for PixtralHF by @mgoin in #9921
[Feature] Update benchmark_throughput.py to support image input by @lk-chen in #9851
[Misc] Modify BNB parameter name by @jeejeelee in #9997
[CI] Prune tests/models/decoder_only/language/* tests by @mgoin in #9940
[CI] Prune back the number of tests in tests/kernels/* by @mgoin in #9932
[bugfix] fix weak ref in piecewise cudagraph and tractable test by @youkaichao in #10048
[Bugfix] Properly propagate trust_remote_code settings by @zifeitong in #10047
[Bugfix] Fix pickle of input when async output processing is on by @wallashss in #9931
[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode by @llsj14 in #9730
[v1] reduce graph capture time for piecewise cudagraph by @youkaichao in #10059
[Misc] Sort the list of embedding models by @DarkLight1337 in #10037
[Model][OpenVINO] Fix regressions from #8346 by @petersalas in #10045
[Bugfix] Fix edge-case crash when using chat with the Mistral Tekken Tokenizer by @tjohnson31415 in #10051
[Bugfix] Gpt-j-6B patch kv_scale to k_scale path by @arakowsk-amd in #10063
[Bugfix] Remove CustomChatCompletionContentPartParam multimodal input type by @zifeitong in #10054
[V1] Integrate Piecewise CUDA graphs by @WoosukKwon in #10058
[distributed] add function to create ipc buffers directly by @youkaichao in #10064
[CI/Build] drop support for Python 3.8 EOL by @aarnphm in #8464
[CI/Build] Fix large_gpu_mark reason by @Isotr0py in #10070
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend by @kzawora-intel in #6143
[Hotfix] Fix ruff errors by @WoosukKwon in #10073
[Model][LoRA]LoRA support added for LlamaEmbeddingModel by @jeejeelee in #10071
[Model] Add Idefics3 support by @jeejeelee in #9767
[Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration by @ericperfect in #10022
Remove ScaledActivation for AWQ by @mgoin in #10057
[CI/Build] Drop Python 3.8 support by @russellb in #10038
[CI/Build] change conflict PR comment from mergify by @russellb in #10080
[V1] Make v1 more testable by @joerunde in #9888
[CI/Build] Always run the ruff workflow by @russellb in #10092
[core][distributed] add stateless_init_process_group by @youkaichao in #10072
[Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12.4 by @mgoin in #10095
[Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend by @yma11 in #9823
[Frontend] Adjust try/except blocks in API impl by @njhill in #10056
[Hardware][CPU] Update torch 2.5 by @bigPYJ1151 in #9911
[doc] add back Python 3.8 ABI by @youkaichao in #10100
[V1][BugFix] Fix Generator construction in greedy + seed case by @njhill in #10097
[Misc] Consolidate ModelConfig code related to HF config by @DarkLight1337 in #10104
[CI/Build] re-add codespell to CI by @russellb in #10083
[Doc] Improve benchmark documentation by @rafvasq in #9927
[Core][Distributed] Refactor ipc buffer init in CustomAllreduce by @hanzhi713 in #10030
[CI/Build] Improve mypy + python version matrix by @russellb in #10041
Adds method to read the pooling types from model's files by @flaviabeo in #9506
[Frontend] Fix multiple values for keyword argument error (#10075) by @DIYer22 in #10076
[Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target by @bigPYJ1151 in #10108
[Bugfix] Make image processor respect mm_processor_kwargs for Qwen2-VL by @li-plus in #10112
[Misc] Add Gamma-Distribution Request Generation Support for Serving Benchmark. by @spliii in #10105
[Frontend] Tool calling parser for Granite 3.0 models by @maxdebayser in #9027
[Feature] [Spec decode]: Combine chunked prefill with speculative decoding by @NickLucche in #9291
[CI/Build] Always run mypy by @russellb in #10122
[CI/Build] Add shell script linting using shellcheck by @russellb in #7925
[CI/Build] Automate PR body text cleanup by @russellb in #10082
Bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #9745
Online video support for VLMs by @litianjian in #10020
Bump actions/checkout from 4.2.1 to 4.2.2 by @dependabot in #9746
[Misc] Add environment variables collection in collect_env.py tool by @ycool in #9293
[V1] Add all_token_ids attribute to Request by @WoosukKwon in #10135
[V1] Prefix caching (take 2) by @comaniac in #9972
[CI/Build] Give PR cleanup job PR write access by @russellb in #10139
[Doc] Update FAQ links in spec_decode.rst by @whyiug in #9662
[Bugfix] Add error handling when server cannot respond any valid tokens by @DearPlanet in #5895
[Misc] Fix ImportError causing by triton by @MengqingCao in #9493
[Doc] Move CONTRIBUTING to docs site by @russellb in #9924
Fixes a typo about 'max_decode_seq_len' which causes crashes with cuda graph. by @sighingnow in #9285
Add hf_transfer to testing image by @mgoin in #10096
[Misc] Fix typo in #5895 by @DarkLight1337 in #10145
[Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator by @yma11 in #10144
[Model] Expose size to Idefics3 as mm_processor_kwargs by @Isotr0py in #10146
[V1]Enable APC by default only for text models by @ywang96 in #10148
[CI/Build] Update CPU tests to include all "standard" tests by @DarkLight1337 in #5481
Fix edge case Mistral tokenizer by @patrickvonplaten in #10152
Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 by @sroy745 in #10136
[Misc] Improve Web UI by @rafvasq in #10090
[V1] Fix non-cudagraph op name by @WoosukKwon in #10166
[CI/Build] Ignore .gitignored files for shellcheck by @ProExpertProg in #10162
Rename vllm.logging to vllm.logging_utils by @flozi00 in #10134
[torch.compile] Fuse RMSNorm with quant by @ProExpertProg in #9138
[Bugfix] Fix SymIntArrayRef expected to contain only concrete integers by @bnellnm in #10170
[Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case by @rasmith in #9857
[CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking by @bigPYJ1151 in #6892
[0/N] Rename MultiModalInputs to MultiModalKwargs by @DarkLight1337 in #10040
[Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module by @mgoin in #10169
[CI/Build] Fix VLM broadcast tests tensor_parallel_size passing by @Isotr0py in #10161
[Doc] Adjust RunLLM location by @DarkLight1337 in #10176
[5/N] pass the whole config to model by @youkaichao in #9983
[CI/Build] Add run-hpu-test.sh script by @xuechendi in #10167
[Bugfix] Enable some fp8 and quantized fullgraph tests by @bnellnm in #10171
[bugfix] fix broken tests of mlp speculator by @youkaichao in #10177
[doc] explaining the integration with huggingface by @youkaichao in #10173
bugfix: fix the bug that stream generate not work by @caijizhuo in #2756
[Frontend] add add_request_id middleware by @cjackal in #9594
[Frontend][Core] Override HF config.json via CLI by @KrishnaM251 in #5836
[CI/Build] Split up models tests by @DarkLight1337 in #10069
[ci][build] limit cmake version by @youkaichao in #10188
[Doc] Fix typo error in CONTRIBUTING.md by @FuryMartin in #10190
[doc] Polish the integration with huggingface doc by @CRZbulabula in #10195
[Misc] small fixes to function tracing file path by @ShawnD200 in #9543
[misc] improve cloudpickle registration and tests by @youkaichao in #10202
[Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py by @yansh97 in #10196
[doc] improve debugging code by @youkaichao in #10206
[6/N] pass whole config to inner model by @youkaichao in #10205
Bump the patch-update group with 5 updates by @dependabot in #10210
[Hardware][CPU] Add embedding models support for CPU backend by @Isotr0py in #10193
[LoRA][Kernel] Remove the unused libentry module by @jeejeelee in #10214
[V1] Allow tokenizer_mode and trust_remote_code for Detokenizer by @ywang96 in #10211
[Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner by @Isotr0py in #10218
[Metrics] add more metrics by @HarryWu99 in #4464
[Doc] fix doc string typo in block_manager swap_out function by @yyccli in #10212
[core][distributed] add stateless process group by @youkaichao in #10216
Bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #10209
[V1] Fix detokenizer ports by @WoosukKwon in #10224
[V1] Do not use inductor for piecewise CUDA graphs by @WoosukKwon in #10225
[v1][torch.compile] support managing cudagraph buffer by @youkaichao in #10203
[V1] Use custom ops for piecewise CUDA graphs by @WoosukKwon in #10227
Add docs on serving with Llama Stack by @terrytangyuan in #10183
[misc][distributed] auto port selection and disable tests by @youkaichao in #10226
[V1] Enable custom ops with piecewise CUDA graphs by @WoosukKwon in #10228
Make shutil rename in python_only_dev by @shcheglovnd in #10233
[V1] AsyncLLM Implementation by @robertgshaw2-neuralmagic in #9826
[doc] update debugging guide by @youkaichao in #10236
[Doc] Update help text for --distributed-executor-backend by @russellb in #10231
[1/N] torch.compile user interface design by @youkaichao in #10237
[Misc][LoRA] Replace hardcoded cuda device with configurable argument by @jeejeelee in #10223
Splitting attention kernel file by @maleksan85 in #10091
[doc] explain the class hierarchy in vLLM by @youkaichao in #10240
[CI][CPU]refactor CPU tests to allow to bind with different cores by @zhouyuan in #10222
[BugFix] Do not raise a ValueError when tool_choice is set to the supported none option and tools are not defined. by @gcalmettes in #10000
[Misc]Fix Idefics3Model argument by @jeejeelee in #10255
[Bugfix] Fix QwenModel argument by @DamonFool in #10262
[Frontend] Add per-request number of cached token stats by @zifeitong in #10174
[V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest by @WoosukKwon in #10245
[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers by @sroy745 in #9982
[LoRA] Adds support for bias in LoRA by @followumesh in #5733
[V1] Enable Inductor when using piecewise CUDA graphs by @WoosukKwon in #10268
[doc] fix location of runllm widget by @youkaichao in #10266
[doc] improve debugging doc by @youkaichao in #10270
Revert "[ci][build] limit cmake version" by @youkaichao in #10271
[V1] Fix CI tests on V1 engine by @WoosukKwon in #10272
[core][distributed] use tcp store directly by @youkaichao in #10275
[V1] Support VLMs with fine-grained scheduling by @WoosukKwon in #9871
Bump to compressed-tensors v0.8.0 by @dsikka in #10279
[Doc] Fix typo in arg_utils.py by @xyang16 in #10264
[Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions by @imkero in #10221
[Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 by @FurtherAI in #9944
[Core] Flashinfer - Remove advance step size restriction by @pavanimajety in #10282
[Model][LoRA]LoRA support added for idefics3 by @B-201 in #10281
[V1] Add missing tokenizer options for Detokenizer by @ywang96 in #10288
[1/N] Initial prototype for multi-modal processor by @DarkLight1337 in #10044
[Bugfix] bitsandbytes models fail to run pipeline parallel by @HoangCongDuc in #10200
[Bugfix] Fix tensor parallel for qwen2 classification model by @Isotr0py in #10297
[misc] error early for old-style class by @youkaichao in #10304
[Misc] format.sh: Simplify tool_version_check by @russellb in #10305
[Frontend] Pythonic tool parser by @mdepinet in #9859
[BugFix]: properly deserialize tool_calls iterator before processing by mistral-common when MistralTokenizer is used by @gcalmettes in #9951
[Model] Add BNB quantization support for Idefics3 by @B-201 in #10310
[ci][distributed] disable hanging tests by @youkaichao in #10317
[CI/Build] Fix CPU CI online inference timeout by @Isotr0py in #10314
[CI/Build] Make shellcheck happy by @DarkLight1337 in #10285
[Docs] Publish meetup slides by @WoosukKwon in #10331
Support Roberta embedding models by @maxdebayser in #9387
[Perf] Reduce peak memory usage of llama by @andoorve in #10339
[Bugfix] use AF_INET6 instead of AF_INET for OpenAI Compatible Server by @jxpxxzj in #9583
[Tool parsing] Improve / correct mistral tool parsing by @patrickvonplaten in #10333
[Bugfix] Fix unable to load some models by @DarkLight1337 in #10312
[bugfix] Fix static asymmetric quantization case by @ProExpertProg in #10334
[Misc] Change RedundantReshapesPass and FusionPass logging from info to debug by @tlrmchlsmth in #10308
[Model] Support Qwen2 embeddings and use tags to select model tests by @DarkLight1337 in #10184
[Bugfix] Qwen-vl output is inconsistent in speculative decoding by @skylee-01 in #10350
[Misc] Consolidate pooler config overrides by @DarkLight1337 in #10351
[Build] skip renaming files for release wheels pipeline by @simon-mo in #9671

New Contributors

@gracehonv made their first contribution in #9349
@streaver91 made their first contribution in #9396
@wukaixingxp made their first contribution in #9013
@sssrijan-amazon made their first contribution in #9380
@coolkp made their first contribution in #9477
@yue-anyscale made their first contribution in #9478
@dhiaEddineRhaiem made their first contribution in #9325
@yudian0504 made their first contribution in #9549
@ngrozae made their first contribution in #9552
@Falko1 made their first contribution in #9503
@wangshuai09 made their first contribution in #9536
@gopalsarda made their first contribution in #9580
@guoyuhong made their first contribution in #9550
@JArnoldAMD made their first contribution in #9529
@yuleil made their first contribution in #8234
@sethkimmel3 made their first contribution in #7889
@MengqingCao made their first contribution in #9605
@CRZbulabula made their first contribution in #9614
@faychu made their first contribution in #9248
@vrdn-23 made their first contribution in #9358
@willmj made their first contribution in #9673
@samos123 made their first contribution in #9709
@MErkinSag made their first contribution in #9560
@Alvant made their first contribution in #9717
@kakao-kevin-us made their first contribution in #9704
@madt2709 made their first contribution in #9533
@FerdinandZhong made their first contribution in #9427
@svenseeberg made their first contribution in #9798
@yannicks1 made their first contribution in #9801
@wseaton made their first contribution in #8339
@Went-Liang made their first contribution in #9697
@andrejonasson made their first contribution in #9696
@GeneDer made their first contribution in #9934
@mikegre-google made their first contribution in #9926
@nokados made their first contribution in #9956
@cooleel made their first contribution in #9747
@zhengy001 made their first contribution in #9447
@daitran2k1 made their first contribution in #9984
@chaunceyjiang made their first contribution in #9915
@hissu-hyvarinen made their first contribution in #9279
@lk-chen made their first contribution in #9779
@yangalan123 made their first contribution in #10027
@llsj14 made their first contribution in #9730
@arakowsk-amd made their first contribution in #10063
@kzawora-intel made their first contribution in #6143
@DIYer22 made their first contribution in #10076
@li-plus made their first contribution in #10112
@spliii made their first contribution in #10105
@flozi00 made their first contribution in #10134
@xuechendi made their first contribution in #10167
@caijizhuo made their first contribution in #2756
@cjackal made their first contribution in #9594
@KrishnaM251 made their first contribution in #5836
@FuryMartin made their first contribution in #10190
@ShawnD200 made their first contribution in #9543
@yansh97 made their first contribution in #10196
@yyccli made their first contribution in #10212
@shcheglovnd made their first contribution in #10233
@maleksan85 made their first contribution in #10091
@followumesh made their first contribution in #5733
@imkero made their first contribution in #10221
@B-201 made their first contribution in #10281
@HoangCongDuc made their first contribution in #10200
@mdepinet made their first contribution in #9859
@jxpxxzj made their first contribution in #9583
@skylee-01 made their first contribution in #10350

Full Changelog: v0.6.3...v0.6.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.4

Highlights

Model Support

Hardware Support

Performance

Engine Core

Others

What's Changed

New Contributors

Contributors