v0.6.4
Highlights
- Significant progress in V1 engine core refactor (#9826, #10135, #10288, #10211, #10225, #10228, #10268, #9954, #10272, #9971, #10224, #10166, #9289, #10058, #9888, #9972, #10059, #9945, #9679, #9871, #10227, #10245, #9629, #10097, #10203, #10148). You can checkout more details regarding the design and plan ahead in our recent meetup slides
- Signficant progress in
torch.compile
support. Many models now support torch compile with TorchInductor. You can checkout our meetup slides for more details. (#9775, #9614, #9639, #9641, #9876, #9946, #9589, #9896, #9637, #9300, #9947, #9138, #9715, #9866, #9632, #9858, #9889)
Model Support
- New LLMs and VLMs: Idefics3 (#9767), H2OVL-Mississippi (#9747), Qwen2-Audio (#9248), Pixtral models in the HF Transformers format (#9036), FalconMamba (#9325), Florence-2 language backbone (#9555)
- New encoder-decoder embedding models: BERT (#9056), RoBERTa & XLM-RoBERTa (#9387)
- Expanded task support: Llama embeddings (#9806), Math-Shepherd (Mistral reward modeling) (#9697), Qwen2 classification (#9704), Qwen2 embeddings (#10184), VLM2Vec (Phi-3-Vision embeddings) (#9303), E5-V (LLaVA-NeXT embeddings) (#9576), Qwen2-VL embeddings (#9944)
- Tool calling parser for Granite 3.0 (#9027), Jamba (#9154), granite-20b-functioncalling (#8339)
- LoRA support for Granite 3.0 MoE (#9673), Idefics3 (#10281), Llama embeddings (#10071), Qwen (#9622), Qwen2-VL (#10022)
- BNB quantization support for Idefics3 (#10310), Mllama (#9720), Qwen2 (#9467, #9574), MiniCPMV (#9891)
- Unified multi-modal processor for VLM (#10040, #10044)
- Simplify model interface (#9933, #10237, #9938, #9958, #10007, #9978, #9983, #10205)
Hardware Support
- Gaudi: Add Intel Gaudi (HPU) inference backend (#6143)
- CPU: Add embedding models support for CPU backend (#10193)
- TPU: Correctly profile peak memory usage & Upgrade PyTorch XLA (#9438)
- Triton: Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857)
Performance
Engine Core
- Override HF
config.json
via CLI (#5836) - Add goodput metric support (#9338)
- Move parallel sampling out from vllm core, paving way for V1 engine (#9302)
- Add stateless process group for easier integration with RLHF and disaggregated prefill (#10216, #10072)
Others
- Improvements to the pull request experience with DCO, mergify, stale bot, etc. (#9436, #9512, #9513, #9259, #10082, #10285, #9803)
- Dropped support for Python 3.8 (#10038, #8464)
- Basic Integration Test For TPU (#9968)
- Document the class hierarchy in vLLM (#10240), explain the integration with Hugging Face (#10173).
- Benchmark throughput now supports image input (#9851)
What's Changed
- [TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by @WoosukKwon in #9350
- [Frontend] merge beam search implementations by @LunrEclipse in #9296
- [Model] Make llama3.2 support multiple and interleaved images by @xiangxu-google in #9095
- [Bugfix] Clean up some cruft in mamba.py by @tlrmchlsmth in #9343
- [Frontend] Clarify model_type error messages by @stevegrubb in #9345
- [Doc] Fix code formatting in spec_decode.rst by @mgoin in #9348
- [Bugfix] Update InternVL input mapper to support image embeds by @hhzhang16 in #9351
- [BugFix] Fix chat API continuous usage stats by @njhill in #9357
- pass ignore_eos parameter to all benchmark_serving calls by @gracehonv in #9349
- [Misc] Directly use compressed-tensors for checkpoint definitions by @mgoin in #8909
- [Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by @CatherineSue in #9034
- [Bugfix][CI/Build] Fix CUDA 11.8 Build by @LucasWilkinson in #9386
- [Bugfix] Molmo text-only input bug fix by @mrsalehi in #9397
- [Misc] Standardize RoPE handling for Qwen2-VL by @DarkLight1337 in #9250
- [Model] VLM2Vec, the first multimodal embedding model in vLLM by @DarkLight1337 in #9303
- [CI/Build] Test VLM embeddings by @DarkLight1337 in #9406
- [Core] Rename input data types by @DarkLight1337 in #8688
- [Misc] Consolidate example usage of OpenAI client for multimodal models by @ywang96 in #9412
- [Model] Support SDPA attention for Molmo vision backbone by @Isotr0py in #9410
- Support mistral interleaved attn by @patrickvonplaten in #9414
- [Kernel][Model] Improve continuous batching for Jamba and Mamba by @mzusman in #9189
- [Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by @streaver91 in #9396
- [Performance][Spec Decode] Optimize ngram lookup performance by @LiuXiaoxuanPKU in #9333
- [CI/Build] mypy: Resolve some errors from checking vllm/engine by @russellb in #9267
- [Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by @tlrmchlsmth in #9425
- [BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by @rasmith in #9391
- Add notes on the use of Slack by @terrytangyuan in #9442
- [Kernel] Add Exllama as a backend for compressed-tensors by @LucasWilkinson in #9395
- [Misc] Print stack trace using
logger.exception
by @DarkLight1337 in #9461 - [misc] CUDA Time Layerwise Profiler by @LucasWilkinson in #8337
- [Bugfix] Allow prefill of assistant response when using
mistral_common
by @sasha0552 in #9446 - [TPU] Call torch._sync(param) during weight loading by @WoosukKwon in #9437
- [Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by @bigPYJ1151 in #9344
- [Core] Deprecating block manager v1 and make block manager v2 default by @KuntaiDu in #8704
- [CI/Build] remove .github from .dockerignore, add dirty repo check by @dtrifiro in #9375
- [Misc] Remove commit id file by @DarkLight1337 in #9470
- [torch.compile] Fine-grained CustomOp enabling mechanism by @ProExpertProg in #9300
- [Bugfix] Fix support for dimension like integers and ScalarType by @bnellnm in #9299
- [Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script by @wukaixingxp in #9013
- [Bugfix] Print warnings related to
mistral_common
tokenizer only once by @sasha0552 in #9468 - [Hardwware][Neuron] Simplify model load for transformers-neuronx library by @sssrijan-amazon in #9380
- Support
BERTModel
(firstencoder-only
embedding model) by @robertgshaw2-neuralmagic in #9056 - [BugFix] Stop silent failures on compressed-tensors parsing by @dsikka in #9381
- [Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage by @joerunde in #9352
- [Qwen2.5] Support bnb quant for Qwen2.5 by @blueyo0 in #9467
- [CI/Build] Use commit hash references for github actions by @russellb in #9430
- [BugFix] Typing fixes to RequestOutput.prompt and beam search by @njhill in #9473
- [Frontend][Feature] Add jamba tool parser by @tomeras91 in #9154
- [BugFix] Fix and simplify completion API usage streaming by @njhill in #9475
- [CI/Build] Fix lint errors in mistral tokenizer by @DarkLight1337 in #9504
- [Bugfix] Fix offline_inference_with_prefix.py by @tlrmchlsmth in #9505
- [Misc] benchmark: Add option to set max concurrency by @russellb in #9390
- [Model] Add user-configurable task for models that support both generation and embedding by @DarkLight1337 in #9424
- [CI/Build] Add error matching config for mypy by @russellb in #9512
- [Model] Support Pixtral models in the HF Transformers format by @mgoin in #9036
- [MISC] Add lora requests to metrics by @coolkp in #9477
- [MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py by @comaniac in #9510
- [Kernel] Add env variable to force flashinfer backend to enable tensor cores by @tdoublep in #9497
- [Bugfix] Fix offline mode when using
mistral_common
by @sasha0552 in #9457 - 🐛 fix torch memory profiling by @joerunde in #9516
- [Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily by @njhill in #9521
- [Doc] update gpu-memory-utilization flag docs by @joerunde in #9507
- [CI/Build] Add error matching for ruff output by @russellb in #9513
- [CI/Build] Configure matcher for actionlint workflow by @russellb in #9511
- [Frontend] Support simpler image input format by @yue-anyscale in #9478
- [Bugfix] Fix missing task for speculative decoding by @DarkLight1337 in #9524
- [Model][Pixtral] Optimizations for input_processor_for_pixtral_hf by @mgoin in #9514
- [Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger by @heheda12345 in #9530
- [Model][Pixtral] Use memory_efficient_attention for PixtralHFVision by @mgoin in #9520
- [Kernel] Support sliding window in flash attention backend by @heheda12345 in #9403
- [Frontend][Misc] Goodput metric support by @Imss27 in #9338
- [CI/Build] Split up decoder-only LM tests by @DarkLight1337 in #9488
- [Doc] Consistent naming of attention backends by @tdoublep in #9498
- [Model] FalconMamba Support by @dhiaEddineRhaiem in #9325
- [Bugfix][Misc]: fix graph capture for decoder by @yudian0504 in #9549
- [BugFix] Use correct python3 binary in Docker.ppc64le entrypoint by @varad-ahirwadkar in #9492
- [Model][Bugfix] Fix batching with multi-image in PixtralHF by @mgoin in #9518
- [Frontend] Reduce frequency of client cancellation checking by @njhill in #7959
- [doc] fix format by @youkaichao in #9562
- [BugFix] Update draft model TP size check to allow matching target TP size by @njhill in #9394
- [Frontend] Don't log duplicate error stacktrace for every request in the batch by @wallashss in #9023
- [CI] Make format checker error message more user-friendly by using emoji by @KuntaiDu in #9564
- 🐛 Fixup more test failures from memory profiling by @joerunde in #9563
- [core] move parallel sampling out from vllm core by @youkaichao in #9302
- [Bugfix]: serialize config instances by value when using --trust-remote-code by @tjohnson31415 in #6751
- [CI/Build] Remove unnecessary
fork_new_process
by @DarkLight1337 in #9484 - [Bugfix][OpenVINO] fix_dockerfile_openvino by @ngrozae in #9552
- [Bugfix]: phi.py get rope_theta from config file by @Falko1 in #9503
- [CI/Build] Replaced some models on tests for smaller ones by @wallashss in #9570
- [Core] Remove evictor_v1 by @KuntaiDu in #9572
- [Doc] Use shell code-blocks and fix section headers by @rafvasq in #9508
- support TP in qwen2 bnb by @chenqianfzh in #9574
- [Hardware][CPU] using current_platform.is_cpu by @wangshuai09 in #9536
- [V1] Implement vLLM V1 [1/N] by @WoosukKwon in #9289
- [CI/Build][LoRA] Temporarily fix long context failure issue by @jeejeelee in #9579
- [Neuron] [Bugfix] Fix neuron startup by @xendo in #9374
- [Model][VLM] Initialize support for Mono-InternVL model by @Isotr0py in #9528
- [Bugfix] Eagle: change config name for fc bias by @gopalsarda in #9580
- [Hardware][Intel CPU][DOC] Update docs for CPU backend by @zhouyuan in #6212
- [Frontend] Support custom request_id from request by @guoyuhong in #9550
- [BugFix] Prevent exporting duplicate OpenTelemetry spans by @ronensc in #9017
- [torch.compile] auto infer dynamic_arg_dims from type annotation by @youkaichao in #9589
- [Bugfix] fix detokenizer shallow copy by @aurickq in #5919
- [Misc] Make benchmarks use EngineArgs by @JArnoldAMD in #9529
- [Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing by @LucasWilkinson in #9487
- [BugFix] Fix metrics error for --num-scheduler-steps > 1 by @yuleil in #8234
- [Doc]: Update tensorizer docs to include vllm[tensorizer] by @sethkimmel3 in #7889
- [Bugfix] Generate exactly input_len tokens in benchmark_throughput by @heheda12345 in #9592
- [Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend to all logging messages by @sfc-gh-zhwang in #9590
- [Model] Support E5-V by @DarkLight1337 in #9576
- [Build] Fix
FetchContent
multiple build issue by @ProExpertProg in #9596 - [Hardware][XPU] using current_platform.is_xpu by @MengqingCao in #9605
- [Model] Initialize Florence-2 language backbone support by @Isotr0py in #9555
- [VLM] Post-layernorm override and quant config in vision encoder by @DarkLight1337 in #9217
- [Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs by @alex-jw-brooks in #9612
- [Bugfix] Fix
_init_vision_model
in NVLM_D model by @DarkLight1337 in #9611 - [misc] comment to avoid future confusion about baichuan by @youkaichao in #9620
- [Bugfix] Fix divide by zero when serving Mamba models by @tlrmchlsmth in #9617
- [Misc] Separate total and output tokens in benchmark_throughput.py by @mgoin in #8914
- [torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9614
- [Frontend] Enable Online Multi-image Support for MLlama by @alex-jw-brooks in #9393
- [Model] Add Qwen2-Audio model support by @faychu in #9248
- [CI/Build] Add bot to close stale issues and PRs by @russellb in #9436
- [Bugfix][Model] Fix Mllama SDPA illegal memory access for batched multi-image by @mgoin in #9626
- [Bugfix] Use "vision_model" prefix for MllamaVisionModel by @mgoin in #9628
- [Bugfix]: Make chat content text allow type content by @vrdn-23 in #9358
- [XPU] avoid triton import for xpu by @yma11 in #9440
- [Bugfix] Fix PP for ChatGLM and Molmo, and weight loading for Qwen2.5-Math-RM by @DarkLight1337 in #9422
- [V1][Bugfix] Clean up requests when aborted by @WoosukKwon in #9629
- [core] simplify seq group code by @youkaichao in #9569
- [torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9639
- [Kernel] add kernel for FATReLU by @jeejeelee in #9610
- [torch.compile] expanding support and fix allgather compilation by @CRZbulabula in #9637
- [Doc] Move additional tips/notes to the top by @DarkLight1337 in #9647
- [Bugfix]Disable the post_norm layer of the vision encoder for LLaVA models by @litianjian in #9653
- Increase operation per run limit for "Close inactive issues and PRs" workflow by @hmellor in #9661
- [torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9641
- [CI/Build] Fix VLM test failures when using transformers v4.46 by @DarkLight1337 in #9666
- [Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints by @alex-jw-brooks in #9650
- [Log][Bugfix] Fix default value check for
image_url.detail
by @mgoin in #9663 - [Performance][Kernel] Fused_moe Performance Improvement by @charlifu in #9384
- [Bugfix] Remove xformers requirement for Pixtral by @mgoin in #9597
- [ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #9675 by @khluu in #9676
- [Model] add a lora module for granite 3.0 MoE models by @willmj in #9673
- [V1] Support sliding window attention by @WoosukKwon in #9679
- [Bugfix] Fix compressed_tensors_moe bad config.strategy by @mgoin in #9677
- [Doc] Improve quickstart documentation by @rafvasq in #9256
- [Bugfix] Fix crash with llama 3.2 vision models and guided decoding by @tjohnson31415 in #9631
- [Bugfix] Steaming continuous_usage_stats default to False by @samos123 in #9709
- [Hardware][openvino] is_openvino --> current_platform.is_openvino by @MengqingCao in #9716
- Fix: MI100 Support By Bypassing Custom Paged Attention by @MErkinSag in #9560
- [Frontend] Bad words sampling parameter by @Alvant in #9717
- [Model] Add classification Task with Qwen2ForSequenceClassification by @kakao-kevin-us in #9704
- [Misc] SpecDecodeWorker supports profiling by @Abatom in #9719
- [core] cudagraph output with tensor weak reference by @youkaichao in #9724
- [Misc] Upgrade to pytorch 2.5 by @bnellnm in #9588
- Fix cache management in "Close inactive issues and PRs" actions workflow by @hmellor in #9734
- [Bugfix] Fix load config when using bools by @madt2709 in #9533
- [Hardware][ROCM] using current_platform.is_rocm by @wangshuai09 in #9642
- [torch.compile] support moe models by @youkaichao in #9632
- Fix beam search eos by @robertgshaw2-neuralmagic in #9627
- [Bugfix] Fix ray instance detect issue by @yma11 in #9439
- [CI/Build] Adopt Mergify for auto-labeling PRs by @russellb in #9259
- [Model][VLM] Add multi-video support for LLaVA-Onevision by @litianjian in #8905
- [torch.compile] Adding "torch compile" annotations to some models by @CRZbulabula in #9758
- [misc] avoid circular import by @youkaichao in #9765
- [torch.compile] add deepseek v2 compile by @youkaichao in #9775
- [Doc] fix third-party model example by @russellb in #9771
- [Model][LoRA]LoRA support added for Qwen by @jeejeelee in #9622
- [Doc] Specify async engine args in docs by @DarkLight1337 in #9726
- [Bugfix] Use temporary directory in registry by @DarkLight1337 in #9721
- [Frontend] re-enable multi-modality input in the new beam search implementation by @FerdinandZhong in #9427
- [Model] Add BNB quantization support for Mllama by @Isotr0py in #9720
- [Hardware] using current_platform.seed_everything by @wangshuai09 in #9785
- [Misc] Add metrics for request queue time, forward time, and execute time by @Abatom in #9659
- Fix the log to correct guide user to install modelscope by @tastelikefeet in #9793
- [Bugfix] Use host argument to bind to interface by @svenseeberg in #9798
- [Misc]: Typo fix: Renaming classes (casualLM -> causalLM) by @yannicks1 in #9801
- [Model] Add LlamaEmbeddingModel as an embedding Implementation of LlamaModel by @jsato8094 in #9806
- [CI][Bugfix] Skip chameleon for transformers 4.46.1 by @mgoin in #9808
- [CI/Build] mergify: fix rules for ci/build label by @russellb in #9804
- [MISC] Set label value to timestamp over 0, to keep track of recent history by @coolkp in #9777
- [Bugfix][Frontend] Guard against bad token ids by @joerunde in #9634
- [Model] tool calling support for ibm-granite/granite-20b-functioncalling by @wseaton in #8339
- [Docs] Add notes about Snowflake Meetup by @simon-mo in #9814
- [Bugfix] Fix prefix strings for quantized VLMs by @mgoin in #9772
- [core][distributed] fix custom allreduce in pytorch 2.5 by @youkaichao in #9815
- Update README.md by @LiuXiaoxuanPKU in #9819
- [Bugfix][VLM] Make apply_fp8_linear work with >2D input by @mgoin in #9812
- [ci/build] Pin CI dependencies version with pip-compile by @khluu in #9810
- [Bugfix] Fix multi nodes TP+PP for XPU by @yma11 in #8884
- [Doc] Add the DCO to CONTRIBUTING.md by @russellb in #9803
- [torch.compile] rework compile control with piecewise cudagraph by @youkaichao in #9715
- [Misc] Specify minimum pynvml version by @jeejeelee in #9827
- [TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA by @WoosukKwon in #9438
- [CI/Build] VLM Test Consolidation by @alex-jw-brooks in #9372
- [Model] Support math-shepherd-mistral-7b-prm model by @Went-Liang in #9697
- [Misc] Add chunked-prefill support on FlashInfer. by @elfiegg in #9781
- [Bugfix][core] replace heartbeat with pid check by @joerunde in #9818
- [Doc] link bug for multistep guided decoding by @joerunde in #9843
- [Neuron] Update Dockerfile.neuron to fix build failure by @hbikki in #9822
- [doc] update pp support by @youkaichao in #9853
- [CI/Build] Simplify exception trace in api server tests by @CRZbulabula in #9787
- [torch.compile] upgrade tests by @youkaichao in #9858
- [Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint by @gcalmettes in #9837
- Revert "[Bugfix] Use host argument to bind to interface (#9798)" by @khluu in #9852
- [Model] Support quantization of Qwen2VisionTransformer for Qwen2-VL by @mgoin in #9817
- [Misc] Remove deprecated arg for cuda graph capture by @ywang96 in #9864
- [Doc] Update Qwen documentation by @jeejeelee in #9869
- [CI/Build] Add Model Tests for Qwen2-VL by @alex-jw-brooks in #9846
- [CI/Build] Adding a forced docker system prune to clean up space by @Alexei-V-Ivanov-AMD in #9849
- [Bugfix] Fix
illegal memory access
error with chunked prefill, prefix caching, block manager v2 and xformers enabled together by @sasha0552 in #9532 - [BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 by @mzusman in #9838
- [ci/build] Configure dependabot to update pip dependencies by @khluu in #9811
- [Bugfix][Frontend] Reject guided decoding in multistep mode by @joerunde in #9892
- [torch.compile] directly register custom op by @youkaichao in #9896
- [Bugfix] Fix layer skip logic with bitsandbytes by @mgoin in #9887
- [torch.compile] rework test plans by @youkaichao in #9866
- [Model] Support bitsandbytes for MiniCPMV by @mgoin in #9891
- [torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9876
- [Doc] Update multi-input support by @DarkLight1337 in #9906
- [Frontend] Chat-based Embeddings API by @DarkLight1337 in #9759
- [CI/Build] Add Model Tests for PixtralHF by @mgoin in #9813
- [Frontend] Use a proper chat template for VLM2Vec by @DarkLight1337 in #9912
- [Bugfix] Fix edge cases for MistralTokenizer by @tjohnson31415 in #9625
- [Core] Refactor: Clean up unused argument preemption_mode in Scheduler._preempt by @andrejonasson in #9696
- [torch.compile] use interpreter with stable api from pytorch by @youkaichao in #9889
- [Bugfix/Core] Remove assertion for Flashinfer k_scale and v_scale by @pavanimajety in #9861
- [1/N] pass the complete config from engine to executor by @youkaichao in #9933
- [Bugfix] PicklingError on RayTaskError by @GeneDer in #9934
- Bump the patch-update group with 10 updates by @dependabot in #9897
- [Core][VLM] Add precise multi-modal placeholder tracking by @petersalas in #8346
- [ci/build] Have dependabot ignore pinned dependencies by @khluu in #9935
- [Encoder Decoder] Add flash_attn kernel support for encoder-decoder models by @sroy745 in #9559
- [torch.compile] fix cpu broken code by @youkaichao in #9947
- [Docs] Update Granite 3.0 models in supported models table by @njhill in #9930
- [Doc] Updated tpu-installation.rst with more details by @mikegre-google in #9926
- [2/N] executor pass the complete config to worker/modelrunner by @youkaichao in #9938
- [V1] Fix
EngineArgs
refactor on V1 by @robertgshaw2-neuralmagic in #9954 - [bugfix] fix chatglm dummy_data_for_glmv by @youkaichao in #9955
- [3/N] model runner pass the whole config to model by @youkaichao in #9958
- [CI/Build] Quoting around > by @nokados in #9956
- [torch.compile] Adding torch compile annotations to vision-language models by @CRZbulabula in #9946
- [bugfix] fix tsts by @youkaichao in #9959
- [V1] Support per-request seed by @njhill in #9945
- [Model] Add support for H2OVL-Mississippi models by @cooleel in #9747
- [V1] Fix Configs by @robertgshaw2-neuralmagic in #9971
- [Bugfix] Fix MiniCPMV and Mllama BNB bug by @jeejeelee in #9917
- [Bugfix]Using the correct type hints by @gshtras in #9885
- [Misc] Compute query_start_loc/seq_start_loc on CPU by @zhengy001 in #9447
- [Bugfix] Fix E2EL mean and median stats by @daitran2k1 in #9984
- [Bugfix][OpenVINO] Fix circular reference #9939 by @MengqingCao in #9974
- [Frontend] Multi-Modality Support for Loading Local Image Files by @chaunceyjiang in #9915
- [4/N] make quant config first-class citizen by @youkaichao in #9978
- [Misc]Reduce BNB static variable by @jeejeelee in #9987
- [Model] factoring out MambaMixer out of Jamba by @mzusman in #8993
- [CI] Basic Integration Test For TPU by @robertgshaw2-neuralmagic in #9968
- [Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs by @hissu-hyvarinen in #9279
- [Doc] Update VLM doc about loading from local files by @ywang96 in #9999
- [Bugfix] Fix
MQLLMEngine
hanging by @robertgshaw2-neuralmagic in #9973 - [Misc] Refactor benchmark_throughput.py by @lk-chen in #9779
- [Frontend] Add max_tokens prometheus metric by @tomeras91 in #9881
- [Bugfix] Upgrade to pytorch 2.5.1 by @bnellnm in #10001
- [4.5/N] bugfix for quant config in speculative decode by @youkaichao in #10007
- [Bugfix] Respect modules_to_not_convert within awq_marlin by @mgoin in #9895
- [Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep by @tlrmchlsmth in #9994
- [Core] Make encoder-decoder inputs a nested structure to be more composable by @DarkLight1337 in #9604
- [Bugfix] Fixup Mamba by @tlrmchlsmth in #10004
- [BugFix] Lazy import ray by @GeneDer in #10021
- [Misc] vllm CLI flags should be ordered for better user readability by @chaunceyjiang in #10017
- [Frontend] Fix tcp port reservation for api server by @russellb in #10012
- Refactor TPU requirements file and pin build dependencies by @richardsliu in #10010
- [Misc] Add logging for CUDA memory by @yangalan123 in #10027
- [CI/Build] Limit github CI jobs based on files changed by @russellb in #9928
- [Model] Support quantization of PixtralHFTransformer for PixtralHF by @mgoin in #9921
- [Feature] Update benchmark_throughput.py to support image input by @lk-chen in #9851
- [Misc] Modify BNB parameter name by @jeejeelee in #9997
- [CI] Prune tests/models/decoder_only/language/* tests by @mgoin in #9940
- [CI] Prune back the number of tests in tests/kernels/* by @mgoin in #9932
- [bugfix] fix weak ref in piecewise cudagraph and tractable test by @youkaichao in #10048
- [Bugfix] Properly propagate trust_remote_code settings by @zifeitong in #10047
- [Bugfix] Fix pickle of input when async output processing is on by @wallashss in #9931
- [Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode by @llsj14 in #9730
- [v1] reduce graph capture time for piecewise cudagraph by @youkaichao in #10059
- [Misc] Sort the list of embedding models by @DarkLight1337 in #10037
- [Model][OpenVINO] Fix regressions from #8346 by @petersalas in #10045
- [Bugfix] Fix edge-case crash when using chat with the Mistral Tekken Tokenizer by @tjohnson31415 in #10051
- [Bugfix] Gpt-j-6B patch kv_scale to k_scale path by @arakowsk-amd in #10063
- [Bugfix] Remove CustomChatCompletionContentPartParam multimodal input type by @zifeitong in #10054
- [V1] Integrate Piecewise CUDA graphs by @WoosukKwon in #10058
- [distributed] add function to create ipc buffers directly by @youkaichao in #10064
- [CI/Build] drop support for Python 3.8 EOL by @aarnphm in #8464
- [CI/Build] Fix large_gpu_mark reason by @Isotr0py in #10070
- [Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend by @kzawora-intel in #6143
- [Hotfix] Fix ruff errors by @WoosukKwon in #10073
- [Model][LoRA]LoRA support added for LlamaEmbeddingModel by @jeejeelee in #10071
- [Model] Add Idefics3 support by @jeejeelee in #9767
- [Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration by @ericperfect in #10022
- Remove ScaledActivation for AWQ by @mgoin in #10057
- [CI/Build] Drop Python 3.8 support by @russellb in #10038
- [CI/Build] change conflict PR comment from mergify by @russellb in #10080
- [V1] Make v1 more testable by @joerunde in #9888
- [CI/Build] Always run the ruff workflow by @russellb in #10092
- [core][distributed] add stateless_init_process_group by @youkaichao in #10072
- [Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12.4 by @mgoin in #10095
- [Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend by @yma11 in #9823
- [Frontend] Adjust try/except blocks in API impl by @njhill in #10056
- [Hardware][CPU] Update torch 2.5 by @bigPYJ1151 in #9911
- [doc] add back Python 3.8 ABI by @youkaichao in #10100
- [V1][BugFix] Fix Generator construction in greedy + seed case by @njhill in #10097
- [Misc] Consolidate ModelConfig code related to HF config by @DarkLight1337 in #10104
- [CI/Build] re-add codespell to CI by @russellb in #10083
- [Doc] Improve benchmark documentation by @rafvasq in #9927
- [Core][Distributed] Refactor ipc buffer init in CustomAllreduce by @hanzhi713 in #10030
- [CI/Build] Improve mypy + python version matrix by @russellb in #10041
- Adds method to read the pooling types from model's files by @flaviabeo in #9506
- [Frontend] Fix multiple values for keyword argument error (#10075) by @DIYer22 in #10076
- [Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target by @bigPYJ1151 in #10108
- [Bugfix] Make image processor respect
mm_processor_kwargs
for Qwen2-VL by @li-plus in #10112 - [Misc] Add Gamma-Distribution Request Generation Support for Serving Benchmark. by @spliii in #10105
- [Frontend] Tool calling parser for Granite 3.0 models by @maxdebayser in #9027
- [Feature] [Spec decode]: Combine chunked prefill with speculative decoding by @NickLucche in #9291
- [CI/Build] Always run mypy by @russellb in #10122
- [CI/Build] Add shell script linting using shellcheck by @russellb in #7925
- [CI/Build] Automate PR body text cleanup by @russellb in #10082
- Bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #9745
- Online video support for VLMs by @litianjian in #10020
- Bump actions/checkout from 4.2.1 to 4.2.2 by @dependabot in #9746
- [Misc] Add environment variables collection in collect_env.py tool by @ycool in #9293
- [V1] Add all_token_ids attribute to Request by @WoosukKwon in #10135
- [V1] Prefix caching (take 2) by @comaniac in #9972
- [CI/Build] Give PR cleanup job PR write access by @russellb in #10139
- [Doc] Update FAQ links in spec_decode.rst by @whyiug in #9662
- [Bugfix] Add error handling when server cannot respond any valid tokens by @DearPlanet in #5895
- [Misc] Fix ImportError causing by triton by @MengqingCao in #9493
- [Doc] Move CONTRIBUTING to docs site by @russellb in #9924
- Fixes a typo about 'max_decode_seq_len' which causes crashes with cuda graph. by @sighingnow in #9285
- Add hf_transfer to testing image by @mgoin in #10096
- [Misc] Fix typo in #5895 by @DarkLight1337 in #10145
- [Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator by @yma11 in #10144
- [Model] Expose size to Idefics3 as mm_processor_kwargs by @Isotr0py in #10146
- [V1]Enable APC by default only for text models by @ywang96 in #10148
- [CI/Build] Update CPU tests to include all "standard" tests by @DarkLight1337 in #5481
- Fix edge case Mistral tokenizer by @patrickvonplaten in #10152
- Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 by @sroy745 in #10136
- [Misc] Improve Web UI by @rafvasq in #10090
- [V1] Fix non-cudagraph op name by @WoosukKwon in #10166
- [CI/Build] Ignore .gitignored files for shellcheck by @ProExpertProg in #10162
- Rename vllm.logging to vllm.logging_utils by @flozi00 in #10134
- [torch.compile] Fuse RMSNorm with quant by @ProExpertProg in #9138
- [Bugfix] Fix SymIntArrayRef expected to contain only concrete integers by @bnellnm in #10170
- [Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case by @rasmith in #9857
- [CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking by @bigPYJ1151 in #6892
- [0/N] Rename
MultiModalInputs
toMultiModalKwargs
by @DarkLight1337 in #10040 - [Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module by @mgoin in #10169
- [CI/Build] Fix VLM broadcast tests
tensor_parallel_size
passing by @Isotr0py in #10161 - [Doc] Adjust RunLLM location by @DarkLight1337 in #10176
- [5/N] pass the whole config to model by @youkaichao in #9983
- [CI/Build] Add run-hpu-test.sh script by @xuechendi in #10167
- [Bugfix] Enable some fp8 and quantized fullgraph tests by @bnellnm in #10171
- [bugfix] fix broken tests of mlp speculator by @youkaichao in #10177
- [doc] explaining the integration with huggingface by @youkaichao in #10173
- bugfix: fix the bug that stream generate not work by @caijizhuo in #2756
- [Frontend] add
add_request_id
middleware by @cjackal in #9594 - [Frontend][Core] Override HF
config.json
via CLI by @KrishnaM251 in #5836 - [CI/Build] Split up models tests by @DarkLight1337 in #10069
- [ci][build] limit cmake version by @youkaichao in #10188
- [Doc] Fix typo error in CONTRIBUTING.md by @FuryMartin in #10190
- [doc] Polish the integration with huggingface doc by @CRZbulabula in #10195
- [Misc] small fixes to function tracing file path by @ShawnD200 in #9543
- [misc] improve cloudpickle registration and tests by @youkaichao in #10202
- [Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py by @yansh97 in #10196
- [doc] improve debugging code by @youkaichao in #10206
- [6/N] pass whole config to inner model by @youkaichao in #10205
- Bump the patch-update group with 5 updates by @dependabot in #10210
- [Hardware][CPU] Add embedding models support for CPU backend by @Isotr0py in #10193
- [LoRA][Kernel] Remove the unused libentry module by @jeejeelee in #10214
- [V1] Allow
tokenizer_mode
andtrust_remote_code
for Detokenizer by @ywang96 in #10211 - [Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner by @Isotr0py in #10218
- [Metrics] add more metrics by @HarryWu99 in #4464
- [Doc] fix doc string typo in block_manager
swap_out
function by @yyccli in #10212 - [core][distributed] add stateless process group by @youkaichao in #10216
- Bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #10209
- [V1] Fix detokenizer ports by @WoosukKwon in #10224
- [V1] Do not use inductor for piecewise CUDA graphs by @WoosukKwon in #10225
- [v1][torch.compile] support managing cudagraph buffer by @youkaichao in #10203
- [V1] Use custom ops for piecewise CUDA graphs by @WoosukKwon in #10227
- Add docs on serving with Llama Stack by @terrytangyuan in #10183
- [misc][distributed] auto port selection and disable tests by @youkaichao in #10226
- [V1] Enable custom ops with piecewise CUDA graphs by @WoosukKwon in #10228
- Make shutil rename in python_only_dev by @shcheglovnd in #10233
- [V1]
AsyncLLM
Implementation by @robertgshaw2-neuralmagic in #9826 - [doc] update debugging guide by @youkaichao in #10236
- [Doc] Update help text for
--distributed-executor-backend
by @russellb in #10231 - [1/N] torch.compile user interface design by @youkaichao in #10237
- [Misc][LoRA] Replace hardcoded cuda device with configurable argument by @jeejeelee in #10223
- Splitting attention kernel file by @maleksan85 in #10091
- [doc] explain the class hierarchy in vLLM by @youkaichao in #10240
- [CI][CPU]refactor CPU tests to allow to bind with different cores by @zhouyuan in #10222
- [BugFix] Do not raise a
ValueError
whentool_choice
is set to the supportednone
option andtools
are not defined. by @gcalmettes in #10000 - [Misc]Fix Idefics3Model argument by @jeejeelee in #10255
- [Bugfix] Fix QwenModel argument by @DamonFool in #10262
- [Frontend] Add per-request number of cached token stats by @zifeitong in #10174
- [V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest by @WoosukKwon in #10245
- [Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers by @sroy745 in #9982
- [LoRA] Adds support for bias in LoRA by @followumesh in #5733
- [V1] Enable Inductor when using piecewise CUDA graphs by @WoosukKwon in #10268
- [doc] fix location of runllm widget by @youkaichao in #10266
- [doc] improve debugging doc by @youkaichao in #10270
- Revert "[ci][build] limit cmake version" by @youkaichao in #10271
- [V1] Fix CI tests on V1 engine by @WoosukKwon in #10272
- [core][distributed] use tcp store directly by @youkaichao in #10275
- [V1] Support VLMs with fine-grained scheduling by @WoosukKwon in #9871
- Bump to compressed-tensors v0.8.0 by @dsikka in #10279
- [Doc] Fix typo in arg_utils.py by @xyang16 in #10264
- [Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions by @imkero in #10221
- [Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 by @FurtherAI in #9944
- [Core] Flashinfer - Remove advance step size restriction by @pavanimajety in #10282
- [Model][LoRA]LoRA support added for idefics3 by @B-201 in #10281
- [V1] Add missing tokenizer options for
Detokenizer
by @ywang96 in #10288 - [1/N] Initial prototype for multi-modal processor by @DarkLight1337 in #10044
- [Bugfix] bitsandbytes models fail to run pipeline parallel by @HoangCongDuc in #10200
- [Bugfix] Fix tensor parallel for qwen2 classification model by @Isotr0py in #10297
- [misc] error early for old-style class by @youkaichao in #10304
- [Misc] format.sh: Simplify tool_version_check by @russellb in #10305
- [Frontend] Pythonic tool parser by @mdepinet in #9859
- [BugFix]: properly deserialize
tool_calls
iterator before processing by mistral-common when MistralTokenizer is used by @gcalmettes in #9951 - [Model] Add BNB quantization support for Idefics3 by @B-201 in #10310
- [ci][distributed] disable hanging tests by @youkaichao in #10317
- [CI/Build] Fix CPU CI online inference timeout by @Isotr0py in #10314
- [CI/Build] Make shellcheck happy by @DarkLight1337 in #10285
- [Docs] Publish meetup slides by @WoosukKwon in #10331
- Support Roberta embedding models by @maxdebayser in #9387
- [Perf] Reduce peak memory usage of llama by @andoorve in #10339
- [Bugfix] use AF_INET6 instead of AF_INET for OpenAI Compatible Server by @jxpxxzj in #9583
- [Tool parsing] Improve / correct mistral tool parsing by @patrickvonplaten in #10333
- [Bugfix] Fix unable to load some models by @DarkLight1337 in #10312
- [bugfix] Fix static asymmetric quantization case by @ProExpertProg in #10334
- [Misc] Change RedundantReshapesPass and FusionPass logging from info to debug by @tlrmchlsmth in #10308
- [Model] Support Qwen2 embeddings and use tags to select model tests by @DarkLight1337 in #10184
- [Bugfix] Qwen-vl output is inconsistent in speculative decoding by @skylee-01 in #10350
- [Misc] Consolidate pooler config overrides by @DarkLight1337 in #10351
- [Build] skip renaming files for release wheels pipeline by @simon-mo in #9671
New Contributors
- @gracehonv made their first contribution in #9349
- @streaver91 made their first contribution in #9396
- @wukaixingxp made their first contribution in #9013
- @sssrijan-amazon made their first contribution in #9380
- @coolkp made their first contribution in #9477
- @yue-anyscale made their first contribution in #9478
- @dhiaEddineRhaiem made their first contribution in #9325
- @yudian0504 made their first contribution in #9549
- @ngrozae made their first contribution in #9552
- @Falko1 made their first contribution in #9503
- @wangshuai09 made their first contribution in #9536
- @gopalsarda made their first contribution in #9580
- @guoyuhong made their first contribution in #9550
- @JArnoldAMD made their first contribution in #9529
- @yuleil made their first contribution in #8234
- @sethkimmel3 made their first contribution in #7889
- @MengqingCao made their first contribution in #9605
- @CRZbulabula made their first contribution in #9614
- @faychu made their first contribution in #9248
- @vrdn-23 made their first contribution in #9358
- @willmj made their first contribution in #9673
- @samos123 made their first contribution in #9709
- @MErkinSag made their first contribution in #9560
- @Alvant made their first contribution in #9717
- @kakao-kevin-us made their first contribution in #9704
- @madt2709 made their first contribution in #9533
- @FerdinandZhong made their first contribution in #9427
- @svenseeberg made their first contribution in #9798
- @yannicks1 made their first contribution in #9801
- @wseaton made their first contribution in #8339
- @Went-Liang made their first contribution in #9697
- @andrejonasson made their first contribution in #9696
- @GeneDer made their first contribution in #9934
- @mikegre-google made their first contribution in #9926
- @nokados made their first contribution in #9956
- @cooleel made their first contribution in #9747
- @zhengy001 made their first contribution in #9447
- @daitran2k1 made their first contribution in #9984
- @chaunceyjiang made their first contribution in #9915
- @hissu-hyvarinen made their first contribution in #9279
- @lk-chen made their first contribution in #9779
- @yangalan123 made their first contribution in #10027
- @llsj14 made their first contribution in #9730
- @arakowsk-amd made their first contribution in #10063
- @kzawora-intel made their first contribution in #6143
- @DIYer22 made their first contribution in #10076
- @li-plus made their first contribution in #10112
- @spliii made their first contribution in #10105
- @flozi00 made their first contribution in #10134
- @xuechendi made their first contribution in #10167
- @caijizhuo made their first contribution in #2756
- @cjackal made their first contribution in #9594
- @KrishnaM251 made their first contribution in #5836
- @FuryMartin made their first contribution in #10190
- @ShawnD200 made their first contribution in #9543
- @yansh97 made their first contribution in #10196
- @yyccli made their first contribution in #10212
- @shcheglovnd made their first contribution in #10233
- @maleksan85 made their first contribution in #10091
- @followumesh made their first contribution in #5733
- @imkero made their first contribution in #10221
- @B-201 made their first contribution in #10281
- @HoangCongDuc made their first contribution in #10200
- @mdepinet made their first contribution in #9859
- @jxpxxzj made their first contribution in #9583
- @skylee-01 made their first contribution in #10350
Full Changelog: v0.6.3...v0.6.4