Skip to content

v0.4.0

Compare
Choose a tag to compare
@github-actions github-actions released this 30 Mar 01:54
· 2874 commits to main since this release
51c31bc

Major changes

Models

Production features

  • Automatic prefix caching (#2762, #3703) supporting long system prompt to be automatically cached across requests. Use the flag --enable-prefix-caching to turn it on.
  • Support json_object in OpenAI server for arbitrary JSON, --use-delay flag to improve time to first token across many requests, and min_tokens to EOS suppression.
  • Progress in chunked prefill scheduler (#3236, #3538), and speculative decoding (#3103).
  • Custom all reduce kernel has been re-enabled after more robustness fixes.
  • Replaced cupy dependency due to its bugs.

Hardware

  • Improved Neuron support for AWS Inferentia.
  • CMake based build system for extensibility.

Ecosystem

  • Extensive serving benchmark refactoring (#3277)
  • Usage statistics collection (#2852)

What's Changed

New Contributors

Full Changelog: v0.3.3...v0.4.0