v0.8.0 - Performance Improvements #124

benbrandt · 2024-03-26T06:15:33Z

benbrandt
Mar 26, 2024
Maintainer

What's New

Significantly fewer allocations necessary when generating chunks. This should result in a performance improvement for most use cases. This was achieved by both reusing pre-allocated collections, as well as memoizing chunk size calculations since that is often the bottleneck, and tokenizer libraries tend to be very allocation heavy!

Benchmarks show:

20-40% fewer allocations caused by the core algorithm.
Up to 20% fewer allocations when using tokenizers to calculate chunk sizes.
In some cases, especially with Markdown, these improvements can also result in up to 20% faster chunk generation.

Breaking Changes

There was a bug in the MarkdownSplitter logic that caused some strange split points.
The Text semantic level in MarkdownSplitter has been merged with inline elements to also find better split points inside content.
Fixed a bug that could cause the algorithm to use a lower semantic level than necessary on occasion. This mostly impacted the MarkdownSplitter, but there were same cases of different behavior in the TextSplitter as well if chunks are not trimmed.

All of the above can cause different chunks to be output than before, depending on the text. So, even though these are bug fixes to bring intended behavior, they are being treated as a major version bump.

Full Changelog: v0.7.0...v0.8.0

This discussion was created from the release v0.8.0 - Performance Improvements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.0 - Performance Improvements #124

{{title}}

Replies: 0 comments

Select a reply

v0.8.0 - Performance Improvements #124

benbrandt Mar 26, 2024 Maintainer

What's New

Breaking Changes

Replies: 0 comments

benbrandt
Mar 26, 2024
Maintainer