v0.14.0
What's New
Performance fixes for large documents. The worst-case performance for certain documents was abysmal, leading to documents that ran forever. This release makes sure that in the worst case, the splitter won't be binary searching over the entire document, which it was before. This is prohibitively expensive especially for the tokenizer implementations, and now this should always have a safe upper bound to the search space.
For the "happy path", this new approach also led to big speed gains in the CodeSplitter
(50%+ speed increase in some cases), marginal regressions in the MarkdownSplitter
, and not much difference in the TextSplitter
. But overall, the performance should be more consistent across documents, since it wasn't uncommon for a document with certain formatting to hit the worst-case scenario previously.
Breaking Changes
- Chunk output may be slightly different because of the changes to the search optimizations. The previous optimization occasionally caused the splitter to stop too soon. For most cases, you may see no difference. It was most pronounced in the
MarkdownSplitter
at very small sizes, and any splitter usingRustTokenizers
because of its offset behavior.
Rust
ChunkSize
has been removed. This was a holdover from a previous internal optimization, which turned out to not be very accurate anyway.- This makes implementing a custom
ChunkSizer
much easier, as you now only need to generate the size of the chunk as ausize
. It often required in tokenization implementations to do more work to calculate the size as well, which is no longer necessary.
Before
pub trait ChunkSizer {
// Required method
fn chunk_size(&self, chunk: &str, capacity: &ChunkCapacity) -> ChunkSize;
}
After
pub trait ChunkSizer {
// Required method
fn size(&self, chunk: &str) -> usize;
}
- Optimization for SemanticSplitRange searching by @benbrandt in #219
- Performance Optimization: Expanding binary search window by @benbrandt in #231
Full Changelog: v0.13.3...v0.14.0