Skip to content

Commit

Permalink
Prep v0.8.0 release
Browse files Browse the repository at this point in the history
  • Loading branch information
benbrandt committed Mar 25, 2024
1 parent ab62a2b commit c4f5c86
Show file tree
Hide file tree
Showing 3 changed files with 23 additions and 3 deletions.
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,25 @@
# Changelog

## v0.8.0

### What's New

[Significantly fewer allocations](https://github.com/benbrandt/text-splitter/pull/121) necessary when generating chunks. This should result in a performance improvement for most use cases. This was achieved by both reusing pre-allocated collections, as well as memoizing chunk size calculations since that is often the bottleneck, and tokenizer libraries tend to be very allocation heavy!

Benchmarks show:

- **20-40% fewer** allocations caused by the core algorithm.
- **Up to 20% fewer** allocations when using tokenizers to calculate chunk sizes.
- In some cases, especially with Markdown, these improvements can also result in **up to 20% faster** chunk generation.

### Breaking Changes

- There was a bug in the `MarkdownSplitter` logic that caused some strange split points.
- The `Text` semantic level in `MarkdownSplitter` has been merged with inline elements to also find better split points inside content.
- Fixed a bug that could cause the algorithm to use a lower semantic level than necessary on occaision.

All of the above mostly effect the `MarkdownSplitter` and will cause different chunks to be output than before.

## v0.7.0

### What's New
Expand Down
4 changes: 2 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
members = ["bindings/*"]

[workspace.package]
version = "0.7.0"
version = "0.8.0"
authors = ["Ben Brandt <[email protected]>"]
edition = "2021"
description = "Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens (when used with large language models)."
Expand Down

0 comments on commit c4f5c86

Please sign in to comment.