Releases · benbrandt/text-splitter

01 Jun 07:59

v0.4.0

cb674db

What's New

New Chunk Capacity (can now size chunks with Ranges)

New ChunkCapacity trait. When calling splitter.chunks() or splitter.chunk_indices(), the chunk_size argument has been replaced with chunk_capacity, which can be anything that implements the ChunkCapacity trait. This means that now the following can all be passed in:

usize
Range<usize>
RangeFrom<usize>
RangeFull
RangeInclusive<usize>
RangeTo<usize>
RangeToInclusive<usize>

This is helpful for cases where you do have a maximum chunk size, but you don't necessarily want to fill it up all the way every time. This can be helpful in embedding cases, where you have some maximum context size, but you don't necessarily want to muddy the embeddings with lots of neighboring semantic elements. You can use a range to express this now, and the chunks will stop filling up once they have reached a size within the range.

Simplified Chunk Sizing traits

Simplified ChunkSizer trait that allows for various calculations of chunk size. No longer requires full validation logic, since that now happens within the TextSplitter itself.

Breaking Changes

ChunkValidator trait removed. Instead impl ChunkSizer instead, which just requires calculating chunk_size and not the full validation logic.
TokenCount trait removed. You can just use ChunkSizer directly instead.
Internal TextChunks iterator is no longer pub.

Assets 2

0 Join discussion

23 May 05:20

benbrandt

v0.3.1

ba3c01b

v0.3.1

What's Changed

Handle more semantic levels of line breaks by @benbrandt in #9

Full Changelog: v0.3.0...v0.3.1

Contributors

benbrandt

Assets 2

0 Join discussion

19 May 03:53

benbrandt

v0.3.0

ebc2c76

v0.3.0 - Feature renaming + Optimized splitting algorithm

What's Changed

Breaking Changes

Match feature names for tokenizer crates to prevent conflicts in the future.
- huggingface -> tokenizers
- tiktoken -> tiktoken-rs

Features

Moved from recursive approach to iterative approach to avoid stack overflow issues by @benbrandt in #7
Relax MSRV to 1.60.0

Full Changelog: v0.2.2...v0.3.0

Contributors

benbrandt

Assets 2

0 Join discussion

08 May 16:39

benbrandt

v0.2.2

20e882b

v0.2.2 - Add all features to docs.rs

Add all features to docs.rs

Full Changelog: v0.2.1...v0.2.2

Assets 2

0 Join discussion

08 May 15:21

benbrandt

v0.2.1

7ed519a

v0.2.1

New Features

impl Default for TextSplitter using Characters. Character count is used for chunk length by default.
Specify the current MSRV (1.62.1)

Full Changelog: v0.2.0...v0.2.1

Assets 2

0 Join discussion

08 May 01:52

benbrandt

v0.2.0

ef3ca61

v0.2.0 - Simpler chunking interface

v0.2.0

Breaking Changes

Simpler Chunking API

Simplified API for the main use case. TextSplitter now only exposes two chunking methods:

chunks
chunk_indices

The other methods are now private. It was likely that the other methods would have caused confusion since it doesn't return the semantic units themselves, but merged versions.

You also specify chunk size directly in these methods to allow reusing the TextSplitter for different chunk sizes.

Allow passing in tokenizers directly

Rather than wrapping a tokenizer in another struct, you can instead just pass a tokenizer directly into TextSplitter::new.

Bug Fixes

Better handling of recursive paragraph chunking to handle when both double and single newline splits are used.

Assets 2

0 Join discussion

05 May 19:01

benbrandt

v0.1.0

4a7060f

v0.1.0 - Initial Release

Initial release to crates.io

Assets 2

0 Join discussion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's New

New Chunk Capacity (can now size chunks with Ranges)

Simplified Chunk Sizing traits

Breaking Changes

What's Changed

Contributors

What's Changed

Breaking Changes

Features

Contributors

New Features

v0.2.0

Breaking Changes

Simpler Chunking API

Allow passing in tokenizers directly

Bug Fixes

Releases: benbrandt/text-splitter

v0.4.0 - New Chunk Capacity

What's New

New Chunk Capacity (can now size chunks with Ranges)

Simplified Chunk Sizing traits

Breaking Changes

v0.3.1

What's Changed

Contributors

v0.3.0 - Feature renaming + Optimized splitting algorithm

What's Changed

Breaking Changes

Features

Contributors

v0.2.2 - Add all features to docs.rs

v0.2.1

New Features

v0.2.0 - Simpler chunking interface

v0.2.0

Breaking Changes

Simpler Chunking API

Allow passing in tokenizers directly

Bug Fixes

v0.1.0 - Initial Release