Release Python: v0.2.0 - Hugging Face Tokenizer support · benbrandt/text-splitter

What's New

New HuggingFaceTextSplitter, which allows for using Hugging Face's tokenizers package to count chunks by tokens with a tokenizer of your choice.

from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer

# Maximum number of tokens in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer, trim_chunks=False)

chunks = splitter.chunks("your document text", max_characters)

Breaking Changes

trim_chunks now defaults to True instead of False. For most use cases, this is the desired behavior, especially with chunk ranges.

Full Changelog: python-v0.1.4...python-v0.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: v0.2.0 - Hugging Face Tokenizer support

What's New

Breaking Changes