Skip to content

Python: v0.2.0 - Hugging Face Tokenizer support

Compare
Choose a tag to compare
@benbrandt benbrandt released this 12 Jun 08:37
· 556 commits to main since this release
fc3709a

What's New

  • New HuggingFaceTextSplitter, which allows for using Hugging Face's tokenizers package to count chunks by tokens with a tokenizer of your choice.
from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer

# Maximum number of tokens in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer, trim_chunks=False)

chunks = splitter.chunks("your document text", max_characters)

Breaking Changes

  • trim_chunks now defaults to True instead of False. For most use cases, this is the desired behavior, especially with chunk ranges.

Full Changelog: python-v0.1.4...python-v0.2.0