Python: v0.2.0 - Hugging Face Tokenizer support
What's New
- New
HuggingFaceTextSplitter
, which allows for using Hugging Face'stokenizers
package to count chunks by tokens with a tokenizer of your choice.
from semantic_text_splitter import HuggingFaceTextSplitter
from tokenizers import Tokenizer
# Maximum number of tokens in a chunk
max_characters = 1000
# Optionally can also have the splitter not trim whitespace for you
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
splitter = HuggingFaceTextSplitter(tokenizer, trim_chunks=False)
chunks = splitter.chunks("your document text", max_characters)
Breaking Changes
trim_chunks
now defaults toTrue
instead ofFalse
. For most use cases, this is the desired behavior, especially with chunk ranges.
Full Changelog: python-v0.1.4...python-v0.2.0