-
Notifications
You must be signed in to change notification settings - Fork 19
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat!: special tokens encoded by default
Special tokens are now also encoded by both Huggingface and Tiktoken tokenizers. This is closer to the default behavior on the Python side, and should make sure if a model adds tokens at the beginning or end of a sequence, these are accounted for as well.
- Loading branch information
Showing
13 changed files
with
4,000 additions
and
3,267 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,7 @@ | |
members = ["bindings/*"] | ||
|
||
[workspace.package] | ||
version = "0.20.0" | ||
version = "0.21.0" | ||
authors = ["Ben Brandt <[email protected]>"] | ||
edition = "2021" | ||
description = "Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python." | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1,172 changes: 655 additions & 517 deletions
1,172
tests/snapshots/snapshots__romeo_and_juliet_Tokenizers_trim_32.snap
Large diffs are not rendered by default.
Oops, something went wrong.
23 changes: 12 additions & 11 deletions
23
tests/snapshots/snapshots__romeo_and_juliet_Tokenizers_trim_512.snap
Large diffs are not rendered by default.
Oops, something went wrong.
1,172 changes: 655 additions & 517 deletions
1,172
tests/snapshots/snapshots__romeo_and_juliet_Tokenizers_trim_false_32.snap
Large diffs are not rendered by default.
Oops, something went wrong.
23 changes: 12 additions & 11 deletions
23
tests/snapshots/snapshots__romeo_and_juliet_Tokenizers_trim_false_512.snap
Large diffs are not rendered by default.
Oops, something went wrong.
2,392 changes: 1,307 additions & 1,085 deletions
2,392
tests/snapshots/snapshots__room_with_a_view_Tokenizers_trim_32.snap
Large diffs are not rendered by default.
Oops, something went wrong.
33 changes: 17 additions & 16 deletions
33
tests/snapshots/snapshots__room_with_a_view_Tokenizers_trim_512.snap
Large diffs are not rendered by default.
Oops, something went wrong.
2,392 changes: 1,307 additions & 1,085 deletions
2,392
tests/snapshots/snapshots__room_with_a_view_Tokenizers_trim_false_32.snap
Large diffs are not rendered by default.
Oops, something went wrong.
33 changes: 17 additions & 16 deletions
33
tests/snapshots/snapshots__room_with_a_view_Tokenizers_trim_false_512.snap
Large diffs are not rendered by default.
Oops, something went wrong.