Python splitters optionally provide chunk char offsets #135

benbrandt · 2024-04-04T15:50:13Z

It can be helpful to know where a given chunk falls within the entire text. On the Rust side, you can get the chunk along with its corresponding byte offset. But there wasn't a comparable method for the Python package.

Because Rust byte offsets aren't useful in Python, these are mapped to the corresponding character index of the beginning of the chunk. Since string indexing in Python is normally done with character indexes, this should allow for different string comparison and matching operations with this number.

Closes #133

It can be helpful to know where a given chunk falls within the entire text. On the Rust side, you can get the chunk along with its corresponding byte offset. But there wasn't a comparable method for the Python package. Because Rust byte offsets aren't useful in Python, these are mapped to the corresponding character index of the beginning of the chunk. Since string indexing in Python is normally done with character indexes, this should allow for different string comparison and matching operations with this number.

…talling in CI

benbrandt self-assigned this Apr 4, 2024

benbrandt added 2 commits April 4, 2024 17:56

Prep 0.9.1 release

89ecc79

fix: try to make sure the CI isn't using the pip index/cache when ins…

f427771

…talling in CI

benbrandt merged commit 17bc95a into main Apr 4, 2024
21 checks passed

benbrandt deleted the 133-char-indices branch April 4, 2024 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python splitters optionally provide chunk char offsets #135

Python splitters optionally provide chunk char offsets #135

benbrandt commented Apr 4, 2024

Python splitters optionally provide chunk char offsets #135

Python splitters optionally provide chunk char offsets #135

Conversation

benbrandt commented Apr 4, 2024