Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build(deps): bump semantic-text-splitter from 0.13.3 to 0.14.0 in the minor group #923

Closed
wants to merge 1 commit into from

Conversation

dependabot[bot]
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Jun 24, 2024

Bumps the minor group with 1 update: semantic-text-splitter.

Updates semantic-text-splitter from 0.13.3 to 0.14.0

Release notes

Sourced from semantic-text-splitter's releases.

v0.14.0

What's New

Performance fixes for large documents. The worst-case performance for certain documents was abysmal, leading to documents that ran forever. This release makes sure that in the worst case, the splitter won't be binary searching over the entire document, which it was before. This is prohibitively expensive especially for the tokenizer implementations, and now this should always have a safe upper bound to the search space.

For the "happy path", this new approach also led to big speed gains in the CodeSplitter (50%+ speed increase in some cases), marginal regressions in the MarkdownSplitter, and not much difference in the TextSplitter. But overall, the performance should be more consistent across documents, since it wasn't uncommon for a document with certain formatting to hit the worst-case scenario previously.

Breaking Changes

  • Chunk output may be slightly different because of the changes to the search optimizations. The previous optimization occasionally caused the splitter to stop too soon. For most cases, you may see no difference. It was most pronounced in the MarkdownSplitter at very small sizes, and any splitter using RustTokenizers because of its offset behavior.

Rust

  • ChunkSize has been removed. This was a holdover from a previous internal optimization, which turned out to not be very accurate anyway.
  • This makes implementing a custom ChunkSizer much easier, as you now only need to generate the size of the chunk as a usize. It often required in tokenization implementations to do more work to calculate the size as well, which is no longer necessary.

Before

pub trait ChunkSizer {
    // Required method
    fn chunk_size(&self, chunk: &str, capacity: &ChunkCapacity) -> ChunkSize;
}

After

pub trait ChunkSizer {
    // Required method
    fn size(&self, chunk: &str) -> usize;
}

Full Changelog: benbrandt/text-splitter@v0.13.3...v0.14.0

Changelog

Sourced from semantic-text-splitter's changelog.

v0.14.0

What's New

Performance fixes for large documents. The worst-case performance for certain documents was abysmal, leading to documents that ran forever. This release makes sure that in the worst case, the splitter won't be binary searching over the entire document, which it was before. This is prohibitively expensive especially for the tokenizer implementations, and now this should always have a safe upper bound to the search space.

For the "happy path", this new approach also led to big speed gains in the CodeSplitter (50%+ speed increase in some cases), marginal regressions in the MarkdownSplitter, and not much difference in the TextSplitter. But overall, the performance should be more consistent across documents, since it wasn't uncommon for a document with certain formatting to hit the worst-case scenario previously.

Breaking Changes

  • Chunk output may be slightly different because of the changes to the search optimizations. The previous optimization occasionally caused the splitter to stop too soon. For most cases, you may see no difference. It was most pronounced in the MarkdownSplitter at very small sizes, and any splitter using RustTokenizers because of its offset behavior.

Rust

  • ChunkSize has been removed. This was a holdover from a previous internal optimization, which turned out to not be very accurate anyway.
  • This makes implementing a custom ChunkSizer much easier, as you now only need to generate the size of the chunk as a usize. It often required in tokenization implementations to do more work to calculate the size as well, which is no longer necessary.
Before
pub trait ChunkSizer {
    // Required method
    fn chunk_size(&self, chunk: &str, capacity: &ChunkCapacity) -> ChunkSize;
}
After
pub trait ChunkSizer {
    // Required method
    fn size(&self, chunk: &str) -> usize;
}
Commits
  • 7c3cbbd Update changelog with details about the fix
  • b8b2184 New attempt at finding best effort binary search window
  • 53a31b5 Remove need for ChunkSize in public interface
  • 14e0699 Use current stats to make a more accurate guess
  • ef3c61b Start to update the changelog
  • c003481 Remove incorrect max_encoded_offset optimization
  • 8d57618 Expanding binary search window
  • b1b39d1 Bump the minor group with 2 updates
  • f31f1e5 Bump the minor group in /docs with 5 updates
  • 52e8f8f Bump the minor group in /docs with 2 updates
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore <dependency name> major version will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself)
  • @dependabot ignore <dependency name> minor version will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself)
  • @dependabot ignore <dependency name> will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself)
  • @dependabot unignore <dependency name> will remove all of the ignore conditions of the specified dependency
  • @dependabot unignore <dependency name> <ignore condition> will remove the ignore condition of the specified dependency and ignore conditions

Bumps the minor group with 1 update: [semantic-text-splitter](https://github.com/benbrandt/text-splitter).


Updates `semantic-text-splitter` from 0.13.3 to 0.14.0
- [Release notes](https://github.com/benbrandt/text-splitter/releases)
- [Changelog](https://github.com/benbrandt/text-splitter/blob/main/CHANGELOG.md)
- [Commits](benbrandt/text-splitter@v0.13.3...v0.14.0)

---
updated-dependencies:
- dependency-name: semantic-text-splitter
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: minor
...

Signed-off-by: dependabot[bot] <[email protected]>
@dependabot dependabot bot added dependencies Pull requests that update a dependency file python Pull requests that update Python code labels Jun 24, 2024
Copy link
Contributor Author

dependabot bot commented on behalf of github Jun 25, 2024

Looks like semantic-text-splitter is updatable in another way, so this is no longer needed.

@dependabot dependabot bot closed this Jun 25, 2024
@dependabot dependabot bot deleted the dependabot/pip/minor-8402e85d18 branch June 25, 2024 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants