-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a binary search to fill TextChunks #64
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @bradfier thanks so much for reaching out and trying this! I've never really been happy with the current algorithm with large chunk sizes for the exact reason you state here: tokenizing is expensive. I really wish these tokenizer libraries returned tokens in an iterator so I could at least stop tokenizing once I know that I am beyond the chunk size already (If I only want a chunk size of 200 and I am tokenizing a chunk of 2000 tokens, I could save a lot of work once I got to token 201)... But I probably won't solve that any time soon :)
But you actually I think found a solution here that helps out a lot and makes total sense. We need to explore the search space, and obviously binary search is a much more efficient way to do this.
I think the performance issue with smaller chunk sizes is that ultimately next_sections
returns sections of the text at the current semantic level for the rest of the entire document.
Because we know that the next semantic level higher doesn't fit, we could either stop returning items after the end of the bigger sized section, or some other way to limit the binary search space to stay within the bounds of the offset of the next highest semantic level. I think once you get down to word or lower level, we would now be binary searching over the entire rest of the document at a very granular level, which would explain the long times.
Regarding the range method of providing a chunk size, this came out of an experiment with using character counting where we wanted some minimum chunk size, but were willing to let it go to some upper bound if necessary to get a more meaningful chunk. But I'm not sure that experiment really panned out, I'll check internally at my company to even see if we are using that anymore.
But I see two options:
- I remove the option if it doesn't seem used very much for the massive performance gain.
- We just stop the binary search once we find a spot somewhere in the range. It is different behavior, and I'd highlight the breaking change, but also seems reasonable. You gave us a range and we are delivering chunks within that range.
I'll take a closer look tomorrow with some fresh eyes at your code. If you want to make your PR editable by me on your end, I can maybe play with next_sections
, or I could do that change independently and one of us can rebase.
Whichever route we go, I want to have your contribution noted and acknowledged! Thanks again!
Thanks for taking a look! That all seems reasonable, relaxing the constraint to 'we just give you something that satisfies the bounds' would make it easier - the tests will obviously all break though. A possible improvement to the algorithm that I haven't yet had a chance to try, would be to start at the beginning of the input, and advance stepwise exponentially by section until the first time that returns This would have an advantage in that it would fill smaller chunks without considering the latter half of the document at all, while preserving the positive behaviour for larger data. |
It looks like GitHub doesn't support allowing maintainer edits on Org-originating PRs, but feel free to edit it anyway, push it somewhere and I'll rebase onto it! |
Ok great @bradfier and no worries. I was thinking about it more, and I think I can make the change to limiting |
@bradfier I've merged the update to All of my snapshot tests passed, which I use as kind of test for the edge cases around this sort of logic, and the current timing remains the same, so this doesn't change existing behavior. This does allow for you though to have the smaller search space when you use these In case it is useful, I also pushed up the offsets as well as the str in all of the methods. Might cause a conflict, but this might help you avoid summing up lengths or something like that, since for an end cursor you should always be able to do |
@bradfier what would be helpful for me to know is:
Once you have the rebase done though, I can pull down your branch and take a look at the changes in the tests. I've gotten a feeling over time about how the chunks change and I think I can see if the difference is significant in the chunking behavior for ranges. Thanks again for all of your work on this, this is a huge area I always wanted to improve, and I am very grateful you found my crate useful enough to put in the extra work to make it work in your use case! |
Hi @benbrandt sorry I didn't get to this until the weekend. Running my tests on
Using the B-search rebased on
Admittedly this wasn't a particularly clever rebase, I didn't use the offset you're now returning from let section_lengths: Vec<usize> = self.next_sections()?.map(|(_, s)| s.len()).collect(); This branch is rebased and pushed, we do probably still need a hybrid approach for filling small chunks from big input, if you don't get to it before me I'll have a look too. |
`TextChunks::next_chunk` is O(n) over the larger of the desired chunk size or the input size. This is especially apparent when using a relatively slow tokenizer like tiktoken-rs and requesting a large chunk size. Here we change `next_chunk` to use a binary search to fill the chunk from semantic sections, such that the largest possible chunk satisfying the capacity requirements is returned on each iteration. This implementation is O(log n) over chunk size. This slightly changes the semantics of the chunker where a chunk size range is supplied by the caller, the original implementation would return the smallest possible chunk that satisfies the requirements, this will return the largest possible chunk.
Thanks for this @bradfier, I also didn't get a chance to work more on this until recently. I do have one more optimization to cut down on the amount of chunking. My hunch now is that booting up the tokenizer at all is costly, which is why the small chunk size didn't benefit much. I'll test it out with my test documents. The one you are using you said is about 128k? |
That's right. It's about 130k, English text, with a decent amount of punctuation used in lieu of formatting which gives a useful pessimistic case for a tokeniser expecting punctuation to be on word boundaries. Think
|
@bradfier apologies, it has been a crazy week. I spent some time yesterday on my other attempt to reuse the tokenization from the semantic level selection, but forgot yet again that tokenization really needs to be on the text in question, you can't pull them from a longer text... Anyway, will work on integrating your change because it is much better than the status quo |
@bradfier a few more questions: what settings are you using? I.e. chunk size, trimming chunks, and which tokenizer specifically? I added some benchmarks which seem to have very different performance characteristics to what you are seeing. It's possible it is just a matter of differences in texts, but I want to make sure. I also attempted to integrate your binary search implementation, but wasn't seeing the same speed up, so again I wanted to check. Thanks |
@bradfier I've adopted your approach, along with a few more optimizations here: #71 At this point overall I see improvements, and the improvements far outweigh any regressions. But I'd still be curious to have you test it with your document and tokenizer to make sure. I will merge my PR regardless, since I see quite a large improvement anyway. And if it still isn't working for you let me know, and we can see what else we can do. Thanks again for all of your help on this. Sorry it took so long, but I was able to also find a way to adapt the binary search to also look for the smallest chunk in ranges, with very minor breaking behavior. I tried giving you credit on the main commit that added the core logic, since it was heavily inspired by you. (it says it isn't verified, but hopefully github still properly gives you credit) |
TextChunks::next_chunk
is O(n) over the larger of the desired chunk size or the input size. This is especially apparent when using a relatively slow tokenizer liketiktoken-rs
and requesting a large chunk size.Here we change
next_chunk
to use a binary search to fill the chunk from semantic sections, such that the largest possible chunk satisfying the capacity requirements is returned on each iteration. This implementation is O(log n) over chunk size and input size.This slightly changes the semantics of the chunker where a chunk size range is supplied by the caller, the original implementation would return the smallest possible chunk that satisfies the requirements, this will return the largest possible chunk.
Motivation
In our application, passing a 128KiB text document through
text-splitter
with a requested chunk size of80000
ran for several minutes (in release mode, on an M2 Pro CPU) without completing.Run time examples (128KiB markdown file input):
Existing Implementation
New Implementation
This is probably too much of a regression for cases where the chunk size is small - so I've marked the PR as a draft as I'm happy to change it - @benbrandt do you think it's worth having a heuristic where we select a different mechanism for chunk fill depending on the requested chunksize?