Use a binary search to fill TextChunks #64

bradfier · 2023-11-29T16:32:25Z

TextChunks::next_chunk is O(n) over the larger of the desired chunk size or the input size. This is especially apparent when using a relatively slow tokenizer like tiktoken-rs and requesting a large chunk size.

Here we change next_chunk to use a binary search to fill the chunk from semantic sections, such that the largest possible chunk satisfying the capacity requirements is returned on each iteration. This implementation is O(log n) over chunk size and input size.

This slightly changes the semantics of the chunker where a chunk size range is supplied by the caller, the original implementation would return the smallest possible chunk that satisfies the requirements, this will return the largest possible chunk.

Motivation

In our application, passing a 128KiB text document through text-splitter with a requested chunk size of 80000 ran for several minutes (in release mode, on an M2 Pro CPU) without completing.

Run time examples (128KiB markdown file input):

Existing Implementation

Chunk Size	Time
50	86.502916ms
100	121.535625ms
500	491.113125ms
1000	955.680333ms
2000	1.893095167s
5000	4.574126375s
10000	9.171867625s
25000	25.724966459s
50000	49.521662416s

New Implementation

Chunk Size	Time
50	4.247918083s
100	1.954472334s
1000	264.748125ms
2000	191.925416ms
5000	141.216625ms
10000	137.836417ms
25000	132.183417ms
50000	144.889083ms

This is probably too much of a regression for cases where the chunk size is small - so I've marked the PR as a draft as I'm happy to change it - @benbrandt do you think it's worth having a heuristic where we select a different mechanism for chunk fill depending on the requested chunksize?

benbrandt

Hey @bradfier thanks so much for reaching out and trying this! I've never really been happy with the current algorithm with large chunk sizes for the exact reason you state here: tokenizing is expensive. I really wish these tokenizer libraries returned tokens in an iterator so I could at least stop tokenizing once I know that I am beyond the chunk size already (If I only want a chunk size of 200 and I am tokenizing a chunk of 2000 tokens, I could save a lot of work once I got to token 201)... But I probably won't solve that any time soon :)

But you actually I think found a solution here that helps out a lot and makes total sense. We need to explore the search space, and obviously binary search is a much more efficient way to do this.

I think the performance issue with smaller chunk sizes is that ultimately next_sections returns sections of the text at the current semantic level for the rest of the entire document.

Because we know that the next semantic level higher doesn't fit, we could either stop returning items after the end of the bigger sized section, or some other way to limit the binary search space to stay within the bounds of the offset of the next highest semantic level. I think once you get down to word or lower level, we would now be binary searching over the entire rest of the document at a very granular level, which would explain the long times.

Regarding the range method of providing a chunk size, this came out of an experiment with using character counting where we wanted some minimum chunk size, but were willing to let it go to some upper bound if necessary to get a more meaningful chunk. But I'm not sure that experiment really panned out, I'll check internally at my company to even see if we are using that anymore.

But I see two options:

I remove the option if it doesn't seem used very much for the massive performance gain.
We just stop the binary search once we find a spot somewhere in the range. It is different behavior, and I'd highlight the breaking change, but also seems reasonable. You gave us a range and we are delivering chunks within that range.

I'll take a closer look tomorrow with some fresh eyes at your code. If you want to make your PR editable by me on your end, I can maybe play with next_sections, or I could do that change independently and one of us can rebase.

Whichever route we go, I want to have your contribution noted and acknowledged! Thanks again!

bradfier · 2023-11-29T21:14:57Z

Thanks for taking a look!

That all seems reasonable, relaxing the constraint to 'we just give you something that satisfies the bounds' would make it easier - the tests will obviously all break though.

A possible improvement to the algorithm that I haven't yet had a chance to try, would be to start at the beginning of the input, and advance stepwise exponentially by section until the first time that returns Greater - at which point we could commence the binary search to find the precise largest chunk to fill.

This would have an advantage in that it would fill smaller chunks without considering the latter half of the document at all, while preserving the positive behaviour for larger data.

bradfier · 2023-11-30T09:33:54Z

I'll take a closer look tomorrow with some fresh eyes at your code. If you want to make your PR editable by me on your end, I can maybe play with next_sections, or I could do that change independently and one of us can rebase.

It looks like GitHub doesn't support allowing maintainer edits on Org-originating PRs, but feel free to edit it anyway, push it somewhere and I'll rebase onto it!

benbrandt · 2023-11-30T11:19:53Z

Ok great @bradfier and no worries. I was thinking about it more, and I think I can make the change to limiting next_sections without breaking anything, and then you can rebase and see if it solves the performance regression problem. Then I will think more about how to handle the range interface, and if you need to change the method

benbrandt · 2023-11-30T12:42:31Z

@bradfier I've merged the update to next_sections. #65 Regardless, I think it enforces a constraint I want to have anyway in the code that logically never happened, but it could have.

All of my snapshot tests passed, which I use as kind of test for the edge cases around this sort of logic, and the current timing remains the same, so this doesn't change existing behavior.

This does allow for you though to have the smaller search space when you use these next_sections for your binary search, which may make it improve performance for all cases, and we might not need a heuristic for which approach to use.

In case it is useful, I also pushed up the offsets as well as the str in all of the methods. Might cause a conflict, but this might help you avoid summing up lengths or something like that, since for an end cursor you should always be able to do offset + str.len()

benbrandt · 2023-11-30T12:48:02Z

@bradfier what would be helpful for me to know is:

After rebasing, does performance improve now across all chunk sizes?
If yes, then it is a question of if we use your idea of keeping the current range behavior or change the behavior.

Once you have the rebase done though, I can pull down your branch and take a look at the changes in the tests. I've gotten a feeling over time about how the chunks change and I think I can see if the difference is significant in the chunking behavior for ranges.

Thanks again for all of your work on this, this is a huge area I always wanted to improve, and I am very grateful you found my crate useful enough to put in the extra work to make it work in your use case!

bradfier · 2023-12-09T08:27:23Z

Hi @benbrandt sorry I didn't get to this until the weekend.

Running my tests on main without using the B-search gives similar results to previously:

Chunksize	Runtime
50	84.465333ms
100	121.103042ms
500	474.750416ms
1000	936.216084ms
2000	1.846219958s
5000	4.41498075s
10000	9.018605s
25000	24.658028291s
50000	28.459285208s

Using the B-search rebased on main gives similarly pessimistic results for small inputs:

Chunksize	Runtime
50	3.994563375s
100	1.813223667s
500	406.425291ms
1000	248.38975ms
2000	175.771209ms
5000	128.846042ms
10000	126.754459ms
25000	122.127584ms
50000	132.624042ms

Admittedly this wasn't a particularly clever rebase, I didn't use the offset you're now returning from next_section because summing up those str::len values is a drop in the runtime bucket, so you might want to tweak that, this line is the offender if you do:

        let section_lengths: Vec<usize> = self.next_sections()?.map(|(_, s)| s.len()).collect();

This branch is rebased and pushed, we do probably still need a hybrid approach for filling small chunks from big input, if you don't get to it before me I'll have a look too.

`TextChunks::next_chunk` is O(n) over the larger of the desired chunk size or the input size. This is especially apparent when using a relatively slow tokenizer like tiktoken-rs and requesting a large chunk size. Here we change `next_chunk` to use a binary search to fill the chunk from semantic sections, such that the largest possible chunk satisfying the capacity requirements is returned on each iteration. This implementation is O(log n) over chunk size. This slightly changes the semantics of the chunker where a chunk size range is supplied by the caller, the original implementation would return the smallest possible chunk that satisfies the requirements, this will return the largest possible chunk.

benbrandt · 2023-12-13T05:56:53Z

Thanks for this @bradfier, I also didn't get a chance to work more on this until recently.

I do have one more optimization to cut down on the amount of chunking. My hunch now is that booting up the tokenizer at all is costly, which is why the small chunk size didn't benefit much.

I'll test it out with my test documents. The one you are using you said is about 128k?

bradfier · 2023-12-13T07:33:49Z

That's right. It's about 130k, English text, with a decent amount of punctuation used in lieu of formatting which gives a useful pessimistic case for a tokeniser expecting punctuation to be on word boundaries.

Think ------- to underline headings, and a contents page like:

Section One............................1

Section Two............................14

benbrandt · 2023-12-17T05:49:29Z

@bradfier apologies, it has been a crazy week. I spent some time yesterday on my other attempt to reuse the tokenization from the semantic level selection, but forgot yet again that tokenization really needs to be on the text in question, you can't pull them from a longer text...

Anyway, will work on integrating your change because it is much better than the status quo

benbrandt · 2023-12-24T06:39:52Z

@bradfier a few more questions: what settings are you using? I.e. chunk size, trimming chunks, and which tokenizer specifically?

I added some benchmarks which seem to have very different performance characteristics to what you are seeing. It's possible it is just a matter of differences in texts, but I want to make sure.

I also attempted to integrate your binary search implementation, but wasn't seeing the same speed up, so again I wanted to check. Thanks

benbrandt · 2023-12-27T17:31:13Z

@bradfier I've adopted your approach, along with a few more optimizations here: #71

At this point overall I see improvements, and the improvements far outweigh any regressions. But I'd still be curious to have you test it with your document and tokenizer to make sure. I will merge my PR regardless, since I see quite a large improvement anyway. And if it still isn't working for you let me know, and we can see what else we can do.

Thanks again for all of your help on this. Sorry it took so long, but I was able to also find a way to adapt the binary search to also look for the smallest chunk in ranges, with very minor breaking behavior.

I tried giving you credit on the main commit that added the core logic, since it was heavily inspired by you. (it says it isn't verified, but hopefully github still properly gives you credit)

benbrandt · 2023-12-27T20:01:23Z

@bradfier closing this PR in favor of #71

If you have any more measurements, let me know though. As I said, happy to see if we can improve this further. Thanks again for opening this!

benbrandt reviewed Nov 29, 2023

View reviewed changes

benbrandt closed this Dec 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a binary search to fill TextChunks #64

Use a binary search to fill TextChunks #64

bradfier commented Nov 29, 2023 •

edited

Loading

benbrandt left a comment

bradfier commented Nov 29, 2023

bradfier commented Nov 30, 2023

benbrandt commented Nov 30, 2023 •

edited

Loading

benbrandt commented Nov 30, 2023

benbrandt commented Nov 30, 2023

bradfier commented Dec 9, 2023

benbrandt commented Dec 13, 2023 •

edited

Loading

bradfier commented Dec 13, 2023

benbrandt commented Dec 17, 2023

benbrandt commented Dec 24, 2023

benbrandt commented Dec 27, 2023

benbrandt commented Dec 27, 2023

Use a binary search to fill TextChunks #64

Use a binary search to fill TextChunks #64

Conversation

bradfier commented Nov 29, 2023 • edited Loading

Motivation

Existing Implementation

New Implementation

benbrandt left a comment

Choose a reason for hiding this comment

bradfier commented Nov 29, 2023

bradfier commented Nov 30, 2023

benbrandt commented Nov 30, 2023 • edited Loading

benbrandt commented Nov 30, 2023

benbrandt commented Nov 30, 2023

bradfier commented Dec 9, 2023

benbrandt commented Dec 13, 2023 • edited Loading

bradfier commented Dec 13, 2023

benbrandt commented Dec 17, 2023

benbrandt commented Dec 24, 2023

benbrandt commented Dec 27, 2023

benbrandt commented Dec 27, 2023

bradfier commented Nov 29, 2023 •

edited

Loading

benbrandt commented Nov 30, 2023 •

edited

Loading

benbrandt commented Dec 13, 2023 •

edited

Loading