Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a binary search to fill TextChunks #64

Closed
wants to merge 1 commit into from
Closed

Use a binary search to fill TextChunks #64

wants to merge 1 commit into from

Conversation

bradfier
Copy link
Contributor

@bradfier bradfier commented Nov 29, 2023

TextChunks::next_chunk is O(n) over the larger of the desired chunk size or the input size. This is especially apparent when using a relatively slow tokenizer like tiktoken-rs and requesting a large chunk size.

Here we change next_chunk to use a binary search to fill the chunk from semantic sections, such that the largest possible chunk satisfying the capacity requirements is returned on each iteration. This implementation is O(log n) over chunk size and input size.

This slightly changes the semantics of the chunker where a chunk size range is supplied by the caller, the original implementation would return the smallest possible chunk that satisfies the requirements, this will return the largest possible chunk.

Motivation

In our application, passing a 128KiB text document through text-splitter with a requested chunk size of 80000 ran for several minutes (in release mode, on an M2 Pro CPU) without completing.

Run time examples (128KiB markdown file input):

Existing Implementation

Chunk Size Time
50 86.502916ms
100 121.535625ms
500 491.113125ms
1000 955.680333ms
2000 1.893095167s
5000 4.574126375s
10000 9.171867625s
25000 25.724966459s
50000 49.521662416s

New Implementation

Chunk Size Time
50 4.247918083s
100 1.954472334s
1000 264.748125ms
2000 191.925416ms
5000 141.216625ms
10000 137.836417ms
25000 132.183417ms
50000 144.889083ms

This is probably too much of a regression for cases where the chunk size is small - so I've marked the PR as a draft as I'm happy to change it - @benbrandt do you think it's worth having a heuristic where we select a different mechanism for chunk fill depending on the requested chunksize?

Copy link
Owner

@benbrandt benbrandt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @bradfier thanks so much for reaching out and trying this! I've never really been happy with the current algorithm with large chunk sizes for the exact reason you state here: tokenizing is expensive. I really wish these tokenizer libraries returned tokens in an iterator so I could at least stop tokenizing once I know that I am beyond the chunk size already (If I only want a chunk size of 200 and I am tokenizing a chunk of 2000 tokens, I could save a lot of work once I got to token 201)... But I probably won't solve that any time soon :)

But you actually I think found a solution here that helps out a lot and makes total sense. We need to explore the search space, and obviously binary search is a much more efficient way to do this.

I think the performance issue with smaller chunk sizes is that ultimately next_sections returns sections of the text at the current semantic level for the rest of the entire document.

Because we know that the next semantic level higher doesn't fit, we could either stop returning items after the end of the bigger sized section, or some other way to limit the binary search space to stay within the bounds of the offset of the next highest semantic level. I think once you get down to word or lower level, we would now be binary searching over the entire rest of the document at a very granular level, which would explain the long times.

Regarding the range method of providing a chunk size, this came out of an experiment with using character counting where we wanted some minimum chunk size, but were willing to let it go to some upper bound if necessary to get a more meaningful chunk. But I'm not sure that experiment really panned out, I'll check internally at my company to even see if we are using that anymore.

But I see two options:

  1. I remove the option if it doesn't seem used very much for the massive performance gain.
  2. We just stop the binary search once we find a spot somewhere in the range. It is different behavior, and I'd highlight the breaking change, but also seems reasonable. You gave us a range and we are delivering chunks within that range.

I'll take a closer look tomorrow with some fresh eyes at your code. If you want to make your PR editable by me on your end, I can maybe play with next_sections, or I could do that change independently and one of us can rebase.

Whichever route we go, I want to have your contribution noted and acknowledged! Thanks again!

@bradfier
Copy link
Contributor Author

Thanks for taking a look!

That all seems reasonable, relaxing the constraint to 'we just give you something that satisfies the bounds' would make it easier - the tests will obviously all break though.

A possible improvement to the algorithm that I haven't yet had a chance to try, would be to start at the beginning of the input, and advance stepwise exponentially by section until the first time that returns Greater - at which point we could commence the binary search to find the precise largest chunk to fill.

This would have an advantage in that it would fill smaller chunks without considering the latter half of the document at all, while preserving the positive behaviour for larger data.

@bradfier
Copy link
Contributor Author

I'll take a closer look tomorrow with some fresh eyes at your code. If you want to make your PR editable by me on your end, I can maybe play with next_sections, or I could do that change independently and one of us can rebase.

It looks like GitHub doesn't support allowing maintainer edits on Org-originating PRs, but feel free to edit it anyway, push it somewhere and I'll rebase onto it!

@benbrandt
Copy link
Owner

benbrandt commented Nov 30, 2023

Ok great @bradfier and no worries. I was thinking about it more, and I think I can make the change to limiting next_sections without breaking anything, and then you can rebase and see if it solves the performance regression problem. Then I will think more about how to handle the range interface, and if you need to change the method

@benbrandt
Copy link
Owner

@bradfier I've merged the update to next_sections. #65 Regardless, I think it enforces a constraint I want to have anyway in the code that logically never happened, but it could have.

All of my snapshot tests passed, which I use as kind of test for the edge cases around this sort of logic, and the current timing remains the same, so this doesn't change existing behavior.

This does allow for you though to have the smaller search space when you use these next_sections for your binary search, which may make it improve performance for all cases, and we might not need a heuristic for which approach to use.

In case it is useful, I also pushed up the offsets as well as the str in all of the methods. Might cause a conflict, but this might help you avoid summing up lengths or something like that, since for an end cursor you should always be able to do offset + str.len()

@benbrandt
Copy link
Owner

@bradfier what would be helpful for me to know is:

  • After rebasing, does performance improve now across all chunk sizes?
  • If yes, then it is a question of if we use your idea of keeping the current range behavior or change the behavior.

Once you have the rebase done though, I can pull down your branch and take a look at the changes in the tests. I've gotten a feeling over time about how the chunks change and I think I can see if the difference is significant in the chunking behavior for ranges.

Thanks again for all of your work on this, this is a huge area I always wanted to improve, and I am very grateful you found my crate useful enough to put in the extra work to make it work in your use case!

@bradfier
Copy link
Contributor Author

bradfier commented Dec 9, 2023

Hi @benbrandt sorry I didn't get to this until the weekend.

Running my tests on main without using the B-search gives similar results to previously:

Chunksize Runtime
50 84.465333ms
100 121.103042ms
500 474.750416ms
1000 936.216084ms
2000 1.846219958s
5000 4.41498075s
10000 9.018605s
25000 24.658028291s
50000 28.459285208s

Using the B-search rebased on main gives similarly pessimistic results for small inputs:

Chunksize Runtime
50 3.994563375s
100 1.813223667s
500 406.425291ms
1000 248.38975ms
2000 175.771209ms
5000 128.846042ms
10000 126.754459ms
25000 122.127584ms
50000 132.624042ms

Admittedly this wasn't a particularly clever rebase, I didn't use the offset you're now returning from next_section because summing up those str::len values is a drop in the runtime bucket, so you might want to tweak that, this line is the offender if you do:

        let section_lengths: Vec<usize> = self.next_sections()?.map(|(_, s)| s.len()).collect();

This branch is rebased and pushed, we do probably still need a hybrid approach for filling small chunks from big input, if you don't get to it before me I'll have a look too.

`TextChunks::next_chunk` is O(n) over the larger of the desired chunk
size or the input size. This is especially apparent when using a
relatively slow tokenizer like tiktoken-rs and requesting a large
chunk size.

Here we change `next_chunk` to use a binary search to fill the chunk
from semantic sections, such that the largest possible chunk satisfying
the capacity requirements is returned on each iteration. This
implementation is O(log n) over chunk size.

This slightly changes the semantics of the chunker where a chunk size
range is supplied by the caller, the original implementation would
return the smallest possible chunk that satisfies the requirements, this
will return the largest possible chunk.
@benbrandt
Copy link
Owner

benbrandt commented Dec 13, 2023

Thanks for this @bradfier, I also didn't get a chance to work more on this until recently.

I do have one more optimization to cut down on the amount of chunking. My hunch now is that booting up the tokenizer at all is costly, which is why the small chunk size didn't benefit much.

I'll test it out with my test documents. The one you are using you said is about 128k?

@bradfier
Copy link
Contributor Author

That's right. It's about 130k, English text, with a decent amount of punctuation used in lieu of formatting which gives a useful pessimistic case for a tokeniser expecting punctuation to be on word boundaries.

Think ------- to underline headings, and a contents page like:

Section One............................1

Section Two............................14

@benbrandt
Copy link
Owner

@bradfier apologies, it has been a crazy week. I spent some time yesterday on my other attempt to reuse the tokenization from the semantic level selection, but forgot yet again that tokenization really needs to be on the text in question, you can't pull them from a longer text...

Anyway, will work on integrating your change because it is much better than the status quo

@benbrandt
Copy link
Owner

@bradfier a few more questions: what settings are you using? I.e. chunk size, trimming chunks, and which tokenizer specifically?

I added some benchmarks which seem to have very different performance characteristics to what you are seeing. It's possible it is just a matter of differences in texts, but I want to make sure.

I also attempted to integrate your binary search implementation, but wasn't seeing the same speed up, so again I wanted to check. Thanks

@benbrandt
Copy link
Owner

@bradfier I've adopted your approach, along with a few more optimizations here: #71

At this point overall I see improvements, and the improvements far outweigh any regressions. But I'd still be curious to have you test it with your document and tokenizer to make sure. I will merge my PR regardless, since I see quite a large improvement anyway. And if it still isn't working for you let me know, and we can see what else we can do.

Thanks again for all of your help on this. Sorry it took so long, but I was able to also find a way to adapt the binary search to also look for the smallest chunk in ranges, with very minor breaking behavior.

I tried giving you credit on the main commit that added the core logic, since it was heavily inspired by you. (it says it isn't verified, but hopefully github still properly gives you credit)

@benbrandt
Copy link
Owner

@bradfier closing this PR in favor of #71

If you have any more measurements, let me know though. As I said, happy to see if we can improve this further. Thanks again for opening this!

@benbrandt benbrandt closed this Dec 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants