Reduce number of allocations in chunk generation #115

benbrandt · 2024-03-23T06:49:14Z

There are two areas in the chunk generation loop that have opportunities for improvement:

Reusing collections

In order to support the new binary search method of finding the right chunk size, we are now allocating some vectors so we can efficiently search across the upcoming chunks. A quick win would be to enable reusing an existing vec so that we don't have to allocate new memory on every iteration, but can reuse the existing one until we are done with the entire text.

Todo:

For the two vectors we create for allowing binary search, instead put this elements in a reused vec that lives on TextChunks
Make sure this is cleared at the end of every iteration of next_chunk

Challenges:

In order to do this will require more methods taking a mutable reference to self &mut self which can make things difficult. It is possible all goes well in a naive approach, just a warning.

Tokenization

Most tokenizers allocate multiple Strings and Vecs and HashMaps and other items on the heap each time they tokenize text. Ideally there would be an option with these libraries to only generate the token ids themselves (some form of integer) since that is all of the information we need. But since it is unlikely, and tokenization is in the hot path of chunk calculation, if we can reduce the number of times that the actual tokenization is necessary, we can get some quick performance wins in terms of both allocations and CPU usage, since tokenization isn't cheap. (see current benchmark output)

Since tokenization is identical for the same strings, we can likely memoize the output of chunk_size for the same string. We will have to allocate to store the results of each section of text, but this should be vastly cheaper than tokenizing again.

While not every result will be reused, it is quite often that multiple semantic levels actually contain the same chunk of text because there isn't always a difference between a character and a grapheme for example. Also, we tokenize each level to find the levels that could fit, and then have to check them again once we generate the chunk itself, which can be reused.

Todo:

Store the results of chunk_size post trimming into a shared, reused HashMap<Range<usize>, ChunkSize> on TextChunks, where the Range represents the range of bytes for the tokenized text. Only run chunk_size again if we have a cache miss.
Clear the hashmap whenever we move the cursor since this should invalidate all of the cached values since all future tokenization will have ranges that start at a later offset.

The text was updated successfully, but these errors were encountered:

benbrandt added this to text-splitter Roadmap Mar 23, 2024

benbrandt converted this from a draft issue Mar 23, 2024

benbrandt changed the title ~~Optimize number of allocations from tokenizers~~ Optimize number of allocations in chunk generation Mar 23, 2024

benbrandt moved this from Backlog to Ready in text-splitter Roadmap Mar 23, 2024

benbrandt changed the title ~~Optimize number of allocations in chunk generation~~ Reduce number of allocations in chunk generation Mar 23, 2024

benbrandt moved this from Ready to In progress in text-splitter Roadmap Mar 23, 2024

benbrandt self-assigned this Mar 23, 2024

benbrandt mentioned this issue Mar 24, 2024

Reduce allocations, and other optimizations #121

Merged

benbrandt linked a pull request Mar 24, 2024 that will close this issue

Reduce allocations, and other optimizations #121

Merged

benbrandt closed this as completed in #121 Mar 25, 2024

github-project-automation bot moved this from In progress to Done in text-splitter Roadmap Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce number of allocations in chunk generation #115

Reduce number of allocations in chunk generation #115

benbrandt commented Mar 23, 2024 •

edited

Loading

Reduce number of allocations in chunk generation #115

Reduce number of allocations in chunk generation #115

Comments

benbrandt commented Mar 23, 2024 • edited Loading

Reusing collections

Tokenization

benbrandt commented Mar 23, 2024 •

edited

Loading