Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce number of allocations in chunk generation #115

Closed
4 tasks done
benbrandt opened this issue Mar 23, 2024 · 0 comments · Fixed by #121
Closed
4 tasks done

Reduce number of allocations in chunk generation #115

benbrandt opened this issue Mar 23, 2024 · 0 comments · Fixed by #121
Assignees

Comments

@benbrandt
Copy link
Owner

benbrandt commented Mar 23, 2024

There are two areas in the chunk generation loop that have opportunities for improvement:

Reusing collections

In order to support the new binary search method of finding the right chunk size, we are now allocating some vectors so we can efficiently search across the upcoming chunks. A quick win would be to enable reusing an existing vec so that we don't have to allocate new memory on every iteration, but can reuse the existing one until we are done with the entire text.

Todo:

  • For the two vectors we create for allowing binary search, instead put this elements in a reused vec that lives on TextChunks
  • Make sure this is cleared at the end of every iteration of next_chunk

Challenges:

In order to do this will require more methods taking a mutable reference to self &mut self which can make things difficult. It is possible all goes well in a naive approach, just a warning.

Tokenization

Most tokenizers allocate multiple Strings and Vecs and HashMaps and other items on the heap each time they tokenize text. Ideally there would be an option with these libraries to only generate the token ids themselves (some form of integer) since that is all of the information we need. But since it is unlikely, and tokenization is in the hot path of chunk calculation, if we can reduce the number of times that the actual tokenization is necessary, we can get some quick performance wins in terms of both allocations and CPU usage, since tokenization isn't cheap. (see current benchmark output)

Since tokenization is identical for the same strings, we can likely memoize the output of chunk_size for the same string. We will have to allocate to store the results of each section of text, but this should be vastly cheaper than tokenizing again.

While not every result will be reused, it is quite often that multiple semantic levels actually contain the same chunk of text because there isn't always a difference between a character and a grapheme for example. Also, we tokenize each level to find the levels that could fit, and then have to check them again once we generate the chunk itself, which can be reused.

Todo:

  • Store the results of chunk_size post trimming into a shared, reused HashMap<Range<usize>, ChunkSize> on TextChunks, where the Range represents the range of bytes for the tokenized text. Only run chunk_size again if we have a cache miss.
  • Clear the hashmap whenever we move the cursor since this should invalidate all of the cached values since all future tokenization will have ranges that start at a later offset.
@benbrandt benbrandt converted this from a draft issue Mar 23, 2024
@benbrandt benbrandt changed the title Optimize number of allocations from tokenizers Optimize number of allocations in chunk generation Mar 23, 2024
@benbrandt benbrandt moved this from Backlog to Ready in text-splitter Roadmap Mar 23, 2024
@benbrandt benbrandt changed the title Optimize number of allocations in chunk generation Reduce number of allocations in chunk generation Mar 23, 2024
@benbrandt benbrandt moved this from Ready to In progress in text-splitter Roadmap Mar 23, 2024
@benbrandt benbrandt self-assigned this Mar 23, 2024
@benbrandt benbrandt linked a pull request Mar 24, 2024 that will close this issue
@github-project-automation github-project-automation bot moved this from In progress to Done in text-splitter Roadmap Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant