You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are two areas in the chunk generation loop that have opportunities for improvement:
Reusing collections
In order to support the new binary search method of finding the right chunk size, we are now allocating some vectors so we can efficiently search across the upcoming chunks. A quick win would be to enable reusing an existing vec so that we don't have to allocate new memory on every iteration, but can reuse the existing one until we are done with the entire text.
Todo:
For the two vectors we create for allowing binary search, instead put this elements in a reused vec that lives on TextChunks
Make sure this is cleared at the end of every iteration of next_chunk
Challenges:
In order to do this will require more methods taking a mutable reference to self &mut self which can make things difficult. It is possible all goes well in a naive approach, just a warning.
Tokenization
Most tokenizers allocate multiple Strings and Vecs and HashMaps and other items on the heap each time they tokenize text. Ideally there would be an option with these libraries to only generate the token ids themselves (some form of integer) since that is all of the information we need. But since it is unlikely, and tokenization is in the hot path of chunk calculation, if we can reduce the number of times that the actual tokenization is necessary, we can get some quick performance wins in terms of both allocations and CPU usage, since tokenization isn't cheap. (see current benchmark output)
Since tokenization is identical for the same strings, we can likely memoize the output of chunk_size for the same string. We will have to allocate to store the results of each section of text, but this should be vastly cheaper than tokenizing again.
While not every result will be reused, it is quite often that multiple semantic levels actually contain the same chunk of text because there isn't always a difference between a character and a grapheme for example. Also, we tokenize each level to find the levels that could fit, and then have to check them again once we generate the chunk itself, which can be reused.
Todo:
Store the results of chunk_size post trimming into a shared, reused HashMap<Range<usize>, ChunkSize> on TextChunks, where the Range represents the range of bytes for the tokenized text. Only run chunk_size again if we have a cache miss.
Clear the hashmap whenever we move the cursor since this should invalidate all of the cached values since all future tokenization will have ranges that start at a later offset.
The text was updated successfully, but these errors were encountered:
There are two areas in the chunk generation loop that have opportunities for improvement:
Reusing collections
In order to support the new binary search method of finding the right chunk size, we are now allocating some vectors so we can efficiently search across the upcoming chunks. A quick win would be to enable reusing an existing vec so that we don't have to allocate new memory on every iteration, but can reuse the existing one until we are done with the entire text.
Todo:
TextChunks
next_chunk
Challenges:
In order to do this will require more methods taking a mutable reference to self
&mut self
which can make things difficult. It is possible all goes well in a naive approach, just a warning.Tokenization
Most tokenizers allocate multiple Strings and Vecs and HashMaps and other items on the heap each time they tokenize text. Ideally there would be an option with these libraries to only generate the token ids themselves (some form of integer) since that is all of the information we need. But since it is unlikely, and tokenization is in the hot path of chunk calculation, if we can reduce the number of times that the actual tokenization is necessary, we can get some quick performance wins in terms of both allocations and CPU usage, since tokenization isn't cheap. (see current benchmark output)
Since tokenization is identical for the same strings, we can likely memoize the output of
chunk_size
for the same string. We will have to allocate to store the results of each section of text, but this should be vastly cheaper than tokenizing again.While not every result will be reused, it is quite often that multiple semantic levels actually contain the same chunk of text because there isn't always a difference between a character and a grapheme for example. Also, we tokenize each level to find the levels that could fit, and then have to check them again once we generate the chunk itself, which can be reused.
Todo:
chunk_size
post trimming into a shared, reusedHashMap<Range<usize>, ChunkSize>
onTextChunks
, where theRange
represents the range of bytes for the tokenized text. Only runchunk_size
again if we have a cache miss.cursor
since this should invalidate all of the cached values since all future tokenization will have ranges that start at a later offset.The text was updated successfully, but these errors were encountered: