Make responses start faster by removing unnecessary cleanup calls #6625
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
clear_torch_cache()
function takes about 0.08 seconds to run because it includes a call togc.collect()
. Previously, this function was called twice before each generation to address memory leaks in Transformers during text streaming.Changes made:
clear_torch_cache()
calls for loaders other than Transformers, saving approximately 0.2 seconds per generation and making replies start faster both in the UI and the API.clear_torch_cache()
for Transformers from two to one, cutting the time spent on this function by half.