-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unload and reload models on request #471
Conversation
I like the reloading idea as I've been switching to another model then returning to the updated model. I'm not sure what the app state would be in an 'unloaded' condition, perhaps we just need the reload implementation? |
Well, the state after unloading the checkpoint would be undetermined. One won't be able to generate a response, yet the generated error is not fatal and one can resume the chat texgen once the model is loaded back in, that much I tested. The core idea and the usage case is simple: when |
Now shows the message in the console when unloading weights. Also reload_model() calls unload_model() first to free the memory so that multiple reloads won't overfill it.
In the latest gradio version, there is now this circle icon in dropdown menus that unselects the currently selected option. I have modified the PR for using this button to unload the model from memory. Your buttons were more functional because they allowed the very same model to be reloaded without having to locate it in the dropdown list, but I found that they occupied a lot of space while being a very niche feature. It should still be possible to create unload/reload buttons inside an extension. |
That's a nice way to save space! |
The |
An important step towards optimizing running different neural networks in parallel on the same GPU.
The core idea and the usage case is simple: when
oobabooga
is used alongside other memory hogs like Stable Diffusion (sd-api-pictures
extension) or Tortoise-TTS (not yet implemented) this simple unload function leaves a lot more video memory for those other neural networks to work with. Once they finish their jobs, the LLM can be returned back to VRAM.This is the first one of the possible improvements to #309 memory handling.
Tested on my machine, unloading Pyg-2.7B-8bit is almost instant, loading it back (from the RAM cache) takes ~7 seconds which I consider to be an acceptable delay compared to the image generation itself.
Pyg-6B-8bit is a bit slower but still tolerable.