-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Q] Memory Requirements for Different Model Sizes #13
Comments
|
Since the original models are using FP16 and llama.cpp quantizes to 4-bit, the memory requirements are around 4 times smaller than the original:
|
With an M1 Max 64GB with 4-bit 65B: 38.5GB, 850 ms per token |
For the record, Intel® Core™ i5-7600K CPU @ 3.80GHz × 4, 16Gb ram, under Ubuntu, model 13B runs with acceptable response time. Note that as mentioned by previous comments, -t 4 parameter gives the best results. Great work ! |
Should add these to readme |
@prusnak is that pc ram or gpu vram ? |
llama.cpp runs on cpu not gpu, so it's the pc ram |
Is it possible that at some point we will get a video card version? |
I don' think so. You can use run the original Whisper model on a GPU: https://github.com/openai/whisper |
Fwiw, running on my M2 Macbook Air w 8GB of ram comes to a grinding halt. At first run about 2-3 minutes of completely unresponsive machine (mouse and keyboard locked), then about 10-20 seconds per response word. Didn't expect great response times, but thats a bit slower than anticipated.
|
Close every other app, ideally reboot to clean state. This should help. If you see unresponsive machine, then it is swapping memory to disk. 8GB is not that much, especially if you have Browsers, Slack, etc. running. |
Also make sure you’re using 4 threads instead of 8 — you don’t want to be using any of the 4 efficiency cores. |
Requirements added in #269 |
32gb is probably a little too optimistic, I have DDR4 32gb clocked at 3600mhz and it generates each token every 2 minutes. |
Yeah, 38.5 GB is more realistic. See https://github.com/ggerganov/llama.cpp#memorydisk-requirements for current values |
I see. That makes more sense since you mention the whole model is loaded into memory as of now. Linux would probably run better in this case from the better swap handling and lower memory usage. Thanks! |
What languages does it work with? Does it work in the same output and input languages GPT? |
we actually build a dylib on macos
No description provided.
The text was updated successfully, but these errors were encountered: