Replies: 3 comments 11 replies
-
There is no gpu use during the process. Only the CPU does the work... I know right it's black magic what he has achieved! |
Beta Was this translation helpful? Give feedback.
-
Take into note that while named For instance, even the simplest C++ paradigms, classes, increase the cost of a function call tenfold since they will not be called directly anymore but routed through virtual function tables. You need constructors, destructors, extra (de)allocations of memory and all that nonsense. All these individually small things add up. C is about as fast as you can get without handcrafting assembly. As an example, the latest compiled windows binary weights just 190KB for llama.exe and 85KB for quantize.exe with very few imports. Obviously the used codepaths matter the most so size doesn't 100% directly correlate, but do me a favor and open the binary in your favourite disassembler (Ghidra/IDA Pro/etc), then open the disassembly of some other implementation and compare, and you'll instantly see what I'm talking about here. I am very delighted to see such lean and fast code which is a rare sight these days. I think most people don't really grasp what an exceptional job @ggerganov has done here; condensing what is in essence a pretty complex thing in such a minimal, fast, close to hardware implementation without resorting to using outside libraries. This not only makes it very fast but also very portable as it lacks any dependencies. Most people would just slap a chonky library in top of another and then wonder why the code isn't fast. That being said, in these types of workloads GPU's are exponentially faster so even less optimized code should run faster than a well-optimized CPU implementation like this. If a CUDA implementation would be made in the llama.cpp style, it would be exceptionally fast. |
Beta Was this translation helpful? Give feedback.
-
Memory bandwidth is the bottleneck with these models. No one has several gigabytes of cache, yet. The ratio of instruction speed to memory speed is massive. Given this fact I would recommend using a 4-bit quantized model on the 3090 - there is little difference in output quality. https://github.com/qwopqwop200/GPTQ-for-LLaMA/ If you want faster, you are going to need the whole model to fit on your graphics card - this means switching to the 30B model (4-bit). Let us know how it goes. |
Beta Was this translation helpful? Give feedback.
-
I've been testing the 8 bit 6B llama on my 3090 and my results were at best as fast as your CPU video.
I noticed the GPU does not get used much, so I assume the CPP is well optimized and the GPU implementation I ran was not.
Though maybe someone has additional insights ?
If CPU gives such a high speed, a 3090 should deliver hundreds or more tokens/sec. I was running at 4-5
Beta Was this translation helpful? Give feedback.
All reactions