-
-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is it too much of me to ask for an MPI option like llama.cpp? #286
Comments
This may or may not be a stupid question, but what is MPI? |
@turboderp it's a stupid question if u try this on raspberry pi cluster like this: basically mpi is enabling clustering for llama models
again, just my thoughts. |
I don't know, ExLlama is really focused on consumer GPUs. This would be asking for a complete rewrite so it can run on clusters of embedded devices instead. And it basically boils down to "can this project be llama.cpp instead?" So, I don't really think this is realistic. As for the name, I didn't really give it much thought. Doesn't have those connotations to me, is all I can say I guess. Think of it as "extra" maybe? And it's not categorically the fastest way to run Llama, either. It really depends on the use case. |
p.s. : v2 is impressive. u guys are doing great. |
Benchmarks tend to become outdated very quickly. When ExLlama first came out there was no CUDA support at all in llama.cpp at all, for instance, AutoGPTQ didn't exist and GPTQ-for-Llama was still using essentially the same kernel written for the original GPTQ paper. Since then, llama.cpp has had a huge amount of work put into it, AutoGPTQ has included the ExLlama (v1) kernel, and there's also AWQ, vLLM... something called OmniQuant..? ExLlama is definitely a fast option, and depending on what you need to do, what your hardware setup is, etc., it may be the fastest in your case. If you want to run an inference server for an online chat service, probably you should look at TGI or vLLM or something. If you want to run on Apple Silicon, llama.cpp is (I think?) the only way to go. If you have an older NVIDIA GPU (Pascal or earlier), AutoGPTQ is probably still the best option. So it all depends. |
i was always looking for the optimum (cheapest) way to run the large models.
kind of tired of going for the extremes. (coz i will need to "upgrade" and that means my other devices are "obsolete")
however, is an MPI option in the roadmap? would really hope to see it happen.
Thx in advance for the great work by the way.
The text was updated successfully, but these errors were encountered: