Has anyone got a 405B model running with node-llama-cpp? #302
Replies: 2 comments 1 reply
-
The beta of version 3 has much better support for CUDA, consider using it instead of the stable version. npx --no node-llama-cpp inspect gpu |
Beta Was this translation helpful? Give feedback.
-
Hi Gilad, so I eventually got this working! I was actually already on the beta (beta 44 to be specific), and also building a very recent version of llama.cpp from source. In the end the thing that fixed it was assigning an absolutely humungous amount of RAM, 256GB to be precise. RAM not VRAM note! So as I say I have it working on a cluster of 8xH100s, each with 80GB VRAM, on 64 cpus & 256GB of RAM. I expected this monstrous machine to be able to run it pretty fast, but all I'm getting is a miserable 6 tokens/sec! I can see from the llama.cpp logs that all the tensors are being offloaded to the GPUs so it doesn't seem like it's a CUDA problem. Any thoughts? In particular, are there any flags I should be setting when building llama.cpp? |
Beta Was this translation helpful? Give feedback.
-
Trying to run the new Hermes 3 405B model. I have a server with 8 nvidia A100s but none of the layers are being offloaded and I keep getting mysterious CUDA errors. Log sample:
Has anyone had any luck with this model, or any other 405B model? I haven't used node-llama-cpp with multiple GPUs before, is there anything special I need to do? I have it working fine with the Hermes 3 8B model on a single A100.
Beta Was this translation helpful? Give feedback.
All reactions