Has anyone got a 405B model running with node-llama-cpp? #302

debrisapron · 2024-08-22T20:30:47Z

debrisapron
Aug 22, 2024

Trying to run the new Hermes 3 405B model. I have a server with 8 nvidia A100s but none of the layers are being offloaded and I keep getting mysterious CUDA errors. Log sample:

ggml_cuda_init: failed to initialize CUDA: unknown error
llm_load_tensors: ggml ctx size =    0.53 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/127 layers to GPU
llm_load_tensors:        CPU buffer size = 231801.72 MiB
CUDA error: unknown error
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_cuda_host_malloc: failed to allocate 64512.00 MiB of pinned memory: unknown error

Has anyone had any luck with this model, or any other 405B model? I haven't used node-llama-cpp with multiple GPUs before, is there anything special I need to do? I have it working fine with the Hermes 3 8B model on a single A100.

giladgd · 2024-08-23T15:54:27Z

giladgd
Aug 23, 2024
Maintainer

The beta of version 3 has much better support for CUDA, consider using it instead of the stable version.
Also, try running this command (inside of a project that uses the beta of version 3) to check how your GPU setup is recognized by llama.cpp, this may help you spot any configuration or compatibility issues:

npx --no node-llama-cpp inspect gpu

0 replies

debrisapron · 2024-08-27T17:14:18Z

debrisapron
Aug 27, 2024
Author

Hi Gilad, so I eventually got this working! I was actually already on the beta (beta 44 to be specific), and also building a very recent version of llama.cpp from source. In the end the thing that fixed it was assigning an absolutely humungous amount of RAM, 256GB to be precise. RAM not VRAM note!

So as I say I have it working on a cluster of 8xH100s, each with 80GB VRAM, on 64 cpus & 256GB of RAM. I expected this monstrous machine to be able to run it pretty fast, but all I'm getting is a miserable 6 tokens/sec! I can see from the llama.cpp logs that all the tensors are being offloaded to the GPUs so it doesn't seem like it's a CUDA problem. Any thoughts? In particular, are there any flags I should be setting when building llama.cpp?

1 reply

giladgd Aug 29, 2024
Maintainer

I'm not aware of any specific CMake build flag that might be the cure for this, but here are a few suggestions of things you might want to try:

Try playing with different values of the threads option when creating a context. I'm working on optimizing the default value for this but haven't released it yet.
Try enabling flash attention on the model by setting the defaultContextFlashAttention option to true.
Try limiting the context size and see if it has any effect on the performance.
Keep in mind that not all GGUF files were converted from source properly, so switching to a different GGUF from a different source might help. Also, try using different quantization types, since different quantizations have different backend implementations, so you might find that a specific version of the model works better with CUDA on your hardware than another version of the model.

I haven't yet tested using llama.cpp on a machine with many high-end GPUs, so let me know whether and how you managed to get it working, so I can improve node-llama-cpp or mention the solution in the docs of version 3 (which I'm actively working on).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Has anyone got a 405B model running with node-llama-cpp? #302

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Has anyone got a 405B model running with node-llama-cpp? #302

debrisapron Aug 22, 2024

Replies: 2 comments · 1 reply

giladgd Aug 23, 2024 Maintainer

debrisapron Aug 27, 2024 Author

giladgd Aug 29, 2024 Maintainer

debrisapron
Aug 22, 2024

Replies: 2 comments 1 reply

giladgd
Aug 23, 2024
Maintainer

debrisapron
Aug 27, 2024
Author

giladgd Aug 29, 2024
Maintainer