Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model doesn't load on >2 GPU anymore. Says ggml_new_object: not enough space in the context's memory pool #4114

Closed
Ph0rk0z opened this issue Nov 17, 2023 · 9 comments · Fixed by #4115

Comments

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 17, 2023

Expected Behavior

Model loaded to 2x3090 + 1 or 2 P40 loads and functions:

llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 2192.00 MB
llama_new_context_with_model: kv self size  = 2192.00 MB
llama_build_graph: non-view tensors processed: 3155/3155
llama_new_context_with_model: compute buffer total size = 574.63 MB
llama_new_context_with_model: VRAM scratch buffer: 568.00 MB
llama_new_context_with_model: total VRAM used: 65972.68 MB (model: 63212.67 MB, context: 2760.00 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
2023-11-16 12:33:35 INFO:Loaded the model in 136.40 seconds.

Current Behavior

Model fails with an error:


ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  =  141.08 MB
llm_load_tensors: offloading 137 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 140/140 layers to GPU
llm_load_tensors: VRAM used: 63212.67 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 2192.00 MB
llama_new_context_with_model: kv self size  = 2192.00 MB
ggml_new_object: not enough space in the context's memory pool (needed 1638880, available 1638544)
Segmentation fault (core dumped)

Failure Information (for bugs)

I'm mainly using the python bindings but v 2.17 works and v 2.18 doesn't. Same settings. I try to load 180b or 120b and this is what I get. I have more than enough vram but for some reason it dies on CPU ram despite the model already being loaded.

I tried numa and mlock to no avail. This is using MMQ kernels so nothing there should have changed.

Last commit it was working on was : df9d129

I tried reverting 1cf2850 manually but that wasn't it.

Will also try with today's commits and update to see what happens. Eliminated any code in the python wrapper by using llama.cpp rev that worked on the 2.18 version.

@slaren
Copy link
Collaborator

slaren commented Nov 17, 2023

Does increasing LLAMA_MAX_NODES in llama.cpp fix it?

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Nov 17, 2023

Just tested it and no (I set 2048). But it does halve the values it complains about. I have over 256gb of cpu ram so it asking for 80gb after saying I had 163gb free and failing is interesting.

@slaren
Copy link
Collaborator

slaren commented Nov 17, 2023

You have to increase it, not decrease. Try with 8192 at least.

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Nov 17, 2023

Ok, I will do that.

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Nov 17, 2023

Ok, that worked, the model loads again. Any reason to set this number higher?

@slaren
Copy link
Collaborator

slaren commented Nov 17, 2023

It sets the maximum number of tensors in the computation graphs. Generally we want to keep it as low as possible to avoid wasting memory, but it seems that the larger models require a higher value.

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Nov 17, 2023

And that's only the CPU memory? I don't think I noticed any difference in terms of vram.

@slaren
Copy link
Collaborator

slaren commented Nov 17, 2023

Yes, it only increases CPU RAM usage, not VRAM.

@countzero
Copy link

I could reproduce the error trying to launch the Goliath 120B model:

ggml_new_object: not enough space in the context's memory pool (needed 1638880, available 1638544)

@slaren Thanks for the fix!
@ggerganov Thanks for the new release!

I can confirm, that the issue is now fixed with https://github.com/ggerganov/llama.cpp/releases/tag/b1535

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants