"Not enough space in the context's memory pool" exception in 1.52 #563

Vladonai · 2023-12-13T17:45:18Z

The new version of Koboldcpp (1.52) terminates with an error when trying to load the model "nethena-mlewd-xwin-23b.Q3_K_M.gguf":

This was not the case with previous versions of the program. I have 64GB of RAM and 8GB of VRAM.

The model is from here: https://huggingface.co/TheBloke/Nethena-MLewd-Xwin-23B-GGUF/tree/main

7erminalVelociraptor · 2023-12-13T23:10:32Z

Same error, different model (Goliath 120b). Worked fine until I updated, can't get it to load no matter what flags I set.

VL4DST3R · 2023-12-14T00:01:47Z

Yeah, I get the same with multiple models I've tried. I figured it was the larger vram requirements noted in the changelog:

Partial per-layer KV offloading is now merged for CUDA. Important: this means that the number of layers you can offload to GPU might be reduced, as each layer now takes up more space. To avoid per-layer KV offloading, use the --usecublas lowvram option (equivalent to -nkvo in llama.cpp). Fully offloaded models should behave the same as before.

but reducing the offloaded layers even to extreme levels changes nothing.

7erminalVelociraptor · 2023-12-14T02:55:14Z

Yeah, I get the same with multiple models I've tried. I figured it was the larger vram requirements noted in the changelog:

Partial per-layer KV offloading is now merged for CUDA. Important: this means that the number of layers you can offload to GPU might be reduced, as each layer now takes up more space. To avoid per-layer KV offloading, use the --usecublas lowvram option (equivalent to -nkvo in llama.cpp). Fully offloaded models should behave the same as before.

but reducing the offloaded layers even to extreme levels changes nothing.

Also tried running --cublas lowvram but still no dice, I even tried using --clblast instead of cublas and still the same error. Tried lower context size, the same. Tried using --nommap but that just filled my ram so completely that the destop locked up and only a hard reboot was possible.

It's unlikely to be the enviroment, as it looks like OP uses windows and I'm on Arch Linux. If the entire release was completely broken confused users would have been streaming in from the get go, but for now that doesn't seem to be the case.

What's going on?

LostRuins · 2023-12-14T07:54:14Z

I can repro this and have found a solution, but it is a bit strange as to why it happens. I'll also open an issue upstream as they'll likely have the same problem.

LostRuins · 2023-12-14T08:19:47Z

I think I fixed it but if someone can verify with ggerganov#4461 would be good.

VL4DST3R · 2023-12-14T08:45:49Z

@LostRuins From reading your other thread this seems to be the caused by the "partial per-layer KV offloading" in the latest version, but I have to ask: what exactly does it do/mean exactly? Tried looking it up on the wiki here and more broadly online but came up blank. Also is it preferred over the increase in vram usage?

LostRuins · 2023-12-14T08:55:32Z

Actually this issue is not related to the partial KV offloading at all. It was caused by a numerical precision overflow, where a float cannot fully represent a very big number accurately.

Previously, KV is only offloaded at the end after all other layers have been offloaded. Partial KV offloading is being able to offload a portion of the KV progressively alongside the other layers, which usually results in a faster speed if you're able to offload like 3/4 of all layers.

I personally don't find it that useful, and usually disable it for myself. But some people seem to like it.

VL4DST3R · 2023-12-14T09:01:48Z

I personally don't find it that useful, and usually disable it for myself. But some people seem to like it.

So you use --usecublas lowvram from now on then? Shouldn't it be an optional flag then and not on by default?

Also is there any objective way to benchmark this? I'm guessing... seeing how a model runs with the same amount of gpu memory taken?

LostRuins · 2023-12-14T09:04:11Z

You are right, I am setting lowvram to be disabled by default in the next version.

Benchmark by measuring time taken with a bunch of prompts. I have done so before - see ggerganov#4309

Some people claim it's better for them - it is a trade off as I would rather have slower processing but faster generation.

VL4DST3R · 2023-12-14T09:19:34Z

You are right, I am setting lowvram to be disabled by default in the next version.

Isn't that already the case? And no no, I meant it as "is using this flag how you personally plan to counteract this change from now on in your own personal gens?" since you said you prefer the old behavior with faster generation vs processing (just like me)

I was trying to figure out what i need to change in my settings so that this doesn't affect me negatively.

LostRuins · 2023-12-14T09:31:04Z

I was referring to the GUI defaults, when running in GUI mode. I will leave the lowvram checkbox unchecked by default in the next version. If you're using the command line, then it will be whatever you set it.

You don't have to change anything. All existing configs and command line args will work.

VL4DST3R · 2023-12-14T10:03:24Z

Got it but my point was that if one wants to keep the slower processing and faster generation (the pre 1.52 behavior when not fully offloading to vram) the only way currently seems to be using lowvram, but I know this also leaves out scratch buffers, which I imagine results in a speed decrease. Is there a way to just not use the partial offloading specifically, leaving everything else as-is?

LostRuins · 2023-12-14T10:09:49Z

Ah no, the scratch buffer thing was for pre-gguf model behavior. There is no difference anymore.

Basically:

For full offload, do not enable lowvram
If you enable lowvram, and do not full offload, behavior is identical to v1.51 with no lowvram. Because lowvram does not affect GGUF models prior to this version, and in fact has already been removed upstream.

Vladonai · 2023-12-14T10:57:31Z

Some people claim it's better for them - it is a trade off as I would rather have slower processing but faster generation.

I don't understand why this is even necessary. The task of initial processing of a large context is much, MUCH better solved by preserving the model context on exit. Everything else is a disadvantage.

Tested the new version (1.52.1) with the model from the initial post. Now it loads and works.

VL4DST3R · 2023-12-14T20:22:48Z

Partial KV offloading is being able to offload a portion of the KV progressively alongside the other layers, which usually results in a faster speed if you're able to offload like 3/4 of all layers.

So even with the reduced amount of layers given the size increase it should still generate faster if over 3/4 of all layers? How? I thought it only affected processing and nothing else?

Ah no, the scratch buffer thing was for pre-gguf model behavior. There is no difference anymore.

But if one were to still use pre-gguf models (e.g. airoboros 33b) It would still be detrimental, no? 👀

I know I'm biased here since I value actual gen speed over processing, but wouldn't it be easier to have it (partial offloading) as an optional flag for people who want to use it specifically? You mentioned a very good point on the other thread namely:

Overall, trading a ~40% increase PP speed for a 22% reduced generation speed does not really feel like a worthwhile trade to me, especially if we take into account the reduced need for rapid prompt processing while using context shifting. I think perhaps you were testing at -c 512 which has a much smaller difference in number of layers offloaded. At 4096 ctx, the difference will probably be even greater.

which leads me to believe this will be most likely detrimental for most kcpp users since I imagine almost everyone makes use of smart context/context shifting to reduce subsequent processing.

LostRuins · 2023-12-15T02:29:23Z

Yeah if you are using pre-gguf models, then you should toggle the lowvram off/on as needed.

Partial offloading is and optional flag, it's disabled with --lowvram (and thats all that lowvram does for GGUF models, which is disabling all KV offloading). I previously tried out making it lowvram by default but then others objected too.

VL4DST3R · 2023-12-15T10:52:37Z

Partial offloading is and optional flag, it's disabled with --lowvram

It is optional in the sense that it can be toggled, yes, but since it's not the default state, doesn't that make the lowvram flag the optional one in this context instead?

I previously tried out making it lowvram by default but then others objected too.

I understand, hard to please everyone. Ultimately given that the issue isn't really an issue per-se and just a preference, I guess ultimately you should have the final say in how this behavior should be set by default.

Vladonai · 2023-12-15T11:30:05Z

I understand, hard to please everyone. Ultimately given that the issue isn't really an issue per-se and just a preference, I guess ultimately you should have the final say in how this behavior should be set by default.

Am I understanding correctly that if I didn't read this thread, I would suddenly lose ~20% of generation speed after upgrading to 1.52? :)

VL4DST3R · 2023-12-15T11:33:23Z

Not exactly. You would now OOM a lot faster, and fixing that would net you some speed reduction. It's why I figured it should be pointed out as an issue but oh well.

LostRuins mentioned this issue Dec 14, 2023

Fix "not enough space in the context's memory pool" error when loading certain models. ggerganov/llama.cpp#4461

Merged

Vladonai closed this as completed Dec 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Not enough space in the context's memory pool" exception in 1.52 #563

"Not enough space in the context's memory pool" exception in 1.52 #563

Vladonai commented Dec 13, 2023

7erminalVelociraptor commented Dec 13, 2023

VL4DST3R commented Dec 14, 2023

7erminalVelociraptor commented Dec 14, 2023 •

edited

Loading

LostRuins commented Dec 14, 2023

LostRuins commented Dec 14, 2023

VL4DST3R commented Dec 14, 2023

LostRuins commented Dec 14, 2023

VL4DST3R commented Dec 14, 2023

LostRuins commented Dec 14, 2023

VL4DST3R commented Dec 14, 2023 •

edited

Loading

LostRuins commented Dec 14, 2023

VL4DST3R commented Dec 14, 2023 •

edited

Loading

LostRuins commented Dec 14, 2023 •

edited

Loading

Vladonai commented Dec 14, 2023

VL4DST3R commented Dec 14, 2023

LostRuins commented Dec 15, 2023

VL4DST3R commented Dec 15, 2023

Vladonai commented Dec 15, 2023 •

edited

Loading

VL4DST3R commented Dec 15, 2023 •

edited

Loading

"Not enough space in the context's memory pool" exception in 1.52 #563

"Not enough space in the context's memory pool" exception in 1.52 #563

Comments

Vladonai commented Dec 13, 2023

7erminalVelociraptor commented Dec 13, 2023

VL4DST3R commented Dec 14, 2023

7erminalVelociraptor commented Dec 14, 2023 • edited Loading

LostRuins commented Dec 14, 2023

LostRuins commented Dec 14, 2023

VL4DST3R commented Dec 14, 2023

LostRuins commented Dec 14, 2023

VL4DST3R commented Dec 14, 2023

LostRuins commented Dec 14, 2023

VL4DST3R commented Dec 14, 2023 • edited Loading

LostRuins commented Dec 14, 2023

VL4DST3R commented Dec 14, 2023 • edited Loading

LostRuins commented Dec 14, 2023 • edited Loading

Vladonai commented Dec 14, 2023

VL4DST3R commented Dec 14, 2023

LostRuins commented Dec 15, 2023

VL4DST3R commented Dec 15, 2023

Vladonai commented Dec 15, 2023 • edited Loading

VL4DST3R commented Dec 15, 2023 • edited Loading

7erminalVelociraptor commented Dec 14, 2023 •

edited

Loading

VL4DST3R commented Dec 14, 2023 •

edited

Loading

VL4DST3R commented Dec 14, 2023 •

edited

Loading

LostRuins commented Dec 14, 2023 •

edited

Loading

Vladonai commented Dec 15, 2023 •

edited

Loading

VL4DST3R commented Dec 15, 2023 •

edited

Loading