-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[METAL] GPU Inference fails due to buffer error (buffer "data" size is larger than buffer maximum) #1815
Comments
You might be interested to check this PR. |
FYI I have the same problem on Intel macOS with AMD 6900XT GPU. Except in my case, it happens on all models: even q4_0 7B can't be loaded when
|
Ah thanks, I didn't find this (and the discussion is apparently not about the merged PR but a future/necessary PR) but apparently the problem is that there is some limit at 1/2 of total memory which would explain everything up to 13B working and 33B (with 20GB/32GB) not: #1696 (comment) . Happy to test solutions if someone is working on this. @TheBloke Then your error is probably unrelated, the 7B should fit into GPU memory, but I don't know anything at all about non-apple GPUs and metal. Thanks for your great work btw :). |
Does #1817 fix your issue? |
no, stays the same. |
@ikawrakow Not working in my case. Still shows similar error message.
|
Update: Q2_K seems to be the only one that works with
Note: Reported numbers are based on |
It looks like Apple has decided to limit the maximum length of a buffer to some fraction of the available memory. In my case, on a 64 GB M2 Max laptop, the maximum buffer length is reported as |
Yes. M1 Max with 32GB RAM. Does that mean there is no way to load 33B q4_0 model with 32GB RAM in any way? |
Looking at the code, it seems the model is being passed to Metal as a single, no-copy buffer. My guess is that the change required to split the model into 2 or more buffers to circumvent the I just tried the 33B model on my laptop and it works fine:
|
You're right, I should raise a separate issue! |
The fix to I think the proposed solution there has to work, but not 100% sure. |
@ggerganov Isn't the splitting somewhat tricky? I mean, we cannot just randomly split the model because some tensors may end up split in the middle, which will lead to garbage results. But if we attempt to split at tensor boundaries, those may not be page aligned. But the buffer being given to the Metal framework must be page aligned. We cannot have overlapping buffers either, which would be needed to have a tensor completely within a buffer and the buffer be page aligned, unless the I'm asking because I thought I could do this quickly, but it turns out to be trickier than it might seem from the discussion in #1696 |
@kiltyj any progress you've made- please upload/make a branch for us to continue work on |
I just created #1825 to capture the code I've written so far. As I note there, buffer splitting seems to be working for smaller models, but there's still something I'm missing that is causing issues when I try to split up (e.g.) a 65B model. I'll keep poking at this after hours as I can, but if anyone spots anything, let me know! |
Forgive me for my novelty here. Is this the reason why my M2 Max 64GB system is not able to load larger than 30B models and use it with GPU acceleration? |
@CyborgArmy83 - very likely |
That's crazy! So the whole concept of using a large amount of unified memory for GPU/Metal is flawed? or is there something we can do? Maybe people should also send this to Apple to see if they can update something in the metal framework or am I missing a key piece of understanding here? |
Has this been fixed? Because I'm still seeing this.
when trying to load |
FYI, I've been hacking around with some ideas related to this issue in #2069. I don't think it's quite ready for merging / there's still a lot to figure out, but I'd be happy for more eyes/ideas. |
I own a Macbook Pro M2 with 32GB memory and try to do inference with a 33B model.
Without Metal (or
-ngl 1
flag) this works fine and 13B models also work fine both with or without METAL.There is sufficient free memory available.
Inference always fails with the error:
ggml_metal_add_buffer: buffer 'data' size 18300780544 is larger than buffer maximum of 17179869184
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
I own a Mac Pro M2 with 32GB memory and try to do inference with a 33B model. Without Metal this works fine and 13B models also work fine with or without METAL.
There is sufficient free memory available.
Current Behavior
Is this known/expected and are there any workarounds? The mentioned "buffer maximum" of 17179869184 stay the same regardless of how much memory is free.
The text was updated successfully, but these errors were encountered: