-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: CUDA error: out of memory - Phi-3 Mini 128k prompted with 20k+ tokens on 4GB GPU #7885
Comments
This is not necessarily a bug but rather an issue with how efficient resources are used, i.e. performance. And I can only speak for myself but I will definitely not debug the performance of downstream projects. |
There is definitely a bug in the GPU memory allocation code. We are experiencing several major issues:
The problem reported by kozuch is also because of this issue.
The right way of working should be that all GPU memory should be allocated when loading the model that it does not need to allocate more memory later (this would prevent the crashes). |
That is just how those things are implemented. As of right now large batch matrix multiplication uses FP16 cuBLAS GEMM so you need allocate temporary buffers for the dequantized weight matrices. If you compile with And
I will happily review any PRs in that regard.
That one could be an actual bug. If you want it fixed, provide instructions for reproduction and do a git bisect to identify the exact commit that introduced the increase in memory consumption. |
I forgot: enabling FlashAttention via |
Thank you Johannes, I will try the FlashAttention option. I am afraid that with the current GPU allocation code the library is not usable in production software unless you only use CPU (very slow). I know that the GPU part is a lot and complex work... hopefully will someone try to correct it. I would be willing to test it when it is ready. For me this is a critical major issue, the library should have a basic correct working with multi-GPUs and CPUs. I see endless PRs with for me unimportant additions compared to this. |
That is unfortunate but as it is there are only 2 devs working on CUDA (slaren and me). My work on the open llama.cpp repository is unpaid and I only really care about my own use cases. You could try a business inquiry if you really need specific changes. |
Thank you Johannes and of course also thank you @slaren for your contribution to the cuda part! I am not interested in the commercial options, I am here to support open source. I don't think that their business venture will be successful if the basics of the library are not good... the core should be robust and well working. The code is 5-10 times faster on the GPU when it does not crash, so for me the cuda part is important, your contribution is important! I am contributing to several open source projects, here I am mostly testing and trying to help with ideas and finding the right direction. Since there is 20% more GPU memory allocation compared to one month ago, it is possible that someone tried to pre-allocate some memory for the context, but probably made a mistake somewhere. If you will have some time and interest to improve the cuda code in the library, then here are my comments:
Could you please give us a summary of the compilation options with some explanation of what they do and their advantages and disadvantages, which could reduce the problems mentioned above? A bit more documentation would help and that is probably not a lot of work. You mentioned already:
What about GGML_CUDA_NO_VMM? |
The bottleneck is not testing or ideas, it is the amount of time that developers have. As I said, I will happily review any PRs that improve memory allocation since that will be significantly less work than having to implement it myself.
That is not going to work without major changes to the GPU code. It would be better to pre-allocate the memory.
It is. Any additional VRAM use comes from temporary buffers. With LLAMA_CUDA_FORCE_MMQ and -fa that should be 95% fixed.
Since you said that you would help with testing, please do a git bisect and identify the bad commit.
llama.cpp (C++ user code) internally uses ggml (C library code).
That's going to make it worse I think. |
There is no "20% more GPU memory increase" - you are likely using a larger context (the default is now The memory estimation problem has been discussed in the past and it is not trivial to fix. Therefore, projects using |
I have done this already Johannes, here is the most probable reason: #6170 |
Unfortunately, there is Georgi. I did not change anything in my code and the GPU memory use increased with about 20%. This may of course also be caused by you changing some default compilation options... or the above PR. The cuda code is too complex for people who did not develop it, so it is difficult to assess. If you take a model and run inference with a version somewhere before #6170 and with the current version, then you will see the memory increase. |
You did not do a git bisect. You said that an unspecified old version was good and that master is bad. You need to tell us the exact commit that causes the problem (based on actual testing using only llama.cpp code) and how you tested it. Otherwise there is no chance whatsoever of this getting fixed for you. When I tested VRAM use with
the VRAM use going back as far as b1875 (15.01.24) was the exact same. Older versions had worse VRAM use (I only tested one release per few months.) |
I have reported this issue before, the #6170 comes from there: #6909 (comment)
Could you please check the state of the two parameters LLAMA_CUDA_FORCE_MMQ and -fa in your test? There must be a reason for that you have different test results and that you do not see the 20% GPU/cuda memory increase. |
I know, that is not actionable information for us. Do a git bisect.
They were not used. |
I don't know git bisect, but I have just taken two versions with your test (I had to modify it a bit because it was not completely correct): I have also tested some other models and the difference in GPU memory use was sometimes more than 100% increase! I guess that it also has to do something with the type and size of the model... The GPU memory use is definitely increased considerably! I will not be able to help to find the exact commit which caused this because it is just too much work to search and run hundreds of PRs. The code to run:
|
That's why we're telling you to use git bisect, it automatically gives you commits to test and the number of tests you have to do only scales logarithmically with the number of commits. None of the devs experience the issue so unless someone that does pins down the exact commit that causes the problem the issue has a 0% chance of getting fixed. |
I have done it, but the result is not logical because it indicates a release from 2 weeks ago, but I have spotted this problem already in April. Maybe the problem is that the commits are not directly linked to releases and the bisect indicates commits (I was testing the releases). Here are the results:
|
@ggerganov is that the commit that changed the default context size? |
When setting the context size to a specific value there is still memory increase but it is much less. The context size will not be the reason for my 20 % GPU memory increase, but mainly the reason for the memory increase in the Releases. This is not consistent with the documentation because when the context size is not given (0) the library uses the context size from the model (according to the docs). Anyway, this is not important for me because the context size should always be set. Please try to publish the preprocessor parameters for compiling the binaries in the releases, maybe the reason lies there, I am compiling the binaries on Windows myself with the same parameters as always. Maybe in the releases these parameters changed to keep up with the code changes. |
It should not be because (until very recently) it was not using int8 tensor cores so the performance on Turing or newer was bad. |
While doing these tests I have noticed something very strange about the binaries. The size of the binaries increase drastically with nearly every release. You should maybe check this.
|
We are aware, that's just a consequence of pre-compiling multiple kernel versions for various combinations of head sizes, batch sizes, and quantization formats. I really don't think it matters nowadays since even 1 GB of SSD space costs only a few cents. |
I went on and done some more testing. Johannes told me to try LLAMA_CUDA_FORCE_MMQ. I did some testing and the LLAMA_CUDA_FORCE_MMQ did not have any effect, so I searched in the code and I have found that it does nothing else than setting the GGML_CUDA_FORCE_MMQ preprocessor directive for ggml. Then, I searched for GGML_CUDA_FORCE_MMQ and I have found it in ggml-cuda.cu, but it does nothing else then writing 1 line to the log. So my question is, what happened to the code supporting GGML_CUDA_FORCE_MMQ? |
If |
I think that that code is also missing, because CUDA_USE_TENSOR_CORES is not set if GGML_CUDA_FORCE_MMQ is not set. I have the feeling that this will be the reason for my 20% memory increase, the code is probably changed, the only thing I do not understand why I don't get the same memory increase when using the Releases from llama.cpp. It would be useful to see what parameters are used to compile the releases, I guess that those parameters are also changed in some way. I have also noted that I get this during testing: "ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-0] [4096 100 1 1]. I guess that batch size is always >1 thus cuda graphs are always disabled now. Furthermore, from testing:
This should be changed to 60 and 70 only because there is no significant speed difference between 60 and 61 and it adds an extra 10 MB to the dll size. Thank you Johannes for your answers. I have not found the reason for the GPU mem increase, but FlashAttention helped to decrease memory use a bit, so I guess that I will go with this and not search further for the reason. |
Yes there is. The |
Good catch Johannes! |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
What happened?
I get a CUDA out of memory error when sending large prompt (about 20k+ tokens) to Phi-3 Mini 128k model on laptop with Nvidia A2000 4GB RAM. At first about 3.3GB GPU RAM and 8GB CPU RAM is used by ollama, then the GPU ram usage slowly rises (3.4, 3.5GB etc.) and after about a minute it throws the error when GPU ram is exhaused probably (3.9GB is latest in task manager). The inference does not return any token (as answer) before crashing. Attaching server log. Using on Win11 + Ollama 0.1.42 + VS Code (1.90.0) + Continue plugin (v0.8.40).
The expected behavior would be not crashing and maybe rellocating the memory somehow so that GPU memory does not get exhausted. I want to disable GPU usage in ollama (to test for CPU inference only - I have 64GB RAM) but I am not able to find out how to turn the GPU off (even though I saw there is a command for it recently - am not able to find it again).
Actual error:
This is reported via Ollama and full logs are in the issue there: ollama/ollama#4985
Name and Version
See linked ollama issue.
What operating system are you seeing the problem on?
Windows
Relevant log output
The text was updated successfully, but these errors were encountered: