Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: CUDA error: out of memory - Phi-3 Mini 128k prompted with 20k+ tokens on 4GB GPU #7885

Closed
kozuch opened this issue Jun 11, 2024 · 28 comments
Labels
3rd party Issue related to a 3rd party project bug-unconfirmed stale

Comments

@kozuch
Copy link

kozuch commented Jun 11, 2024

What happened?

I get a CUDA out of memory error when sending large prompt (about 20k+ tokens) to Phi-3 Mini 128k model on laptop with Nvidia A2000 4GB RAM. At first about 3.3GB GPU RAM and 8GB CPU RAM is used by ollama, then the GPU ram usage slowly rises (3.4, 3.5GB etc.) and after about a minute it throws the error when GPU ram is exhaused probably (3.9GB is latest in task manager). The inference does not return any token (as answer) before crashing. Attaching server log. Using on Win11 + Ollama 0.1.42 + VS Code (1.90.0) + Continue plugin (v0.8.40).

The expected behavior would be not crashing and maybe rellocating the memory somehow so that GPU memory does not get exhausted. I want to disable GPU usage in ollama (to test for CPU inference only - I have 64GB RAM) but I am not able to find out how to turn the GPU off (even though I saw there is a command for it recently - am not able to find it again).

Actual error:

CUDA error: out of memory
  current device: 0, in function alloc at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:375
  cuMemSetAccess(pool_addr + pool_size, reserve_size, &access, 1)
GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:100: !"CUDA error"

This is reported via Ollama and full logs are in the issue there: ollama/ollama#4985

Name and Version

See linked ollama issue.

What operating system are you seeing the problem on?

Windows

Relevant log output

See linked ollama issue.
@kozuch kozuch added bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) labels Jun 11, 2024
@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Jun 11, 2024

This is not necessarily a bug but rather an issue with how efficient resources are used, i.e. performance. And I can only speak for myself but I will definitely not debug the performance of downstream projects.

@ggerganov ggerganov added 3rd party Issue related to a 3rd party project and removed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) labels Jun 12, 2024
@zsogitbe
Copy link

zsogitbe commented Jun 13, 2024

There is definitely a bug in the GPU memory allocation code. We are experiencing several major issues:

  • GPU memory is not fully allocated after loading the model, but it keeps allocating more when doing the first inference.
  • The main problem with the above rises when there is no more GPU memory left in the main GPU for the context (kv cache?) to be fully allocated. This can happen at several occasions, for example, if the model does not fit into the GPU memory and only partially loaded, or if there are two models in GPU memory and the second model completely fills the main GPU memory. Etc. When the library tries to allocate the context it crashes because there is no more GPU memory left in the main GPU. The library crashes regularly while using GPUs.

The problem reported by kozuch is also because of this issue.

  • One more issue we have found is that approximately 20% more GPU memory is used today than about 1 month ago with the same model (this has to do probably with the later GPU allocation of the context).

  • One more issue is something strange with the split mode. When setting Row as split mode the number of off loaded layers still needs to be set. And while using the Layer split mode the library also crashes regularly and it uses more GPU memory compared to the Row split mode. Etc. I think that the GPU memory allocation should be completely reworked or validated in some way.

The right way of working should be that all GPU memory should be allocated when loading the model that it does not need to allocate more memory later (this would prevent the crashes).

@JohannesGaessler
Copy link
Collaborator

GPU memory is not fully allocated after loading the model, but it keeps allocating more when doing the first inference.

One more issue is something strange with the split mode. When setting Row as split mode the number of off loaded layers still needs to be set. And while using the Layer split mode the library also crashes regularly and it uses more GPU memory compared to the Row split mode. Etc.

That is just how those things are implemented. As of right now large batch matrix multiplication uses FP16 cuBLAS GEMM so you need allocate temporary buffers for the dequantized weight matrices. If you compile with LLAMA_CUDA_FORCE_MMQ you can use custom kernels that directly use the quantized data but there are still performance issues on Volta or newer.

And --split-mode row allocates the entire context on a single GPU so it's not surprising that you oom faster.

I think that the GPU memory allocation should be completely reworked or validated in some way.

I will happily review any PRs in that regard.

One more issue we have found is that approximately 20% more GPU memory is used today than about 1 month ago with the same model (this has to do probably with the later GPU allocation of the context).

That one could be an actual bug. If you want it fixed, provide instructions for reproduction and do a git bisect to identify the exact commit that introduced the increase in memory consumption.

@JohannesGaessler
Copy link
Collaborator

I forgot: enabling FlashAttention via -fa should also cut down a lot on temporary buffers needed for matrix multiplication. (Right now it cannot be the default because not all backends support it.)

@zsogitbe
Copy link

Thank you Johannes, I will try the FlashAttention option.

I am afraid that with the current GPU allocation code the library is not usable in production software unless you only use CPU (very slow). I know that the GPU part is a lot and complex work... hopefully will someone try to correct it. I would be willing to test it when it is ready.

For me this is a critical major issue, the library should have a basic correct working with multi-GPUs and CPUs. I see endless PRs with for me unimportant additions compared to this.

@JohannesGaessler
Copy link
Collaborator

For me this is a critical major issue, the library should have a basic correct working with multi-GPUs and CPUs. I see endless PRs with for me unimportant additions compared to this.

That is unfortunate but as it is there are only 2 devs working on CUDA (slaren and me). My work on the open llama.cpp repository is unpaid and I only really care about my own use cases. You could try a business inquiry if you really need specific changes.

@zsogitbe
Copy link

Thank you Johannes and of course also thank you @slaren for your contribution to the cuda part! I am not interested in the commercial options, I am here to support open source. I don't think that their business venture will be successful if the basics of the library are not good... the core should be robust and well working. The code is 5-10 times faster on the GPU when it does not crash, so for me the cuda part is important, your contribution is important! I am contributing to several open source projects, here I am mostly testing and trying to help with ideas and finding the right direction.

Since there is 20% more GPU memory allocation compared to one month ago, it is possible that someone tried to pre-allocate some memory for the context, but probably made a mistake somewhere. If you will have some time and interest to improve the cuda code in the library, then here are my comments:

  • the main problem is the second allocation of GPU memory (probably the context - kv cache) because that causes major errors in several circumstances. I think that the code tries to allocate the GPU memory on the main GPU for the context and if there is no memory left there, then it just crashes The code should check if there is memory left on one of the available GPU's and if yes, then allocate the memory there or if there is nothing left, then allocate in system memory and use the CPU for that part, etc. maybe there are other optimized options.
    You can test the memory allocation amount by running an older version of the library and compare it to the current version, you will see the 20% more GPU memory use.
  • The second allocation should be prevented and there should only be 1 allocation of GPU memory when the model is loaded. The context size should probably be fixed at loading time. The GPU may be used by several applications and models, so we cannot be sure that there will be memory left on the GPU for the second allocation.
  • if possible the GPU memory use should be decreased back to the level where it was approximately one month ago. 20% more GPU memory use is too much.
  • abort() in GGML_ASSERT(!"CUDA error"); should be replaced with throwing an exception in some way. A library which is just crashing out in this way is former century C style coding and should be prevented.

Could you please give us a summary of the compilation options with some explanation of what they do and their advantages and disadvantages, which could reduce the problems mentioned above? A bit more documentation would help and that is probably not a lot of work. You mentioned already:

  • LLAMA_CUDA_FORCE_MMQ
  • FlashAttention via -fa

What about GGML_CUDA_NO_VMM?

@JohannesGaessler
Copy link
Collaborator

I am contributing to several open source projects, here I am mostly testing and trying to help with ideas and finding the right direction.

The bottleneck is not testing or ideas, it is the amount of time that developers have. As I said, I will happily review any PRs that improve memory allocation since that will be significantly less work than having to implement it myself.

The code should check if there is memory left on one of the available GPU's and if yes, then allocate the memory there or if there is nothing left, then allocate in system memory and use the CPU for that part, etc. maybe there are other optimized options.

That is not going to work without major changes to the GPU code. It would be better to pre-allocate the memory.

The context size should probably be fixed at loading time.

It is. Any additional VRAM use comes from temporary buffers. With LLAMA_CUDA_FORCE_MMQ and -fa that should be 95% fixed.

if possible the GPU memory use should be decreased back to the level where it was approximately one month ago. 20% more GPU memory use is too much.

Since you said that you would help with testing, please do a git bisect and identify the bad commit.

abort() in GGML_ASSERT(!"CUDA error"); should be replaced with throwing an exception in some way. A library which is just crashing out in this way is former century C style coding and should be prevented.

llama.cpp (C++ user code) internally uses ggml (C library code).

What about GGML_CUDA_NO_VMM?

That's going to make it worse I think.

@ggerganov
Copy link
Owner

There is no "20% more GPU memory increase" - you are likely using a larger context (the default is now -c 0 so the full training context). In order for us to do anything, you need to provide specific repro steps using llama.cpp examples, otherwise we directly assume user error.

The memory estimation problem has been discussed in the past and it is not trivial to fix. Therefore, projects using llama.cpp have to know in advance the set of conditions that they will be running and constrain this set to be compatible with the underlying hardware. You cannot expect to be able to load arbitrary models in memory from different applications sharing the same GPU and everything to just run smoothly.

@zsogitbe
Copy link

if possible the GPU memory use should be decreased back to the level where it was approximately one month ago. 20% more GPU memory use is too much.

Since you said that you would help with testing, please do a git bisect and identify the bad commit.

I have done this already Johannes, here is the most probable reason: #6170

@zsogitbe
Copy link

There is no "20% more GPU memory increase" - you are likely using a larger context (the default is now -c 0 so the full training

Unfortunately, there is Georgi. I did not change anything in my code and the GPU memory use increased with about 20%. This may of course also be caused by you changing some default compilation options... or the above PR. The cuda code is too complex for people who did not develop it, so it is difficult to assess. If you take a model and run inference with a version somewhere before #6170 and with the current version, then you will see the memory increase.

@JohannesGaessler
Copy link
Collaborator

I have done this already Johannes, here is the most probable reason: #6170

You did not do a git bisect. You said that an unspecified old version was good and that master is bad. You need to tell us the exact commit that causes the problem (based on actual testing using only llama.cpp code) and how you tested it. Otherwise there is no chance whatsoever of this getting fixed for you. When I tested VRAM use with

./main --model models/opt/llama_2-q4_0.gguf --n-gpu-layers 99 -n 100 -e prompt.txt --n-predict -1 --ignore-eos

the VRAM use going back as far as b1875 (15.01.24) was the exact same. Older versions had worse VRAM use (I only tested one release per few months.)

@zsogitbe
Copy link

zsogitbe commented Jun 14, 2024

I have done this already Johannes, here is the most probable reason: #6170

You did not do a git bisect. You said that an unspecified old version was good and that master is bad. You need to tell us the exact commit that causes the problem (based on actual testing using only llama.cpp code) and how you tested it. Otherwise there is no chance whatsoever of this getting fixed for you. When I tested VRAM use with

I have reported this issue before, the #6170 comes from there: #6909 (comment)

./main --model models/opt/llama_2-q4_0.gguf --n-gpu-layers 99 -n 100 -e prompt.txt --n-predict -1 --ignore-eos

the VRAM use going back as far as b1875 (15.01.24) was the exact same. Older versions had worse VRAM use (I only tested one release per few months.)

Could you please check the state of the two parameters LLAMA_CUDA_FORCE_MMQ and -fa in your test? There must be a reason for that you have different test results and that you do not see the 20% GPU/cuda memory increase.

@JohannesGaessler
Copy link
Collaborator

I have reported this issue before, the #6170 comes from there: #6909 (comment)

I know, that is not actionable information for us. Do a git bisect.

Could you please check the state of the two parameters LLAMA_CUDA_FORCE_MMQ and -fa in your test?

They were not used.

@zsogitbe
Copy link

I don't know git bisect, but I have just taken two versions with your test (I had to modify it a bit because it was not completely correct):

  • llama-b2380-bin-win-cublas-cu12 2 0-x64 (10/03/2024)
    llama-b2380-bin-win-cublas-cu12 2 0-x64

  • llama-b3146-bin-win-cuda-cu12 2 0-x64 (14/06/2024)
    llama-b3146-bin-win-cuda-cu12 2 0-x64

I have also tested some other models and the difference in GPU memory use was sometimes more than 100% increase! I guess that it also has to do something with the type and size of the model...

The GPU memory use is definitely increased considerably!

I will not be able to help to find the exact commit which caused this because it is just too much work to search and run hundreds of PRs.

The code to run:

llama-b2380-bin-win-cublas-cu12.2.0-x64\main.exe --model phi-2.Q8_0.gguf -sm none --n-gpu-layers 99 -n 100 -p "I want you to write an essay about an apple." --n-predict -1 --ignore-eos

llama-b3146-bin-win-cuda-cu12.2.0-x64\llama-cli.exe --model phi-2.Q8_0.gguf -sm none --gpu-layers 99 -n 100 -p "I want you to write an essay about an apple." --predict -1 --ignore-eos 

@JohannesGaessler
Copy link
Collaborator

I will not be able to help to find the exact commit which caused this because it is just too much work to search and run hundreds of PRs.

That's why we're telling you to use git bisect, it automatically gives you commits to test and the number of tests you have to do only scales logarithmically with the number of commits. None of the devs experience the issue so unless someone that does pins down the exact commit that causes the problem the issue has a 0% chance of getting fixed.

@zsogitbe
Copy link

I have done it, but the result is not logical because it indicates a release from 2 weeks ago, but I have spotted this problem already in April. Maybe the problem is that the commits are not directly linked to releases and the bisect indicates commits (I was testing the releases).

Here are the results:

D:\_GitHub\llama.cpp>git bisect start
status: waiting for both good and bad commits

D:\_GitHub\llama.cpp>git bisect good d894f352bf433157232dc8dc54eacd50014e898e
status: waiting for bad commit, 1 good commit known

D:\_GitHub\llama.cpp>git bisect bad f8ec8877b75774fc6c47559d529dac423877bcad
Bisecting: 385 revisions left to test after this (roughly 9 steps)
[d2c898f746a527f09effb061829e68b2e1812a28] ci : tmp disable gguf-split (#6983)

D:\_GitHub\llama.cpp>git bisect good d2c898f746a527f09effb061829e68b2e1812a28
Bisecting: 192 revisions left to test after this (roughly 8 steps)
[fcf6538ba6702c55eaec70da9a75c81d04900a72] CUDA: fix unused warning in mmq.cu (#7442)

D:\_GitHub\llama.cpp>git bisect good fcf6538ba6702c55eaec70da9a75c81d04900a72
Bisecting: 96 revisions left to test after this (roughly 7 steps)
[1af511fc22cba4959dd8bced5501df9e8af6ddf9] Add convert.py removal to hot topics (#7662)

D:\_GitHub\llama.cpp>git bisect good 1af511fc22cba4959dd8bced5501df9e8af6ddf9
Bisecting: 48 revisions left to test after this (roughly 6 steps)
[c9ee7118d5644dd3df70ea6878b36a9761616aab] check for nans in imatrix and quantize (#7807)

D:\_GitHub\llama.cpp>git bisect bad c9ee7118d5644dd3df70ea6878b36a9761616aab
Bisecting: 23 revisions left to test after this (roughly 5 steps)
[bde7cd3cd949c1a85d3a199498ac98e78039d46f] llama : offload to RPC in addition to other backends (#7640)

D:\_GitHub\llama.cpp>git bisect good bde7cd3cd949c1a85d3a199498ac98e78039d46f
Bisecting: 11 revisions left to test after this (roughly 4 steps)
[9973e81c5ccf4f31b3980f5aa73f5cfea8699860] readme : remove -ins (#7759)

D:\_GitHub\llama.cpp>git bisect bad 9973e81c5ccf4f31b3980f5aa73f5cfea8699860
Bisecting: 5 revisions left to test after this (roughly 3 steps)
[5ca0944a153b65724d51b2f484139aa25ccb7a8b] readme : remove obsolete Zig instructions (#7471)

D:\_GitHub\llama.cpp>git bisect good 5ca0944a153b65724d51b2f484139aa25ccb7a8b
Bisecting: 2 revisions left to test after this (roughly 2 steps)
[1442677f92e45a475be7b4d056e3633d1d6f813b] common : refactor cli arg parsing (#7675)

D:\_GitHub\llama.cpp>git bisect bad 1442677f92e45a475be7b4d056e3633d1d6f813b
Bisecting: 0 revisions left to test after this (roughly 1 step)
[554c247caffed64465f372661f2826640cb10430] ggml : remove OpenCL (#7735)

D:\_GitHub\llama.cpp>git bisect good 554c247caffed64465f372661f2826640cb10430
1442677f92e45a475be7b4d056e3633d1d6f813b is the first bad commit
commit 1442677f92e45a475be7b4d056e3633d1d6f813b
Author: Georgi Gerganov <[email protected]>
Date:   Tue Jun 4 21:23:39 2024 +0300

    common : refactor cli arg parsing (#7675)

    * common : gpt_params_parse do not print usage

    * common : rework usage print (wip)

    * common : valign

    * common : rework print_usage

    * infill : remove cfg support

    * common : reorder args

    * server : deduplicate parameters

    ggml-ci

    * common : add missing header

    ggml-ci

    * common : remote --random-prompt usages

    ggml-ci

    * examples : migrate to gpt_params

    ggml-ci

    * batched-bench : migrate to gpt_params

    * retrieval : migrate to gpt_params

    * common : change defaults for escape and n_ctx

    * common : remove chatml and instruct params

    ggml-ci

    * common : passkey use gpt_params

 common/common.cpp                            | 823 ++++++++++++++++++---------
 common/common.h                              | 105 +++-
 examples/batched-bench/README.md             |   8 +-
 examples/batched-bench/batched-bench.cpp     |  92 +--
 examples/batched/README.md                   |   2 +-
 examples/batched/batched.cpp                 |  73 +--
 examples/embedding/embedding.cpp             |   4 +-
 examples/eval-callback/eval-callback.cpp     |   6 +-
 examples/gguf-split/tests.sh                 |  10 +-
 examples/gritlm/gritlm.cpp                   |   2 +
 examples/imatrix/imatrix.cpp                 |   8 +-
 examples/infill/infill.cpp                   | 134 +----
 examples/llama-bench/llama-bench.cpp         |  48 +-
 examples/llava/llava-cli.cpp                 |  14 +-
 examples/lookahead/lookahead.cpp             |   3 +-
 examples/lookup/lookup-create.cpp            |   2 +
 examples/lookup/lookup-stats.cpp             |   1 +
 examples/lookup/lookup.cpp                   |   1 +
 examples/main/README.md                      |   5 +-
 examples/main/main.cpp                       |  69 +--
 examples/parallel/parallel.cpp               |   3 +-
 examples/passkey/README.md                   |   2 +-
 examples/passkey/passkey.cpp                 |  66 +--
 examples/perplexity/perplexity.cpp           |  12 +-
 examples/quantize/tests.sh                   |   4 +-
 examples/retrieval/retrieval.cpp             |  90 +--
 examples/save-load-state/save-load-state.cpp |   1 +
 examples/server/server.cpp                   | 700 +++--------------------
 examples/server/utils.hpp                    |   7 -
 examples/simple/README.md                    |   2 +-
 examples/simple/simple.cpp                   |  50 +-
 examples/speculative/speculative.cpp         |   3 +-
 llama.cpp                                    |   2 +-
 scripts/run-with-preset.py                   |   4 +-
 34 files changed, 900 insertions(+), 1456 deletions(-)

@JohannesGaessler
Copy link
Collaborator

@ggerganov is that the commit that changed the default context size?

@zsogitbe
Copy link

When setting the context size to a specific value there is still memory increase but it is much less. The context size will not be the reason for my 20 % GPU memory increase, but mainly the reason for the memory increase in the Releases. This is not consistent with the documentation because when the context size is not given (0) the library uses the context size from the model (according to the docs). Anyway, this is not important for me because the context size should always be set.

Please try to publish the preprocessor parameters for compiling the binaries in the releases, maybe the reason lies there, I am compiling the binaries on Windows myself with the same parameters as always. Maybe in the releases these parameters changed to keep up with the code changes.
For example, is LLAMA_CUDA_FORCE_MMQ set for all of the Releases?

@JohannesGaessler
Copy link
Collaborator

For example, is LLAMA_CUDA_FORCE_MMQ set for all of the Releases?

It should not be because (until very recently) it was not using int8 tensor cores so the performance on Turing or newer was bad.

@zsogitbe
Copy link

While doing these tests I have noticed something very strange about the binaries. The size of the binaries increase drastically with nearly every release. You should maybe check this.

10 ‎March ‎2024  => 12,6 MB
29 ‎April ‎2024  => 18,7 MB
21 ‎May ‎2024    => 37,0 MB
‎4 ‎June ‎2024    => 58,9 MB
14 ‎June ‎2024   => 75,6 MB

@JohannesGaessler
Copy link
Collaborator

We are aware, that's just a consequence of pre-compiling multiple kernel versions for various combinations of head sizes, batch sizes, and quantization formats. I really don't think it matters nowadays since even 1 GB of SSD space costs only a few cents.

@zsogitbe
Copy link

I went on and done some more testing. Johannes told me to try LLAMA_CUDA_FORCE_MMQ. I did some testing and the LLAMA_CUDA_FORCE_MMQ did not have any effect, so I searched in the code and I have found that it does nothing else than setting the GGML_CUDA_FORCE_MMQ preprocessor directive for ggml. Then, I searched for GGML_CUDA_FORCE_MMQ and I have found it in ggml-cuda.cu, but it does nothing else then writing 1 line to the log. So my question is, what happened to the code supporting GGML_CUDA_FORCE_MMQ?

@JohannesGaessler
Copy link
Collaborator

If GGML_CUDA_FORCE_MMQ is not defined CUDA_USE_TENSOR_CORES gets defined which then has an actual effect. As of right now this is a misnomer though since MMQ recently gained tensor core support.

@zsogitbe
Copy link

zsogitbe commented Jun 16, 2024

I think that that code is also missing, because CUDA_USE_TENSOR_CORES is not set if GGML_CUDA_FORCE_MMQ is not set. I have the feeling that this will be the reason for my 20% memory increase, the code is probably changed, the only thing I do not understand why I don't get the same memory increase when using the Releases from llama.cpp. It would be useful to see what parameters are used to compile the releases, I guess that those parameters are also changed in some way.

I have also noted that I get this during testing: "ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-0] [4096 100 1 1]. I guess that batch size is always >1 thus cuda graphs are always disabled now.

Furthermore, from testing:

--generate-code=arch=compute_60,code=[compute_60,sm_60] --generate-code=arch=compute_61,code=[compute_61,sm_61] --generate-code=arch=compute_70,code=[compute_70,sm_70]

This should be changed to 60 and 70 only because there is no significant speed difference between 60 and 61 and it adds an extra 10 MB to the dll size.

Thank you Johannes for your answers. I have not found the reason for the GPU mem increase, but FlashAttention helped to decrease memory use a bit, so I guess that I will go with this and not search further for the reason.

@JohannesGaessler
Copy link
Collaborator

This should be changed to 60 and 70 only because there is no significant speed difference between 60 and 61 and it adds an extra 10 MB to the dll size.

Yes there is. The __dp4a instruction is critical for Pascal performance and only available with CC 6.1 but not CC 6.0.

@zsogitbe
Copy link

Good catch Johannes!

@github-actions github-actions bot added the stale label Jul 17, 2024
Copy link
Contributor

github-actions bot commented Aug 1, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Issue related to a 3rd party project bug-unconfirmed stale
Projects
None yet
Development

No branches or pull requests

4 participants