-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory consumption for inference with Llama2-7B is weird #28651
Comments
Hi @c3ianwu |
Thanks @younesbelkada. Modified my script:
The plot looks like this: The gradient of the linear sloping bit is still the same (about 0.065, double what we expect). It also looks like clearing the cache is having the desired effect, but the memory consumption for generation is still off. For the beginning bit - I assume it's allocating some memory prior to generation (I guess since we expect to generate at least some tokens)? That would explain the flat line. Am running this on a GCP container on a Jupyter notebook. Thought it might be worth mentioning given the flask issue mentioned in huggingface/accelerate#614 (comment) |
Hi dude, TL; DR: pass I was running into the same issue as you did. It turns out that it was due to the update of |
Hi @g-h-chen . Thanks for the insights , will try these. just to mention i have been facing similar issues while running Mistral 7b from local. Below is the code snippet i am using- model_id = "mistralai/Mistral-7B-Instruct-v0.2" begin initializing HF items, need auth token for thesehf_auth = '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' model = transformers.AutoModelForCausalLM.from_pretrained( tokenizer = transformers.AutoTokenizer.from_pretrained( generate_text = transformers.pipeline( table_list = [list of 50 html tables ] for i, text in enumerate(table_list): I have a A100 80 GB gpu but while iterating over it after 28 tables there is OOM issue . i am not sure why does it keeps filling up the memory while inferencing . Ideally it should release memory after each inference ? or am i wrong somewhere here. |
@g-h-chen not sure this is the fix. Have tried the same steps with eos token set and I'm getting the same memory profile as before. Also if anything we want it to hit max_new_tokens every time (for memory profiling) so that we can be sure that it is outputting sequences of the length we expect. The theoretical calculations I provide above assume that outputs of a particular length have been produced. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
The same problem. |
See #30536 and I would recommend everyone to use the static cache with torch compile ! |
System Info
transformers
version: 4.36.2Who can help?
@ArthurZucker @younesbelkada @gan
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am trying to track GPU memory consumption when doing inference with Llama2-7B. This is my set-up:
I ran
This is the plot:
Expected behavior
I tried to compute theoretical numbers. I estimated the number of input tokens:
which returns 12992. Taking the model to be 7B params ~ 14GB in bf16, and assuming that the kv cache consumes
4*num_layers*d_model = 4*32*4096 = 524,288 bytes/token
, we get an estimated14 + (12992*524288)*1e-9 = 20.8GB
before anything is generated, which looks about right from the graph.Using the same logic, we know that each additional generation step should cost (via the kv cache)
524,288*64 = 0.0034GB / step
of memory. Looking at the gradient of the linear portion of the plot, we get ~0.0067GB / step instead, which is around double the amount.The text was updated successfully, but these errors were encountered: