Memory consumption for inference with Llama2-7B is weird #28651

c3ianwu · 2024-01-22T20:20:16Z

System Info

transformers version: 4.36.2
Platform: Linux-5.15.107+-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.20.1
Safetensors version: 0.4.1
Accelerate version: 0.22.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed

Who can help?

@ArthurZucker @younesbelkada @gan

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I am trying to track GPU memory consumption when doing inference with Llama2-7B. This is my set-up:

import json
import tqdm
import warnings
warnings.filterwarnings('ignore')
import time

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import datasets
import matplotlib.pyplot as plt

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.bfloat16)
model.to(device=0)

prompt_data = datasets.load_from_disk("/data/metamath_100k_2048/train") # this is just some supervised training text data
prompts = prompt_data["inputs"] # this is a list of strings


class LocalModel:

    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer

    def generate(self, prompts, do_sample=False, temperature=0, top_k=0, top_p=0, repetition_penalty=1.0, max_new_tokens=128):
        self.tokenizer.pad_token = self.tokenizer.eos_token
        tokenized_inputs = self.tokenizer(prompts, return_tensors="pt", padding=True).to(self.model.device)
        inputs = tokenized_inputs["input_ids"]
        attention_mask = tokenized_inputs["attention_mask"]
        tic = time.time()
        logits = self.model.generate(input_ids=inputs, 
                                     attention_mask=attention_mask, 
                                     do_sample=do_sample, 
                                     temperature=temperature, 
                                     top_k=top_k, 
                                     top_p=top_p, 
                                     repetition_penalty=repetition_penalty,
                                     max_new_tokens=max_new_tokens)
        max_alloc = torch.cuda.max_memory_allocated(0) / 1e9
        print("Peak GPU Memory Consumption: {}".format(torch.cuda.max_memory_allocated(0) / 1e9))
        torch.cuda.reset_peak_memory_stats(0)
        toc = time.time()
        print("Time for generation: {}".format(toc - tic))
        return max_alloc

I ran

local_model = LocalModel(model, tokenizer)

alloc = []
x = [0, 2, 4, 6, 8, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160]
for i in x:
    alloc.append(local_model.generate(prompts[:64], max_new_tokens=i))


plt.scatter(x, alloc)
plt.xlabel("Max New Tokens")
plt.ylabel("Peak Mem Usage / GB")
plt.show()

This is the plot:

Expected behavior

I tried to compute theoretical numbers. I estimated the number of input tokens:

def calculate_prompt_tokens(tokenizer, prompts, batch_size):
    tokenizer.pad_token = tokenizer.eos_token
    tokens = tokenizer(prompts[:batch_size], return_tensors="pt", padding=True)
    return tokens["input_ids"].shape[0] * tokens["input_ids"].shape[1]

calculate_prompt_tokens(tokenizer, prompts, batch_size=64)

which returns 12992. Taking the model to be 7B params ~ 14GB in bf16, and assuming that the kv cache consumes 4*num_layers*d_model = 4*32*4096 = 524,288 bytes/token, we get an estimated 14 + (12992*524288)*1e-9 = 20.8GB before anything is generated, which looks about right from the graph.

Using the same logic, we know that each additional generation step should cost (via the kv cache) 524,288*64 = 0.0034GB / step of memory. Looking at the gradient of the linear portion of the plot, we get ~0.0067GB / step instead, which is around double the amount.

Why is the memory consumed for generation greater than expected?
What's going on in the early portion of the plot? Why is there a big jump at the start?

The text was updated successfully, but these errors were encountered:

younesbelkada · 2024-01-23T09:56:55Z

Hi @c3ianwu
This is interesting, I am not 100% sure what is wrong here but I can give you some insights.
When designing the tests for quantization, as we were running multiple tests with generate I used to get OOM on our CI machines that had ~16GB GPU RAM. It seems the fix was simply to empty the CUDA cache after the test. Maybe here the CUDA cache gets somehow accumulated and causes this behaviour. Can you try to call torch.cuda.empty_cache() after the generate call? You should also add import gc; gc.collect() before that call
For reference, check out this thread huggingface/accelerate#614 (comment) from @ydshieh

c3ianwu · 2024-01-23T21:35:22Z

Thanks @younesbelkada.

Modified my script:

class LocalModel:

    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer

    def generate(self, prompts, do_sample=False, temperature=0, top_k=0, top_p=0, repetition_penalty=1.0, max_new_tokens=128):
        self.tokenizer.pad_token = self.tokenizer.eos_token
        tokenized_inputs = self.tokenizer(prompts, return_tensors="pt", padding=True).to(self.model.device)
        inputs = tokenized_inputs["input_ids"]
        attention_mask = tokenized_inputs["attention_mask"]
        tic = time.time()
        logits = self.model.generate(input_ids=inputs, 
                                     attention_mask=attention_mask, 
                                     do_sample=do_sample, 
                                     temperature=temperature, 
                                     top_k=top_k, 
                                     top_p=top_p, 
                                     repetition_penalty=repetition_penalty,
                                     max_new_tokens=max_new_tokens)
        max_alloc = torch.cuda.max_memory_allocated(0) / 1e9
        print("Peak GPU Memory Consumption: {}".format(max_alloc))
        gc.collect()
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats(0)
        after_clearing_alloc = torch.cuda.max_memory_allocated(0) / 1e9
        print("After clearing: {}".format(after_clearing_alloc))
        toc = time.time()
        print("Input tokens: {}".format(len(inputs[0])))
        print("Output tokens: {}".format(len(logits[0])))
        print("Time for generation: {}".format(toc - tic))
        return max_alloc, after_clearing_alloc

The plot looks like this:

The gradient of the linear sloping bit is still the same (about 0.065, double what we expect). It also looks like clearing the cache is having the desired effect, but the memory consumption for generation is still off.

For the beginning bit - I assume it's allocating some memory prior to generation (I guess since we expect to generate at least some tokens)? That would explain the flat line.

Am running this on a GCP container on a Jupyter notebook. Thought it might be worth mentioning given the flask issue mentioned in huggingface/accelerate#614 (comment)

g-h-chen · 2024-01-29T01:10:49Z

Hi dude,

TL; DR: pass eos_token_id=tokenizer.eos_token_id in model.generate()

I was running into the same issue as you did. It turns out that it was due to the update of transformers where you have to pass the eos_token_id in model.generate(), otherwise it won't stop generating unless it hits max_new_tokens or OOM is triggered.

harry7171 · 2024-01-30T04:58:41Z

Hi @g-h-chen .

Thanks for the insights , will try these. just to mention i have been facing similar issues while running Mistral 7b from local.

Below is the code snippet i am using-

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

begin initializing HF items, need auth token for these

hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
model_id,
use_auth_token=hf_auth,
cache_dir="/fs/scratch/SX_ETL3_GenAI_Team/models/"
)

model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
config=model_config,
# quantization_config=bnb_config,
device_map='auto',
use_auth_token=hf_auth,
cache_dir="/fs/scratch/SX_ETL3_GenAI_Team/models/"
)
model.eval()
print(f"Model loaded on {device}")

tokenizer = transformers.AutoTokenizer.from_pretrained(
model_id,
use_auth_token=hf_auth,
cache_dir="/fs/scratch/SX_ETL3_GenAI_Team/models/")

generate_text = transformers.pipeline(
model=model, tokenizer=tokenizer,
return_full_text=True, # langchain expects the full text
task='text-generation',
# we pass model parameters here too
temperature=0.2, # 'randomness' of outputs, 0.0 is the min and 1.0 the max
max_new_tokens=512,
cache_dir=None,
device_map='auto'
# mex number of tokens to generate in the output
# repetition_penalty=1.1 # without this output begins repeating
)

table_list = [list of 50 html tables ]

for i, text in enumerate(table_list):
print(i)
result = generate_text(f"""Summarize the following table in detail, dont abbreviate or expand any abbreviations, keep the information as precise as possible from original text:
{text}""")
print(result[0]['generated_text'])
print('='*50)

I have a A100 80 GB gpu but while iterating over it after 28 tables there is OOM issue . i am not sure why does it keeps filling up the memory while inferencing . Ideally it should release memory after each inference ? or am i wrong somewhere here.
Any help would be appreciated

c3ianwu · 2024-02-12T12:13:22Z

eos_token_id=tokenizer.eos_token_id

@g-h-chen not sure this is the fix. Have tried the same steps with eos token set and I'm getting the same memory profile as before.

Also if anything we want it to hit max_new_tokens every time (for memory profiling) so that we can be sure that it is outputting sequences of the length we expect. The theoretical calculations I provide above assume that outputs of a particular length have been produced.

github-actions · 2024-03-08T08:03:52Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

bxrjmfh · 2024-05-13T03:36:08Z

The same problem.

ArthurZucker · 2024-05-13T06:04:37Z

See #30536 and I would recommend everyone to use the static cache with torch compile !

github-actions bot closed this as completed Mar 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory consumption for inference with Llama2-7B is weird #28651

Memory consumption for inference with Llama2-7B is weird #28651

c3ianwu commented Jan 22, 2024 •

edited

Loading

younesbelkada commented Jan 23, 2024 •

edited

Loading

c3ianwu commented Jan 23, 2024

g-h-chen commented Jan 29, 2024 •

edited

Loading

harry7171 commented Jan 30, 2024

c3ianwu commented Feb 12, 2024

github-actions bot commented Mar 8, 2024

bxrjmfh commented May 13, 2024

ArthurZucker commented May 13, 2024

Memory consumption for inference with Llama2-7B is weird #28651

Memory consumption for inference with Llama2-7B is weird #28651

Comments

c3ianwu commented Jan 22, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

younesbelkada commented Jan 23, 2024 • edited Loading

c3ianwu commented Jan 23, 2024

g-h-chen commented Jan 29, 2024 • edited Loading

harry7171 commented Jan 30, 2024

begin initializing HF items, need auth token for these

c3ianwu commented Feb 12, 2024

github-actions bot commented Mar 8, 2024

bxrjmfh commented May 13, 2024

ArthurZucker commented May 13, 2024

c3ianwu commented Jan 22, 2024 •

edited

Loading

younesbelkada commented Jan 23, 2024 •

edited

Loading

g-h-chen commented Jan 29, 2024 •

edited

Loading