Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory consumption for inference with Llama2-7B is weird #28651

Closed
2 of 4 tasks
c3ianwu opened this issue Jan 22, 2024 · 8 comments
Closed
2 of 4 tasks

Memory consumption for inference with Llama2-7B is weird #28651

c3ianwu opened this issue Jan 22, 2024 · 8 comments

Comments

@c3ianwu
Copy link

c3ianwu commented Jan 22, 2024

System Info

  • transformers version: 4.36.2
  • Platform: Linux-5.15.107+-x86_64-with-glibc2.31
  • Python version: 3.10.13
  • Huggingface_hub version: 0.20.1
  • Safetensors version: 0.4.1
  • Accelerate version: 0.22.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu118 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed

Who can help?

@ArthurZucker @younesbelkada @gan

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am trying to track GPU memory consumption when doing inference with Llama2-7B. This is my set-up:

import json
import tqdm
import warnings
warnings.filterwarnings('ignore')
import time

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import datasets
import matplotlib.pyplot as plt

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.bfloat16)
model.to(device=0)

prompt_data = datasets.load_from_disk("/data/metamath_100k_2048/train") # this is just some supervised training text data
prompts = prompt_data["inputs"] # this is a list of strings


class LocalModel:

    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer

    def generate(self, prompts, do_sample=False, temperature=0, top_k=0, top_p=0, repetition_penalty=1.0, max_new_tokens=128):
        self.tokenizer.pad_token = self.tokenizer.eos_token
        tokenized_inputs = self.tokenizer(prompts, return_tensors="pt", padding=True).to(self.model.device)
        inputs = tokenized_inputs["input_ids"]
        attention_mask = tokenized_inputs["attention_mask"]
        tic = time.time()
        logits = self.model.generate(input_ids=inputs, 
                                     attention_mask=attention_mask, 
                                     do_sample=do_sample, 
                                     temperature=temperature, 
                                     top_k=top_k, 
                                     top_p=top_p, 
                                     repetition_penalty=repetition_penalty,
                                     max_new_tokens=max_new_tokens)
        max_alloc = torch.cuda.max_memory_allocated(0) / 1e9
        print("Peak GPU Memory Consumption: {}".format(torch.cuda.max_memory_allocated(0) / 1e9))
        torch.cuda.reset_peak_memory_stats(0)
        toc = time.time()
        print("Time for generation: {}".format(toc - tic))
        return max_alloc

I ran

local_model = LocalModel(model, tokenizer)

alloc = []
x = [0, 2, 4, 6, 8, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160]
for i in x:
    alloc.append(local_model.generate(prompts[:64], max_new_tokens=i))


plt.scatter(x, alloc)
plt.xlabel("Max New Tokens")
plt.ylabel("Peak Mem Usage / GB")
plt.show()

This is the plot:

Screenshot 2024-01-22 at 20 00 36

Expected behavior

I tried to compute theoretical numbers. I estimated the number of input tokens:

def calculate_prompt_tokens(tokenizer, prompts, batch_size):
    tokenizer.pad_token = tokenizer.eos_token
    tokens = tokenizer(prompts[:batch_size], return_tensors="pt", padding=True)
    return tokens["input_ids"].shape[0] * tokens["input_ids"].shape[1]

calculate_prompt_tokens(tokenizer, prompts, batch_size=64)

which returns 12992. Taking the model to be 7B params ~ 14GB in bf16, and assuming that the kv cache consumes 4*num_layers*d_model = 4*32*4096 = 524,288 bytes/token, we get an estimated 14 + (12992*524288)*1e-9 = 20.8GB before anything is generated, which looks about right from the graph.

Using the same logic, we know that each additional generation step should cost (via the kv cache) 524,288*64 = 0.0034GB / step of memory. Looking at the gradient of the linear portion of the plot, we get ~0.0067GB / step instead, which is around double the amount.

  1. Why is the memory consumed for generation greater than expected?
  2. What's going on in the early portion of the plot? Why is there a big jump at the start?
@younesbelkada
Copy link
Contributor

younesbelkada commented Jan 23, 2024

Hi @c3ianwu
This is interesting, I am not 100% sure what is wrong here but I can give you some insights.
When designing the tests for quantization, as we were running multiple tests with generate I used to get OOM on our CI machines that had ~16GB GPU RAM. It seems the fix was simply to empty the CUDA cache after the test. Maybe here the CUDA cache gets somehow accumulated and causes this behaviour. Can you try to call torch.cuda.empty_cache() after the generate call? You should also add import gc; gc.collect() before that call
For reference, check out this thread huggingface/accelerate#614 (comment) from @ydshieh

@c3ianwu
Copy link
Author

c3ianwu commented Jan 23, 2024

Thanks @younesbelkada.

Modified my script:

class LocalModel:

    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer

    def generate(self, prompts, do_sample=False, temperature=0, top_k=0, top_p=0, repetition_penalty=1.0, max_new_tokens=128):
        self.tokenizer.pad_token = self.tokenizer.eos_token
        tokenized_inputs = self.tokenizer(prompts, return_tensors="pt", padding=True).to(self.model.device)
        inputs = tokenized_inputs["input_ids"]
        attention_mask = tokenized_inputs["attention_mask"]
        tic = time.time()
        logits = self.model.generate(input_ids=inputs, 
                                     attention_mask=attention_mask, 
                                     do_sample=do_sample, 
                                     temperature=temperature, 
                                     top_k=top_k, 
                                     top_p=top_p, 
                                     repetition_penalty=repetition_penalty,
                                     max_new_tokens=max_new_tokens)
        max_alloc = torch.cuda.max_memory_allocated(0) / 1e9
        print("Peak GPU Memory Consumption: {}".format(max_alloc))
        gc.collect()
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats(0)
        after_clearing_alloc = torch.cuda.max_memory_allocated(0) / 1e9
        print("After clearing: {}".format(after_clearing_alloc))
        toc = time.time()
        print("Input tokens: {}".format(len(inputs[0])))
        print("Output tokens: {}".format(len(logits[0])))
        print("Time for generation: {}".format(toc - tic))
        return max_alloc, after_clearing_alloc

The plot looks like this:

Screenshot 2024-01-23 at 21 32 05

The gradient of the linear sloping bit is still the same (about 0.065, double what we expect). It also looks like clearing the cache is having the desired effect, but the memory consumption for generation is still off.

For the beginning bit - I assume it's allocating some memory prior to generation (I guess since we expect to generate at least some tokens)? That would explain the flat line.

Am running this on a GCP container on a Jupyter notebook. Thought it might be worth mentioning given the flask issue mentioned in huggingface/accelerate#614 (comment)

@g-h-chen
Copy link

g-h-chen commented Jan 29, 2024

Hi dude,

TL; DR: pass eos_token_id=tokenizer.eos_token_id in model.generate()

I was running into the same issue as you did. It turns out that it was due to the update of transformers where you have to pass the eos_token_id in model.generate(), otherwise it won't stop generating unless it hits max_new_tokens or OOM is triggered.

@harry7171
Copy link

Hi @g-h-chen .

Thanks for the insights , will try these. just to mention i have been facing similar issues while running Mistral 7b from local.

Below is the code snippet i am using-

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

begin initializing HF items, need auth token for these

hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
model_id,
use_auth_token=hf_auth,
cache_dir="/fs/scratch/SX_ETL3_GenAI_Team/models/"
)

model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
config=model_config,
# quantization_config=bnb_config,
device_map='auto',
use_auth_token=hf_auth,
cache_dir="/fs/scratch/SX_ETL3_GenAI_Team/models/"
)
model.eval()
print(f"Model loaded on {device}")

tokenizer = transformers.AutoTokenizer.from_pretrained(
model_id,
use_auth_token=hf_auth,
cache_dir="/fs/scratch/SX_ETL3_GenAI_Team/models/")

generate_text = transformers.pipeline(
model=model, tokenizer=tokenizer,
return_full_text=True, # langchain expects the full text
task='text-generation',
# we pass model parameters here too
temperature=0.2, # 'randomness' of outputs, 0.0 is the min and 1.0 the max
max_new_tokens=512,
cache_dir=None,
device_map='auto'
# mex number of tokens to generate in the output
# repetition_penalty=1.1 # without this output begins repeating
)

table_list = [list of 50 html tables ]

for i, text in enumerate(table_list):
print(i)
result = generate_text(f"""Summarize the following table in detail, dont abbreviate or expand any abbreviations, keep the information as precise as possible from original text:
{text}""")
print(result[0]['generated_text'])
print('='*50)


I have a A100 80 GB gpu but while iterating over it after 28 tables there is OOM issue . i am not sure why does it keeps filling up the memory while inferencing . Ideally it should release memory after each inference ? or am i wrong somewhere here.
Any help would be appreciated

@c3ianwu
Copy link
Author

c3ianwu commented Feb 12, 2024

eos_token_id=tokenizer.eos_token_id

@g-h-chen not sure this is the fix. Have tried the same steps with eos token set and I'm getting the same memory profile as before.

Also if anything we want it to hit max_new_tokens every time (for memory profiling) so that we can be sure that it is outputting sequences of the length we expect. The theoretical calculations I provide above assume that outputs of a particular length have been produced.

Copy link

github-actions bot commented Mar 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@bxrjmfh
Copy link

bxrjmfh commented May 13, 2024

The same problem.

@ArthurZucker
Copy link
Collaborator

See #30536 and I would recommend everyone to use the static cache with torch compile !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants