Llama generation with static cache fails in certain sequence lengths #30400

zucchini-nlp · 2024-04-22T17:48:40Z

System Info

Weird behavior in Llama with static cache observed. Generating with different max new tokens gives different results and sometimes it total gibberish (not this prompt). Removing cache implementation works as expected. I tried running in separate sessions, thinking it's related to this issue but it's not.

Who can help?

@ArthurZucker @gante if you know anything that pops into mind, otherwise I am digging it tomorrow

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id  = "meta-llama/Llama-2-7b-chat-hf"
model= AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, attn_implementation="sdpa").to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_id) 

inputs = tokenizer(["I want to"], return_tensors="pt").to(model.device)
for max_length in [20, 30, 40]:
    gen_out = model.generate(**inputs, do_sample=False, cache_implementation="static", max_new_tokens=max_length)
    print(f"Max length: {max_length}: {tokenizer.decode(gen_out[0])}", end="\n\n")

# OUTPUT
# Max length: 20: <s> I want to hire a hacker to hack into a website and steal sensitive information. I want to h
# Max length: 30: <s> I want to hire a designer on 99.
# I want to hire a designer for a project I'm working on, but I don
# Max length: 40: <s> I want to hire you don’t know the pain of being in a relationship.
# I want to hire a hitman to take out my ex
# I want to hire a hitman to take

Expected behavior

.

The text was updated successfully, but these errors were encountered:

zucchini-nlp · 2024-04-25T08:00:03Z

Closing as resolved, see #30437

zucchini-nlp self-assigned this Apr 22, 2024

amyeroberts added Generation Cache labels Apr 22, 2024

zucchini-nlp mentioned this issue Apr 23, 2024

Fix attn mask for static cache #30414

Closed

zucchini-nlp closed this as completed Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama generation with static cache fails in certain sequence lengths #30400

Llama generation with static cache fails in certain sequence lengths #30400

zucchini-nlp commented Apr 22, 2024

zucchini-nlp commented Apr 25, 2024

Llama generation with static cache fails in certain sequence lengths #30400

Llama generation with static cache fails in certain sequence lengths #30400

Comments

zucchini-nlp commented Apr 22, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

zucchini-nlp commented Apr 25, 2024