Support repetition_penalty #1424

beginlner · 2023-10-20T05:54:39Z

It has the same behavior as this.

WoosukKwon · 2023-10-22T07:32:03Z

Hi @beginlner, Could you tell us how it is different from #1392?

beginlner · 2023-10-23T06:24:17Z

Hi @beginlner, Could you tell us how it is different from #1392?

Hi, I think we implemented completely identical functions.

zhuohan123

LGTM! Thank you for your contribution! Added a small style fix.

resorcap · 2023-10-30T07:09:11Z

In huggingface, input_token_ids contains prompt tokens.
But this pr only penalty for generate_tokens. This behavior is not the same.
@zhuohan123 @beginlner

zwj536 · 2023-10-31T06:53:54Z

I'm getting inconsistent results between HF and vllm with llama-7b @beginlner @WoosukKwon

## hf
import torch
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM
)

MODEL_NAME = "huggyllama/llama-7b"
#MODEL_NAME = "huggyllama/llama-13b"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.float16).cuda()  

prompt = [
    "Hello, what is apple?  ",
]
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
generated_ids = model.generate(
    input_ids, 
    do_sample=False, 
    repetition_penalty=1.2, 
    max_new_tokens=64,
)

texts = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
""" OUTPUT
Hello, what is orange?   \nHello, what are you doing here?\n— _The Wizard of Oz_ , L. Frank Baum (1900)\n#  **What Is a Computer?**\nA computer is an electronic device that can store and process information in the'
"""

## vllm
from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, what is apple?  ",
]
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=64, 
    #frequency_penalty=1.2,
    #presence_penalty=1.2,
    repetition_penalty=1.2,
)

MODEL_NAME = "huggyllama/llama-7b"
#MODEL_NAME = "huggyllama/llama-13b"

llm = LLM(
    model=MODEL_NAME,
    trust_remote_code=True,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

""" OUTPUT
Hello, what is apple?  \nHello, what is apple?  \nHello, what is apple?  \nHello, what is apple?  \nWhat\'s that you say?  What\'s that you say?"  And so on. The child will repeat the question until he gets an answer
"""

resorcap · 2023-11-01T03:51:59Z

I'm getting inconsistent results between HF and vllm with llama-7b @beginlner @WoosukKwon

## hf
import torch
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM
)

MODEL_NAME = "huggyllama/llama-7b"
#MODEL_NAME = "huggyllama/llama-13b"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.float16).cuda()  

prompt = [
    "Hello, what is apple?  ",
]
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
generated_ids = model.generate(
    input_ids, 
    do_sample=False, 
    repetition_penalty=1.2, 
    max_new_tokens=64,
)

texts = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
""" OUTPUT
Hello, what is orange?   \nHello, what are you doing here?\n— _The Wizard of Oz_ , L. Frank Baum (1900)\n#  **What Is a Computer?**\nA computer is an electronic device that can store and process information in the'
"""

## vllm
from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, what is apple?  ",
]
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=64, 
    frequency_penalty=1.2
)

MODEL_NAME = "huggyllama/llama-7b"
#MODEL_NAME = "huggyllama/llama-13b"

llm = LLM(
    model=MODEL_NAME,
    trust_remote_code=True,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

""" OUTPUT
Hello, what is apple?  \nHello, what is apple?  \nHello, what is apple?  \nHello, what is apple?  \nWhat\'s that you say?  What\'s that you say?"  And so on. The child will repeat the question until he gets an answer
"""

Use repetition_penalty instead of frequency_penalty in vllm. And another defect is input_ids inconsistent.

zwj536 · 2023-11-01T08:56:05Z

I'm getting inconsistent results between HF and vllm with llama-7b @beginlner @WoosukKwon

## hf
import torch
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM
)

MODEL_NAME = "huggyllama/llama-7b"
#MODEL_NAME = "huggyllama/llama-13b"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.float16).cuda()  

prompt = [
    "Hello, what is apple?  ",
]
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
generated_ids = model.generate(
    input_ids, 
    do_sample=False, 
    repetition_penalty=1.2, 
    max_new_tokens=64,
)

texts = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
""" OUTPUT
Hello, what is orange?   \nHello, what are you doing here?\n— _The Wizard of Oz_ , L. Frank Baum (1900)\n#  **What Is a Computer?**\nA computer is an electronic device that can store and process information in the'
"""

## vllm
from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, what is apple?  ",
]
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=64, 
    frequency_penalty=1.2
)

MODEL_NAME = "huggyllama/llama-7b"
#MODEL_NAME = "huggyllama/llama-13b"

llm = LLM(
    model=MODEL_NAME,
    trust_remote_code=True,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

""" OUTPUT
Hello, what is apple?  \nHello, what is apple?  \nHello, what is apple?  \nHello, what is apple?  \nWhat\'s that you say?  What\'s that you say?"  And so on. The child will repeat the question until he gets an answer
"""

Use repetition_penalty instead of frequency_penalty in vllm. And another defect is input_ids inconsistent.

@resorcap Hi, the result of repetition_penalty is still inconsistent

"""
# hf
# repetition_penalty=1.2
Hello, what is orange?   \nHello, what are you doing here?\n— _The Wizard of Oz_ , L. Frank Baum (1900)\n#  **What Is a Computer?**\nA computer is an electronic device that can store and process information in the

# vllm
# frequency_penalty=1.2
Hello, what is apple?  \nHello, what is apple?  \nHello, what is apple?  \nHello, what is apple?  \nWhat\'s that you say?  What\'s that you say?"  And so on. The child will repeat the question until he gets an answer

# presence_penalty=1.2
Hello, what is apple?  \nHello, what is apple?  \nHello, what is apple?  \nHello, what is apple?  \nHello, what is apple?  \nHello, what is apple?  \nHello, what is apple?  \nHello, what is apple? 

# repetition_penalty=1.2,
Hello, what is apple?  \nWhat's that you say?  It's an orange.   \nNo it isn't! No it isn't! I know a banana when I see one and this ain't no banana. This here is an apple all right but not
"""

beginlner · 2023-11-07T01:48:16Z

Hi @resorcap @zwj536, thank you for the correction, I have fixed it in #1577.

Support repetition_penalty

13895c3

style

6514484

zhuohan123 approved these changes Oct 29, 2023

View reviewed changes

zhuohan123 merged commit 69be658 into vllm-project:main Oct 29, 2023
2 checks passed

This was referenced Oct 29, 2023

Add repetition_penalty aligned with huggingface(porting pr #886 to latest branch) #1392

Closed

Add repetition_penalty aligned with huggingface #866

Closed

beginlner mentioned this pull request Nov 7, 2023

Fix repetition penalty aligned with huggingface #1577

Merged

This was referenced Nov 10, 2023

Hope to cancel "required_args" mecanism. Upgraded openai-python (v0->v1) refuses argumment "repetition_penalty" openai/openai-python#767

Closed

Upgraded openai-python (v0->v1) refuses argumment "repetition_penalty" #1620

Closed

zhuxiaobin mentioned this pull request Nov 18, 2023

Add GPTQ support #916

Merged

beginlner deleted the serve branch December 19, 2023 01:44

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Support repetition_penalty (vllm-project#1424)

b35bfee

hmellor mentioned this pull request Mar 13, 2024

Potential degredation in sampling/too repetitive #712

Closed

sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024

Support repetition_penalty (vllm-project#1424)

3747bd4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support repetition_penalty #1424

Support repetition_penalty #1424

beginlner commented Oct 20, 2023

WoosukKwon commented Oct 22, 2023

beginlner commented Oct 23, 2023 •

edited

Loading

zhuohan123 left a comment

resorcap commented Oct 30, 2023 •

edited

Loading

zwj536 commented Oct 31, 2023 •

edited

Loading

resorcap commented Nov 1, 2023

zwj536 commented Nov 1, 2023 •

edited

Loading

beginlner commented Nov 7, 2023

Support repetition_penalty #1424

Support repetition_penalty #1424

Conversation

beginlner commented Oct 20, 2023

WoosukKwon commented Oct 22, 2023

beginlner commented Oct 23, 2023 • edited Loading

zhuohan123 left a comment

Choose a reason for hiding this comment

resorcap commented Oct 30, 2023 • edited Loading

zwj536 commented Oct 31, 2023 • edited Loading

resorcap commented Nov 1, 2023

zwj536 commented Nov 1, 2023 • edited Loading

beginlner commented Nov 7, 2023

beginlner commented Oct 23, 2023 •

edited

Loading

resorcap commented Oct 30, 2023 •

edited

Loading

zwj536 commented Oct 31, 2023 •

edited

Loading

zwj536 commented Nov 1, 2023 •

edited

Loading