-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor llama family models #2637
Conversation
Hi. @zhuohan123 @WoosukKwon @simon-mo I have completed most of this PR. Any suggestions on this idea? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me. Please do put up a test result log showing all the models still work.
Thanks, i will post test results later. |
Test Codefrom vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
"Hello, my name is"
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0, max_tokens=64)
# Create an LLM.
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf", dtype="half", tensor_parallel_size=2, enforce_eager=True, trust_remote_code=True)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") Model Result:meta-llama/Llama-2-7b-chat-hf
BAAI/AquilaChat-7B
BAAI/AquilaChat2-7B
stabilityai/stablelm-3b-4e1t
01-ai/Yi-6B-Chat
mistralai/Mistral-7B-Instruct-v0.1
mistralai/Mistral-7B-Instruct-v0.2
internlm/internlm-chat-7b
internlm/internlm2-chat-7b
Qwen/Qwen-7B-Chat
baichuan-inc/Baichuan2-13B-Chat
|
Hi, @simon-mo , I have posted test results. PTAL. |
This looks like a great change to me. I think you can really speed up getting it merged if you split it into two PRs:
That will make it much easier to review (if the second PR is more complicated, you can do it model-by-model) |
Merging as this is a great simplification. We do need to add more accuracy test, will follow up once #2844 is in place. |
@esmeetu looks like LoRA test familiar is related to this PR: https://buildkite.com/vllm/ci/builds/1165#018da191-b529-4516-af17-10858fd5b73e |
This reverts commit 5c976a7.
@esmeetu We chatted with @WoosukKwon and I think he only wants to unify the first class of models. How about we just do them one by one and decide if it makes sense for each model? I made an example in #2854 and made you co-author :) |
@esmeetu As @pcmoritz mentioned, we reverted the PR to make the change smaller. Would you be able to make the change for the first-class models?
For the second-class models, I feel isolating them from the first-class models makes more sense since integrating the models can potentially block (or slow down the development of) the optimizations for the first-class models. |
@pcmoritz @WoosukKwon Alright, I appreciate your advice. I’ll work on making the integration clearer. |
This PR is experimental and aiming for reducing redundant code in llama family models. We can make supporting those models in
models/__init__.py
's _MODELS variable likeand remove those repeated model files like
llama
model. Because some llama arch models only have different layer names, norm function and weight loading logic.This is good for further better extension on model without changing all model files. And it would be clear to make repo smaller.
Need more disscusions about this idea.
config.json
.PretainedConfig
fromtransformers
if no customattribute_map
.LayerNorm
, and some useRMSNorm
.qwen2 (special)