-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales #4343
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales #4343
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM!
The way to propagate activation scaling method is a bit tricky, but it should be a lot easier with #4342. We can merge this PR first given it's more straightforward and no design changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pcmoritz Thanks for the PR! The code looks very clean. I like it! Left some minor comments on the code style. Please check them out.
@classmethod | ||
def get_config_filenames(cls) -> List[str]: | ||
return [] | ||
return ["quantize_config.json"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering: For static quantization, can we just have this scaling factor in config.json
? I'm not sure if this is a better decision than having a separate quantization config file, but it seems feasible. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is feasible. The reason why I put it into the safetensor files is because that's a better place to store tensors than the config.json and for some schemes, these tensors can be larger (e.g. for per-channel quantization). Since the checkpoint already needs to be rewritten (to add the quant_config.json and also possibly to convert weights to FP8), I don't think it is a big problem to rewrite the safetensors to include activation scales. This is also what https://huggingface.co/FriendliAI/Mistral-7B-Instruct-v0.2-fp8 does (note that their model format is otherwise not very useful since it stores the weights as int8).
I assume that a standard for FP8 will emerge, and I would expect it to store the scales in the safetensor files -- this is a no brainer for weight scales to keep them close to the weights and make sure they are consistent with the quantized weights but also makes sense for activation scales. Once we have a standard, we should use that. Right now, I don't think trying to invent our own standard in quantize_config.json
is a good idea (since it involves a schema), whereas storing scales in the safetensor scales is pretty canonical and doesn't require us to invent a lot of convention.
These are my reasons -- let me know if you prefer otherwise :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also cc @robertgshaw2-neuralmagic who also thought about this a bunch I think :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @pcmoritz , the scales
should be in a safetensors file
The mental model for the quantize_config.json
is that it would hold metadata about what is in the safetensors
file.
So examples could be:
- Datatype of the weights
- Whether the activations are static or dynamic (so we dont have to peek into safetensors)
- Channelwise vs not, etc
For this first implementation, we don't this, but if we start supporting various different schemes, then we will (since we need to know this when create_weights
is called - which happens before we look at the safetensors
file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, mostly quantize_config
is not used anymore, even for AutoGPTQ
The config.json
typically has a quantization_config
.
https://huggingface.co/astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit/blob/main/config.json
We only fall back to quantize_config.json
if quantization_config
is not found in the config.json
need_act_scales = (self.use_fp8 and | ||
linear_method.quant_config.act_scaling == "static") | ||
self.as_scale = nn.Parameter( | ||
torch.zeros(1, device="cuda", dtype=torch.float32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Theoretically, we shouldn't use "cuda" in the model code. Since the GPU worker sets "cuda" as the default device in torch, device="cuda" is not necessary. Also, it's not good for the compatibility with non-CUDA devices.
This rule is violated for Mixtral and other MoE models unfortunately. 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll make a follow up PR to remove the device="cuda"
-- since we also specify it explicitly for the other parameters, I don't want to be inconsistent for this PR :)
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @pcmoritz generally looks good
Question - what is the UX for this?
I see three cases.
- We have an fp8 model with dynamic activation quantization
- We have an fp8 model with static activation quantization
- We have an fp16 model which we want to quantize automatically and run with dynamic quantization
For 1-2, I don't think the user should have to specify that they want fp8. For these cases, I think we will need a simple quantization_config
inside the config.json
that we can use to parse which of the cases we have
For 3, there is no quantization config needed. The user should have to specify they want to use fp8 in this case
@classmethod | ||
def get_config_filenames(cls) -> List[str]: | ||
return [] | ||
return ["quantize_config.json"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @pcmoritz , the scales
should be in a safetensors file
The mental model for the quantize_config.json
is that it would hold metadata about what is in the safetensors
file.
So examples could be:
- Datatype of the weights
- Whether the activations are static or dynamic (so we dont have to peek into safetensors)
- Channelwise vs not, etc
For this first implementation, we don't this, but if we start supporting various different schemes, then we will (since we need to know this when create_weights
is called - which happens before we look at the safetensors
file)
}) | ||
set_weight_attrs(self.a2s_scale, { | ||
"weight_loader": self.weight_loader, | ||
}) | ||
|
||
def weight_loader(self, param: nn.Parameter, loaded_weight: torch.Tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, I think we should have an MoELayer
that is shared across models
All of these changes are currently only impacting Mixtral
, but could also be applied to other models. Since we have all this generic logic in the model definitions, we are losing out at running others with these features
@classmethod | ||
def get_config_filenames(cls) -> List[str]: | ||
return [] | ||
return ["quantize_config.json"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, mostly quantize_config
is not used anymore, even for AutoGPTQ
The config.json
typically has a quantization_config
.
https://huggingface.co/astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit/blob/main/config.json
We only fall back to quantize_config.json
if quantization_config
is not found in the config.json
Co-authored-by: Woosuk Kwon <[email protected]>
Yeah exactly -- for 1 and 2, the config.json in the model checkpoint will have the right |
|
||
quant_config_files = [ | ||
f for f in config_files if any( | ||
f.endswith(x) for x in possible_config_filenames) | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for GPTQ/AWQ, having a quantize_config.json
is not necessarily required. So I think this check could break models that have:
quantization_config
inconfig.json
- no
quantize_config.json
For example: https://huggingface.co/casperhansen/llama-3-70b-instruct-awq/tree/main
Our CI does not seem to have any models with this setup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I fixed this now by removing the quantize_config.json support -- we can just use config.json
for specifying the quantization for FP8 checkpoints for the time being :)
Remove the device="cuda" declarations in mixtral as promised in #4343
…les (vllm-project#4343) Co-authored-by: Woosuk Kwon <[email protected]>
Remove the device="cuda" declarations in mixtral as promised in vllm-project#4343
…les (vllm-project#4343) Co-authored-by: Woosuk Kwon <[email protected]>
Remove the device="cuda" declarations in mixtral as promised in vllm-project#4343
Remove the device="cuda" declarations in mixtral as promised in vllm-project#4343
…les (vllm-project#4343) Co-authored-by: Woosuk Kwon <[email protected]>
Remove the device="cuda" declarations in mixtral as promised in vllm-project#4343
This PR contains the performance improvements mentioned in #4244
The performance numbers are as follows (TP2, H100) -- with mistralai/Mixtral-8x7B-Instruct-v0.1:
Note that to get the performance improvements from static scaling, you will need a model checkpoint with the scales.
Instructions on how to create the checkpoint:
and copy the directory over to
Mixtral-8x7B-Instruct-v0.1-fp8
.Inside the
Mixtral-8x7B-Instruct-v0.1-fp8
directory, run the following script (withmixtral_scales.pth
generated as described in #3208 (comment)):Then create a file
quantize_config.json
inside ofMixtral-8x7B-Instruct-v0.1-fp8
withYou can then run the model with
Note that this is probably not the best way to generate these scales, I'm looking forward to suggestions to generate better ones :)
PR Checklist (Click to Expand)
Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]
for bug fixes.[CI/Build]
for build or continuous integration improvements.[Doc]
for documentation fixes and improvements.[Model]
for adding a new model or improving an existing model. Model name should appear in the title.[Frontend]
For changes on the vLLM frontend (e.g., OpenAI API server,LLM
class, etc.)[Kernel]
for changes affecting CUDA kernels or other compute kernels.[Core]
for changes in the core vLLM logic (e.g.,LLMEngine
,AsyncLLMEngine
,Scheduler
, etc.)[Hardware][Vendor]
for hardware-specific changes. Vendor name should appear in the prefix (e.g.,[Hardware][AMD]
).[Misc]
for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
format.sh
to format your code.docs/source/
if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.Notes for Large Changes
Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with
rfc-required
and might not go through the PR.What to Expect for the Reviews
The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:
action-required
label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!