Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales #4343
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales #4343
Changes from 28 commits
ccee5d3
8f71c79
6eb01e0
dc89cbc
be60845
4613cb5
3d95d86
642763f
706e931
9a3c78c
1b6f020
b09bcec
052e2b3
b33c6d7
475f58d
56b4880
be37154
9c54d19
c5155ea
948cca7
3feb887
df16316
794f1a1
c13b6a4
5a230ed
80069c9
5ce17d0
92d5162
521a4c8
f330056
2ec375c
72e1e42
0f5824c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering: For static quantization, can we just have this scaling factor in
config.json
? I'm not sure if this is a better decision than having a separate quantization config file, but it seems feasible. WDYT?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is feasible. The reason why I put it into the safetensor files is because that's a better place to store tensors than the config.json and for some schemes, these tensors can be larger (e.g. for per-channel quantization). Since the checkpoint already needs to be rewritten (to add the quant_config.json and also possibly to convert weights to FP8), I don't think it is a big problem to rewrite the safetensors to include activation scales. This is also what https://huggingface.co/FriendliAI/Mistral-7B-Instruct-v0.2-fp8 does (note that their model format is otherwise not very useful since it stores the weights as int8).
I assume that a standard for FP8 will emerge, and I would expect it to store the scales in the safetensor files -- this is a no brainer for weight scales to keep them close to the weights and make sure they are consistent with the quantized weights but also makes sense for activation scales. Once we have a standard, we should use that. Right now, I don't think trying to invent our own standard in
quantize_config.json
is a good idea (since it involves a schema), whereas storing scales in the safetensor scales is pretty canonical and doesn't require us to invent a lot of convention.These are my reasons -- let me know if you prefer otherwise :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also cc @robertgshaw2-neuralmagic who also thought about this a bunch I think :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @pcmoritz , the
scales
should be in a safetensors fileThe mental model for the
quantize_config.json
is that it would hold metadata about what is in thesafetensors
file.So examples could be:
For this first implementation, we don't this, but if we start supporting various different schemes, then we will (since we need to know this when
create_weights
is called - which happens before we look at thesafetensors
file)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, mostly
quantize_config
is not used anymore, even forAutoGPTQ
The
config.json
typically has aquantization_config
.https://huggingface.co/astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit/blob/main/config.json
We only fall back to
quantize_config.json
ifquantization_config
is not found in theconfig.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for GPTQ/AWQ, having a
quantize_config.json
is not necessarily required. So I think this check could break models that have:quantization_config
inconfig.json
quantize_config.json
For example: https://huggingface.co/casperhansen/llama-3-70b-instruct-awq/tree/main
Our CI does not seem to have any models with this setup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I fixed this now by removing the quantize_config.json support -- we can just use
config.json
for specifying the quantization for FP8 checkpoints for the time being :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Theoretically, we shouldn't use "cuda" in the model code. Since the GPU worker sets "cuda" as the default device in torch, device="cuda" is not necessary. Also, it's not good for the compatibility with non-CUDA devices.
This rule is violated for Mixtral and other MoE models unfortunately. 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll make a follow up PR to remove the
device="cuda"
-- since we also specify it explicitly for the other parameters, I don't want to be inconsistent for this PR :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, I think we should have an
MoELayer
that is shared across modelsAll of these changes are currently only impacting
Mixtral
, but could also be applied to other models. Since we have all this generic logic in the model definitions, we are losing out at running others with these features