[Kernel] FP8 support for MoE kernel / Mixtral #4244

pcmoritz · 2024-04-21T20:33:30Z

This PR is the first step towards fixing #3208

It implements dynamic per-tensor scaling (see #4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the quantization="fp8" argument. You can try out the PR like this:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Performance: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in #3954). With this PR, the results are as follows:

Accuracy: The accuracy with this PR on MMLU on mistralai/Mixtral-8x7B-v0.1 is as follows:

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7018|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6472|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7673|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8099|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6131|±  |0.0083|

this compares favorably with the fp16 results which are

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7020|±  |0.1313|
| - humanities     |N/A    |none  |     5|acc   |0.6425|±  |0.1349|
| - other          |N/A    |none  |     5|acc   |0.7744|±  |0.1038|
| - social_sciences|N/A    |none  |     5|acc   |0.8131|±  |0.0695|
| - stem           |N/A    |none  |     5|acc   |0.6108|±  |0.1383|

Happy hacking!

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

mgoin · 2024-04-22T19:00:58Z

vllm/model_executor/models/mixtral.py

+            self.ws = nn.Parameter(ws, requires_grad=False)
+            self.w2s = nn.Parameter(w2s, requires_grad=False)


if we define cuda device for the original Parameter, should we do the same here?

ws will inherit the device from self.ws.data through torch.empty_like, so I don't think we need to specify a CUDA device here (and we do want to remove it from the original Parameter going forward too)

comaniac

LGTM!
btw, the scaled quantization kernel shows 2-3x speedup over the PyTorch op implementation in FP8 linear method on my L4, so we should expect some improvements if we use this op in the FP8 linear method. I'll send a follow-up PR after this.

mgoin

Nice job, looks good to me!

mgoin · 2024-04-22T23:04:38Z

vllm/model_executor/layers/fused_moe/fused_moe.py

@@ -129,7 +136,10 @@ def fused_moe_kernel(
                    mask=offs_k[:, None] < K - k * BLOCK_SIZE_K,
                    other=0.0)
        # We accumulate along the K dimension.
-        accumulator += tl.dot(a, b)
+        if use_fp8:
+            accumulator = tl.dot(a, b, acc=accumulator, allow_tf32=True)


nit: allow_tf is deprecated and it is recommended to use input_precision="tf32"
Also it might be worth noting if this is either: required for fp8 to work or just a performance optimization

So the input_precision="tf32" interface is not part of the triton 2.2.0 release yet (which is the version we are using -- triton is pinned to 2.2.0 via the pytorch version we are pinning).

I tried without this flag and can't measure a performance difference, so will remove it. It seems it is the default for devices with tensor cores anyways (https://triton-lang.org/main/python-api/generated/triton.language.dot.html#triton.language.dot). Removing the flag will make us robust against the allow_tf32 vs. input_precision interface change as well. Once we have upgraded trition, it might be worth looking at triton-lang/triton#3234

vllm/model_executor/layers/fused_moe/fused_moe.py

vllm/model_executor/models/mixtral.py

cadedaniel

Great work! Approving after a light pass given @comaniac and @mgoin's approval.

One question:

It implements dynamic per-tensor scaling (see #4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints.

Doesn't this approach still depend on the choice of scaling factor as not all datasets will have the same activation patterns for a given layer? Perhaps the d(quality)/d(per-tensor scaling factor) is small versus other methods, but there is still some impact on quality by the chosen per-tensor scaling factor.

pcmoritz · 2024-04-23T22:14:03Z

So in a way it will select the best scaling factor possible at runtime (assuming per-tensor and not subdividing the tensor further for scaling), since it will use the actual activations on the current batch in the forward pass to compute the scaling factor (vs. using some scaling that was computed on an offline dataset). The resulting fp8 tensor will always have 448.0 as its maximum and you can't quantize a given tensor better than that.

This PR is the first step towards fixing vllm-project#3208 It implements dynamic per-tensor scaling (see vllm-project#4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this: ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8") outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` **Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in vllm-project#3954). With this PR, the results are as follows: <img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03"> **Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows: ``` | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7018|± |0.0036| | - humanities |N/A |none | 5|acc |0.6472|± |0.0065| | - other |N/A |none | 5|acc |0.7673|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8099|± |0.0070| | - stem |N/A |none | 5|acc |0.6131|± |0.0083| ``` this compares favorably with the fp16 results which are ``` | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7020|± |0.1313| | - humanities |N/A |none | 5|acc |0.6425|± |0.1349| | - other |N/A |none | 5|acc |0.7744|± |0.1038| | - social_sciences|N/A |none | 5|acc |0.8131|± |0.0695| | - stem |N/A |none | 5|acc |0.6108|± |0.1383| ``` Happy hacking!

pingzhuu · 2024-04-25T08:09:55Z

vllm/model_executor/layers/fused_moe/fused_moe.py

+        if use_fp8:
+            accumulator = tl.dot(a, b, acc=accumulator)


Thanks for your great work! I have a question here, why does fp8 need to do this, is there any difference?

This is a great question, I'm doing it because it supports the fp8 fast accumulation, see

https://github.com/NVIDIA/cutlass/blob/5c447dd84f8ae0e1d48ff9a2eae26ce8c4958101/CHANGELOG.md?plain=1#L67

https://github.com/openai/triton/blob/5623cdc5fb2d497d2d48cea89170707a97029219/python/triton/ops/matmul.py#L132

The downside is that you can't do rescaling before adding it to the accumulator

This PR is the first step towards fixing vllm-project#3208 It implements dynamic per-tensor scaling (see vllm-project#4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this: ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8") outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` **Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in vllm-project#3954). With this PR, the results are as follows: <img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03"> **Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows: ``` | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7018|± |0.0036| | - humanities |N/A |none | 5|acc |0.6472|± |0.0065| | - other |N/A |none | 5|acc |0.7673|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8099|± |0.0070| | - stem |N/A |none | 5|acc |0.6131|± |0.0083| ``` this compares favorably with the fp16 results which are ``` | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7020|± |0.1313| | - humanities |N/A |none | 5|acc |0.6425|± |0.1349| | - other |N/A |none | 5|acc |0.7744|± |0.1038| | - social_sciences|N/A |none | 5|acc |0.8131|± |0.0695| | - stem |N/A |none | 5|acc |0.6108|± |0.1383| ``` Happy hacking!

pcmoritz added 30 commits April 17, 2024 13:40

add initial single block kernel

ab7963e

update

45225aa

use blocks

69b52cc

fix

dd6f680

update

cb89c0f

port fp8 code

4351703

config

c303674

update

267f856

custom ops

d85fb1a

update

96e3f8b

update

0690411

fix initialization

130899b

add fp8_silu_and_mul_kernel

0a10737

update

ab9fec4

fix

10a5697

fix

c89d2a8

convert in kernel

9435467

cleanup

609f493

conversion

d790697

update

400a7e1

update

4b2c8f4

update

cc2a488

update

dc6add9

update

0af9edc

update

f2a934d

update

ce663ec

Merge branch 'main' into mixtral-fp8-final

4047a93

update

77bdc3e

Use MoE for fp8 quant

bb123dd

fix

d212d2d

update

aedd33d

mgoin reviewed Apr 22, 2024

View reviewed changes

pcmoritz added 3 commits April 22, 2024 12:11

keep fused_moe interface

4aa77c9

typo

69ad2dc

fixloading config file

bae81d3

comaniac approved these changes Apr 22, 2024

View reviewed changes

mgoin reviewed Apr 22, 2024

View reviewed changes

pcmoritz added 8 commits April 22, 2024 19:06

update

b733cea

update

d53b1fc

update

5ef2ee9

fix

8807300

update

a15a7b5

format

8fd40c1

align

0f93811

rerun ci

fbbfc61

cadedaniel approved these changes Apr 23, 2024

View reviewed changes

rerun ci

725270e

pcmoritz enabled auto-merge (squash) April 23, 2024 23:19

pcmoritz merged commit eace8bf into vllm-project:main Apr 24, 2024
47 checks passed

pcmoritz mentioned this pull request Apr 24, 2024

[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales #4343

Merged

pingzhuu reviewed Apr 25, 2024

View reviewed changes

pcmoritz mentioned this pull request Apr 30, 2024

[Bugfix][Kernel] Fix compute_type for MoE kernel #4463

Merged

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] FP8 support for MoE kernel / Mixtral #4244

[Kernel] FP8 support for MoE kernel / Mixtral #4244

pcmoritz commented Apr 21, 2024

mgoin Apr 22, 2024

pcmoritz Apr 22, 2024

comaniac left a comment •

edited

Loading

mgoin left a comment

mgoin Apr 22, 2024

pcmoritz Apr 23, 2024

cadedaniel left a comment •

edited

Loading

pcmoritz commented Apr 23, 2024 •

edited

Loading

pingzhuu Apr 25, 2024

pcmoritz Apr 27, 2024

		self.ws = nn.Parameter(ws, requires_grad=False)
		self.w2s = nn.Parameter(w2s, requires_grad=False)

[Kernel] FP8 support for MoE kernel / Mixtral #4244

[Kernel] FP8 support for MoE kernel / Mixtral #4244

Conversation

pcmoritz commented Apr 21, 2024

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

mgoin Apr 22, 2024

Choose a reason for hiding this comment

pcmoritz Apr 22, 2024

Choose a reason for hiding this comment

comaniac left a comment • edited Loading

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

mgoin Apr 22, 2024

Choose a reason for hiding this comment

pcmoritz Apr 23, 2024

Choose a reason for hiding this comment

cadedaniel left a comment • edited Loading

Choose a reason for hiding this comment

pcmoritz commented Apr 23, 2024 • edited Loading

pingzhuu Apr 25, 2024

Choose a reason for hiding this comment

pcmoritz Apr 27, 2024

Choose a reason for hiding this comment

comaniac left a comment •

edited

Loading

cadedaniel left a comment •

edited

Loading

pcmoritz commented Apr 23, 2024 •

edited

Loading