-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPTQ & AWQ Fused MOE #2761
base: main
Are you sure you want to change the base?
GPTQ & AWQ Fused MOE #2761
Conversation
@chu-tianxiang Great job on optimizing GPTQ! Is there another option than repacking for AWQ? |
I can implement the AWQ kernel based on current AWQ gemm implementation too. Which do you think is better? |
I would prefer it if you can base it on the current AWQ GEMM kernel |
I have updated the AWQ kernels. AWQ GEMM uses tensor cores and has better performance at large batch size, which turns out to be better suited in the MoE case. |
This is excellent work! Looking forward to seeing this merged for a big speedup. |
@chu-tianxiang On a side note, I tried importing the kernels from here to AutoAWQ and I am getting CUDA illegal memory access on multi-GPU while it works fine on a single GPU. It triggers at However, I do not get the same issue in vLLM. Do you have any way or idea to address this issue for AutoAWQ? |
Could you please provide the branch / code to reproduce please? vLLM use separate process for tensor parallel while AutoAWQ and transformers use torch hooks for pipeline parallel. An initial guess is that |
Hi @chu-tianxiang, I added an issue to track it. I attempted to put a device guard in place and it fixes the illegal memory access error, but then results in the generated output being garbage. See details in the issue below. |
I built this branch and ran the all tests under the tests that were added in this PR do all seem to pass though: lroberts@GPU77B9:~/update-vllm-env/vllm-source/vllm$ python3.10 -m pytest tests/kernels/test_moe.py -k "test_fused_moe_gptq or test_fused_moe_awq"
======================================================================================= test session starts =======================================================================================
platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.3.0
rootdir: /home/lroberts/update-vllm-env/vllm-source/vllm
plugins: asyncio-0.23.3, forked-1.6.0, anyio-3.7.1
asyncio: mode=strict
collected 1299 items / 291 deselected / 1008 selected
tests/kernels/test_moe.py ................................................................................................................................................................. [ 15%]
........................................................................................................................................................................................... [ 34%]
........................................................................................................................................................................................... [ 53%]
........................................................................................................................................................................................... [ 71%]
........................................................................................................................................................................................... [ 90%]
................................................................................................... [100%]
======================================================================================== warnings summary =========================================================================================
../../../../../usr/lib/python3/dist-packages/requests/__init__.py:87
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (5.2.0) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:121
/home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:121: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
@validator("best_of")
../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:140
/home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:140: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
@validator("repetition_penalty")
../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:146
/home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:146: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
@validator("seed")
../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:152
/home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:152: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
@validator("temperature")
../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:158
/home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:158: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
@validator("top_k")
../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:164
/home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:164: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
@validator("top_p")
../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:170
/home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:170: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
@validator("truncate")
../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:176
/home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:176: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
@validator("typical_p")
../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:204
/home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:204: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
@validator("inputs")
../../../.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:210
/home/lroberts/.local/lib/python3.10/site-packages/huggingface_hub/inference/_text_generation.py:210: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
@validator("stream")
../../../.local/lib/python3.10/site-packages/cupy/_environment.py:404
/home/lroberts/.local/lib/python3.10/site-packages/cupy/_environment.py:404: UserWarning:
nccl library could not be loaded.
Reason: ImportError (libnccl.so.2: cannot open shared object file: No such file or directory)
You can install the library by:
$ python -m cupyx.tools.install_library --library nccl --cuda 12.x
warnings.warn(msg)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================= 1008 passed, 291 deselected, 12 warnings in 19.40s ======================================================================== EDIT: some details on environment |
@lroberts7 it seems your tests are failing for reasons unrelated to this PR. I think you may have an environment issue or some problem with the GPUs. |
The PR breaks the mixtral unit test previously and I pushed a fix for it, but I'm still seeing |
Hi everyone, thank you for the active development on this PR. We would really like to include this in the next release. However, we identified few issues: (1) the code made some significant change the existing moe implementation that needs to be carefully reviewed (2) there are some merge conflict (3) the main "code owners" who are familiar with code path for recent moe changes @pcmoritz and @WoosukKwon is lacking in bandwidth. Therefore, we would like to push this to the next release v0.4.1 which is targeted around mid April. |
Thanks for all the attention. I fixed the conflicts and added quantization support for Qwen2Moe model. Tested with Btw, yapf and isort seem to have conflicting format rules, I'm not sure how that could be handled. |
@chu-tianxiang Thanks for the great PR I have one major piece of feedback. This PR effectively supports two cases:
Supporting both of these cases adds significant complexity to the implementation, since we now have a big if statement in each of the core methods in the model definition: if not isinstance(self.linear_method, UnquantizedLinearMethod) and not self.linear_method.quant_config.support_fused_moe():
# case 2 --> there is not a fused kernel
else:
# case 1 --> there is a fused kernel This impacts each the core methods in the model definitions:
Since we now have kernels for GPTQ and AWQ, which are by far the most popular quantization methods, I think it makes sense to remove support for case 2 and simply fail if the user tries to run a quantization method that does not support fused_moe execution. This will dramatically simplify the code and make it much easier to (a) maintain and (b) add new MoE models in the future. Neural Magic is already working on a fused MoE version of Marlin as well. So it will really just be SqueezeLLM that lacks a fused kernel. I think this is a completely worthwhile tradeoff |
@robertgshaw2-neuralmagic Thanks for the suggestion, the current logic does increase the code complexity of MoE models quite a bit. Inspired by your analysis, I'm thinking that the root cause of complexity is that fused MoE uses tensor parallel while the unfused uses expert parallel, maybe we change the unfused MoE implementation from expert parallel to the very initial tensor parallel. If it works out we can have simple code and full quantization support at the same time. |
@chu-tianxiang Are you okay if I make a proposal for a refactor to the logic? |
Sure, please feel free to do so. |
@chu-tianxiang LMK when you're ready for a re-review The refactor I have been working on basically makes a shared Since all that logic is duplicated across each model, thought it made sense to abstract it into a new |
@robertgshaw2-neuralmagic Thanks, it is ready now. Following your suggestion, I removed the expert parallel part and it's much cleaner. Currently tested on Mixtral (fp16, awq, gptq-4bit, gptq-3bit), Deepseek (fp16, gptq-4bit) and Qwen-moe (fp16, gptq-4bit). |
Sweet - thanks @chu-tianxiang Will take a look later this week. |
Hey @chu-tianxiang - these changes are looking much better, the logic is much simpler and easier to parse The final architectural change I would want to see to feel comfortable merging is to abstract the fused MoE layer logic from the model definitions into a new class in
The key problem with the proposal I laid out is that it makes the mapping of the vllm state dict to the hf state dict more difficult. So we will need to handle this in each model's FusedMoELinear Proposalclass FusedMoELinear(torch.nn.Module):
def __init__(shapes, linear_method):
# gate_up_proj
self.ws = linear_method.create_moe_weights(shapes)
set_weight_attrs(self.ws, {
"weight_loader": self.weight_loader_merged_column,
})
# down_proj
self.ws = linear_method.create_moe_weights(shapes)
set_weight_attrs(self.w2s, {
"weight_loader": self.weight_loader_row_parallel,
})
# ...
# weight loader for gate_up_proj
def weight_loader_merged_column(param, loaded_weight, expert_id):
# refactor to share with MergedColumnParallel? << make method static in MergedColumnParallel?
pass
# weight loader for down_proj
def weight_loader_row_parallel(param, loaded_weight, expert_id):
# refactor to share with MergedColumnParallel? << make method static in RowColumnLinear?
pass
def forward(hidden_states, router_logic):
linear_method.apply_moe_weights(**) Then, this layer would be part of Mixtral: class MixtralMoE(torch.nn.Module):
def __init__():
self.gate = ReplicatedLinear()
# note: this breaks the disk state dict (model.layers.0.mlp.w1 --> model.layers.0.mlp.fused_moe.ws)
self.fused_moe = FusedMoE()
def forward(hidden):
router_logits = gate(hidden)
return self.fused_moe(hidden, gates)
# handle complexity of state dict remapping here
def load_weights():
# model.layers.0.mlp.w1 --> model.layers.0.mlp.fused_moe.ws
# model.layers.0.mlp.w2--> model.layers.0.mlp.fused_moe.w2
# model.layers.0.mlp.w3--> model.layers.0.mlp.fused_moe.w3 WDYT? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments for final requested architectural changes. Also we need tests for this.
It would be nice to have some big model tests and small model tests (that can run on a single GPU).
The small model tests should be possible for deepseek and qwen as they have sizes that fit on a single GPU
Hi, is this pr still active? Looking forward to this pr being merged. |
Sorry I've been quite busy with personal life over the past month, left me with little time to update. Additionally, when I attempted to update last month, I encountered some conflicts that were hard to resolve. Originally I created the MoE weights by adding an axis to every weights in |
Does this PR support deepseek-v2 awq? |
@fengyang95 This was intended for v1 but should be extendable to v2. I hope @robertgshaw2-neuralmagic is able to pick this up at some point to get it through :) |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This pull request has merge conflicts that must be resolved before it can be |
Thanks to the very smart MoE align strategy introduced in #2453, each block only uses a single expert, making it much easier to be adapted to quantized methods. This PR refactors the code to support quantized fused-MoE and adds GPTQ group gemm kernels based on exllamav2.
tokens/s of Mixtral measured at A100 using
benchmark_latency.py
with input_len=256 and output_len=1024.Todo:
via repacking