Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixtral fused MoE: Fix multi-GPU #341

Closed
casper-hansen opened this issue Feb 15, 2024 · 6 comments
Closed

Mixtral fused MoE: Fix multi-GPU #341

casper-hansen opened this issue Feb 15, 2024 · 6 comments

Comments

@casper-hansen
Copy link
Owner

casper-hansen commented Feb 15, 2024

Currently, multi-GPU is not supported because it causes an illegal memory access error. I believe the error comes from moe_alig_block_size.

Kernels installed from: https://github.com/casper-hansen/AutoAWQ_kernels

Attempted solutions

Example:

  1. Modify to allow fused MoE with multi-GPU : https://github.com/casper-hansen/AutoAWQ/blob/main/awq/models/mixtral.py#L130
  2. python examples/generate.py (modify quant_path = "casperhansen/mixtral-instruct-awq")

Solutions:

  • adding OptionalCUDAGuard to every torch tensor that goes into moe_alig_block_size.
    • result: output produced is garbage + segmentation fault at end of the generation.
    const at::cuda::OptionalCUDAGuard device_guard_topk_ids(device_of(topk_ids));
    const at::cuda::OptionalCUDAGuard device_guard_sorted(device_of(sorted_token_ids));
    const at::cuda::OptionalCUDAGuard device_guard_experts(device_of(experts_ids));
    const at::cuda::OptionalCUDAGuard device_guard_num_tokens(device_of(num_tokens_post_pad));
@chu-tianxiang
Copy link

It's weird. I added exactly the above device guard code to the AutoAWQ_kernels and modified the multi-GPU part. Running python examples/generate.py with 2 GPUs yield coherent response and I checked with Nsight Systems that the fused kernels are called and the layers are executed at two GPUs correctly.

@casper-hansen
Copy link
Owner Author

casper-hansen commented Feb 16, 2024

It's weird. I added exactly the above device guard code to the AutoAWQ_kernels and modified the multi-GPU part. Running python examples/generate.py with 2 GPUs yield coherent response and I checked with Nsight Systems that the fused kernels are called and the layers are executed at two GPUs correctly.

Okay, that is interesting then! That would suggest one of the GPUs I rented had an issue. Let me try again and thanks for testing it out!

EDIT: Did you also modify the code here to allow fusing with the new modules? https://github.com/casper-hansen/AutoAWQ/blob/main/awq/models/mixtral.py#L130

@casper-hansen
Copy link
Owner Author

Ok, I tested it on 2x 4090! Seems fixed. Thanks for making the suggestion and going through with testing it.

@casper-hansen
Copy link
Owner Author

The previous issue is fixed now on the main branch and published in the new AutoAWQ-kernels package on PyPi. However, it seems that the Triton kernel fails in the same way as well. When the last layer was executed on device cuda:0 and the next layer is on cuda:1 (with every tensor being on cuda:1), it throws an error. This is only triggered with context >= 1024 as that is the threshold to dequantize and use FP16 kernels.

  File "/workspace/AutoAWQ/awq/modules/fused/moe.py", line 35, in forward
    final_hidden_states = apply_moe_weights(
  File "/workspace/AutoAWQ/awq/modules/fused/moe.py", line 63, in apply_moe_weights
    return fused_moe(x, dequant_w1, dequant_w2, gating_output, topk, renormalize)
  File "/workspace/AutoAWQ/awq/modules/fused/moe.py", line 431, in fused_moe
    invoke_fused_moe_kernel(
  File "/workspace/AutoAWQ/awq/modules/fused/moe.py", line 296, in invoke_fused_moe_kernel
    fused_moe_kernel[grid](
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 550, in run
    bin.c_wrapper(
ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

@chu-tianxiang
Copy link

I think the Triton kernel also needs to add a GPU device context similar to this.

@casper-hansen
Copy link
Owner Author

I think the Triton kernel also needs to add a GPU device context similar to this.

You are right, this fixed it. After more careful benchmarking with different problem sizes, I found that dequantizing the large stacked weights leads to increased memory usage without any speed improvement in prefilling. Thus, I am removing it and simplifying the forward pass.

Thanks for all your hard work and guidance @chu-tianxiang, I will attempt to make the best of it in AutoAWQ and get the MoE fused modules into transformers as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants