Mixtral fused MoE: Fix multi-GPU #341

casper-hansen · 2024-02-15T17:39:53Z

Currently, multi-GPU is not supported because it causes an illegal memory access error. I believe the error comes from moe_alig_block_size.

Kernels installed from: https://github.com/casper-hansen/AutoAWQ_kernels

Attempted solutions

Example:

Modify to allow fused MoE with multi-GPU : https://github.com/casper-hansen/AutoAWQ/blob/main/awq/models/mixtral.py#L130
python examples/generate.py (modify quant_path = "casperhansen/mixtral-instruct-awq")

Solutions:

adding OptionalCUDAGuard to every torch tensor that goes into moe_alig_block_size.
- result: output produced is garbage + segmentation fault at end of the generation.

    const at::cuda::OptionalCUDAGuard device_guard_topk_ids(device_of(topk_ids));
    const at::cuda::OptionalCUDAGuard device_guard_sorted(device_of(sorted_token_ids));
    const at::cuda::OptionalCUDAGuard device_guard_experts(device_of(experts_ids));
    const at::cuda::OptionalCUDAGuard device_guard_num_tokens(device_of(num_tokens_post_pad));

The text was updated successfully, but these errors were encountered:

chu-tianxiang · 2024-02-16T14:17:46Z

It's weird. I added exactly the above device guard code to the AutoAWQ_kernels and modified the multi-GPU part. Running python examples/generate.py with 2 GPUs yield coherent response and I checked with Nsight Systems that the fused kernels are called and the layers are executed at two GPUs correctly.

casper-hansen · 2024-02-16T14:31:46Z

It's weird. I added exactly the above device guard code to the AutoAWQ_kernels and modified the multi-GPU part. Running python examples/generate.py with 2 GPUs yield coherent response and I checked with Nsight Systems that the fused kernels are called and the layers are executed at two GPUs correctly.

Okay, that is interesting then! That would suggest one of the GPUs I rented had an issue. Let me try again and thanks for testing it out!

EDIT: Did you also modify the code here to allow fusing with the new modules? https://github.com/casper-hansen/AutoAWQ/blob/main/awq/models/mixtral.py#L130

casper-hansen · 2024-02-16T15:10:57Z

Ok, I tested it on 2x 4090! Seems fixed. Thanks for making the suggestion and going through with testing it.

casper-hansen · 2024-02-16T16:54:05Z

The previous issue is fixed now on the main branch and published in the new AutoAWQ-kernels package on PyPi. However, it seems that the Triton kernel fails in the same way as well. When the last layer was executed on device cuda:0 and the next layer is on cuda:1 (with every tensor being on cuda:1), it throws an error. This is only triggered with context >= 1024 as that is the threshold to dequantize and use FP16 kernels.

  File "/workspace/AutoAWQ/awq/modules/fused/moe.py", line 35, in forward
    final_hidden_states = apply_moe_weights(
  File "/workspace/AutoAWQ/awq/modules/fused/moe.py", line 63, in apply_moe_weights
    return fused_moe(x, dequant_w1, dequant_w2, gating_output, topk, renormalize)
  File "/workspace/AutoAWQ/awq/modules/fused/moe.py", line 431, in fused_moe
    invoke_fused_moe_kernel(
  File "/workspace/AutoAWQ/awq/modules/fused/moe.py", line 296, in invoke_fused_moe_kernel
    fused_moe_kernel[grid](
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 550, in run
    bin.c_wrapper(
ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

chu-tianxiang · 2024-02-17T02:56:45Z

I think the Triton kernel also needs to add a GPU device context similar to this.

casper-hansen · 2024-02-17T10:25:31Z

I think the Triton kernel also needs to add a GPU device context similar to this.

You are right, this fixed it. After more careful benchmarking with different problem sizes, I found that dequantizing the large stacked weights leads to increased memory usage without any speed improvement in prefilling. Thus, I am removing it and simplifying the forward pass.

Thanks for all your hard work and guidance @chu-tianxiang, I will attempt to make the best of it in AutoAWQ and get the MoE fused modules into transformers as well.

This was referenced Feb 15, 2024

GPTQ & AWQ Fused MOE vllm-project/vllm#2761

Open

Idea: Dequantize + Fused MoE #323

Closed

casper-hansen closed this as completed Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixtral fused MoE: Fix multi-GPU #341

Mixtral fused MoE: Fix multi-GPU #341

casper-hansen commented Feb 15, 2024 •

edited

Loading

chu-tianxiang commented Feb 16, 2024

casper-hansen commented Feb 16, 2024 •

edited

Loading

casper-hansen commented Feb 16, 2024

casper-hansen commented Feb 16, 2024

chu-tianxiang commented Feb 17, 2024

casper-hansen commented Feb 17, 2024

Mixtral fused MoE: Fix multi-GPU #341

Mixtral fused MoE: Fix multi-GPU #341

Comments

casper-hansen commented Feb 15, 2024 • edited Loading

Attempted solutions

chu-tianxiang commented Feb 16, 2024

casper-hansen commented Feb 16, 2024 • edited Loading

casper-hansen commented Feb 16, 2024

casper-hansen commented Feb 16, 2024

chu-tianxiang commented Feb 17, 2024

casper-hansen commented Feb 17, 2024

casper-hansen commented Feb 15, 2024 •

edited

Loading

casper-hansen commented Feb 16, 2024 •

edited

Loading