-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS #6036
base: main
Are you sure you want to change the base?
Conversation
Nice! |
BTW, are there any tools available that can automatically resolve lint issues? vllm/model_executor/layers/quantization/gptq_bitblas.py:28:1: E402 Module level import not at top of file
vllm/model_executor/layers/quantization/gptq_bitblas.py:28:8: F811 Redefinition of unused `bitblas` from line 21
vllm/model_executor/layers/quantization/gptq_bitblas.py:29:1: E402 Module level import not at top of file
vllm/model_executor/layers/quantization/gptq_bitblas.py:66:81: E501 Line too long (107 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:172:81: E501 Line too long (85 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:222:81: E501 Line too long (105 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:230:81: E501 Line too long (89 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:233:81: E501 Line too long (110 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:236:81: E501 Line too long (99 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:242:81: E501 Line too long (84 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:253:81: E501 Line too long (94 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:414:81: E501 Line too long (86 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:417:29: G004 Logging statement uses f-string
vllm/model_executor/layers/quantization/gptq_bitblas.py:420:17: G004 Logging statement uses f-string
vllm/model_executor/layers/quantization/gptq_bitblas.py:427:81: E501 Line too long (103 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:433:81: E501 Line too long (116 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:454:81: E501 Line too long (82 > 80) |
|
@LeiWang1999 thanks for the WIP, very cool interface with bitblas as a package. Can you explain if the GPTQ benchmarking results in vLLM were run with the base "gptq" kernels or using the "gptq_marlin" interface to take advantage of Marlin kernels? This will be important to compare with the current baseline we consider for GPTQ models in vLLM |
Thanks, it utilized exllamav2 during our benchmarking at that time; we will examine the comparison with the Marlin kernel. |
Hi all, I recently update the the supports for 1.58bits model and related bitblas inference kernel for vllm.
|
We will soon do benchmarking with marlin, and looks like the docs build failed because of the dependency for bitblas, do you have any ideas to fix this issue? should we put the bitblas requirements to the doc/requirements or is there some options to skip this dependency? @mgoin |
@mgoin , @LucasWilkinson , thanks for your detail and valuable review messages, modifications and improvements have been applied, please take a look :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this essentially looks good to go with these last fixes @LeiWang1999
This pull request has merge conflicts that must be resolved before it can be |
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
@mgoin apologies for the delayed response. In the latest update, we double-checked the correctness, optimized INT8 GEMM performance with DP4A on V100, and added support for high-performance GEMM on MI300 within BitBLAS. We’ve also updated recent benchmark results, which you can find at bitblas-benchmark. Additionally, we released version 0.1.0 as part of this pull request, let us work together to get this pull request in, and start planning the next PR for BitNet :) |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy new year and thanks for your patience over the holidays! This looks good to me to land, just a few nits and help with your merge conflicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to update your docs changes to work with the new .md
format
def find_flash_attn_supported_head_dims(self, head_dim: int) -> int: | ||
""" | ||
Find the closest head dimension to the given head dimension that | ||
is supported by Flash Attention. | ||
""" | ||
from vllm.attention.backends.flash_attn import FlashAttentionBackend | ||
|
||
FLASHATTN_SUPPORTED_HEAD_DIMS = ( | ||
FlashAttentionBackend.get_supported_head_sizes()) | ||
|
||
for supported_head_dim in FLASHATTN_SUPPORTED_HEAD_DIMS: | ||
if head_dim <= supported_head_dim: | ||
return supported_head_dim | ||
raise ValueError( | ||
f"Head dimension {head_dim} is not supported by Flash Attention." | ||
f"Supported head dimensions are {FLASHATTN_SUPPORTED_HEAD_DIMS}.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an unrelated and unused change? There are a few other changes in this file as well, but I could understand if this is just the formatter switching up
BITBLAS_SUPPORTED_SYM = [False, True] | ||
|
||
|
||
# For binary size and compile time, we don't support the same types for with and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't look finished
A_dtype, | ||
W_dtype, | ||
out_dtype, | ||
accum_dtype, | ||
layout, | ||
with_bias, | ||
group_size, | ||
with_scaling, | ||
with_zeros, | ||
zeros_mode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could make all these configs more readable if you put all of these args in a list and unzipped them such as (1, 16384, 16384, *shared_args)
where shared_args = [A_dtype, W_dtype, out_dtype, accum_dtype, layout, with_bias, group_size, with_scaling, with_zeros, zeros_mode]
Hi all, this PR introduces support for the Microsoft Runtime Kernel Library to enhance our low precision computation capabilities.
Brief Introduction of BitBLAS
BitBLAS is a library to support mixed-precision BLAS operations on GPUs, for example, the$W_{wdtype}A_{adtype}$ mixed-precision matrix multiplication where $C_{cdtype}[M, N] = A_{adtype}[M, K] \times W_{wdtype}[N, K]$ .$W_{wdtype}A_{adtype}$ quantization in large language models (LLMs), for example, the $W_{UINT4}A_{FP16}$ in GPTQ, the $W_{INT2}A_{FP16}$ in BitDistiller, the $W_{INT2}A_{INT8}$ in BitNet-b1.58.
BitBLAS aims to support efficient mixed-precision DNN model deployment, especially the
PR Overview
This PR integrates BitBLAS into vLLM by adding examples of its usage. We provide two forms:
Below are the benchmarking results that we evaluated several months ago:
TODO ITEMS
Any feedback and suggestions to improve this integration are appreciated.