[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS #6036

LeiWang1999 · 2024-07-01T16:30:04Z

Hi all, this PR introduces support for the Microsoft Runtime Kernel Library to enhance our low precision computation capabilities.

Brief Introduction of BitBLAS

BitBLAS is a library to support mixed-precision BLAS operations on GPUs, for example, the $W_{wdtype}A_{adtype}$ mixed-precision matrix multiplication where $C_{cdtype}[M, N] = A_{adtype}[M, K] \times W_{wdtype}[N, K]$.
BitBLAS aims to support efficient mixed-precision DNN model deployment, especially the $W_{wdtype}A_{adtype}$ quantization in large language models (LLMs), for example, the $W_{UINT4}A_{FP16}$ in GPTQ, the $W_{INT2}A_{FP16}$ in BitDistiller, the $W_{INT2}A_{INT8}$ in BitNet-b1.58.

PR Overview

This PR integrates BitBLAS into vLLM by adding examples of its usage. We provide two forms:

Load from GPTQ Checkpoints: This allows the loading of models from GPTQ format checkpoints.
Load from GPTQ CKPT with BitBLAS Format: This enables the loading of models using the BitBLAS format for further optimized performance.

Below are the benchmarking results that we evaluated several months ago:

TODO ITEMS

Update and provide the latest benchmarking results.
1.58Bits Model
Provide Benchmark/Test Scripts

Any feedback and suggestions to improve this integration are appreciated.

…nearMethod constructor

robertgshaw2-redhat · 2024-07-01T16:39:27Z

Nice!

LeiWang1999 · 2024-07-01T16:44:51Z

BTW, are there any tools available that can automatically resolve lint issues?

vllm/model_executor/layers/quantization/gptq_bitblas.py:28:1: E402 Module level import not at top of file
vllm/model_executor/layers/quantization/gptq_bitblas.py:28:8: F811 Redefinition of unused `bitblas` from line 21
vllm/model_executor/layers/quantization/gptq_bitblas.py:29:1: E402 Module level import not at top of file
vllm/model_executor/layers/quantization/gptq_bitblas.py:66:81: E501 Line too long (107 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:172:81: E501 Line too long (85 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:222:81: E501 Line too long (105 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:230:81: E501 Line too long (89 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:233:81: E501 Line too long (110 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:236:81: E501 Line too long (99 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:242:81: E501 Line too long (84 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:253:81: E501 Line too long (94 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:414:81: E501 Line too long (86 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:417:29: G004 Logging statement uses f-string
vllm/model_executor/layers/quantization/gptq_bitblas.py:420:17: G004 Logging statement uses f-string
vllm/model_executor/layers/quantization/gptq_bitblas.py:427:81: E501 Line too long (103 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:433:81: E501 Line too long (116 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:454:81: E501 Line too long (82 > 80)

robertgshaw2-redhat · 2024-07-01T16:48:10Z

BTW, are there any tools available that can automatically resolve lint issues?

vllm/model_executor/layers/quantization/gptq_bitblas.py:28:1: E402 Module level import not at top of file
vllm/model_executor/layers/quantization/gptq_bitblas.py:28:8: F811 Redefinition of unused `bitblas` from line 21
vllm/model_executor/layers/quantization/gptq_bitblas.py:29:1: E402 Module level import not at top of file
vllm/model_executor/layers/quantization/gptq_bitblas.py:66:81: E501 Line too long (107 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:172:81: E501 Line too long (85 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:222:81: E501 Line too long (105 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:230:81: E501 Line too long (89 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:233:81: E501 Line too long (110 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:236:81: E501 Line too long (99 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:242:81: E501 Line too long (84 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:253:81: E501 Line too long (94 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:414:81: E501 Line too long (86 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:417:29: G004 Logging statement uses f-string
vllm/model_executor/layers/quantization/gptq_bitblas.py:420:17: G004 Logging statement uses f-string
vllm/model_executor/layers/quantization/gptq_bitblas.py:427:81: E501 Line too long (103 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:433:81: E501 Line too long (116 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:454:81: E501 Line too long (82 > 80)

./format.sh fixes whatever it can, but not everything is automated for fixing (esp line length)

mgoin · 2024-07-01T16:50:08Z

@LeiWang1999 thanks for the WIP, very cool interface with bitblas as a package. Can you explain if the GPTQ benchmarking results in vLLM were run with the base "gptq" kernels or using the "gptq_marlin" interface to take advantage of Marlin kernels? This will be important to compare with the current baseline we consider for GPTQ models in vLLM

LeiWang1999 · 2024-07-01T17:01:40Z

@LeiWang1999 thanks for the WIP, very cool interface with bitblas as a package. Can you explain if the GPTQ benchmarking results in vLLM were run with the base "gptq" kernels or using the "gptq_marlin" interface to take advantage of Marlin kernels? This will be important to compare with the current baseline we consider for GPTQ models in vLLM

Thanks, it utilized exllamav2 during our benchmarking at that time; we will examine the comparison with the Marlin kernel.

…las-intg

LeiWang1999 · 2024-07-19T04:22:02Z

Hi all, I recently update the the supports for 1.58bits model and related bitblas inference kernel for vllm.

		Token Per Second(tok/s)
model	framework	BS16IN32OUT128	BS1IN512OUT1024	B32IN32OUT128
bitnet-3b-1.58bits	pytorch	106.83	49.34	209.03
bitnet-3b-1.58bits	pytorch-bitblas	240.33	103.09	493.31
bitnet-3b-1.58bits	vllm-bitblas	379.25	117.43	752.55
bitnet-3b-1.58bits	vllm-bitblas-cuda-graph	2543.58	1621.08	2731.79

LeiWang1999 · 2024-07-19T04:27:05Z

We will soon do benchmarking with marlin, and looks like the docs build failed because of the dependency for bitblas, do you have any ideas to fix this issue? should we put the bitblas requirements to the doc/requirements or is there some options to skip this dependency? @mgoin

…las-intg

LeiWang1999 · 2024-11-13T10:11:56Z

@mgoin , @LucasWilkinson , thanks for your detail and valuable review messages, modifications and improvements have been applied, please take a look :)

mgoin

I think this essentially looks good to go with these last fixes @LeiWang1999

docs/source/quantization/bitblas.rst

vllm/model_executor/layers/quantization/kernels/bitblas.py

mergify · 2024-11-19T18:35:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LeiWang1999.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…las-intg

Co-authored-by: Michael Goin <[email protected]>

…tblas into bitblas-intg

LeiWang1999 · 2024-12-19T16:51:31Z

@mgoin apologies for the delayed response. In the latest update, we double-checked the correctness, optimized INT8 GEMM performance with DP4A on V100, and added support for high-performance GEMM on MI300 within BitBLAS. We’ve also updated recent benchmark results, which you can find at bitblas-benchmark.

Additionally, we released version 0.1.0 as part of this pull request, let us work together to get this pull request in, and start planning the next PR for BitNet :)

mergify · 2025-01-02T22:43:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LeiWang1999.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mgoin

Happy new year and thanks for your patience over the holidays! This looks good to me to land, just a few nits and help with your merge conflicts

mgoin · 2025-01-02T22:44:12Z

docs/source/quantization/bitblas.rst

You need to update your docs changes to work with the new .md format

mgoin · 2025-01-02T22:48:57Z

vllm/config.py

+    def find_flash_attn_supported_head_dims(self, head_dim: int) -> int:
+        """
+        Find the closest head dimension to the given head dimension that 
+        is supported by Flash Attention.
+        """
+        from vllm.attention.backends.flash_attn import FlashAttentionBackend
+
+        FLASHATTN_SUPPORTED_HEAD_DIMS = (
+            FlashAttentionBackend.get_supported_head_sizes())
+
+        for supported_head_dim in FLASHATTN_SUPPORTED_HEAD_DIMS:
+            if head_dim <= supported_head_dim:
+                return supported_head_dim
+        raise ValueError(
+            f"Head dimension {head_dim} is not supported by Flash Attention."
+            f"Supported head dimensions are {FLASHATTN_SUPPORTED_HEAD_DIMS}.")
+


This seems like an unrelated and unused change? There are a few other changes in this file as well, but I could understand if this is just the formatter switching up

mgoin · 2025-01-02T22:57:33Z

vllm/model_executor/layers/quantization/utils/bitblas_utils.py

+BITBLAS_SUPPORTED_SYM = [False, True]
+
+
+# For binary size and compile time, we don't support the same types for with and


This comment doesn't look finished

mgoin · 2025-01-02T23:08:19Z

benchmarks/kernels/benchmark_bitblas.py

+            A_dtype,
+            W_dtype,
+            out_dtype,
+            accum_dtype,
+            layout,
+            with_bias,
+            group_size,
+            with_scaling,
+            with_zeros,
+            zeros_mode,


Maybe you could make all these configs more readable if you put all of these args in a list and unzipped them such as (1, 16384, 16384, *shared_args) where shared_args = [A_dtype, W_dtype, out_dtype, accum_dtype, layout, with_bias, group_size, with_scaling, with_zeros, zeros_mode]

LeiWang1999 added 2 commits July 1, 2024 15:37

Support Repack from GPTQ.

2be6218

chore: Remove unused input_size and output_size variables in MarlinLi…

b92de92

…nearMethod constructor

LeiWang1999 added 7 commits July 16, 2024 03:19

Support BitNet Model for 1.58bits.

71ea469

Lint Fix

dfa6b2f

lint fix

8d2c635

Lint Fix for line length

41bb18e

Support Loading 1.58B Model with BitBLAS Format

29ac34d

Improve performance for bitnet

7f69aef

Merge branch 'main' of https://github.com/vllm-project/vllm into bitb…

01a789a

…las-intg

LeiWang1999 marked this pull request as ready for review July 19, 2024 04:23

LeiWang1999 added 13 commits July 19, 2024 06:00

fix lm_head for gptq model refactor

a973123

linx fix

aea1f4c

handle compressed scale weight.

17128d5

lint fix

1741ed4

remove partial weight load for sw

726a1f7

apply torch compile for uncompressed weight.

68c8052

Merge branch 'main' of https://github.com/vllm-project/vllm into bitb…

6eb2870

…las-intg

merge bug fix

52418ef

lint fix

a15ba12

fix torch compile issue

53babae

bug fix.

40a4e53

BENCHMARK SCRIPTS

d316a87

Merge branch 'main' of https://github.com/vllm-project/vllm into bitb…

4d40275

…las-intg

LeiWang1999 added 5 commits November 13, 2024 06:31

replace parameter

757aed0

review handling

36ba50a

remove commit id

8359a98

lint fix

742be3d

remove debug print

fa1c932

mgoin approved these changes Nov 19, 2024

View reviewed changes

mergify bot added the needs-rebase label Nov 19, 2024

LeiWang1999 mentioned this pull request Nov 27, 2024

Release Plan of BitBLAS 0.0.1 microsoft/BitBLAS#150

Open

5 tasks

LeiWang1999 added 4 commits December 19, 2024 14:43

bug fix

df7a5f6

lint fix

9adb4d5

Merge branch 'main' of https://github.com/vllm-project/vllm into bitb…

f0a1ec3

…las-intg

merge upstream

7d5dd06

mergify bot removed the needs-rebase label Dec 19, 2024

LeiWang1999 and others added 9 commits December 19, 2024 16:37

modify commit id

8d7881b

add bitblas to index

217fa5a

Update docs/source/quantization/bitblas.rst

d6586e7

Co-authored-by: Michael Goin <[email protected]>

Update docs/source/quantization/bitblas.rst

430ca44

Co-authored-by: Michael Goin <[email protected]>

force use MINIMUM_BITBLAS_VERSION

f2af59e

lint fix

d868cac

lint fix

6ec9800

lint fix

bcbad57

Merge branch 'bitblas-intg' of https://github.com/LeiWang1999/vllm-bi…

6cc9022

…tblas into bitblas-intg

conflict resolved

3703449

mgoin changed the title ~~[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation~~ [Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS Jan 2, 2025

mergify bot added the needs-rebase label Jan 2, 2025

mgoin reviewed Jan 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS #6036

[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS #6036

LeiWang1999 commented Jul 1, 2024 •

edited by github-actions bot

Loading

robertgshaw2-redhat commented Jul 1, 2024

LeiWang1999 commented Jul 1, 2024 •

edited

Loading

robertgshaw2-redhat commented Jul 1, 2024

mgoin commented Jul 1, 2024

LeiWang1999 commented Jul 1, 2024

LeiWang1999 commented Jul 19, 2024 •

edited

Loading

LeiWang1999 commented Jul 19, 2024

LeiWang1999 commented Nov 13, 2024

mgoin left a comment •

edited

Loading

mergify bot commented Nov 19, 2024

LeiWang1999 commented Dec 19, 2024

mergify bot commented Jan 2, 2025

mgoin left a comment

mgoin Jan 2, 2025

mgoin Jan 2, 2025

mgoin Jan 2, 2025

mgoin Jan 2, 2025

		BITBLAS_SUPPORTED_SYM = [False, True]


		# For binary size and compile time, we don't support the same types for with and

[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS #6036

Are you sure you want to change the base?

[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS #6036

Conversation

LeiWang1999 commented Jul 1, 2024 • edited by github-actions bot Loading

Brief Introduction of BitBLAS

PR Overview

TODO ITEMS

robertgshaw2-redhat commented Jul 1, 2024

LeiWang1999 commented Jul 1, 2024 • edited Loading

robertgshaw2-redhat commented Jul 1, 2024

mgoin commented Jul 1, 2024

LeiWang1999 commented Jul 1, 2024

LeiWang1999 commented Jul 19, 2024 • edited Loading

LeiWang1999 commented Jul 19, 2024

LeiWang1999 commented Nov 13, 2024

mgoin left a comment • edited Loading

Choose a reason for hiding this comment

mergify bot commented Nov 19, 2024

LeiWang1999 commented Dec 19, 2024

mergify bot commented Jan 2, 2025

mgoin left a comment

Choose a reason for hiding this comment

mgoin Jan 2, 2025

Choose a reason for hiding this comment

mgoin Jan 2, 2025

Choose a reason for hiding this comment

mgoin Jan 2, 2025

Choose a reason for hiding this comment

mgoin Jan 2, 2025

Choose a reason for hiding this comment

LeiWang1999 commented Jul 1, 2024 •

edited by github-actions bot

Loading

LeiWang1999 commented Jul 1, 2024 •

edited

Loading

LeiWang1999 commented Jul 19, 2024 •

edited

Loading

mgoin left a comment •

edited

Loading