[Dev][TL] Integrate TL Dequant Implementation into BitBLAS OPs #214

LeiWang1999 · 2024-10-06T15:20:36Z

This pull request includes significant changes to the bitblas library, mainly focusing on the addition of new schedulers, improvements to matrix multiplication operations, and updates to testing and dependencies. The most important changes are grouped into themes below.

New Schedulers and Enhancements:

Added support for float16 and int8 target data types in get_lop3_intrin_group function in bitblas/gpu/intrin/lop3.py. ([bitblas/gpu/intrin/lop3.pyR1680-R1690](https://github.com/microsoft/BitBLAS/pull/214/files#diff-15fd74b90c3b956e9864e35778f26b27f6c9a7cfae35037967f420fda9a0bbe5R1680-R1690))
Introduced a new scheduler for weight dequantization in bitblas/ops/general_matmul/__init__.py and updated the _select_scheduler method to return this new scheduler. ([[1]](https://github.com/microsoft/BitBLAS/pull/214/files#diff-74fe5dd2824cb03a0fb2b0a913a2fc5caeb9c08e5368c318cd32b3af7e6f52edR15), [[2]](https://github.com/microsoft/BitBLAS/pull/214/files#diff-74fe5dd2824cb03a0fb2b0a913a2fc5caeb9c08e5368c318cd32b3af7e6f52edL594-R614))
Added a new MatmulDequantizeScheduler and related functions in bitblas/ops/general_matmul/tilelang/dequantize/__init__.py. ([bitblas/ops/general_matmul/tilelang/dequantize/__init__.pyR3-R102](https://github.com/microsoft/BitBLAS/pull/214/files#diff-422bb6fd30915da2280e418fe97aab5bdf246b548321577b128cd1652bd68ec2R3-R102))

Code Simplification and Refactoring:

Refactored the main function in bitblas/ops/general_matmul/tilelang/dense/matmul_tensorcore.py for better readability and maintainability. ([[1]](https://github.com/microsoft/BitBLAS/pull/214/files#diff-eacc57f40c9e810b3503e297685ed5eb8372c922201e18a4be1c1c2e20c93353L515-R522), [[2]](https://github.com/microsoft/BitBLAS/pull/214/files#diff-eacc57f40c9e810b3503e297685ed5eb8372c922201e18a4be1c1c2e20c93353L652-R662), [[3]](https://github.com/microsoft/BitBLAS/pull/214/files#diff-eacc57f40c9e810b3503e297685ed5eb8372c922201e18a4be1c1c2e20c93353L686-R700), [[4]](https://github.com/microsoft/BitBLAS/pull/214/files#diff-eacc57f40c9e810b3503e297685ed5eb8372c922201e18a4be1c1c2e20c93353L867-R884), [[5]](https://github.com/microsoft/BitBLAS/pull/214/files#diff-eacc57f40c9e810b3503e297685ed5eb8372c922201e18a4be1c1c2e20c93353L979-R999), [[6]](https://github.com/microsoft/BitBLAS/pull/214/files#diff-eacc57f40c9e810b3503e297685ed5eb8372c922201e18a4be1c1c2e20c93353L1009-R1033))
Updated the __repr__ method in bitblas/ops/general_matmul/tilelang/dense/matmul_tensorcore.py to include warp_M and warp_N details. ([bitblas/ops/general_matmul/tilelang/dense/matmul_tensorcore.pyR320-R321](https://github.com/microsoft/BitBLAS/pull/214/files#diff-eacc57f40c9e810b3503e297685ed5eb8372c922201e18a4be1c1c2e20c93353R320-R321))

Testing Enhancements:

Added new test functions matmul_torch_forward and matmul_torch_forward_dequant to testing/python/operators/test_general_matmul_ops_backend_tl.py for validating matrix multiplication operations with and without dequantization. ([[1]](https://github.com/microsoft/BitBLAS/pull/214/files#diff-cb0d29b36116f888480a1c2a5cee67a69ad3d5434522ce51fb92731242adc2cfR78-R227), [[2]](https://github.com/microsoft/BitBLAS/pull/214/files#diff-cb0d29b36116f888480a1c2a5cee67a69ad3d5434522ce51fb92731242adc2cfR248-R259))
Included MatmulDequantizeScheduler in the imports of testing/python/operators/test_general_matmul_tilelang_kernel.py. ([testing/python/operators/test_general_matmul_tilelang_kernel.pyR13-R15](https://github.com/microsoft/BitBLAS/pull/214/files#diff-052c67c47659338f3612e7c47384c54f88929e9b32c40d9e797b3f5307ff3896R13-R15))

Dependency Updates:

Updated yapf version in requirements-dev.txt and requirements-test.txt to 0.40.2. ([[1]](https://github.com/microsoft/BitBLAS/pull/214/files#diff-2b4945591edfeaa4cf4d3f155e66d4b43d1bda7a55d881d5cf3107f1b05abbbcL2-R2), [[2]](https://github.com/microsoft/BitBLAS/pull/214/files#diff-685da804fbcac569d75387e475e57d1de687a54c6c41b3aa4057694cfb5abc4bL2-R2))

Miscellaneous:

Added interleave_weight function to the imports in bitblas/quantization/__init__.py. ([bitblas/quantization/__init__.pyL12-R16](https://github.com/microsoft/BitBLAS/pull/214/files#diff-aeb54b540a85cbc63bdf9e661a713906e58ddb8c69f56d090bd811e2ba9b4b97L12-R16))

…y function

…ps_dynamic

The select_scheduler function in the dense/__init__.py module has been refactored to use a fine-grained interface. This change provides more flexibility and enables the implementation of high-performance kernels. Update MatmulScheduler class in matmul_tensorcore.py The MatmulScheduler class in the matmul_tensorcore.py module has been updated to calculate the number of threads based on the block size and warp size. This ensures optimal GPU warp configuration for NVIDIA GPUs. Improve test_general_matmul_tilelang_kernel.py The test_general_matmul_tilelang_kernel.py module has been improved to include additional test cases and assertions for correctness.

…inetuning

…ps_dynamic

…_tilelang_kernel.py to use centered random values for input tensors

…ps_dynamic

…t tensors

…alled

…ps_dynamic

…eduler

LeiWang1999 · 2024-10-06T15:24:47Z

Though TL enables more program flexibility for us to write kernels, but it's hard to implement all dequant kernels within a simple template(as our tir schedule based template)

LeiWang1999 added 30 commits September 28, 2024 07:43

Refactor tilelang dequantize module and add matmul_blocked_weight_onl…

f3b1eb9

…y function

remove un-implemented code.

730d13e

Implement BaseScheduler to wrap some related items.

8047ee7

lint fix

64db065

test skip

cef04a8

Refactor tilelang dequantize module and add matmul_blocked_weight_onl…

f1652e9

…y function

Merge branch 'main' of https://github.com/microsoft/BitBLAS into tl_o…

4f6c545

…ps_dynamic

test fix

c485b68

hardware tuning demo

ebe42a6

Merge branch 'main' of https://github.com/microsoft/BitBLAS into tl_o…

88230ec

…ps_dynamic

remove debug related items.

44246a1

imlement tuner and cache fix

bb51e15

Merge branch 'main' of https://github.com/microsoft/BitBLAS into tl_o…

f42a3b9

…ps_dynamic

lint fix

de7ae18

test case fix.

ef40bd8

Adapt Tuning Space generation with Roller

85f0a5f

Merge branch 'main' of https://github.com/microsoft/BitBLAS into tl_o…

e9f7db3

…ps_dynamic

lint fix

9e31336

Refactor select_scheduler function for fine-grained interface

f1378d4

Refactor NotImplementedError message in BaseTLHint class

137cce3

Update submodule reference in 3rdparty/tvm

fc19fa2

Refactor matmul_finetune function to use topk=20 for hardware-aware f…

fe51bb1

…inetuning

Refactor submodule reference in 3rdparty/tvm

79878cb

lint fix

0fc7ab9

Refactor test_general_matmul_tilelang_impl.py and test_tilelang_gemm.py

255e925

Refactor MatmulConfig to enable weight propagation on supported devices

df47f63

Merge branch 'main' of https://github.com/microsoft/BitBLAS into tl_o…

826255d

…ps_dynamic

Refactor test_general_matmul_tilelang_impl.py and test_general_matmul…

48dc94e

…_tilelang_kernel.py to use centered random values for input tensors

test fix

82f39d7

LeiWang1999 added 14 commits October 2, 2024 18:27

Merge branch 'main' of https://github.com/microsoft/BitBLAS into tl_o…

02ef258

…ps_dynamic

test fix

e753ef2

Refactor flash attention tests to use centered random values for inpu…

f6dd744

…t tensors

Refactor flash attention tests to use centered random values for inpu…

7417372

…t tensors

Refactor flash attention tests to skip test if flash_attn is not inst…

145a850

…alled

lint fix

3384458

test fix

82f50ea

test fix

d2ed936

test fix

6c56273

Merge branch 'main' of https://github.com/microsoft/BitBLAS into tl_o…

2e59e58

…ps_dynamic

Refactor quantization module imports

074b9ca

lint fix

0923344

Update yapf version in requirements-dev.txt and requirements-test.txt

b30bcd4

Refactor shared memory to global memory storage in MatmulFineGrainSch…

d0a88ac

…eduler

LeiWang1999 added 5 commits October 6, 2024 16:32

test fix

62303e2

format

01dc3f9

test fix

c621664

Refactor tensorcore policy to use list comprehension for readability

f934635

lint fix

754cf75

This was referenced Oct 7, 2024

[Feature Request] Enhance Simplification to remove unused function arguments #215

Closed

[Feature Request] Enhance Simplification to remove unused function arguments TileLang/tvm#1

Closed

LeiWang1999 merged commit a6d627c into microsoft:main Oct 7, 2024
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dev][TL] Integrate TL Dequant Implementation into BitBLAS OPs #214

[Dev][TL] Integrate TL Dequant Implementation into BitBLAS OPs #214

LeiWang1999 commented Oct 6, 2024

LeiWang1999 commented Oct 6, 2024

[Dev][TL] Integrate TL Dequant Implementation into BitBLAS OPs #214

[Dev][TL] Integrate TL Dequant Implementation into BitBLAS OPs #214

Conversation

LeiWang1999 commented Oct 6, 2024

New Schedulers and Enhancements:

Code Simplification and Refactoring:

Testing Enhancements:

Dependency Updates:

Miscellaneous:

LeiWang1999 commented Oct 6, 2024