diff --git a/README.md b/README.md index 2cb81502136e..7914482fe59f 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ Some of the key features of BitBLAS include: - $W_{wdtype}A_{adtype}$ mixed-precision matrix multiplication including FP16xINT4/2/1, INT8xINT4/2/1, etc. Please checkout [support matrix](#support-matrix) for detailed data types support. - Matrix multiplication like FP16xFP16 and INT8xINT8. - Auto-Tensorization for TensorCore-like hardware instructions. - - Implemented [integration](./integration/) to [PyTorch](https://pytorch.org/), [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) and [vLLM](https://github.com/vllm-project/vllm) for LLM deployment. Please checkout [benchmark summary](#benchmark-summary) for detailed end2end LLM inference performance. + - Implemented [integration](/integration/) to [PyTorch](https://pytorch.org/), [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) and [vLLM](https://github.com/vllm-project/vllm) for LLM deployment. Please checkout [benchmark summary](#benchmark-summary) for detailed end2end LLM inference performance. - BitBLAS first implemented $W_{INT2}A_{INT8}$ GEMV/GEMM in [BitNet-b1.58](https://arxiv.org/abs/2402.17764) with 8x/2x speedup over cuBLAS $W_{FP16}A_{FP16}$ on A100, please checkout [op_benchmark_a100_int2_scaling](images/figures/op_benchmark_a100_int2_scaling.png) for detailed benchmark results. - Support customizing mixed-precision DNN operations for your specific scenarios via the flexible DSL (TIR Script). @@ -68,16 +68,16 @@ We are continuously expanding the support matrix. If you have any specific requi ## Getting Started -- [Installation](./docs/Installation.md): - To install BitBLAS, please checkout the document [installation](./docs/Installation.md). Also Make sure you already have the cuda toolkit (version >= 11) installed in the system. Or you can easily install from `pip install bitblas` in the root directory. +- [Installation](/docs/Installation.md): + To install BitBLAS, please checkout the document [installation](/docs/Installation.md). Also Make sure you already have the cuda toolkit (version >= 11) installed in the system. Or you can easily install from `pip install bitblas` in the root directory. -- [QuickStart](./docs/QuickStart.md): BitBLAS provides two Python APIs to perform mixed-precision matrix multiplication: +- [QuickStart](/docs/QuickStart.md): BitBLAS provides two Python APIs to perform mixed-precision matrix multiplication: - ```bitblas.Matmul``` implements the $W_{wdtype}A_{adtype}$ mixed-precision matrix multiplication of $C_{cdtype}[M, N] = A_{adtype}[M, K] \times W_{wdtype}[N, K]$. - ```bitblas.Linear``` is a PyTorch ```nn.Linear```-like module to support a Linear of mixed-precision. -- [Integration](./integration/): Explore how BitBLAS seamlessly integrates with LLM deployment frameworks through our examples. Discover the ease of integrating BitBLAS with PyTorch, AutoGPTQ, and vLLM in the 3rd-party integration examples. +- [Integration](/integration/): Explore how BitBLAS seamlessly integrates with LLM deployment frameworks through our examples. Discover the ease of integrating BitBLAS with PyTorch, AutoGPTQ, and vLLM in the 3rd-party integration examples. -- [Customization](./docs/ExtendOperatorsWithDSL.md): BitBLAS supports implementing customized mixed-precision DNN operations rather than matrix multiplication with the flexible DSL (TIR Script). +- [Customization](/docs/ExtendOperatorsWithDSL.md): BitBLAS supports implementing customized mixed-precision DNN operations rather than matrix multiplication with the flexible DSL (TIR Script). ## Contributing diff --git a/benchmark/README.md b/benchmark/README.md index dc86280a780e..d02596ffb10b 100644 --- a/benchmark/README.md +++ b/benchmark/README.md @@ -80,9 +80,13 @@ The benchmark configurations for each test scenario are detailed below: ## Benchmark Images -INT8xINT1 Matmul BS Scaling on A100. +BitNET 1.58B INT8xINT2 Matmul BS Scaling on A100. -![int8xint1_scaling](../images/figures/op_benchmark_a100_int1_scaling.png) +![int8xiint2_scaling](../images/figures/op_benchmark_a100_int2_scaling.png) + +INT8xUINT1 Matmul BS Scaling on A100. + +![int8xiint1_scaling](../images/figures/op_benchmark_a100_uint1_scaling.png) 3090 Related benchmark numbers diff --git a/docs/PythonAPI.md b/docs/PythonAPI.md index e2925a147be8..518ec6baf09f 100644 --- a/docs/PythonAPI.md +++ b/docs/PythonAPI.md @@ -16,8 +16,12 @@ - **K** *(int)*: The common dimension of matrices A and W. - **A_dtype** *(str, default='float16')*: The data type of matrix A. - Choices: `'float16'`, `'int8'`. -- **W_dtype** *(str, default='float16')*: The data type of matrix W. Also acts as a wrapper for source_format and bit. - - Choices: `'float16'`, `'int8'`, `'int4'`, `'int2'`, `'int1'`, `'fp4_e2m1'`, `'nf4'`. +- **W_dtype** *(str, optional)*: Data type of the weights. Default: `'float16'`. + - Choices: `'float16'`, `'int8'`, `'int4'`, `'int2'`, `'int1'`, `'uint4'`,`'uint2'`, `'uint1'`, `'fp4_e2m1'`, `'nf4'`. + - The Range of the INT Format: + - `'int4'`: [-8, 7] + - `'int2'`: [-2, 1] + - `'int1'`: [-1, 1] - **accum_dtype** *(str, default='float16')*: The data type used for accumulation during the matrix multiplication. - Choices: `'float16'`, `'int32'`. - **out_dtype** *(str, default='float16')*: The data type of the output matrix. @@ -25,8 +29,6 @@ - **layout** *(Literal['nn', 'nt', 'tn', 'tt'], default='nt')*: The layout of the matrix multiplication operation. The matrix is stored in row-major. - `'nn'`: Both matrices are non-transposed. - `'nt'`: Matrix A is non-transposed, and matrix W is transposed. - - `'tn'`: Matrix A is transposed, and matrix W is non-transposed. - - `'tt'`: Both matrices are transposed. - **with_bias** *(bool, default=False)*: Indicates whether a bias vector is added to the output. - **group_size** *(int, default=-1)*: The group size for quantization, -1 indicates no grouping. - **with_scaling** *(bool, default=False)*: Indicates whether scaling is applied during quantization. @@ -90,7 +92,11 @@ Applies a linear transformation to the incoming data: $out[M, N] = A[M, K] \time - **A_dtype** *(str, optional)*: Data type of the input tensor. Default: `'float16'`. - Choices: `'float16'`, `'int8'`. - **W_dtype** *(str, optional)*: Data type of the weights. Default: `'float16'`. - - Choices: `'float16'`, `'int8'`, `'int4'`, `'int2'`, `'int1'`, `'fp4_e2m1'`, `'af4'`. + - Choices: `'float16'`, `'int8'`, `'int4'`, `'int2'`, `'int1'`, `'uint4'`,`'uint2'`, `'uint1'`, `'fp4_e2m1'`, `'nf4'`. + - The Range of the INT Format: + - `'int4'`: [-8, 7] + - `'int2'`: [-2, 1] + - `'int1'`: [-1, 1] - **accum_dtype** *(str, optional)*: Data type for accumulation. Default: `'float16'`. - Choices: `'float16'`, `'int32'`. - **out_dtype** *(str, optional)*: Data type of the output tensor. Default: `'float16'`. diff --git a/images/figures/op_benchmark_a100_int2_scaling.png b/images/figures/op_benchmark_a100_int2_scaling.png new file mode 100644 index 000000000000..b0c4dd58e629 Binary files /dev/null and b/images/figures/op_benchmark_a100_int2_scaling.png differ diff --git a/images/figures/op_benchmark_a100_uint1_scaling.png b/images/figures/op_benchmark_a100_uint1_scaling.png new file mode 100644 index 000000000000..4899c289ab57 Binary files /dev/null and b/images/figures/op_benchmark_a100_uint1_scaling.png differ diff --git a/python/bitblas/gpu/intrin/lop3.py b/python/bitblas/gpu/intrin/lop3.py index 43c91a6454a9..70819362ad08 100644 --- a/python/bitblas/gpu/intrin/lop3.py +++ b/python/bitblas/gpu/intrin/lop3.py @@ -633,14 +633,14 @@ static constexpr uint immLut = (0xf0 & 0xcc) | 0xaa; // 0b11101010 static constexpr uint BOTTOM_MASK = 0x03030303; // 0xf -> 0b11 select 0,3 static constexpr uint I8s_MAGIC_NUM = 0x00000000; // 1024 - static constexpr uint MEDIAN_NUM = 0x01010101; + static constexpr uint MEDIAN_NUM = 0x02020202; #pragma unroll for (int i = 0; i < (N / 4); i++) { asm volatile("lop3.b32 %0, %1, %2, %3, %4;\\n" : "=r"(i8s[i]) : "r"(i2b >> (2 * i)), "n"(BOTTOM_MASK), "n"(I8s_MAGIC_NUM), "n"(immLut)); - i8s[i] = __vsubss4(i8s[i], MEDIAN_NUM); + i8s[i] = __vsub4(i8s[i], MEDIAN_NUM); } } template