FP6 quantization end-to-end. #5234

loadams · 2024-03-06T21:41:38Z

The user interface: microsoft/DeepSpeed-MII#433
nv-a6000 ci running against the MII branch linked above is here

Co-authored-by: Zhen Zheng [email protected]
Co-authored-by: Shiyang Chen [email protected]
Co-authored-by: Arash Bakhtiari [email protected]
Co-authored-by: Haojun Xia [email protected]

* Initialize the fp6-quant-kernel integration. * Add necessary parameters of kernel interfaces and the linear layer selection logic. * upload kernel code * The simple script for debugging. * fix typo * update * fix split k * Fix some errors and add test case. * Workspace for Inference Kernels (#1) * Add transform_param functions and update format. * kernel debug * fix include * Update core_ops.cpp * Add split k support * fix * Fix kernel error * update * update * Fix rebase errors. * Add missed include. * Fix the bug that the attribute uses the weight information for mem alloc. * Avoid GPU preallocation during weight loading. * Add support of larger shapes for gated activation kernel. * update * model update * fix all weight preprocessing * Add split-k heuristic. * Avoid reading scale attribute on non-quantized tensors. * Change the scales from attributes to new tensors. Provide the end-to-end script given HuggingFace model id. * Hard-coded commented out the scales in the kernel to workaround the bug. * Support the user config for quantization. Fix kernel bug. * Per operator test functions. * Multiply scales by 1e12 according to the kernel design. * Revert "Workspace for Inference Kernels (#1)". This reverts commit 1528732. * Remove the format-only changes. * Put the quantization into the transform_param function. --------- Co-authored-by: Shiyang Chen <[email protected]> Co-authored-by: Haojun Xia <[email protected]>

…ion.

…d_linear.py

loadams · 2024-03-07T18:07:13Z

A6000 CI that points to the required branch for MII is here

The user interface: microsoft/DeepSpeed-MII#433 nv-a6000 ci running against the MII branch linked above is [here](https://github.com/microsoft/DeepSpeed/actions/runs/8192124606) Co-authored-by: Zhen Zheng [[email protected]](mailto:[email protected]) Co-authored-by: Shiyang Chen [[email protected]](mailto:[email protected]) Co-authored-by: Arash Bakhtiari [[email protected]](mailto:[email protected]) Co-authored-by: Haojun Xia [[email protected]](mailto:[email protected]) --------- Co-authored-by: ZHENG, Zhen <[email protected]> Co-authored-by: Shiyang Chen <[email protected]> Co-authored-by: Haojun Xia <[email protected]> Co-authored-by: Arash Bakhtiari <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Co-authored-by: Michael Wyatt <[email protected]>

JamesTheZ and others added 30 commits March 5, 2024 00:56

Update CUDA kernels and clean codes.

91bb4d7

Make the quantizer on GPU.

1c2131d

[WIP] Fix the bug of FP16-to-FP6 data packing.

1ba45fd

Add FP6 end-to-end unit tests

ff6c3c3

Refine the FP16-to-FP6 cast logic.

368a763

Add unit tests for FP6 quantizer

6c45a84

Fix FP16-FP6 cast problems.

90b710d

Update FP6 kernels.

f8e3acf

Fix the bug of subnormal FP6 casting and the 2bit/4bit tensor allocat…

b025c5a

…ion.

Clean code.

6ed67f7

pre-commit

20b543c

Deal with the subnormal FP6 and FP16 values and refine the UT.

c43947a

Update according to review comments.

a6d2f2f

Fix the CI workflow problem for FP6 end-to-end.

62a2d49

Fix at::nullopt and at::optional conflicts.

118af37

Refine split-k setting.

56eb8b9

Remove debug files.

0ddbfd1

Only compiler the kernel body for SM >= 8.0.

35c82f2

Fix the GPU architecture requirement of FP6 kernel.

63489d1

Update deepspeed/inference/v2/config_v2.py

ed00ac9

Update deepspeed/inference/v2/config_v2.py

b15a1a1

refactor fp6 tests, fix import error

c2e6ebb

Update deepspeed/inference/v2/modules/implementations/linear/quantize…

fb8887c

…d_linear.py

Update requirements.txt

77f3883

revert testing to fix A6000 test

f6bcdee

Update pydantic version

e1a4ce0

fix pydantic import

e86611f

Fix some review comments.

7e28144

Pin pydantic to latest version

f8454a0

Add the missed torch import.

bed775e

loadams requested review from mrwyattii, awan-10, arashb and tjruwase as code owners March 6, 2024 21:41

xiaoxiawu-microsoft enabled auto-merge March 6, 2024 22:06

arashb requested a review from xiaoxiawu-microsoft March 6, 2024 22:07

xiaoxiawu-microsoft approved these changes Mar 6, 2024

View reviewed changes

mrwyattii approved these changes Mar 6, 2024

View reviewed changes

arashb approved these changes Mar 6, 2024

View reviewed changes

loadams mentioned this pull request Mar 6, 2024

Add quantization config option microsoft/DeepSpeed-MII#433

Merged

loadams disabled auto-merge March 7, 2024 00:59

loadams added 2 commits March 6, 2024 16:59

Merge branch 'master' into features/rebase-quant-fp6

f34312a

Merge branch 'master' into features/rebase-quant-fp6

4a91788

xiaoxiawu-microsoft enabled auto-merge March 8, 2024 00:18

loadams disabled auto-merge March 8, 2024 00:44

loadams merged commit ccfdb84 into master Mar 8, 2024
16 of 17 checks passed

loadams added a commit that referenced this pull request Apr 4, 2024

Update file that was modified in #5234

91789b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP6 quantization end-to-end. #5234

FP6 quantization end-to-end. #5234

loadams commented Mar 6, 2024 •

edited

Loading

loadams commented Mar 7, 2024

FP6 quantization end-to-end. #5234

FP6 quantization end-to-end. #5234

Conversation

loadams commented Mar 6, 2024 • edited Loading

loadams commented Mar 7, 2024

loadams commented Mar 6, 2024 •

edited

Loading