Fix INT8-quantization for BLOOM, OPT, and Neo-X #2662

RezaYazdaniAminabadi · 2023-01-02T23:10:47Z

This PR addresses #2616 and #2379

Also, this adds the support for INT8 inference of the different model architectures quantizing form the HF checkpoint directly. Here is an example using the DeepSpeedExamples inference test-suite running facebook/opt-30b using only one 32GB NVIDIA V100 card:

deepspeed --num_nodes 1 --num_gpus 1 inference-test.py --ds_inference --use_kernel --name facebook/opt-30b --use_meta_tensor --checkpoint_path ~/.cache/huggingface/hub/models--facebook--opt-30b/snapshots/463007d7da4e87fe962909a027811a8c0b32ede8/ --dtype int8

producing the following text:

------------------------------------------------------
Free memory : 0.238525 (GigaBytes)  
Total memory: 31.748535 (GigaBytes)  
Requested memory: 0.140137 (GigaBytes) 
Setting maximum total tokens (input + output) to 82 
------------------------------------------------------
generation time is 10.450812101364136 sec

in=DeepSpeed is a machine learning framework
out=DeepSpeed is a machine learning framework for large-scale, complex data

DeepSpeed is a machine learning framework specifically designed to solve some of the most complex and large-scale problems. The goal of DeepSpeed is to provide a rich infrastructure on top of which researchers can build highly
------------------------------------------------------------
[2023-01-04 11:23:05,806] [INFO] [launch.py:350:main] Process 33466 exits successfully.

Note that the memory is too tight here, however, we can still generate 50 tokens using the input text!

jeffra · 2023-02-10T19:47:04Z

#2725 replaces this PR

Reza Yazdani and others added 6 commits December 23, 2022 23:29

support int8 for neox

d5513e9

fix some issues with the replace policies

88a8b14

Merge branch 'master' into neox-q-int8

376b854

fix merging checkpoint for qkv params

b8d731d

fix the transposition when using int8

0d20187

reduce the ckpt-loading code

85ab3d1

RezaYazdaniAminabadi marked this pull request as ready for review January 3, 2023 23:06

RezaYazdaniAminabadi requested review from jeffra, mrwyattii, awan-10, cmikeh2 and arashb as code owners January 3, 2023 23:06

Reza Yazdani and others added 3 commits January 4, 2023 11:29

fix some issue with GPT-J and OPT ckpt-loading

5658141

Merge branch 'master' into neox-q-int8

847c85f

Merge branch 'master' into neox-q-int8

7773501

mrwyattii mentioned this pull request Jan 19, 2023

Add correctness check for sharded checkpoint test #2643

Open

lekurile mentioned this pull request Jan 19, 2023

Port Reza's INT8-quantization fix to container architecture #2725

Merged

anselmwang mentioned this pull request Jan 22, 2023

[BUG] Can't load OPT-30B and OPT-66B through checkpoints.json #2616

Open

sindhuvahinis mentioned this pull request Jan 31, 2023

[BUG] Tested the 2662 PR, It fails for GPTJ 6B and few others #2770

Closed

jeffra closed this Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix INT8-quantization for BLOOM, OPT, and Neo-X #2662

Fix INT8-quantization for BLOOM, OPT, and Neo-X #2662

RezaYazdaniAminabadi commented Jan 2, 2023 •

edited

Loading

jeffra commented Feb 10, 2023

Fix INT8-quantization for BLOOM, OPT, and Neo-X #2662

Fix INT8-quantization for BLOOM, OPT, and Neo-X #2662

Conversation

RezaYazdaniAminabadi commented Jan 2, 2023 • edited Loading

jeffra commented Feb 10, 2023

RezaYazdaniAminabadi commented Jan 2, 2023 •

edited

Loading