Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix INT8-quantization for BLOOM, OPT, and Neo-X #2662

Closed
wants to merge 9 commits into from

Conversation

RezaYazdaniAminabadi
Copy link
Contributor

@RezaYazdaniAminabadi RezaYazdaniAminabadi commented Jan 2, 2023

This PR addresses #2616 and #2379

Also, this adds the support for INT8 inference of the different model architectures quantizing form the HF checkpoint directly. Here is an example using the DeepSpeedExamples inference test-suite running facebook/opt-30b using only one 32GB NVIDIA V100 card:

deepspeed --num_nodes 1 --num_gpus 1 inference-test.py --ds_inference --use_kernel --name facebook/opt-30b --use_meta_tensor --checkpoint_path ~/.cache/huggingface/hub/models--facebook--opt-30b/snapshots/463007d7da4e87fe962909a027811a8c0b32ede8/ --dtype int8

producing the following text:

------------------------------------------------------
Free memory : 0.238525 (GigaBytes)  
Total memory: 31.748535 (GigaBytes)  
Requested memory: 0.140137 (GigaBytes) 
Setting maximum total tokens (input + output) to 82 
------------------------------------------------------
generation time is 10.450812101364136 sec

in=DeepSpeed is a machine learning framework
out=DeepSpeed is a machine learning framework for large-scale, complex data

DeepSpeed is a machine learning framework specifically designed to solve some of the most complex and large-scale problems. The goal of DeepSpeed is to provide a rich infrastructure on top of which researchers can build highly
------------------------------------------------------------
[2023-01-04 11:23:05,806] [INFO] [launch.py:350:main] Process 33466 exits successfully.

Note that the memory is too tight here, however, we can still generate 50 tokens using the input text!

@RezaYazdaniAminabadi RezaYazdaniAminabadi marked this pull request as ready for review January 3, 2023 23:06
@jeffra
Copy link
Collaborator

jeffra commented Feb 10, 2023

#2725 replaces this PR

@jeffra jeffra closed this Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants