Fix INT8-quantization for BLOOM, OPT, and Neo-X #2662
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR addresses #2616 and #2379
Also, this adds the support for INT8 inference of the different model architectures quantizing form the HF checkpoint directly. Here is an example using the DeepSpeedExamples inference test-suite running facebook/opt-30b using only one 32GB NVIDIA V100 card:
producing the following text:
Note that the memory is too tight here, however, we can still generate 50 tokens using the input text!