DeepSpeed inference support for int8 parameters on BLOOM? #330

pai4451 · 2022-08-16T03:34:21Z

Recently, HuggingFace transformers has a new feature on int8 quantization for all HuggingFace models. This feature could reduce the size of the large models by up to 2 without a high loss in performance. Is it possible for DeepSpeed inference to support int8 quantization for BLOOM? According to the DeepSpeed inference tutorial, DeepSpeed inference supports fp32, fp16, and int8 parameters. But when I tried BLOOM with the inference script and changed dtype=torch.int8 on line 194, an error will be raised.

site-packages/deepspeed/runtime/weight_quantizer.py”, line 163, in model_quantize
    return quantized_module, torch.cat(all_scales)
RuntimeError: torch.cat(): expected a non-empty list of Tensors

Any chance on DeepSpeed inference to support int8 quantization for BLOOM?

The text was updated successfully, but these errors were encountered:

mayank31398 · 2022-08-16T08:31:45Z

@pai4451
https://www.deepspeed.ai/docs/config-json/#weight-quantization
You can't use it that way. Please refer to this config.
Let me know if it works ;)

mayank31398 · 2022-08-16T08:32:22Z

As an alternative, you can use it in HuggingFace too. I haven't tried it either though.

mayank31398 · 2022-08-29T08:22:18Z

@pai4451 #328 (comment)
you can use these instructions for quantization.
However, this is a barebones script.
I would encourage to wait for this PR: #328
Planning to add server + CLI inference + benchmarking support using accelerate and ds inference both. This will also support quantization should you need it.

pai4451 · 2022-08-30T01:35:34Z

@pai4451 #328 (comment)
you can use these instructions for quantization.
However, this is a barebones script.
I would encourage to wait for this PR: #328
Planning to add server + CLI inference + benchmarking support using accelerate and ds inference both. This will also support quantization should you need it.

@mayank31398 I am running my server without Internet available so I can’t use snapshot_download from the hub. Also I am running on two nodes with 16 GPUs so I need a total of 16 shards checkpoints instead of the 8 shards provided by microsoft/bloom-deepspeed-inference-int8. I can convert by myself with the old FP16 weights but for int8 the following error occurs
NotImplenentationError: Cannot copy out of meta tensors; no data. Any chance to solve that?

mayank31398 · 2022-08-30T02:49:50Z

Quantization with int8 requires knowledge distillation and might need significant compute.
Read the zeroquant paper.
I would suggest to get intenet access on the node if you can.
I dont know how to quantize yourself.
Int8 might work on a single node with 8 gpus for you. Can you give it a shot?

mayank31398 · 2022-08-30T02:51:19Z

Also, can you provide me the ds config you use to run on 16 gpus?
I dont know how to reshard for pipeline parallel.
Do you save the resharded weights?
Or reshard every time?

pai4451 changed the title ~~DeepSpeed inference support for int8 parameters on BLOOM~~ DeepSpeed inference support for int8 parameters on BLOOM? Aug 16, 2022

mayank31398 closed this as completed Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed inference support for int8 parameters on BLOOM? #330

DeepSpeed inference support for int8 parameters on BLOOM? #330

pai4451 commented Aug 16, 2022 •

edited

Loading

mayank31398 commented Aug 16, 2022

mayank31398 commented Aug 16, 2022 •

edited

Loading

mayank31398 commented Aug 29, 2022

pai4451 commented Aug 30, 2022

mayank31398 commented Aug 30, 2022

mayank31398 commented Aug 30, 2022

DeepSpeed inference support for int8 parameters on BLOOM? #330

DeepSpeed inference support for int8 parameters on BLOOM? #330

Comments

pai4451 commented Aug 16, 2022 • edited Loading

mayank31398 commented Aug 16, 2022

mayank31398 commented Aug 16, 2022 • edited Loading

mayank31398 commented Aug 29, 2022

pai4451 commented Aug 30, 2022

mayank31398 commented Aug 30, 2022

mayank31398 commented Aug 30, 2022

pai4451 commented Aug 16, 2022 •

edited

Loading

mayank31398 commented Aug 16, 2022 •

edited

Loading