[Question] How to preshard a model for tensor parallism #2379

lanking520 · 2022-09-29T20:12:46Z

Currently we are trying to run inference with pretrained BLOOM model. However, the loading takes very long due to DeepSpeed sharding in runtime. Since there is a pre-sharded version of BLOOM:

microsoft/bloom-deepspeed-inference-fp16

Is that possible to share the script that done the above job or any guidance on how to preshard to speed up loading experience?

The text was updated successfully, but these errors were encountered:

mrwyattii · 2022-09-30T17:53:39Z

Hi @lanking520 for loading this presharded version:

import os
import torch
import deepspeed

from huggingface_hub import snapshot_download

model = "microsoft/bloom-deepspeed-inference-fp16"

local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))

repo_root = snapshot_download(model)
checkpoints_json = os.path.join(repo_root, "ds_inference_config.json")
model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    base_dir=repo_root,
    dtype=torch.half,
    checkpoint=checkpoints_json,
    replace_method="auto",
    replace_with_kernel_inject=True,
)

@RezaYazdaniAminabadi would be able to give more details about creating presharded versions for other models.

lanking520 · 2022-09-30T18:11:50Z

@mrwyattii thanks for the reply, I know how to load the model. But wondering more on how we can pre-shard the model.

This also applies to other model like OPT models. Loading them in CPU and sharding take crazy long time. Maybe it could be done in one-off, save the sharded one to disk and next time skip loading them on CPU again. Would appreciate If there is any instruction that can share to save the sharded model with DeepSpeed

archieCanada · 2022-10-03T14:20:11Z

@lanking520

Here is how I did it (if I correctly understand the issue):

say 5-GB shards

python -c 'from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained("model_path/model_name"); \
model.save_pretrained("/path/to/model_name", max_shard_size="5GB")

this step will require 2x model-size cpu memory and then a bit more.

and then use the resulting model like so:

python -c 'from transformers import AutoModelForSeq2SeqLM; \
model = AutoModelForSeq2SeqLM.from_pretrained("/path/to/model_name");

lanking520 · 2022-10-03T14:27:17Z

@lanking520

Here is how I did it (if I correctly understand the issue):

say 5-GB shards
python -c 'from transformers import AutoModelForSeq2SeqLM; \

model = AutoModelForSeq2SeqLM.from_pretrained("model_path/model_name"); \

model.save_pretrained("/path/to/model_name", max_shard_size="5GB")
this step will require 2x model-size cpu memory and then a bit more.

and then use the resulting model like so:
python -c 'from transformers import AutoModelForSeq2SeqLM; \

model = AutoModelForSeq2SeqLM.from_pretrained("/path/to/model_name");

Nope, this only shards the model based on its size. It doesn't tell which part of the model goes to which GPU in DeepSpeed. Tensor Parallelism means vertical sharding, where models are defined in TP and also sizing. In fact, if you look at the INT8 BLOOM model, each GPU has 4 vertical shards (TP4) and distributed for 8 GPUs. This cannot be done without using DeepSpeed itselves.

lanking520 · 2022-10-04T17:46:17Z

@RezaYazdaniAminabadi I found your PR here is really helpful: #2132
This is the key, but it limited to BLOOM model. Does other model also supported by doing this way?

RezaYazdaniAminabadi · 2022-10-04T22:53:42Z

Hi @lanking520,

Thanks for your interest in this part.
I am working on bringing this feature for the rest of models. I will let you know once creating that a PR for that.
Best,
Reza

jeffra · 2022-12-12T18:21:44Z

@lanking520 can you try this again? #2547 should address your issue

lanking520 · 2022-12-12T19:03:55Z

nice, will test them today

lanking520 · 2022-12-12T19:04:32Z

@jeffra @RezaYazdaniAminabadi do you happened to have a code sample for OPT model?

mrwyattii · 2022-12-20T19:00:04Z

@lanking520 Here is a small code sample for saving a sharded OPT model:

import os
import torch
import transformers
import deepspeed
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--save_ckpt", action="store_true")
parser.add_argument("--local_rank", type=int, default=0)
parser.add_argument("--world_size", type=int, default=int(os.getenv("WORLD_SIZE", 1)))
args = parser.parse_args()

model_name = "facebook/opt-1.3b"
inputs = ["DeepSpeed is the"]
ckpt_path = "/data/sharded-opt-model/"
inf_config = {
    "replace_with_kernel_inject": True,
    "dtype": torch.float16,
    "replace_method": "auto",
    "enable_cuda_graph": False,
    "tensor_parallel": {"tp_size": args.world_size},
}

config = transformers.AutoConfig.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

if args.save_ckpt:
    inf_config["save_mp_checkpoint_path"] = ckpt_path
    model = transformers.AutoModelForCausalLM.from_config(
        config, torch_dtype=torch.float16
    )
else:
    inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json")
    with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = transformers.AutoModelForCausalLM.from_config(
            config, torch_dtype=torch.float16
        )

model = deepspeed.init_inference(model, config=inf_config)

if not args.save_ckpt:
    tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)
    for t in tokens:
        if torch.is_tensor(tokens[t]):
            tokens[t] = tokens[t].to(f"cuda:{args.local_rank}")
    greedy_output = model.generate(**tokens)
    outputs = tokenizer.batch_decode(greedy_output, skip_special_tokens=True)
    if args.local_rank == 0:
        print(outputs)

To save the checkpoint:
deepspeed --num_gpus 2 example.py --save_ckpt

Verify the sharded checkpoints were created:

venv ❯ ls /data/sharded-opt-model
ds_inference_config.json  tp_00_00.pt  tp_00_02.pt  tp_00_04.pt  tp_00_06.pt  tp_01_00.pt  tp_01_02.pt  tp_01_04.pt  tp_01_06.pt
non-tp.pt                 tp_00_01.pt  tp_00_03.pt  tp_00_05.pt  tp_00_07.pt  tp_01_01.pt  tp_01_03.pt  tp_01_05.pt  tp_01_07.pt

Load the sharded checkpoint and run a query:
deepspeed --num_gpus 2 example.py

lanking520 · 2022-12-20T19:02:01Z

@mrwyattii. Thanks for sharing. I am able to get this file. But it is completely not loadable to get it back. I tried bloom way but doesn't work into that way

lanking520 · 2022-12-20T19:30:32Z

def load_model():
    tensor_parallel = int(os.getenv('WORLD_SIZE', '1'))
    model_name = "path_to_model"
    deepspeed.init_distributed("nccl")
    logging.info(f"Loading the model {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    config = AutoConfig.from_pretrained(model_name)
    # Construct model with fake meta tensors, later will be replaced during ds-inference ckpt load
    with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16)
    model = model.eval()
    torch.cuda.empty_cache()

    ### Deepspeed-Inference Loading
    checkpoints_json = os.path.join(model_name, "ds_inference_config.json")
    model = deepspeed.init_inference(
        model,
        mp_size=tensor_parallel,
        base_dir=repo_root,
        dtype=torch.float16,
        checkpoint=checkpoints_json,
        replace_with_kernel_inject=True)
    torch.cuda.empty_cache()
    gc.collect()
    deepspeed.runtime.utils.see_memory_usage("post-ds-inference-init", force=True)
    model = model.module
    return model, tokenizer

This is the standard way loading back a bloom model. This is not working for OPT

mrwyattii · 2022-12-20T21:47:22Z

@lanking520 I've updated the code in my previous comment to include loading the model with deepspeed.OnDevice - Can you give it a try?

mrwyattii · 2022-12-20T22:06:41Z

Also check out the example we have in the DeepSpeedExamples repo:
https://github.com/microsoft/DeepSpeedExamples/blob/meta-inference/inference/huggingface/text-generation/inference-test.py

#2547 has some information about how to run that script.

slrsnpdla · 2022-12-21T06:58:05Z

@lanking520 Here is a small code sample for saving a sharded OPT model:

import os
import torch
import transformers
import deepspeed
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--save_ckpt", action="store_true")
parser.add_argument("--local_rank", type=int, default=0)
parser.add_argument("--world_size", type=int, default=int(os.getenv("WORLD_SIZE", 1)))
args = parser.parse_args()

model_name = "facebook/opt-1.3b"
inputs = ["DeepSpeed is the"]
ckpt_path = "/data/sharded-opt-model/"
inf_config = {
    "replace_with_kernel_inject": True,
    "dtype": torch.float16,
    "replace_method": "auto",
    "enable_cuda_graph": False,
    "tensor_parallel": {"tp_size": args.world_size},
}

config = transformers.AutoConfig.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

if args.save_ckpt:
    inf_config["save_mp_checkpoint_path"] = ckpt_path
    model = transformers.AutoModelForCausalLM.from_config(
        config, torch_dtype=torch.float16
    )
else:
    inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json")
    with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = transformers.AutoModelForCausalLM.from_config(
            config, torch_dtype=torch.float16
        )

model = deepspeed.init_inference(model, config=inf_config)

if not args.save_ckpt:
    tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)
    for t in tokens:
        if torch.is_tensor(tokens[t]):
            tokens[t] = tokens[t].to(f"cuda:{args.local_rank}")
    greedy_output = model.generate(**tokens)
    outputs = tokenizer.batch_decode(greedy_output, skip_special_tokens=True)
    if args.local_rank == 0:
        print(outputs)

To save the checkpoint: deepspeed --num_gpus 2 example.py --save_ckpt

Verify the sharded checkpoints were created:

venv ❯ ls /data/sharded-opt-model
ds_inference_config.json  tp_00_00.pt  tp_00_02.pt  tp_00_04.pt  tp_00_06.pt  tp_01_00.pt  tp_01_02.pt  tp_01_04.pt  tp_01_06.pt
non-tp.pt                 tp_00_01.pt  tp_00_03.pt  tp_00_05.pt  tp_00_07.pt  tp_01_01.pt  tp_01_03.pt  tp_01_05.pt  tp_01_07.pt

Load the sharded checkpoint and run a query: deepspeed --num_gpus 2 example.py

When i used example.py, the model was saved well, but there was a problem loading the model.
I'm using deepspeed==0.7.7
I think this problem is related to this line
inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json")

In ckpt_path,
ls /hdd/cache/sharded-galactica-125m/ (galactica is also OPT model)

ds_inference_config.json  tp_00_01.pt  tp_00_04.pt  tp_00_07.pt  tp_01_02.pt  tp_01_05.pt
non-tp.pt                 tp_00_02.pt  tp_00_05.pt  tp_01_00.pt  tp_01_03.pt  tp_01_06.pt
tp_00_00.pt               tp_00_03.pt  tp_00_06.pt  tp_01_01.pt  tp_01_04.pt  tp_01_07.pt

Traceback (most recent call last):
  File "saving_sharded_model.py", line 49, in <module>
    model = deepspeed.init_inference(model, config=inf_config)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 89, in __init__
    self._load_checkpoint(config.checkpoint)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 411, in _load_checkpoint
    load_path, checkpoint, quantize_config = sd_loader.load(self._config.tensor_parallel.tp_size,
AttributeError: 'dict' object has no attribute 'load'

(+)
It was confirmed that the OPT model was saved and loaded normally when the config was given and used.
However, the generated result seems to be different from the intended result.

inf_config = {
    "replace_with_kernel_inject": True,
    #"injection_policy": {OPTDecoderLayer: ('self_attn.out_proj', '.fc2')},
    "dtype": torch.float16,
    "replace_method": "auto",
    #"enable_cuda_graph": False,
    #"tensor_parallel": {"tp_size": args.world_size},
    "tensor_parallel": DeepSpeedTPConfig(enabled=True, tp_size=args.world_size),
    #"mp_size": args.world_size,
}

In : ["Deepspeed is the"]
Out : ['DeepSpeed is thePromegaPromegaPromega Mtb other other other other Mtb Mtbfortunately Mtborbed adventorbedorbed']

In order for the model to be loaded normally, it must be given injection_policy directly, and the load does not seem to work well with the current method(= replace_method='auto', replace_with_kernel_inject=True).

mrwyattii · 2022-12-22T02:02:42Z

@slrsnpdla I'm able to reproduce the bad output you are seeing. This appears to only happen for some models. I've extended the unit tests for sharded checkpoints to include a correctness test in #2643. I'm seeing failures for gpt-neo and gpt-j models as well.

@RezaYazdaniAminabadi

lanking520 · 2023-01-03T17:42:50Z

Also reproducible from my end, tested OPT, GPT-Neo and GPT-J is kind of broken.

RezaYazdaniAminabadi · 2023-01-03T19:27:52Z

Hi @lanking520,

I am working on resolving this issue. I will let you know once I have the solution tested completely.
Thanks,
Reza

lanking520 · 2023-01-03T22:32:22Z

Hi @RezaYazdaniAminabadi or @jeffra do we have any weekly/monthly community meeting for DeepSpeed? I would like to attend if there is one.

RezaYazdaniAminabadi · 2023-01-04T06:49:58Z

Hi @lanking520,

I have verified several model architectures with this PR and using this test-suite. All works fine on my side. Could you please try this on your end and see if the issue is resolved?
Here is how I ran different models:

deepspeed --num_gpus N inference-test.py --ds_inference --use_kernel --name `HF-moel-name` --use_meta_tensor --checkpoint_path ~/.cache/huggingface/hub/...

Regarding your last question, I don't think there is any meeting currently set up. But, I think this is a great idea. I let @jeffra or @tjruwase chime in here, and we might be able to set up something.

Thanks,
Reza

lanking520 · 2023-01-12T00:34:47Z

Will start testing this week. Thanks @RezaYazdaniAminabadi.

Wenhan-Tan · 2023-01-13T03:41:46Z

Hi @mrwyattii ,

Thank you for this example! I have 2 questiones regarding your code.

@lanking520 Here is a small code sample for saving a sharded OPT model:

import os
import torch
import transformers
import deepspeed
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--save_ckpt", action="store_true")
parser.add_argument("--local_rank", type=int, default=0)
parser.add_argument("--world_size", type=int, default=int(os.getenv("WORLD_SIZE", 1)))
args = parser.parse_args()

model_name = "facebook/opt-1.3b"
inputs = ["DeepSpeed is the"]
ckpt_path = "/data/sharded-opt-model/"
inf_config = {
    "replace_with_kernel_inject": True,
    "dtype": torch.float16,
    "replace_method": "auto",
    "enable_cuda_graph": False,
    "tensor_parallel": {"tp_size": args.world_size},
}

config = transformers.AutoConfig.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

if args.save_ckpt:
    inf_config["save_mp_checkpoint_path"] = ckpt_path
    model = transformers.AutoModelForCausalLM.from_config(
        config, torch_dtype=torch.float16
    )
else:
    inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json")
    with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = transformers.AutoModelForCausalLM.from_config(
            config, torch_dtype=torch.float16
        )

model = deepspeed.init_inference(model, config=inf_config)

if not args.save_ckpt:
    tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)
    for t in tokens:
        if torch.is_tensor(tokens[t]):
            tokens[t] = tokens[t].to(f"cuda:{args.local_rank}")
    greedy_output = model.generate(**tokens)
    outputs = tokenizer.batch_decode(greedy_output, skip_special_tokens=True)
    if args.local_rank == 0:
        print(outputs)

To save the checkpoint: deepspeed --num_gpus 2 example.py --save_ckpt

Verify the sharded checkpoints were created:

venv ❯ ls /data/sharded-opt-model
ds_inference_config.json  tp_00_00.pt  tp_00_02.pt  tp_00_04.pt  tp_00_06.pt  tp_01_00.pt  tp_01_02.pt  tp_01_04.pt  tp_01_06.pt
non-tp.pt                 tp_00_01.pt  tp_00_03.pt  tp_00_05.pt  tp_00_07.pt  tp_01_01.pt  tp_01_03.pt  tp_01_05.pt  tp_01_07.pt

Load the sharded checkpoint and run a query: deepspeed --num_gpus 2 example.py

This line model = transformers.AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16) will run --num_gpus times when you are saving preshard checkpoints. For example, gpt-neo-x 20B takes about 40GB in RAM, and if you run this script with deepspeed --num_gpus 4 example.py --save_ckpt, you will end up using 4 * 40GB in RAM.

My first question is: Is there a way to load the model in RAM only 1 time instead of 4 times to save RAM and still save preshard checkpoints?

My second question is (might be a stupid one): When you run deepspeed --num_gpus 4 example.py for inference, does the model get split into 4 pieces in terms of GPU memory usage? For example, gpt-neo-x 20B needs about 40GB and does each GPU only use about 10GB if I have 4 of them?

mrwyattii · 2023-01-17T20:52:21Z

My first question is: Is there a way to load the model in RAM only 1 time instead of 4 times to save RAM and still save preshard checkpoints?

My second question is (might be a stupid one): When you run deepspeed --num_gpus 4 example.py for inference, does the model get split into 4 pieces in terms of GPU memory usage? For example, gpt-neo-x 20B needs about 40GB and does each GPU only use about 10GB if I have 4 of them?

I don't think this is currently possible. You can load the model with metatensors and provide the pytorch_model.bin as the checkpoint file in init_inference when creating the sharded checkpoint (see examples here), however I just did a quick test and this doesn't seem to lower the maximum memory usage. @RezaYazdaniAminabadi can you confirm this is the case?
The model will be split across GPU as you have described (but if I recall correctly some parts of the model will be present on all GPUs). If you are seeing that this is not the case when you check memory usage via nvidia-smi try clearing the cache after the call to init_inference and check again

sindhuvahinis · 2023-01-21T03:24:06Z

When we took your PR and tested it with your test suite. We got the following error for both OPT 1.3B and GPTJ 6B.

Saving tp-sharded checkpoints
Loading 0 checkpoint shards: 0it [00:02, ?it/s]
Traceback (most recent call last):
  File "inference-test.py", line 51, in <module>
    pipe.model = deepspeed.init_inference(pipe.model,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 129, in __init__
    self.module.to(device)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

lanking520 · 2023-01-23T16:12:08Z

@RezaYazdaniAminabadi @lekurile did you get any luck to not seeing the above error? ^^

Wenhan-Tan · 2023-01-25T01:54:39Z

Hi @lanking520 , I think the error that you're seeing maybe comes from the flag low_cpu_mem_usage=True or device_map="auto" when you call the from_pretrained() method. Either of these two flags uses meta weights when you load the model, they're fake data meaning there are no actual values.

lanking520 · 2023-01-25T03:25:47Z

@Wenhan-Tan this is the way why Meta tensor is here:

with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16)

This is a must have step in order to use DeepSpeed checkpoint loading. You need to have a placeholder at the place to build the full model. I think it still stay true with @RezaYazdaniAminabadi 's commit. We need to send the model body in to get full weights equipped by DS. For some reason, the checkpoint weight was not taken by DeepSpeed

Wenhan-Tan · 2023-01-25T17:48:28Z

@lanking520 You're right! I also ran the script, and the checkpoint weight loading worked on my machine. It makes me wonder whether if the weight saving is successful on his @sindhuvahinis machine.

sindhuvahinis · 2023-01-30T17:46:26Z

@Wenhan-Tan @RezaYazdaniAminabadi I verified it again with larger instance with 8GPUs.

Did you test for GPTJ 6b model? I was able to generate checkpoints. I tried But loading back the generated checkpoints throws the below error.

NotImplementedError: Cannot copy out of meta tensor; no data!
Traceback (most recent call last):
  File "inference-test.py", line 57, in <module>
    pipe.model = deepspeed.init_inference(pipe.model,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/engine.py", line 129, in __init__
    self.module.to(device)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 927, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 579, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 602, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 925, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

To reproduce I am using the deepspeed test-suite

deepspeed --num_nodes 1 \
     --num_gpus 4     \
    inference-test.py \
    --use_kernel   \    
    --ds_inference \
    --name EleutherAI/gpt-j-6B \
    --checkpoint_path /tmp/ws/gpt-j-6B/ \
    --save_mp_checkpoint_path /tmp/ws/sharded-gptj-6b/

Wenhan-Tan · 2023-02-02T21:05:07Z

Hi @sindhuvahinis , I didn't try GPTJ but it did work on GPTNeox for me

ghost · 2023-02-04T20:45:03Z

@mrwyattii @slrsnpdla @lanking520,

Wrt the bad outputs, in the provided script at the presharding step perhaps there could be an issue with the usage of from_config alongside save_mp_checkpoint_path and how this affects the checkpoints.

Please see if the following example makes sense.

$ diff -u example-original.py example-modified.py

--- example-original.py
+++ example-modified.py
@@ -26,8 +26,8 @@
 
 if args.save_ckpt:
     inf_config["save_mp_checkpoint_path"] = ckpt_path
-    model = transformers.AutoModelForCausalLM.from_config(
-        config, torch_dtype=torch.float16
+    model = transformers.AutoModelForCausalLM.from_pretrained(
+        pretrained_model_name_or_path=model_name, torch_dtype=torch.float16
     )
 else:
     inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json")

Using the original script to do sharding and then verify that the outputs are not correct:

$ deepspeed --num_gpus 1 example-original.py --save_ckpt
(..)
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.0009701251983642578 seconds
Saving tp-sharded checkpoints
[] [INFO] [launch.py:350:main] Process 1195148 exits successfully.

$ ls /data/sharded-opt-model/
ds_inference_config.json  non-tp.pt  tp_00_00.pt  tp_00_01.pt  tp_00_02.pt  tp_00_03.pt  tp_00_04.pt  tp_00_05.pt  tp_00_06.pt  tp_00_07.pt

$ deepspeed --num_gpus 1 example-original.py
(..)
Requested memory: 0.375000 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
------------------------------------------------------
['DeepSpeed is the grant grant ► grantrecentElsaElsa grantrecent Observer simplify IndigoElsaElsa Indigo']
[] [INFO] [launch.py:350:main] Process 1200105 exits successfully.

On my end the modified script produces the expected reproducible results:

$ rm -f /data/sharded-opt-model/*

$ deepspeed --num_gpus 1 example-modified.py --save_ckpt
(..)
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.0009124279022216797 seconds
Saving tp-sharded checkpoints
[] [INFO] [launch.py:350:main] Process 1209907 exits successfully.

$ ls /data/sharded-opt-model/
ds_inference_config.json  non-tp.pt  tp_00_00.pt  tp_00_01.pt  tp_00_02.pt  tp_00_03.pt  tp_00_04.pt  tp_00_05.pt  tp_00_06.pt  tp_00_07.pt

$ deepspeed --num_gpus 1 example-modified.py
(..)
Requested memory: 0.375000 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
------------------------------------------------------
["DeepSpeed is the best.\nI've been using DeepSpeed for a while now. It"]
[] [INFO] [launch.py:350:main] Process 1211738 exits successfully.

DeepSpeed version: 0.8.0
Transformers version: 4.26.0

simoroma · 2023-03-01T13:34:21Z

I am trying to run this with GPTNeoX 20B. I managed to save a sharded model with the example above with a single GPU A6000. Also inference looks correct. Though, when I try to save the sharded model using 4 A4000, the script takes long time (I also need a very big amount of CPU RAM as discussed above). The script has been running for more than 2 hours. GPUs are at 100% usage. Did anyone have the same experience with this?

Wenhan-Tan · 2023-03-01T18:49:43Z

Hi @simoroma , for GPTNeoX 20B, it took me about a little less than 2 hours to save the sharded model. Did you leave the script running and save the sharded model successfully? If not, it could be that you don't have enough RAM or GPU memory.

simoroma · 2023-03-01T19:00:56Z

Thanks @Wenhan-Tan it ran for 4 hours and I stopped it. I have 4 A4000 and 200 GB of RAM. The code first correctly uses about 165 GB of RAM. Then starts to use GPUs at 100% using about 10 GB VRAM each. Though it does not get to save the sharded model.

If I run the same code on a single A6000 it works correctly. I can also use it for inference on a single GPU.

If I use the sharded model saved with a single GPU and move it to the pod with 4 GPU, the model gets correctly splitted on 4 GPUs at inference time. About 10 GB VRAM each. But I never get inference results. The code gets stuck and GPU are at 100% utilization.

Wenhan-Tan · 2023-03-01T21:32:29Z

Hi @simoroma , I never tried loading 1-GPU sharded model on 2 GPUs using DeepSpeed. I ran the same script on 2 A100-40GB GPUs and both saving and inference work for GPTNeox 20B. If your 4 A4000s have more GPU memory than your single A6000, then this is probably a bug.

simoroma · 2023-03-02T18:37:47Z

The sharded model out of a single A6000 could be loaded correctly with 4 RTX 3090 GPUs. Though results were gibberish.

I don't know why but I had no issues saving and then loading the code with 4 RTX 3090. The script was getting stuck with 4 A4000.

qtli · 2023-03-29T09:01:00Z

Hi, thanks for the above suggestions. I managed to pre-shard OPT model.

Now I wanted to pre-shard a T5 model (like T0 or Flan-T5 )but I failed. Here is my code:

        pipe.model = deepspeed.init_inference(pipe.model,
                                              dtype=data_type,  # float32
                                              mp_size=world_size,  # 2
                                              replace_with_kernel_inject=args.use_kernel,  # true
                                              replace_method=args.replace_method,  # 'auto'
                                              max_tokens=args.length,  
                                              save_mp_checkpoint_path=args.save_mp_checkpoint_path,
                                              # injection_policy={T5Block: ('SelfAttention.o', 'EncDecAttention.o', 'DenseReluDense.wo')} if 't5' or 't0' in args.model_type else None,
                                              **ds_kwargs
                                              )

And it reports errors:

[2023-03-29 16:55:10,110] [INFO] [comm.py:661:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 295, in <module>
    deepspeed_infer(args)
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 241, in deepspeed_infer
    pipe.model = deepspeed.init_inference(pipe.model,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 136, in __init__
    self._apply_injection_policy(config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 363, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 543, in replace_transformer_layer
    assert container_g.ckpt_load_enabled, \
AttributeError: 'NoneType' object has no attribute 'ckpt_load_enabled'
Traceback (most recent call last):
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 295, in <module>
    deepspeed_infer(args)
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 241, in deepspeed_infer
    pipe.model = deepspeed.init_inference(pipe.model,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 136, in __init__
    self._apply_injection_policy(config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 363, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 543, in replace_transformer_layer
    assert container_g.ckpt_load_enabled, \
AttributeError: 'NoneType' object has no attribute 'ckpt_load_enabled'

Is there anyone who could help me to solve this problem?

cgd-bot · 2023-03-30T08:20:47Z

Hi, thanks for the above suggestions. I managed to pre-shard OPT model.

Now I wanted to pre-shard a T5 model (like T0 or Flan-T5 )but I failed. Here is my code:

        pipe.model = deepspeed.init_inference(pipe.model,
                                              dtype=data_type,  # float32
                                              mp_size=world_size,  # 2
                                              replace_with_kernel_inject=args.use_kernel,  # true
                                              replace_method=args.replace_method,  # 'auto'
                                              max_tokens=args.length,  
                                              save_mp_checkpoint_path=args.save_mp_checkpoint_path,
                                              # injection_policy={T5Block: ('SelfAttention.o', 'EncDecAttention.o', 'DenseReluDense.wo')} if 't5' or 't0' in args.model_type else None,
                                              **ds_kwargs
                                              )

And it reports errors:

[2023-03-29 16:55:10,110] [INFO] [comm.py:661:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 295, in <module>
    deepspeed_infer(args)
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 241, in deepspeed_infer
    pipe.model = deepspeed.init_inference(pipe.model,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 136, in __init__
    self._apply_injection_policy(config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 363, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 543, in replace_transformer_layer
    assert container_g.ckpt_load_enabled, \
AttributeError: 'NoneType' object has no attribute 'ckpt_load_enabled'
Traceback (most recent call last):
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 295, in <module>
    deepspeed_infer(args)
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 241, in deepspeed_infer
    pipe.model = deepspeed.init_inference(pipe.model,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 136, in __init__
    self._apply_injection_policy(config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 363, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 543, in replace_transformer_layer
    assert container_g.ckpt_load_enabled, \
AttributeError: 'NoneType' object has no attribute 'ckpt_load_enabled'

Is there anyone who could help me to solve this problem?

I met the same problem, but i have no idea how to solve it

qtli · 2023-03-30T11:02:17Z

Hi, thanks for the above suggestions. I managed to pre-shard OPT model.
Now I wanted to pre-shard a T5 model (like T0 or Flan-T5 )but I failed. Here is my code:

        pipe.model = deepspeed.init_inference(pipe.model,
                                              dtype=data_type,  # float32
                                              mp_size=world_size,  # 2
                                              replace_with_kernel_inject=args.use_kernel,  # true
                                              replace_method=args.replace_method,  # 'auto'
                                              max_tokens=args.length,  
                                              save_mp_checkpoint_path=args.save_mp_checkpoint_path,
                                              # injection_policy={T5Block: ('SelfAttention.o', 'EncDecAttention.o', 'DenseReluDense.wo')} if 't5' or 't0' in args.model_type else None,
                                              **ds_kwargs
                                              )

And it reports errors:

[2023-03-29 16:55:10,110] [INFO] [comm.py:661:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 295, in <module>
    deepspeed_infer(args)
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 241, in deepspeed_infer
    pipe.model = deepspeed.init_inference(pipe.model,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 136, in __init__
    self._apply_injection_policy(config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 363, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 543, in replace_transformer_layer
    assert container_g.ckpt_load_enabled, \
AttributeError: 'NoneType' object has no attribute 'ckpt_load_enabled'
Traceback (most recent call last):
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 295, in <module>
    deepspeed_infer(args)
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 241, in deepspeed_infer
    pipe.model = deepspeed.init_inference(pipe.model,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 136, in __init__
    self._apply_injection_policy(config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 363, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 543, in replace_transformer_layer
    assert container_g.ckpt_load_enabled, \
AttributeError: 'NoneType' object has no attribute 'ckpt_load_enabled'

Is there anyone who could help me to solve this problem?

I met the same problem, but i have no idea how to solve it

Hi, I followed this tutorial and succeeded.

The code is:

# ---------------------------------------
# New automatic tensor parallelism method
# ---------------------------------------
import os
import torch
import transformers
import deepspeed
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
# create the model pipeline
pipe = transformers.pipeline(task="text2text-generation", model="google/t5-v1_1-small", device=local_rank)
# Initialize the DeepSpeed-Inference engine
pipe.model = deepspeed.init_inference(
    pipe.model,
    mp_size=world_size,
    dtype=torch.float
)
output = pipe('Input String')

huangjiaheng · 2023-03-31T09:22:31Z

Hi, thanks for the above suggestions. I managed to pre-shard OPT model.
Now I wanted to pre-shard a T5 model (like T0 or Flan-T5 )but I failed. Here is my code:

        pipe.model = deepspeed.init_inference(pipe.model,
                                              dtype=data_type,  # float32
                                              mp_size=world_size,  # 2
                                              replace_with_kernel_inject=args.use_kernel,  # true
                                              replace_method=args.replace_method,  # 'auto'
                                              max_tokens=args.length,  
                                              save_mp_checkpoint_path=args.save_mp_checkpoint_path,
                                              # injection_policy={T5Block: ('SelfAttention.o', 'EncDecAttention.o', 'DenseReluDense.wo')} if 't5' or 't0' in args.model_type else None,
                                              **ds_kwargs
                                              )

And it reports errors:

[2023-03-29 16:55:10,110] [INFO] [comm.py:661:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 295, in <module>
    deepspeed_infer(args)
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 241, in deepspeed_infer
    pipe.model = deepspeed.init_inference(pipe.model,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 136, in __init__
    self._apply_injection_policy(config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 363, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 543, in replace_transformer_layer
    assert container_g.ckpt_load_enabled, \
AttributeError: 'NoneType' object has no attribute 'ckpt_load_enabled'
Traceback (most recent call last):
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 295, in <module>
    deepspeed_infer(args)
  File "/apdcephfs/share_916081/qintongli/benchmarking_llm_gen/inference/infer_pipeline.py", line 241, in deepspeed_infer
    pipe.model = deepspeed.init_inference(pipe.model,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 136, in __init__
    self._apply_injection_policy(config)
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 363, in _apply_injection_policy
    replace_transformer_layer(client_module,
  File "/home/qintongli/anaconda3/envs/cuda11/lib/python3.9/site-packages/deepspeed/module_inject/replace_module.py", line 543, in replace_transformer_layer
    assert container_g.ckpt_load_enabled, \
AttributeError: 'NoneType' object has no attribute 'ckpt_load_enabled'

Is there anyone who could help me to solve this problem?

I met the same problem, but i have no idea how to solve it

the same

kevinuserdd · 2023-04-14T05:56:48Z

Hi @mrwyattii ,

Thank you for this example! I have 2 questiones regarding your code.

@lanking520 Here is a small code sample for saving a sharded OPT model:

import os
import torch
import transformers
import deepspeed
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--save_ckpt", action="store_true")
parser.add_argument("--local_rank", type=int, default=0)
parser.add_argument("--world_size", type=int, default=int(os.getenv("WORLD_SIZE", 1)))
args = parser.parse_args()

model_name = "facebook/opt-1.3b"
inputs = ["DeepSpeed is the"]
ckpt_path = "/data/sharded-opt-model/"
inf_config = {
    "replace_with_kernel_inject": True,
    "dtype": torch.float16,
    "replace_method": "auto",
    "enable_cuda_graph": False,
    "tensor_parallel": {"tp_size": args.world_size},
}

config = transformers.AutoConfig.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

if args.save_ckpt:
    inf_config["save_mp_checkpoint_path"] = ckpt_path
    model = transformers.AutoModelForCausalLM.from_config(
        config, torch_dtype=torch.float16
    )
else:
    inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json")
    with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = transformers.AutoModelForCausalLM.from_config(
            config, torch_dtype=torch.float16
        )

model = deepspeed.init_inference(model, config=inf_config)

if not args.save_ckpt:
    tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)
    for t in tokens:
        if torch.is_tensor(tokens[t]):
            tokens[t] = tokens[t].to(f"cuda:{args.local_rank}")
    greedy_output = model.generate(**tokens)
    outputs = tokenizer.batch_decode(greedy_output, skip_special_tokens=True)
    if args.local_rank == 0:
        print(outputs)

To save the checkpoint: deepspeed --num_gpus 2 example.py --save_ckpt
Verify the sharded checkpoints were created:

venv ❯ ls /data/sharded-opt-model
ds_inference_config.json  tp_00_00.pt  tp_00_02.pt  tp_00_04.pt  tp_00_06.pt  tp_01_00.pt  tp_01_02.pt  tp_01_04.pt  tp_01_06.pt
non-tp.pt                 tp_00_01.pt  tp_00_03.pt  tp_00_05.pt  tp_00_07.pt  tp_01_01.pt  tp_01_03.pt  tp_01_05.pt  tp_01_07.pt

Load the sharded checkpoint and run a query: deepspeed --num_gpus 2 example.py

This line model = transformers.AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16) will run --num_gpus times when you are saving preshard checkpoints. For example, gpt-neo-x 20B takes about 40GB in RAM, and if you run this script with deepspeed --num_gpus 4 example.py --save_ckpt, you will end up using 4 * 40GB in RAM.

My first question is: Is there a way to load the model in RAM only 1 time instead of 4 times to save RAM and still save preshard checkpoints?

My second question is (might be a stupid one): When you run deepspeed --num_gpus 4 example.py for inference, does the model get split into 4 pieces in terms of GPU memory usage? For example, gpt-neo-x 20B needs about 40GB and does each GPU only use about 10GB if I have 4 of them?

Have you solved this problem? When I use deepspeed inference, run deepspeed --num_gpus 2 inference.py, my gpu memory is doubled, and when I set num_gpus=4, my gpu memory is 4 times.

Wenhan-Tan · 2023-04-14T19:06:42Z

@kevinuserdd Hi, what you're seeing is exactly what it's supposed to be. I looked up all over the internet and this looks like is a non-solvable problem. You have to have X times of Model Size RAM in order to parallelize your model into X number of GPUs. If you have any ideas or found any solutions, please let me know. This will be incredibly helpful for people who do not have infinite RAM.

lanking520 · 2023-04-14T19:18:32Z

Hi All, for DeepSpeed 0.8.3. The following models could be applied on HuggingFace checkpoint with low CPU memory:

OPT
BLOOM
GPT-J

We have CI run nightly to verify.

For Model support native sharding on DeepSpeed

GPT-NeoX
OPT
BLOOM
GPT-J

We have tested all these fours and working. If you need help, please feel free to send us an email and we can work with you a solution on sharding.

[email protected]

kevinuserdd · 2023-04-17T02:21:31Z

@kevinuserdd Hi, what you're seeing is exactly what it's supposed to be. I looked up all over the internet and this looks like is a non-solvable problem. You have to have X times of Model Size RAM in order to parallelize your model into X number of GPUs. If you have any ideas or found any solutions, please let me know. This will be incredibly helpful for people who do not have infinite RAM.

But I have four A100 graphics cards, and the graphics memory is sufficient. When I test the Bloom model, it has the same parameter size as Chatglm. I use Bloom for inference, and the graphics memory of multiple graphics cards will be shared; But using chatglm won't do it. I doubt if it's related to the model itself, and I can't parallelize the model

Wenhan-Tan · 2023-04-17T15:19:46Z

@kevinuserdd Hi, what you're seeing is exactly what it's supposed to be. I looked up all over the internet and this looks like is a non-solvable problem. You have to have X times of Model Size RAM in order to parallelize your model into X number of GPUs. If you have any ideas or found any solutions, please let me know. This will be incredibly helpful for people who do not have infinite RAM.

But I have four A100 graphics cards, and the graphics memory is sufficient. When I test the Bloom model, it has the same parameter size as Chatglm. I use Bloom for inference, and the graphics memory of multiple graphics cards will be shared; But using chatglm won't do it. I doubt if it's related to the model itself, and I can't parallelize the model

@kevinuserdd Hi, sorry I read your response wrong. I only got more RAM when I want to parallelize. The increased RAM is gone when I successfully parallelized my model into GPUs. And my total GPU memory wasn't increased as RAM did. Maybe chatglm is not fully supported yet for parallelization.

heroes999 · 2023-07-08T23:56:42Z

@lanking520 Is Llama 7B/65B supported?

lanking520 · 2023-07-09T00:00:20Z

@lanking520 Is Llama 7B/65B supported?

Yes

dc3671 · 2023-08-08T09:08:21Z

Hi @mrwyattii , would it be possible to use this script for autoTP rather than kernel_injection ?

@lanking520 Here is a small code sample for saving a sharded OPT model:

import os
import torch
import transformers
import deepspeed
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--save_ckpt", action="store_true")
parser.add_argument("--local_rank", type=int, default=0)
parser.add_argument("--world_size", type=int, default=int(os.getenv("WORLD_SIZE", 1)))
args = parser.parse_args()

model_name = "facebook/opt-1.3b"
inputs = ["DeepSpeed is the"]
ckpt_path = "/data/sharded-opt-model/"
inf_config = {
    "replace_with_kernel_inject": True,
    "dtype": torch.float16,
    "replace_method": "auto",
    "enable_cuda_graph": False,
    "tensor_parallel": {"tp_size": args.world_size},
}

config = transformers.AutoConfig.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

if args.save_ckpt:
    inf_config["save_mp_checkpoint_path"] = ckpt_path
    model = transformers.AutoModelForCausalLM.from_config(
        config, torch_dtype=torch.float16
    )
else:
    inf_config["checkpoint"] = os.path.join(ckpt_path, "ds_inference_config.json")
    with deepspeed.OnDevice(dtype=torch.float16, device="meta"):
        model = transformers.AutoModelForCausalLM.from_config(
            config, torch_dtype=torch.float16
        )

model = deepspeed.init_inference(model, config=inf_config)

if not args.save_ckpt:
    tokens = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)
    for t in tokens:
        if torch.is_tensor(tokens[t]):
            tokens[t] = tokens[t].to(f"cuda:{args.local_rank}")
    greedy_output = model.generate(**tokens)
    outputs = tokenizer.batch_decode(greedy_output, skip_special_tokens=True)
    if args.local_rank == 0:
        print(outputs)

To save the checkpoint: deepspeed --num_gpus 2 example.py --save_ckpt

Verify the sharded checkpoints were created:

venv ❯ ls /data/sharded-opt-model
ds_inference_config.json  tp_00_00.pt  tp_00_02.pt  tp_00_04.pt  tp_00_06.pt  tp_01_00.pt  tp_01_02.pt  tp_01_04.pt  tp_01_06.pt
non-tp.pt                 tp_00_01.pt  tp_00_03.pt  tp_00_05.pt  tp_00_07.pt  tp_01_01.pt  tp_01_03.pt  tp_01_05.pt  tp_01_07.pt

Load the sharded checkpoint and run a query: deepspeed --num_gpus 2 example.py

lanking520 added the enhancement New feature or request label Sep 29, 2022

tjruwase added the inference label Sep 30, 2022

jeffra assigned RezaYazdaniAminabadi Oct 13, 2022

pai4451 mentioned this issue Oct 24, 2022

[BUG] MP-sharded checkpoint loading does not work for models except BLOOM #2442

Closed

RezaYazdaniAminabadi mentioned this issue Nov 28, 2022

Fix quantized-inference & Add generic support of checkpoint loading #2547

Merged

mrwyattii mentioned this issue Dec 20, 2022

[BUG] Can't load OPT-30B and OPT-66B through checkpoints.json #2616

Open

mrwyattii mentioned this issue Dec 22, 2022

Add correctness check for sharded checkpoint test #2643

Open

RezaYazdaniAminabadi mentioned this issue Jan 4, 2023

Fix INT8-quantization for BLOOM, OPT, and Neo-X #2662

Closed

awan-10 assigned mrwyattii Feb 17, 2023

qtli mentioned this issue Mar 27, 2023

[BUG] deepspeed.init_inference() erroneously attempts to copy out of meta tensor #3012

Open

kevinuserdd mentioned this issue Apr 11, 2023

chatglm 分片模型不适合deepspeed liangwq/Chatglm_lora_multi-gpu#20

Open

zjjMaiMai mentioned this issue Aug 16, 2023

[REQUEST] example for preshard a model for tensor parallism #4159

Closed

[Question] How to preshard a model for tensor parallism #2379

[Question] How to preshard a model for tensor parallism #2379

Comments

lanking520 commented Sep 29, 2022

mrwyattii commented Sep 30, 2022

lanking520 commented Sep 30, 2022

archieCanada commented Oct 3, 2022

lanking520 commented Oct 3, 2022

lanking520 commented Oct 4, 2022

RezaYazdaniAminabadi commented Oct 4, 2022

jeffra commented Dec 12, 2022

lanking520 commented Dec 12, 2022

lanking520 commented Dec 12, 2022

mrwyattii commented Dec 20, 2022 • edited Loading

lanking520 commented Dec 20, 2022

lanking520 commented Dec 20, 2022

mrwyattii commented Dec 20, 2022

mrwyattii commented Dec 20, 2022

slrsnpdla commented Dec 21, 2022 • edited Loading

mrwyattii commented Dec 22, 2022

lanking520 commented Jan 3, 2023

RezaYazdaniAminabadi commented Jan 3, 2023

lanking520 commented Jan 3, 2023

RezaYazdaniAminabadi commented Jan 4, 2023

lanking520 commented Jan 12, 2023

Wenhan-Tan commented Jan 13, 2023

mrwyattii commented Jan 17, 2023

sindhuvahinis commented Jan 21, 2023

lanking520 commented Jan 23, 2023

Wenhan-Tan commented Jan 25, 2023

lanking520 commented Jan 25, 2023

Wenhan-Tan commented Jan 25, 2023

sindhuvahinis commented Jan 30, 2023

Wenhan-Tan commented Feb 2, 2023

ghost commented Feb 4, 2023 • edited by ghost Loading

simoroma commented Mar 1, 2023

Wenhan-Tan commented Mar 1, 2023

simoroma commented Mar 1, 2023

Wenhan-Tan commented Mar 1, 2023

simoroma commented Mar 2, 2023

qtli commented Mar 29, 2023 • edited Loading

cgd-bot commented Mar 30, 2023

qtli commented Mar 30, 2023

huangjiaheng commented Mar 31, 2023

kevinuserdd commented Apr 14, 2023

Wenhan-Tan commented Apr 14, 2023

lanking520 commented Apr 14, 2023

kevinuserdd commented Apr 17, 2023

Wenhan-Tan commented Apr 17, 2023

heroes999 commented Jul 8, 2023 • edited Loading

lanking520 commented Jul 9, 2023

dc3671 commented Aug 8, 2023

mrwyattii commented Dec 20, 2022 •

edited

Loading

slrsnpdla commented Dec 21, 2022 •

edited

Loading

ghost commented Feb 4, 2023 •

edited by ghost

Loading

qtli commented Mar 29, 2023 •

edited

Loading

heroes999 commented Jul 8, 2023 •

edited

Loading