-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix random token-generation issue + MP-checkpoint loading/saving #2132
Conversation
…ed into ds-inference/bloom-fix
…ed into ds-inference/bloom-fix
Just fixed it, please give it a try |
Looks like I need to add a new key to my checkpoint.json? Is it mandatory? What value should I put in it for the huggingface checkpoint file list? EDIT: I looked at the code and set it to
|
After getting past the
The My new checkpoints json file looks like this: {"type": "BLOOM-176B",
"base_dir": "/home/ubuntu/.cache/deepspeed/bigscience/bloom",
"checkpoints": ["BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-tp_00.pt", "BLOOM-176B-tp_01.pt", "BLOOM-176B-tp_02.pt", "BLOOM-176B-tp_03.pt", "BLOOM-176B-tp_04.pt", "BLOOM-176B-tp_05.pt", "BLOOM-176B-tp_06.pt", "BLOOM-176B-tp_07.pt"],
"version": 1.0,
"parallelization": "tp",
"mp_size": 8} EDIT: I found the
|
v in dict(replaced_module.state_dict()).items() | ||
if transformer_name not in k | ||
}), | ||
non_tp_ckpt_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f'{save_mp_checkpoint_path}/{non_tp_ckpt_name}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's true, it's not saved correctly. I am gonna fix it now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zcrypt0, it also generates a config file under the same path that you can use to run inference with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: I just realized the change in the non-tp file size, I will give it a try soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RezaYazdaniAminabadi Just tested and it works without a hitch, nice! 👍
…ed into ds-inference/bloom-fix
Still getting this error @RezaYazdaniAminabadi
running with batch size = 1 |
Ran this with CUDA 11.6 with DeepSpeed on master branch. |
@RezaYazdaniAminabadi how much time is cached tp model loading supposed to take?
self.model is loaded using HF AutoModel as in bloom-ds-inference.py |
nvm |
@RezaYazdaniAminabadi I am seeing
again after updating to master branch and saving without providing checkpoint json |
Just want to double check, did your install include this commit in master? #2237 |
@jeffra yes I am on the latest commit |
I use this code: run with : deepspeed --num_gpus 8 scripts/bloom-inference-server/cache_ds_checkpoints.py --model_name bigscience/bloom --dtype fp16 --save_mp_checkpoint_path ../DS_cache import argparse
import os
import deepspeed
import torch
from transformers import AutoConfig, AutoModelForCausalLM
def get_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
group = parser.add_argument_group(title="launch config")
group.add_argument("--local_rank", required=False,
type=int, help="used by dist launchers")
group.add_argument("--save_mp_checkpoint_path", required=True,
type=str, help="MP checkpoints path for DS inference")
group = parser.add_argument_group(title="model")
group.add_argument("--model_name", type=str,
required=True, help="model to use")
group.add_argument("--dtype", type=str, required=True,
choices=["bf16", "fp16"], help="dtype for model")
args = parser.parse_args()
if (args.dtype == "bf16"):
args.dtype = torch.bfloat16
elif (args.dtype == "fp16"):
args.dtype = torch.float16
return args
def main() -> None:
args = get_args()
if (args.local_rank == 0):
print("Loading model...")
world_size = int(os.getenv("WORLD_SIZE", "1"))
# Load model
with deepspeed.OnDevice(dtype=args.dtype, device="meta"):
model = AutoModelForCausalLM.from_config(
AutoConfig.from_pretrained(args.model_name),
torch_dtype=torch.bfloat16
)
model = model.eval()
if (args.dtype == torch.float16):
model = deepspeed.init_inference(
model,
mp_size=world_size,
dtype=args.dtype,
replace_with_kernel_inject=True,
save_mp_checkpoint_path=args.save_mp_checkpoint_path
)
elif (args.dtype == torch.bfloat16):
raise NotImplementedError("bfloat16 is not yet supported")
print("Model loaded")
if (__name__ == "__main__"):
main() |
@jeffra This issue is blocking bigscience-workshop/Megatron-DeepSpeed#328 |
@jeffra ^^ |
@mayank31398 Hi, also facing this issue. Is it solved right now? |
@pai4451 not yet |
This PR fixes the token-generation issue with different random seed on several MP ranks. It also adds the ability to load/save MP-partitioned checkpoints to speed up the checkpoint loading for inference.
cc: @stas00 @jeffra