Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[deepspeed] zero inference #14253

Merged
merged 8 commits into from
Nov 23, 2021
Merged

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Nov 2, 2021

This PR extends HF/DS integration to support Deepspeed Zero inference. Now we don't need to waste gpu memory on allocating the optimizer/scheduler and then dropping them. And in some cases enabling what was not possible before - in case the user doesn't have the extra gpu memory and was getting OOM on inference.

Blocking events:

@jeffra, @sgugger

The CI errors seem to be irrelevant to this PR

@stas00 stas00 marked this pull request as ready for review November 20, 2021 02:14
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this new mode!


Inference:

1. DeepSpeed ZeRO Inference - same as Training but doesn't require Optimizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a real sentence? I don't understand what this means.

docs/source/main_classes/deepspeed.rst Outdated Show resolved Hide resolved
@@ -111,6 +111,24 @@ def get_value(self, ds_key_long, default=None):
return default
return config.get(ds_key, default)

def del_config_sub_tree(self, ds_key_long, must_exist=False):
config = self.config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method deservers a docstring.

config = config.get(node)
if config is None:
if must_exist:
raise ValueError(f"Can't find {ds_key_long} entry in the config")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth printing the config as well?

Suggested change
raise ValueError(f"Can't find {ds_key_long} entry in the config")
raise ValueError(f"Can't find {ds_key_long} entry in the config {config}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that won't work, but I will fix it to dump the config, thank you.

hf_deepspeed_config.trainer_config_finalize(args, model, num_training_steps)

# resume config update - some bits like `model` and `num_training_steps` only become available during train
def deepspeed_optim_sched(trainer, hf_deepspeed_config, args, num_training_steps):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new method deserves a docstring.

@stas00
Copy link
Contributor Author

stas00 commented Nov 21, 2021

Thanks a lot for the review and the suggestions, Sylvain. All addressed, please have another look at your convenience. Thank you.

@stas00 stas00 merged commit 956a483 into huggingface:master Nov 23, 2021
@stas00 stas00 deleted the ds-zero-inference branch November 23, 2021 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants