-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Deepspeed Inference] HF Integration #14426
base: main
Are you sure you want to change the base?
Conversation
@@ -33,6 +33,54 @@ | |||
logger = logging.get_logger(__name__) | |||
|
|||
|
|||
inference_custom_map = dict( | |||
electra=dict(ElectraLayer=("output.dense")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stas00 This will only parallelize the output.dense
layer and the other parts will be duplicated on all GPUs, resulting in memory inefficiency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To parallelize all parts, all layer information must be input. This will be similar to the policy of the existing DeepSpeed Inference, and it will not be very different from the policy I used in Parallelformers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RezaYazdaniAminabadi Am I right? Or any other your opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the PR says this is very early. So basically all I did is converting an example that Reza gave me to have it integrated into HF Trainer. So treating it as a black box for now and waiting for Reza to complete the project before trying to understand how it works.
But I trust Reza will be happy to answer your question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hyunwoongko, this only shows that which linear layers would require an all_reduce. So, this is not going to use the same policy as when injecting the kernels. You can find more detail on how the other layers are partitioned on the replace_module function in DeepSpeed. But, basically this policy here is just showing which part need to be partitioned horizontally, whereas the rest are partitioned vertically. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanatory notes, @RezaYazdaniAminabadi - I have added them to the file, so this is covered.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
This PR is working on an integration of Deepspeed Inference which implements Tensor Parallelism. This is different from Deepspeed ZeRO inference.
This is a very early draft.
To try:
and it currently hangs with
--num_gpus > 1
. One gpu finishes processing and the other is stuck in preparing inputs. So need to figure out the synchronization of the gpus.