-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove usage of local variables related with model parallel and move … #13039
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -249,10 +249,14 @@ def forward( | |
else: | ||
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device) | ||
|
||
local_rank = os.getenv("LOCAL_RANK") | ||
if local_rank is not None: | ||
token_type_ids = token_type_ids.to(local_rank) | ||
|
||
Comment on lines
+252
to
+255
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sgugger, this is new to and surely the verbatim code shouldn't be replicated everywhere as it doesn't contribute to model's readability and should be abstracted into an API - put it in the super-class or having a util function? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok Firstly, We can start models that doesn't have token type ids like gpt2 or gpt neo. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't say that. What I suggested is that we work out all the kinks in all models that are unique, and any similar models get worked out afterwards. So if you're finding yourself replicating the same code - it's not unique - if that's a good rough guideline to follow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry. I think I misunderstood what you said. (As you know, I'm not good at English because I've never lived in an English-speaking country.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is better to define a new abstract class or write a utility function than to use the os module directly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please don't worry about that, even those with great grasp of a certain language misunderstand each other very often. It's normal. So we will just try to explain again in a different way until we understand each other. And code samples are the best way to communicate at times ;) So to try again, what I was proposing is to pick all the models that require unique handling. And once those are figured out we can replicate all the similar ones. e.g. we have some models which are 95% identical to 5-10 others. Alternatively, we can also take a very different approach. We can pick just one model - say T5 or GPT2 and completely port it, and then do a few more models, etc, etc. The drawback in this approach is that it'd be more difficult to see how to generalize, but I think getting one model working sooner is more practical and we can get users to start experimenting and report flaws sooner than later. Also it'll be much easier to see that we are doing the right thing, and of course tests will be needed, and we can test right away. @sgugger, what's your take on this? Do the integration in stages across all models? Or do 1-2 most popular models, and then replay to other models, generalizing on the way where needed? |
||
if inputs_embeds is None: | ||
inputs_embeds = self.word_embeddings(input_ids) | ||
token_type_embeddings = self.token_type_embeddings(token_type_ids) | ||
|
||
token_type_embeddings = self.token_type_embeddings(token_type_ids) | ||
embeddings = inputs_embeds + token_type_embeddings | ||
if self.position_embedding_type == "absolute": | ||
position_embeddings = self.position_embeddings(position_ids) | ||
|
@@ -711,6 +715,10 @@ def forward( | |
else: | ||
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device) | ||
|
||
local_rank = os.getenv("LOCAL_RANK") | ||
if local_rank is not None: | ||
token_type_ids = token_type_ids.to(local_rank) | ||
|
||
extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2) | ||
extended_attention_mask = extended_attention_mask.to(dtype=self.dtype) # fp16 compatibility | ||
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -253,7 +253,7 @@ def forward( | |
|
||
attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim) | ||
attn_output = attn_output.transpose(1, 2) | ||
attn_output = attn_output.reshape(bsz, tgt_len, embed_dim) | ||
attn_output = attn_output.reshape(bsz, tgt_len, self.embed_dim) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we do that, I propose we change the earlier code:
so that there will be no confusing that way we know we already use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great. |
||
|
||
attn_output = self.out_proj(attn_output) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are going to move the variable a moment later to another device, as proposed next, this is wasteful.
We should get the device figured out first and then create the variable directly on the target device.
apologies if this comment is confusing I'm referring to lines 250-254. so this line and the 3 code lines after it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. we can add another one else statement for this.
But let's start with the model without token type id as you said above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why adding an else statement, I don't quite follow?
I suggested we figure out the device first, and then in one go create the variable on the right device. i.e. your rank code goes before the creation, the correct device is set and then there is no
.to()
to do.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of following code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If model parallel is not applied, the LOCAL_RANK variable does not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but still I'd prefer to abstract
os.getenv("LOCAL_RANK"))
and include meaningful exceptions should it be invalid for whatever reason. i.e.transfromers
needs to have a defined API to get the rank and not rely just on env var.It'd also hide all that checking if it's not defined. Let's perhaps start with a helper util in
modeling_utils.py
to add the abstractionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. It would be good to implement mpu like nvidia megatron and provide it as a utility function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/mpu/initialize.py#L255
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will absolutely need to implement MPU anyway, so let's to do it right from the get-going!
You have a choice of Megatron or deepspeed MPU versions, I may have seen some other.
I think I may have even started to port one while trying to plug my PP PR into Deepspeed's 3D. yes, I did:
https://github.com/huggingface/transformers/pull/9765/files#diff-48e672da3865f77a2e1d38954e8e075c0f1e02c7306f163847b9b8ecc56ede24
I see I took it from megatron.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can re-use some other parts of my work, but things have probably changed quite a lot in transformers to try to rescue much. there was some of the mpu use in the modeling_t5.py, but if I remember it wasn't quite finished.