-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify Tensor Parallel implementation with PyTorch TP #34184
Conversation
Cc: @lessw2020 @HamidShojanazeri |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really really nice! I think the only thing I would change is to have the tp_plan
loadable and potentially defined in the config. Given that it does not hold any weights, it would allow other frameworks to rely on it as well! 🤗
Also we would place device = torch.device(f"cuda:{rank}")
dist.init_process_group("nccl", device_id=device)
device_mesh = dist.init_device_mesh("cuda", (world_size,))
with device:
model = AutoModelForCausalLM.from_pretrained(
model_id,
) potentially inside |
WDYT? |
We will also add this to the DEFAULT_TP_PLAN = {
"model.layer.*.q_proj":"column_parallel"
} this way we can init lama config with it! |
cc @SunMarc as per our discussion! |
Thanks @ArthurZucker @SunMarc for your review. Good idea to move Since the top-level model can have different I also made the TP styles "type neutral" in the config as you suggested. Translation to torch TP types is done only when applying TP. |
Re moving distributed / device_mesh init code inside It is a good direction in making the workflow even simpler for users. As in, wouldn't it be nice if the model is distributed after it is created? I think a UI design would be helpful as to how users express the intention of "please create the model in a distributed manner". For example, would
If that kwarg is given, I think the intention is clear enough, and we can call Re launcher: I might slightly prefer the above "explicit expression" to "implicit dependency" on Regardless, I agree that it would be very nice if we can create distributed models out of `from_pretrained" directly, and I am really excite to help that way :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks a lot better!
I think we want to have some small tests to make sure we properly serialize the TP-Plan, and also properly dispatch the model (slow tests, we can probably add them ourselves!)
- would need some documentation as well, we can do that in a follow up PR for sure
On of the most important question now is: What's the approach regarding ForCausalLM
.
This might be for @SunMarc as it's transformers
internal:
_base_model_tp_plan
will be consumed by the PreTrainedModel
➡ this way we can enforce some checks:
- check that all models that inherit from
PreTrainedModel
have a TP plan - check nesting (ForCausalLM has a module that has a TP Plan, if it has other module they should have a TP plan
src/transformers/pytorch_utils.py
Outdated
def translate_to_torch_parallel_style(style: str): | ||
""" | ||
In model configurations, we use a neutral type (string) to specify parallel | ||
styles, here we translate them into torch.distributed tensor-parallel | ||
types. | ||
""" | ||
if not isinstance(style, str): | ||
raise ValueError( | ||
f"Unsupported parallel style type {type(style)}, expected str" | ||
) | ||
|
||
if style == "colwise": | ||
return ColwiseParallel() | ||
elif style == "rowwise": | ||
return RowwiseParallel() | ||
elif style == "colwise_rep": | ||
return ColwiseParallel(output_layouts=Replicate()) | ||
else: | ||
raise ValueError(f"Unsupported parallel style value: {style}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A mapping from tp style to the correct function might be better.
We are also thinking of defining a TensorParallelConfig, your feedback is welcome here, as I don't know the variety of classes / args that might be used!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment! Indeed a mapping style would look better.
The only caveat is that the returned value here is an object rather than a constant or class (see the ()
behind ColwiseParallel
). If we prepare a map, we may be always returning the same object -- the parallelize_module
API may be able to support it I guess, I am just not sure if there is a contract guaranteeing that today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pitching in :)
We should be able to use the same object since it applies required parallel operation to the module and returns a new copy - https://github.com/pytorch/pytorch/blob/86d4b7d60b264cae5a04a1b20719bcd7a5752a4c/torch/distributed/tensor/parallel/api.py#L95
Have also tested it empirically while benchmarking (#34194)
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SOunds good!
@@ -141,6 +141,16 @@ class LlamaConfig(PretrainedConfig): | |||
|
|||
model_type = "llama" | |||
keys_to_ignore_at_inference = ["past_key_values"] | |||
# Default tensor parallel plan for base model `LlamaModel` | |||
_base_model_tp_plan = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to allow this for external use by removing _
so that we can allow users to define tp plan tweaks from config.json
?
Given that, shall we as well allow for providing custom tp plan as input to LlamaConfig()
that overrides the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea. We can make this public once we prove things work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, base_model_tp_plan
should be supported as input to the PreTrainedConfig!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this variable base_model_tp_plan
has to be added to PreTrainedConfig
class PretrainedConfig(PushToHubMixin): |
{}
which I believe is best possible default for any config sub class inheriting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kmehant @ArthurZucker for the suggestion. I moved base_model_tp_plan
to PretrainedConfig
in the latest commit.
Summarizing my discussion with @kmehant re collaboration plan between this PR and #34194:
|
Sounds good, let's focus on this one first then! |
Dumb question: is there a way to bypass the |
Okay, I just temporarily disabled the "Copied from" comments in some derivative models. Will add them back once we extend TP plans to those models. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! So basically places where you are disabling it , you just need to do something like:
# Copied from transformers.models.llama.modeling_llama.LlamaModel with Llama->Olmoe,lm_head->new_name
for example
and before anything run make fix-copies
which would apply your changes!
Thanks @ArthurZucker . I tried that method, but it doesn't seem to work. |
Ah that is because you probably modified |
@ArthurZucker Thanks again for the suggestions.
|
Merged! let me review one last time 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! 🔥 some small things left to do but super close to merge!
@@ -1442,6 +1449,9 @@ def post_init(self): | |||
""" | |||
self.init_weights() | |||
self._backward_compatibility_gradient_checkpointing() | |||
# If current model is a base model, attach `base_model_tp_plan` from config | |||
if self.base_model is self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but self._tp_plan
should be None only for the base model no?
We could maybe add something like if not self.base_model is self and self._tp_plan is None and self.supports_tp_plan
raise an error in the futur, to force people to add the TP plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, would be nice to have a generic way for all models that have support_tp_plan=True
instead of just llama, setting everyone up for success in adding support for other models!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, both me and @kmehant are interested in extending this mechanism to other models. We can think of an way to "copy" the tp_plan to other models with similar architectures (just like how HF copies modeling code between similar models). Maybe covering all AutoModelForCausalLM
would be a good next step. Follow-up PR will come soon :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ArthurZucker for the close look! I fixed the places you pointed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, both me and @kmehant are interested in extending this mechanism to other models. We can think of an way to "copy" the tp_plan to other models with similar architectures (just like how HF copies modeling code between similar models). Maybe covering all AutoModelForCausalLM
would be a good next step. Follow-up PR will come soon :)
@ArthurZucker CI is all green now. (Previous failure seems to come from infra instability). Is there anything else I should add before merge? Thanks! |
Nope merging! 🤗 |
I wanted to check if |
SUper kudos for the PR! 🚀 |
Hi @ArthurZucker and @kwen2501 - it seems this PR now requires that users have torch 2.5+, is that correct? When trying to run the DeepSpeed unit tests with the latest transformers and torch 2.3 I'm getting errors around not being able to import from
The specific error:
Opened #34795 to track if that's better visibility wise. |
@ArthurZucker @kwen2501 Planning for a HF blog? I am in! |
Sure! Won't have bandwidth to start a blog but feel free to do it! 🤗 |
…34184) * Simplify Tensor Parallel implementation with PyTorch TP * Move tp_plan to config * Lint * Format and warning * Disable copy-from check * Conditionally get attr from config * make fix-copies * Move base_model_tp_plan to PretrainedConfig * Move TP into from_pretrained * Add device context for load * Do not serialize * Move _tp_plan setting to post_init * Add has_tp_plan * Add test_tp * Add 'Multi-gpu inference' doc * Add backward support for device type identification * Auto-detect accelerator * supports_tp_plan * copyright year * Fix copy
…34184) * Simplify Tensor Parallel implementation with PyTorch TP * Move tp_plan to config * Lint * Format and warning * Disable copy-from check * Conditionally get attr from config * make fix-copies * Move base_model_tp_plan to PretrainedConfig * Move TP into from_pretrained * Add device context for load * Do not serialize * Move _tp_plan setting to post_init * Add has_tp_plan * Add test_tp * Add 'Multi-gpu inference' doc * Add backward support for device type identification * Auto-detect accelerator * supports_tp_plan * copyright year * Fix copy
What does this PR do?
This PR uses the
torch.distributed.tensor.parallel
subpackage to implement Tensor Parallel for Llama (as an example).The motivation is multi-fold:
to make modeling code simple as single-worker case:
all manual TP implementations under
if self.config.pretraining_tp > 1
can be removed.to make tensor parallelism easily accessible by users:
added a
model.tensor_parallel(device_mesh)
method that allows users to turn a single-proc model into a parallel model. !- Please guide me to a right place to put this function/method ifPreTrainedModel
is not a preferred place. -!Note:
This is just a demo. The removal of
if self.config.pretraining_tp > 1
may break workflows elsewhere, but it is just intended to calculate how much code can be saved, and hopefully it would be possible to direct it to the mechanism in this PR.User script:
Launch command:
torchrun --standalone --nproc-per-node 4 tp_hf.py
Output:
Test
Doc
perf_infer_gpu_multi.md
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Models:
Library:
Integrations:
HF projects:
Cc PyTorch Distributed Team:
@gnadathur @wanchaol @fduwjj @wz337 @wconstab @tianyu-l