Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AWS Neuron torchrun support #20806

Merged
merged 12 commits into from
Jan 18, 2023

Conversation

jeffhataws
Copy link
Contributor

@jeffhataws jeffhataws commented Dec 17, 2022

What does this PR do?

This PR adds support for torchrun for AWS Neuron SDK.

Existing HF tutorial for Neuron SDK requires users to modify the HF example script (ie run_glue.py). This change will help minimize the changes required.

This change will require future AWS Neuron PyTorch 1.13 support.

This is an update to #19907 .

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sgugger

@jeffhataws jeffhataws changed the title Add aws neuron torchrun support Add AWS Neuron torchrun support and transformer model-type support for compiler Dec 17, 2022
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Dec 17, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this new integration. The test won't be run on our CI since torch_neuroncore is not installed. Is it possible to install it in regular images or do we need to be on an AWS instance>

src/transformers/training_args.py Outdated Show resolved Hide resolved
@philschmid
Copy link
Member

@jeffhataws could you maybe please explain a bit more about how users would benefit from that? I quickly checked the HF tutorial and with the change you propose users would still need to modify the scripts, e.g., for

# Fixup to enable distributed training with XLA
from packaging import version
from transformers import __version__
if version.parse(__version__) < version.parse("4.20.0"):
    Trainer._wrap_model = lambda self, model, training=True: model
else:
    Trainer._wrap_model = lambda self, model, training=True, dataloader=None: model

# Workaround for NaNs seen with transformers version >= 4.21.0
# https://github.com/aws-neuron/aws-neuron-sdk/issues/593
if os.environ.get("XLA_USE_BF16") or os.environ.get("XLA_DOWNCAST_BF16"):
    transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16

@jeffhataws
Copy link
Contributor Author

jeffhataws commented Dec 20, 2022

Thanks for adding this new integration. The test won't be run on our CI since torch_neuroncore is not installed. Is it possible to install it in regular images or do we need to be on an AWS instance>

Yes for this test we will need Trainium instance. Over time, once pytorch/xla#3609 is released, we can make it more generic for GPU/XLA. For now, Neuron team will test this. Test is currently passing on Trainium instance.

@jeffhataws
Copy link
Contributor Author

@jeffhataws could you maybe please explain a bit more about how users would benefit from that? I quickly checked the HF tutorial and with the change you propose users would still need to modify the scripts, e.g., for

# Fixup to enable distributed training with XLA
from packaging import version
from transformers import __version__
if version.parse(__version__) < version.parse("4.20.0"):
    Trainer._wrap_model = lambda self, model, training=True: model
else:
    Trainer._wrap_model = lambda self, model, training=True, dataloader=None: model

# Workaround for NaNs seen with transformers version >= 4.21.0
# https://github.com/aws-neuron/aws-neuron-sdk/issues/593
if os.environ.get("XLA_USE_BF16") or os.environ.get("XLA_DOWNCAST_BF16"):
    transformers.modeling_utils.get_parameter_dtype = lambda x: torch.bfloat16

The first workaround is for missing DDP support which will be available in Neuron's PyTorch-XLA version 1.13 (future release). The second workaround is already fixed in transformers==4.25.1 by #20562.

@jeffhataws jeffhataws changed the title Add AWS Neuron torchrun support and transformer model-type support for compiler Add AWS Neuron torchrun support Dec 22, 2022
@sgugger
Copy link
Collaborator

sgugger commented Jan 3, 2023

Thanks for the precisions. Let's wait until the release of Neuron's PyTorch-XLA version 1.13 to merge this, then?

@jeffhataws
Copy link
Contributor Author

Thanks for the precisions. Let's wait until the release of Neuron's PyTorch-XLA version 1.13 to merge this, then?

@sgugger since we already have a workaround for DDP wrapper by overwriting the _wrap_model function, we can actually merge this first. The reason is that 1) we want it in for next transformer release ahead of 1.13, and 2) I will need this change to post another PR for the default compiler flag for transformer model type. Let me know if this is acceptable.

@sgugger
Copy link
Collaborator

sgugger commented Jan 18, 2023

Thanks for your patience on this.

@sgugger sgugger merged commit c59d71b into huggingface:main Jan 18, 2023
@jeffhataws jeffhataws deleted the add_aws_neuron_torchrun_support branch January 19, 2023 04:12
ts2095 pushed a commit to ts2095/transformers that referenced this pull request Jan 20, 2023
* Add XLA torchrun support

* Clarify that currently DDP doesn't work with torch.distributed XLA backend yet

* Enable DDP with torchrun and XLA (now available in PT-XLA 1.13)

* Add check for AWS Neuron availability and AWS Neuron specific compiler flag

* Change the new test's name to TestTrainerDistributedNeuronCore

* Remove "assert" and replace raised exception

* Remove compiler flag as it is optional. If needed, will be another PR.

* Use TORCHELASTIC_RUN_ID to determine whether torchrun is used
venkat-natchi pushed a commit to venkat-natchi/transformers that referenced this pull request Jan 22, 2023
* Add XLA torchrun support

* Clarify that currently DDP doesn't work with torch.distributed XLA backend yet

* Enable DDP with torchrun and XLA (now available in PT-XLA 1.13)

* Add check for AWS Neuron availability and AWS Neuron specific compiler flag

* Change the new test's name to TestTrainerDistributedNeuronCore

* Remove "assert" and replace raised exception

* Remove compiler flag as it is optional. If needed, will be another PR.

* Use TORCHELASTIC_RUN_ID to determine whether torchrun is used
miyu386 pushed a commit to miyu386/transformers that referenced this pull request Feb 9, 2023
* Add XLA torchrun support

* Clarify that currently DDP doesn't work with torch.distributed XLA backend yet

* Enable DDP with torchrun and XLA (now available in PT-XLA 1.13)

* Add check for AWS Neuron availability and AWS Neuron specific compiler flag

* Change the new test's name to TestTrainerDistributedNeuronCore

* Remove "assert" and replace raised exception

* Remove compiler flag as it is optional. If needed, will be another PR.

* Use TORCHELASTIC_RUN_ID to determine whether torchrun is used
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants