Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge Spawn Strategy in Lite #14707

Closed
awaelchli opened this issue Sep 14, 2022 · 2 comments · Fixed by #14952
Closed

Merge Spawn Strategy in Lite #14707

awaelchli opened this issue Sep 14, 2022 · 2 comments · Fixed by #14952
Assignees
Labels
fabric lightning.fabric.Fabric priority: 1 Medium priority task refactor strategy: ddp DistributedDataParallel
Milestone

Comments

@awaelchli
Copy link
Contributor

awaelchli commented Sep 14, 2022

🚀 Feature

The DDPStrategy and DDPSpawnStrategy are two very similar strategies. Fundamentally, they only differ in the way processes are managed. Their logic can be merged and de-duplicated, since over time we have factored out all things related to process launching and environments to separate classes.

Motivation

  • Less maintenance of duplicated code
  • Easier to do in Lite first, with the reduced complexity and can then be integrated in PL eventually, as PL becomes more dependent on Lite
  • Easier to understand code for maintainers and users as well.

Pitch

Merge the two implementations (resulting in the class named DDPStrategy). Register the same class with different init args for launcher settings.
Logical consequence: DDPShardedStrategy and DDPSpawnShardedStrategy will also merge.

Note:

  • The TPUSpawn Strategy should first be refactored to subclass ParallelStrategy directly. It's implementation is vastly different from DDPSpawnStrategy, and it has its own launcher anyway.

Alternatives

Leave as is.

Additional context

Lite has now their own implementations


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.

cc @justusschock @awaelchli @rohitgr7 @tchaton @carmocca @kaushikb11 @akihironitta

@awaelchli awaelchli added needs triage Waiting to be triaged by maintainers refactor fabric lightning.fabric.Fabric strategy: ddp DistributedDataParallel strategy: ddp spawn and removed needs triage Waiting to be triaged by maintainers labels Sep 14, 2022
@awaelchli awaelchli added this to the pl:future milestone Sep 14, 2022
@awaelchli awaelchli added the priority: 1 Medium priority task label Sep 14, 2022
@carmocca
Copy link
Contributor

Looking forward to this simplification! Are there any known blockers? Or is it just a matter of doing it right and deprecating what's necessary?

@awaelchli
Copy link
Contributor Author

awaelchli commented Sep 15, 2022

No big blockers. In PL, it would be more work because there are still some differences there (e.g. the process reconciliation feature). In Lite, it's much easier. Some details like handling local rank and multi-node needs to be done correctly, the rest is identical code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric priority: 1 Medium priority task refactor strategy: ddp DistributedDataParallel
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants