Pytorch Fully Sharded Data Parallel (FSDP) Integration #147

parthraut · 2024-12-12T04:50:20Z

Integrates Pytorch Fully Sharded Data Parallel (FSDP) into Zeus. Now, GlobalPowerLimitOptimizer performs distributed operations (all-reduce) to ensure it makes the correct power limit decision.

Implemented generic zeus.framework.all_reduce, which currently invokes torch.distributed.all_reduce if torch is the framework.
Added train_fsdp.py example and relevant documentation
Added relevant use cases to GlobalPowerLimitOptimizer.__init__
Implemented zeus.framework.is_distributed and warn the user if multiple GPUs are being monitored in a distributed context

jaywonchung · 2024-12-12T04:52:33Z

Please rebase to the current master and push. It's impossible to review with all the changes from past commits.

jaywonchung

Thanks @parthraut! I'm requesting some changes. Please let me know if anything is unclear.

examples/power_limit_optimizer/README.md

examples/power_limit_optimizer/train_dp.py

zeus/monitor/energy.py

zeus/utils/framework.py

jaywonchung · 2024-12-12T06:59:31Z

zeus/utils/framework.py

+    if jax_is_available():
+        # JAX cross-device all-reduce not yet implemente
+        return sum(object) if operation == "sum" else max(object)
+
+    raise RuntimeError("No framework is available.")
+
+
+def is_distributed() -> bool:
+    """Check if the current execution is distributed across multiple devices."""
+    if torch_is_available(ensure_cuda=False):
+        torch = MODULE_CACHE["torch"]
+        return torch.distributed.is_available() and torch.distributed.is_initialized()
+    if jax_is_available():
+        return False  # JAX not yet implemented
+    return False


If this was going to be left unimplemented, it should have raised a NotImplementedError instead of silently doing the wrong thing.

This PR will be merged after JAX counterparts are implemented. No need to have a full JAX training script; I'm fine with it being tested manually with a quick script that imports and uses all_reduce and is_distributed with JAX.

I will include JAX impl in this PR

jaywonchung · 2024-12-12T07:09:49Z

zeus/optimizer/power_limit.py

Line 265 is now broken, because the previous implementation assumed that len(zeus_monitor.gpu_indices) gives the current world size. Let's just switch the default optimum_selector to MaxSlowdownConstraint(factor=1.1).

Or we could use torch.distributed.get_world_size (and something analogous for jax) by defining a generic framework function zeus.framework.get_world_size. What do you think?

Nah, I wouldn't bother for this one. Now I think MaxSlowdownConstraint is a better default; the original one if from 2022.

zeus/utils/framework.py

zeus/optimizer/power_limit.py

Co-authored-by: Jae-Won Chung <[email protected]>

jaywonchung · 2024-12-17T04:18:01Z

zeus/utils/framework.py

-        return sum(object) if operation == "sum" else max(object)
+        # Check if not distributed
+        jax = MODULE_CACHE["jax"]
+        if jax.process_count() == 1:


https://jax.readthedocs.io/en/latest/multi_process.html#running-multi-process-computations
Should be jax.device_count()?

yes, it should be. Fixed that, thanks

parthraut requested a review from jaywonchung December 12, 2024 04:50

parthraut added 11 commits December 12, 2024 00:01

impl all reduce

f3dcd22

merged all_reduce into one

c3559e2

fixes to reduce

9de0b4f

object must be list

fc8ac00

added fsdp example

f2173b4

fsdp_examples

0c43ec8

added doc about pytorch fsdp

26d940b

final docs

af121a3

minor changes in docs

9956d4a

made sure jax doesn't break

636c45d

fixed tests failing

3ab379a

parthraut force-pushed the fsdp_integration branch from 2792726 to 3ab379a Compare December 12, 2024 05:01

jaywonchung requested changes Dec 12, 2024

View reviewed changes

jaywonchung mentioned this pull request Dec 12, 2024

Training framework integration opportunities #77

Open

parthraut and others added 7 commits December 12, 2024 21:01

Apply suggestions from code review

66573ec

Co-authored-by: Jae-Won Chung <[email protected]>

fixes

2f41596

more fixes

63f7fe9

small fixes

ea0f866

test fix

08fd4d3

forgot to move to cuda

856b48e

move to cuda

4140af5

jaywonchung reviewed Dec 17, 2024

View reviewed changes

parthraut added 3 commits December 20, 2024 20:44

impl ax

96b39bb

jax fixes

c0c53ca

wrapped in main

0f803cf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch Fully Sharded Data Parallel (FSDP) Integration #147

Pytorch Fully Sharded Data Parallel (FSDP) Integration #147

parthraut commented Dec 12, 2024

jaywonchung commented Dec 12, 2024

jaywonchung left a comment

jaywonchung Dec 12, 2024

parthraut Dec 13, 2024

jaywonchung Dec 13, 2024

jaywonchung Dec 12, 2024

parthraut Dec 13, 2024

jaywonchung Dec 13, 2024

jaywonchung Dec 17, 2024

parthraut Dec 21, 2024 •

edited

Loading

Pytorch Fully Sharded Data Parallel (FSDP) Integration #147

Are you sure you want to change the base?

Pytorch Fully Sharded Data Parallel (FSDP) Integration #147

Conversation

parthraut commented Dec 12, 2024

jaywonchung commented Dec 12, 2024

jaywonchung left a comment

Choose a reason for hiding this comment

jaywonchung Dec 12, 2024

Choose a reason for hiding this comment

parthraut Dec 13, 2024

Choose a reason for hiding this comment

jaywonchung Dec 13, 2024

Choose a reason for hiding this comment

jaywonchung Dec 12, 2024

Choose a reason for hiding this comment

parthraut Dec 13, 2024

Choose a reason for hiding this comment

jaywonchung Dec 13, 2024

Choose a reason for hiding this comment

jaywonchung Dec 17, 2024

Choose a reason for hiding this comment

parthraut Dec 21, 2024 • edited Loading

Choose a reason for hiding this comment

parthraut Dec 21, 2024 •

edited

Loading