Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan to support FSDP2? #2873

Open
ByronHsu opened this issue Jun 19, 2024 · 10 comments · May be fixed by #3231
Open

Plan to support FSDP2? #2873

ByronHsu opened this issue Jun 19, 2024 · 10 comments · May be fixed by #3231
Assignees
Labels
enhancement New feature or request feature request Request for a new feature to be added to Accelerate

Comments

@ByronHsu
Copy link

ByronHsu commented Jun 19, 2024

FSDP2 provides smaller memory footprint, compatibility with torch compile, and more flexibility due to per param sharding. Does huggingface have plan to support FSDP2?

https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md

@BenjaminBossan
Copy link
Member

Thanks for bringing FSDP2 to our (or at least my) attention. The changes described in the document you linked sound very reasonable and could remove some of the common pain points of using FSDP.

Reading this, it got the impression that this is a very new addition to PyTorch. When searching for fully_shard in the PyTorch docs, there is no hit, which reinforces this impression. But looking at the actual code, it's already 2 years old! So I'm confused now about the state of this feature: Is it going to be officially released soon or is it more of an experimental feature that may or may not see continued work? Do you have any insights on that @ByronHsu?

@ByronHsu
Copy link
Author

ByronHsu commented Jun 20, 2024

Thanks @BenjaminBossan! If I understand correctly, PyTorch team wants to replace FSDP1 with FSDP2 in the long term.
I saw it has already been integrated in torchtitan. Maybe we can have some plans for accelerate too? Otherwise, users cannot use torch compile with FSDP in hf. cc PyTorch team @awgu @msaroufim

@awgu
Copy link
Contributor

awgu commented Jun 21, 2024

But looking at the actual code, it's already 2 years old!

Very sorry for the confusion! There are two separate functions called fully_shard, one being 2 years old and one being new from this year. For historical context, we were experimenting with approaches to implementing FSDP that were not an nn.Module wrapper like FullyShardedDataParallel. This led to the distributed/_composable folder, and the APIs were all verbs, hence fully_shard. The original fully_shard called into the same underlying code as FullyShardedDataParallel. The new fully_shard (FSDP2) is a standalone implementation.

We proposed FSDP2 as prototype for 2.4 release, and we are investing in it heavily.

@BenjaminBossan
Copy link
Member

Thanks a lot for clarifying my confusion. In that case, I think it makes sense to wait until FSDP2 is released and then run experiments with accelerate to see how it can be best supported.

@muellerzr muellerzr added enhancement New feature or request feature request Request for a new feature to be added to Accelerate labels Jul 1, 2024
@muellerzr
Copy link
Collaborator

The main worry with FSDPv2 is if it's stable enough that it makes sense to include it in Accelerate. At the worst case, we can keep a draft PR open and/or an experimental feature (and advertise it as such).

So my main question is:

  • How stable is it already? What ETA is there for it to be considered "stable"?

I planned on looking into FSDP2 in the near future anyways, so I'm open to having some early-ish support in Accelerate for it as long as I can get a full grasp of how long into the development it is.

(We did something similar with PiPPy, so okay do so here too)

I know we need to do some heavy uprooting to add in custom process support into Accelerate, which I believe FSDP2 relies on if I'm not mistaken?

@muellerzr
Copy link
Collaborator

What'd be helpful on my end is some bare-bones FSDP2 examples in PyTorch with how things are operating end-to-end

@raghukiran1224
Copy link

Barebones example of fsdpv2 is available in https://github.com/pytorch/torchtitan.

@muellerzr
Copy link
Collaborator

Thanks @raghukiran1224 :) Yes indeed I plan on looking into these w/ some of the torch folks. It's in our close future to get something small going. (Probably highly experimental, since they're still not settled with things yet)

@muellerzr muellerzr moved this to TODO Feature in Accelerate 1.0.0 Roadmap Jul 29, 2024
@muellerzr muellerzr self-assigned this Jul 29, 2024
@kmehant kmehant linked a pull request Nov 8, 2024 that will close this issue
5 tasks
@kmehant
Copy link

kmehant commented Nov 8, 2024

We have been looking at this, will be happy to help in bringing in FSDP2 as experimental parallel to accelerate. RFC PR - #3231

cc: @raghukiran1224 @ashokponkumar @prjayach @awgu

@FindDefinition
Copy link

look forward to more generic N-D parallel (device mesh, TP, CP) support instead of fsdp2 only. I have implemented a simple AcceleratorNd by inheritance to support this, but I found that we need to change many internal code to cover more case:

  1. accelerator.gather and other distributed ops must be performed in data parallel group instead of default group
  2. dataloader/scheduler must use data parallel group rank instead of global rank
  3. save_pretrained don't support DTensor (use pytorch DCP in accelerate and transformers can resolve this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature request Request for a new feature to be added to Accelerate
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

7 participants