Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train_engine] support fsdp #2412
[train_engine] support fsdp #2412
Changes from all commits
9406947
67c1451
eb5c720
62e018d
3475bff
d13de47
05ccbf3
caa97c6
24639e4
eda080d
c2a832a
0644294
2b966f0
603c600
979c3e1
5f5010a
79448ff
f7aea43
1740e39
6008bab
2f1f4df
2d70039
767b0ed
9d615e2
8716038
a92ee10
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
要是能贴个link解释下lambda wrap和transformer wrap的区别就好了,方便学习 (感谢周哥:))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
从使用方式上看,lambda是用于整个encoder级别,transformer是用于layer级别,为啥要有这两种不同的划分呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一个wrap 意味着在这个wrap的forward上in out 的梯度 和 optimizer的切分会进行一次all gather 的通性。
fsdp 比较灵活,通过wrap的方式控制“切分”的力度, 所以在 enc dec的力度上相当于只有optimzier的切分,没有梯度的切分,(内存优化相当于zero1) 在每一个layer上的wrap就有了layer级别的切分相当于zero2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
细节可以看下这个 https://openmmlab.medium.com/its-2023-is-pytorch-s-fsdp-the-best-choice-for-training-large-models-fe8d2848832f