Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于SFT不同DP训练的loss对不上的问题 #382

Open
yspMing opened this issue Nov 19, 2024 · 1 comment
Open

关于SFT不同DP训练的loss对不上的问题 #382

yspMing opened this issue Nov 19, 2024 · 1 comment

Comments

@yspMing
Copy link

yspMing commented Nov 19, 2024

使用deepseek 进行sft模式的训练,分别使用8卡和16卡进行训练,会出现第一个iter的loss对不上的情况,都是tp1,pp1的切分方式;

进一步分析发现第一个microbatch的loss能够对上,后面的就对不上了,第一个iterloss如图所示:
截屏2024-11-19 21 44 01

gbs设为32,mbs设为1,8卡的时候每张卡跑4个microbatch,16卡每张卡跑2个microbatch,表颜色的数据是能够对的上的loss,其他的对不上,sft的代码存在问题吗,还是本身就该是这样?

@lostkevin
Copy link
Contributor

应该是megatron默认开启的data sharding的问题,它会将数据均分到各DP rank再在rank内shuffle,考虑[0, 1, 2, 3, 4, 5, 6, 7] 8个sample;DP2时,数据被切为rank0: [0,1,2,3] rank1:[4,5,6,7]; DP4时,数据被切为rank0: [0,1] rank1:[2,3] rank2:[4,5] rank3:[6,7].

可以观察到DP4的rank2与DP2的rank1开头的数据一致,这与图上的情况是类似的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants