We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用deepseek 进行sft模式的训练,分别使用8卡和16卡进行训练,会出现第一个iter的loss对不上的情况,都是tp1,pp1的切分方式;
进一步分析发现第一个microbatch的loss能够对上,后面的就对不上了,第一个iterloss如图所示:
gbs设为32,mbs设为1,8卡的时候每张卡跑4个microbatch,16卡每张卡跑2个microbatch,表颜色的数据是能够对的上的loss,其他的对不上,sft的代码存在问题吗,还是本身就该是这样?
The text was updated successfully, but these errors were encountered:
应该是megatron默认开启的data sharding的问题,它会将数据均分到各DP rank再在rank内shuffle,考虑[0, 1, 2, 3, 4, 5, 6, 7] 8个sample;DP2时,数据被切为rank0: [0,1,2,3] rank1:[4,5,6,7]; DP4时,数据被切为rank0: [0,1] rank1:[2,3] rank2:[4,5] rank3:[6,7].
可以观察到DP4的rank2与DP2的rank1开头的数据一致,这与图上的情况是类似的。
Sorry, something went wrong.
No branches or pull requests
使用deepseek 进行sft模式的训练,分别使用8卡和16卡进行训练,会出现第一个iter的loss对不上的情况,都是tp1,pp1的切分方式;
进一步分析发现第一个microbatch的loss能够对上,后面的就对不上了,第一个iterloss如图所示:
gbs设为32,mbs设为1,8卡的时候每张卡跑4个microbatch,16卡每张卡跑2个microbatch,表颜色的数据是能够对的上的loss,其他的对不上,sft的代码存在问题吗,还是本身就该是这样?
The text was updated successfully, but these errors were encountered: