Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoParallel] Fit allreduce_matmul_grad_overlapping when using master grad #61865

Conversation

AndSonder
Copy link
Contributor

@AndSonder AndSonder commented Feb 20, 2024

PR types

Bug fixes

PR changes

Others

Description

当下 allreduce_matmul_grad_overlapping 和 master_grad 同时开启的情况下,会出现 cast op 位置错误的情况,导致 cast op 将没有初始化的tensor作为输入,进而导致选Kernel报错:

d80d57d926ad5295509fa639a06ee606

为了适配这种情况,我们需要需要将依赖 dy 的 op 移动到 allreduce_matmul_grad_overlapping 的第二个 matmul 之后:

576e399cbdc2f4810c3de80b07583b81

依赖环境:

  • PaddleNLP develop llama 模型 (hidden_layer 修改为 4)
  • 4 卡 1080 Ti 服务器

经过测试,llama 模型的loss和pr修改前可以对齐

Copy link

paddle-bot bot commented Feb 20, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Feb 20, 2024
Copy link
Contributor

@From00 From00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@From00 From00 merged commit 2823a59 into PaddlePaddle:develop Feb 26, 2024
30 checks passed
@AndSonder AndSonder deleted the fit_allreduce_matmul_grad_overlapping_when_open_master_grad branch April 23, 2024 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants