-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon 6th Fundable Projects 3 No.13】barrier #67310
【Hackathon 6th Fundable Projects 3 No.13】barrier #67310
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
sendbuff, recvbuff, numel, dtype, nccl_red_type, comm->comm(), stream)); | ||
phi::backends::gpu::GpuStreamSync(stream); | ||
VLOG(3) << "old NCCLCommContext has rid " << ring_id; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phi 下算子,只用新通信库,即 FLAGS_dynamic_static_unified_comm = True。那么,
只保留 L46-L66,删掉 L68 - L75
"not found in comm_context_manager.", | ||
std::to_string(ring_id))); | ||
auto comm_ctx = static_cast<phi::distributed::NCCLCommContext*>( | ||
comm_context_manager.Get(std::to_string(ring_id))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phi kernel 里不能使用单例。
- 删掉 comm_context_manager 相关代码和头文件。
- L55 改成
auto comm_ctx = static_castdistributed::NCCLCommContext*(dev_ctx.GetCommContext());
|
||
template <typename T, typename Context> | ||
void BarrierOpCUDAKernel(const Context& dev_ctx, | ||
const DenseTensor& x_in, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BarrierOpCUDAKernel 改成 BarrierKernel
phi::distributed::CommContextManager::GetInstance(); | ||
if (comm_context_manager.Has(std::to_string(ring_id))) { | ||
auto* comm_context = static_cast<phi::distributed::GlooCommContext*>( | ||
comm_context_manager.Get(std::to_string(ring_id))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phi kernel 里不能使用单例。
- 删掉 comm_context_manager 相关代码和头文件。
- L34-35 改成
auto comm_ctx = static_castdistributed::NCCLCommContext*(dev_ctx.GetCommContext());
PADDLE_ENFORCE_NE(
comm_ctx,
nullptr,
errors::Unavailable("NCCLCommContext is nullptr, collective op should "
"has ring_id attr."));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
老师,cpu里面好像用不了 NCCL,我可以改成 auto comm_ctx =
static_castdistributed::GlooCommContext*(dev_ctx.GetCommContext()); 吗
template <typename T, typename Context> | ||
void BarrierOpCUDAKernel(const Context& dev_ctx, | ||
const DenseTensor& x_in, | ||
int ring_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ring_id kernel中用不到,这里删掉
@@ -455,6 +455,17 @@ | |||
data_type : param | |||
inplace : (in_sum_1 -> out_sum_1), (in_sum_2 -> out_sum_2), (in_sum_3 -> out_sum_3), (in_num_accumulates -> out_num_accumulates), (in_old_num_accumulates -> out_old_num_accumulates), (in_num_updates -> out_num_updates) | |||
|
|||
- op : barrier | |||
args : (Tensor x, int ring_id = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意:这里的ring_id 不能删
template <typename T, typename Context> | ||
void BarrierKernel(const Context& dev_ctx, | ||
const DenseTensor& x_in, | ||
int ring_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ring_id kernel中用不到,这里删掉
Sorry to inform you that 9a050a3's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
|
PR Category
Operator Mechanism
PR Types
Others
Description
迁移 barrier,还在研究研究...