Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon 6th Fundable Projects 3 No.13】barrier #67310

Closed

Conversation

smallpoxscattered
Copy link
Contributor

PR Category

Operator Mechanism

PR Types

Others

Description

迁移 barrier,还在研究研究...

Copy link

paddle-bot bot commented Aug 11, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Aug 11, 2024
@smallpoxscattered smallpoxscattered changed the title 【Hackathon 6th Fundable Projects 3 No.13】barrie 【Hackathon 6th Fundable Projects 3 No.13】barrier Aug 12, 2024
sendbuff, recvbuff, numel, dtype, nccl_red_type, comm->comm(), stream));
phi::backends::gpu::GpuStreamSync(stream);
VLOG(3) << "old NCCLCommContext has rid " << ring_id;
}
Copy link
Contributor

@liym27 liym27 Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phi 下算子,只用新通信库,即 FLAGS_dynamic_static_unified_comm = True。那么,
只保留 L46-L66,删掉 L68 - L75

"not found in comm_context_manager.",
std::to_string(ring_id)));
auto comm_ctx = static_cast<phi::distributed::NCCLCommContext*>(
comm_context_manager.Get(std::to_string(ring_id)));
Copy link
Contributor

@liym27 liym27 Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phi kernel 里不能使用单例。

  1. 删掉 comm_context_manager 相关代码和头文件。
  2. L55 改成
    auto comm_ctx = static_castdistributed::NCCLCommContext*(dev_ctx.GetCommContext());


template <typename T, typename Context>
void BarrierOpCUDAKernel(const Context& dev_ctx,
const DenseTensor& x_in,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BarrierOpCUDAKernel 改成 BarrierKernel

phi::distributed::CommContextManager::GetInstance();
if (comm_context_manager.Has(std::to_string(ring_id))) {
auto* comm_context = static_cast<phi::distributed::GlooCommContext*>(
comm_context_manager.Get(std::to_string(ring_id)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phi kernel 里不能使用单例。

  1. 删掉 comm_context_manager 相关代码和头文件。
  2. L34-35 改成
auto comm_ctx = static_castdistributed::NCCLCommContext*(dev_ctx.GetCommContext());
PADDLE_ENFORCE_NE(
      comm_ctx,
      nullptr,
      errors::Unavailable("NCCLCommContext is nullptr, collective op should "
                          "has ring_id attr."));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

老师,cpu里面好像用不了 NCCL,我可以改成 auto comm_ctx =
static_castdistributed::GlooCommContext*(dev_ctx.GetCommContext()); 吗

template <typename T, typename Context>
void BarrierOpCUDAKernel(const Context& dev_ctx,
const DenseTensor& x_in,
int ring_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ring_id kernel中用不到,这里删掉

@@ -455,6 +455,17 @@
data_type : param
inplace : (in_sum_1 -> out_sum_1), (in_sum_2 -> out_sum_2), (in_sum_3 -> out_sum_3), (in_num_accumulates -> out_num_accumulates), (in_old_num_accumulates -> out_old_num_accumulates), (in_num_updates -> out_num_updates)

- op : barrier
args : (Tensor x, int ring_id = 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意:这里的ring_id 不能删

template <typename T, typename Context>
void BarrierKernel(const Context& dev_ctx,
const DenseTensor& x_in,
int ring_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ring_id kernel中用不到,这里删掉

@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Aug 14, 2024
@PaddlePaddle PaddlePaddle unlocked this conversation Aug 14, 2024
Copy link

paddle-ci-bot bot commented Aug 31, 2024

Sorry to inform you that 9a050a3's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@luotao1 luotao1 closed this Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants