Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

文档“运行分布式训练”的修改 #1942

Closed
luotao1 opened this issue Apr 28, 2017 · 3 comments
Closed

文档“运行分布式训练”的修改 #1942

luotao1 opened this issue Apr 28, 2017 · 3 comments
Assignees

Comments

@luotao1
Copy link
Contributor

luotao1 commented Apr 28, 2017

修改原因:使用v2 api后,用mpi集群训练作业的方式也发生了变化,相应的文档内容和标题也需改变。另外,因为我们有Kubernetes的文档,所以建议这篇标题为“MPI分布式训练”。

修改文件:一个中文文件

网页链接:develop分支的运行分布式训练

@Yancey1989
Copy link
Contributor

v2 api在Kubernetes集群训练还在开发中,PR: #1906

@typhoonzero
Copy link
Contributor

现在的例子我理解是使用fabric和脚本在集群上分别启动pserver和trainer的,需要增加使用OpenMPI还是沿用现在的fabric的方式呢?

@typhoonzero typhoonzero self-assigned this May 2, 2017
@typhoonzero
Copy link
Contributor

已和 @luotao1 沟通,考虑到现在还没有文档 详细描述pserver和trainer的启动方式的,在这里着重描述分布式训练的启动和配置方式,简单介绍可以使用fabric或mpi管理工具来管理集群,而不详细展开。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants