We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
修改原因:使用v2 api后,用mpi集群训练作业的方式也发生了变化,相应的文档内容和标题也需改变。另外,因为我们有Kubernetes的文档,所以建议这篇标题为“MPI分布式训练”。
修改文件:一个中文文件
网页链接:develop分支的运行分布式训练
The text was updated successfully, but these errors were encountered:
v2 api在Kubernetes集群训练还在开发中,PR: #1906
Sorry, something went wrong.
现在的例子我理解是使用fabric和脚本在集群上分别启动pserver和trainer的,需要增加使用OpenMPI还是沿用现在的fabric的方式呢?
已和 @luotao1 沟通,考虑到现在还没有文档 详细描述pserver和trainer的启动方式的,在这里着重描述分布式训练的启动和配置方式,简单介绍可以使用fabric或mpi管理工具来管理集群,而不详细展开。
typhoonzero
No branches or pull requests
修改原因:使用v2 api后,用mpi集群训练作业的方式也发生了变化,相应的文档内容和标题也需改变。另外,因为我们有Kubernetes的文档,所以建议这篇标题为“MPI分布式训练”。
修改文件:一个中文文件
网页链接:develop分支的运行分布式训练
The text was updated successfully, but these errors were encountered: