Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paddle cloud 计划内容 #1860

Closed
typhoonzero opened this issue Apr 24, 2017 · 9 comments
Closed

paddle cloud 计划内容 #1860

typhoonzero opened this issue Apr 24, 2017 · 9 comments

Comments

@typhoonzero
Copy link
Contributor

typhoonzero commented Apr 24, 2017

第一阶段: 可以演示,可以开放给部分用户使用(内测)

时间点:2017-05-31

paddle-cloud-design

  • 只需要支持jupiter notebook on PaddlePaddle cloud when doing distributed training.
    • Jupiter notebook需要能收到callback,包含cost,能自己画图。(就跟单机版使用方式类似,都是callback里绘图)。
    • PaddlePaddle Server --yanxu (如果第一版只支持jupiter notebook on cloud,这个就不一定需要了)
      End user's train function talks to PaddlePaddle server, which invokes Docker to build images.
  • Paddle Cloud Web页面 -- 先出原型-- wuyi, yanxu,gongweibao
    • 在网页上编写训练代码
    • 在云端打包并提交训练任务
    • training过程可视化(cost动画)
    • job运行状态监控
    • 查看训练日志
    • 查看个人配额
    • 支持inference应用(暂不考虑)
  • RBAC--wuyi
    • 使用百度账号登录(注册)-->开通账号-->配置namespace等
  • 存储GlusterFS -- weibao
    • 权限和配额
    • 训练数据的上传和分片
    • 性能考虑,确定演示版本部署方案。使用内部存储系统如BFS也会带来和kubernetes的适配成本
  • pserver和trainer支持扩容的功能的开发
  • Network Policy网络隔离调研
    • 不影响演示,但是需要有,否则不安全
  • GPU资源 -- Done
  • v2 分布式训练:强调扩容,体现出变化。
    • 训练任务扩容
      • 扩容之后体现效果展示?
  • Implement master program. (helin)
    • master,trainer,pserver service discovery. (helin)
    • master trainer communication. (helin)
  • Implement fault tolerant parameter server.
    • Do we need to rewrite parameter server? How much effort is it to add fault tolerant in C++? If the effort is bigger or equal to rewrite in golang, maybe we can rewrite in golang.
    • Do we need to support sparse parameter update in v1?
    • What kind of update rule does parameter server need to support in v1? maybe only simple "add" (no momentum based).
  • Implement fault tolerant trainer.
    • able to scale up trainer.
    • it involves changes for python part and native part (c++ or golang). We need to define a clean C api for python to use.
  • Setup etcd service (no detail specification in the design doc, will not reuse etcd from k8s, according to distributed training: should we re-use etcd from k8s? #1807).
    • How to control etcd access namespace?
  • Upload custom dataset to cluster.
    • Do we need to support merge data files to a big custom file to speed up sequential read. This is performance in the first version?
    • How can the trainer read the dataset, and be backward / forward compatible, maybe we need a reader "driver" for each dataset?

风险:
- 长期考虑,高性能存储需要深度支持。
- Web页面的开发有工作量,人员少


Design docs:


4/24/2017 meeting minutes:
scope information for first version:
pserver:
⁃ 只考虑TCP,不支持RDMA
⁃ 不考虑sparse
⁃ 支持trainer动态伸缩
⁃ 同步 SGD

trainer:
⁃ pserver client
⁃ fetch taskid,按task处理数据
⁃ 动态伸缩,demo强调扩容,体现出变化。

master:
⁃ 服务发现
⁃ 分配task

paddle server:
⁃ build docker image on Kubernetes
⁃ 启动paddle job

paddle client:
⁃ 提交集群任务(python代码, add an optional argument for paddle.train, which contains dist train configuration.)
⁃ 命令行 paddle upload/download

  • 不允许用户在分布式训练里画图,只能打印log。
  • Paddle会提供cost的动态图表。
  • Parameter Server是否需要重写需要更多调研。
  • PR问题达成了一致。
  • 分工(请参考issue评论,4/24/2017的plan)
@jacquesqiao
Copy link
Member

jacquesqiao commented Apr 24, 2017

第一阶段,容错应该是个重要的点

“容错” 改为 “扩容并体现性能变化”

@helinwang
Copy link
Contributor

helinwang commented Apr 24, 2017

4/24/2017 plan:

  • prototype master, trainer, pserver service discovery through etcd in golang.
  • look into if need to rewrite parameter server and trainer in golang.

@jacquesqiao
Copy link
Member

我来调研一下parameter server 的情况吧

@typhoonzero
Copy link
Contributor Author

typhoonzero commented Apr 24, 2017

  • paddle cloud和k8s用户认证统一(使用openid)
  • kubernetes RBAC用户权限
  • 使用python + django + bootstrap实现web页面登录和启动jupyter notebook的原型
  • 绘制训练cost的动态图像

@Yancey1989
Copy link
Contributor

  • 在python代码中提交集群任务
  • Paddle server打包docker image,启动训练任务。

@gongweibao
Copy link
Contributor

  • Setup etcd service (no detail specification in the design doc, will not reuse etcd from k8s, according to distributed training: should we re-use etcd from k8s? #1807).

    • How to control etcd access namespace?
  • Upload custom dataset to cluster.

    • 权限和配额
    • 训练数据的上传、下载
    • Do we need to support merge data files to a big custom file to speed up sequential read. This is performance in the first version?
    • How can the trainer read the dataset, and be backward / forward compatible, maybe we need a reader "driver" for each dataset?
  • Implement fault tolerant trainer.

    • able to scale up trainer.
    • it involves changes for python part and native part (c++ or golang). We need to define a clean C api for python to use.

不确定:

  • paddlecloud web开发

@dzhwinter
Copy link
Contributor

4/24/2017 plan:
1、look into the parameter server detail, if we need to add fault tolerant in C++, auto scaling of parameter server process.
2、add parameter server checkpointing / recover from checkpoint file

@helinwang
Copy link
Contributor

4/26/2017 plan

  1. master's detailed design doc.
  2. prototype master, train, pserver communication with etcd.
  3. file server with cephfs in golang.

@typhoonzero
Copy link
Contributor Author

Closing this issue, we can track the status in "Project": https://github.com/PaddlePaddle/Paddle/projects/18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants