Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141

Closed
zhujl1991 opened this issue Mar 12, 2020 · 7 comments
Closed

[Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141

zhujl1991 opened this issue Mar 12, 2020 · 7 comments

Comments

@zhujl1991
Copy link
Member

Goals

Since Tensorflow 1.14, TensorFlow supports Cluster Propagation feature which "allows TensorFlow workers
to be booted independently of each other, and with no knowledge about others". This essentially allows us to add/remove workers on-the-fly. Specifically, this makes two features possible:

  1. Worker Failover: If a worker fails (e.g., OOM) or is evicted (e.g., not enough resource), the training continues. Later once the failed worker restarts, it can join the training job dynamically without interupting the training process.
  2. Scale Workers Up/Down: During the training, we can dynamically add/reduce the number of workers on-the-fly based on the needs. This is particularly helpful for online learning -- use more workers during peak time while less during spare time.

The goal of this proposal is to allow tf-operator to support this.

Current Issues

In order to support the new features, there are a couple of issues that need to be solved:

  1. Support sparse ClusterSpec mentioned here.
  2. Support manually scaling up/down workers.
  3. Status update logic needs to be changed (e.g., failed workers are not supposed to result in training failure).

Implementation Details

The work can be devided into the following tasks:

  1. In TFJobSpec, add a boolean variable AllowDynamicWorker.
  2. When AllowDynamicWorker == true, reconcile TFJobs every single time. Change needs to be made here.
  3. When AllowDynamicWorker == true, use sparse form in TF_CONFIG here.
  4. Handle the cases where worker index is larger than replicas here. When AllowDynamicWorker == true, implement scale-down logic, i.e., remove workers from the one has largest index until the number of workers equals replicas, here. The same change need to be done for service here and here.
  5. Change the status update logic here.
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
feature 0.99

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@gaocegege
Copy link
Member

LGTM. It is a helpful feature.

/cc @richardsliu @johnugeorge

@zhujl1991
Copy link
Member Author

@gaocegege The first PR is here #1142 . Looks like I'm not allowed to add you as the reviewer. Can you take a look when you get a chance? Thanks.

@ChanYiLin
Copy link
Member

I like the feature, sounds good!

Is current dynamic worker also supported in all training mode, e.g. allreduce, parameter server or even sync/async training?

I am also interested in implementing it because my graduation thesis is exactly to implement the autoscaling and ps/worker location aware scheduling controller based on tf-operator which had a lot of limit at the time.

You can reference to my thesis
http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0007707605690577
and our implementation called DRAGON
https://github.com/NTHU-LSALAB/DRAGON

So I would also like to know more about your implement details or maybe we can work on this together. Thanks 😊

@zhujl1991
Copy link
Member Author

I like the feature, sounds good!

Is current dynamic worker also supported in all training mode, e.g. allreduce, parameter server or even sync/async training?

I am also interested in implementing it because my graduation thesis is exactly to implement the autoscaling and ps/worker location aware scheduling controller based on tf-operator which had a lot of limit at the time.

You can reference to my thesis
http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0007707605690577
and our implementation called DRAGON
https://github.com/NTHU-LSALAB/DRAGON

So I would also like to know more about your implement details or maybe we can work on this together. Thanks 😊

Cool! I'll submit a PR, which has been done internally, implementing in a very naive way based on the 4th item in the Implementation Details. The PR can already meet our needs for now. I think we can work together on top of it later making the functionality more sophisticated.

@johnugeorge
Copy link
Member

Interesting feature 👍

@zhujl1991
Copy link
Member Author

Closed in #1149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants