[Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141

zhujl1991 · 2020-03-12T01:35:13Z

Goals

Since Tensorflow 1.14, TensorFlow supports Cluster Propagation feature which "allows TensorFlow workers
to be booted independently of each other, and with no knowledge about others". This essentially allows us to add/remove workers on-the-fly. Specifically, this makes two features possible:

Worker Failover: If a worker fails (e.g., OOM) or is evicted (e.g., not enough resource), the training continues. Later once the failed worker restarts, it can join the training job dynamically without interupting the training process.
Scale Workers Up/Down: During the training, we can dynamically add/reduce the number of workers on-the-fly based on the needs. This is particularly helpful for online learning -- use more workers during peak time while less during spare time.

The goal of this proposal is to allow tf-operator to support this.

Current Issues

In order to support the new features, there are a couple of issues that need to be solved:

Support sparse ClusterSpec mentioned here.
Support manually scaling up/down workers.
Status update logic needs to be changed (e.g., failed workers are not supposed to result in training failure).

Implementation Details

The work can be devided into the following tasks:

In TFJobSpec, add a boolean variable AllowDynamicWorker.
When AllowDynamicWorker == true, reconcile TFJobs every single time. Change needs to be made here.
When AllowDynamicWorker == true, use sparse form in TF_CONFIG here.
Handle the cases where worker index is larger than replicas here. When AllowDynamicWorker == true, implement scale-down logic, i.e., remove workers from the one has largest index until the number of workers equals replicas, here. The same change need to be done for service here and here.
Change the status update logic here.

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-03-12T01:35:20Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
feature	0.99

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

gaocegege · 2020-03-12T02:58:43Z

LGTM. It is a helpful feature.

/cc @richardsliu @johnugeorge

zhujl1991 · 2020-03-12T19:11:43Z

@gaocegege The first PR is here #1142 . Looks like I'm not allowed to add you as the reviewer. Can you take a look when you get a chance? Thanks.

ChanYiLin · 2020-03-13T00:56:06Z

I like the feature, sounds good!

Is current dynamic worker also supported in all training mode, e.g. allreduce, parameter server or even sync/async training?

I am also interested in implementing it because my graduation thesis is exactly to implement the autoscaling and ps/worker location aware scheduling controller based on tf-operator which had a lot of limit at the time.

You can reference to my thesis
http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0007707605690577
and our implementation called DRAGON
https://github.com/NTHU-LSALAB/DRAGON

So I would also like to know more about your implement details or maybe we can work on this together. Thanks 😊

zhujl1991 · 2020-03-13T03:21:16Z

I like the feature, sounds good!

Is current dynamic worker also supported in all training mode, e.g. allreduce, parameter server or even sync/async training?

I am also interested in implementing it because my graduation thesis is exactly to implement the autoscaling and ps/worker location aware scheduling controller based on tf-operator which had a lot of limit at the time.

You can reference to my thesis
http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0007707605690577
and our implementation called DRAGON
https://github.com/NTHU-LSALAB/DRAGON

So I would also like to know more about your implement details or maybe we can work on this together. Thanks 😊

Cool! I'll submit a PR, which has been done internally, implementing in a very naive way based on the 4th item in the Implementation Details. The PR can already meet our needs for now. I think we can work together on top of it later making the functionality more sophisticated.

johnugeorge · 2020-03-13T08:15:29Z

Interesting feature 👍

zhujl1991 · 2020-04-01T23:18:14Z

Closed in #1149

issue-label-bot bot added the feature label Mar 12, 2020

zhujl1991 mentioned this issue Mar 12, 2020

Add enableDynamicWorker flag #1142

Closed

jlewi added kind/feature and removed feature labels Mar 20, 2020

zhujl1991 mentioned this issue Mar 23, 2020

Support ClusterSpec Propagation Feature in TF 1.14 #1149

Merged

zhujl1991 closed this as completed Apr 1, 2020

zw0610 mentioned this issue Oct 27, 2021

tfjob status not match when EnableDynamicWorker set true #1452

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141

[Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141

zhujl1991 commented Mar 12, 2020

issue-label-bot bot commented Mar 12, 2020

gaocegege commented Mar 12, 2020

zhujl1991 commented Mar 12, 2020

ChanYiLin commented Mar 13, 2020

zhujl1991 commented Mar 13, 2020

johnugeorge commented Mar 13, 2020

zhujl1991 commented Apr 1, 2020

[Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141

[Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141

Comments

zhujl1991 commented Mar 12, 2020

Goals

Current Issues

Implementation Details

issue-label-bot bot commented Mar 12, 2020

gaocegege commented Mar 12, 2020

zhujl1991 commented Mar 12, 2020

ChanYiLin commented Mar 13, 2020

zhujl1991 commented Mar 13, 2020

johnugeorge commented Mar 13, 2020

zhujl1991 commented Apr 1, 2020