-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141
Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
LGTM. It is a helpful feature. |
@gaocegege The first PR is here #1142 . Looks like I'm not allowed to add you as the reviewer. Can you take a look when you get a chance? Thanks. |
I like the feature, sounds good! Is current dynamic worker also supported in all training mode, e.g. allreduce, parameter server or even sync/async training? I am also interested in implementing it because my graduation thesis is exactly to implement the autoscaling and ps/worker location aware scheduling controller based on tf-operator which had a lot of limit at the time. You can reference to my thesis So I would also like to know more about your implement details or maybe we can work on this together. Thanks 😊 |
Cool! I'll submit a PR, which has been done internally, implementing in a very naive way based on the 4th item in the |
Interesting feature 👍 |
Closed in #1149 |
Goals
Since Tensorflow 1.14, TensorFlow supports Cluster Propagation feature which "allows TensorFlow workers
to be booted independently of each other, and with no knowledge about others". This essentially allows us to add/remove workers on-the-fly. Specifically, this makes two features possible:
The goal of this proposal is to allow tf-operator to support this.
Current Issues
In order to support the new features, there are a couple of issues that need to be solved:
ClusterSpec
mentioned here.Implementation Details
The work can be devided into the following tasks:
AllowDynamicWorker
.AllowDynamicWorker == true
, reconcile TFJobs every single time. Change needs to be made here.AllowDynamicWorker == true
, use sparse form inTF_CONFIG
here.AllowDynamicWorker == true
, implement scale-down logic, i.e., remove workers from the one has largest index until the number of workers equals replicas, here. The same change need to be done forservice
here and here.The text was updated successfully, but these errors were encountered: