-
Notifications
You must be signed in to change notification settings - Fork 61
tensorflow.proto and respective changes #71
Conversation
/cc @kumare3 |
message DistributedTensorflowTrainingTask { | ||
// number of worker, ps, chief replicas spawned in the cluster for this job | ||
int32 workers = 1; | ||
int32 ps = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put a command PS -> Parameter server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also should it be "ps_replicas"
// number of worker, ps, chief replicas spawned in the cluster for this job | ||
int32 workers = 1; | ||
int32 ps = 2; | ||
int32 chief = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a comment to explain what "chief" means. Should it be "chief_replicas". I acutally do not know this, can "chief" have different "min" and "max". From the CRD it seems that it can only be 1
. In that case can we just omit this value here
https://github.com/kubeflow/tf-operator/blob/master/examples/crd/crd-v1.yaml#L37
In FlytePlugin it is possible to provide this as a configuration. Here is an example - https://github.com/lyft/flyteplugins/blob/master/go/tasks/plugins/k8s/spark/spark.go#L37
This value is configured as here
https://github.com/lyft/flyte/blob/master/kustomize/overlays/sandbox/propeller/plugins/spark/config.yaml#L2
Reason being, we can make it simpler for the user. But, ofcourse this depends on whether "chief" can have different values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chief replicas number could be 0.
TF has a multi-worker
distributed strategy, they will only launch worker nodes without ps and chief.
example: https://github.com/kubeflow/tf-operator/blob/master/examples/v1/distribution_strategy/keras-API/multi_worker_tfjob.yaml
TF-operator will check spec whether contains Chief
https://github.com/kubeflow/tf-operator/blob/24798375fead22bb1a78f3039565ccf5a9fae017/pkg/controller.v1/tensorflow/status.go#L89-L142
@pingsutw Thank you for the PR, just a couple comments. Otherwise this looks good. Ping me and we can merge it soon |
@kumare3 Thanks for the review.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, qq; do the workers etc need mem/cpu?
I think it should be added, but I found |
@katrogan Thanks for the reminder. |
I've updated the version. |
TL;DR
proto files and generated ones for TensorFlow Flyte plugin
Type
Are all requirements met?
Complete description
How did you fix the bug, make the feature etc. Link to any design docs etc
Tracking Issue
N/A
Follow-up issue
N/A