Skip to content
This repository has been archived by the owner on Oct 23, 2023. It is now read-only.

tensorflow.proto and respective changes #71

Merged
merged 5 commits into from
Jul 6, 2020

Conversation

pingsutw
Copy link
Member

@pingsutw pingsutw commented Jul 2, 2020

TL;DR

proto files and generated ones for TensorFlow Flyte plugin

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

How did you fix the bug, make the feature etc. Link to any design docs etc

Tracking Issue

N/A

Follow-up issue

N/A

@pingsutw
Copy link
Member Author

pingsutw commented Jul 2, 2020

/cc @kumare3

message DistributedTensorflowTrainingTask {
// number of worker, ps, chief replicas spawned in the cluster for this job
int32 workers = 1;
int32 ps = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put a command PS -> Parameter server?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also should it be "ps_replicas"

// number of worker, ps, chief replicas spawned in the cluster for this job
int32 workers = 1;
int32 ps = 2;
int32 chief = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a comment to explain what "chief" means. Should it be "chief_replicas". I acutally do not know this, can "chief" have different "min" and "max". From the CRD it seems that it can only be 1. In that case can we just omit this value here
https://github.com/kubeflow/tf-operator/blob/master/examples/crd/crd-v1.yaml#L37

In FlytePlugin it is possible to provide this as a configuration. Here is an example - https://github.com/lyft/flyteplugins/blob/master/go/tasks/plugins/k8s/spark/spark.go#L37

This value is configured as here
https://github.com/lyft/flyte/blob/master/kustomize/overlays/sandbox/propeller/plugins/spark/config.yaml#L2

Reason being, we can make it simpler for the user. But, ofcourse this depends on whether "chief" can have different values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chief replicas number could be 0.
TF has a multi-worker distributed strategy, they will only launch worker nodes without ps and chief.
example: https://github.com/kubeflow/tf-operator/blob/master/examples/v1/distribution_strategy/keras-API/multi_worker_tfjob.yaml
TF-operator will check spec whether contains Chief
https://github.com/kubeflow/tf-operator/blob/24798375fead22bb1a78f3039565ccf5a9fae017/pkg/controller.v1/tensorflow/status.go#L89-L142

@kumare3
Copy link
Contributor

kumare3 commented Jul 6, 2020

@pingsutw Thank you for the PR, just a couple comments. Otherwise this looks good. Ping me and we can merge it soon

@pingsutw
Copy link
Member Author

pingsutw commented Jul 6, 2020

@kumare3 Thanks for the review.
I just updated my PR.
update include:

  • rename protobuf
    ps -> ps_replicas,
    chief -> chief_replicas
  • add a comment for ps

kumare3
kumare3 previously approved these changes Jul 6, 2020
Copy link
Contributor

@kumare3 kumare3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, qq; do the workers etc need mem/cpu?

@pingsutw
Copy link
Member Author

pingsutw commented Jul 6, 2020

Lgtm, qq; do the workers etc need mem/cpu?

I think it should be added, but I found mem/cpu also don't include in PyTorch proto.
In order to make it consistency, I didn't add them. Maybe we could add it later if I run into some problems in other PRs.
https://github.com/lyft/flyteidl/blob/master/protos/flyteidl/plugins/pytorch.proto

@katrogan
Copy link
Contributor

katrogan commented Jul 6, 2020

hey @pingsutw mind incrementing the version here and here as part of this change?

@pingsutw
Copy link
Member Author

pingsutw commented Jul 6, 2020

@katrogan Thanks for the reminder.
The new version is 0.17.35?

@pingsutw
Copy link
Member Author

pingsutw commented Jul 6, 2020

I've updated the version.

@kumare3 kumare3 self-requested a review July 6, 2020 17:25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants