[CRD] Request for input and output dirs in TFJobSpec #224

gaocegege · 2017-12-14T05:37:42Z

There is one input dir: DataDir and three output dirs ModelDir, LogDir, and ExportDir in a typical training job.

DataDir is the path to dataset, ModelDir is the path to checkpoint, graph and etc. LogDir is the path to the log, and ExportDir is the path to the saved model.

These dirs are often specified by users, then could we add them as an optional config in TFJobSpec?

For example,

// TODO(jlewi): Need to define the actual configuration for the TensorFlow TfJob.
type TfJobSpec struct {
	// TODO(jlewi): Can we we get rid of this and use some value from Kubernetes or a random ide.
	RuntimeId string

+       // DataDir specify the path of dataset
+	DataDir string `json:"dataDir,omitempty"`
+	// ModelDir specify the path of checkpoint, graph, etc
+	ModelDir string `json:"modelDir,omitempty"`
+	// LogDir specify the path of tf.events files
+	LogDir string `json:"logDir,omitempty"`
+	// ExportDir specify the path of saved model
+	ExportDir string `json:"exportDir,omitempty"`

	//TensorBoardSpec specifies the configuration to start a TensorBoard deployment
	TensorBoard *TensorBoardSpec `json:"tensorboard"`

	// ReplicaSpecs specifies the TF replicas to run.
	ReplicaSpecs []*TfReplicaSpec `json:"replicaSpecs"`

	// TfImage defines the tensorflow docker image that should be used for Tensorboard
	// and the default parameter server
	TfImage string `json:"tfImage,omitempty"`

	// TerminationPolicy specifies the condition that the tfjob should be considered finished.
	TerminationPolicy *TerminationPolicySpec `json:"terminationPolicy,omitempty"`
}

/cc @DjangoPeng

label/feature

The text was updated successfully, but these errors were encountered:

jlewi · 2017-12-15T04:47:09Z

How would these values be passed to the user's code?

What's the advantage of this proposal over just letting the user specify command line arguments or environment variables as they see fit?

Why not let users rely on their tool of choice (e.g. helm or ksonnet) to specify the TfJob spec conveniently?

DjangoPeng · 2017-12-15T09:15:03Z

@jlewi Hi Jeremy, I think it's a trade-off between flexibility and complexity.

We can divide users into two kinds according to the familiarity with TensorFlow. One is the TensorFlow expert who always know how to train, visualize and serve models. Another is the TensorFlow newbie who is in the process of learning TensorFlow API.

For TensorFlow experts, the optional fields above would never influence them at all.
For TensorFlow newbie, the optional fields would be a kind of suggestion.

jlewi · 2017-12-15T13:49:28Z

For TensorFlow newbie, the optional fields would be a kind of suggestion.

Suppose I don't understand TensorFlow very well and I just copied someone else's TensorFlow example. If I look at the proposed TfJob spec then I'd probably expect if I just specify LogDir, DataDir, and ModelDir that this would set the corresponding values for my TF program.

But this would most likely not work.

First I'd need to understand how TfJob passes those values into the TensorFlow program for example which command line arguments it is using.

I'd then need to figure out which command line arguments my program expects.

Only if TfJob was using the same command line arguments as my TF program would this work.

So a user has to understand how information is passed into their TF program (e.g. which command line arguments it uses) regardless. So its not clear to me that having TfJob give special treatment to certain arguments really helps. It seems like this just adds another layer that users would have to understand.

DjangoPeng · 2017-12-18T08:19:01Z

@jlewi Thanks for your replies. Since TensorFlow is just a ML lib instead of framework such as Hadoop/Spark, we have to trade-off the user flexibility and system completeness. Of course we can open all interfaces and configs to users. In this way, we just provide a tool to launch TensorFlow training jobs and hard to provide a Training-TensorBoard-Serving pipeline. The key point is the scope of our goal.

jlewi · 2017-12-21T05:34:47Z

@DjangoPeng I agree there's room for providing simpler APIs and tools. My preference though would be to build this as a separate layer on top of the core layers.

For example, if you want a simpler and more constrained API then TfJobSpec then why not define a new CRD with the desired API and implemented as a lambda using kube-metacontroller that just converts the spec to a TfJobSpec.

The reason I like this layering approach is because the higher up the stack you go the more variability we will have. For example, we might have multiple lambdas corresponding to different TF frameworks (such as TF Estimator API).

If there's a clear separation of the higher level tooling and the underlying core APIs then its easier to experiment with a variety of different higher level APIs without affecting the core components.

If we start adding fields to the TfJob spec to make certain use cases easier, I think we are more likely to wind up in a place where the TfJob API isn't clean and this will complicate the implementation since will have to add logic to handle all the different cases.

jlewi · 2018-01-25T13:43:03Z

@gaocegege @DjangoPeng What do you think? Can we close this or do you think we still need these fields?

gaocegege · 2018-01-25T14:00:12Z

I think we could close the issue.

jlewi added the kind/feature label Dec 14, 2017

gaocegege mentioned this issue Jan 11, 2018

[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283

Closed

6 tasks

jlewi added the area/api label Jan 25, 2018

jlewi closed this as completed Feb 12, 2018

llhuii mentioned this issue Mar 17, 2021

kubeflow FAQ llhuii/sedna#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CRD] Request for input and output dirs in TFJobSpec #224

[CRD] Request for input and output dirs in TFJobSpec #224

gaocegege commented Dec 14, 2017 •

edited

Loading

jlewi commented Dec 15, 2017

DjangoPeng commented Dec 15, 2017 •

edited

Loading

jlewi commented Dec 15, 2017

DjangoPeng commented Dec 18, 2017

jlewi commented Dec 21, 2017

jlewi commented Jan 25, 2018

gaocegege commented Jan 25, 2018

[CRD] Request for input and output dirs in TFJobSpec #224

[CRD] Request for input and output dirs in TFJobSpec #224

Comments

gaocegege commented Dec 14, 2017 • edited Loading

jlewi commented Dec 15, 2017

DjangoPeng commented Dec 15, 2017 • edited Loading

jlewi commented Dec 15, 2017

DjangoPeng commented Dec 18, 2017

jlewi commented Dec 21, 2017

jlewi commented Jan 25, 2018

gaocegege commented Jan 25, 2018

gaocegege commented Dec 14, 2017 •

edited

Loading

DjangoPeng commented Dec 15, 2017 •

edited

Loading