Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRD] Request for input and output dirs in TFJobSpec #224

Closed
gaocegege opened this issue Dec 14, 2017 · 7 comments
Closed

[CRD] Request for input and output dirs in TFJobSpec #224

gaocegege opened this issue Dec 14, 2017 · 7 comments

Comments

@gaocegege
Copy link
Member

gaocegege commented Dec 14, 2017

There is one input dir: DataDir and three output dirs ModelDir, LogDir, and ExportDir in a typical training job.

DataDir is the path to dataset, ModelDir is the path to checkpoint, graph and etc. LogDir is the path to the log, and ExportDir is the path to the saved model.

These dirs are often specified by users, then could we add them as an optional config in TFJobSpec?

For example,

// TODO(jlewi): Need to define the actual configuration for the TensorFlow TfJob.
type TfJobSpec struct {
	// TODO(jlewi): Can we we get rid of this and use some value from Kubernetes or a random ide.
	RuntimeId string

+       // DataDir specify the path of dataset
+	DataDir string `json:"dataDir,omitempty"`
+	// ModelDir specify the path of checkpoint, graph, etc
+	ModelDir string `json:"modelDir,omitempty"`
+	// LogDir specify the path of tf.events files
+	LogDir string `json:"logDir,omitempty"`
+	// ExportDir specify the path of saved model
+	ExportDir string `json:"exportDir,omitempty"`

	//TensorBoardSpec specifies the configuration to start a TensorBoard deployment
	TensorBoard *TensorBoardSpec `json:"tensorboard"`

	// ReplicaSpecs specifies the TF replicas to run.
	ReplicaSpecs []*TfReplicaSpec `json:"replicaSpecs"`

	// TfImage defines the tensorflow docker image that should be used for Tensorboard
	// and the default parameter server
	TfImage string `json:"tfImage,omitempty"`

	// TerminationPolicy specifies the condition that the tfjob should be considered finished.
	TerminationPolicy *TerminationPolicySpec `json:"terminationPolicy,omitempty"`
}

/cc @DjangoPeng

label/feature

@jlewi
Copy link
Contributor

jlewi commented Dec 15, 2017

How would these values be passed to the user's code?

What's the advantage of this proposal over just letting the user specify command line arguments or environment variables as they see fit?

Why not let users rely on their tool of choice (e.g. helm or ksonnet) to specify the TfJob spec conveniently?

@DjangoPeng
Copy link
Member

DjangoPeng commented Dec 15, 2017

@jlewi Hi Jeremy, I think it's a trade-off between flexibility and complexity.

We can divide users into two kinds according to the familiarity with TensorFlow. One is the TensorFlow expert who always know how to train, visualize and serve models. Another is the TensorFlow newbie who is in the process of learning TensorFlow API.

  • For TensorFlow experts, the optional fields above would never influence them at all.
  • For TensorFlow newbie, the optional fields would be a kind of suggestion.

@jlewi
Copy link
Contributor

jlewi commented Dec 15, 2017

For TensorFlow newbie, the optional fields would be a kind of suggestion.

Suppose I don't understand TensorFlow very well and I just copied someone else's TensorFlow example. If I look at the proposed TfJob spec then I'd probably expect if I just specify LogDir, DataDir, and ModelDir that this would set the corresponding values for my TF program.

But this would most likely not work.

First I'd need to understand how TfJob passes those values into the TensorFlow program for example which command line arguments it is using.

I'd then need to figure out which command line arguments my program expects.

Only if TfJob was using the same command line arguments as my TF program would this work.

So a user has to understand how information is passed into their TF program (e.g. which command line arguments it uses) regardless. So its not clear to me that having TfJob give special treatment to certain arguments really helps. It seems like this just adds another layer that users would have to understand.

@DjangoPeng
Copy link
Member

@jlewi Thanks for your replies. Since TensorFlow is just a ML lib instead of framework such as Hadoop/Spark, we have to trade-off the user flexibility and system completeness. Of course we can open all interfaces and configs to users. In this way, we just provide a tool to launch TensorFlow training jobs and hard to provide a Training-TensorBoard-Serving pipeline. The key point is the scope of our goal.

@jlewi
Copy link
Contributor

jlewi commented Dec 21, 2017

@DjangoPeng I agree there's room for providing simpler APIs and tools. My preference though would be to build this as a separate layer on top of the core layers.

For example, if you want a simpler and more constrained API then TfJobSpec then why not define a new CRD with the desired API and implemented as a lambda using kube-metacontroller that just converts the spec to a TfJobSpec.

The reason I like this layering approach is because the higher up the stack you go the more variability we will have. For example, we might have multiple lambdas corresponding to different TF frameworks (such as TF Estimator API).

If there's a clear separation of the higher level tooling and the underlying core APIs then its easier to experiment with a variety of different higher level APIs without affecting the core components.

If we start adding fields to the TfJob spec to make certain use cases easier, I think we are more likely to wind up in a place where the TfJob API isn't clean and this will complicate the implementation since will have to add logic to handle all the different cases.

@jlewi
Copy link
Contributor

jlewi commented Jan 25, 2018

@gaocegege @DjangoPeng What do you think? Can we close this or do you think we still need these fields?

@jlewi jlewi added the area/api label Jan 25, 2018
@gaocegege
Copy link
Member Author

I think we could close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants