Skip to content

Commit

Permalink
fix broken link (#305)
Browse files Browse the repository at this point in the history
  • Loading branch information
Gaojin CAO authored and jlewi committed Jan 15, 2018
1 parent 2109be9 commit 77e272a
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions tf_job_design_doc.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,14 +78,14 @@ As illustrated by Fig 1, I made an explicit decision not to try to hide or repla

The controller can be used to configure defaults for TfJob to create a simpler user experience. The most common use for this right now is supporting GPUs. To use GPUs, the NVIDIA drivers and libraries need to be mounted from the host into the container. This step should become unnecessary with Kubernetes 1.8. The TfJob controller will automatically add these volume mounts based on configuration specified when the controller is started. This prevents users from having to specify them for each job. Instead, only the cluster administrator who deploys the TfJob controller needs to know how the volumes should be configured.

Another use case is minimizing the boilerplate users have to write to run standard processes (e.g. [Parameter Servers](https://github.com/jlewi/mlkube.io/pull/36#discussion_r141135711) or TensorBoard) using official TF Docker images.
Another use case is minimizing the boilerplate users have to write to run standard processes (e.g. [Parameter Servers](https://github.com/tensorflow/k8s/pull/36#discussion_r141135711) or TensorBoard) using official TF Docker images.


## Controller

The controller manages a distributed TFJob by creating a series of Job controllers Fig 2. The TfJob controller sets the environment variable TF_CONFIG to make the TensorFlow cluster spec and replica type (PS, WORKER, MASTER) and replica index available to TensorFlow code. The Job controller takes care of restarting TensorFlow processes that terminate due to an error. Additional logic in the TfJob controller looks at exit codes and fails the job if a TF process exits with an exit code indicating a permanent error. The TfJob controller treats exit codes of 1-127 as permanent errors; this is an arbitrary convention.

When the master exits successfully or with a permanent error the job is considered finished. There is an open issue([issues/61](https://github.com/jlewi/mlkube.io/issues/61)) to make the changes necessary to support evaluation with the Estimator API in 1.4. The pods aren't deleted until the TfJob is deleted. This allows the logs to be fetched via kubectl logs.
When the master exits successfully or with a permanent error the job is considered finished. There is an open issue([issues/61](https://github.com/tensorflow/k8s/issues/61)) to make the changes necessary to support evaluation with the Estimator API in 1.4. The pods aren't deleted until the TfJob is deleted. This allows the logs to be fetched via kubectl logs.

![Resources for TfJob](diagrams/tfjob_k8s_resources.svg)

Expand Down

0 comments on commit 77e272a

Please sign in to comment.