Skip to content

Commit

Permalink
Docs should refer to Kubeflow user guide for deploying the TFJob oper…
Browse files Browse the repository at this point in the history
…ator. (#412)

* We don't have the resources to support and maintain ksonnet and helm packages.
* We want to focus on just using ksonnet to deploy Kubeflow.
  • Loading branch information
jlewi authored Feb 27, 2018
1 parent c54cda9 commit 0094aaa
Showing 1 changed file with 2 additions and 125 deletions.
127 changes: 2 additions & 125 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,132 +46,9 @@ TFJob requires Kubernetes >= 1.8

## Installing the TFJob CRD and operator on your k8s cluster

1. Ensure helm is running on your cluster
Please refer to the [Kubeflow user guide](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md).

* On GKE with K8s 1.8, follow these
[instructions](https://docs.helm.sh/using_helm/#tiller-namespaces-and-rbac)
to setup appropriate service accounts for tiller.

* Azure K8s clusters should have service accounts configured by
default for tiller.

1. Deploy the operator

For RBAC-enabled clusters:
```
CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz
helm install ${CHART} -n tf-job --wait --replace --set rbac.install=true,cloud=<gke or azure>
```

* If you aren't running on GKE or Azure don't set cloud.

For non-RBAC enabled clusters:
```
CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz
helm install ${CHART} -n tf-job --wait --replace --set cloud=<gke or azure>
```

* The above instructions use the latest release.
* Releases are versioned
* You can see a list of versions
```
gsutil ls gs://tf-on-k8s-dogfood-releases
```
* **Avoiding Breakages**
* During Alpha there is no guarantees about TFJob API
compatibility.
* To avoid being broken by changes you can pin to a particular
version of the helm chart and control when you upgrade.
1. Make sure the operator is running
```
kubectl get pods
NAME READY STATUS RESTARTS AGE
tf-job-operator-3083500267-wxj43 1/1 Running 0 48m
```
1. Run the helm tests
```
helm test tf-job
RUNNING: tf-job-tfjob-test-pqxkwk
PASSED: tf-job-tfjob-test-pqxkwk
```
### Installing `kubeflow/tf-operator`'s Dashboard
> **Caution: the dashboard is in very early development stage!**
`kubeflow/tf-operator` also includes a dashboard allowing you to monitor and create `TFJobs` through a web UI.
To deploy the dashboard, set `dashboard.install` to `true`.
Note that by default the dashboard will only be accessible from within the cluster or by proxying, as the default `ServiceType` is `ClusterIP`.
If you wish to expose the dashboard through an external IP, set `dashboard.serviceType` to `LoadBalancer`.
So, for example, if you want to enable the dashboard, and also want to expose it externally, you would do:
```
CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz
helm install ${CHART} -n tf-job --wait --replace --set cloud=<gke or azure>,dashboard.install=true,dashboard.serviceType=LoadBalancer
```
This should create a service named `tf-job-dashboard` as well as an additional deployment named `tf-job-dashboard`.
### Configuring the CRD
The CRD must be configured properly to work with your specific Kubernetes cluster.
Since it will be mounting GPU drivers into your pods, the CRD needs to know where to find them on the Kubernetes agents. It also needs to know which environment variable needs to be injected in the pods.
If your Kubernetes cluster is running on GKE or Azure (ACS, AKS, acs-engine) simply pass the provider name to the helm install (or in `tf-job-operator-chart/values.yaml`).
For **GKE**:
```
helm install ${CHART} -n tf-job --wait --replace --set cloud=gke
```
For **Azure**:
```
helm install ${CHART} -n tf-job --wait --replace --set cloud=azure
```
If the cluster is not hosted on GKE or Azure, you will need to specify a custom configuration.
To do so create a `ConfigMap` with your desired settings.
This is the structure of the expected configuration file:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tf-job-operator-config
namespace: default
data:
controller-config-file.yaml: |
accelerators:
alpha.kubernetes.io/nvidia-gpu:
volumes:
- name: <volume-name> # Desired name of the volume, ex: nvidia-libs
mountPath: <mount-path> # Path where this should be mounted
hostPath: <host-path> # Path on the host machine
- name: <volume2-name> # optional
mountPath: <mount-path>
hostPath: <host-path>
envVars:
- name: <env-var-name> # Name of the environment variable, ex: LD_LIBRARY_PATH
value: <env-value> # Value of the environment variable
```

Then simply create the `ConfigMap` and install the Helm chart (**the order matters**) without specifying any cloud provider:

```
kubectl create configmap tf-job-operator-config --from-file <your-configmap-path> --dry-run -o yaml | kubectl replace configmap tf-job-operator-config -f -
helm install ${CHART} -n tf-job --wait --replace
```

Subsequently, any pod requesting a resource of type `alpha.kubernetes.io/nvidia-gpu` will have these Volumes\VolumeMounts and environment variables injected at creation.
We recommend deploying Kubeflow in order to use the TFJob operator.

## Creating a job

Expand Down

0 comments on commit 0094aaa

Please sign in to comment.