Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#71 Simplify accelerators config #90

Merged
merged 6 commits into from
Oct 26, 2017
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 36 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
TfJob provides a Kubernetes custom resource that makes it easy to
run distributed or non-distributed TensorFlow jobs on Kubernetes.

Using a CRD gives users the ability to create and manage TF Jobs just like builtin K8s resources. For example to
Using a Custom Resource Definition (CRD) gives users the ability to create and manage TF Jobs just like builtin K8s resources. For example to
create a job

```
Expand All @@ -36,20 +36,20 @@ CRD please refer to
Custom Resources require Kubernetes >= 1.7


## Installing the CRD and operator on your k8s cluster
## Installing the TfJob CRD and operator on your k8s cluster

1. Deploy the operator

For non-RBAC enabled clusters:
```
CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz
helm install ${CHART} -n tf-job --wait --replace
helm install ${CHART} -n tf-job --wait --replace --set cloud=<gce or azure>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gce->gke

```

For RBAC-enabled clusters:
```
CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz
helm install ${CHART} -n tf-job --wait --replace --set rbac.install=true
helm install ${CHART} -n tf-job --wait --replace --set rbac.install=true cloud=<gce or azure>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gce->gke

```

* The above instructions use the latest release.
Expand Down Expand Up @@ -84,44 +84,48 @@ Custom Resources require Kubernetes >= 1.7

### Configuring the CRD

The CRD can be configured via a [ConfigMap](https://kubernetes.io/docs/api-reference/v1.8/#configmap-v1-core)
that provides a [ControllerConfig](https://github.com/tensorflow/k8s/blob/master/pkg/spec/controller.go) serialized
as YAML. The config controls how the CRD manages TensorFlow jobs.
The CRD must be configured properly to work with your specific Kubernetes cluster.
Since it will be mounting GPU drivers into your pods, the CRD needs to know where to find them on the Kubernetes agents. It also needs to know which environment variable needs to be injected in the pods.

Currently, the most important use for [ControllerConfig](https://github.com/tensorflow/k8s/blob/master/pkg/spec/controller.go)
is specifying environment variables and volumes that must be mounted from the
host into containers to configure GPUS.
If your Kubernetes cluster is running on GCE (GKE) or Azure (ACS, AKS, acs-engine) simply pass the provider name to the helm install (or in `values.yaml`).

The TfJob controller can be configured with a list of volumes that should be mounted from the host into the container
to make GPUs work. Here's an example [ControllerConfig](https://github.com/tensorflow/k8s/blob/master/pkg/spec/controller.go):
For **GCE**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GKE

```
helm install ${CHART} -n tf-job --wait --replace --set cloud=gce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gce -> gke

```

For **Azure**:
```
helm install ${CHART} -n tf-job --wait --replace --set cloud=azure
```

If the cluster is not hosted on GCE or Azure, you will need specify a custom configuration.
To do so edit `${CHART}\custom-config.yaml` with your desired settings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think editing the chart is undesirable because it requires the user download/unpack the chart.

What if instead of passing in a file to helm, the user has to create the config map manually. e.g.

kubectl create configmap tf-job-operator-config --from-file=path/to/bar
helm install ${CHART} -n tf-job --wait --replace --set cloud=none

if cloud=none then the helm chart doesn't include the tf-job-operator-config defined in the template.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Slight difference, is that you don't need to specify any value for cloud. As long as it it neither gke nor azure the chart will not create a ConfigMap.


This is the structure of the configuration file:

```yaml
accelerators:
alpha.kubernetes.io/nvidia-gpu:
volumes:
- name: nvidia-libraries
mountPath: /usr/local/nvidia/lib64 # This path is special; it is expected to be present in `/etc/ld.so.conf` inside the container image.
hostPath: /home/kubernetes/bin/nvidia/lib
- name: nvidia-debug-tools # optional
mountPath: /usr/local/bin/nvidia
hostPath: /home/kubernetes/bin/nvidia/bin
- name: <volume-name> # Desired name of the volume, ex: nvidia-libs
mountPath: <mount-path> # Path where this should be mounted
hostPath: <host-path> # Path on the host machine
- name: <volume2-name> # optional
mountPath: <mount-path>
hostPath: <host-path>
envVars:
- name: <env-var-name> # Name of the environment variable, ex: LD_LIBRARY_PATH
value: <env-value> # Value of the environment variable
```

Here **alpha.kubernetes.io/nvidia-gpu** is the K8s resource name used for a GPU. The config above says that
any container which uses this resource should have the volumes mentioned mounted into the container
from the host.
Then simply install the Helm chart without specifying any cloud provider:

The config is usually specified using a K8s ConfigMap to stage the config
on a valume mounted into the Pod running the controller, and then passing
the config into the controller via the --controller_config_file flag.

The helm package for the controller includes a config map suitable for GKE.
This ConfigMap may need to be modified for your cluster if you aren't using
GKE.

There's an open [issue](https://github.com/tensorflow/k8s/issues/71) to
better support non GKE clusters
```
helm install ${CHART} -n tf-job --wait --replace
```

Subsequently, any pod requesting a resource of type `alpha.kubernetes.io/nvidia-gpu` will have these Volumes\VolumeMounts and environment variables injected at creation.

## Creating a job

Expand Down Expand Up @@ -247,7 +251,7 @@ in your job. The table below describes the important fields in

#### TensorBoard on Azure

On Azure you can store your event files on an azure file and use
On Azure you can store your event files on an Azure Files and use
volumes to make them available to TensorBoard.

```
Expand All @@ -258,7 +262,6 @@ metadata:
spec:
replica_specs:
- replicas: 1
tfPort: 2222
tfReplicaType: MASTER
template:
spec:
Expand All @@ -271,7 +274,6 @@ spec:
restartPolicy: OnFailure
tensorboard:
logDir: /tmp/tensorflow
serviceType: LoadBalancer
volumes:
- name: azurefile
azureFile:
Expand Down
2 changes: 1 addition & 1 deletion tf-job-operator-chart/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: tf-job-operator-chart
home: https://github.com/jlewi/mlkube.io
version: 0.1.0
version: 0.2.0
appVersion: 0.1.0
description: K8s Custom Resource and Operator For TensorFlow jobs
sources:
Expand Down
26 changes: 26 additions & 0 deletions tf-job-operator-chart/custom-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# In this file you can specify a custom configuration for TfJob, such as specific driver mounts
# and environment variable.
# Note that configurations for GCE and Azure (ACS, acs-engine, AKS) are already available: simply set cloud=gce
# or cloud=azure in values.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: tf-job-operator-config
namespace: default
data:
controller_config_file.yaml: |
grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py
accelerators:
alpha.kubernetes.io/nvidia-gpu:
# These are all the Volumes and VolumeMounts that should be added to any pod requesting
# a resource of type "alpha.kubernetes.io/nvidia-gpu"
volumes:
- name: nvidia-libraries
mountPath: /usr/local/nvidia/lib64 # This path is special; it is expected to be present in `/etc/ld.so.conf` inside the container image.
hostPath: /home/kubernetes/bin/nvidia/lib
- name: nvidia-debug-tools # optional
mountPath: /usr/local/bin/nvidia
hostPath: /home/kubernetes/bin/nvidia/bin
# These are all the environment variables that should be added to any pod requesting
# a resource of type "alpha.kubernetes.io/nvidia-gpu", such as LD_LIBRARY_PATH
envVars:
47 changes: 47 additions & 0 deletions tf-job-operator-chart/templates/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
{{- $cloud := .Values.cloud | default "" -}}

{{ if eq $cloud "azure" }}
apiVersion: v1
kind: ConfigMap
metadata:
name: tf-job-operator-config
namespace: default
data:
controller_config_file.yaml: |
grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py
accelerators:
alpha.kubernetes.io/nvidia-gpu:
envVars:
- name: LD_LIBRARY_PATH
value: /usr/lib/nvidia:/usr/lib/x86_64-linux-gnu
volumes:
- name: lib
mountPath: /usr/lib/nvidia
hostPath: /usr/lib/nvidia-384
- name: bin
mountPath: /usr/local/nvidia/bin
hostPath: /usr/lib/nvidia-384/bin
- name: libcuda
mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
hostPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
{{ else if eq $cloud "gce" }}
apiVersion: v1
kind: ConfigMap
metadata:
name: tf-job-operator-config
namespace: default
data:
controller_config_file.yaml: |
grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py
accelerators:
alpha.kubernetes.io/nvidia-gpu:
volumes:
- name: nvidia-libraries
mountPath: /usr/local/nvidia/lib64 # This path is special; it is expected to be present in `/etc/ld.so.conf` inside the container image.
hostPath: /home/kubernetes/bin/nvidia/lib
- name: nvidia-debug-tools # optional
mountPath: /usr/local/bin/nvidia
hostPath: /home/kubernetes/bin/nvidia/bin
{{ else if eq $cloud ""}}
{{ .Files.Get "custom-config.yaml"}}
{{ end }}
18 changes: 0 additions & 18 deletions tf-job-operator-chart/templates/deployment.yaml
Original file line number Diff line number Diff line change
@@ -1,21 +1,3 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: tf-job-operator-config
namespace: default
data:
controller_config_file.yaml: |
grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py
accelerators:
alpha.kubernetes.io/nvidia-gpu:
volumes:
- name: nvidia-libraries
mountPath: /usr/local/nvidia/lib64 # This path is special; it is expected to be present in `/etc/ld.so.conf` inside the container image.
hostPath: /home/kubernetes/bin/nvidia/lib
- name: nvidia-debug-tools # optional
mountPath: /usr/local/bin/nvidia
hostPath: /home/kubernetes/bin/nvidia/bin
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
Expand Down
5 changes: 5 additions & 0 deletions tf-job-operator-chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@
image: gcr.io/tf-on-k8s-dogfood/tf_operator:latest
test_image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff

# Which cloud provider is kubernetes hosted on.
# Supported values are gce or azure.
# If no value is provided, the configuration specified in custom-config.yaml will be applied
cloud:

## Install Default RBAC roles and bindings
rbac:
install: false
Expand Down