-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#71 Simplify accelerators config #90
Changes from 2 commits
c6622d2
aef19e6
a6140cf
4741721
bab3d25
eeddb9f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,7 +11,7 @@ | |
TfJob provides a Kubernetes custom resource that makes it easy to | ||
run distributed or non-distributed TensorFlow jobs on Kubernetes. | ||
|
||
Using a CRD gives users the ability to create and manage TF Jobs just like builtin K8s resources. For example to | ||
Using a Custom Resource Definition (CRD) gives users the ability to create and manage TF Jobs just like builtin K8s resources. For example to | ||
create a job | ||
|
||
``` | ||
|
@@ -36,20 +36,20 @@ CRD please refer to | |
Custom Resources require Kubernetes >= 1.7 | ||
|
||
|
||
## Installing the CRD and operator on your k8s cluster | ||
## Installing the TfJob CRD and operator on your k8s cluster | ||
|
||
1. Deploy the operator | ||
|
||
For non-RBAC enabled clusters: | ||
``` | ||
CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz | ||
helm install ${CHART} -n tf-job --wait --replace | ||
helm install ${CHART} -n tf-job --wait --replace --set cloud=<gce or azure> | ||
``` | ||
|
||
For RBAC-enabled clusters: | ||
``` | ||
CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgz | ||
helm install ${CHART} -n tf-job --wait --replace --set rbac.install=true | ||
helm install ${CHART} -n tf-job --wait --replace --set rbac.install=true cloud=<gce or azure> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. gce->gke |
||
``` | ||
|
||
* The above instructions use the latest release. | ||
|
@@ -84,44 +84,48 @@ Custom Resources require Kubernetes >= 1.7 | |
|
||
### Configuring the CRD | ||
|
||
The CRD can be configured via a [ConfigMap](https://kubernetes.io/docs/api-reference/v1.8/#configmap-v1-core) | ||
that provides a [ControllerConfig](https://github.com/tensorflow/k8s/blob/master/pkg/spec/controller.go) serialized | ||
as YAML. The config controls how the CRD manages TensorFlow jobs. | ||
The CRD must be configured properly to work with your specific Kubernetes cluster. | ||
Since it will be mounting GPU drivers into your pods, the CRD needs to know where to find them on the Kubernetes agents. It also needs to know which environment variable needs to be injected in the pods. | ||
|
||
Currently, the most important use for [ControllerConfig](https://github.com/tensorflow/k8s/blob/master/pkg/spec/controller.go) | ||
is specifying environment variables and volumes that must be mounted from the | ||
host into containers to configure GPUS. | ||
If your Kubernetes cluster is running on GCE (GKE) or Azure (ACS, AKS, acs-engine) simply pass the provider name to the helm install (or in `values.yaml`). | ||
|
||
The TfJob controller can be configured with a list of volumes that should be mounted from the host into the container | ||
to make GPUs work. Here's an example [ControllerConfig](https://github.com/tensorflow/k8s/blob/master/pkg/spec/controller.go): | ||
For **GCE** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. GKE |
||
``` | ||
helm install ${CHART} -n tf-job --wait --replace --set cloud=gce | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. gce -> gke |
||
``` | ||
|
||
For **Azure**: | ||
``` | ||
helm install ${CHART} -n tf-job --wait --replace --set cloud=azure | ||
``` | ||
|
||
If the cluster is not hosted on GCE or Azure, you will need specify a custom configuration. | ||
To do so edit `${CHART}\custom-config.yaml` with your desired settings. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think editing the chart is undesirable because it requires the user download/unpack the chart. What if instead of passing in a file to helm, the user has to create the config map manually. e.g.
if cloud=none then the helm chart doesn't include the tf-job-operator-config defined in the template. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
|
||
This is the structure of the configuration file: | ||
|
||
```yaml | ||
accelerators: | ||
alpha.kubernetes.io/nvidia-gpu: | ||
volumes: | ||
- name: nvidia-libraries | ||
mountPath: /usr/local/nvidia/lib64 # This path is special; it is expected to be present in `/etc/ld.so.conf` inside the container image. | ||
hostPath: /home/kubernetes/bin/nvidia/lib | ||
- name: nvidia-debug-tools # optional | ||
mountPath: /usr/local/bin/nvidia | ||
hostPath: /home/kubernetes/bin/nvidia/bin | ||
- name: <volume-name> # Desired name of the volume, ex: nvidia-libs | ||
mountPath: <mount-path> # Path where this should be mounted | ||
hostPath: <host-path> # Path on the host machine | ||
- name: <volume2-name> # optional | ||
mountPath: <mount-path> | ||
hostPath: <host-path> | ||
envVars: | ||
- name: <env-var-name> # Name of the environment variable, ex: LD_LIBRARY_PATH | ||
value: <env-value> # Value of the environment variable | ||
``` | ||
|
||
Here **alpha.kubernetes.io/nvidia-gpu** is the K8s resource name used for a GPU. The config above says that | ||
any container which uses this resource should have the volumes mentioned mounted into the container | ||
from the host. | ||
Then simply install the Helm chart without specifying any cloud provider: | ||
|
||
The config is usually specified using a K8s ConfigMap to stage the config | ||
on a valume mounted into the Pod running the controller, and then passing | ||
the config into the controller via the --controller_config_file flag. | ||
|
||
The helm package for the controller includes a config map suitable for GKE. | ||
This ConfigMap may need to be modified for your cluster if you aren't using | ||
GKE. | ||
|
||
There's an open [issue](https://github.com/tensorflow/k8s/issues/71) to | ||
better support non GKE clusters | ||
``` | ||
helm install ${CHART} -n tf-job --wait --replace | ||
``` | ||
|
||
Subsequently, any pod requesting a resource of type `alpha.kubernetes.io/nvidia-gpu` will have these Volumes\VolumeMounts and environment variables injected at creation. | ||
|
||
## Creating a job | ||
|
||
|
@@ -247,7 +251,7 @@ in your job. The table below describes the important fields in | |
|
||
#### TensorBoard on Azure | ||
|
||
On Azure you can store your event files on an azure file and use | ||
On Azure you can store your event files on an Azure Files and use | ||
volumes to make them available to TensorBoard. | ||
|
||
``` | ||
|
@@ -258,7 +262,6 @@ metadata: | |
spec: | ||
replica_specs: | ||
- replicas: 1 | ||
tfPort: 2222 | ||
tfReplicaType: MASTER | ||
template: | ||
spec: | ||
|
@@ -271,7 +274,6 @@ spec: | |
restartPolicy: OnFailure | ||
tensorboard: | ||
logDir: /tmp/tensorflow | ||
serviceType: LoadBalancer | ||
volumes: | ||
- name: azurefile | ||
azureFile: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# In this file you can specify a custom configuration for TfJob, such as specific driver mounts | ||
# and environment variable. | ||
# Note that configurations for GCE and Azure (ACS, acs-engine, AKS) are already available: simply set cloud=gce | ||
# or cloud=azure in values.yaml | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: tf-job-operator-config | ||
namespace: default | ||
data: | ||
controller_config_file.yaml: | | ||
grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py | ||
accelerators: | ||
alpha.kubernetes.io/nvidia-gpu: | ||
# These are all the Volumes and VolumeMounts that should be added to any pod requesting | ||
# a resource of type "alpha.kubernetes.io/nvidia-gpu" | ||
volumes: | ||
- name: nvidia-libraries | ||
mountPath: /usr/local/nvidia/lib64 # This path is special; it is expected to be present in `/etc/ld.so.conf` inside the container image. | ||
hostPath: /home/kubernetes/bin/nvidia/lib | ||
- name: nvidia-debug-tools # optional | ||
mountPath: /usr/local/bin/nvidia | ||
hostPath: /home/kubernetes/bin/nvidia/bin | ||
# These are all the environment variables that should be added to any pod requesting | ||
# a resource of type "alpha.kubernetes.io/nvidia-gpu", such as LD_LIBRARY_PATH | ||
envVars: |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
{{- $cloud := .Values.cloud | default "" -}} | ||
|
||
{{ if eq $cloud "azure" }} | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: tf-job-operator-config | ||
namespace: default | ||
data: | ||
controller_config_file.yaml: | | ||
grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py | ||
accelerators: | ||
alpha.kubernetes.io/nvidia-gpu: | ||
envVars: | ||
- name: LD_LIBRARY_PATH | ||
value: /usr/lib/nvidia:/usr/lib/x86_64-linux-gnu | ||
volumes: | ||
- name: lib | ||
mountPath: /usr/lib/nvidia | ||
hostPath: /usr/lib/nvidia-384 | ||
- name: bin | ||
mountPath: /usr/local/nvidia/bin | ||
hostPath: /usr/lib/nvidia-384/bin | ||
- name: libcuda | ||
mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1 | ||
hostPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1 | ||
{{ else if eq $cloud "gce" }} | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: tf-job-operator-config | ||
namespace: default | ||
data: | ||
controller_config_file.yaml: | | ||
grpcServerFilePath: /opt/mlkube/grpc_tensorflow_server/grpc_tensorflow_server.py | ||
accelerators: | ||
alpha.kubernetes.io/nvidia-gpu: | ||
volumes: | ||
- name: nvidia-libraries | ||
mountPath: /usr/local/nvidia/lib64 # This path is special; it is expected to be present in `/etc/ld.so.conf` inside the container image. | ||
hostPath: /home/kubernetes/bin/nvidia/lib | ||
- name: nvidia-debug-tools # optional | ||
mountPath: /usr/local/bin/nvidia | ||
hostPath: /home/kubernetes/bin/nvidia/bin | ||
{{ else if eq $cloud ""}} | ||
{{ .Files.Get "custom-config.yaml"}} | ||
{{ end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gce->gke