Skip to content

Commit

Permalink
A bunch of changes to support distributed training using tf.estimator (
Browse files Browse the repository at this point in the history
…kubeflow#265)

* Unify the code for training with Keras and TF.Estimator

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See kubeflow#196
The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (kubeflow#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

* Address comments.
  • Loading branch information
jlewi authored and sven committed Apr 1, 2019
1 parent 3f3392f commit 126ad24
Show file tree
Hide file tree
Showing 20 changed files with 6,577 additions and 556 deletions.
5 changes: 5 additions & 0 deletions github_issue_summarization/02_distributed_training.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Distributed training using Estimator

Distributed training with keras currently doesn't work; see

* kubeflow/examples#280
* kubeflow/examples#96

Requires Tensorflow 1.9 or later.
Requires [StorageClass](https://kubernetes.io/docs/concepts/storage/storage-classes/) capable of creating ReadWriteMany persistent volumes.

Expand Down
51 changes: 51 additions & 0 deletions github_issue_summarization/demo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ Here are the instructions for setting up the demo.

1. Follow the [instructions](https://www.kubeflow.org/docs/guides/gke/cloud-filestore/) to Setup an NFS share

* This is needed to do distributed training with the TF estimator example

1. Create static IP for serving **gh-demo.kubeflow.org**

```
Expand Down Expand Up @@ -77,4 +79,53 @@ Here are the instructions for setting up the demo.
cd gh-app
ks env add gh-public --namespace=gh-public
ks apply gh-public
```
### Training and Deploying the model.
We use the ksonnet app in [github/kubeflow/examples/github_issue_summarization/ks-kubeflow](https://github.com/kubeflow/examples/tree/master/github_issue_summarization/ks-kubeflow)
The current environment is
```
export ENV=gh-demo-1003
```
Set a bucket for the job output
```
DAY=$(date +%Y%m%d)
ks param set --env=${ENV} tfjob-v1alpha2 output_model_gcs_bucket kubecon-gh-demo
ks param set --env=${ENV} tfjob-v1alpha2 output_model_gcs_path gh-demo/${DAY}/output
```
Run the job
```
ks apply ${ENV} -c tfjob-v1alpha2
```
#### Using TF Estimator with Keras
1. Copy the data to the GCFS mount by launching a notebook and then running the following commands
```
!mkdir -p /mnt/kubeflow-gcfs/gh-demo/data
!gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}
!gsutil cp gs://kubeflow-examples/github-issue-summarization-data/github-issues.zip /mnt/kubeflow-gcfs/gh-demo/data
!unzip /mnt/kubeflow-gcfs/gh-demo/data/github-issues.zip
!cp github_issues.csv /mnt/kubeflow-gcfs/gh-demo/data/
```
* TODO(jlewi): Can we modify the existing job that downloads data to a PVC to do this?
1. Run the estimator job
```
ks apply ${ENV} -c tfjob-estimator
```
1. Run TensorBoard
```
ks apply ${ENV} -c tensorboard-pvc-tb
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
local env = std.extVar("__ksonnet/environments");
local params = std.extVar("__ksonnet/params").components["google-cloud-filestore-pv"];

local google_cloud_file_store_pv = import "kubeflow/core/google-cloud-filestore-pv.libsonnet";
local instance = google_cloud_file_store_pv.new(env, params);
instance.list(instance.all)
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
jupyterhub: {
accessLocalFs: 'false',
cloud: 'gke',
disks: 'null',
disks: 'kubeflow-gcfs',
gcpSecretName: 'user-gcp-sa',
image: 'gcr.io/kubeflow/jupyterhub-k8s:v20180531-3bb991b1',
jupyterHubAuthenticator: 'iap',
Expand Down Expand Up @@ -104,14 +104,21 @@
secretName: 'envoy-ingress-tls',
},
seldon: {
apifeServiceType: "NodePort",
name: "seldon",
namespace: "null",
operatorJavaOpts: "null",
operatorSpringOpts: "null",
seldonVersion: "0.2.3",
withApife: "false",
withRbac: "true",
apifeServiceType: 'NodePort',
name: 'seldon',
namespace: 'null',
operatorJavaOpts: 'null',
operatorSpringOpts: 'null',
seldonVersion: '0.2.3',
withApife: 'false',
withRbac: 'true',
},
"google-cloud-filestore-pv": {
image: 'gcr.io/kubeflow-images-public/ubuntu:18.04',
name: 'kubeflow-gcfs',
path: '/kubeflow',
serverIP: '10.33.75.194',
storageCapacity: '20',
},
},
}
53 changes: 0 additions & 53 deletions github_issue_summarization/distributed/storage.yaml

This file was deleted.

69 changes: 0 additions & 69 deletions github_issue_summarization/distributed/tfjob.yaml

This file was deleted.

Loading

0 comments on commit 126ad24

Please sign in to comment.