Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bunch of changes to support distributed training using tf.estimator #265

Merged
merged 2 commits into from
Nov 8, 2018

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Oct 9, 2018

  • Unify the code for training with Keras and TF.Estimator

    • Create a single train.py and trainer.py which uses Keras inside TensorFlow
    • Provide options to either train with Keras or TF.TensorFlow
  • The code to train with TF.estimator doesn't worki

  • We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

    • We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
    • We've added unitttests that can be used to verify training with TF.estimator works. This test
      can also be used to reproduce the current errors with TF.estimator.
  • Add a Makefile to build the Docker image

  • Add a NFS PVC to our Kubeflow demo deployment.

  • Create a tfjob-estimator component in our ksonnet component.

  • changes to distributed/train.py as part of merging with notebooks/train.py
    * Add command line arguments to specify paths rather than hard coding them.
    * Remove the code at the start of train.py to wait until the input data
    becomes available.
    * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
    job and just block until the data is available
    * That should be unnecessary since we can just run the preprocessing job as a separate job.

  • Fix notebooks/train.py ([GH Issue Summarization] Fix tfjob training #186)

    • The code wasn't actually calling Model Fit
    • Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

This change is Reviewable

@texasmichelle
Copy link
Member

Ready for a rebase

# Word embeding for encoder (ex: Issue Body)
x = tf.keras.layers.Embedding(
num_encoder_tokens, latent_dim, name='Body-Word-Embedding', mask_zero=False)(encoder_inputs)
x = tf.keras.layers.BatchNormalization(name='Encoder-Batchnorm-1')(x)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inc0 @hamelsmu It doesn't look like the output of BatchNormalization is used anywhere? Is this expected?
Does Keras just automagically connect the current layer to the last layer?

Copy link
Member

@hamelsmu hamelsmu Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi look at line 160 Below, it that the output of the batch norm layer is used as input to the GRU layer

_, state_h = tf.keras.layers.GRU(latent_dim, return_state=True, name='Encoder-Last-GRU')(x)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Resolved.

#x = GRU(latent_dim, name='Encoder-Intermediate-GRU', return_sequences=True)(x)
#x = BatchNormalization(name='Encoder-Batchnorm-2')(x)
# We do not need the `encoder_output` just the hidden state.
_, state_h = tf.keras.layers.GRU(latent_dim, return_state=True, name='Encoder-Last-GRU')(x)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inc0 @hamelsmu Doesn't look like state_h is used anywhere?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was originally supposed to be used in line 166

encoder_model = tf.keras.Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')

Encapsulating a subset of the layers as a model object makes it easier to extract the encoder as it logically groups together those layers under the name "Encoder Model". However, depending on the use case you are demonstrating here maybe you do not care about this?

However, per my other comment, you will eventually need to initialize the decoder with the last hidden state of the encoder which was formerly done via the commented out encoder_model object.

# encode without decoding if we want to.
encoder_model = tf.keras.Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')
# TODO(jlewi): I commented out the following two lines
# encoder_model = tf.keras.Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inc0 @hamelsmu It doesn't look like seq2seq_encouder_out is used anywhere; can we delete these two lines of code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi the seq2seq_encoder out should be used to initialize the state of the decoder. However, that code has been commented out, apparently as an interim step by someone? See my comments below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented this line and changed the line below.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 17, 2018

@inc0 @hamelsmu Could you please take a look at train.py? I left a few questions for you? I don't think I made any substantive changes. I noticed however (because of lint errors) that various variables outputs aren't being used. Does this mean we can delete those lines?

Its unclear to me if calls to tf.keras.layers is building up some internal state variable storing the graph. So even though we don't use the output we still need to call the function.

x = tf.keras.layers.BatchNormalization(name='Encoder-Batchnorm-1')(x)
# Intermediate GRU layer (optional)
#x = GRU(latent_dim, name='Encoder-Intermediate-GRU', return_sequences=True)(x)
#x = BatchNormalization(name='Encoder-Batchnorm-2')(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of the commented lines maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

decoder_gru = tf.keras.layers.GRU(
latent_dim, return_state=True, return_sequences=True, name='Decoder-GRU')
# FIXME: seems to be running into this https://github.com/keras-team/keras/issues/9761
decoder_gru_output, _ = decoder_gru(dec_bn) # , initial_state=seq2seq_encoder_out)
Copy link
Member

@hamelsmu hamelsmu Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi this appears to be a bug. The decoder should be initialized with the last hidden state of the encoder. Here it is not being initialized with anything. I understand there may have been some breaking changes to the keras api per the #FIXME comments so I would suggest holding this PR for now until a work around is resolved. This relates to your question about seq2seq_encoder_out being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it thanks.

@hamelsmu
Copy link
Member

@jlewi I added some comments. I haven't kept up with all the changes in the code base, so hopefully my comments are not completely out of context but I think I identified some issues.

Copy link
Member

@hamelsmu hamelsmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 17, 2018

@hamelsmu Thanks.

@inc0 Any idea how to address the issues Hamel mentioned (see comments)? I think you identified the issue (the FIX ME) in your original PR introducing tf.Estimator.

In this PR the only change I want to make is to pass in certain values as command line arguments rather than hard coding them.

Copy link

@inc0 inc0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change tfjob.yaml to use new image please and make sure commands are correct

@inc0
Copy link

inc0 commented Oct 18, 2018

@jlewi @hamelsmu not sure. Bug in Keras doesn't seem to be worked on.

@hamelsmu
Copy link
Member

hamelsmu commented Oct 18, 2018 via email

@inc0
Copy link

inc0 commented Oct 18, 2018

Sure - no, but it did raise traceback when I've tried it:)

@jlewi
Copy link
Contributor Author

jlewi commented Oct 22, 2018

@inc0 If I'm understanding Hamel's comments then there are issues with the original code and we wouldn't actually be training a good quality model. Do you think you could pick this up and try to fix it?

Was the issue only with distributed training or did it happen with non-distributed training as well?

@jlewi
Copy link
Contributor Author

jlewi commented Oct 22, 2018

I ran the latest iteration of the code after uncommenting the lines

Here's the stack trace

Traceback (most recent call last):
  File "/issues/train.py", line 217, in <module>
    main()
  File "/issues/train.py", line 214, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 637, in run
    getattr(self, task_to_run)()
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 674, in run_master
    self._start_distributed_training(saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 788, in _start_distributed_training
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1211, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1169, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/keras.py", line 261, in model_fn
    labels)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/keras.py", line 199, in _clone_and_build_model
    optimizer_iterations=global_step)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/models.py", line 434, in clone_and_build_model
    clone = clone_model(model, input_tensors=input_tensors)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/models.py", line 255, in clone_model
    return _clone_functional_model(model, input_tensors=input_tensors)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/models.py", line 163, in _clone_functional_model
    **kwargs))
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/layers/recurrent.py", line 617, in __call__
    self._num_constants)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/layers/recurrent.py", line 2392, in _standardize_args
    assert initial_state is None and constants is None
AssertionError

Full logs
logs.txt

@inc0 is this the same error you ran into?

@jlewi
Copy link
Contributor Author

jlewi commented Oct 22, 2018

@inc0 @hamelsmu

If you look at the comment
keras-team/keras#9761 (comment)

It suggests that there was an API change in Keras to handle the issue. I tried changing the call to match that syntax.

I'm now getting the following error

ValueError: An `initial_state` was passed that is not compatible with `cell.state_size`. Received `state_spec`=[InputSpec(shape=(None, 300), ndim=2), InputSpec(shape=(None, 300), ndim=2)]; however `cell.state_size` is [300]

Any idea how to fix this? Do I need to flatten the output of the previous layer?

I added a unittest "train_test.py" which can reproduce the error.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 23, 2018

I tried following
pascalxia/keras@6750e1e

And commenting out the call to _standardize_args in recurrent.py

When I did that I ended up getting a different error

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1626, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimensions must be equal, but are 0 and 300 for 'add' (op: 'Add') with input shapes: [0], [?,300].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_test.py", line 17, in test_train
    train.train_model(args)
  File "/examples/github_issue_summarization/distributed/train.py", line 148, in train_model
    decoder_gru_output, _ = decoder_gru(dec_bn, initial_state=seq2seq_encoder_out)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/layers/recurrent.py", line 629, in __call__
    additional_inputs += initial_state
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 890, in r_binary_op_wrapper
    return func(x, y, name=name)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 301, in add
    "Add", x=x, y=y, name=name)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1790, in __init__
    control_input_ops)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1629, in _create_c_op
    raise ValueError(str(e))
ValueError: Dimensions must be equal, but are 0 and 300 for 'add' (op: 'Add') with input shapes: [0], [?,300].

Not sure if this an improvement or not.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 23, 2018

I've refactored the code so that we can train the model either using Keras' fit function or TF.Estimator.
I was hoping this would give us a way to make the code work.

Training with Keras works; but it looks like inference is giving me a problem.

The error with inference is

Traceback (most recent call last):
  File "train_test.py", line 32, in test_keras
    train.main(args)
  File "/examples/github_issue_summarization/notebooks/train.py", line 196, in main
    model_trainer.evaluate_keras()
  File "/examples/github_issue_summarization/notebooks/trainer.py", line 222, in evaluate_keras
    seq2seq_model=self.seq2seq_Model)
  File "/examples/github_issue_summarization/notebooks/seq2seq_utils.py", line 231, in __init__
    self.decoder_model = extract_decoder_model(seq2seq_model)
  File "/examples/github_issue_summarization/notebooks/seq2seq_utils.py", line 208, in extract_decoder_model
    gru_out, gru_state_out = model.get_layer('Decoder-GRU')([dec_bn, gru_inference_state_input])
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/layers/recurrent.py", line 659, in __call__
    output = super(RNN, self).__call__(full_input, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 796, in __call__
    inputs, outputs, args, kwargs)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 908, in _set_connectivity_metadata_
    input_tensors=inputs, output_tensors=outputs, arguments=kwargs)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1089, in _add_inbound_node
    arguments=arguments)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1760, in __init__
    layer.outbound_nodes.append(self)
AttributeError: 'InputLayer' object has no attribute 'outbound_nodes'

As far as I can tell, the code is exactly the same as the current code in train.py and in train.ipynb. The only thing is we are using the Keras code in TF not the separate Keras package.

@inc0 and @hamelsmu any ideas?

@inc0
Copy link

inc0 commented Oct 23, 2018

Maybe model_to_estimator has missing feature? Since it will try inspect model and didn't notice output tensor? I'll check it out when I'm back home (next week)

@jlewi
Copy link
Contributor Author

jlewi commented Oct 23, 2018

I suspect a versioning issue. It looks like the Docker container is using

from tensorflow import keras  
In [2]: keras.__version__                                                                                                                             
Out[2]: '2.1.6-tf'

In [4]: tensorflow.__version__                                                                                                                        
Out[4]: '1.11.0'

Keras is on 2.2
https://github.com/keras-team/keras/releases

2.1.6 looks like its from April
https://github.com/keras-team/keras/releases/tag/2.1.6

@jlewi
Copy link
Contributor Author

jlewi commented Oct 24, 2018

/assign @amygdala

@jlewi jlewi changed the title [WIP] A bunch of changes to support distributed training using tf.estimator A bunch of changes to support distributed training using tf.estimator Oct 24, 2018
@jlewi
Copy link
Contributor Author

jlewi commented Oct 24, 2018

@amygdala I think this is pretty much ready to review.

@@ -1,5 +1,10 @@
# Distributed training using Estimator

Distributed training with keras currently doesn't work; see

* kubeflow/examples#280
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're training with Keras + tf-estimator, just not tf.keras. Is this still accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to run distributed training without TF.Estimator.

The code using TF.Estimator doesn't work. So right now we don't have a way to run distributed training.

@@ -1,69 +0,0 @@
---
apiVersion: "kubeflow.org/v1alpha2"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we're deleting it? It would still make sense to keep it for educational purpose for people not familiar with sonnet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TFEstimator code doesn't work; so running the code with YAML will just produce errors. So I think keeping it right now is just a source of confusion. If/when we fix the code to work with TF.Estimator we can restore it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to reducing confusion by removing files that don't run

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But only change needed is command and image, so it'd be very easy to make yaml work and I think it's more readable than ksonnet. Why not fix yaml rather than deleting it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd like to do that in a follow on PR; I'm open to it. I'd like to get the example back to a healthy state and having less files is helpful.

@amygdala
Copy link
Collaborator

I'll dig into this on the weekend.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 27, 2018

Thanks @amygdala

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See kubeflow#196
The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (kubeflow#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.
@jlewi
Copy link
Contributor Author

jlewi commented Nov 7, 2018

/unassign @amygdala
/assign @richardsliu
@richardsliu Would you mind reviewing this? Amy's a bit busy.

@k8s-ci-robot k8s-ci-robot assigned richardsliu and unassigned amygdala Nov 7, 2018

### Training and Deploying the model.

We use the ksonnet app in **github/kubeflow/examples/github_issue_summarization/ks-kubeflow**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a link instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

# We'd like to switch to importing keras from TensorFlow in order to support
# TF.Estimator but using tensorflow.keras we can't train a model either using
# Keras' fit function or using TF.Estimator.
# from tensorflow import keras
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove if not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if num_samples:
traindf, self.test_df = train_test_split(pd.read_csv(data_file).sample(
n=num_samples),
test_size=.10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation seems off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


self.output_dir = output_dir

self.tf_config = os.environ.get('TF_CONFIG', '{}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed for keras?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to use TF Estimator.

@richardsliu
Copy link
Contributor

/lgtm
/approve

@jlewi
Copy link
Contributor Author

jlewi commented Nov 8, 2018

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi, richardsliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 1043bc0 into kubeflow:master Nov 8, 2018
yixinshi pushed a commit to yixinshi/examples that referenced this pull request Nov 30, 2018
…kubeflow#265)

* Unify the code for training with Keras and TF.Estimator

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See kubeflow#196
The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (kubeflow#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

* Address comments.
Svendegroote91 pushed a commit to Svendegroote91/examples that referenced this pull request Dec 6, 2018
…kubeflow#265)

* Unify the code for training with Keras and TF.Estimator

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See kubeflow#196
The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (kubeflow#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

* Address comments.
Svendegroote91 pushed a commit to Svendegroote91/examples that referenced this pull request Apr 1, 2019
…kubeflow#265)

* Unify the code for training with Keras and TF.Estimator

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See kubeflow#196
The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (kubeflow#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

* Address comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants