-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A bunch of changes to support distributed training using tf.estimator #265
Conversation
Ready for a rebase |
# Word embeding for encoder (ex: Issue Body) | ||
x = tf.keras.layers.Embedding( | ||
num_encoder_tokens, latent_dim, name='Body-Word-Embedding', mask_zero=False)(encoder_inputs) | ||
x = tf.keras.layers.BatchNormalization(name='Encoder-Batchnorm-1')(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlewi look at line 160 Below, it that the output of the batch norm layer is used as input to the GRU layer
_, state_h = tf.keras.layers.GRU(latent_dim, return_state=True, name='Encoder-Last-GRU')(x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Resolved.
#x = GRU(latent_dim, name='Encoder-Intermediate-GRU', return_sequences=True)(x) | ||
#x = BatchNormalization(name='Encoder-Batchnorm-2')(x) | ||
# We do not need the `encoder_output` just the hidden state. | ||
_, state_h = tf.keras.layers.GRU(latent_dim, return_state=True, name='Encoder-Last-GRU')(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was originally supposed to be used in line 166
encoder_model = tf.keras.Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')
Encapsulating a subset of the layers as a model object makes it easier to extract the encoder as it logically groups together those layers under the name "Encoder Model". However, depending on the use case you are demonstrating here maybe you do not care about this?
However, per my other comment, you will eventually need to initialize the decoder with the last hidden state of the encoder which was formerly done via the commented out encoder_model
object.
# encode without decoding if we want to. | ||
encoder_model = tf.keras.Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model') | ||
# TODO(jlewi): I commented out the following two lines | ||
# encoder_model = tf.keras.Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlewi the seq2seq_encoder out should be used to initialize the state of the decoder. However, that code has been commented out, apparently as an interim step by someone? See my comments below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented this line and changed the line below.
@inc0 @hamelsmu Could you please take a look at train.py? I left a few questions for you? I don't think I made any substantive changes. I noticed however (because of lint errors) that various variables outputs aren't being used. Does this mean we can delete those lines? Its unclear to me if calls to tf.keras.layers is building up some internal state variable storing the graph. So even though we don't use the output we still need to call the function. |
x = tf.keras.layers.BatchNormalization(name='Encoder-Batchnorm-1')(x) | ||
# Intermediate GRU layer (optional) | ||
#x = GRU(latent_dim, name='Encoder-Intermediate-GRU', return_sequences=True)(x) | ||
#x = BatchNormalization(name='Encoder-Batchnorm-2')(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of the commented lines maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
decoder_gru = tf.keras.layers.GRU( | ||
latent_dim, return_state=True, return_sequences=True, name='Decoder-GRU') | ||
# FIXME: seems to be running into this https://github.com/keras-team/keras/issues/9761 | ||
decoder_gru_output, _ = decoder_gru(dec_bn) # , initial_state=seq2seq_encoder_out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlewi this appears to be a bug. The decoder should be initialized with the last hidden state of the encoder. Here it is not being initialized with anything. I understand there may have been some breaking changes to the keras api per the #FIXME comments so I would suggest holding this PR for now until a work around is resolved. This relates to your question about seq2seq_encoder_out
being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it thanks.
@jlewi I added some comments. I haven't kept up with all the changes in the code base, so hopefully my comments are not completely out of context but I think I identified some issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
@hamelsmu Thanks. @inc0 Any idea how to address the issues Hamel mentioned (see comments)? I think you identified the issue (the FIX ME) in your original PR introducing tf.Estimator. In this PR the only change I want to make is to pass in certain values as command line arguments rather than hard coding them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change tfjob.yaml to use new image please and make sure commands are correct
Are we sure about this bug? This code doesn’t run with the hidden state
being passed? What is the error we get? Curious as this is something I
want to keep an eye on.
…On Thu, Oct 18, 2018 at 7:13 PM Michał Jastrzębski ***@***.***> wrote:
@jlewi <https://github.com/jlewi> @hamelsmu <https://github.com/hamelsmu>
not sure. Bug in Keras doesn't seem to be worked on.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#265 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABakkgnLJoXqXf2CqQBB7b-0lxjdQF53ks5umLbMgaJpZM4XRaPm>
.
|
Sure - no, but it did raise traceback when I've tried it:) |
@inc0 If I'm understanding Hamel's comments then there are issues with the original code and we wouldn't actually be training a good quality model. Do you think you could pick this up and try to fix it? Was the issue only with distributed training or did it happen with non-distributed training as well? |
I ran the latest iteration of the code after uncommenting the lines Here's the stack trace
Full logs @inc0 is this the same error you ran into? |
If you look at the comment It suggests that there was an API change in Keras to handle the issue. I tried changing the call to match that syntax. I'm now getting the following error
Any idea how to fix this? Do I need to flatten the output of the previous layer? I added a unittest "train_test.py" which can reproduce the error. |
I tried following And commenting out the call to _standardize_args in recurrent.py When I did that I ended up getting a different error
Not sure if this an improvement or not. |
I've refactored the code so that we can train the model either using Keras' fit function or TF.Estimator. Training with Keras works; but it looks like inference is giving me a problem. The error with inference is
As far as I can tell, the code is exactly the same as the current code in train.py and in train.ipynb. The only thing is we are using the Keras code in TF not the separate Keras package. |
Maybe model_to_estimator has missing feature? Since it will try inspect model and didn't notice output tensor? I'll check it out when I'm back home (next week) |
I suspect a versioning issue. It looks like the Docker container is using
Keras is on 2.2 2.1.6 looks like its from April |
/assign @amygdala |
@amygdala I think this is pretty much ready to review. |
@@ -1,5 +1,10 @@ | |||
# Distributed training using Estimator | |||
|
|||
Distributed training with keras currently doesn't work; see | |||
|
|||
* kubeflow/examples#280 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're training with Keras + tf-estimator, just not tf.keras. Is this still accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to run distributed training without TF.Estimator.
The code using TF.Estimator doesn't work. So right now we don't have a way to run distributed training.
@@ -1,69 +0,0 @@ | |||
--- | |||
apiVersion: "kubeflow.org/v1alpha2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason we're deleting it? It would still make sense to keep it for educational purpose for people not familiar with sonnet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TFEstimator code doesn't work; so running the code with YAML will just produce errors. So I think keeping it right now is just a source of confusion. If/when we fix the code to work with TF.Estimator we can restore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to reducing confusion by removing files that don't run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But only change needed is command and image, so it'd be very easy to make yaml work and I think it's more readable than ksonnet. Why not fix yaml rather than deleting it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you'd like to do that in a follow on PR; I'm open to it. I'd like to get the example back to a healthy state and having less files is helpful.
I'll dig into this on the weekend. |
Thanks @amygdala |
Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See kubeflow#196 The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (kubeflow#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.
/unassign @amygdala |
|
||
### Training and Deploying the model. | ||
|
||
We use the ksonnet app in **github/kubeflow/examples/github_issue_summarization/ks-kubeflow** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a link instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
# We'd like to switch to importing keras from TensorFlow in order to support | ||
# TF.Estimator but using tensorflow.keras we can't train a model either using | ||
# Keras' fit function or using TF.Estimator. | ||
# from tensorflow import keras |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove if not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
if num_samples: | ||
traindf, self.test_df = train_test_split(pd.read_csv(data_file).sample( | ||
n=num_samples), | ||
test_size=.10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation seems off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
self.output_dir = output_dir | ||
|
||
self.tf_config = os.environ.get('TF_CONFIG', '{}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed for keras?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed to use TF Estimator.
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jlewi, richardsliu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…kubeflow#265) * Unify the code for training with Keras and TF.Estimator Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See kubeflow#196 The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (kubeflow#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions. * Address comments.
…kubeflow#265) * Unify the code for training with Keras and TF.Estimator Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See kubeflow#196 The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (kubeflow#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions. * Address comments.
…kubeflow#265) * Unify the code for training with Keras and TF.Estimator Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See kubeflow#196 The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (kubeflow#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions. * Address comments.
Unify the code for training with Keras and TF.Estimator
The code to train with TF.estimator doesn't worki
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image
Add a NFS PVC to our Kubeflow demo deployment.
Create a tfjob-estimator component in our ksonnet component.
changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.
Fix notebooks/train.py ([GH Issue Summarization] Fix tfjob training #186)
This change is