A bunch of changes to support distributed training using tf.estimator #265

jlewi · 2018-10-09T06:04:11Z

Unify the code for training with Keras and TF.Estimator
- Create a single train.py and trainer.py which uses Keras inside TensorFlow
- Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki
- See [gh_issue_summarization] distributed training using Keras #196
- The original PR (Add estimator example for github issues #203) worked around a blocking issue with Keras and TF.Estimator by commenting
  certain layers in the model architecture leading to a model that wouldn't generate meaningful
  predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further
- We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
- We've added unitttests that can be used to verify training with TF.estimator works. This test
  can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image
Add a NFS PVC to our Kubeflow demo deployment.
Create a tfjob-estimator component in our ksonnet component.
changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.
Fix notebooks/train.py ([GH Issue Summarization] Fix tfjob training #186)
- The code wasn't actually calling Model Fit
- Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

This change is

texasmichelle · 2018-10-15T17:32:35Z

Ready for a rebase

jlewi · 2018-10-17T02:39:53Z

github_issue_summarization/distributed/train.py

+  # Word embeding for encoder (ex: Issue Body)
+  x = tf.keras.layers.Embedding(
+        num_encoder_tokens, latent_dim, name='Body-Word-Embedding', mask_zero=False)(encoder_inputs)
+  x = tf.keras.layers.BatchNormalization(name='Encoder-Batchnorm-1')(x)


@inc0 @hamelsmu It doesn't look like the output of BatchNormalization is used anywhere? Is this expected?
Does Keras just automagically connect the current layer to the last layer?

@jlewi look at line 160 Below, it that the output of the batch norm layer is used as input to the GRU layer

_, state_h = tf.keras.layers.GRU(latent_dim, return_state=True, name='Encoder-Last-GRU')(x)

Ack. Resolved.

jlewi · 2018-10-17T02:40:17Z

github_issue_summarization/distributed/train.py

-#x = GRU(latent_dim, name='Encoder-Intermediate-GRU', return_sequences=True)(x)
-#x = BatchNormalization(name='Encoder-Batchnorm-2')(x)
+  # We do not need the `encoder_output` just the hidden state.
+  _, state_h = tf.keras.layers.GRU(latent_dim, return_state=True, name='Encoder-Last-GRU')(x)


@inc0 @hamelsmu Doesn't look like state_h is used anywhere?

It was originally supposed to be used in line 166

encoder_model = tf.keras.Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')

Encapsulating a subset of the layers as a model object makes it easier to extract the encoder as it logically groups together those layers under the name "Encoder Model". However, depending on the use case you are demonstrating here maybe you do not care about this?

However, per my other comment, you will eventually need to initialize the decoder with the last hidden state of the encoder which was formerly done via the commented out encoder_model object.

jlewi · 2018-10-17T02:40:57Z

github_issue_summarization/distributed/train.py

-#  encode without decoding if we want to.
-encoder_model = tf.keras.Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')
+  # TODO(jlewi): I commented out the following two lines
+  # encoder_model = tf.keras.Model(inputs=encoder_inputs, outputs=state_h, name='Encoder-Model')


@inc0 @hamelsmu It doesn't look like seq2seq_encouder_out is used anywhere; can we delete these two lines of code?

@jlewi the seq2seq_encoder out should be used to initialize the state of the decoder. However, that code has been commented out, apparently as an interim step by someone? See my comments below

I commented this line and changed the line below.

jlewi · 2018-10-17T02:43:17Z

@inc0 @hamelsmu Could you please take a look at train.py? I left a few questions for you? I don't think I made any substantive changes. I noticed however (because of lint errors) that various variables outputs aren't being used. Does this mean we can delete those lines?

Its unclear to me if calls to tf.keras.layers is building up some internal state variable storing the graph. So even though we don't use the output we still need to call the function.

hamelsmu · 2018-10-17T09:26:16Z

github_issue_summarization/distributed/train.py

-x = tf.keras.layers.BatchNormalization(name='Encoder-Batchnorm-1')(x)
+  # Intermediate GRU layer (optional)
+  #x = GRU(latent_dim, name='Encoder-Intermediate-GRU', return_sequences=True)(x)
+  #x = BatchNormalization(name='Encoder-Batchnorm-2')(x)


Let's get rid of the commented lines maybe?

hamelsmu · 2018-10-17T09:36:07Z

github_issue_summarization/distributed/train.py

-decoder_gru = tf.keras.layers.GRU(
-                latent_dim, return_state=True, return_sequences=True, name='Decoder-GRU')
+  # FIXME: seems to be running into this https://github.com/keras-team/keras/issues/9761
+  decoder_gru_output, _ = decoder_gru(dec_bn)  # , initial_state=seq2seq_encoder_out)


@jlewi this appears to be a bug. The decoder should be initialized with the last hidden state of the encoder. Here it is not being initialized with anything. I understand there may have been some breaking changes to the keras api per the #FIXME comments so I would suggest holding this PR for now until a work around is resolved. This relates to your question about seq2seq_encoder_out being used.

Got it thanks.

hamelsmu · 2018-10-17T09:49:24Z

@jlewi I added some comments. I haven't kept up with all the changes in the code base, so hopefully my comments are not completely out of context but I think I identified some issues.

hamelsmu

.

jlewi · 2018-10-17T17:37:11Z

@hamelsmu Thanks.

@inc0 Any idea how to address the issues Hamel mentioned (see comments)? I think you identified the issue (the FIX ME) in your original PR introducing tf.Estimator.

In this PR the only change I want to make is to pass in certain values as command line arguments rather than hard coding them.

inc0

Change tfjob.yaml to use new image please and make sure commands are correct

inc0 · 2018-10-18T17:13:40Z

@jlewi @hamelsmu not sure. Bug in Keras doesn't seem to be worked on.

hamelsmu · 2018-10-18T17:15:49Z

Are we sure about this bug? This code doesn’t run with the hidden state being passed? What is the error we get? Curious as this is something I want to keep an eye on.

…

On Thu, Oct 18, 2018 at 7:13 PM Michał Jastrzębski ***@***.***> wrote: @jlewi <https://github.com/jlewi> @hamelsmu <https://github.com/hamelsmu> not sure. Bug in Keras doesn't seem to be worked on. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#265 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABakkgnLJoXqXf2CqQBB7b-0lxjdQF53ks5umLbMgaJpZM4XRaPm> .

inc0 · 2018-10-18T17:22:39Z

Sure - no, but it did raise traceback when I've tried it:)

jlewi · 2018-10-22T12:22:24Z

@inc0 If I'm understanding Hamel's comments then there are issues with the original code and we wouldn't actually be training a good quality model. Do you think you could pick this up and try to fix it?

Was the issue only with distributed training or did it happen with non-distributed training as well?

jlewi · 2018-10-22T13:09:02Z

I ran the latest iteration of the code after uncommenting the lines

Here's the stack trace

Traceback (most recent call last):
  File "/issues/train.py", line 217, in <module>
    main()
  File "/issues/train.py", line 214, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 637, in run
    getattr(self, task_to_run)()
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 674, in run_master
    self._start_distributed_training(saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 788, in _start_distributed_training
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1211, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1169, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/keras.py", line 261, in model_fn
    labels)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/estimator/keras.py", line 199, in _clone_and_build_model
    optimizer_iterations=global_step)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/models.py", line 434, in clone_and_build_model
    clone = clone_model(model, input_tensors=input_tensors)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/models.py", line 255, in clone_model
    return _clone_functional_model(model, input_tensors=input_tensors)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/models.py", line 163, in _clone_functional_model
    **kwargs))
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/layers/recurrent.py", line 617, in __call__
    self._num_constants)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/layers/recurrent.py", line 2392, in _standardize_args
    assert initial_state is None and constants is None
AssertionError

Full logs
logs.txt

@inc0 is this the same error you ran into?

jlewi · 2018-10-22T23:58:17Z

@inc0 @hamelsmu

If you look at the comment
keras-team/keras#9761 (comment)

It suggests that there was an API change in Keras to handle the issue. I tried changing the call to match that syntax.

I'm now getting the following error

ValueError: An `initial_state` was passed that is not compatible with `cell.state_size`. Received `state_spec`=[InputSpec(shape=(None, 300), ndim=2), InputSpec(shape=(None, 300), ndim=2)]; however `cell.state_size` is [300]

Any idea how to fix this? Do I need to flatten the output of the previous layer?

I added a unittest "train_test.py" which can reproduce the error.

jlewi · 2018-10-23T02:12:00Z

I tried following
pascalxia/keras@6750e1e

And commenting out the call to _standardize_args in recurrent.py

When I did that I ended up getting a different error

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1626, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimensions must be equal, but are 0 and 300 for 'add' (op: 'Add') with input shapes: [0], [?,300].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_test.py", line 17, in test_train
    train.train_model(args)
  File "/examples/github_issue_summarization/distributed/train.py", line 148, in train_model
    decoder_gru_output, _ = decoder_gru(dec_bn, initial_state=seq2seq_encoder_out)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/layers/recurrent.py", line 629, in __call__
    additional_inputs += initial_state
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 890, in r_binary_op_wrapper
    return func(x, y, name=name)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 301, in add
    "Add", x=x, y=y, name=name)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1790, in __init__
    control_input_ops)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1629, in _create_c_op
    raise ValueError(str(e))
ValueError: Dimensions must be equal, but are 0 and 300 for 'add' (op: 'Add') with input shapes: [0], [?,300].

Not sure if this an improvement or not.

jlewi · 2018-10-23T21:53:39Z

I've refactored the code so that we can train the model either using Keras' fit function or TF.Estimator.
I was hoping this would give us a way to make the code work.

Training with Keras works; but it looks like inference is giving me a problem.

The error with inference is

Traceback (most recent call last):
  File "train_test.py", line 32, in test_keras
    train.main(args)
  File "/examples/github_issue_summarization/notebooks/train.py", line 196, in main
    model_trainer.evaluate_keras()
  File "/examples/github_issue_summarization/notebooks/trainer.py", line 222, in evaluate_keras
    seq2seq_model=self.seq2seq_Model)
  File "/examples/github_issue_summarization/notebooks/seq2seq_utils.py", line 231, in __init__
    self.decoder_model = extract_decoder_model(seq2seq_model)
  File "/examples/github_issue_summarization/notebooks/seq2seq_utils.py", line 208, in extract_decoder_model
    gru_out, gru_state_out = model.get_layer('Decoder-GRU')([dec_bn, gru_inference_state_input])
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/layers/recurrent.py", line 659, in __call__
    output = super(RNN, self).__call__(full_input, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 796, in __call__
    inputs, outputs, args, kwargs)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 908, in _set_connectivity_metadata_
    input_tensors=inputs, output_tensors=outputs, arguments=kwargs)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1089, in _add_inbound_node
    arguments=arguments)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1760, in __init__
    layer.outbound_nodes.append(self)
AttributeError: 'InputLayer' object has no attribute 'outbound_nodes'

As far as I can tell, the code is exactly the same as the current code in train.py and in train.ipynb. The only thing is we are using the Keras code in TF not the separate Keras package.

@inc0 and @hamelsmu any ideas?

inc0 · 2018-10-23T22:21:08Z

Maybe model_to_estimator has missing feature? Since it will try inspect model and didn't notice output tensor? I'll check it out when I'm back home (next week)

jlewi · 2018-10-23T23:22:15Z

I suspect a versioning issue. It looks like the Docker container is using

from tensorflow import keras  
In [2]: keras.__version__                                                                                                                             
Out[2]: '2.1.6-tf'

In [4]: tensorflow.__version__                                                                                                                        
Out[4]: '1.11.0'

Keras is on 2.2
https://github.com/keras-team/keras/releases

2.1.6 looks like its from April
https://github.com/keras-team/keras/releases/tag/2.1.6

jlewi · 2018-10-24T16:58:27Z

/assign @amygdala

jlewi · 2018-10-24T17:42:30Z

@amygdala I think this is pretty much ready to review.

inc0 · 2018-10-24T18:16:20Z

github_issue_summarization/02_distributed_training.md

@@ -1,5 +1,10 @@
 # Distributed training using Estimator

+Distributed training with keras currently doesn't work; see
+
+* kubeflow/examples#280


You're training with Keras + tf-estimator, just not tf.keras. Is this still accurate?

I don't know how to run distributed training without TF.Estimator.

The code using TF.Estimator doesn't work. So right now we don't have a way to run distributed training.

inc0 · 2018-10-24T18:17:29Z

github_issue_summarization/distributed/tfjob.yaml

@@ -1,69 +0,0 @@
---
-apiVersion: "kubeflow.org/v1alpha2"


any reason we're deleting it? It would still make sense to keep it for educational purpose for people not familiar with sonnet

The TFEstimator code doesn't work; so running the code with YAML will just produce errors. So I think keeping it right now is just a source of confusion. If/when we fix the code to work with TF.Estimator we can restore it.

+1 to reducing confusion by removing files that don't run

But only change needed is command and image, so it'd be very easy to make yaml work and I think it's more readable than ksonnet. Why not fix yaml rather than deleting it?

If you'd like to do that in a follow on PR; I'm open to it. I'd like to get the example back to a healthy state and having less files is helpful.

amygdala · 2018-10-27T01:09:03Z

I'll dig into this on the weekend.

jlewi · 2018-10-27T18:51:17Z

Thanks @amygdala

Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See kubeflow#196 The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (kubeflow#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

jlewi · 2018-11-07T01:28:57Z

/unassign @amygdala
/assign @richardsliu
@richardsliu Would you mind reviewing this? Amy's a bit busy.

richardsliu · 2018-11-07T01:54:48Z

github_issue_summarization/demo/README.md

+
+### Training and Deploying the model.
+
+We use the ksonnet app in **github/kubeflow/examples/github_issue_summarization/ks-kubeflow**


Should this be a link instead?

richardsliu · 2018-11-07T20:01:36Z

github_issue_summarization/notebooks/trainer.py

+# We'd like to switch to importing keras from TensorFlow in order to support
+# TF.Estimator but using tensorflow.keras we can't train a model either using
+# Keras' fit function or using TF.Estimator.
+# from tensorflow import keras


Remove if not needed.

richardsliu · 2018-11-07T20:05:05Z

github_issue_summarization/notebooks/trainer.py

+    if num_samples:
+      traindf, self.test_df = train_test_split(pd.read_csv(data_file).sample(
+        n=num_samples),
+                                               test_size=.10)


Indentation seems off.

richardsliu · 2018-11-07T20:11:00Z

github_issue_summarization/notebooks/trainer.py

+
+    self.output_dir = output_dir
+
+    self.tf_config = os.environ.get('TF_CONFIG', '{}')


Is this needed for keras?

This is needed to use TF Estimator.

richardsliu · 2018-11-07T22:59:45Z

/lgtm
/approve

jlewi · 2018-11-08T00:18:32Z

/approve

k8s-ci-robot · 2018-11-08T00:21:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi, richardsliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…kubeflow#265) * Unify the code for training with Keras and TF.Estimator Create a single train.py and trainer.py which uses Keras inside TensorFlow Provide options to either train with Keras or TF.TensorFlow The code to train with TF.estimator doesn't worki See kubeflow#196 The original PR (kubeflow#203) worked around a blocking issue with Keras and TF.Estimator by commenting certain layers in the model architecture leading to a model that wouldn't generate meaningful predictions We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further We've unified the existing code so that we don't duplicate the code just to train with TF.estimator We've added unitttests that can be used to verify training with TF.estimator works. This test can also be used to reproduce the current errors with TF.estimator. Add a Makefile to build the Docker image Add a NFS PVC to our Kubeflow demo deployment. Create a tfjob-estimator component in our ksonnet component. changes to distributed/train.py as part of merging with notebooks/train.py * Add command line arguments to specify paths rather than hard coding them. * Remove the code at the start of train.py to wait until the input data becomes available. * I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing job and just block until the data is available * That should be unnecessary since we can just run the preprocessing job as a separate job. Fix notebooks/train.py (kubeflow#186) The code wasn't actually calling Model Fit Add a unittest to verify we can invoke fit and evaluate without throwing exceptions. * Address comments.

k8s-ci-robot added the do-not-merge/work-in-progress label Oct 9, 2018

k8s-ci-robot requested review from cwbeitel and jimexist October 9, 2018 06:04

k8s-ci-robot added the size/XXL label Oct 9, 2018

jlewi force-pushed the fix_gh_demo2 branch from f676b89 to 7d34a8c Compare October 17, 2018 01:26

k8s-ci-robot added size/XL and removed size/XXL labels Oct 17, 2018

jlewi force-pushed the fix_gh_demo2 branch from 7d34a8c to 49062bc Compare October 17, 2018 02:36

jlewi commented Oct 17, 2018

View reviewed changes

hamelsmu reviewed Oct 17, 2018

View reviewed changes

inc0 suggested changes Oct 18, 2018

View reviewed changes

k8s-ci-robot added size/XXL and removed size/XL labels Oct 22, 2018

jlewi mentioned this pull request Oct 23, 2018

[GH Issue Summarization] Fix tfjob training #186

Closed

This was referenced Oct 23, 2018

[GH Issue Summarization] Keras model doesn't work with keras in TensorFlow #280

Closed

[gh_issue_summarization] distributed training using Keras #196

Closed

k8s-ci-robot assigned amygdala Oct 24, 2018

jlewi changed the title ~~[WIP] A bunch of changes to support distributed training using tf.estimator~~ A bunch of changes to support distributed training using tf.estimator Oct 24, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress label Oct 24, 2018

inc0 suggested changes Oct 24, 2018

View reviewed changes

jlewi mentioned this pull request Nov 2, 2018

Adding new Pachyderm example (picking up where #112 left-off #282

Closed

jlewi force-pushed the fix_gh_demo2 branch from ab2ee6b to 7a1d87c Compare November 6, 2018 22:33

k8s-ci-robot assigned richardsliu and unassigned amygdala Nov 7, 2018

richardsliu reviewed Nov 7, 2018

View reviewed changes

Address comments.

a9fa8e2

k8s-ci-robot added the lgtm label Nov 7, 2018

k8s-ci-robot added the approved label Nov 8, 2018

k8s-ci-robot merged commit 1043bc0 into kubeflow:master Nov 8, 2018

jlewi mentioned this pull request Jan 13, 2019

[GH Issue] directory distributed missing for github_issue_summarization #461

Closed


		### Training and Deploying the model.

		We use the ksonnet app in github/kubeflow/examples/github_issue_summarization/ks-kubeflow


		self.output_dir = output_dir

		self.tf_config = os.environ.get('TF_CONFIG', '{}')

A bunch of changes to support distributed training using tf.estimator #265

A bunch of changes to support distributed training using tf.estimator #265

Conversation

jlewi commented Oct 9, 2018 • edited Loading

texasmichelle commented Oct 15, 2018

Choose a reason for hiding this comment

hamelsmu Oct 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlewi commented Oct 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hamelsmu Oct 17, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hamelsmu commented Oct 17, 2018

hamelsmu left a comment

Choose a reason for hiding this comment

jlewi commented Oct 17, 2018

inc0 left a comment

Choose a reason for hiding this comment

inc0 commented Oct 18, 2018

hamelsmu commented Oct 18, 2018 via email

inc0 commented Oct 18, 2018

jlewi commented Oct 22, 2018

jlewi commented Oct 22, 2018

jlewi commented Oct 22, 2018

jlewi commented Oct 23, 2018

jlewi commented Oct 23, 2018

inc0 commented Oct 23, 2018

jlewi commented Oct 23, 2018

jlewi commented Oct 24, 2018

jlewi commented Oct 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amygdala commented Oct 27, 2018

jlewi commented Oct 27, 2018

jlewi commented Nov 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardsliu commented Nov 7, 2018

jlewi commented Nov 8, 2018

k8s-ci-robot commented Nov 8, 2018

jlewi commented Oct 9, 2018 •

edited

Loading

hamelsmu Oct 17, 2018 •

edited

Loading

hamelsmu Oct 17, 2018 •

edited

Loading