Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Generalizing tf.data batching using windowing and reducers #5

Merged
merged 3 commits into from
Sep 19, 2018

Conversation

ewilderj
Copy link
Contributor

This RFC will be open for comment until Friday, August 10th, 2018.

cc @jsimsa @mrry

Generalizing tf.data batching using windowing and reducers

Status Proposed
Author(s) Jiri Simsa (Google)
Sponsor Derek Murray (Google)
Updated 2018-07-26

Objective

This proposal addresses the known limitations of the current tf.data batching API:

  • it provides a mechanism for padded batching of sparse tensors
  • it facilitates customization of batching logic (users can now express batching logic as a pure Python function)
  • it enables application of different batching logic on different components

Proposed RFC from @jsimsa, addresses the known limitations of the current tf.data batching API:

*   it provides a mechanism for padded batching of sparse tensors
*   it facilitates customization of batching logic (users can now express batching logic as a pure Python function)
*   it enables application of different batching logic on different components
@ewilderj ewilderj changed the title Create 20180726-tf-data-windowing-reducers.md Generalizing tf.data batching using windowing and reducers Jul 26, 2018
@ewilderj ewilderj changed the title Generalizing tf.data batching using windowing and reducers RFC: Generalizing tf.data batching using windowing and reducers Jul 26, 2018
@ewilderj ewilderj added the RFC: Proposed RFC Design Document label Jul 26, 2018

## Open Questions

* Any interest in the window transformation supporting parameters for specifying the window shift and stride (similar to tf.contrib.data.sliding_window_batch)? Is there any other type of windowing that people are interested in?
Copy link
Contributor

@bhack bhack Jul 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about video analysis or video inpainting subfields? Have we some common shift a stride use cases for these?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stride/shift uses cases for video analysis I am aware of are covered by tf.contrib.data.sliding_window_batch.

def count(dataset):
"""Counts the elements of a dataset."""

def init_fn(_):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is expected to be passed into the init_fn?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the group key, which only makes sense if the reducer is used as part of the tf.contrib.data.group_by_reducer. We are still debating whether to have an unused init_fn argument in the non-keyed reducer or different interfaces for keyed and non-keyed reducers.

@carlthome
Copy link

Super!

Having control over batching directly in the Python API would be great, and it would let me solve tensorflow/tensorflow#20781 by making a tf.data.Reducer that sorts the tensor on the batch axis by some metadata like sequence id.

@Mistobaan
Copy link

It feels like consuming the dataset in batches should be pushed downstream in the API. something along the line of .get_next(batch_size=#) The reduce API is something you would apply to the content of the dataset and not to its packaging/consumption.

@jsimsa
Copy link
Contributor

jsimsa commented Jul 30, 2018

Thank you for your comment @Mistobaan. How would your account for the limitations that the current proposal aims to address? In particular, I am not sure how would .get_next(batch_size=#) enable users to express different types of batching (e.g. regular, padded, truncated, ...) or per-component batching? I can also imagine that giving the user program the flexibility to change the batch dimension of consecutive elements of the pipeline would hamper certain optimization (e.g. prefetching).

@carlthome
Copy link

carlthome commented Aug 2, 2018

Looking a little more closely at this I wonder if the return of tf.data.Dataset.reduce() shouldn't also be a tf.data.Dataset, because that seems like it's slowly becoming the expected object type in the high-level APIs such as tf.estimator and tf.keras, particularly with eager execution as:

for minibatch in dataset:
    model.train_on_batch(minibatch)

And with lazy:

estimator.train(lambda: dataset)

model.fit(dataset, steps=num_batches_in_dataset)

It will probably be confusing if only some of the Dataset methods are for chaining. Thoughts?

@jsimsa
Copy link
Contributor

jsimsa commented Aug 10, 2018

@carlthome I would actually argue that because of the integration with tf.estimator or tf.keras you would want the result to be a tensor as opposed to a dataset.

As illustrated by the examples in the proposal, one common use of tf.data.Dataset.reduce() is to reduce windows of tensors of a dataset to a single tensor. If the reduced element would be a dataset instead, we would end up with a dataset of datasets which is something that tf.estimator or tf.keras would not necessarily know how to handle. If you wanted to pass a single tensor produce by tf.data.Dataset.reduce() to tf.estimator or tf.keras, you can always use tf.data.Dataset.from_tensors to wrap it in a dataset.


#### Example 0: Count Dense Tensors

To introduce the concept of tf.data to readers unfamiliar with it, we illustrate how it can be used to count the elements of a dataset:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean "To introduce the concept of tf.data.Reducer"?

tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Sep 17, 2018
…op_remainder)`, which can be used for combining elements of input dataset into "windows". A window

is itself a finite dataset and, among other things, can be used for generalized batching (see tensorflow/community#5 for details).

PiperOrigin-RevId: 213360134
Edd Wilder-James added 2 commits September 19, 2018 13:34
Updates following the review committee.
Status => accepted. Update revision date.
@ewilderj ewilderj added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Sep 19, 2018
@ewilderj ewilderj merged commit ce1b08d into master Sep 19, 2018
@ewilderj ewilderj deleted the ewilderj-rfc-tfdata branch September 19, 2018 20:53
ganny26 added a commit to ganny26/tensorflow that referenced this pull request Sep 21, 2018
* Add --config=v2 option to the .bazelrc file.

PiperOrigin-RevId: 213027176

* Populate custom name in registration.

PiperOrigin-RevId: 213028338

* Disable the flaky test case in timeline_test

PiperOrigin-RevId: 213034078

* Convert more kernel signatures to use runtime shapes.

PiperOrigin-RevId: 213037039

* Disable flaky gpu_base_test

PiperOrigin-RevId: 213040362

* Added TFE_OpSetAttrTensor() to eager C API.

Also added some experimental C APIs for facilitate the use of eager C APIs in
S4TF compiler.

PiperOrigin-RevId: 213041780

* Generalize TransformFilter method in preparation of NHWC Conv support

PiperOrigin-RevId: 213049674

* [TF:XLA] Remove special base case from BatchDot that has been redundant ever since xla::DotGeneral was added.

PiperOrigin-RevId: 213052269

* Disable flaky keras_test.

PiperOrigin-RevId: 213053512

* Refactored some of the metrics code in compile function for better readability.
- Logic change: Moved getting metric name and function out of the training/eval loops in eager mode
- Moved setting metric attributes on the model out the function which calls metric functions.

PiperOrigin-RevId: 213060143

* Fixed documentation of Optimizer.minimize() for eager mode to match behavior of Optimizer.compute_gradients().

PiperOrigin-RevId: 213060585

* Fix spelling in error message

PiperOrigin-RevId: 213062112

* Makes tf.Variable arguments (non-captured) DT_RESOURCE function inputs.

Previously, tf.Variable arguments to a defun-d Python function were made captured inputs. This change makes it possible to parameterize functions on DT_RESOURCE inputs.

PiperOrigin-RevId: 213064739

* Switch to Eigen::Index in Tensorflow kernels.

Mixing index type doesn't work well with latest Eigen.

PiperOrigin-RevId: 213067224

* Revert PR tensorflow#21997: Fixes the formatting issue pointed out at tensorflow#21762

It breaks. should be s/input_shape/inputs_shape.

PiperOrigin-RevId: 213070141

* Make accessed variable ordering deterministic again when constructing defuns

PiperOrigin-RevId: 213074939

* fix bug of lacking axis when using array.ops.concat in unwrap_and_concat function

* compat: Update forward compatibility horizon to 2018-09-15

PiperOrigin-RevId: 213100589

* [TPU] Deprecate the computation_shape attribute to the TpuReplicate op in lieu of a new num_cores_per_replica.

PiperOrigin-RevId: 213111326

* compat: Update forward compatibility horizon to 2018-09-16

PiperOrigin-RevId: 213161736

* Introduce gmock matchers for TensorFlow nodes

I need these to write readable unit tests for TF graph transformations.  All of
my use cases will live inside tensorflow/compiler so putting it in
tensorflow/compiler/jit for now; but we can move these out if other users are
interested.

In the future we may want to auto-generate type safe versions of these from the
op registrations like we generate C++ wrappers today.

PiperOrigin-RevId: 213186810

* Conditionally allow changing a non-fusion computation root_instruction shape.

PiperOrigin-RevId: 213191899

* Update broken link to intro on ADAGRAD

* Fix some typos in the doc for XlaDynamicSlice

phawkins@ suggested these in cr/212715067 but I accidentally made the changes in
another client.

PiperOrigin-RevId: 213208811

* Improve TFLite iOS doc.

PiperOrigin-RevId: 213210253

* Add ZerosLike to schema.

PiperOrigin-RevId: 213212445

* Implement ZerosLike

PiperOrigin-RevId: 213227615

* Add fill to schema.

PiperOrigin-RevId: 213234759

* compat: Update forward compatibility horizon to 2018-09-17

PiperOrigin-RevId: 213234942

* revised a parameter error

Hi, i found that when firstly use `interpreter `as a parameter pass into `eval_model` function, wrong spell mistake of `interpreter_quant`.

* [XLA:TF] Enable int8 and uint8 support in the bridge for CPU/GPU

The test changes are awkward. None of these are XLA bugs, it's just that the op
definitions in tensorflow are really inconsistent. I tried to infer whether the
limitation is on signed types, index types or just arbitrary. In the latter
case just int8/uint8 is blacklisted, we should probably lift that requirement
at some point.

PiperOrigin-RevId: 213243906

* README s/tensorflow.contrib/tensorflow.python/.

PiperOrigin-RevId: 213262445

* Convert more kernel signatures to use runtime shapes.

PiperOrigin-RevId: 213275003

* Convert more kernel signatures to use runtime shapes.

PiperOrigin-RevId: 213281730

* Removing unused code comment in AutoGraph error rewriting.

PiperOrigin-RevId: 213282302

* [tf.data] Adding support for `tf.data.AUTOTUNE` as a special value for the `num_parallel_calls` argument of `tf.data.Dataset.map()`, `tf.data.Dataset.interleave()`, and `tf.contrib.data.map_and_batch()`.

When `tf.data.AUTOTUNE` is specified, the level of parallelism is determined at runtime. The underlying mechanism instruments the input pipeline to build a performance model and then uses the model to find the optimal values for the parallelism knobs.

PiperOrigin-RevId: 213283297

* Increase tolerance in linalg_grad_test to fix tensorflow#19935

Fixes tensorflow#19935

PiperOrigin-RevId: 213286535

* Minor docstring change: update link to saved_model_cli.

PiperOrigin-RevId: 213296537

* [Java]: Release 1.11.0-rc0

PiperOrigin-RevId: 213305616

* Fix and complete StreamExecutor's DoFusedConvolve:
* bias_nd is set to have CUDNN_DATA_FLOAT, even though BiasType is not float.
* double is supported but not exposed through the public interface.
* DoFusedConvolveImpl has duplicated information in its template parameter list.

PiperOrigin-RevId: 213308435

* Numerics tweak to symmetric quantization.

PiperOrigin-RevId: 213314024

* Do not segfault in Conv2d/3d if cuDNN version is too low.

PiperOrigin-RevId: 213315830

* Convert more kernel signatures to use runtime shapes.

PiperOrigin-RevId: 213316034

* [XLA] Allow adding extra instructions in HloComputation::CloneWithReplacements

PiperOrigin-RevId: 213316504

* GradientTape: Documentation formatting tweak.

PiperOrigin-RevId: 213318051

* [XLA] Add ReduceWindow test.

PiperOrigin-RevId: 213322116

* Raise error on encountering bad indentation during Autograph parsing.

PiperOrigin-RevId: 213324570

* Move from deprecated self.test_session() to self.cached_session().

self.test_session() has been deprecated in 9962eb5 as its name confuses readers of the test. Moving to cached_session() instead which is more explicit about:
* the fact that the session may be reused.
* the session is not closed even when doing a "with self.test_session()" statement.

PiperOrigin-RevId: 213326167

* Add missing `watch` call to GradientTape documentation.

PiperOrigin-RevId: 213326503

* Move from deprecated self.test_session() to self.cached_session().

self.test_session() has been deprecated in 9962eb5 as its name confuses readers of the test. Moving to cached_session() instead which is more explicit about:
* the fact that the session may be reused.
* the session is not closed even when doing a "with self.test_session()" statement.

PiperOrigin-RevId: 213326581

* Add support for predicting models with learning_phase.

PiperOrigin-RevId: 213327633

* Compute `axes` and `free` statically during graph creation.

PiperOrigin-RevId: 213327709

* Tweak test tolerance in segment_reduction_ops_test.py, which is otherwise flaky.

PiperOrigin-RevId: 213327863

* Improve the error messages in custom_export_strategy.

PiperOrigin-RevId: 213334465

* Use a single thread in eager if inter_op_parallelism_threads isn't specified.

PiperOrigin-RevId: 213336463

* Keep only weak references to variables in graph functions

This enables cleanup of the variables referenced in defunned methods of objects when the object is garbage collected. Since one PolymorphicFunction is created per @Defun, decorated methods before this change held on to all of the variables referenced in that method for any instance of the class (i.e. variables which should have been object-scoped were scoped to the lifetime of the class definition).

Raises an exception if variables used in the function have been deleted when it is called, which means no local variables.

PiperOrigin-RevId: 213337256

* Fix testing bug where partitioned primals wasn't actually being tested (constructing Variable directly instead of get_variable under scope with partitioner).

PiperOrigin-RevId: 213345447

* Add benchmarks comparing Mkl vs Default Conv2D ops.

PiperOrigin-RevId: 213346439

* Fix _check_is_tensor like _check_is_tensor_or_operation was fixed in tensorflow#22264.

PiperOrigin-RevId: 213346485

* Add api_docs_relpath option.
Eliminate error when copying a file to itself.

PiperOrigin-RevId: 213349424

* Move OvicBenchmarker class from app folder to source folder.

PiperOrigin-RevId: 213349833

* Add generic fallback optimized implementations for dilated DepthwiseConv.

PiperOrigin-RevId: 213350122

* Remove tensorflow/contrib/linalg library.  linalg remains in core.

PiperOrigin-RevId: 213352573

* Fix GraphConstructor and import_graph_def bug with variadic ops.

Prior to this change,
GraphConstructor::PopulateMissingUnusedInputMapKey() didn't correctly
compute the number of outputs for ops with variadic outputs. This
meant that missing_unused_input_map_keys could contain spurious
entries for unused variadic outputs, which could trigger a ValueError
in import_graph_def.

This also adds a new util method in node_def_util.h, NumOutputsForNode().

PiperOrigin-RevId: 213353158

* Fixing the documentation of the parse_sequence_example function.

PiperOrigin-RevId: 213354240

* [tf.data] Introducing `tf.data.Dataset.window(size, shift, stride, drop_remainder)`, which can be used for combining elements of input dataset into "windows". A window
is itself a finite dataset and, among other things, can be used for generalized batching (see tensorflow/community#5 for details).

PiperOrigin-RevId: 213360134

* Add basic op resolver registration to TFLite C API

PiperOrigin-RevId: 213360279

* Update 1.11.0-rc0 version strings to 1.11.0-rc1 (tensorflow#22284)

* Make HLO liveness analysis correctly handle computations with side effect instructions.

PiperOrigin-RevId: 213361904

* Changing `OpInputList` so that it is a forward iterator and taking advantage of the fact in the tf.data kernels.

PiperOrigin-RevId: 213361953

* Increase test timeout for dnn_tree_combined_estimator_test to de-flake.

PiperOrigin-RevId: 213363558

* Fixed bug where a mixture of Variable and PartitionedVariable would break SDCA.  Added new test that fails with `IndexError: list index out of range` in `_get_partitioned_update_ops` without the corresponding fix.

Note that the effect of this bug is minimal, because for Estimator users, it only applies to sparse features that are not partitionable (e.g. [1,]), since all variables are created with the same partitioner in Estimator).

PiperOrigin-RevId: 213365956

* Remove unnecessary side-effect test, since HLO liveness now reports correct
liveness information if a control flow computation contains side effect
instructions.

PiperOrigin-RevId: 213367995

* Update ops-related pbtxt files.

PiperOrigin-RevId: 213368723

* Eliminate VisitableAllocator.

The visitor pattern is used to allow pre-registration of memory for
DMA access, e.g. for fast GPU/CPU i/o and for RDMA networking.  The
VisitableAllocator interface was introduced to support this use some
time ago, prior to SubAllocators. Memory registration works best if
it's done infrequently, on large pieces of memory, rather than on
every piece that's dynamically allocated/freed.  This usage pattern
fits the SubAllocator better than a general Allocator.  This change
moves memory allocation visitor access to SubAllocator and eliminates
the VisitableAllocator subclass of Allocator.

This change also more rigorously enforces the requirement that all
Visitors be declared prior to memory allocation begining.  This is
accomplished by requiring that Visitors be provided to the SubAllocator
constructor.

This refactoring will ease an upcoming CL introducing
NUMA specific CPU devices.  It also should fix some performance
pitfalls (e.g. accidental use of PoolAllocator) introduced by an
earlier refactoring of ProcessState that was also in preparation for
NUMA.  It restores the default use of the cpu_allocator() value (i.e.
no SubAllocator) by model executions that don't use allocation
visitors (since visitor registration must precede the first allocation,
hence can be detected at that time).

PiperOrigin-RevId: 213371553

* Add type checking at the beginning of tpu.shard().

Otherwise a message like "TypeError: Tensor objects are only iterable when eager execution is enabled. To iterate over this tensor use tf.map_fn." will be thrown, which is confusing.

PiperOrigin-RevId: 213371676

* Remove some dead code after migration from python to C.

PiperOrigin-RevId: 213372027

* Increase test timeout for image_grad_test to de-flake.

PiperOrigin-RevId: 213372241

* Num elements fastpath for eager tensors.

PiperOrigin-RevId: 213377426

* Break cwise_opt_test.py into 3 files to speed up testing, since we are up against the 50 shard limit.

PiperOrigin-RevId: 213377776

* Add Keras TPU support for the new metrics.

PiperOrigin-RevId: 213378552

* Register fp16 reduce_max on GPU

PiperOrigin-RevId: 213383647

* Fix unused variable error on powerpc.

PiperOrigin-RevId: 213386145

* [tf.data] Fixing an error in the optimization loop.

PiperOrigin-RevId: 213386401

* Refactor out the metadata_ops set from const_analysis to a per-op bit; NFC

PiperOrigin-RevId: 213389224

* Automated rollback of commit 185aa89

PiperOrigin-RevId: 213394522

* Support scoped_allocator_ops for renamed device.

This fixes tensorflow#22274.

Signed-off-by: Bairen Yi <[email protected]>

* [XLA] Refactor conv_ops emitters to make them reusable.

PiperOrigin-RevId: 213398930

* compat: Update forward compatibility horizon to 2018-09-18

PiperOrigin-RevId: 213414462

* Simplify the interface of conversion_call to allow a ConversionOptions object that can be more easily extended. Currently any new argument needs changing a lot of call sites and there is redundancy in argument documentation.

Note: this does not modify the public symbols yet - it's not clear whether we want to complicate their interface. However we may want to use it in to_graph and to_code.
PiperOrigin-RevId: 213433379

* Add a fuzzer to test DecodeCompressed

PiperOrigin-RevId: 213441868

* Automated rollback of commit 19d66a9

PiperOrigin-RevId: 213453719

* Creating an InstantiatedCapturedFunction that captures the instantiated state of a function to be executed, separating it out from the non instantiated regular state such as function name, captured inputs etc.

This allows us to truly separate Dataset kernel creation from Iterator creation i.e. each time a dataset is created that uses functions, we create only a CapturedFunction whereas we create an InstantiatedCapturedFunction each time a new iterator is created.

PiperOrigin-RevId: 213456128

* Extend template expansion support for arithmetic expressions.

PiperOrigin-RevId: 213462334

* [SE] Restore int8x4 data types if that's the requested DataLayout for fused conv

This broke in a recent refactoring.

PiperOrigin-RevId: 213497416

* Link to readme for distribution strategy from distribute.py and package init file, so that folks looking at API documentation can find the readme as well.

PiperOrigin-RevId: 213499832

* Only start_step/end_step on GradientTape if executing eagerly.

This prevents creating a context where none is required.

PiperOrigin-RevId: 213500408

* Register FakeResourceUpdateOp for the right op

Before this CL the PartiallyDeclusterPassTest.DontDuplicateResourceVarOps test
was buggy, in that it wasn't testing what it was supposed to test.

PiperOrigin-RevId: 213501558

* Eliminate VisitableAllocator.

The visitor pattern is used to allow pre-registration of memory for
DMA access, e.g. for fast GPU/CPU i/o and for RDMA networking.  The
VisitableAllocator interface was introduced to support this use some
time ago, prior to SubAllocators. Memory registration works best if
it's done infrequently, on large pieces of memory, rather than on
every piece that's dynamically allocated/freed.  This usage pattern
fits the SubAllocator better than a general Allocator.  This change
moves memory allocation visitor access to SubAllocator and eliminates
the VisitableAllocator subclass of Allocator.

This change also more rigorously enforces the requirement that all
Visitors be declared prior to memory allocation begining.  This is
accomplished by requiring that Visitors be provided to the SubAllocator
constructor.

This refactoring will ease an upcoming CL introducing
NUMA specific CPU devices.  It also should fix some performance
pitfalls (e.g. accidental use of PoolAllocator) introduced by an
earlier refactoring of ProcessState that was also in preparation for
NUMA.  It restores the default use of the cpu_allocator() value (i.e.
no SubAllocator) by model executions that don't use allocation
visitors (since visitor registration must precede the first allocation,
hence can be detected at that time).

PiperOrigin-RevId: 213505655

* Clean up remove_negation pass in Grappler.

PiperOrigin-RevId: 213520177

* Add error reporting TFLIte C API

PiperOrigin-RevId: 213526489

* [TF:XLA] Document that the order of control predecessors matters.

PiperOrigin-RevId: 213528296

* Automated rollback of commit b1ff7c2

PiperOrigin-RevId: 213528716

* Updates documentation of Estimator.predict to note that an issue with yielding and graph context.

PiperOrigin-RevId: 213528782

* "Isolate" must-be-constant side effecting operations

I first tried to fix this issue in cr/209996730 but didn't quite fix the problem
for for XLA_* devices.  A node assigned to an XLA_* device must be compiled so
the cr/209996730 fix of simply not compiling the nodes doesn't generalize to
XLA_* devices.  Instead we now "isolate" these nodes, only putting them in a
trivial one-node cluster.  For non-XLA devices even this trivial cluster is
ignored because of flags->tf_xla_min_cluster_size.

I was initially considering a more principled data-flow-analysis based solution
but then decided the upfront work isn't worth it until I see a clear motivating
example.

PiperOrigin-RevId: 213531437

* Convert more kernel signatures to use runtime shapes.

PiperOrigin-RevId: 213536334

* Reject RESHAPE if new_shape tensor is not provided.

PiperOrigin-RevId: 213541006

* Return OrderedDict as eval results should be sorted by global_step key.

PiperOrigin-RevId: 213541935

* Add ConstantScalar, WithPredicate, Disjunction, and OpAnyOrder (where Op
is a commutative binary operator) to the XLA pattern matcher.

PiperOrigin-RevId: 213543953

* Convert the new metric instances to (value_op, update_op) tuple in the EstimatorSpec.

PiperOrigin-RevId: 213548081

* Add a new function to load kernel libraries and library folders.

PiperOrigin-RevId: 213549838

* Add layout information to logging.

PiperOrigin-RevId: 213551652

* Go: Update generated wrapper functions for TensorFlow ops.
PiperOrigin-RevId: 213552354

* Update the grappler plugin to support the @Defun generated function and ops.

PiperOrigin-RevId: 213554813

* [tf.data] Add a test for state persistence between iterators over the same MapDataset.

PiperOrigin-RevId: 213555982

* Getting DNNModel to work with the new feature columns.

PiperOrigin-RevId: 213561495

* First commit for functional while loop.
Supports single and double derivatives but does not supporting nesting yet.

tensorflow/community#13

PiperOrigin-RevId: 213565971

* Putting `NodeExecStatsWrapper` behind an interface and providing a light-weight statistics collector for tf.data performance modeling.

PiperOrigin-RevId: 213566889

* [TF:XLA] Change HloPtrComparator to work across HLO modules. Declaring the method out of line does not increase compile time.

PiperOrigin-RevId: 213571783

* Add xla.compile(), a low-level API that compiles graph with XLA.

PiperOrigin-RevId: 213574904

* Modify Timeline Analysis to consider allocations in order.

PiperOrigin-RevId: 213589710

* Implement sort op for CPU.

Also don't allow parallelization for the sort op in parallel_task_assignment.

PiperOrigin-RevId: 213592046

* Replace DLOG(FATAL) with an Unimplemented error.

In tensorflow we don't have DLOG, and we should not use LOG(FATAL).

PiperOrigin-RevId: 213595376

* Enable XlaSort and TopKV2 for CPU backend.

PiperOrigin-RevId: 213595499

* compat: Update forward compatibility horizon to 2018-09-19

PiperOrigin-RevId: 213595705

* Run CPU tests remotely.

Being able to run CPU tests remotely while running GPU tests locally required
multiple changes:
1. Unify how we tag GPU tests in TF; we now always use tf_cuda_tests_tags().
2. Tag tests using tf_cuda_tests_tags() with 'local' and 'gpu'; this makes
   them not run on non-gpu builds and always runs them locally.

PiperOrigin-RevId: 213601626

* jacobian: manually setting the output shape in the output.

PiperOrigin-RevId: 213610324

* Enable tests for CPU and GPU backends that involve XlaSort.

PiperOrigin-RevId: 213611371

* [TF:XLA] Enable ClipByValue test for integer types

This has been fixed a while ago. Even though TF allows ClipByValue for complex
types it's not implemented anywhere (and it doesn't make sense for complex
numbers) so blacklist complex types.

PiperOrigin-RevId: 213615429

* Distributions should raise the original exception (log_prob not implemented) instead of the fallback exception (prob not implemented).

Additionally, in a nested structure of transformed distributions, it can be useful to know which distribution is raising this error.

PiperOrigin-RevId: 213618306

* Enable while_test for the GPU backend.

PiperOrigin-RevId: 213618350

* Add interface for HLO passes which run on HloModuleGroup.
Derive HloModulePass and HloModuleGroupPass from HloPassInterface which run module-scoped and module-group-scoped respectively. Replace all existing uses of HloPassInterface with HloModulePass because all existing passes are module-scoped. Also rewrite HloPassPipeline to support both module-scoped and module-group-scoped passes.

PiperOrigin-RevId: 213629604

* Automated rollback of commit 9fe1778

PiperOrigin-RevId: 213630404

* Treat kDomain instruction as a pure pass-through in HloValue

It doesn't access the data in any way similarly to kTuple so it should
be handled the same way.

PiperOrigin-RevId: 213630620

* Add build rules for mnist_softmax_xla.py so it can work internally.

PiperOrigin-RevId: 213637804

* Convert more kernel signatures to use runtime shapes.

PiperOrigin-RevId: 213640434

* [XLA:CPU] Add an emitter for erfinv(double) and erfinv(half).

This is used by the random number generator. Same algorithm as for float, just with more
precision. fp16 is upcasted to fp32 and then processed with the float algorithm.

PiperOrigin-RevId: 213648736

* Convert more kernel signatures to use runtime shapes.

PiperOrigin-RevId: 213651158

* Fix estimator_training test flakiness.

PiperOrigin-RevId: 213653403

* Return error message with illegal input rather than check-failing in op_kernel.

PiperOrigin-RevId: 213653853

* Force-place embedding variables on CPUs ein eager mode.

This avoids problems which happen because most optimizers do not have sparse updating gpu kernels implemented.

Fixes tensorflow#22042

PiperOrigin-RevId: 213654354

* Fix documentation markdown

PiperOrigin-RevId: 213655969

* Enable large constant array deduping by default.
If this causes trouble (makes graph visualizations harder to read, etc)
then consider increasing the default value of dedupe_array_min_size_bytes.

PiperOrigin-RevId: 213656796

* Python interface for Boosted Trees model explainability (currently includes directional feature contributions); fixed ExampleDebugOutputs bug where it errors with empty trees.

PiperOrigin-RevId: 213658470

* Add a space to the error message.

PiperOrigin-RevId: 213661062

* Re-enable flaky keras_test

PiperOrigin-RevId: 213665390

* Remove non-determinism in model-parallel compilation

PiperOrigin-RevId: 213667385

* Fixed broken links

* [XLA:TF] Re-disable testRandomUniformIsInRange

The bug is still there and makes this test flakily fail with fp16.

PiperOrigin-RevId: 213669453

* Convert more kernel signatures to use runtime shapes.

PiperOrigin-RevId: 213673402

* Adds an experimental package group to allow Swift and ObjC targets to depend on the "c_api" target.

PiperOrigin-RevId: 213673549

* Simplify ir_emitter_unnested so that it doesn't take a look at conv
custom call and try to understand what's inside. convolution_thunk does
it anyway.

PiperOrigin-RevId: 213676051

* Fixes in ResolveReorderAxes.
The main issue is we were keeping the input array, updating it in place and discarding the output array. That was a problem when the input array had multiple consumer ops. Now we're keeping the output array instead, which is the correct thing to do. However, in order to minimize disruption, we keep using the input array's name whenever possible, by means of some array renamings.

PiperOrigin-RevId: 213678219

* Two improvements in resolve_tensorflow_matmul:
1. Before inserting a new Transpose node, check if there already is one that
   may be reused. In practice, there are two cases: either the array being
   transposed is a constant (by far the most common case) or it's not.
    * If it is constant, then this doesn't really make a difference:
      ResolveConstantTranspose runs anyway, eliminating these Transpose nodes
      and also mootifying this change as it leaves no Transpose node to be
      reused. So in that case, constant-array-deduping is really the only
      thing that prevents duplication of data.
    * If it is not constant, that's where this new logic really helps, as
      the resulting Transpose nodes are here to stay in the final graph,
      and this avoids inserting more than are needed.
2. transpose_a is not supported. However, rather than CHECK-fail, it's more
   useful to have this graph transformation bail with a log message. The
   resulting 'unresolved' MatMul node could still be handled in some way
   at the TFLite level, or we could end up having support for MatMul per se.

PiperOrigin-RevId: 213678294

* Remove the CHECK added for debugging.

PiperOrigin-RevId: 213681549

* Fixes bits/bytes unit error in comment.

PiperOrigin-RevId: 213684048

* [tf.data] MapVectorization optimization: C++ conversion framework to vectorize a MapDefun function. Also implements conversion for two ops: Cast and Unpack.

PiperOrigin-RevId: 213686720

* Remove LOG(INFO) in MetaOptimizer:Optimize as this currently produces a large number of debugging outputs in the INFO log that look like:

I0917 16:20:11.073992    9191 meta_optimizer.cc:334] Starting optimization for grappler item: tf_graph
I0917 16:20:11.079458    9191 meta_optimizer.cc:334] Starting optimization for grappler item: tf_graph
I0917 16:20:11.084827   12447 meta_optimizer.cc:334] Starting optimization for grappler item: tf_graph
I0917 16:20:11.089359   12447 meta_optimizer.cc:334] Starting optimization for grappler item: tf_graph

After this change those lines will simply no longer appear.

RELNOTES: n/a
PiperOrigin-RevId: 213690759

* Added ABSL_DEPRECATED annotations to various deprecated TensorFlow functions.

PiperOrigin-RevId: 213693027

* Add min/max version for depthwise conv.

PiperOrigin-RevId: 213698663

* Allow the tape tensor to have unknown shapes.

This is done by making the TapeTensor a template rather than a concrete struct.

PiperOrigin-RevId: 213700425

* Create a steps_per_run variable to be updated correctly in the fit loop to make sure we run fit for the right number of steps.

PiperOrigin-RevId: 213706042

* Boosted trees: Add error messages when tree complexity parameter is not properly set.

PiperOrigin-RevId: 213706101

* This CL adds a new `tf.print` operator that more closely aligns with the standard python `print` method, and deprecates the old `tf.Print` operator (to be removed in in v2.0).

It follows the design doc specified in tensorflow/community#14 and additionally incorporates the community feedback and design review decisions.

This CL adds two new internal graph operators: a StringFormat operator that formats a template string with a list of input tensors to insert into the string and outputs a string scalar containing the result, and a PrintV2 operator that prints a string scalar to a specified output stream or logging level.

The formatting op is exposed at `tf.strings.Format`. A new python method is exposed at `tf.print` that takes a list of inputs that may be nested structures and may contain tensors, formats them nicely using the formatting op, and returns a PrintV2 operator that prints them. In Eager mode and inside defuns this PrintV2 operator will automatically be executed, but in graph mode it will need to be either added to `sess.run`, or used as a control dependency for other operators being executed.

As compared to the previous print function, the new print function:
- Has an API that more closely aligns with the standard python3 print
- Supports changing the print logging level/output stream
- allows printing arbitrary (optionally nested) data structures as opposed to just flat lists of tensors
- support printing sparse tensors
- changes printed tensor format to show more meaningful summary (recursively print the first and last elements of each tensor dimension, instead of just the first few elements of the tensor irregardless of dimension).

PiperOrigin-RevId: 213709924

* Go: Update generated wrapper functions for TensorFlow ops.
PiperOrigin-RevId: 213716034

* [XLA] Add R2 strided slice test.

PiperOrigin-RevId: 213718019

* Add VerifiedHloModule class.
VerifiedHloModule is derived from HloModule and verifies itself on destruction. This is designed to be used in HloVerifiedTestBase. This replaces the current mechanism which verifies HloModules in the TearDown method. The VerifiedHloModule approach is cleaner (less state on the test object) and more capable because these verified HLO modules can be passed to methods which require taking ownership of the module (eg, HlotestBase::Execute).

This change required some changes to the parser which enables constructing the parsed HloModule into an already allocated HloModule. Some trivial changes to HloModule are required as well.

PiperOrigin-RevId: 213718126

* Allow setting a global override for the "allow_growth" GPU option via the TF_FORCE_GPU_ALLOW_GROWTH environment variable.

PiperOrigin-RevId: 213728460

* TOCO transformations updated to support dilated depthwise convolution.

PiperOrigin-RevId: 213729750

* Update ops-related pbtxt files.

PiperOrigin-RevId: 213729979

* Fix the error message thrown when running eval on pod

PiperOrigin-RevId: 213730668

* Copy Tensor._handle_data from external_capture to placeholder for Variant tensors in Graph mode defun.
This allows inferring the shape of values popped from TensorLists inside defuns.
Remove "Resource" from {Set|Get}ResourceHandleShapeAndType since the same functions are re-usable for variants.
Eager mode fix coming in a future changelist.

PiperOrigin-RevId: 213735462

* BEGIN_PUBLIC
It's desirable to run int64 compute on GPU. Rolling back the folowing CL.

*** Original change description ***

Register a new Sum op for T:int64 and Tidx:int32

END_PUBLIC

Automated rollback of commit a9a5929

PiperOrigin-RevId: 213736058

* Update TF Lite subsite

PiperOrigin-RevId: 213737482

* Internal change.

PiperOrigin-RevId: 213749129

* Fix typo error in grapper remapper optimizer.

* Speeds up _random_flip for batched images.

PiperOrigin-RevId: 213753728

* Add feature_group_count parameter of Convolution op to xla_client.py.

This parameter has been added to HLO to support depthwise convolution.

PiperOrigin-RevId: 213761790

* Add AOT test case for XlaSort.

The only tensorflow op that uses XlaSort is nn.top_k, so we add a test case
using nn.top_k.

PiperOrigin-RevId: 213763591

* Automated rollback of commit 31c0857

PiperOrigin-RevId: 213764810

* Internal change.

PiperOrigin-RevId: 213770000

* Automated rollback of commit da3357e

PiperOrigin-RevId: 213771631

* compat: Update forward compatibility horizon to 2018-09-20

PiperOrigin-RevId: 213773990

* [XLA:TF] Whitelist quantized types for CPU/GPU

These have the same behavior as unquantized types so we can just pass them
through to XLA (which converts them to unquantized types). They're supposed to
be used with special ops, none of which are currently implemented by XLA.
Casting (without quantization) and basic math works fine though.

These do not have a corresponding numpy type, so only tests using TF types will
see them.

PiperOrigin-RevId: 213781650

* Fix typo in _EnforceShapeInvariant.

PiperOrigin-RevId: 213801006

* Callbacks should count the steps correctly in the multi step case

PiperOrigin-RevId: 213829360

* [tf.data] Use vectorization_utils::VectorizeMapDefun in MapVectorization optimization

PiperOrigin-RevId: 213840320

* [SE] Use absl instead of TF classes where an absl version exists

With the exception of StrCat all of these are using absl already, this change
just removes one layer of indirection.

PiperOrigin-RevId: 213846036

* [data-stats] Adds number of filtered elements as scalar summary, also adds number of filtered elements to monitoring counter.

PiperOrigin-RevId: 213846793

* Moving tpu_embedding_config.proto to tpu_embedding_configuration.proto, refactoring it, adding several new fields and an EmbeddingOutputLayout message to provide experimental support for controlling the embedding output.

PiperOrigin-RevId: 213849572

* Replace the OrderedDict with a basic list/dict solution. OrderedDict is problematic to use in eager because of the circular references it creates.

PiperOrigin-RevId: 213862402

* Fix _handle_data of variant and resource type outputs of While op in while_v2.

tensorflow/community#13

PiperOrigin-RevId: 213862844

* Add searchsorted (ie lower/upper bound) op.

PiperOrigin-RevId: 213863392

* Modify docs under contrib/distributions to point to tfp.

PiperOrigin-RevId: 213866466

* Updating doc references to tf.distributions to point to tfp.distributions.

PiperOrigin-RevId: 213867606

* Simplifies the ResourceVariable constructor.

PiperOrigin-RevId: 213872127

* This CL adds a Keras-based mobilenet_v2 feature extractor for object detection models.

As part of this CL, we use the Keras mobilenet_v2 application's keyword argument layer injection API to allow the generated network to support the object detection hyperparameters.

PiperOrigin-RevId: 213872175

* [tf.data] Fixes for two recently introduced use-after-free bugs.

1. In ParallelMapIterator, do not call `cond_var_.notify_all()` without holding
   the associated mutex. In some cases, the iterator may have been deleted
   between releasing the lock and notifying the condition variable, which
   leads to a use-after-free. This change applies this style to all use of
   condition variables in tensorflow/core/kernels/data/.

2. In CapturedFunction::RunAsync(), do not use `shared_ptr` to manage
   the lifetime of objects that (potentially) borrow from runtime
   objects. The present code runs the destructor after the `done()`
   callback is called, but the `done()` callback may be the last
   action in a session, and thus trigger destruction of those borrowed
   objects. In that case, the `shared_ptr` destructor may use the
   borrowed objects after they are freed.

PiperOrigin-RevId: 213872829

* Update ops-related pbtxt files.

PiperOrigin-RevId: 213873471

* Implement TF graph capture.

PiperOrigin-RevId: 213875284

* Fix bug in Pow optimizer rule when broadcasting is involved.
Minor cleanup by moving the helper function ShapesEqual to GraphProperties and adding unit tests for it.

PiperOrigin-RevId: 213876779

* Include the print function in the list of special functions - its name is not found in the namespace in Python 3.

PiperOrigin-RevId: 213879813

* [Java]: Release 1.11.0-rc1

PiperOrigin-RevId: 213882538

* [XLA] Don't create mixed precision operations accidentally

The reshape we created change the element type unintentionally.

PiperOrigin-RevId: 213883142

*  Remove restriction on scope for bypass operators. Previously, the scope had to be of the form 'scope/<arbitrary_text>'. Relax restriction to handle empty scopes. Enable this change to work for both fused and unfused batch norm layers

PiperOrigin-RevId: 213883621

* Fix missing TODO.

PiperOrigin-RevId: 213885561

* [tf.data] Some vectorization cleanup

PiperOrigin-RevId: 213886813

* Add more specific ReLU implementation tests.

PiperOrigin-RevId: 213890403

* This CL moves the tf.print logging level tests that are sensitive to OS & environment configurations to a separate test target, and disables running them on Windows.

PiperOrigin-RevId: 213895372

* Split XlaLaunch into XlaCompile and XlaRun; NFC

This CL splits the functionality in XlaLaunch into two separate operations:

 - XlaCompile, responsible for compiling a TF function into a LocalExecutable
 - XlaRun, responsible for executing a LocalExecutable created by XlaCompile

This CL is a stepping stone towards implementing lazy compilation for TF/XLA.
The XlaCompile op is spec'ed to return a boolean indicating whether the
compilation was successful.  Right now that boolean is always set to true by
XlaCompile and its value is otherwise ignored, but in the future it will be used
to indicate whether the TF function was compiled or not, and thus whether we
should execute XlaRun or just directly call the TF function.

XlaLaunch still exists, and will be created by create_xla_launch_op.cc.  In the
future we may consider removing it altogether.  build_xla_launch_ops.cc, now
renamed to build_xla_ops.cc, creates a XlaCompile/XlaRun pair instead of
XlaLaunch.

This CL is organized as follows:

 - jit/ops/xla_ops.cc gets two new XLA-specific operations, XlaCompile and
   XlaRun, described above.  XlaRun redundantly takes the must-be-constant
   inputs to the TensorFlow cluster to keep the implementation simple (simple in
   the sense of similar to XlaLaunch), but I will remove this in a subsequent
   cleanup CL.

 - jit/kernels/xla_ops.cc implements XlaCompile and XlaRun in a fairly
   straightforward manner.  XlaCompile compiles the TF function, puts it in a
   process-global storage, XlaExecutableClosureStore, and produces a int64 key.
   XlaRun uses the key to read out the LocalExecutable and execute it.  I'm not
   sure if XlaExecutableClosureStore should be a resource like
   XlaCompilationCache; I did not immediately see any reason to make it so.

 - There are changes to the various _device files to register XlaCompile and
   XlaRun for the XLA_* devices.

 - Finally, I had to fix some tests that were expecting XlaLaunch in the
   execution timeline.

PiperOrigin-RevId: 213895405

* Change all YAML booleans from True/False to true/false.

PiperOrigin-RevId: 213896057

* It is more computationally efficient to represent resize bilinear as a
depthwise convolution instead of a full convolution now that it exists in XLA.

PiperOrigin-RevId: 213896333

* [tf.data] Moving auto-tuning optimizations into a background thread, refactoring the API for exposing tunable parameters, and removing `model::Node` from the public API.

PiperOrigin-RevId: 213907565

* Fixes regression to tf.Print that removed square braces around printed tensors.

PiperOrigin-RevId: 213912507

*  Support 16 ways model parallelism.

PiperOrigin-RevId: 213913013

* Updating doc references to tf.distributions to tfp.distributions.

PiperOrigin-RevId: 213915666

* Update links to tf lite site.

PiperOrigin-RevId: 213917881

* Update links to install pages.

PiperOrigin-RevId: 213917946

* Add an API which gives explicit control over shard sizes and introspection into the number of shards used. This is a variant of threadpool::parallelFor

PiperOrigin-RevId: 213920649

* Make threading.local not an instance member of collective ops because in python3 threading.local cannot be pickled.

PiperOrigin-RevId: 213928766

* Return model format from LoadSessionBundleOrSavedModelBundle(),
allowing callers to know if we up-converted a SessionBundle to
SavedModel format.

PiperOrigin-RevId: 213937542

* Fix cub include path so that TensorFlow compiles when used as a bazel dependency.

PiperOrigin-RevId: 213942340

* Move from deprecated self.test_session() to self.cached_session().

self.test_session() has been deprecated in 9962eb5 as its name confuses readers of the test. Moving to cached_session() instead which is more explicit about:
* the fact that the session may be reused.
* the session is not closed even when doing a "with self.test_session()" statement.

PiperOrigin-RevId: 213944355

* Move from deprecated self.test_session() to self.cached_session().

self.test_session() has been deprecated in 9962eb5 as its name confuses readers of the test. Moving to cached_session() instead which is more explicit about:
* the fact that the session may be reused.
* the session is not closed even when doing a "with self.test_session()" statement.

PiperOrigin-RevId: 213944932

* keras/training.py: Improve error message.

Inspired by:
https://stackoverflow.com/questions/52428939/eager-mode-optimizers/

PiperOrigin-RevId: 213948133

* Internal change.

PiperOrigin-RevId: 213948394

* [TF:XLA] Bump open source llvm revision to r342644

PiperOrigin-RevId: 213952786

* [XLA:CPU] Re-enable half float tests for unary ops

This was blocked by an LLVM bug, which was fixed in r342542.

PiperOrigin-RevId: 213953743

* compat: Update forward compatibility horizon to 2018-09-21

PiperOrigin-RevId: 213955428

* Added fetch support for attrs classes.

Given a class

@attr.s()
class SampleAttr(object):
  field_1 = attr.ib()
  field_2 = attr.ib()

we will be able to run

obj = SampleAttr(tensor_1, tensor_2)
session.run(obj) # equivalent with session.run([obj.field_1, obj.field_2])

Please note, this does not need nest flatten support (which is only relevant to the feed_dict argument).

Also, the information in __attrs_attrs__ is provided for extensions (as per the docs: http://www.attrs.org/en/stable/extending.html#extending-metadata) like this and is not an "implementation detail".

PiperOrigin-RevId: 213963978

* Use weakrefs where absolutely safe to do so, in order to reduce the number of circular references. Replace unnecessary OrderedDict with a regular dict.

PiperOrigin-RevId: 213982097

* [TPU] Change the TPU DeviceAssignment class to use a flatter (replica, logical core) indexing scheme for cores.

Previously the DeviceAssignment class mixed both a general concept (a mapping from (replica, logical core) to physical TPU core) and a specific instantiation of that concept, by imposing a particular 3D grid structure on the logical core numbers. This was excessive ? while the physical core numbers have a particular structure, there is no need to impose any particular structure on the logical core numbers.

This change simplifies the DeviceAssignment scheme, changing it so logical cores within a replica are numbered sequentially without any particular semantics.

PiperOrigin-RevId: 213984629
@georgesterpu
Copy link

Hi @jsimsa
Is it possible to use the improved API to pad sequences differently than with constant values ?
The tf.pad transformation supports the "reflect" and "symmetric" padding modes. Is a PaddedBatchDataset currently limited to the "constant" padding mode ?

@jsimsa
Copy link
Contributor

jsimsa commented Nov 26, 2019

As Example 1 illustrates, the API allows you to write your custom padding logic, applying arbitrary map transformation (e.g. using tf.pad) to the elements to be padded.

In contrast, padded_batch is limited to either constant (or maximum) padding.

@georgesterpu
Copy link

This looks really great, @jsimsa !
Would you please help me with an example of a pad_fn that when applied on a WindowDataset produces symmetrically right padded sequences up to the length of longest one ? There are a few differences between the examples in the RFC and the current codebase, and can't get the snippet below to work:

import tensorflow as tf
import numpy as np

data = []
for _ in range(50):
    random_length = np.random.randint(low=10, high=100)
    data.append(tf.ones([random_length, 128]))

dataset = tf.data.Dataset.from_generator(lambda: data, output_types=tf.float32)


def pad_fn(sequences):
    return tf.pad(sequences, paddings=None, mode='SYMMETRIC')


dataset = dataset.window(4)
dataset = dataset.map(pad_fn)

for elem in dataset:
    print(elem)  # expected shape: [4, max_ts_batch, 128]

theadactyl pushed a commit that referenced this pull request Sep 29, 2020
Updates based on recent changes and discussion on RFC PR #262
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC: Accepted RFC Design Document: Accepted by Review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants