A model defined with the high-level TensorFlow Fold API is a single Block object that converts an input data structure of some kind to a tensor or tuple of tensors. There are a number of ways to accomplish this, depending on the use-case.
The simplest way to run a block is to eval
individual inputs interactively from a REPL.
For example:
td.OneHot(5).eval(3) => array([ 0., 0., 0., 1., 0.], dtype=float32)
This works for composite blocks also:
(td.Scalar() >>
td.AllOf(td.Function(tf.negative), td.Function(tf.square))).eval(2)
=> (array(-2.0, dtype=float32), array(4.0, dtype=float32))
Eval also works on blocks that produce sequences of indeterminate length (i.e. varying on a per-example basis) as outputs:
td.Map(td.Scalar() >> td.Function(tf.square)).eval(xrange(5))
=> [array(0.0, dtype=float32),
array(1.0, dtype=float32),
array(4.0, dtype=float32),
array(9.0, dtype=float32),
array(16.0, dtype=float32)]
Such blocks cannot be compiled and run on batches of inputs, because TensorFlow does not support ragged-edged tensors. Fold does however provide the Metric block, which supports batched outputs that vary in size on a per-example basis.
Typically we want to run Fold on batches of inputs, to exploit TensorFlow
parallelism. In this case, we first create an explicit
Compiler
, which compiles our model down to a
TensorFlow graph. The compiler will also do type inference and
validation on the model. The outputs of the model are exposed as
ordinary TensorFlow tensors, which can be connected to TensorFlow loss
functions and optimizers in the usual way.
By default, the graph produced by the compiler uses a place-holder,
Compiler.loom_input_tensor
, for
input data. This gets filled in with loom inputs (serialized
protobufs) using the TensorFlow feed_dict
mechanism. During
training, the compiler is also used to convert batches of input data
into feed_dict
s, by building them into loom inputs.
The overall structure of a running model uses the following outline:
# Define the Fold model.
root_block = insert_some_block_definition_here()
# Compile root_block to get a TensorFlow model that we can run.
compiler = td.Compiler(root_block)
# Get the TensorFlow tensor(s) that correspond to the outputs of root_block.
(model_output,) = compiler.output_tensors
# Compute a loss of some kind using TensorFlow.
loss = tf.l2loss(model_output)
# Hook up a TensorFlow optimizer.
train_op = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(loss)
# Convert a batch of examples into a TensorFlow feed_dict.
fd = compiler.build_feed_dict(batch_of_examples)
# Now train the model on the batch of examples.
sess.run(train_op, feed_dict=fd)
If your dataset is small enough to fit in memory, we recommend building loom inputs once and keeping them around. A simple Fold training loop for small datasets is as follows:
train_set = compiler.build_loom_inputs(load_examples(train_file))
train_set_dict = {} # add feeds for e.g. dropout here
dev_feed_dict = compiler.build_feed_dict(load_examples(dev_file))
for epoch, shuffled in enumerate(td.epochs(train_set, num_epochs), 1):
train_loss = 0.0
for batch in td.group_by_batches(shuffled, batch_size):
train_feed_dict[compiler.loom_input_tensor] = batch
_, batch_loss = sess.run([train, loss], train_feed_dict)
train_loss += batch_loss
dev_loss = sess.run(loss, dev_feed_dict)
print 'epoch: %s train: %s dev: %s' % (epoch, train_loss, dev_loss)
The code assumes that you have already created a compiler for your
model, have train
and loss
tensors, and have a load_examples
function.
This idiom exploits Python generators to do work as needed and avoid blocking.
-
Compiler.build_loom_inputs
takes an iterable of examples and returns an iterable of loom inputs in one-to-one correspondence. -
Compiler.build_feed_dict
takes an iterable of examples and returns a lazily computed feed dictionary. The feed values are cached internally, so they only get computed once. -
td.epochs
is a simple utility function that repeatedly yields an iterable, caching and shuffling it after the first yield. The reason for usingbuild_loom_inputs
for our training set is so that we can shuffle after each epoch and get different batches on each pass. -
td.group_by_batches
is an even simpler utility that lazily batches an iterable into lists.
What all this means is that TensorFlow training begins as soon as the
first batch of examples has been built into loom inputs. To take full
advantage of this, load_examples
should ideally return a generator.
If your dataset is too large to fit in memory, or if you only want to
process each example once (e.g. for inference), it is cleaner and more
efficient to use
Compiler.build_loom_input_batched
,
like so:
for batch_feed in compiler.build_loom_input_batched(examples, batch_size):
sess.run(fetches, {compiler.loom_input_tensor: batch_feed, ...}
...
As you'd expect, build_loom_input_batched
is a generator, yielding
batch feed values one at a time. So if examples
is also a generator,
only batch_size
examples will get materialized in-memory at a time.
Fold also supports reading from a tensor following the standard TF
idiom.
For example, let's say we have a tensor loom_input_batch
containing batches of
precomputed loom inputs (e.g. queued up from files):
compiler = td.Compiler(root_block, init_loom=False)
compiler.init_loom(loom_input_tensor=loom_input_batch)
...
We can now use compiler.output_tensors
without needing to feed any values.
If the input to your model is a string, then it can be built directly (without separately computing loom inputs) like so:
compiler = td.Compiler(root_block, init_loom=False)
compiler.init_loom(input_tensor=model_input_batch)
For example:
x = tf.constant(['foo', 'bar', 'foobar'])
compiler = td.Compiler(td.Length(), init_loom=False)
compiler.init_loom(input_tensor=x)
sess.run(compiler.output_tensors) => [array([ 3., 3., 6.], dtype=float32)]
TensorFlow exploits multicore to speed up operations like matrix
multiplication which are implemented in C++. Fold evaluation is
currently implemented in Python, so we run into the notorious global
interpreter
lock. We can
overcome this using a pool of subprocesses, which can be easily
created using the
Compiler.multiprocessing_pool
context manager. Inside of a multiprocessing pool context, all calls
to the compiler that build loom inputs will seamlessly use the pool to
evaluate examples in parallel. By default results will be unordered,
which is faster. For example:
with compiler.multiprocessing_pool():
batch_feeds = compiler.build_loom_input_batched(examples, batch_size):
sess.run(fetches, {compiler.loom_input_tensor: next(batch_feeds)})
Here Session.run
will be called on the first batch of examples to
get computed, which will not necessarily be the first batch_size
elements of the examples
iterable. To preserve order, simply pass
ordered=True
to build_loom_input_batched
(or build_feed_dict
, or
build_loom_inputs
).
The implementation uses the standard Python mulitprocessing library. When the context manager exits the pool is closed and we block until all ongoing work is completed.