Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converter bugfixes/improvements #119

Merged
merged 18 commits into from
Jan 28, 2020
Merged

Converter bugfixes/improvements #119

merged 18 commits into from
Jan 28, 2020

Conversation

drasmuss
Copy link
Member

Fixing some issues found in converter for more complex models, and improving performance of converted networks.

@drasmuss drasmuss force-pushed the converter2 branch 2 times, most recently from a517739 to 30ef415 Compare December 19, 2019 22:58
@drasmuss drasmuss force-pushed the converter2 branch 7 times, most recently from 00ffd7a to 85f9846 Compare January 10, 2020 14:03
@drasmuss drasmuss changed the title WIP Converter bugfixes/improvements Converter bugfixes/improvements Jan 10, 2020
@drasmuss drasmuss force-pushed the converter2 branch 2 times, most recently from 11ceedc to cf275f5 Compare January 16, 2020 18:07
@drasmuss
Copy link
Member Author

Note for posterity. The transform=None changes caused performance to decrease on the integrator benchmark, but after looking into this for a while I am relatively convinced that this is just a quirk of that particular model rather than a general issue. Basically the removal of some of the unnecessary x*1 ElementwiseInc operators means that one of the remaining ElementwiseInc operators ends up writing to a partial signal block rather than the whole signal block, which is less efficient. But that is just a quirk of how the operators and signals are ordered/merged in that particular model, not an effect we'd expect to see in general. Larger scale tests (e.g. on the Spaun model) show a general speedup with the transform=None changes.

In case it is useful in the future, here is a benchmark script I made to test the cost of different read/write types:

from collections import defaultdict
import timeit

import numpy as np
import tensorflow as tf
from tensorflow.python.eager import profiler
from tensorflow.python.ops.gen_state_ops import (
    TemporaryVariable,
    DestroyTemporaryVariable,
)

tf.config.optimizer.set_experimental_options({"disable_meta_optimizer": True})

minibatch_size = 64
base_shape = (minibatch_size, 16384)
read_write_size = 4096
reps = 1000

with tf.Graph().as_default() as graph:
    results = defaultdict(list)

    idxs = tf.constant(
        np.random.uniform(0, base_shape[1], size=read_write_size), dtype=tf.int32
    )
    idxs_nd = tf.stack(
        tf.meshgrid(tf.range(minibatch_size, dtype=tf.int32), idxs, indexing="ij",),
        axis=-1,
    )

    base = tf.compat.v1.placeholder(shape=base_shape, dtype=tf.float32)

    read_identity = tf.compat.v1.placeholder(
        shape=(minibatch_size, read_write_size), dtype=tf.float32
    )

    for i in range(reps):
        with tf.control_dependencies(results["read_identity"]):
            results["read_identity"] = [read_identity]

        with tf.control_dependencies(results["read_slice"]):
            results["read_slice"] = [
                tf.strided_slice(base, [0, 0], [minibatch_size, read_write_size])
            ]

        with tf.control_dependencies(results["read_gather"]):
            results["read_gather"] = [tf.gather(base, idxs, axis=1)]

        with tf.control_dependencies(results["read_slice_concat"]):
            results["read_slice_concat"] = [
                tf.concat(
                    [
                        tf.strided_slice(
                            base, [0, 0], [minibatch_size, read_write_size // 2]
                        ),
                        tf.strided_slice(
                            base,
                            [0, base_shape[1] - read_write_size // 2],
                            [minibatch_size, base_shape[1]],
                        ),
                    ],
                    axis=1,
                )
            ]

        with tf.control_dependencies(results["write_assign"]):
            results["write_assign"] = [read_identity]

        with tf.control_dependencies(results["write_assign_add"]):
            if i == 0:
                results["write_assign_add"] = [read_identity]
            else:
                results["write_assign_add"] = [
                    results["write_assign_add"][0] + read_identity
                ]

        with tf.control_dependencies(results["write_scatter_add"]):
            results["write_scatter_add"] = [
                tf.tensor_scatter_nd_add(base, idxs_nd, read_identity)
            ]

        with tf.control_dependencies(results["write_scatter_update"]):
            results["write_scatter_update"] = [
                tf.tensor_scatter_nd_update(base, idxs_nd, read_identity)
            ]

        with tf.control_dependencies(results["write_temp_var_add"]):
            var = TemporaryVariable(shape=base.shape, dtype=base.dtype)
            var_name = var.op.name
            var = tf.compat.v1.assign(var, base)
            var = tf.compat.v1.scatter_nd_add(var, idxs_nd, read_identity)
            results["write_temp_var_add"] = [
                DestroyTemporaryVariable(ref=var, var_name=var_name)
            ]

        with tf.control_dependencies(results["write_temp_var_update"]):
            var = TemporaryVariable(shape=base.shape, dtype=base.dtype)
            var_name = var.op.name
            var = tf.compat.v1.assign(var, base)
            var = tf.compat.v1.scatter_nd_update(var, idxs_nd, read_identity)
            results["write_temp_var_update"] = [
                DestroyTemporaryVariable(ref=var, var_name=var_name)
            ]

    # change all the results to the same output, to remove i/o discrepancies
    for k, v in results.items():
        with tf.control_dependencies(v):
            results[k] = tf.constant(1)

with tf.compat.v1.Session(graph=graph) as sess:
    feed_dict = {
        base: np.random.uniform(size=base_shape),
        read_identity: np.random.uniform(size=(minibatch_size, read_write_size)),
    }

    # profiler.start()
    sess.run(results, feed_dict=feed_dict)
    # profiler.save("tmp2_profile", profiler.stop())

    for key, vals in results.items():
        print(key)

        time = 1e10
        for _ in range(50):
            start = timeit.default_timer()
            sess.run(
                vals, feed_dict=feed_dict,
            )
            time = min(time, timeit.default_timer() - start)

        print(time)

@drasmuss drasmuss force-pushed the converter2 branch 3 times, most recently from a1a3e11 to 0267ba7 Compare January 17, 2020 20:45
@hunse
Copy link
Collaborator

hunse commented Jan 20, 2020

I'm getting the following warnings in master:

/home/ehunsber/workspace/nengo-dl/nengo_dl/converter.py:1097: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  broadcast_scale[slices] = scale[i]
/home/ehunsber/workspace/nengo-dl/nengo_dl/converter.py:1098: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  broadcast_bias[slices] = bias[i]

Might want to fix these here. I think it's just a matter of using tuple(slices) in these lines.

@drasmuss
Copy link
Member Author

That's done in this commit c03bd58, and some other instances in this commit bab3950 (I just stuck them in with a larger commit as they came up since the change was so minor).

Copy link
Member

@tbekolay tbekolay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all LGTM! Had a few small questions/comments that could lead to changes but also could be left as is, so I'll wait on your response before proceeding @drasmuss.

nengo_dl/tensor_node.py Show resolved Hide resolved
nengo_dl/converter.py Outdated Show resolved Hide resolved
nengo_dl/converter.py Outdated Show resolved Hide resolved
CHANGES.rst Show resolved Hide resolved
Store TensorSignal indices as slices instead of full lists
of indices.

Store initial values of base arrays more efficiently.
Can improve the speed of state updates, and likely doesn't make
a significant difference to the memory size (relative to all
the other internal state on the GPU).
The is_gpu_available function is deprecated.

Use sys.executable in TF GPU check, which ensures that the
check is using the same python executable as the source script.
The behaviour of the batch_size parameter was
changed slightly.
Some changes were made to the ABR GPU server which sped things up
slightly.
Copy link
Member

@tbekolay tbekolay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All LGTM now, merging when CI finishes.

@tbekolay tbekolay merged commit e0c3479 into master Jan 28, 2020
@tbekolay tbekolay deleted the converter2 branch March 4, 2020 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants