From f7438f593e3a61cb98ee27edec56f37afcd4eb4b Mon Sep 17 00:00:00 2001 From: Logan Weber <36520469+weberlo@users.noreply.github.com> Date: Fri, 12 Apr 2019 15:43:37 -0700 Subject: [PATCH] [Relay] Add gradient operator tutorial docs (#2751) * Add gradient operator tutorial docs * Incorporate Steven's and Ziheng's feedback * Remove TODO about `collapse_sum_like` * Add more examples --- docs/dev/relay_add_op.rst | 104 ++++++++++++++++++++++++++++++++++ src/relay/pass/pattern_util.h | 5 ++ 2 files changed, 109 insertions(+) diff --git a/docs/dev/relay_add_op.rst b/docs/dev/relay_add_op.rst index c17e8318bc1f..466dca038185 100644 --- a/docs/dev/relay_add_op.rst +++ b/docs/dev/relay_add_op.rst @@ -156,6 +156,110 @@ before producing the call node: tup = Tuple(list(args)) return _make.concat(tup) +Gradient Operators +------------------ + +Gradient operators are important for writing differentiable programs in +Relay. While it is the case that Relay's autodiff algorithm can differentiate +first-class language constructs, operators are opaque. Because Relay can't +look into the implementation, an explicit differentiation rule must be +provided. + +Both Python and C++ can be used to write gradient operators, but we focus our +examples on Python, as it is more commonly used. + +Adding a Gradient in Python +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A collection of Python gradient operators can be found in +``python/tvm/relay/op/_tensor_grad.py``. We will walk through two +representative examples: ``sigmoid`` and ``multiply``. + +.. code:: python + + @register_gradient("sigmoid") + def sigmoid_grad(orig, grad): + """Returns [grad * sigmoid(x) * (1 - sigmoid(x))].""" + return [grad * orig * (ones_like(orig) - orig)] + +The inputs here are the original operator ``orig`` and a gradient ``grad`` to +accumulate into. What we return is a list, where the element at the i'th +index is the derivative of the operator with respect to the operator's i'th +input. In general, the gradient will return a list with as many elements as +there are inputs to the base operator. + +Before we further analyze this definition, first we should recall the +derivative of the sigmoid function: :math:`\frac{\partial \sigma}{\partial x} += \sigma(x)(1 - \sigma(x))`. The definition above looks similar to the +mathematical definition, but there is one important addition, which we +describe below. + +The term ``orig * (ones_like(orig) - orig)`` directly matches the derivative, +because ``orig`` here is the sigmoid function, but we're not just interested +in how to compute the gradient of this function. We're interested in +composing this gradient with other gradients, so we can accumulate the +gradient across an entire program. This is where the ``grad`` term comes in. +In the expression ``grad * orig * (ones_like(orig) - orig)``, multiplying by +``grad`` specifies how to compose the derivative with the gradient thus far. + +Now, we consider ``multiply``, a slightly more interesting example: + +.. code:: python + + @register_gradient("multiply") + def multiply_grad(orig, grad): + """Returns [grad * y, grad * x]""" + x, y = orig.args + return [collapse_sum_like(grad * y, x), + collapse_sum_like(grad * x, y)] + +In this example, there are two elements in the returned list, because +``multiply`` is a binary operator. And to recall, if :math:`f(x, y) = xy`, the +partial derivatives are :math:`\frac{\partial f}{\partial x} = y` and +:math:`\frac{\partial f}{\partial y} = x`. + +There is one required step for ``multiply`` that is not required for +``sigmoid``, because ``multiply`` has broadcasting semantics. Since the shape +of ``grad`` might not match the shape of the inputs, we use +``collapse_sum_like`` to take the contents of the ``grad * `` terms and +make the shape match the shape of the input we're differentiating with +respect to. + +Adding a Gradient in C++ +~~~~~~~~~~~~~~~~~~~~~~~~ + +Adding a gradient in C++ is similar to adding one in Python, but the +interface for registering is slightly different. + +First, make sure ``src/relay/pass/pattern_util.h`` is included. It provides +helper functions for creating nodes in the Relay AST. Then, define the +gradient in a similar fashion as in the Python example: + +.. code:: c + + tvm::Array MultiplyGrad(const Expr& orig_call, const Expr& output_grad) { + const Call& call = orig_call.Downcast(); + return { CollapseSumLike(Multiply(output_grad, call.args[1]), call.args[0]), + CollapseSumLike(Multiply(output_grad, call.args[0]), call.args[1]) }; + } + +Notice that in C++ we can't use the same operator overloading that we have in +Python, and we need to downcast, so the implementation is more verbose. Even +so, we can easily verify that this definition mirrors the earlier example in +Python. + +Now, instead of using a Python decorator, we need to tack a ``set_attr`` call +for "FPrimalGradient" onto the end of the base operator's registration, in +order to register the gradient. + +.. code:: c + + RELAY_REGISTER_OP("multiply") + // ... + // Set other attributes + // ... + .set_attr("FPrimalGradient", MultiplyGrad); + Summary ------- diff --git a/src/relay/pass/pattern_util.h b/src/relay/pass/pattern_util.h index 22307d12303e..1e4060fe6c75 100644 --- a/src/relay/pass/pattern_util.h +++ b/src/relay/pass/pattern_util.h @@ -328,6 +328,11 @@ inline Expr OnesLike(Expr e) { return CallNode::make(op, {e}); } +inline Expr CollapseSumLike(Expr e) { + static const Op& op = Op::Get("collapse_sum_like"); + return CallNode::make(op, {e}); +} + inline Expr Power(Expr lhs, Expr rhs) { static const Op& op = Op::Get("power"); return CallNode::make(op, {lhs, rhs}, Attrs(), {});