diff --git a/dev/adjoints/index.html b/dev/adjoints/index.html index 192ee663c..91deee4fb 100644 --- a/dev/adjoints/index.html +++ b/dev/adjoints/index.html @@ -100,4 +100,4 @@ 1 levels of nesting julia> grad(x -> x*grad(f, x), 1); -2 levels of nesting +2 levels of nesting diff --git a/dev/complex/index.html b/dev/complex/index.html index 31d31425f..b6957d1fb 100644 --- a/dev/complex/index.html +++ b/dev/complex/index.html @@ -27,4 +27,4 @@ (8.0 + 12.0im, 0.0 + 0.0im) julia> wirtinger(x -> abs2(x), 1+2im) -(1.0 - 2.0im, 1.0 + 2.0im) +(1.0 - 2.0im, 1.0 + 2.0im) diff --git a/dev/glossary/index.html b/dev/glossary/index.html index 95b4cb060..490fa5d05 100644 --- a/dev/glossary/index.html +++ b/dev/glossary/index.html @@ -6,4 +6,4 @@ ga('create', 'UA-36890222-9', 'auto'); ga('send', 'pageview', {'page': location.pathname + location.search + location.hash}); -

Glossary

Differentiation is a minefield of conflicting and overlapping terminology, partly because the ideas have been re-discovered in many different fields (e.g. calculus and differential geometry, the traditional AD community, deep learning, finance, etc.) Many of these terms are not well-defined and others may disagree on the details. Nevertheless, we aim to at least say how we use these terms, which will be helpful when reading over Zygote issues, discussions and source code.

The list is certainly not complete; if you see new terms you'd like defined, or would like to add one yourself, please do open an issue or PR.

Adjoint: See pullback. Used when defining new pullbacks (i.e. the @adjoint macro) since this involves defining the adjoint of the Jacobian, in most cases.

Backpropagation: Essentially equivalent to "reverse-mode AD". Used particularly in the machine learning world to refer to simple chains of functions f(g(h(x))), but has generalised beyond that.

Derivative: Given a scalar function $y = f(x)$, the derivative is $\frac{\partial y}{\partial x}$. "Partial" is taken for granted in AD; there's no interesting distinction between partial and total derivatives for our purposes. It's all in the eye of the beholder.

Differential: Given a function $f(x)$, the linearisation $\partial f$ such that $f(x + \epsilon) \approx f(x) + \partial f \epsilon$. This is a generalisation of the derivative since it applies to, for example, vector-to-vector functions ($\partial f$ is a Jacobian) and holomorphic complex functions ($\partial f$ is the first Wirtinger derivative). This is not, in general, what Zygote calculates, though differentials can usually be derived from gradients.

IR: Intermediate Representation. Essentially source code, but usually lower level – e.g. control flow constructs like loops and branches have all been replaced by gotos. The idea is that it's harder for humans to read/write but easier to manipulate programmatically. Worth looking at SSA form as a paradigmatic example.

Gradient: See sensitivity. There is no technical difference in Zygote's view, though "gradient" sometimes distinguishes the sensitivity we actually want from e.g. the internal ones that Zygote produces as it backpropagates.

Graph: ML people tend to think of models as "computation graphs", but this is no more true than any program is a graph. In fact, pretty much anything is a graph if you squint hard enough. This also refers to the data structure that e.g. TensorFlow and PyTorch build to represent your model, but see trace for that.

Pullback: Given $y = f(x)$ the function $\bar x = back(̄\bar y)$. In other words, the function back in y, back = Zygote.pullback(f, x).

Sensitivity: Used to refer to the gradient $\bar x = \frac{\partial l}{\partial x}$ with some scalar loss $l$. In other words, you have a value $x$ (which need not be scalar) at some point in your program, and $\bar x$ tells you how you should change that value to decrease the loss. In the AD world, sometimes used to refer to adjoint rules.

Source to Source Differentiation: Or Source Code Transformation (SCT). As opposed to tracing programs to simplify them, an alternative is to operate directly on a language's source code or IR, generating new source code for pullbacks. This describes Zygote, Swift for TensorFlow, Tapenade and a few other old ADs that worked on C source files. Zygote and Swift are unusual in that they work on in-memory IR rather than text source.

To an extent, tracing ADs can be viewed as source transform of a Wengert list / trace. The key difference is that the trace is a lossy representation of the original semantics, which causes problems with e.g. control flow. Systems which can preserve some of those semantics (e.g. autograph) begin to blur the line here, though they are still not nearly as expressive as language IRs.

Symbolic Differentiation: Used to refer to differentiation of "mathematical expressions", that is, things like 3x^2 + sin(x). Often distinguished from AD, though this is somewhat arbitrary; you can happily produce a symbolic adjoint for a Wengert list, the only difference being that you're allowed to make variable bindings. So it's really just a special case of AD on an unusually limited language.

Tape: This term can refer to pretty much any part of an AD implementation. In particular confusion is caused by conflating the trace with the set of values sometimes closed over by a pullback. Autograd has a combined trace/closure data structure which is usually described as the tape. On the other hand, PyTorch described their implementation as tape-free because the trace/closure is stored as a DAG rather than a vector, so basically all bets are off here.

Trace: A recording of each mathematical operation used by a program, made at runtime and usually forming a Wengert list. Traces may or may not also record actual runtime values (e.g. PyTorch vs. TensorFlow). They can often be treated as an IR and compiled, but are distinguished from true IRs in that they unroll and inline all control flow, functions and data structures. The tracing process can be thought of as a kind of partial evaluation, though tracers are typically much less worried about losing information.

Vector-Jacobian product: see pullback. So called because all pullbacks are linear functions that can be represented by (left) multiplication with the Jacobian matrix.

Wengert List: A set of simple variable assignments and mathematical expressions, forming a directed graph. Can be thought of as a limited programming language with variable bindings and numerical functions but no control flow or data structures. If you trace a program for AD it will typically take this form.

+

Glossary

Differentiation is a minefield of conflicting and overlapping terminology, partly because the ideas have been re-discovered in many different fields (e.g. calculus and differential geometry, the traditional AD community, deep learning, finance, etc.) Many of these terms are not well-defined and others may disagree on the details. Nevertheless, we aim to at least say how we use these terms, which will be helpful when reading over Zygote issues, discussions and source code.

The list is certainly not complete; if you see new terms you'd like defined, or would like to add one yourself, please do open an issue or PR.

Adjoint: See pullback. Used when defining new pullbacks (i.e. the @adjoint macro) since this involves defining the adjoint of the Jacobian, in most cases.

Backpropagation: Essentially equivalent to "reverse-mode AD". Used particularly in the machine learning world to refer to simple chains of functions f(g(h(x))), but has generalised beyond that.

Derivative: Given a scalar function $y = f(x)$, the derivative is $\frac{\partial y}{\partial x}$. "Partial" is taken for granted in AD; there's no interesting distinction between partial and total derivatives for our purposes. It's all in the eye of the beholder.

Differential: Given a function $f(x)$, the linearisation $\partial f$ such that $f(x + \epsilon) \approx f(x) + \partial f \epsilon$. This is a generalisation of the derivative since it applies to, for example, vector-to-vector functions ($\partial f$ is a Jacobian) and holomorphic complex functions ($\partial f$ is the first Wirtinger derivative). This is not, in general, what Zygote calculates, though differentials can usually be derived from gradients.

IR: Intermediate Representation. Essentially source code, but usually lower level – e.g. control flow constructs like loops and branches have all been replaced by gotos. The idea is that it's harder for humans to read/write but easier to manipulate programmatically. Worth looking at SSA form as a paradigmatic example.

Gradient: See sensitivity. There is no technical difference in Zygote's view, though "gradient" sometimes distinguishes the sensitivity we actually want from e.g. the internal ones that Zygote produces as it backpropagates.

Graph: ML people tend to think of models as "computation graphs", but this is no more true than any program is a graph. In fact, pretty much anything is a graph if you squint hard enough. This also refers to the data structure that e.g. TensorFlow and PyTorch build to represent your model, but see trace for that.

Pullback: Given $y = f(x)$ the function $\bar x = back(̄\bar y)$. In other words, the function back in y, back = Zygote.pullback(f, x).

Sensitivity: Used to refer to the gradient $\bar x = \frac{\partial l}{\partial x}$ with some scalar loss $l$. In other words, you have a value $x$ (which need not be scalar) at some point in your program, and $\bar x$ tells you how you should change that value to decrease the loss. In the AD world, sometimes used to refer to adjoint rules.

Source to Source Differentiation: Or Source Code Transformation (SCT). As opposed to tracing programs to simplify them, an alternative is to operate directly on a language's source code or IR, generating new source code for pullbacks. This describes Zygote, Swift for TensorFlow, Tapenade and a few other old ADs that worked on C source files. Zygote and Swift are unusual in that they work on in-memory IR rather than text source.

To an extent, tracing ADs can be viewed as source transform of a Wengert list / trace. The key difference is that the trace is a lossy representation of the original semantics, which causes problems with e.g. control flow. Systems which can preserve some of those semantics (e.g. autograph) begin to blur the line here, though they are still not nearly as expressive as language IRs.

Symbolic Differentiation: Used to refer to differentiation of "mathematical expressions", that is, things like 3x^2 + sin(x). Often distinguished from AD, though this is somewhat arbitrary; you can happily produce a symbolic adjoint for a Wengert list, the only difference being that you're allowed to make variable bindings. So it's really just a special case of AD on an unusually limited language.

Tape: This term can refer to pretty much any part of an AD implementation. In particular confusion is caused by conflating the trace with the set of values sometimes closed over by a pullback. Autograd has a combined trace/closure data structure which is usually described as the tape. On the other hand, PyTorch described their implementation as tape-free because the trace/closure is stored as a DAG rather than a vector, so basically all bets are off here.

Trace: A recording of each mathematical operation used by a program, made at runtime and usually forming a Wengert list. Traces may or may not also record actual runtime values (e.g. PyTorch vs. TensorFlow). They can often be treated as an IR and compiled, but are distinguished from true IRs in that they unroll and inline all control flow, functions and data structures. The tracing process can be thought of as a kind of partial evaluation, though tracers are typically much less worried about losing information.

Vector-Jacobian product: see pullback. So called because all pullbacks are linear functions that can be represented by (left) multiplication with the Jacobian matrix.

Wengert List: A set of simple variable assignments and mathematical expressions, forming a directed graph. Can be thought of as a limited programming language with variable bindings and numerical functions but no control flow or data structures. If you trace a program for AD it will typically take this form.

diff --git a/dev/index.html b/dev/index.html index 09ea4fc4e..9231197ba 100644 --- a/dev/index.html +++ b/dev/index.html @@ -79,7 +79,7 @@ p = size(x, d) sum(x.^p .+ y) end -([14.0, 22.0], 2.0, nothing)source
julia> linear(θ, x) = θ[:W] * x .+ θ[:b]
+([14.0, 22.0], 2.0, nothing)
source
julia> linear(θ, x) = θ[:W] * x .+ θ[:b]
 linear (generic function with 1 method)
 
 julia> x = rand(5);
@@ -121,7 +121,7 @@
  8.0  80.0  800.0
 
 julia> haskey(g, z)  # only x and y are parameters
-false
source
julia> W = rand(2, 5); b = rand(2);
+false
source
julia> W = rand(2, 5); b = rand(2);
 
 julia> linear(x) = W * x .+ b
 linear (generic function with 2 methods)
@@ -130,4 +130,4 @@
 Grads(...)
 
 julia> grads[W], grads[b] # access gradients using arrays as keys
-([0.652543 … 0.683588], [1.0, 1.0])

Here grads is a dictionary-like object, whose keys are the same parameters we indicated in Params. (In fact it wraps a dictionary using objectid(W) as keys, which does not change if the values in W are mutated).

This implicit style is the one presently used by Flux.jl, a closely related machine learning library. It uses structs like Linear above to define layers, and the function Flux.params(model) returns a Params object containing all the parameters of all layers. See its documentation for more details. When using Zygote for most other purposes, however, the explicit style is usually preferred.

+([0.652543 … 0.683588], [1.0, 1.0])

Here grads is a dictionary-like object, whose keys are the same parameters we indicated in Params. (In fact it wraps a dictionary using objectid(W) as keys, which does not change if the values in W are mutated).

This implicit style is the one presently used by Flux.jl, a closely related machine learning library. It uses structs like Linear above to define layers, and the function Flux.params(model) returns a Params object containing all the parameters of all layers. See its documentation for more details. When using Zygote for most other purposes, however, the explicit style is usually preferred.

diff --git a/dev/internals/index.html b/dev/internals/index.html index 1a546b097..915db68bb 100644 --- a/dev/internals/index.html +++ b/dev/internals/index.html @@ -135,4 +135,4 @@ julia> y, back = Zygote._pullback(bad, 1); julia> back(1) # ok, here's our issue. Lather, rinse, repeat. -ERROR: bad

Of course, our goal is that you never have to do this, but until Zygote is more mature it can be a useful way to narrow down test cases.

+ERROR: bad

Of course, our goal is that you never have to do this, but until Zygote is more mature it can be a useful way to narrow down test cases.

diff --git a/dev/limitations/index.html b/dev/limitations/index.html index af0a17af1..e9ef6c01b 100644 --- a/dev/limitations/index.html +++ b/dev/limitations/index.html @@ -87,4 +87,4 @@ tot += x^n # binds symbol `tot` to new value end return tot -end

However, sometimes such re-binding confuses Zygote, especially if the type of the value changes. Especially if the variable is "boxed", as will happen if you re-bind from within a closure (such as the function created by a do block).

Second derivatives

In principle Zygote supports taking derivatives of derivatives. There are, however, a few problems:

The issue tracker has a label for second order, which will outline where the bodies are buried.

Often using a different AD system over Zygote is a better solution. This is what hessian does, using ForwardDiff over Zygote, but other combinations are possible. (Note that rules defined here mean that Zygote over ForwardDiff is translated to ForwardDiff over ForwardDiff.)

+end

However, sometimes such re-binding confuses Zygote, especially if the type of the value changes. Especially if the variable is "boxed", as will happen if you re-bind from within a closure (such as the function created by a do block).

Second derivatives

In principle Zygote supports taking derivatives of derivatives. There are, however, a few problems:

The issue tracker has a label for second order, which will outline where the bodies are buried.

Often using a different AD system over Zygote is a better solution. This is what hessian does, using ForwardDiff over Zygote, but other combinations are possible. (Note that rules defined here mean that Zygote over ForwardDiff is translated to ForwardDiff over ForwardDiff.)

diff --git a/dev/profiling/index.html b/dev/profiling/index.html index ccbaa2506..22b011d16 100644 --- a/dev/profiling/index.html +++ b/dev/profiling/index.html @@ -28,4 +28,4 @@ │ %2 = (Base.mul_int)(Δ, 1)::Int64 │ %3 = (Zygote.tuple)(nothing, %1, %2)::PartialTuple(Tuple{Nothing,Int64,Int64}, Any[Const(nothing, false), Int64, Int64]) └── return %3 -) => Tuple{Nothing,Int64,Int64} +) => Tuple{Nothing,Int64,Int64} diff --git a/dev/search/index.html b/dev/search/index.html index 8c108b9f0..b25075005 100644 --- a/dev/search/index.html +++ b/dev/search/index.html @@ -6,4 +6,4 @@ ga('create', 'UA-36890222-9', 'auto'); ga('send', 'pageview', {'page': location.pathname + location.search + location.hash}); -

Loading search...

    +

    Loading search...

      diff --git a/dev/utils/index.html b/dev/utils/index.html index 91ef506a5..a8cb5653e 100644 --- a/dev/utils/index.html +++ b/dev/utils/index.html @@ -23,7 +23,7 @@ ([4 4 4], nothing) julia> gradient((a,t) -> sum(a .* t[1]) + t[2], [1,2,3], (4,5)) # gradient undersands the tuple -([4 4 4], (6, 1))source
      jacobian(loss, ::Params)

      Like gradient with implicit parameters, this method takes a zero-argument function and returns an IdDict-like object, now containing the Jacobian for each parameter.

      Examples

      julia> xs = [1 2; 3 4]; ys = [5,7,9];
      +([4 4 4], (6, 1))
      source
      jacobian(loss, ::Params)

      Like gradient with implicit parameters, this method takes a zero-argument function and returns an IdDict-like object, now containing the Jacobian for each parameter.

      Examples

      julia> xs = [1 2; 3 4]; ys = [5,7,9];
       
       julia> Jxy = jacobian(() -> ys[1:2] .+ sum(xs.^2), Params([xs, ys]))
       Grads(...)
      @@ -36,7 +36,7 @@
       julia> Jxy[xs]
       2×4 Matrix{Int64}:
        2  6  4  8
      - 2  6  4  8
      source
      Zygote.hessianFunction
      hessian(f, x)

      Construct the Hessian ∂²f/∂x², where x is a real number or an array, and f(x) is a real number. When x is an array, the result is a matrix H[i,j] = ∂²f/∂x[i]∂x[j], using linear indexing x[i] even if the argument is higher-dimensional.

      This uses forward over reverse, ForwardDiff over Zygote, calling hessian_dual(f, x). See hessian_reverse for an all-Zygote alternative.

      See also diaghessian to compute only the diagonal part.

      Examples

      julia> hessian(x -> x[1]*x[2], randn(2))
      + 2  6  4  8
      source
      Zygote.hessianFunction
      hessian(f, x)

      Construct the Hessian ∂²f/∂x², where x is a real number or an array, and f(x) is a real number. When x is an array, the result is a matrix H[i,j] = ∂²f/∂x[i]∂x[j], using linear indexing x[i] even if the argument is higher-dimensional.

      This uses forward over reverse, ForwardDiff over Zygote, calling hessian_dual(f, x). See hessian_reverse for an all-Zygote alternative.

      See also diaghessian to compute only the diagonal part.

      Examples

      julia> hessian(x -> x[1]*x[2], randn(2))
       2×2 Matrix{Float64}:
        0.0  1.0
        1.0  0.0
      @@ -49,7 +49,7 @@
        0   0   0  24
       
       julia> hessian(sin, pi/2)
      --1.0
      source
      Zygote.hessian_reverseFunction
      hessian_reverse(f, x)

      This should be equivalent to hessian(f, x), but implemented using reverse over reverse mode, all Zygote. (This is usually much slower, and more likely to find errors.)

      source
      Zygote.diaghessianFunction
      diaghessian(f, args...) -> Tuple

      Diagonal part of the Hessian. Returns a tuple containing, for each argument x, h of the same shape with h[i] = Hᵢᵢ = ∂²y/∂x[i]∂x[i]. The original evaluation y = f(args...) must give a real number y.

      For one vector argument x, this is equivalent to (diag(hessian(f,x)),). Like hessian it uses ForwardDiff over Zygote.

      Warning

      For arguments of any type except Number & AbstractArray, the result is nothing.

      Examples

      julia> diaghessian(x -> sum(x.^3), [1 2; 3 4])[1]
      +-1.0
      source
      Zygote.hessian_reverseFunction
      hessian_reverse(f, x)

      This should be equivalent to hessian(f, x), but implemented using reverse over reverse mode, all Zygote. (This is usually much slower, and more likely to find errors.)

      source
      Zygote.diaghessianFunction
      diaghessian(f, args...) -> Tuple

      Diagonal part of the Hessian. Returns a tuple containing, for each argument x, h of the same shape with h[i] = Hᵢᵢ = ∂²y/∂x[i]∂x[i]. The original evaluation y = f(args...) must give a real number y.

      For one vector argument x, this is equivalent to (diag(hessian(f,x)),). Like hessian it uses ForwardDiff over Zygote.

      Warning

      For arguments of any type except Number & AbstractArray, the result is nothing.

      Examples

      julia> diaghessian(x -> sum(x.^3), [1 2; 3 4])[1]
       2×2 Matrix{Int64}:
         6  12
        18  24
      @@ -66,7 +66,7 @@
       julia> hessian(xy -> atan(xy[1], xy[2]), [1, 2])  # full Hessian is not diagonal
       2×2 Matrix{Float64}:
        -0.16  -0.12
      - -0.12   0.16
      source

      Zygote also provides a set of helpful utilities. These are all "user-level" tools – in other words you could have written them easily yourself, but they live in Zygote for convenience.

      See ChainRules.ignore_derivatives if you want to exclude some of your code from the gradient calculation. This replaces previous Zygote-specific ignore and dropgrad functionality.

      Zygote.withgradientFunction
      withgradient(f, args...)
      + -0.12   0.16
      source

      Zygote also provides a set of helpful utilities. These are all "user-level" tools – in other words you could have written them easily yourself, but they live in Zygote for convenience.

      See ChainRules.ignore_derivatives if you want to exclude some of your code from the gradient calculation. This replaces previous Zygote-specific ignore and dropgrad functionality.

      Zygote.withgradientFunction
      withgradient(f, args...)
       withgradient(f, ::Params)

      Returns both the value of the function and the gradient, as a named tuple.

      julia> y, ∇ = withgradient(/, 1, 2)
       (val = 0.5, grad = (0.5, -0.25))
       
      @@ -87,8 +87,8 @@
       
       julia> res.grad[w]
       1-element Vector{Float64}:
      - 6.0
      source
      Zygote.withjacobianFunction
      withjacobian(f, args...)

      Returns both the value f(args...) and the jacobian as a named tuple.

      julia> withjacobian(cumsum, [1,2,3])
      -(val = [1, 3, 6], grad = ([1 0 0; 1 1 0; 1 1 1],))
      source
      Zygote.@showgradMacro
      @showgrad(x) -> x

      Much like @show, but shows the gradient about to accumulate to x. Useful for debugging gradients.

      julia> gradient(2, 3) do a, b
      + 6.0
      source
      Zygote.withjacobianFunction
      withjacobian(f, args...)

      Returns both the value f(args...) and the jacobian as a named tuple.

      julia> withjacobian(cumsum, [1,2,3])
      +(val = [1, 3, 6], grad = ([1 0 0; 1 1 0; 1 1 1],))
      source
      Zygote.@showgradMacro
      @showgrad(x) -> x

      Much like @show, but shows the gradient about to accumulate to x. Useful for debugging gradients.

      julia> gradient(2, 3) do a, b
                @showgrad(a)*b
              end
       ∂(a) = 3
      @@ -103,7 +103,7 @@
                a*b
              end
       ∂(a) = nothing
      -(3, 2)
      source
      Zygote.hookFunction
      hook(x̄ -> ..., x) -> x

      Gradient hooks. Allows you to apply an arbitrary function to the gradient for x.

      julia> gradient(2, 3) do a, b
      +(3, 2)
      source
      Zygote.hookFunction
      hook(x̄ -> ..., x) -> x

      Gradient hooks. Allows you to apply an arbitrary function to the gradient for x.

      julia> gradient(2, 3) do a, b
                hook(ā -> @show(ā), a)*b
              end
       ā = 3
      @@ -112,7 +112,7 @@
       julia> gradient(2, 3) do a, b
                hook(-, a)*b
              end
      -(-3, 2)
      source
      Zygote.BufferType
      Buffer(xs, ...)

      Buffer is an array-like type which is mutable when taking gradients. You can construct a Buffer with the same syntax as similar (e.g. Buffer(xs, 5)) and then use normal indexing. Finally, use copy to get back a normal array.

      For example:

      julia> function vstack(xs)
      +(-3, 2)
      source
      Zygote.BufferType
      Buffer(xs, ...)

      Buffer is an array-like type which is mutable when taking gradients. You can construct a Buffer with the same syntax as similar (e.g. Buffer(xs, 5)) and then use normal indexing. Finally, use copy to get back a normal array.

      For example:

      julia> function vstack(xs)
                  buf = Buffer(xs, length(xs), 5)
                  for i = 1:5
                    buf[:, i] = xs
      @@ -128,7 +128,7 @@
        3  3  3  3  3
       
       julia> gradient(x -> sum(vstack(x)), [1, 2, 3])
      -([5.0, 5.0, 5.0],)

      Buffer is not an AbstractArray and can't be used for linear algebra operations like matrix multiplication. This prevents it from being captured by pullbacks.

      copy is a semantic copy, but does not allocate memory. Instead the Buffer is made immutable after copying.

      source
      Zygote.forwarddiffFunction
      forwarddiff(f, x; chunk_threshold = ForwardDiff.DEFAULT_CHUNK_THRESHOLD) -> f(x)

      Runs f(x) as usual, but instructs Zygote to differentiate f using forward mode, rather than the usual reverse mode. The chunk_threshold argument controls the maximum chunk size (c.f. ForwardDiff documentation).

      Forward mode takes time linear in length(x) but only has constant memory overhead, and is very efficient for scalars, so in some cases this can be a useful optimisation.

      julia> function pow(x, n)
      +([5.0, 5.0, 5.0],)

      Buffer is not an AbstractArray and can't be used for linear algebra operations like matrix multiplication. This prevents it from being captured by pullbacks.

      copy is a semantic copy, but does not allocate memory. Instead the Buffer is made immutable after copying.

      source
      Zygote.forwarddiffFunction
      forwarddiff(f, x; chunk_threshold = ForwardDiff.DEFAULT_CHUNK_THRESHOLD) -> f(x)

      Runs f(x) as usual, but instructs Zygote to differentiate f using forward mode, rather than the usual reverse mode. The chunk_threshold argument controls the maximum chunk size (c.f. ForwardDiff documentation).

      Forward mode takes time linear in length(x) but only has constant memory overhead, and is very efficient for scalars, so in some cases this can be a useful optimisation.

      julia> function pow(x, n)
                r = one(x)
                for i = 1:n
                  r *= x
      @@ -151,7 +151,7 @@
         forwarddiff([a, b]) do (a, b)
           a*b
         end
      -end
      source
      Zygote.checkpointedFunction
      checkpointed(f, xs...)

      Use gradient checkpointing on the call f(xs...). This means that checkpointed(f, xs...) === f(xs...), but when computing the derivative intermediate results from the forward pass of f will not be stored. Instead the forward pass will be repeated, when computing the derivative. This saves memory at the cost of increasing execution time.

      Warning

      If f is not a pure function, checkpointed will likely give wrong results.

      source

      Params and Grads can be copied to and from arrays using the copy! function.

      Working with Grads

      Map, broadcast, and iteration are supported for the dictionary-like Grads objects. These operations are value based and preserve the keys.

      using Zygote, Test
      +end
      source
      Zygote.checkpointedFunction
      checkpointed(f, xs...)

      Use gradient checkpointing on the call f(xs...). This means that checkpointed(f, xs...) === f(xs...), but when computing the derivative intermediate results from the forward pass of f will not be stored. Instead the forward pass will be repeated, when computing the derivative. This saves memory at the cost of increasing execution time.

      Warning

      If f is not a pure function, checkpointed will likely give wrong results.

      source

      Params and Grads can be copied to and from arrays using the copy! function.

      Working with Grads

      Map, broadcast, and iteration are supported for the dictionary-like Grads objects. These operations are value based and preserve the keys.

      using Zygote, Test
       
       w, x1, x2, b = rand(2), rand(2), rand(2), rand(2)
       
      @@ -180,4 +180,4 @@
       # note that gradients must be w.r.t. to the same parameter key set
       gs3 = gradient(() -> sum(tanh.(w .* x2)), Params([w]))
       # gs3 does not have the key b
      -@test_throws ArgumentError gs1 .+ gs3
      +@test_throws ArgumentError gs1 .+ gs3