Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jacobian types #220

Closed
ChrisRackauckas opened this issue Dec 11, 2017 · 26 comments
Closed

Jacobian types #220

ChrisRackauckas opened this issue Dec 11, 2017 · 26 comments
Assignees

Comments

@ChrisRackauckas
Copy link
Member

We need to be able to handle non-dense Jacobians. I plan on doing this by allowing Jacobian types to be chosen in the ODEProblem etc. types as a jac_prototype which allows the user to pass a function which generates the type which will be used for the Jacobian. This will let the user make use of a BandedMatrix from BandedMatrices.jl, or a sparse matrix, or even allows for matrix-free representations of the Jacobian.

@ChrisRackauckas
Copy link
Member Author

The solution here needs another tier. Here's a full explanation. In stiff solvers, you always need to solve the linear system

(M-gamma*J)x = Wx = b

where x is the unknown, M is the mass matrix (for standard ODEs, M=I), gamma is some real number (and is proportional to dt), J is the Jacobian of the ODE/SDE/DDE/etc, and b is some constant. The standard idea is to let the user choose the type of J, say Dense, Banded, Diagonal, or Sparse. Then that with the promotion of the mass matrix type determines the type of W which is then used in the linear solves.

This is fine if you want to actually store the values. But in many cases you may have a matrix-free representation of the Jacobian and mass matrix. In that case, you have in theory a matrix-free representation of W, so it would be nice to be able to set the type of W directly. This would allow for matrix-free Krylov methods to use a user chosen W via its action W*v, so the interface would just be that W needs A_mul_B!.

Then the third level, which is really only used by DSLs, is for "users" to give W^(-1). While inverses are not usually a numerically good idea, what I've seen is it's okay in this case because symbolic engines and compilers will optimize the redundant operations in the W^(-1) expression, and it's only possible to even compute this symbolically if the Jacobian is sparse enough that the terms here are not huge. So in this case, things like @ode_def supply W^(-1) and the system is solved by a single matrix multiplication.

The question then is how to give this full flexibility to the user. Sundials allows the user to plug in at two places: you can give a function to compute the Jacobian or a function for J*v. From J*v they can compute W*v = M*v - gamma*J*v. The nice thing is that then the user doesn't have to worry about getting the extra pieces correct: the mass matrix and gamma interaction is done for you.

But there is a nice connection to DAEs though. For DAEs, the Jacobian you need to specify is

dG/d(du) + gamma*dG/du

which, if you look at the ODE, is just the W matrix (since G(du,u,t) = 0 moves the du over so the sign flips). So setting W directly is natural in the DAE case, along with the actions W*x.

The bright side is that the J*v approach is perfectly fine performance-wise. This is because

W*v = M*v - gamma*J*v

requires doing the matrix interactions first, and then broadcasting the result.

A_mul_B!(mass_cache,M,v)
A_mul_B!(jac_cache,J,v)
@. Wv = mass_cache - gamma * jac_cache

is how that's done. That can only be condensed when the matrices are diagonal, which we can special case for diagonal mass matrices (and if the Jacobian is diagonal... it's not actually a system of ODEs anyways, but we can specialize that if we really wanted to too).

So then there's three perfectly fine options. The first is this set of rules:

Option 1

  1. The user can set J to some mutable type, along with a mass matrix. Then by type promotion W's type is determined and W = M - gamma * I is calculated.
  2. The user can give a function for J*v, then a function for W*v is created and given as the input to the linear solver which the user can override. A default can be set to GMRES in this case.
  3. The user can give a function for W^(-1) which is then used to directly calculate x.

That's (1)+(2) is the Sundials/LSODE/Hairer older style. The Julia updated version of that is the set of rules:

Option 2

  1. Same as before.
  2. The user can give J as some lazy operator type which defines A_mul_B! and/or *. This is used to build the W*v operator.
  3. Same as before.

Or there's the version:

Option 3

  1. Same as before
  2. the user gives a function which returns some W type which defines A_mul_B! and/or *.
  3. Same as before.

So really it's just different ways to do the more advanced version. In the end, the user-definable linear solve function gets a W, whether it's some array/matrix type or a lazy operator. In version 1 it could also be a function, which I think is more difficult and we should make use of the operator/linear map types. I am inclined to go with the second set of rules since it's easier for users to reason about the Jacobian and there's really no optimizations that can be done by the user writing the full W. However, then the connection to DAEs is somewhat muddled, but it might be good to abstract that.

Let's think about making (2) work. In this case, (1) is obvious and the user just sets jac_prototype to whatever array type they want, and mutates it via f(::Type{Val{:jac}},J,u,p,t). We can build off of that and allow jac_prototype to be a lazy operator for example, and then f(::Type{Val{:jac}},J,u,p,t) would allow the user to update it given (u,p,t) (and DiffEqArrayOperator etc. has update_coefficients! for this). But then we'd have to have some way of knowing not to do

      for j in 1:length(u), i in 1:length(u)
          @inbounds W[i,j] = mass_matrix[i,j] - γ*J[i,j]
      end

and instead build the lazy W operator. An option in the ODEProblem etc. could be lazy_jac which would default to true for AbstractSparseMatrix or Array Jacobians.

Or in case 3, there can be W_prototype along with f(::Type{Val{:W}},J,u,p,gamma,t) for updating the W type.

I'm not sure which one is more intuitive. Option 1 I think is out because using types can make this nicer to pass around extra properties. Option 2 makes it possible to interop back to the old form via closures and stuff like that, but makes the native Julia codes nicer. Option 3 may be more intuitive than Option 2, or maybe it's less intuitive because it's exposing implementation details. Right now, Option 2 is very slightly in the lead for me.

@tkf
Copy link

tkf commented Feb 9, 2018

Hi, nice write-up! I prefer Option 2 (or Option 1; but not 3), because of two reasons:

👍 Forward propagation of tangent vectors is easier in Option 2 (and 1)

It's related to what I wrote here:
SciML/OrdinaryDiffEq.jl#250

If there is an API to get a lazy Jacobian operator from an ODEProblem and DiscreteProblem then I can construct a new problem struct that does phase space and tangent space propagation. Just to be concrete, this is how I do it manually for now:

@with_kw struct Lorenz63Param
    σ::Float64 = 10.0
    ρ::Float64 = 28.0
    β::Float64 = 8/3
end

@inline function phase_dynamics!(du, u, p, t)
    @unpack σ, ρ, β = p
    du[1] = σ * (u[2] - u[1])
    du[2] = u[1] *- u[3]) - u[2]
    du[3] = u[1] * u[2] - β*u[3]
end

@inline @views function tangent_dynamics!(du, u, p, t)
    @unpack σ, ρ, β = p

    # Calculate du[:, 1] = f(u[:, 1])
    phase_dynamics!(du[:, 1], u[:, 1], p, t)

    # Calculate du[:, 2:end] = Jacobian * u[:, 2:end]
    du[1, 2:end] .= σ .* (u[2, 2:end] .- u[1, 2:end])
    du[2, 2:end] .=
        u[1, 2:end] .*.- u[3, 1]) .-
        u[1, 1] .* u[3, 2:end] .-
        u[2, 2:end]
    du[3, 2:end] .=
        u[1, 2:end] .* u[2, 1] .+ u[1, 1] .* u[2, 2:end] .-
        β .* u[3, 2:end]
end

This construction can be fully automated, since you'll be able to do it by:

@inline @views function tangent_dynamics!(du, u, phase_ode::ODEProblem, t)
    # Calculate du[:, 1] = f(u[:, 1])
    phase_ode.f(du[:, 1], u[:, 1], phase_ode.p, t)

    # Calculate du[:, 2:end] = Jacobian * u[:, 2:end]
    J = somehow get the Jacobian
    A_mul_B!(du[:, 2:end], J, v)
end

u0 = zeros(length(phase_ode.u0), length(phase_ode.u0) + 1)
u0[:, 1] = phase_ode.u0
u0[:, 2:end] = eye(length(phase_ode.u0))
tangent_prob = ODEProblem(tangent_dynamics!, u0, phase_ode.tspan, phase_ode)

This can also be done by Option 1 but not with Option 3. So I'm selfishly hoping that Option 1 or 2 will be chosen :)

Additional API?

Actually, I wonder if it makes sense for user to specify "fused" function in which left-multiplication of the Jacobian and the time derivative calculations are done simultaneously; i.e., an interface to accept the above tangent_dynamics! function directly.

(Side note: of course, I suppose you can't assume the type of the phase space in general so the function signature has to be something like tangent_dynamics!(Jv, du, v, u, p, t))

Hypothetically, for a sparse and large system (> L1 cache), I think calculating the du[i] and Jv[i] simultaneously (or the "transposed version" of it: du[:] and Jv[:] from the effect of u[j]) may help memory locality.

👍 Backward propagation can easily be supported by Option 2

I haven't looked at how parameter estimations work internally in DifferentialEquations.jl yet but a common way to do it in discrete systems (actually probably only in recurrent neural nets) is the back-propagation through time. So this would be very useful when you are fitting DiscreteProblem. I'm not sure if that's common in ODEs, but supposedly you can do the same. In this case, you need to back-propagate vectors through transposed Jacobian. With Option 2, user can just provide Base.At_mul_B! for the custom Jacobian type.

This backward propagation is also used in the calculation of covariant Lyapunov vectors. So, it would be quite nice to have this as a common interface.

@tkf
Copy link

tkf commented Feb 9, 2018

All I wanted say in the point "Forward propagation of tangent vectors..." was that "I need Jacobian, not W". Just realized that I didn't need the code to say it :)

@ChrisRackauckas
Copy link
Member Author

I haven't looked at how parameter estimations work internally in DifferentialEquations.jl yet but a common way to do it in discrete systems (actually probably only in recurrent neural nets) is the back-propagation through time. So this would be very useful when you are fitting DiscreteProblem. I'm not sure if that's common in ODEs, but supposedly you can do the same. In this case, you need to back-propagate vectors through transposed Jacobian. With Option 2, user can just provide Base.At_mul_B! for the custom Jacobian type.

Well, it's similar. It's adjoint sensitivity analysis, where you solve a backwards ODE. This uses the Jacobian too. These are all good points for working with the Jacobian directly, so that takes option 3 off the table.

http://docs.juliadiffeq.org/latest/analysis/sensitivity.html

So yeah, we want Jacobians from users for all this stuff.

Actually, I wonder if it makes sense for user to specify "fused" function in which left-multiplication of the Jacobian and the time derivative calculations are done simultaneously; i.e., an interface to accept the above tangent_dynamics! function directly.

I hadn't considered that. Yes, something like Optim.jl's fg! is a good idea. But how far does it go? Do we also fold in the time-gradient or parameter Jacobian calculations even though those don't happen with each stiff solver?

@tkf
Copy link

tkf commented Feb 9, 2018

So direct access to lazy Jacobian is coming. Awesome!

Re: the last sentence..., my English parser doesn't work... :) Are you suggesting to accept only the function to calculate f and J and get rid of Option 2(2)? I was suggesting to add a new interface "Option 2(4)" in which users can pass a function of signature (say) (Jv, du, v, u, p, t) where Jv .= Jacobian * v and du .= f(u) are computed. Internally, you can generate this function from Option 2(1) and 2(2).

@ChrisRackauckas
Copy link
Member Author

I was wondering about Option 2(4). (Jv, du, v, u, p, t), or should it be (J, du, v, u, p, t), or should it also add a spot for the time-gradient or parameter Jacobian? I think Optim.jl went all the way and then the interface is that the user checks if an input is nothing.

https://github.com/JuliaNLSolvers/NLSolversBase.jl#example

But the question really is, what optimizations can be done like this and to which of these functions can the optimizations be applied together?

@tkf
Copy link

tkf commented Feb 9, 2018

I see. I think fJ!(Jv, du, v, u, p, t) is the most flexible and optimizable version. You can call it with fJ!(Jv, du, I, u, p, t) to get Jv = J. User can even dispatch on type of v to write a specific implementation for directly calculating J, if that makes sense.

@tkf
Copy link

tkf commented Feb 9, 2018

Just to be clear if I understand the plan, if Option 2 is implemented, end-users can do something like:

p = ODEProblem(...; J = J,     # Option 2(1) and 2(2)
                    W⁻¹ = W⁻¹,  # Option 2(3)
                    fJ! = fJ!)  # Option 2(4)

where any of J, W⁻¹, fJ! can be provided or missing. End-users can provide all of them if it is appropriate for the algorithms they are using. For example, passing both J and W⁻¹ makes sense if you are using stiff solver and fitting the parameter. Passing both J and fJ! also makes sense if you are calculating the covariant Lyapunov vectors or using Hessian-free learning algorithm since both J * v and J^t v are required and having extra performance boost for J * v from fJ! is nice.

Then, the integrators and other downstream libraries would have a common interface to call those functions. Something like J_mul_B!(Jv, prob, B) to use J (or fJ! if J is missing) in prob::ODEProblem (or in integrator?).

Is this the direction you are heading?

@tkf
Copy link

tkf commented Feb 9, 2018

Probably I should call it Jf! if Jv is the first argument :)

@tkf
Copy link

tkf commented Feb 9, 2018

In Option 2(2), if you want to calculate J * v (or v * J) on an existing phase-space solution (so that fJ! is not the right function to call), you need to do something like what update_func for DiffEqArrayOperator is doing, right? Like this:

update!(J, sol(t), p, t)  # some hypothetical function
A_mul_B!(Jv, J, v)

On the other hand, in Option 1(2), the same code would be

J_mul_B!(Jv, sol(t), p, u, t)  # J_mul_B! is defined by user

Isn't Option 1(2) cleaner? I first thought the lazy operator approach is cool but not sure if it's better.

Also I wonder if it eliminate some optimization opportunity from Julia, because even if the Jacobian is totally lazy, passing u, p, t has to go through one indirection:

function update!(J::CustomJacobianType, u, p, t)
    J.u = u
    J.p = p
    J.t = t
end

function A_mul_B!(Y, J::CustomJacobianType, v)
    ... use J.u, J.p, J.t here ...
end

I don't think Julia can optimize away this since update! function has to mutate J. Of course, those assignments are super cheep but if you are calling update! & A_mul_B! combo in a tight loop, some clever optimizations (like with @polly?) that would have happened with J_mul_B! may not be possible. I know this is totally hypothetical but forcing a mutation in a loop when it is avoidable sounds a bit suspicious API design...

@ChrisRackauckas
Copy link
Member Author

All you should have to do is call update_coefficients!(J,u,p,t) to fix up the operator, and then it can be used J*v as needed. The mutation like that would never show up in profiling, and most of the time I think you'd just directly change some stencil values and not actually save the times in there.

Isn't Option 1(2) cleaner?

Not necessarily, since in many cases like using IterativeSolvers.jl we will have to build the operator anyways. DiffEq will only be using it for Krylov solvers.

In fact, I'm now wondering what fJ! is actually used for. There's no integrator that I know that actually needs to compute the two of them together. J*v is only done in Krylov iterations, in which case a derivative would have already been calculated and J*v will be calculated alone tons and tons of times.

The question then is: where will this actually be used, and what kind of optimizations are possible?

@tkf
Copy link

tkf commented Feb 9, 2018

I'm not arguing mutation itself would be time-consuming (of course not). I was worrying that the mutation makes it impossible to do some kind of optimizations like automatic loop unrolling. Though I have no idea if that's really going to happen. I need to come up with some example...

The question then is: where will this actually be used, and what kind of optimizations are possible?

fJ! is all I need for calculating the Lyapunov exponents. OK, but that's not used in DifferentialEquations.jl. All I can imagine that would be related to DifferentialEquations.jl is for the parameter fits. If you do online fitting (as in real-time recurrent learning) or Hessian-free Optimization (http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_Martens10.pdf) then forward-propagating vectors is relevant.

I think the optimization you can do would be something cache related. But I don't have any example at the moment.

@tkf
Copy link

tkf commented Feb 10, 2018

OK, so here is an example that I came up with. It's a (discrete-time) recurrent neural network. The full code is here: https://gist.github.com/tkf/3668ccf9aa704e5f1f321629ea71250f

I compared two implementations of !fJ: separated_tangent! and fused_tangent!:

@inline function phase_dynamics!(du, u, rnn, t)
    rnn.s .= tanh.(u .+ rnn.b)
    A_mul_B!(du, rnn.W, rnn.s)
end

@inline function separated_tangent!(du, u, rnn, t)
    @views phase_dynamics!(du[:, 1], u[:, 1], rnn, t)

    Y1 = @view du[:, 2:end]
    Y0 = @view u[:, 2:end]
    slopes = (rnn.s .= 1 .- rnn.s.^2)
    n, m = size(Y1)

    Y1 .= 0
    @inbounds for k in 1:m
        for j in 1:n
            @views Y1[:, k] .+= rnn.W[:, j] .* (slopes[j] * Y0[j, k])
        end
    end
end

@inline @generated function fused_tangent!(du, u, rnn, t,
                                           ::Type{Val{m}}) where {m}
    quote
        n, l = size(du)
        @assert l - 1 == $m

        du .= 0
        @inbounds for j in 1:n
            sj = tanh(u[j, 1] + rnn.b[j])
            slope = 1 - sj^2
            @simd for i = 1:n
                du[i, 1] += rnn.W[i, j] * sj
                @nexprs $m k->(du[i, k+1] += rnn.W[i, j] * slope * u[j, k+1])
            end
        end
    end
end

The idea is that, if the weight W is big, the memory access for W becomes dominating part of the computation. So, accessing W[i, j] twice as in separated_tangent! would kill the performance. Indeed, for system size > 1000 fused_tangent! is ~40% faster. Also, there is more than 20% speedup even at moderate system size = 128.:

image

Disclaimer: This speedup is only evident if you want to evaluate J*v for a few vectors v. The graph above is with two vectors. If you need full matrix-matrix multiplication J*B* then I don't know if there is a better way to do it. However, for parameter fitting, only one vector has to be multiplied by Jacobian so it's fine. For Lyapunov analysis, you may want to calculate first few Lyapunov exponents so this speedup can be relevant in those cases.

@tkf
Copy link

tkf commented Feb 10, 2018

Side note: The swich from separated_tangent! being faster to slower happens at system size 2^6 = 64. Here, the weight matrix W has length(W) = 64^2 = 4096 which is 32 KiB = my L1 cache size. But actually the whole memory consumption should be larger than that because there are other arrays. I'm a bit puzzled for why separated_tangent! can perform better in this case...

@ChrisRackauckas
Copy link
Member Author

I'm confused by your fused_tangent!: shouldn't it have two output arrays its mutating: Jv and du?

@ChrisRackauckas
Copy link
Member Author

You're accessing du in row-major order for the fused_tangent! BTW.

@ChrisRackauckas
Copy link
Member Author

ChrisRackauckas commented Feb 10, 2018

Looking at this more, backprop wouldn't make use of it because it doesn't need f values, it just uses interpolant values:

https://github.com/JuliaDiffEq/DiffEqSensitivity.jl/blob/master/src/adjoint_sensitivity.jl

Forward sensitivity could make use of it

https://github.com/JuliaDiffEq/DiffEqSensitivity.jl/blob/master/src/local_sensitivity.jl

But since J*v is needed multiple times, doing Jv and du together isn't very helpful since that's only one of the many calculations Jv calculations. And none of the integrators would make use of it internally. So we need to make really sure that this will actually be used somewhere, and that it matters for performance, before commiting to support it.

@ChrisRackauckas
Copy link
Member Author

I played with your example a bit. What you were actually measuring was just that the Cartesian method was faster. Both without Cartesian and both with Cartesian need to be measured.

using Base.Cartesian: @nexprs
type RNN{TM, TV}
    W::TM
    b::TV
    s::TV

    function RNN(W::TM, b::TV) where {TM <: AbstractMatrix,
                                      TV <: AbstractVector}
        (n,) = size(b)
        @assert size(W) == (n, n)
        return new{TM, TV}(W, b, similar(b))
    end
end

@inline function phase_dynamics!(du, u, rnn, t)
    rnn.s .= tanh.(u .+ rnn.b)
    A_mul_B!(du, rnn.W, rnn.s)
end

function separated_tangent!(du, u, rnn, t)
    du .= 0
    @views phase_dynamics!(du[:, 1], u[:, 1], rnn, t)

    slopes = (rnn.s .= 1 .- rnn.s.^2)
    n, m = size(du)
    sze = size(rnn.W,1)
    @inbounds for k in 2:m
        for j in 1:n
            for i in 1:sze
                du[i, k] += rnn.W[i, j] * (slopes[j] * u[j, k])
            end
        end
    end
end

@generated function separated_tangent!(du, u, rnn, t,
                                           ::Type{Val{m}}) where {m}
    quote
        n, l = size(du)
        @assert l - 1 == $m
        @views phase_dynamics!(du[:, 1], u[:, 1], rnn, t)
        du[:,2:end] .= 0
        slopes = rnn.s
        @inbounds for i in 1:length(slopes)
            slopes[i] = 1 - slopes[i]^2
        end
        @inbounds for j in 1:n
            slope = slopes[j]
            for i = 1:n
                @nexprs $m k->(du[i, k+1] += rnn.W[i, j] * slope * u[j, k+1])
            end
        end
    end
end

function fused_tangent!(du, u, rnn, t)
    n, l = size(du)
    du .= 0
    @inbounds for j in 1:n
        sj = tanh(u[j, 1] + rnn.b[j])
        slope = 1 - sj^2
        for i = 1:n
            du[i, 1] += rnn.W[i, j] * sj
            for k in 1:l-1
                du[i, k+1] += rnn.W[i, j] * slope * u[j, k+1]
            end
        end
    end
end

@inline @generated function fused_tangent!(du, u, rnn, t,
                                           ::Type{Val{m}}) where {m}
    quote
        n, l = size(du)
        @assert l - 1 == $m

        du .= 0
        @inbounds for j in 1:n
            sj = tanh(u[j, 1] + rnn.b[j])
            slope = 1 - sj^2
            @simd for i = 1:n
                du[i, 1] += rnn.W[i, j] * sj
                @nexprs $m k->(du[i, k+1] += rnn.W[i, j] * slope * u[j, k+1])
            end
        end
    end
end

using Base.Test
using BenchmarkTools

st_trials = []
vst_trials = []
vft_trials = []
ft_trials = []

for n in 2.^(5:11)
    k = 2
    rnn = RNN(randn(n, n), randn(n))
    u0 = randn(n, 1 + k)
    u1 = similar(u0)
    u2 = similar(u0)
    u3 = similar(u0)
    u4 = similar(u0)

    separated_tangent!(u1, u0, rnn, 0)
    fused_tangent!(u2, u0, rnn, 0, Val{k})
    fused_tangent!(u3, u0, rnn, 0)
    separated_tangent!(u4, u0, rnn, 0, Val{k})
    @test u1  u2
    @test u1  u3
    @test u1  u4

    push!(st_trials, @benchmark separated_tangent!($u1, $u0, $rnn, 0))
    push!(vst_trials, @benchmark separated_tangent!($u4, $u0, $rnn, 0, Val{$k}))
    push!(vft_trials, @benchmark fused_tangent!($u2, $u0, $rnn, 0, Val{$k}))
    push!(ft_trials, @benchmark fused_tangent!($u3, $u0, $rnn, 0))
end

using Plots

st_mtime = [est.time for est in map(minimum, st_trials)]
ft_mtime = [est.time for est in map(minimum, ft_trials)]
vst_mtime = [est.time for est in map(minimum, vst_trials)]
vft_mtime = [est.time for est in map(minimum, vft_trials)]

plt = plot(2.^(5:11), ft_mtime ./ st_mtime, xlabel="system size", ylabel="relative time", title = "Standard")
savefig("standard.png")
plt = plot(2.^(5:11), vft_mtime ./ vst_mtime, xlabel="system size", ylabel="relative time", title = "Cartesian")
savefig("cartesian.png")

standard

cartesian

It's not entirely clear to me, when you compare apples-to-apples, that this is a big difference. In fact, with multiple applications of J*v (Krylov iterations) the first one will be much better because you are storing the slopes, and so you can do a quick time check to see if you should re-evaluate the slopes, whereas the second one doesn't immediately have a way to handle this.

@ChrisRackauckas
Copy link
Member Author

Here's where I am at.

Option 3 is out.

Option 1 would require that DiffEq knows how to find out when the Jacobian is a lazy function or dense. It would need separate code paths each time the Jacobian is involved to either call the function or to A_mul_B. Option 2 still needs to check if it's lazy or indexable to do the W calculation, or build a lazy W. Option 2 also fits with the DiffEqOperators.jl interface. While there are some cases where it would be nice to directly pass in u,p,t instead of *, we can make this easier via a FunctionOperator <: DiffEqOperator.

So Option 2 > Option 1. If this ends up being a wrong idea, we can always add on Option 1's style. In that case, the changes to support this is quite minimal.

The next thing is whether fJ!(J, du, u, p, t) makes sense, or fJ!(Jv, du, v, u, p, t), or other variants. First of all, DiffEq would not make use of these internally. The integrators have a bunch of special triggers to avoid calculating the Jacobians each step and instead use approximation when it can. The only place in JuliaDiffEq where this could be used is forward sensitivity equations. In that case, I'm not convinced there's actually a performance benefit, at least if it's just a view to a function call and then a matrix multiplication later.

https://github.com/JuliaDiffEq/DiffEqSensitivity.jl/blob/master/src/local_sensitivity.jl

To make this worthwhile, it would have to be fJ!(J, du, u, p, t) or all fJ!(Jv, du, v, u, p, t) would have to support v a matrix, and even then (tests above) I'm not convinced that it will actually be faster.

The kicker is that it can always be added later. If we see the need, we can always add fJ!, so I'm going to table this for now.

@tkf
Copy link

tkf commented Feb 11, 2018

You're accessing du in row-major order for the fused_tangent! BTW.

Of course I realized that, but I thought it'd be fine since that's the penalty for fused_tangent!. Otherwise it's not clear if I'm comparing the code or memory layout.

Looking at this more, backprop wouldn't make use of it because it doesn't need f values

Also, BPTT needs transposed Jacobian so yea, it's irrelevant. But if you need online fit, then you probably do real-time recurrent learning (RTRL) and then it's relevant. (RTRL is to ForwardDiff as BPTT is to ReverseDiff.) Hessian-free learning needs both J*v and v*J. I suppose those are not in DifferentialEquations.jl yet but I'm just pointing out the places that it could be used, since parameter fitting seems to be one of the focus in DifferentialEquations.jl.

Forward sensitivity could make use of it [...] But since J*v is needed multiple times, doing Jv and du together isn't very helpful since that's only one of the many calculations Jv calculations.

Doesn't "J*v is needed multiple times" just mean that you need v to be a matrix? My example is doing it, BTW.

I played with your example a bit. What you were actually measuring was just that the Cartesian method was faster. Both without Cartesian and both with Cartesian need to be measured.

I see. Thank you very much! 40% was a big difference and I should be careful about the confirmation bias.

It's not entirely clear to me, when you compare apples-to-apples, that this is a big difference.

So in Cartesian figure it says 10% speed up for > 2000 but that's not enough? (OK, I'm not sure if the trend continue.)

Anyway, this is not super hard to have it in each libraries codebase so I think ti's fine if DifferentialEquations.jl does not support it. I just thought it'd be nice to have it since it is used at least in two different scientific disciplines (machine learning and dynamical systems). Adding fJ! support will be backward-compatible so probably it's not necessary to discuss it here.

@tobydriscoll
Copy link

Late to the party but...even if the user supplies just J in whatever form, won't they potentially want to precondition for W?

@ChrisRackauckas
Copy link
Member Author

yes, I wonder how Sundials documents this.

@ChrisRackauckas
Copy link
Member Author

One thing I've noticed since the earlier discussion is that there are some algorithms which need to work directly with the Jacobian instead of with W. For example, exponential Rosenbrock methods and EPIRK integrators need the Jacobian and never use W. So option 3 is out: working at the level of W isn't always appropriate. Option 2 is much easier to implement since we can just allow someone to return a multiplying type, so I'm definitely going in that direction. So I think that the solution is to implement Option 2 and then if we do need fJ! support we can return to add it later as an optimization for @tkf et al., but for now getting this done is of high priority.

I assigned the GSoC student @MSeeker1340 to handle this and hopefully it will be completed before the end of July. I plan on showing all sorts of cool matrix-free use cases in the JuliaCon talk.

@MSeeker1340
Copy link

Haven't looked at the forward/backward propagation part yet (though it is sure interesting), let me just pitch in and bring in the some of the updates to OrdinaryDiffEq and DiffEqOperators to the discussion.

  1. The old trait-based interface f(::Type{Val{:jac}},J,u,p,t) should now be replaced by f.jac(J,u,p,t) for f the updated ODEFunction by @YingboMa. So the end user interface

    p = ODEProblem(_f, u0, tspan; jac=J,     # Option 2(1) and 2(2)
                   invW=W⁻¹)                 # Option 2(3)

    should now directly translate to ODEFunction(f; jac=J, invW=W⁻¹). The future update for fJ! support should also be on this level.

  2. If jac_prototype is a DiffEqArrayOperator or an AbstractDerivativeOperator from DiffEqOperators, then we can simply set f.jac = update_coefficients! because they use the same (J,u,p,t) interface. The jac parameter in ODEProblem will now instead be used as the update function to combine with jac_prototype to construct the desired lazy operator.

  3. Speaking of laziness, a lazy W operator can be constructed simply as W = M - gamma*J from a lazy M and J operator using the operator composition stuff I developed for DiffEqOperators back in April (New DiffEqOperator Interface for Array and Composite Operators DiffEqOperators.jl#60). And if J is lazy while M is a regular matrix, then a convert method from AbstractMatrix to DiffEqArrayOperator should cover the case.
    This is neat because the construction of W only needs to be done once in the cache initialization step. calc_W! now trivially calls update_coefficients! for W, which simply recursively updates M and J. Things can get a little more complicated if the time step (and by extension gamma) is allowed to adapt, for which we should make sure to also update gamma whenever dt changes.

  4. @YingboMa the rewrite of SplitFunction also ties in with this, because the jacobian (and by extension W) for f = f1 + f2 can be automatically generated if the jacobian for both f1 and f2 is known. There're also a lot of special scenarios to consider also, e.g. when f1 is linear and when f2 is a constant (for affine problems). Of course we can still manually specify the jacobian/W but it would be nice to the end user if automation can be used.

It seems to me that the majority of work will in fact lead me back to DiffEqOperators and lazy operator composition, which to me is like going a full circle ;) @ChrisRackauckas is there anything else you'd like to add?

@ChrisRackauckas
Copy link
Member Author

I have been thinking that instead of 1 we should just expect users to use the ODEFunction constructor directly. But putting it in the problem constructor may be an easier interface. We would just have to make sure we send the right kwargs over to the ODEFunction call to do it, and we would have to do this with each problem type. This may be easier than documenting each DiffEqFunction type. Another question that might arise then is whether we need a separate type, or whether this really should be just a big problem type with a lot of spots for functions. This idea for the new way to handle all of the extra functions really isn't settled yet but it's at least in the right direction enough to start implementing and using it.

2 is exactly what I had in mind.

3 is a great idea. That solves it well. Users shouldn't ever touch gamma so yes, on the solver side it will have to do something like make a new lazy composition to change gamma and that's more than simple with your idea.

4: should it be a tuple of Jacobians?

Yeah, it sounds like this solution will require fixing operator compositions on v0.7 👍

@ChrisRackauckas
Copy link
Member Author

Done via SciML/OrdinaryDiffEq.jl#443 . Similar changes to StochasticDiffEq.jl will follow shortly. This already exists for Sparse support in Sundials.jl. Any others will happen as they do since this issue was for the general interface. If there's a solver that is compatible with more Jacobian types and they should be added, those should get solver package issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants