-
-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jacobian types #220
Comments
The solution here needs another tier. Here's a full explanation. In stiff solvers, you always need to solve the linear system (M-gamma*J)x = Wx = b where This is fine if you want to actually store the values. But in many cases you may have a matrix-free representation of the Jacobian and mass matrix. In that case, you have in theory a matrix-free representation of Then the third level, which is really only used by DSLs, is for "users" to give The question then is how to give this full flexibility to the user. Sundials allows the user to plug in at two places: you can give a function to compute the Jacobian or a function for But there is a nice connection to DAEs though. For DAEs, the Jacobian you need to specify is dG/d(du) + gamma*dG/du which, if you look at the ODE, is just the The bright side is that the W*v = M*v - gamma*J*v requires doing the matrix interactions first, and then broadcasting the result. A_mul_B!(mass_cache,M,v)
A_mul_B!(jac_cache,J,v)
@. Wv = mass_cache - gamma * jac_cache is how that's done. That can only be condensed when the matrices are diagonal, which we can special case for diagonal mass matrices (and if the Jacobian is diagonal... it's not actually a system of ODEs anyways, but we can specialize that if we really wanted to too). So then there's three perfectly fine options. The first is this set of rules: Option 1
That's (1)+(2) is the Sundials/LSODE/Hairer older style. The Julia updated version of that is the set of rules: Option 2
Or there's the version: Option 3
So really it's just different ways to do the more advanced version. In the end, the user-definable linear solve function gets a Let's think about making (2) work. In this case, (1) is obvious and the user just sets for j in 1:length(u), i in 1:length(u)
@inbounds W[i,j] = mass_matrix[i,j] - γ*J[i,j]
end and instead build the lazy Or in case 3, there can be I'm not sure which one is more intuitive. Option 1 I think is out because using types can make this nicer to pass around extra properties. Option 2 makes it possible to interop back to the old form via closures and stuff like that, but makes the native Julia codes nicer. Option 3 may be more intuitive than Option 2, or maybe it's less intuitive because it's exposing implementation details. Right now, Option 2 is very slightly in the lead for me. |
Hi, nice write-up! I prefer Option 2 (or Option 1; but not 3), because of two reasons: 👍 Forward propagation of tangent vectors is easier in Option 2 (and 1)It's related to what I wrote here: If there is an API to get a lazy Jacobian operator from an @with_kw struct Lorenz63Param
σ::Float64 = 10.0
ρ::Float64 = 28.0
β::Float64 = 8/3
end
@inline function phase_dynamics!(du, u, p, t)
@unpack σ, ρ, β = p
du[1] = σ * (u[2] - u[1])
du[2] = u[1] * (ρ - u[3]) - u[2]
du[3] = u[1] * u[2] - β*u[3]
end
@inline @views function tangent_dynamics!(du, u, p, t)
@unpack σ, ρ, β = p
# Calculate du[:, 1] = f(u[:, 1])
phase_dynamics!(du[:, 1], u[:, 1], p, t)
# Calculate du[:, 2:end] = Jacobian * u[:, 2:end]
du[1, 2:end] .= σ .* (u[2, 2:end] .- u[1, 2:end])
du[2, 2:end] .=
u[1, 2:end] .* (ρ .- u[3, 1]) .-
u[1, 1] .* u[3, 2:end] .-
u[2, 2:end]
du[3, 2:end] .=
u[1, 2:end] .* u[2, 1] .+ u[1, 1] .* u[2, 2:end] .-
β .* u[3, 2:end]
end This construction can be fully automated, since you'll be able to do it by: @inline @views function tangent_dynamics!(du, u, phase_ode::ODEProblem, t)
# Calculate du[:, 1] = f(u[:, 1])
phase_ode.f(du[:, 1], u[:, 1], phase_ode.p, t)
# Calculate du[:, 2:end] = Jacobian * u[:, 2:end]
J = somehow get the Jacobian
A_mul_B!(du[:, 2:end], J, v)
end
u0 = zeros(length(phase_ode.u0), length(phase_ode.u0) + 1)
u0[:, 1] = phase_ode.u0
u0[:, 2:end] = eye(length(phase_ode.u0))
tangent_prob = ODEProblem(tangent_dynamics!, u0, phase_ode.tspan, phase_ode) This can also be done by Option 1 but not with Option 3. So I'm selfishly hoping that Option 1 or 2 will be chosen :) Additional API?Actually, I wonder if it makes sense for user to specify "fused" function in which left-multiplication of the Jacobian and the time derivative calculations are done simultaneously; i.e., an interface to accept the above (Side note: of course, I suppose you can't assume the type of the phase space in general so the function signature has to be something like Hypothetically, for a sparse and large system (> L1 cache), I think calculating the 👍 Backward propagation can easily be supported by Option 2I haven't looked at how parameter estimations work internally in DifferentialEquations.jl yet but a common way to do it in discrete systems (actually probably only in recurrent neural nets) is the back-propagation through time. So this would be very useful when you are fitting This backward propagation is also used in the calculation of covariant Lyapunov vectors. So, it would be quite nice to have this as a common interface. |
All I wanted say in the point "Forward propagation of tangent vectors..." was that "I need Jacobian, not W". Just realized that I didn't need the code to say it :) |
Well, it's similar. It's adjoint sensitivity analysis, where you solve a backwards ODE. This uses the Jacobian too. These are all good points for working with the Jacobian directly, so that takes option 3 off the table. http://docs.juliadiffeq.org/latest/analysis/sensitivity.html So yeah, we want Jacobians from users for all this stuff.
I hadn't considered that. Yes, something like Optim.jl's |
So direct access to lazy Jacobian is coming. Awesome! Re: the last sentence..., my English parser doesn't work... :) Are you suggesting to accept only the function to calculate |
I was wondering about Option 2(4). https://github.com/JuliaNLSolvers/NLSolversBase.jl#example But the question really is, what optimizations can be done like this and to which of these functions can the optimizations be applied together? |
I see. I think |
Just to be clear if I understand the plan, if Option 2 is implemented, end-users can do something like:
where any of Then, the integrators and other downstream libraries would have a common interface to call those functions. Something like Is this the direction you are heading? |
Probably I should call it |
In Option 2(2), if you want to calculate
On the other hand, in Option 1(2), the same code would be
Isn't Option 1(2) cleaner? I first thought the lazy operator approach is cool but not sure if it's better. Also I wonder if it eliminate some optimization opportunity from Julia, because even if the Jacobian is totally lazy, passing
I don't think Julia can optimize away this since |
All you should have to do is call
Not necessarily, since in many cases like using IterativeSolvers.jl we will have to build the operator anyways. DiffEq will only be using it for Krylov solvers. In fact, I'm now wondering what The question then is: where will this actually be used, and what kind of optimizations are possible? |
I'm not arguing mutation itself would be time-consuming (of course not). I was worrying that the mutation makes it impossible to do some kind of optimizations like automatic loop unrolling. Though I have no idea if that's really going to happen. I need to come up with some example...
I think the optimization you can do would be something cache related. But I don't have any example at the moment. |
OK, so here is an example that I came up with. It's a (discrete-time) recurrent neural network. The full code is here: https://gist.github.com/tkf/3668ccf9aa704e5f1f321629ea71250f I compared two implementations of @inline function phase_dynamics!(du, u, rnn, t)
rnn.s .= tanh.(u .+ rnn.b)
A_mul_B!(du, rnn.W, rnn.s)
end
@inline function separated_tangent!(du, u, rnn, t)
@views phase_dynamics!(du[:, 1], u[:, 1], rnn, t)
Y1 = @view du[:, 2:end]
Y0 = @view u[:, 2:end]
slopes = (rnn.s .= 1 .- rnn.s.^2)
n, m = size(Y1)
Y1 .= 0
@inbounds for k in 1:m
for j in 1:n
@views Y1[:, k] .+= rnn.W[:, j] .* (slopes[j] * Y0[j, k])
end
end
end
@inline @generated function fused_tangent!(du, u, rnn, t,
::Type{Val{m}}) where {m}
quote
n, l = size(du)
@assert l - 1 == $m
du .= 0
@inbounds for j in 1:n
sj = tanh(u[j, 1] + rnn.b[j])
slope = 1 - sj^2
@simd for i = 1:n
du[i, 1] += rnn.W[i, j] * sj
@nexprs $m k->(du[i, k+1] += rnn.W[i, j] * slope * u[j, k+1])
end
end
end
end The idea is that, if the weight Disclaimer: This speedup is only evident if you want to evaluate |
Side note: The swich from |
I'm confused by your |
You're accessing |
Looking at this more, backprop wouldn't make use of it because it doesn't need https://github.com/JuliaDiffEq/DiffEqSensitivity.jl/blob/master/src/adjoint_sensitivity.jl Forward sensitivity could make use of it https://github.com/JuliaDiffEq/DiffEqSensitivity.jl/blob/master/src/local_sensitivity.jl But since |
I played with your example a bit. What you were actually measuring was just that the Cartesian method was faster. Both without Cartesian and both with Cartesian need to be measured. using Base.Cartesian: @nexprs
type RNN{TM, TV}
W::TM
b::TV
s::TV
function RNN(W::TM, b::TV) where {TM <: AbstractMatrix,
TV <: AbstractVector}
(n,) = size(b)
@assert size(W) == (n, n)
return new{TM, TV}(W, b, similar(b))
end
end
@inline function phase_dynamics!(du, u, rnn, t)
rnn.s .= tanh.(u .+ rnn.b)
A_mul_B!(du, rnn.W, rnn.s)
end
function separated_tangent!(du, u, rnn, t)
du .= 0
@views phase_dynamics!(du[:, 1], u[:, 1], rnn, t)
slopes = (rnn.s .= 1 .- rnn.s.^2)
n, m = size(du)
sze = size(rnn.W,1)
@inbounds for k in 2:m
for j in 1:n
for i in 1:sze
du[i, k] += rnn.W[i, j] * (slopes[j] * u[j, k])
end
end
end
end
@generated function separated_tangent!(du, u, rnn, t,
::Type{Val{m}}) where {m}
quote
n, l = size(du)
@assert l - 1 == $m
@views phase_dynamics!(du[:, 1], u[:, 1], rnn, t)
du[:,2:end] .= 0
slopes = rnn.s
@inbounds for i in 1:length(slopes)
slopes[i] = 1 - slopes[i]^2
end
@inbounds for j in 1:n
slope = slopes[j]
for i = 1:n
@nexprs $m k->(du[i, k+1] += rnn.W[i, j] * slope * u[j, k+1])
end
end
end
end
function fused_tangent!(du, u, rnn, t)
n, l = size(du)
du .= 0
@inbounds for j in 1:n
sj = tanh(u[j, 1] + rnn.b[j])
slope = 1 - sj^2
for i = 1:n
du[i, 1] += rnn.W[i, j] * sj
for k in 1:l-1
du[i, k+1] += rnn.W[i, j] * slope * u[j, k+1]
end
end
end
end
@inline @generated function fused_tangent!(du, u, rnn, t,
::Type{Val{m}}) where {m}
quote
n, l = size(du)
@assert l - 1 == $m
du .= 0
@inbounds for j in 1:n
sj = tanh(u[j, 1] + rnn.b[j])
slope = 1 - sj^2
@simd for i = 1:n
du[i, 1] += rnn.W[i, j] * sj
@nexprs $m k->(du[i, k+1] += rnn.W[i, j] * slope * u[j, k+1])
end
end
end
end
using Base.Test
using BenchmarkTools
st_trials = []
vst_trials = []
vft_trials = []
ft_trials = []
for n in 2.^(5:11)
k = 2
rnn = RNN(randn(n, n), randn(n))
u0 = randn(n, 1 + k)
u1 = similar(u0)
u2 = similar(u0)
u3 = similar(u0)
u4 = similar(u0)
separated_tangent!(u1, u0, rnn, 0)
fused_tangent!(u2, u0, rnn, 0, Val{k})
fused_tangent!(u3, u0, rnn, 0)
separated_tangent!(u4, u0, rnn, 0, Val{k})
@test u1 ≈ u2
@test u1 ≈ u3
@test u1 ≈ u4
push!(st_trials, @benchmark separated_tangent!($u1, $u0, $rnn, 0))
push!(vst_trials, @benchmark separated_tangent!($u4, $u0, $rnn, 0, Val{$k}))
push!(vft_trials, @benchmark fused_tangent!($u2, $u0, $rnn, 0, Val{$k}))
push!(ft_trials, @benchmark fused_tangent!($u3, $u0, $rnn, 0))
end
using Plots
st_mtime = [est.time for est in map(minimum, st_trials)]
ft_mtime = [est.time for est in map(minimum, ft_trials)]
vst_mtime = [est.time for est in map(minimum, vst_trials)]
vft_mtime = [est.time for est in map(minimum, vft_trials)]
plt = plot(2.^(5:11), ft_mtime ./ st_mtime, xlabel="system size", ylabel="relative time", title = "Standard")
savefig("standard.png")
plt = plot(2.^(5:11), vft_mtime ./ vst_mtime, xlabel="system size", ylabel="relative time", title = "Cartesian")
savefig("cartesian.png") It's not entirely clear to me, when you compare apples-to-apples, that this is a big difference. In fact, with multiple applications of |
Here's where I am at. Option 3 is out. Option 1 would require that DiffEq knows how to find out when the Jacobian is a lazy function or dense. It would need separate code paths each time the Jacobian is involved to either call the function or to So Option 2 > Option 1. If this ends up being a wrong idea, we can always add on Option 1's style. In that case, the changes to support this is quite minimal. The next thing is whether https://github.com/JuliaDiffEq/DiffEqSensitivity.jl/blob/master/src/local_sensitivity.jl To make this worthwhile, it would have to be The kicker is that it can always be added later. If we see the need, we can always add |
Of course I realized that, but I thought it'd be fine since that's the penalty for
Also, BPTT needs transposed Jacobian so yea, it's irrelevant. But if you need online fit, then you probably do real-time recurrent learning (RTRL) and then it's relevant. (RTRL is to ForwardDiff as BPTT is to ReverseDiff.) Hessian-free learning needs both
Doesn't "
I see. Thank you very much! 40% was a big difference and I should be careful about the confirmation bias.
So in Cartesian figure it says 10% speed up for > 2000 but that's not enough? (OK, I'm not sure if the trend continue.) Anyway, this is not super hard to have it in each libraries codebase so I think ti's fine if DifferentialEquations.jl does not support it. I just thought it'd be nice to have it since it is used at least in two different scientific disciplines (machine learning and dynamical systems). Adding |
Late to the party but...even if the user supplies just J in whatever form, won't they potentially want to precondition for W? |
yes, I wonder how Sundials documents this. |
One thing I've noticed since the earlier discussion is that there are some algorithms which need to work directly with the Jacobian instead of with W. For example, exponential Rosenbrock methods and EPIRK integrators need the Jacobian and never use W. So option 3 is out: working at the level of W isn't always appropriate. Option 2 is much easier to implement since we can just allow someone to return a multiplying type, so I'm definitely going in that direction. So I think that the solution is to implement Option 2 and then if we do need I assigned the GSoC student @MSeeker1340 to handle this and hopefully it will be completed before the end of July. I plan on showing all sorts of cool matrix-free use cases in the JuliaCon talk. |
Haven't looked at the forward/backward propagation part yet (though it is sure interesting), let me just pitch in and bring in the some of the updates to OrdinaryDiffEq and DiffEqOperators to the discussion.
It seems to me that the majority of work will in fact lead me back to DiffEqOperators and lazy operator composition, which to me is like going a full circle ;) @ChrisRackauckas is there anything else you'd like to add? |
I have been thinking that instead of
Yeah, it sounds like this solution will require fixing operator compositions on v0.7 👍 |
Done via SciML/OrdinaryDiffEq.jl#443 . Similar changes to StochasticDiffEq.jl will follow shortly. This already exists for Sparse support in Sundials.jl. Any others will happen as they do since this issue was for the general interface. If there's a solver that is compatible with more Jacobian types and they should be added, those should get solver package issues. |
We need to be able to handle non-dense Jacobians. I plan on doing this by allowing Jacobian types to be chosen in the
ODEProblem
etc. types as ajac_prototype
which allows the user to pass a function which generates the type which will be used for the Jacobian. This will let the user make use of a BandedMatrix from BandedMatrices.jl, or a sparse matrix, or even allows for matrix-free representations of the Jacobian.The text was updated successfully, but these errors were encountered: