Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second order for vector to vector functions #206

Closed
gdalle opened this issue Apr 24, 2024 · 10 comments
Closed

Second order for vector to vector functions #206

gdalle opened this issue Apr 24, 2024 · 10 comments
Labels
core Related to the core utilities of the package

Comments

@gdalle
Copy link
Member

gdalle commented Apr 24, 2024

At the moment I don't know if there is wide user interest, but feel free to contribute here.

See initial discussion with @timholy in JuliaDiff/AbstractDifferentiation.jl#134

@timholy
Copy link

timholy commented Apr 24, 2024

xref JuliaDiff/ForwardDiff.jl#61 (and quite possibly others, but I haven't searched)

@gdalle gdalle added the core Related to the core utilities of the package label Apr 24, 2024
@timholy
Copy link

timholy commented May 10, 2024

As far as naming goes, perhaps this could still be called Hessian. It's a little inconsistent that we give the gradient of a vector-valued function a special name, "Jacobian," but that's basically a historical artifact. The right way to think about it is that the jacobian is not a matrix: it's a 2-tensor that is both contravariant (for the "vector-valued" portion) and covariant (for the gradient portion). There's really no confusion between these, except for the fact that in Julia we handle this by adding an extra dimension and we don't use any encoding to specify which "dimension" is covariant and which is contravariant.

The second order derivative of a vector-valued function is, of course, a rank-3 tensor with one contravariant and two covariant "dimensions." Again there doesn't have to be any actual ambiguity about these things, if we encoded the meaning of the dimensions. Heck, one could even define Hessian-(contravariant)vector products just fine with no ambiguity, because there are only two candidates for that (the two covariant "dimensions") and they are symmetric.

@adrhill
Copy link
Collaborator

adrhill commented May 10, 2024

we give the gradient of a vector-valued function a special name, "Jacobian," but that's basically a historical artifact. The right way to think about it is that the jacobian is not a matrix: it's a 2-tensor that is both contravariant (for the "vector-valued" portion) and covariant (for the gradient portion).

To add to what you are saying, Jacobians are the matrix representation of two linear maps: pushforwards and (transposed) pullbacks, which are at the heart of forward- and reverse-mode AD (and DI).
For $f: \mathcal{M} \rightarrow \mathcal{N}$, pushforwards map tangent spaces on $\mathcal{M}$ onto the corresponding tangent space on $\mathcal{N}$.
Pullbacks map cotangent spaces on $\mathcal{N}$ onto corresponding cotangent spaces on $\mathcal{M}$.

I'm guessing the popularity of Jacobians comes from frameworks like JAX calling their pushforwards Jacobian-vector products and pullbacks vector-Jacobian products, even though they are implemented as matrix-free linear maps. To some degree, this is just a question of semantics.

@timholy
Copy link

timholy commented May 25, 2024

@gdalle, are you now satisfied that this isn't "weird"? I could resubmit JuliaDiff/AbstractDifferentiation.jl#134 here.

@gdalle
Copy link
Member Author

gdalle commented May 25, 2024

I understand the concept, but I've been thinking more about your specific application to optimization with a Lagrangian:

$$L: (x, \lambda) \in \mathbb{R}^n \times \mathbb{R}^m \longmapsto f(x) + \lambda^\top c(x)$$

I'm still unsure which option is more wasteful:

  • computing the joint scalar Hessian $\nabla_{(x,\lambda)}^2 L \in \mathbb{R}^{(n+m)^2}$ and keeping only the upper left corner $\nabla_x^2 L \in \mathbb{R}^{n^2}$
  • computing the vector Hessian of $\phi: x \longmapsto (f(x), c_1(x), \dots, c_m(x))$, let's call it $\overset{\text{vec}}{\nabla^2} \phi \in \mathbb{R}^{(1+m) \times n \times n}$, and then summing along the first dimension

Perhaps @amontoison would have a clue, we've been talking about this too.

@gdalle
Copy link
Member Author

gdalle commented May 25, 2024

Either way, if you want the vector Hessian, you can get it very easily as follows:

using DifferentiationInterface
import ForwardDiff

function value_jacobian_and_vector_hessian(f, backend, x)
    y = f(x)
    J(x) = jacobian(f, backend, x)
    J, H = value_and_jacobian(J, backend, x)
    return y, J, reshape(H, eachindex(y), eachindex(x), eachindex(x)) 
end

I'm reluctant to add it to DifferentiationInterface because:

  1. It's very easy to code
  2. It's very expensive to test and document. Our test suite is the true added value of the package, and adding one more operator would require at least 500 LOCs off the top of my head.
  3. No backend natively supports this, so the implementation above is essentially optimal...
  4. ... up to preparation. That is where we could possibly improve performance. Unfortunately, preparation for second-order operators is tricky, and I haven't figured it out fully yet.

Since you've already been bitten by the last item in #252, you can look at the other issues discussing it:

@timholy
Copy link

timholy commented May 25, 2024

computing the joint scalar Hessian ... and keeping only the upper left corner

This would be viable, but you probably wouldn't want to throw it away: Newton's method for constrained optimization consists of using a first order expansion of $\nabla_x L$ with respect to $x + \Delta x$, $\lambda + \Delta \lambda$. So the $x, \lambda$ terms are useful. The $\lambda, \lambda$ ones are not because they are zero (the Lagrangian is linear in $\lambda$).

That said, there are cases where you might still want the individual Hessians (your second proposal). For unconstrained optimization, trust-region methods are very popular, and they essentially ask "was the quadratic expansion fairly accurate?" You could imagine asking that about each individual constraint as well, in which case you might want to have them disentangled from $L$. But that's a bit more speculative.

Anyway, it's unfortunate that 2nd order is so hard. Presumably Julia is not the only ecosystem grappling with this?

@gdalle
Copy link
Member Author

gdalle commented Sep 24, 2024

As of today's main branch, DI allows you to keep some of the arguments constant in a differentiation operator. You can thus prepare a Lagrangian Hessian with some multiplicator $\lambda_0$, and then compute it with other values $\lambda \neq \lambda_0$, which I think was your main goal here.

If this is satisfactory for constrained optimization, I don't think I'm gonna implement vector-Hessians in the near future, so I'll close this for now.

using DifferentiationInterface, LinearAlgebra
using Enzyme: Enzyme

obj(x) = sum(abs2, x)
cons(x) = diff(x) .^ 3
lag(x, λ) = obj(x) + dot(λ, cons(x))

x, λ = float.(1:4), float.(1:3);
prep = prepare_hessian(lag, AutoEnzyme(), x, Constant(rand(size(λ)...)));
hessian(lag, prep, AutoEnzyme(), x, Constant(λ))

@gdalle gdalle closed this as completed Sep 24, 2024
@timholy
Copy link

timholy commented Sep 24, 2024

Sounds very exciting, looking forward to trying it!

@Vaibhavdixit02
Copy link
Contributor

Very exciting, thanks for the great work @gdalle!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Related to the core utilities of the package
Projects
None yet
Development

No branches or pull requests

4 participants