Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient AD for the simulation of gradients #504

Open
FelixBenning opened this issue May 2, 2023 · 6 comments
Open

Efficient AD for the simulation of gradients #504

FelixBenning opened this issue May 2, 2023 · 6 comments

Comments

@FelixBenning
Copy link

FelixBenning commented May 2, 2023

Introduction: How would you simulate gradients

If you want to simulate the gradient of a random function $Z$, it turns out that you simply need to take derivatives of the covariance funcion, as (w.l.o.g. $Z$ is centered)

$$\text{Cov}(\nabla Z(x), Z(y)) = \mathbb{E}[\nabla Z(x) Z(y)] = \nabla_x \mathbb{E}[Z(x)Z(y)] = \nabla_x C(x,y)$$

And similarly

$$\text{Cov}(\nabla Z(x), \nabla Z(y)) = \nabla_x\nabla_y C(x,y)$$

For stationary covariance functions $C(x,y) = C(x-y)$ this simplifies to

$$\begin{aligned} \nabla_x C(x,y) &= \nabla C(x-y)\\\ \nabla_x\nabla_y C(x,y) &= -\nabla^2 C(x-y) \end{aligned}$$

So if we consider the multivariate random function $T_1(x) = (Z(x), \nabla Z(x))$, then its covariance kernel is given by

$$\begin{aligned} C_{T_1}(x,y) &= \begin{pmatrix} C(x,y) & \nabla_y C(x,y)^T\\\ \nabla_x C(x,y) & \nabla_x \nabla_y C(x,y) \end{pmatrix}\\\ &\overset{\mathllap{\text{stationary}}}= \begin{pmatrix} C(x-y) & -\nabla C(x-y)^T\\\ \nabla C(x-y) & -\nabla^2C(x-y) \end{pmatrix}\\\ &\overset{\mathllap{\text{isotrope}}}= \begin{pmatrix} C(d) & -f'\bigl(\frac{\|d\|^2}{2}\bigr)d^T\\\ f'\bigl(\frac{\|d\|^2}{2}\bigr) d & -\Bigl[f''\bigl(\frac{\|d\|^2}2\bigr)dd^T + f'\bigl(\frac{\|d\|^2}2\bigr) \mathbb{I}\Bigr] \end{pmatrix} \end{aligned}$$

where we use $d=x-y$ and $C(d) = f\bigl(\frac{|d|^2}2\bigr)$ in the last equation.

Performance Considerations

In principle you could just directly apply Autodiff (AD) to any kernel $C$ to obtain $C_{T_1}$, but since I heard that the expense of autodiff scales with the number of input arguments, this would be really wasteful when we only really need to differentiate a one dimensional function in the isotropic case.

Unfortunately the way that KernelFunctions implement length scales results in general kernel functions, so I am not completely sure how to tell the compiler, that "these derivatives are much simpler than they look".

One possibility might be, to add the Abstract types IsotropicKernel and StationaryKernel and carry these types over when transformations do not violate them. Scaling would not, more general affine transformations would violate isotropy but not stationarity, etc. This could probably be done with type parameters.

But even once that is implemented, how do you tell autodiff what to differentiate? I have seen the file chainrules.jl in this repository, so I thought I would ask if someone already knows how to implement this.

Kernels for multiple outputs considerations

Since you implemented kernels for multiple outputs as an extension of the input space, the reuse of the derivative $f'(\frac{|d|}{2})$ also becomes more complicated.

Maybe this is all premature optimization, as the evaluation of the kernel is complexity wise in the shadow of the cholesky decomposition.

Extension: Simulate $n$-order derivatives

In principle you could similarly simulate $T_n(x) = (Z(x), Z'(x), \dots, Z^{(n)}(x))$. But AD is already underdeveloped for hessians, so I don't know how to get this to work. It might be possible to write custom logic for certain isotropic random functions, like with the squared exponential one.

What do you think? I handcrafted something for first order derivatives in a personal project, but for KernelFunctions.jl a more general approach is probably needed.

@FelixBenning
Copy link
Author

FelixBenning commented May 12, 2023

EDIT: I replaced this comment completely since I found a better way to do this (this might make some of the following comments superflous)

I think this might be the way to code this. By encoding the partial derivatives in the points passed to the kernel, it is possible easily mark the index and place to take the partial derivative in. And this way this information is stored in the location. So if you store all previously evaluated locations, you know in which location a derivative was also taken.

using KernelFunctions: Kernel, KernelFunctions as KF
using OneHotArrays: OneHotVector
import ForwardDiff as FD
import LinearAlgebra as LA

function partial(fun, dim, partials=())
	"""
	Take the partial derivative of a function `fun`  with input dimesion `dim`.
	If partials=(i,j), then (∂ᵢ∂ⱼ fun) is returned.
	"""
	if !isnothing(local next = iterate(partials))
		ii, state = next
		return partial(
			x -> FD.derivative(0) do dx
				fun(x .+ dx * OneHotVector(ii, dim))
			end,
			dim,
			Base.rest(partials, state),
		)
	end
	return fun 
end

function partial(k, dim; partials_x=(), partials_y=())
	"""
	Take the partial derivative of a function with two dim-dimensional inputs,
	i.e. 2dim dimensional input
	"""
	local f(x,y) = partial(t -> k(t,y), dim, partials_x)(x)
	return (x,y) -> partial(t -> f(x,t), dim, partials_y)(y)
end

struct Pt{Dim}
	pos::AbstractArray # the actual position
	partial
	# if empty and passed to function f return f(pos), if partial = (i,j) then return ∂ᵢ∂ⱼf(pos)
	# this allows existing functions to be subtyped for Pt
end

Pt(x;partial=()) = Pt{length(x)}(x, partial) # convenience constructor

for T in subtypes(Kernel)
	# this allows the passing of Pt to kernels (which makes them differentiable by default)
	(k::T)(x::Pt{Dim}, y::Pt{Dim}) where {Dim} = evaluate_(k, x, y)
	(k::T)(x::Pt{Dim}, y) where {Dim} = evaluate_(k, x, Pt(y))
	(k::T)(x, y::Pt{Dim}) where {Dim} = evaluate_(k, Pt(x), y)
end

function evaluate_(k::T, x::Pt{Dim}, y::Pt{Dim}) where {Dim, T<:Kernel}
	""" unbox the partial instructions from Pt and apply them to k, and evaluat them at the positions of Pt"""
	return partial(
		k, Dim,
		partials_x=x.partial, partials_y=y.partial
	)(x.pos, y.pos)
end

k = KF.MaternKernel()

k([1],[1]) # Cov(Z(x), Z(y)) where x=[1], y=[1]
k([1], Pt([1], partial=(1,1))) # Cov(Z(x), ∂₁∂₁Z(y)) where x=[1], y=[1]
k(Pt([1], partial=1), [2]) # Cov(∂₁Z(x), Z(y)) where x=[1], y=[2]
k(Pt([1,2], partial=(1)), Pt([1,2], partial=2))# Cov(∂₁Z(x), ∂₂Z(y)) where x=[1,2], y=[1,2]


k = KF.SEKernel()

kappa(x) = exp(- LA.dot(x,x)/2)


g = partial(k,1, partials_x=2, partials_y=0)
g(1,1)

@devmotion
Copy link
Member

I don't follow completely - what are you after in the end? Derivatives of the kernel functions with respect to one or both arguments? IIRC many kernels in KernelFunctions already support AD - and importantly not only ForwardDiff but also reverse-mode backends such as ReverseDiff, Zygote, and (AFAIK to a lesser extent) Enzyme. Depending on the input- and output dimensions of the functions you try to differentiate, you might want to use forward-mode or reverse-mode (a rule of thumb when working with differential equations is e.g. to use ForwardDiff when working with < 100 equations and reverse-mode otherwise: https://docs.sciml.ai/SciMLSensitivity/stable/getting_started/#When-Should-You-Use-Forward-or-Reverse-Mode?).

Regardless of your exact intentions, I recommend you have a look at https://docs.julialang.org/en/v1/manual/performance-tips if you want to improve the performance of your code. For instance, probably you would want to avoid structs with abstract fields (https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-fields-with-abstract-type).
The loop for T in subtypes(Kernel) is also problematic since it will only cover subtypes of Kernels that are known when this line is run but any kernel that is defined later (e.g., in a package loaded by a user) is missed. Pt(x;partial=()) = Pt{length(x)}(x, partial) is problematic since the return type cannot be inferred since it depends on dynamic information (the length of x). Generally, I am not sure if one should encode the desired partial derivatives in the inputs.

@FelixBenning
Copy link
Author

FelixBenning commented May 12, 2023

Let me try a different way to phrase this: Let $Z(x)$ be a gaussian process with zero mean. I can sample from this, if I know the covariance kernel

$$k(x,y) = \text{Cov}(Z(x), Z(y)) = \mathbb{E}[Z(x)Z(y)]$$

So lets say I want to simulate $Z(x_1),\dots, Z(x_n)$, then you would feed $x_1,\dots, x_n$ into an AbstractGP with kernel k to obtain a finiteGP and then use rand on it. But I want to sample

$$Z(x_1), \nabla Z(x_1), ..., Z(x_n), \nabla Z(x_n)$$

how would that work? Well first of all this is not just sampling from $Z$, this is sampling from $\tilde{Z} = (Z, \nabla Z)$, how would you do that? Well you need the covariance function of $\tilde{Z}$ (which is a multivariate kernel, since $\tilde{Z}$ is a vector). How does that multivariate kernel look like? That is something I discuss in the introduction of this issue, for now let us just call this kernel $\tilde{k}$

So how do you implement multivariate kernels? Well this package made the design decision to encode this in the input and without loss of generality do not have multivariate kernels. So the position $x$ becomes $(x,i)$ where $i$ denotes the index of the multivariate kernel. Well in our case, the first entry of our vector $\tilde{Z}$ denotes the actual GP $Z$ while the following indices are the individual partial derivatives. So instead of $(x,i)$ I encode Pt(x, partial=...) since a tuple in partial is much more interpretable than an integer index i. So this is why the partial derivatives are encoded in the inputs - a ripple effect of the "no multivariate kernels" policy (which I don't disagree with).

Does that clear things up?

Now that I motivated p= Pt(x, partial=(i,)) which can be interpreted as $\tilde{Z}(p) = \partial_i Z(x)$, we want to know the covariance kernel $\tilde{k}$. So let us for example look at

$$\text{Cov}( Z(x), \partial_i Z(y)) = \mathbb{E}[ Z(x) \partial_i Z(y)] = \partial_{y_i} \mathbb{E}[Z(x) Z(y)] = \partial_{y_i} k(x,y)$$

So here we see that the entries of $\tilde{k}$ just consists of different partial derivatives of k

@FelixBenning
Copy link
Author

I don't particularly like for T in subtypes(Kernel) either, but if I only implement

(k::Kernel)(x::Pt, y::Pt)

then it can not tell whether to use this, or

(k::SimpleKernel)(x,y)

so if I want all kernel variants to first go through the "turning it into $\tilde{k}$" process before getting called, I need to implmenent

(k::T)(x::Pt, y::Pt) where {T<: Kernel}

and the above wasn't allowed so I tried the next best thing (which I am not happy with either)

@FelixBenning
Copy link
Author

Regarding the forward diff vs. backward diff part: Backward diff would be more efficient since kernels have many input dimensions and only one output. But since we want to obtain $\tilde{k}$ which is only supposed to return the covariance of one partial deriviative with another, calculating the entire gradient does not make sense as it would just be thrown away.

This is the disadvantage of the "no multivariate kernels" policy. You prevent possible performance improvement from calculating $k((x,1), (y,1)),... k((x,n), (y,m))$ in one go.

@FelixBenning
Copy link
Author

Someone pointed out the CovarianceFunctions.jl package to me (https://github.com/SebastianAment/CovarianceFunctions.jl#gradient-kernels) which apparently implements something similar as well. There is also an issue about interoperability with KernelFunctions.jl (SebastianAment/CovarianceFunctions.jl#2) so I will try to understand this package and see if it is all I want

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants