Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of BFGS optimizer #90

Merged
merged 48 commits into from
Dec 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
9387f2c
Remove old transformer script (Lux and Flux dependencies).
benedict-96 Nov 27, 2023
a47bdef
Changed name to something more descriptive.
benedict-96 Nov 27, 2023
dd31393
Now only doing things once instead of multiple times (computation of …
benedict-96 Nov 28, 2023
877f32f
Now exporting BFGS-related structs.
benedict-96 Nov 28, 2023
e22bc4f
Cache and Optimizer for BFGS.
benedict-96 Nov 28, 2023
afa46c6
Added routines for vec, zero, get_backend, assign! and copy for the s…
benedict-96 Nov 28, 2023
e3c4dfd
Minor change of one comment and now allow different arrays to be opti…
benedict-96 Nov 28, 2023
fd5854e
Added routine for bfgs
benedict-96 Nov 28, 2023
2790fea
Added routines for vec (and its inverse), zero, get_backend, assign! …
benedict-96 Nov 28, 2023
80dbaa7
Added some comments.
benedict-96 Nov 28, 2023
af717fc
Put comments in front of functions that are probably not needed.
benedict-96 Nov 28, 2023
6fa0937
Started adding documentation for BFSG.
benedict-96 Nov 28, 2023
c80b1c6
Added new transformer script explicitly for bfgs, up to now not diffe…
benedict-96 Nov 28, 2023
3cc8976
Changed backend to GPU (in general) instead of CPU.
benedict-96 Nov 28, 2023
dc64ce1
Added a test for the vectorization.
benedict-96 Nov 28, 2023
3082631
changed some hyperparameters.
benedict-96 Nov 29, 2023
7426e9c
Adjusted script to new syntax. Not working at the moment.
benedict-96 Nov 29, 2023
3d718df
Moved DataLoader to the front of the file <- DataLoader doesn't depen…
benedict-96 Nov 29, 2023
6031a94
Changed equals sign to broadcasting operation (assign! and broadcast …
benedict-96 Nov 29, 2023
e9b6741
Changed equals sign to broadcast operation (assign! should serve the …
benedict-96 Nov 29, 2023
9ddaa52
Added another optimize_for_one_epoch! routine for the case if we have…
benedict-96 Nov 29, 2023
4a4b8a2
Added an accuracy routine that takes neural network as input.
benedict-96 Nov 29, 2023
5aa3bf8
main change is that we now add a small scalar to Y'*S to make sure it…
benedict-96 Nov 29, 2023
4d3f2a1
Added a constructor for optimizer if the input is a neural network.
benedict-96 Nov 29, 2023
83e2702
Added a short test script for the bfgs optimizer.
benedict-96 Nov 29, 2023
420cd34
Now nor loading LinearAlgebra anymore.
benedict-96 Nov 29, 2023
6ec91a8
Fixed typo dimesnion -> dimension.
benedict-96 Nov 29, 2023
5fbed13
Finished a rough outline of the documentation. Included a derivation …
benedict-96 Dec 3, 2023
f2aa41d
Fixed a typo that originated from not having defined for .
benedict-96 Dec 3, 2023
90f59cf
Decreased batch size.
benedict-96 Dec 3, 2023
8cd3bda
Added documentation for and .
benedict-96 Dec 3, 2023
7f07da4
Fixed a problem that had its origin in not being able to index GPU ar…
benedict-96 Dec 3, 2023
4bff014
Added a test for bfgs on the Stiefel manifold.
benedict-96 Dec 3, 2023
a33ef13
Added an architecture for the classification transformer as a .
benedict-96 Dec 3, 2023
f2099fa
Removed the temporary workaround and the cpu allocation.
benedict-96 Dec 3, 2023
9740b83
Added an additional way of computing H. These two probably are equiva…
benedict-96 Dec 3, 2023
24634f2
Added new file stiefel_projection.jl for the array of the same name.
benedict-96 Dec 4, 2023
0bd6c09
Removed file. The remaining One struct is not needed anymore.
benedict-96 Dec 4, 2023
c8d6522
Removed a comment.
benedict-96 Dec 4, 2023
f7c9ff0
Now using the newly implemented StiefelProjection struct. This gets r…
benedict-96 Dec 4, 2023
a5c2026
Removed arrays/auxiliary.jl
benedict-96 Dec 4, 2023
2823f5b
Added another constructor for the case when we just have two integers…
benedict-96 Dec 4, 2023
54b21bd
Added a missing type dependency.
benedict-96 Dec 4, 2023
9bde4ba
Changed the way the indexing is done for the Grassmann manifold. This…
benedict-96 Dec 4, 2023
2420ad9
Increased number of steps for convergence test. Was occasionally fail…
benedict-96 Dec 4, 2023
015d324
Also importing function init_optimizer_cache now because this is no l…
benedict-96 Dec 4, 2023
0513445
Added restriction that the type has to be a number. Needed for Julia …
benedict-96 Dec 4, 2023
41a46c4
Merge branch 'main' into bfgs
benedict-96 Dec 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions docs/src/optimizers/bfgs_optimizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# The BFGS Algorithm

The presentation shown here is largely taken from chapters 3 and 6 of reference [wright2006numerical](@cite) with a derivation based on an [online comment](https://math.stackexchange.com/questions/2091867/quasi-newton-methods-understanding-dfp-updating-formula). The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is a second order optimizer that can be also be used to train a neural network.

It is a version of a *quasi-Newton* method and is therefore especially suited for convex problems. As is the case with any other (quasi-)Newton method the BFGS algorithm approximates the objective with a quadratic function in each optimization step:
```math
m_k(x) = f(x_k) + (\nabla_{x_k}f)^T(x - x_k) + \frac{1}{2}(x - x_k)^TB_k(x - x_k),
```
where ``B_k`` is referred to as the *approximate Hessian*. We further require ``B_k`` to be symmetric and positive definite. Differentiating the above expression and setting the derivative to zero gives us:
```math
\nabla_xm_k = \nabla_{x_k}f + B_k(x - x_k) = 0,
```
or written differently:
```math
x - x_k = -B_k^{-1}\nabla_{x_k}f.
```
This value we will from now on call ``p_k := x - x_k`` and refer to as the *search direction*. The new iterate then is:
```math
x_{k+1} = x_k + \alpha_kp_k,
```
where ``\alpha_k`` is the *step length*. Techniques that describe how to pick an appropriate ``\alpha_k`` are called *line-search methods* and are discussed below. First we discuss what requirements we impose on ``B_k``. A first reasonable condition would be to require the gradient of ``m_k`` to be equal to that of ``f`` at the points ``x_{k-1}`` and ``x_k``:
```math
\begin{aligned}
\nabla_{x_k}m_k & = \nabla_{x_k}f + B_k(x_k - x_k) & \overset{!}{=} \nabla_{x_k}f \text{ and } \\
\nabla_{x_{k-1}}m_k & = \nabla{x_k}f + B_k(x_{k-1} - x_k) & \overset{!}{=} \nabla_{x_{k-1}}f.
\end{align}
```
The first one of these conditions is of course automatically satisfied. The second one can be rewritten as:
```math
B_k(x_k - x_{k-1}) = \overset{!}{=} \nabla_{x_k}f - \nabla_{x_{k-1}}f.
```

The following notations are often used:
```math
s_{k-1} := \alpha_{k-1}p_{k-1} := x_{k} - x_{k-1} \text{ and } y_{k-1} := \nabla_{x_k}f - \nabla_{x_{k-1}}f.
```

The conditions mentioned above then becomes:
```math
B_ks_{k-1} \overset{!}{=} y_{k-1},
```
and we call it the *secant equation*. A second condition we impose on ``B_k`` is that is has to be positive-definite at point ``s_{k-1}``:
```math
s_{k-1}^Ty_{k-1} > 0.
```
This is referred to as the *curvature condition*. If we impose the *Wolfe conditions*, the *curvature condition* hold automatically. The Wolfe conditions are stated with respect to the parameter ``\alpha_k``.

The *Wolfe conditions* are:
1. ``f(x_k+\alpha{}p_k)\leq{}f(x_k) + c_1\alpha(\nabla_{x_k}f)^Tp_k`` for ``c_1\in(0,1)``.
2. ``(\nabla_{(x_k + \alpha_kp_k)}f)^Tp_k \geq c_2(\nabla_{x_k}f)^Tp_k`` for ``c_2\in(c_1,1)``.

A possible choice for ``c_1`` and ``c_2`` are ``10^{-4}`` and ``0.9`` (see [wright2006numerical](@cite)). The two Wolfe conditions above are respectively called the *sufficient decrease condition* and the *curvature condition* respectively. Note that the second Wolfe condition (also called curvature condition) is stronger than the one mentioned before under the assumption that the first Wolfe condition is true:
```math
(\nabla_{x_k}f)^Tp_{k-1} - c_2(\nabla_{x_{k-1}}f)^Tp_{k-1} = y_{k-1}^Tp_{k-1} + (1 - c_2)(\nabla_{x_{k-1}}f)^Tp_{k-1} \geq 0,
```
and the second term in this expression is ``(1 - c_2)(\nabla_{x_{k-1}}f)^Tp_{k-1}\geq\frac{1-c_2}{c_1\alpha_{k-1}}(f(x_k) - f(x_{k-1}))``, which is negative.

In order to pick the ideal ``B_k`` we solve the following problem:
```math
\begin{aligned}
\min_B & ||B - B_{k-1}||_W \\
\text{s.t.} & B = B^T\text{ and }Bs_{k-1}=y_{k-1},
\end{aligned}
```
where the first condition is symmetry and the second one is the secant equation. For the norm ``||\cdot||_W`` we pick the weighted Frobenius norm:
```math
||A||_W := ||W^{1/2}AW^{1/2}||_F,
```
where ``||\cdot||_F`` is the usual Frobenius norm[^1] and the matrix ``W=\tilde{B}_{k-1}`` is the inverse of the *average Hessian*:
```math
\tilde{B}_{k-1} = \int_0^1 \nabla^2f(x_{k-1} + \tau\alpha_{k-1}p_{k-1})d\tau.
```
[^1]: The Frobenius norm is ``||A||_F^2 = \sum_{i,j}a_{ij}^2``.

In order to find the ideal ``B_k`` under the conditions described above, we introduce some notation:
- ``\tilde{B}_{k-1} := W^{1/2}B_{k-1}W^{1/2}``,
- ``\tilde{B} := W^{1/2}BW^{1/2}``,
- ``\tilde{y}_{k-1} := W^{1/2}y_{k-1}``,
- ``\tilde{s}_{k-1} := W^{-1/2}s_{k-1}``.

With this notation we can rewrite the problem of finding ``B_k`` as:
```math
\begin{aligned}
\min_{\tilde{B}} & ||\tilde{B} - \tilde{B}_{k-1}||_F \\
\text{s.t.} & \tilde{B} = \tilde{B}^T\text{ and }\tilde{B}\tilde{s}_{k-1}=\tilde{y}_{k-1}.
\end{aligned}
```

We further have ``Wy_{k-1} = s_{k-1}`` (by the mean value theorem ?) and therefore ``\tilde{y}_{k-1} = \tilde{s}_{k-1}``.

Now we rewrite ``B`` and ``B_{k-1}`` in a new basis ``U = [u|u_\perp]``, where ``u := \tilde{s}_{k-1}/||\tilde{s}_{k-1}||`` and ``u_perp`` is an orthogonal complement[^2] of ``u``:

[^2]: So we must have ``u^Tu_\perp=0`` and further ``u_\perp^Tu_\perp=\mathbb{I}``.

```math
\begin{aligned}
U^T\tilde{B}_{k-1}U - U^T\tilde{B}U = \begin{bmatrix} u^T \\ u_\perp^T \end{bmatrix}(\tilde{B}_{k-1} - \tilde{B})\begin{bmatrix} u & u_\perp \end{bmatrix} = \\
\begin{bmatrix}
u^T\tilde{B}_{k-1}u - 1 & u^T\tilde{B}_{k-1}u \\
u_\perp^T\tilde{B}_{k-1}u & u_\perp^T(\tilde{B}_{k-1}-\tilde{B}_k)u_\perp
\end{bmatrix}.
\end{aligned}
```
By a property of the Frobenius norm:
```math
||\tilde{B}_{k-1} - \tilde{B}||^2_F = (u^T\tilde{B}_{k-1} -1)^2 + ||u^T\tilde{B}_{k-1}u_\perp||_F^2 + ||u_\perp^T\tilde{B}_{k-1}u||_F^2 + ||u_\perp^T(\tilde{B}_{k-1} - \tilde{B})u_\perp||_F^2.
```

We see that ``\tilde{B}`` only appears in the last term, which should therefore be made zero. This then gives:
```math
\tilde{B} = U\begin{bmatrix} 1 & 0 \\ 0 & u^T_\perp\tilde{B}_{k-1}u_\perp \end{bmatrix} = uu^T + (\mathbb{I}-uu^T)\tilde{B}_{k-1}(\mathbb{I}-uu^T).
```

If we now map back to the original coordinate system, the ideal solution for ``B_k`` is:
```math
B_k = (\mathbb{I} - \frac{1}{y_{k-1}^Ts_{k-1}}y_{k-1}s_{k-1}^T)B_{k-1}(\mathbb{I} - \frac{1}{y_{k-1}^Ts_{k-1}}s_{k-1}y_{k-1}^T) + \frac{1}{y_{k-1}^Ts_{k-1}}y_ky_k^T.
```

What we need in practice however is not ``B_k``, but its inverse ``H_k``. This is because we need to find ``s_{k-1}`` based on ``y_{k-1}``. To get ``H_k`` based on the expression for ``B_k`` above we can use the *Sherman-Morrison-Woodbury formula*[^3] to obtain:

[^3]: The *Sherman-Morrison-Woodbury formula* states ``(A + UCV)^{-1} = A^{-1} - A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}.

```math
H_{k} = H_{k-1} - \frac{H_{k-1}y_{k-1}y_{k-1}^TH_{k-1}}{y_{k-1}^TH_{k-1}y_{k-1}} + \frac{s_{k-1}s_{k-1}^T}{y_{k-1}^Ts_{k-1}}.
```


TODO: Example where this works well!

## References
@book{wright2006numerical,
title={Numerical optimization},
author={Stephen J. Wright, Jorge Nocedal},
year={2006},
publisher={Springer Science+Business Media},
location={New York, NY}
}
114 changes: 0 additions & 114 deletions scripts/transformer.jl

This file was deleted.

24 changes: 11 additions & 13 deletions scripts/transformer_new.jl → scripts/transformer_bfgs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ image_dim = 28
patch_length = 7
transformer_dim = 49
n_heads = 7
n_layers = 16
n_layers = 1
number_of_patch = (image_dim÷patch_length)^2
batch_size = 2048
activation = softmax
Expand All @@ -41,7 +41,6 @@ backend, train_x, test_x, train_y, test_y =
test_y
end


#encoder layer - final layer has to be added for evaluation purposes!
model1 = Chain(Transformer(patch_length^2, n_heads, n_layers, Stiefel=false, add_connection=add_connection),
Classification(patch_length^2, 10, activation))
Expand All @@ -50,7 +49,7 @@ model2 = Chain(Transformer(patch_length^2, n_heads, n_layers, Stiefel=true, add_
Classification(patch_length^2, 10, activation))

# err_freq is the frequency with which the error is computed (e.g. every 100 steps)
function transformer_training(Ψᵉ::Chain; backend=CPU(), n_epochs=100, opt=AdamOptimizer())
function transformer_training(Ψᵉ::Chain; backend=backend, n_epochs=100, opt=BFGSOptimizer(1f-3))
# call data loader
dl = DataLoader(train_x, train_y)
dl_test = DataLoader(test_x, test_y)
Expand Down Expand Up @@ -85,20 +84,19 @@ function transformer_training(Ψᵉ::Chain; backend=CPU(), n_epochs=100, opt=Ada
loss_array, ps, total_time, accuracy_score
end


loss_array2, ps2, total_time2, accuracy_score2 = transformer_training(model2, backend=backend, n_epochs=n_epochs)
loss_array1, ps1, total_time1, accuracy_score1 = transformer_training(model1, backend=backend, n_epochs=n_epochs)
loss_array3, ps3, total_time3, accuracy_score3 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=GradientOptimizer(0.001))
loss_array4, ps4, total_time4, accuracy_score4 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=MomentumOptimizer(0.001, 0.5))
loss_array1, ps1, total_time1, accuracy_score1 = transformer_training(model1, backend=backend, n_epochs=n_epochs, opt=BFGSOptimizer(1f-3))
loss_array2, ps2, total_time2, accuracy_score2 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=BFGSOptimizer(1f-3))
loss_array3, ps3, total_time3, accuracy_score3 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=GradientOptimizer(1f-3))
loss_array4, ps4, total_time4, accuracy_score4 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=AdamOptimizer())

p1 = plot(loss_array1, color=1, label="Regular weights", ylimits=(0.,1.4), linewidth=2)
plot!(p1, loss_array2, color=2, label="Weights on Stiefel Manifold", linewidth=2)
png(p1, "Stiefel_Regular")
plot!(p1, loss_array2, color=3, label="Weights on Stiefel Manifold", linewidth=2)
png(p1, "BFGS_Stiefel_Regular")

p2 = plot(loss_array2, color=2, label="Adam", ylimits=(0.,1.4), linewidth=2)
p2 = plot(loss_array2, color=3, label="BFGS", ylimits=(0.,1.4), linewidth=2)
plot!(p2, loss_array3, color=1, label="Gradient", linewidth=2)
plot!(p2, loss_array4, color=3, label="Momentum", linewidth=2)
png(p2, "Adam_Gradient_Momentum")
plot!(p2, loss_array4, color=2, label="Adam", linewidth=2)
png(p2, "BFGS_Gradient_Adam")

text_string =
"n_epochs: " * string(n_epochs) * "\n"
Expand Down
Loading
Loading