diff --git a/docs/src/optimizers/bfgs_optimizer.md b/docs/src/optimizers/bfgs_optimizer.md new file mode 100644 index 000000000..b7bcd6e3a --- /dev/null +++ b/docs/src/optimizers/bfgs_optimizer.md @@ -0,0 +1,137 @@ +# The BFGS Algorithm + +The presentation shown here is largely taken from chapters 3 and 6 of reference [wright2006numerical](@cite) with a derivation based on an [online comment](https://math.stackexchange.com/questions/2091867/quasi-newton-methods-understanding-dfp-updating-formula). The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is a second order optimizer that can be also be used to train a neural network. + +It is a version of a *quasi-Newton* method and is therefore especially suited for convex problems. As is the case with any other (quasi-)Newton method the BFGS algorithm approximates the objective with a quadratic function in each optimization step: +```math +m_k(x) = f(x_k) + (\nabla_{x_k}f)^T(x - x_k) + \frac{1}{2}(x - x_k)^TB_k(x - x_k), +``` +where ``B_k`` is referred to as the *approximate Hessian*. We further require ``B_k`` to be symmetric and positive definite. Differentiating the above expression and setting the derivative to zero gives us: +```math +\nabla_xm_k = \nabla_{x_k}f + B_k(x - x_k) = 0, +``` +or written differently: +```math +x - x_k = -B_k^{-1}\nabla_{x_k}f. +``` +This value we will from now on call ``p_k := x - x_k`` and refer to as the *search direction*. The new iterate then is: +```math +x_{k+1} = x_k + \alpha_kp_k, +``` +where ``\alpha_k`` is the *step length*. Techniques that describe how to pick an appropriate ``\alpha_k`` are called *line-search methods* and are discussed below. First we discuss what requirements we impose on ``B_k``. A first reasonable condition would be to require the gradient of ``m_k`` to be equal to that of ``f`` at the points ``x_{k-1}`` and ``x_k``: +```math +\begin{aligned} +\nabla_{x_k}m_k & = \nabla_{x_k}f + B_k(x_k - x_k) & \overset{!}{=} \nabla_{x_k}f \text{ and } \\ +\nabla_{x_{k-1}}m_k & = \nabla{x_k}f + B_k(x_{k-1} - x_k) & \overset{!}{=} \nabla_{x_{k-1}}f. +\end{align} +``` +The first one of these conditions is of course automatically satisfied. The second one can be rewritten as: +```math +B_k(x_k - x_{k-1}) = \overset{!}{=} \nabla_{x_k}f - \nabla_{x_{k-1}}f. +``` + +The following notations are often used: +```math +s_{k-1} := \alpha_{k-1}p_{k-1} := x_{k} - x_{k-1} \text{ and } y_{k-1} := \nabla_{x_k}f - \nabla_{x_{k-1}}f. +``` + +The conditions mentioned above then becomes: +```math +B_ks_{k-1} \overset{!}{=} y_{k-1}, +``` +and we call it the *secant equation*. A second condition we impose on ``B_k`` is that is has to be positive-definite at point ``s_{k-1}``: +```math +s_{k-1}^Ty_{k-1} > 0. +``` +This is referred to as the *curvature condition*. If we impose the *Wolfe conditions*, the *curvature condition* hold automatically. The Wolfe conditions are stated with respect to the parameter ``\alpha_k``. + +The *Wolfe conditions* are: +1. ``f(x_k+\alpha{}p_k)\leq{}f(x_k) + c_1\alpha(\nabla_{x_k}f)^Tp_k`` for ``c_1\in(0,1)``. +2. ``(\nabla_{(x_k + \alpha_kp_k)}f)^Tp_k \geq c_2(\nabla_{x_k}f)^Tp_k`` for ``c_2\in(c_1,1)``. + +A possible choice for ``c_1`` and ``c_2`` are ``10^{-4}`` and ``0.9`` (see [wright2006numerical](@cite)). The two Wolfe conditions above are respectively called the *sufficient decrease condition* and the *curvature condition* respectively. Note that the second Wolfe condition (also called curvature condition) is stronger than the one mentioned before under the assumption that the first Wolfe condition is true: +```math +(\nabla_{x_k}f)^Tp_{k-1} - c_2(\nabla_{x_{k-1}}f)^Tp_{k-1} = y_{k-1}^Tp_{k-1} + (1 - c_2)(\nabla_{x_{k-1}}f)^Tp_{k-1} \geq 0, +``` +and the second term in this expression is ``(1 - c_2)(\nabla_{x_{k-1}}f)^Tp_{k-1}\geq\frac{1-c_2}{c_1\alpha_{k-1}}(f(x_k) - f(x_{k-1}))``, which is negative. + +In order to pick the ideal ``B_k`` we solve the following problem: +```math +\begin{aligned} +\min_B & ||B - B_{k-1}||_W \\ +\text{s.t.} & B = B^T\text{ and }Bs_{k-1}=y_{k-1}, +\end{aligned} +``` +where the first condition is symmetry and the second one is the secant equation. For the norm ``||\cdot||_W`` we pick the weighted Frobenius norm: +```math +||A||_W := ||W^{1/2}AW^{1/2}||_F, +``` +where ``||\cdot||_F`` is the usual Frobenius norm[^1] and the matrix ``W=\tilde{B}_{k-1}`` is the inverse of the *average Hessian*: +```math +\tilde{B}_{k-1} = \int_0^1 \nabla^2f(x_{k-1} + \tau\alpha_{k-1}p_{k-1})d\tau. +``` +[^1]: The Frobenius norm is ``||A||_F^2 = \sum_{i,j}a_{ij}^2``. + +In order to find the ideal ``B_k`` under the conditions described above, we introduce some notation: +- ``\tilde{B}_{k-1} := W^{1/2}B_{k-1}W^{1/2}``, +- ``\tilde{B} := W^{1/2}BW^{1/2}``, +- ``\tilde{y}_{k-1} := W^{1/2}y_{k-1}``, +- ``\tilde{s}_{k-1} := W^{-1/2}s_{k-1}``. + +With this notation we can rewrite the problem of finding ``B_k`` as: +```math +\begin{aligned} +\min_{\tilde{B}} & ||\tilde{B} - \tilde{B}_{k-1}||_F \\ +\text{s.t.} & \tilde{B} = \tilde{B}^T\text{ and }\tilde{B}\tilde{s}_{k-1}=\tilde{y}_{k-1}. +\end{aligned} +``` + +We further have ``Wy_{k-1} = s_{k-1}`` (by the mean value theorem ?) and therefore ``\tilde{y}_{k-1} = \tilde{s}_{k-1}``. + +Now we rewrite ``B`` and ``B_{k-1}`` in a new basis ``U = [u|u_\perp]``, where ``u := \tilde{s}_{k-1}/||\tilde{s}_{k-1}||`` and ``u_perp`` is an orthogonal complement[^2] of ``u``: + +[^2]: So we must have ``u^Tu_\perp=0`` and further ``u_\perp^Tu_\perp=\mathbb{I}``. + +```math +\begin{aligned} +U^T\tilde{B}_{k-1}U - U^T\tilde{B}U = \begin{bmatrix} u^T \\ u_\perp^T \end{bmatrix}(\tilde{B}_{k-1} - \tilde{B})\begin{bmatrix} u & u_\perp \end{bmatrix} = \\ +\begin{bmatrix} + u^T\tilde{B}_{k-1}u - 1 & u^T\tilde{B}_{k-1}u \\ + u_\perp^T\tilde{B}_{k-1}u & u_\perp^T(\tilde{B}_{k-1}-\tilde{B}_k)u_\perp +\end{bmatrix}. +\end{aligned} +``` +By a property of the Frobenius norm: +```math +||\tilde{B}_{k-1} - \tilde{B}||^2_F = (u^T\tilde{B}_{k-1} -1)^2 + ||u^T\tilde{B}_{k-1}u_\perp||_F^2 + ||u_\perp^T\tilde{B}_{k-1}u||_F^2 + ||u_\perp^T(\tilde{B}_{k-1} - \tilde{B})u_\perp||_F^2. +``` + +We see that ``\tilde{B}`` only appears in the last term, which should therefore be made zero. This then gives: +```math +\tilde{B} = U\begin{bmatrix} 1 & 0 \\ 0 & u^T_\perp\tilde{B}_{k-1}u_\perp \end{bmatrix} = uu^T + (\mathbb{I}-uu^T)\tilde{B}_{k-1}(\mathbb{I}-uu^T). +``` + +If we now map back to the original coordinate system, the ideal solution for ``B_k`` is: +```math +B_k = (\mathbb{I} - \frac{1}{y_{k-1}^Ts_{k-1}}y_{k-1}s_{k-1}^T)B_{k-1}(\mathbb{I} - \frac{1}{y_{k-1}^Ts_{k-1}}s_{k-1}y_{k-1}^T) + \frac{1}{y_{k-1}^Ts_{k-1}}y_ky_k^T. +``` + +What we need in practice however is not ``B_k``, but its inverse ``H_k``. This is because we need to find ``s_{k-1}`` based on ``y_{k-1}``. To get ``H_k`` based on the expression for ``B_k`` above we can use the *Sherman-Morrison-Woodbury formula*[^3] to obtain: + +[^3]: The *Sherman-Morrison-Woodbury formula* states ``(A + UCV)^{-1} = A^{-1} - A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}. + +```math +H_{k} = H_{k-1} - \frac{H_{k-1}y_{k-1}y_{k-1}^TH_{k-1}}{y_{k-1}^TH_{k-1}y_{k-1}} + \frac{s_{k-1}s_{k-1}^T}{y_{k-1}^Ts_{k-1}}. +``` + + +TODO: Example where this works well! + +## References +@book{wright2006numerical, + title={Numerical optimization}, + author={Stephen J. Wright, Jorge Nocedal}, + year={2006}, + publisher={Springer Science+Business Media}, + location={New York, NY} +} \ No newline at end of file diff --git a/scripts/transformer.jl b/scripts/transformer.jl deleted file mode 100644 index 0f43b50b0..000000000 --- a/scripts/transformer.jl +++ /dev/null @@ -1,114 +0,0 @@ -""" -TODO: Add a better predictor at the end! It should set the biggest value of the softmax to 1 and the rest to zero! -""" - -using GeometricMachineLearning, LinearAlgebra, ProgressMeter, Plots -import Lux, Zygote, Random, MLDatasets, Flux, Lux.gpu - -#MNIST images are 28×28, so a sequence_length of 16 = 4² means the image patches are of size 7² = 49 -image_dim = 28 -patch_length = 7 -n_heads = 7 -n_layers = 5 -patch_number = (image_dim÷patch_length)^2 - -train_x, train_y = MLDatasets.MNIST(split=:train)[:] -test_x, test_y = MLDatasets.MNIST(split=:test)[:] - -#preprocessing steps -train_x = Tuple(map(i -> sc_embed(split_and_flatten(train_x[:,:,i], patch_length)) #=|> gpu=#, 1:size(train_x,3))) -test_x = Tuple(map(i -> sc_embed(split_and_flatten(test_x[:,:,i], patch_length)) #=|> gpu=#, 1:size(test_x,3))) - -#implement this encoding yourself! -train_y = Flux.onehotbatch(train_y, 0:9) #|> gpu -test_y = Flux.onehotbatch(test_y, 0:9) #|> gpu - - -#encoder layer - final layer has to be added for evaluation purposes! -Ψᵉ₁ = Lux.Chain( - #Embedding(patch_length^2, patch_number), - Transformer(patch_length^2, n_heads, n_layers, Stiefel=false), - Lux.Dense(patch_length^2, 10, Lux.σ, use_bias=false) -) - -Ψᵉ₂ = Lux.Chain( - #Embedding(patch_length^2, patch_number), - Transformer(patch_length^2, n_heads, n_layers, Stiefel=true), - Lux.Dense(patch_length^2, 10, Lux.σ, use_bias=false) -) - -#err_freq is the frequency with which the error is computed (e.g. every 100 steps) -function transformer_training(Ψᵉ::Lux.Chain, batch_size=64, training_steps=10000, err_freq=100, o=AdamOptimizer()) - ps, st = Lux.setup(Random.default_rng(), Ψᵉ) #.|> gpu - - #loss_sing - #note that the loss here is the first column of the output of the entire model; similar to what was done in the ViT paper. - function loss_sing(ps, x, y) - norm(Lux.apply(Ψᵉ, x, ps, st)[1][:,1] - y) - end - function loss_sing(ps, train_x, train_y, index) - loss_sing(ps, train_x[index], train_y[:, index]) - end - function full_loss(ps, train_x, train_y) - num = length(train_x) - mapreduce(index -> loss_sing(ps, train_x, train_y, index), +, 1:num) - end - - num = length(train_x) - - cache = init_optimizer_cache(o, ps) - - loss_array = zeros(training_steps÷err_freq + 1) - loss_array[1] = full_loss(ps, train_x, train_y)/num - println("initial loss: ", loss_array[1]) - - @showprogress "Training network ..." for i in 1:training_steps - index₁ = Int(ceil(rand()*num)) - x = train_x[index₁] #|> gpu - y = train_y[:, index₁] #|> gpu - l, pb = Zygote.pullback(ps -> loss_sing(ps, x, y), ps) - dp = pb(one(l))[1] - #dp = Zyogte.gradient(ps -> loss_sing(ps, x, y), ps)[1] - - indices = Int.(ceil.(rand(batch_size -1)*num)) - for index in indices - x = train_x[index] #|> gpu - y = train_y[:, index] #|> gpu - l, pb = Zygote.pullback(ps -> loss_sing(ps, x, y), ps) - dp = _add(dp, pb(one(l))[1]) - end - optimization_step!(o, Ψᵉ, ps, cache, dp) - if i%err_freq == 0 - loss_array[1+i÷err_freq] = full_loss(ps, train_x, train_y)/num - end - end - println("final loss: ", loss_array[end]) - println("final test loss: ", full_loss(ps, test_x, test_y)/length(test_x),"\n") - - loss_array -end - -batch_size = 16 -training_steps = 10000 -err_freq = 100 -o = AdamOptimizer() - -loss_array₁ = transformer_training(Ψᵉ₁, batch_size, training_steps, err_freq, o) -loss_array₂ = transformer_training(Ψᵉ₂, batch_size, training_steps, err_freq, o) - -steps = vcat(1:err_freq:training_steps, training_steps+1) .- 1 - -p₁ = plot(steps, loss_array₁, label="Regular weights", linewidth=2, size=(800,500)) -plot!(p₁, steps, loss_array₂, label="Weights on the Stiefel manifold", linewidth=2) -ylims!(0.,2.2) -png(p₁, "transformer_stiefel_reg_comp") - - -loss_array₃ = transformer_training(Ψᵉ₂, batch_size, training_steps, err_freq, StandardOptimizer(0.001)) -loss_array₄ = transformer_training(Ψᵉ₂, batch_size, training_steps, err_freq, MomentumOptimizer(0.001, 0.5)) - -p₂ = plot(steps, loss_array₃, label="Standard Optimizer",linewidth=2, size=(800,500)) -plot!(p₂, steps, loss_array₂, label="Adam Optimizer",linewidth=2) -plot!(p₂, steps, loss_array₄, label="Momentum Optimizer",linewidth=2) -ylims!(0.,2.2) -png(p₂, "transformer_stiefel_ad_mom_stan") \ No newline at end of file diff --git a/scripts/transformer_new.jl b/scripts/transformer_bfgs.jl similarity index 86% rename from scripts/transformer_new.jl rename to scripts/transformer_bfgs.jl index daff099ee..30c5d9615 100644 --- a/scripts/transformer_new.jl +++ b/scripts/transformer_bfgs.jl @@ -15,7 +15,7 @@ image_dim = 28 patch_length = 7 transformer_dim = 49 n_heads = 7 -n_layers = 16 +n_layers = 1 number_of_patch = (image_dim÷patch_length)^2 batch_size = 2048 activation = softmax @@ -41,7 +41,6 @@ backend, train_x, test_x, train_y, test_y = test_y end - #encoder layer - final layer has to be added for evaluation purposes! model1 = Chain(Transformer(patch_length^2, n_heads, n_layers, Stiefel=false, add_connection=add_connection), Classification(patch_length^2, 10, activation)) @@ -50,7 +49,7 @@ model2 = Chain(Transformer(patch_length^2, n_heads, n_layers, Stiefel=true, add_ Classification(patch_length^2, 10, activation)) # err_freq is the frequency with which the error is computed (e.g. every 100 steps) -function transformer_training(Ψᵉ::Chain; backend=CPU(), n_epochs=100, opt=AdamOptimizer()) +function transformer_training(Ψᵉ::Chain; backend=backend, n_epochs=100, opt=BFGSOptimizer(1f-3)) # call data loader dl = DataLoader(train_x, train_y) dl_test = DataLoader(test_x, test_y) @@ -85,20 +84,19 @@ function transformer_training(Ψᵉ::Chain; backend=CPU(), n_epochs=100, opt=Ada loss_array, ps, total_time, accuracy_score end - -loss_array2, ps2, total_time2, accuracy_score2 = transformer_training(model2, backend=backend, n_epochs=n_epochs) -loss_array1, ps1, total_time1, accuracy_score1 = transformer_training(model1, backend=backend, n_epochs=n_epochs) -loss_array3, ps3, total_time3, accuracy_score3 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=GradientOptimizer(0.001)) -loss_array4, ps4, total_time4, accuracy_score4 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=MomentumOptimizer(0.001, 0.5)) +loss_array1, ps1, total_time1, accuracy_score1 = transformer_training(model1, backend=backend, n_epochs=n_epochs, opt=BFGSOptimizer(1f-3)) +loss_array2, ps2, total_time2, accuracy_score2 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=BFGSOptimizer(1f-3)) +loss_array3, ps3, total_time3, accuracy_score3 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=GradientOptimizer(1f-3)) +loss_array4, ps4, total_time4, accuracy_score4 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=AdamOptimizer()) p1 = plot(loss_array1, color=1, label="Regular weights", ylimits=(0.,1.4), linewidth=2) -plot!(p1, loss_array2, color=2, label="Weights on Stiefel Manifold", linewidth=2) -png(p1, "Stiefel_Regular") +plot!(p1, loss_array2, color=3, label="Weights on Stiefel Manifold", linewidth=2) +png(p1, "BFGS_Stiefel_Regular") -p2 = plot(loss_array2, color=2, label="Adam", ylimits=(0.,1.4), linewidth=2) +p2 = plot(loss_array2, color=3, label="BFGS", ylimits=(0.,1.4), linewidth=2) plot!(p2, loss_array3, color=1, label="Gradient", linewidth=2) -plot!(p2, loss_array4, color=3, label="Momentum", linewidth=2) -png(p2, "Adam_Gradient_Momentum") +plot!(p2, loss_array4, color=2, label="Adam", linewidth=2) +png(p2, "BFGS_Gradient_Adam") text_string = "n_epochs: " * string(n_epochs) * "\n" diff --git a/scripts/transformer_mnist.jl b/scripts/transformer_mnist.jl new file mode 100644 index 000000000..6adc3e661 --- /dev/null +++ b/scripts/transformer_mnist.jl @@ -0,0 +1,105 @@ +""" +TODO: Add a better predictor at the end! It should set the biggest value of the softmax to 1 and the rest to zero! +""" + +using GeometricMachineLearning, ProgressMeter, Plots, CUDA +import Zygote, MLDatasets + +# MNIST images are 28×28, so a sequence_length of 16 = 4² means the image patches are of size 7² = 49 +image_dim = 28 +patch_length = 7 +transformer_dim = 49 +n_heads = 7 +n_layers = 16 +number_of_patch = (image_dim÷patch_length)^2 +batch_size = 2048 +activation = softmax +n_epochs = 1000 +add_connection = false + +train_x, train_y = MLDatasets.MNIST(split=:train)[:] +test_x, test_y = MLDatasets.MNIST(split=:test)[:] + +# use CUDA backend if available. else use CPU() +backend, train_x, test_x, train_y, test_y = + try + CUDABackend(), + train_x |> cu, + test_x |> cu, + train_y |> cu, + test_y |> cu + catch + CPU(), + train_x, + test_x, + train_y, + test_y +end + +dl = DataLoader(train_x, train_y) +dl_test = DataLoader(test_x, test_y) + +# the difference between the first and the second model is that we put the weights on the Stiefel manifold in the second case +model1 = ClassificationTransformer(dl, n_heads=n_heads, n_layers=n_layers, Stiefel=false, add_connection=add_connection) +model2 = ClassificationTransformer(dl, n_heads=n_heads, n_layers=n_layers, Stiefel=true, add_connection=add_connection) + +batch = Batch(batch_size) + +# err_freq is the frequency with which the error is computed (e.g. every 100 steps) +function transformer_training(Ψᵉ::GeometricMachineLearning.Architecture; n_epochs=100, opt=AdamOptimizer()) + nn = NeuralNetwork(Ψᵉ, backend, eltype(dl)) + optimizer_instance = Optimizer(opt, nn) + + println("initial test accuracy: ", GeometricMachineLearning.accuracy(nn, dl_test), "\n") + + progress_object = Progress(n_epochs; enabled=true) + + # use the `time` function to get the system time. + init_time = time() + total_time = init_time - time() + + loss_array = zeros(eltype(train_x), n_epochs) + for i in 1:n_epochs + # there is some functionality in a recent PR that streamlines some of this -> make sure to include this! + loss_val = optimize_for_one_epoch!(optimizer_instance, nn, dl, batch) + + ProgressMeter.next!(progress_object; showvalues = [(:TrainingLoss, loss_val)]) + loss_array[i] = loss_val + + # update runtime + total_time = init_time - time() + end + + accuracy_score = GeometricMachineLearning.accuracy(nn, dl_test) + println("final test accuracy: ", accuracy_score, "\n") + + loss_array, nn, total_time, accuracy_score +end + + +loss_array2, nn2, total_time2, accuracy_score2 = transformer_training(model2, n_epochs=n_epochs) +loss_array1, nn1, total_time1, accuracy_score1 = transformer_training(model1, n_epochs=n_epochs) +loss_array3, nn3, total_time3, accuracy_score3 = transformer_training(model2, n_epochs=n_epochs, opt=GradientOptimizer(0.001)) +loss_array4, nn4, total_time4, accuracy_score4 = transformer_training(model2, n_epochs=n_epochs, opt=MomentumOptimizer(0.001, 0.5)) + +p1 = plot(loss_array1, color=1, label="Regular weights", ylimits=(0.,1.4), linewidth=2) +plot!(p1, loss_array2, color=2, label="Weights on Stiefel Manifold", linewidth=2) +png(p1, "Stiefel_Regular") + +p2 = plot(loss_array2, color=2, label="Adam", ylimits=(0.,1.4), linewidth=2) +plot!(p2, loss_array3, color=1, label="Gradient", linewidth=2) +plot!(p2, loss_array4, color=3, label="Momentum", linewidth=2) +png(p2, "Adam_Gradient_Momentum") + +text_string = + "n_epochs: " * string(n_epochs) * "\n" + "Regular weights: time: " * string(total_time1) * " classification accuracy: " * string(accuracy_score1) * "\n" * + "Stiefel weights: time: " * string(total_time2) * " classification accuracy: " * string(accuracy_score2) * "\n" * + "GradientOptimizer: time: " * string(total_time3) * " classification accuracy: " * string(accuracy_score3) * "\n" * + "MomentumOptimizer: time: " * string(total_time4) * " classification accuracy: " * string(accuracy_score4) * "\n" + +display(text_string) + +open("measure_times"*string(backend), "w") do file + write(file, text_string) +end diff --git a/src/GeometricMachineLearning.jl b/src/GeometricMachineLearning.jl index aa0262bb0..cc6d70674 100644 --- a/src/GeometricMachineLearning.jl +++ b/src/GeometricMachineLearning.jl @@ -45,6 +45,9 @@ module GeometricMachineLearning export σ, sigmoid, softmax + # the functionality in the script doesn't require anything else defined in GML, but some of the other scripts in that folder do. + include("data_loader/data_loader.jl") + include("kernels/assign_q_and_p.jl") include("kernels/tensor_mat_mul.jl") include("kernels/tensor_tensor_mul.jl") @@ -91,7 +94,6 @@ module GeometricMachineLearning include("arrays/skew_symmetric.jl") include("arrays/stiefel_lie_algebra_horizontal.jl") include("arrays/grassmann_lie_algebra_horizontal.jl") - include("arrays/auxiliary.jl") export SymmetricMatrix, SymplecticPotential, SkewSymMatrix export StiefelLieAlgHorMatrix @@ -110,6 +112,8 @@ module GeometricMachineLearning # include("manifolds/symplectic_stiefel_manifold.jl") include("manifolds/grassmann_manifold.jl") + include("arrays/stiefel_projection.jl") + export StiefelManifold, SymplecticStiefelManifold, GrassmannManifold, Manifold export rgrad, metric @@ -119,6 +123,8 @@ module GeometricMachineLearning include("optimizers/gradient_optimizer.jl") include("optimizers/momentum_optimizer.jl") include("optimizers/adam_optimizer.jl") + include("optimizers/bfgs_cache.jl") + include("optimizers/bfgs_optimizer.jl") include("optimizers/init_optimizer_cache.jl") include("optimizers/manifold_related/global_sections.jl") @@ -150,10 +156,10 @@ module GeometricMachineLearning export GradientOptimizer, GradientCache export MomentumOptimizer, MomentumCache export AdamOptimizer, AdamCache + export BFGSOptimizer, BFGSCache export Optimizer export optimization_step! - export init_optimizer_cache export GlobalSection, apply_section export global_rep @@ -234,6 +240,7 @@ module GeometricMachineLearning include("architectures/variable_width_network.jl") include("architectures/recurrent_neural_network.jl") include("architectures/LSTM_neural_network.jl") + include("architectures/transformer_neural_network.jl") export HamiltonianNeuralNetwork export LagrangianNeuralNetwork @@ -242,11 +249,11 @@ module GeometricMachineLearning export GSympNet export RecurrentNeuralNetwork export LSTMNeuralNetwork + export ClassificationTransformer export train!, apply!, jacobian! export Iterate_Sympnet - export default_arch include("architectures/default_architecture.jl") @@ -356,7 +363,6 @@ module GeometricMachineLearning export integrate, integrate_step! include("integrator/sympnet_integrator.jl") - include("reduced_system/system_type.jl") include("reduced_system/reduced_system.jl") diff --git a/src/architectures/transformer_neural_network.jl b/src/architectures/transformer_neural_network.jl new file mode 100644 index 000000000..1ca4ae4ec --- /dev/null +++ b/src/architectures/transformer_neural_network.jl @@ -0,0 +1,32 @@ +@doc raw""" +This is a transformer neural network for classification purposes. At the moment this is only used for training on MNIST, but can in theory be used for any classification problem. + +It has to be called with a `DataLoader` that stores an input and an output tensor. The optional arguments are: +- `n_heads`: The number of heads in the `MultiHeadAttention` (mha) layers. Default: `7`. +- `n_layers`: The number of transformer layers. Default: `16`. +- `activation`: The activation function. Default: `softmax`. +- `Stiefel`: Wheter the matrices in the mha layers are on the **Stiefel manifold**. +- `add_connection`: Whether the input is appended to the output of the mha layer. (skip connection) +""" +struct ClassificationTransformer{AT} <: Architecture + transformer_dim::Int + classification_dim::Int + n_heads::Int + n_layers::Int + activation::AT + Stiefel::Bool + add_connection::Bool + function ClassificationTransformer(transformer_dim::Int, classification_dim::Int, n_heads::Int, n_layers::Int, σ::AT, Stiefel::Bool, add_connection::Bool) where AT + new{AT}(transformer_dim, classification_dim, n_heads, n_layers, σ, Stiefel, add_connection) + end + function ClassificationTransformer(dl::DataLoader{T, BT, CT}; n_heads::Int=7, n_layers::Int=16, activation::AT=softmax, Stiefel::Bool=true, add_connection::Bool=true) where {T, T1, BT<:AbstractArray{T, 3}, CT<:AbstractArray{T1, 3}, AT} + new{AT}(dl.input_dim, dl.output_dim, n_heads, n_layers, activation, Stiefel, add_connection) + end +end + +function Chain(arch::ClassificationTransformer) + Chain( + Transformer(arch.transformer_dim, arch.n_heads, arch.n_layers, Stiefel=arch.Stiefel, add_connection=arch.add_connection).layers..., + Classification(arch.transformer_dim, arch.classification_dim, arch.activation) + ) +end \ No newline at end of file diff --git a/src/arrays/auxiliary.jl b/src/arrays/auxiliary.jl deleted file mode 100644 index a5497779e..000000000 --- a/src/arrays/auxiliary.jl +++ /dev/null @@ -1,35 +0,0 @@ -""" -Auxiliary arrays - these should proably be put into different files or given another name. -""" - -struct StiefelProjection{T} <: AbstractMatrix{T} - N::Integer - n::Integer - StiefelProjection(N, n, T=Float64) = new{T}(N,n) -end - -function Base.getindex(::StiefelProjection{T},i,j) where T - if i == j - return T(1.) - end - return T(0.) -end - -Base.parent(E::StiefelProjection) = (E.N, E.n) - -Base.size(E::StiefelProjection) = (E.N, E.n) - -struct One{T} <: AbstractMatrix{T} - n::Integer - One(n, T=Float64) = new{T}(n) -end - -function Base.getindex(::One{T}, i, j) where T -if i == j - return T(1.) - end - return T(0.) -end -Base.parent(mat::One) = (mat.n) -Base.size(mat::One) = (mat.n, mat.n) - diff --git a/src/arrays/skew_symmetric.jl b/src/arrays/skew_symmetric.jl index c3e91ab81..e1ab85fe2 100644 --- a/src/arrays/skew_symmetric.jl +++ b/src/arrays/skew_symmetric.jl @@ -188,4 +188,27 @@ end # the first matrix is multiplied onto A2 in order for it to not be SkewSymMatrix! function Base.:*(A1::SkewSymMatrix{T}, A2::SkewSymMatrix{T}) where T A1*(one(A2)*A2) +end + +@doc raw""" +If `vec` is applied onto `SkewSymMatrix`, then the output is the associated vector. +""" +function Base.vec(A::SkewSymMatrix) + A.S +end + +function Base.zero(A::SkewSymMatrix) + SkewSymMatrix(zero(A.S), A.n) +end + +function KernelAbstractions.get_backend(A::SkewSymMatrix) + KernelAbstractions.get_backend(A.S) +end + +function assign!(B::SkewSymMatrix{T}, C::SkewSymMatrix{T}) where T + B.S .= C.S +end + +function Base.copy(A::SkewSymMatrix) + SkewSymMatrix(copy(A.S), A.n) end \ No newline at end of file diff --git a/src/arrays/stiefel_lie_algebra_horizontal.jl b/src/arrays/stiefel_lie_algebra_horizontal.jl index a1a1a6f28..71558e485 100644 --- a/src/arrays/stiefel_lie_algebra_horizontal.jl +++ b/src/arrays/stiefel_lie_algebra_horizontal.jl @@ -168,4 +168,55 @@ function LinearAlgebra.mul!(C::StiefelLieAlgHorMatrix, A::StiefelLieAlgHorMatrix mul!(C.B, A.B, α) end LinearAlgebra.mul!(C::StiefelLieAlgHorMatrix, α::Real, A::StiefelLieAlgHorMatrix) = mul!(C, A, α) -LinearAlgebra.rmul!(C::StiefelLieAlgHorMatrix, α::Real) = mul!(C, C, α) \ No newline at end of file +LinearAlgebra.rmul!(C::StiefelLieAlgHorMatrix, α::Real) = mul!(C, C, α) + +function Base.vec(A::StiefelLieAlgHorMatrix) + vcat(vec(A.A), vec(A.B)) +end + +function StiefelLieAlgHorMatrix(V::AbstractVector, N::Int, n::Int) + # length of skew-symmetric matrix + skew_sym_size = n*(n-1)÷2 + # size of matrix component + matrix_size = (N-n)*n + @assert length(V) == skew_sym_size + matrix_size + StiefelLieAlgHorMatrix( + SkewSymMatrix(@view(V[1:skew_sym_size]), n), + reshape(@view(V[(skew_sym_size+1):(skew_sym_size+matrix_size)]), (N-n), n), + N, + n + ) +end + +function Base.zero(B::StiefelLieAlgHorMatrix) + StiefelLieAlgHorMatrix( + zero(B.A), + zero(B.B), + B.N, + B.n + ) +end + +function KernelAbstractions.get_backend(B::StiefelLieAlgHorMatrix) + KernelAbstractions.get_backend(B.B) +end + +# assign funciton; also implement this for other arrays! +function assign!(B::StiefelLieAlgHorMatrix{T}, C::StiefelLieAlgHorMatrix{T}) where T + assign!(B.A, C.A) + B.B .= C.B +end + +function Base.copy(B::StiefelLieAlgHorMatrix) + StiefelLieAlgHorMatrix( + copy(B.A), + copy(B.B), + B.N, + B.n + ) +end + +# fallback -> put this somewhere else! +function assign!(A::AbstractArray, B::AbstractArray) + A .= B +end \ No newline at end of file diff --git a/src/arrays/stiefel_projection.jl b/src/arrays/stiefel_projection.jl new file mode 100644 index 000000000..82c04c2ac --- /dev/null +++ b/src/arrays/stiefel_projection.jl @@ -0,0 +1,51 @@ +@doc raw""" +An array that essentially does `vcat(I(n), zeros(N-n, n))` with GPU support. It has **three inner constructors**. The **first one** is called with the following arguments: +1. `backend`: backends as supported by `KernelAbstractions`. +2. `T::Type` +3. `N::Integer` +4. `n::Integer` + +The **second constructor** is called by supplying a matrix as input. The constructor will then extract the backend, the type and the dimensions of that matrix. + +The **third constructor** is called by supplying an instance of `StiefelLieAlgHorMatrix`. + +Technically this should be a subtype of `StiefelManifold`. +""" +struct StiefelProjection{T, AT} <: AbstractMatrix{T} + N::Integer + n::Integer + A::AT + function StiefelProjection(backend, T::Type, N::Integer, n::Integer) + A = KernelAbstractions.zeros(backend, T, N, n) + assign_ones_for_stiefel_projection! = assign_ones_for_stiefel_projection_kernel!(backend) + assign_ones_for_stiefel_projection!(A, ndrange=n) + new{T, typeof(A)}(N,n, A) + end + + function StiefelProjection(A::AbstractMatrix{T}) where T + StiefelProjection(KernelAbstractions.get_backend(A), T, size(A)...) + end + + function StiefelProjection(B::StiefelLieAlgHorMatrix{T}) where T + StiefelProjection(KernelAbstractions.get_backend(B), T, B.N, B.n) + end +end + +@kernel function assign_ones_for_stiefel_projection_kernel!(A::AbstractArray{T}) where T + i = @index(Global) + A[i, i] = one(T) +end + +""" +Outer constructor for `StiefelProjection`. This works with two integers as input and optionally the type. +""" +StiefelProjection(N::Integer, n::Integer, T::Type=Float64) = StiefelProjection(CPU(), T, N, n) + +Base.size(E::StiefelProjection) = (E.N, E.n) +Base.getindex(E::StiefelProjection, i, j) = getindex(E.A, i, j) +Base.:+(E::StiefelProjection, A::AbstractMatrix) = E.A + A +Base.:+(A::AbstractMatrix, E::StiefelProjection) = +(E, A) +Base.vcat(A::AbstractVecOrMat{T}, E::StiefelProjection{T}) where {T<:Number} = vcat(A, E.A) +Base.vcat(E::StiefelProjection{T}, A::AbstractVecOrMat{T}) where {T<:Number} = vcat(E.A, A) +Base.hcat(A::AbstractVecOrMat{T}, E::StiefelProjection{T}) where {T<:Number} = hcat(A, E.A) +Base.hcat(E::StiefelProjection{T}, A::AbstractVecOrMat{T}) where {T<:Number} = hcat(E.A, A) \ No newline at end of file diff --git a/src/data_loader/batch.jl b/src/data_loader/batch.jl index c85ede47c..bbb9e4e39 100644 --- a/src/data_loader/batch.jl +++ b/src/data_loader/batch.jl @@ -83,6 +83,10 @@ function optimize_for_one_epoch!(opt::Optimizer, model, ps::Union{Tuple, NamedTu optimize_for_one_epoch!(opt, model, ps, dl, batch, loss) end +function optimize_for_one_epoch!(opt::Optimizer, nn::NeuralNetwork, dl::DataLoader, batch::Batch) + optimize_for_one_epoch!(opt, nn.model, nn.params, dl, batch) +end + """ TODO: Add ProgressMeter!!! """ @@ -105,7 +109,6 @@ end function (o::Optimizer)(nn::NeuralNetwork, dl::DataLoader, batch::Batch, n_epochs::Int, loss) - progress_object = ProgressMeter.Progress(n_epochs; enabled=true) loss_array = zeros(n_epochs) for i in 1:n_epochs diff --git a/src/data_loader/data_loader.jl b/src/data_loader/data_loader.jl index 34d65c71a..ae6a45089 100644 --- a/src/data_loader/data_loader.jl +++ b/src/data_loader/data_loader.jl @@ -47,7 +47,7 @@ end The DataLoader is called with a tensor and a vector. For the moment this is always interpreted to be the MNIST data set. """ function DataLoader(data::AbstractArray{T, 3}, target::AbstractVector{T1}; patch_length=7) where {T, T1} - @info "You provided a tensor and a vector as input. This will be treated as a classification problem (MNIST). Tensor axes: (i) & (ii) image axes and (iii) parameter dimesnion." + @info "You provided a tensor and a vector as input. This will be treated as a classification problem (MNIST). Tensor axes: (i) & (ii) image axes and (iii) parameter dimension." im_dim₁, im_dim₂, n_params = size(data) @assert length(target) == n_params number_of_patches = (im_dim₁÷patch_length)*(im_dim₂÷patch_length) @@ -153,4 +153,8 @@ function accuracy(model::Chain, ps::Tuple, dl::DataLoader{T, AT, BT}) where {T, # get tensor of maximum elements tensor_of_maximum_elements[ind] .= T1(1) (size(dl.output, 3)-sum(abs.(dl.output - tensor_of_maximum_elements))/T1(2))/size(dl.output, 3) -end \ No newline at end of file +end + +accuracy(nn::NeuralNetwork, dl::DataLoader) = accuracy(nn.model, nn.params, dl) + +Base.eltype(::DataLoader{T}) where T = T \ No newline at end of file diff --git a/src/layers/multi_head_attention.jl b/src/layers/multi_head_attention.jl index 0ae8757f6..05c3c53cc 100644 --- a/src/layers/multi_head_attention.jl +++ b/src/layers/multi_head_attention.jl @@ -75,7 +75,10 @@ function initialparameters(backend::KernelAbstractions.Backend, T::Type, d::Mult (PQ=PQ, PK=PK, PV=PV) end -function (d::MultiHeadAttention{M, M, Stiefel, Retraction, true})(x::AbstractMatrix{T}, ps::NamedTuple) where {M, Stiefel, Retraction, T} +@doc raw""" +Applies MHA to an abstract matrix. This is the same independent of whether the input is added to the output or not. +""" +function compute_output_of_mha(d::MultiHeadAttention{M, M}, x::AbstractMatrix{T}, ps::NamedTuple) where {M, T} dim, input_length = size(x) @assert dim == M @@ -84,35 +87,18 @@ function (d::MultiHeadAttention{M, M, Stiefel, Retraction, true})(x::AbstractMat key = Symbol("head_"*string(i)) output = vcat(output, ps.PV[key]'*x*softmax((ps.PQ[key]'*x)'*(ps.PK[key]'*x)/T(sqrt(dim)))) end - x + output -end - -function (d::MultiHeadAttention{M, M, Stiefel, Retraction, false})(x::AbstractMatrix{T}, ps::NamedTuple) where {M, Stiefel, Retraction, T} - dim, input_length = size(x) - @assert dim == M - - output = similar(x, 0, input_length) - for i in 1:d.n_heads - key = Symbol("head_"*string(i)) - output = vcat(output, ps.PV[key]'*x*softmax((ps.PQ[key]'*x)'*(ps.PK[key]'*x)/T(sqrt(dim)))) - end output end -function (d::MultiHeadAttention{M, M, Stiefel, Retraction, true})(x::AbstractArray{T, 3}, ps::NamedTuple) where {M, Stiefel, Retraction, T} + +function compute_output_of_mha(d::MultiHeadAttention{M, M}, x::AbstractArray{T, 3}, ps::NamedTuple) where {M, T} Dₕ = M ÷ d.n_heads dim, input_length, number_data = size(x) @assert dim == M - + # initialize the output output = similar(x, 0, input_length, number_data) - # initialize the results of various tensor matrix multiplications - Q_tensor = similar(x, Dₕ, input_length, number_data) - K_tensor = similar(x, Dₕ, input_length, number_data) - V_tensor = similar(x, Dₕ, input_length, number_data) - QK_tensor = similar(x, input_length, input_length, number_data) - # this is the result of a single head attention block single_head_output = similar(x, Dₕ, input_length, number_data) @@ -127,35 +113,15 @@ function (d::MultiHeadAttention{M, M, Stiefel, Retraction, true})(x::AbstractArr output = vcat(output, single_head_output) # KernelAbstractions.synchronize(backend) end - x + output + output end -function (d::MultiHeadAttention{M, M, Stiefel, Retraction, false})(x::AbstractArray{T, 3}, ps::NamedTuple) where {M, Stiefel, Retraction, T} - Dₕ = M ÷ d.n_heads - dim, input_length, number_data = size(x) - @assert dim == M - - output = similar(x, 0, input_length, number_data) - - Q_tensor = similar(x, Dₕ, input_length, number_data) - K_tensor = similar(x, Dₕ, input_length, number_data) - V_tensor = similar(x, Dₕ, input_length, number_data) - QK_tensor = similar(x, input_length, input_length, number_data) - - single_head_output = similar(x, Dₕ, input_length, number_data) - - for i in 1:d.n_heads - key = Symbol("head_"*string(i)) - Q_tensor = mat_tensor_mul(ps.PQ[key]', x) - K_tensor = mat_tensor_mul(ps.PK[key]', x) - V_tensor = mat_tensor_mul(ps.PV[key]', x) - QK_tensor = tensor_transpose_tensor_mul(Q_tensor, K_tensor) +function (d::MultiHeadAttention{M, M, Stiefel, Retraction, true})(x::AbstractArray, ps::NamedTuple) where {M, Stiefel, Retraction} + x + compute_output_of_mha(d, x, ps) +end - single_head_output = tensor_tensor_mul(V_tensor, softmax(QK_tensor/T(sqrt(dim)))) - output = vcat(output, single_head_output) - # KernelAbstractions.synchronize(backend) - end - output +function (d::MultiHeadAttention{M, M, Stiefel, Retraction, false})(x::AbstractArray, ps::NamedTuple) where {M, Stiefel, Retraction} + compute_output_of_mha(d, x, ps) end import ChainRules diff --git a/src/manifolds/stiefel_manifold.jl b/src/manifolds/stiefel_manifold.jl index cbe750c55..4149dcd2d 100644 --- a/src/manifolds/stiefel_manifold.jl +++ b/src/manifolds/stiefel_manifold.jl @@ -1,8 +1,10 @@ +@doc raw""" +An implementation of the Stiefel manifold. It has various convenience functions associated with it: +- check +- rand +- rgrad +- metric """ -maybe consider dividing the output in the check functions by n! -TODO: Implement sampling procedures!! -""" - mutable struct StiefelManifold{T, AT <: AbstractMatrix{T}} <: Manifold{T} A::AT function StiefelManifold(A::AbstractMatrix) @@ -65,10 +67,30 @@ function Base.:*(Y::Adjoint{T, StiefelManifold{T, AT}}, B::AbstractMatrix) where Y.parent.A'*B end +@doc raw""" +Computes the Riemannian gradient for the Stiefel manifold given an element ``Y\in{}St(N,n)`` and a matrix ``\nabla{}L\in\mahbb{R}^{N\times{}n}`` (the Euclidean gradient). It computes the Riemannian gradient with respect to the canonical metric (see the documentation for the function `metric` for an explanation of this). +The precise form of the mapping is: +```math +\mathtt{rgrad}(Y, \nabla{}L) \mapsto \nabla{}L - Y(\nabla{}L)^TY +``` +It is called with inputs: +- `Y::StiefelManifold` +- `e_grad::AbstractMatrix`: i.e. the Euclidean gradient (what was called ``\nabla{}L``) above. +""" function rgrad(Y::StiefelManifold, e_grad::AbstractMatrix) e_grad - Y.A*(e_grad'*Y.A) end +@doc raw""" +Implements the canonical Riemannian metric for the Stiefel manifold: +```math +g_Y: (\Delta_1, \Delta_2) \mapsto \mathrm{tr}(\Delta_1^T(\mathbb{I} - \frac{1}{2}YY^T)\Delta_2). +``` +It is called with: +- `Y::StiefelManifold` +- `Δ₁::AbstractMatrix` +- `Δ₂::AbstractMatrix`` +""" function metric(Y::StiefelManifold, Δ₁::AbstractMatrix, Δ₂::AbstractMatrix) LinearAlgebra.tr(Δ₁'*(I - .5*Y.A*Y.A')*Δ₂) end diff --git a/src/optimizers/adam_optimizer.jl b/src/optimizers/adam_optimizer.jl index 8e5378f71..8dd5d895b 100644 --- a/src/optimizers/adam_optimizer.jl +++ b/src/optimizers/adam_optimizer.jl @@ -11,13 +11,13 @@ struct AdamOptimizer{T<:Real} <: OptimizerMethod AdamOptimizer(η = 1f-3, ρ₁ = 9f-1, ρ₂ = 9.9f-1, δ = 3f-7; T=typeof(η)) = new{T}(T(η), T(ρ₁), T(ρ₂), T(δ)) end -function update!(o::Optimizer{<:AdamOptimizer{T}}, C::AdamCache, B::AbstractVecOrMat) where T +function update!(o::Optimizer{<:AdamOptimizer{T}}, C::AdamCache, B::AbstractArray) where T add!(C.B₁, ((o.method.ρ₁ - o.method.ρ₁^o.step)/(T(1.) - o.method.ρ₁^o.step))*C.B₁, ((T(1.) - o.method.ρ₁)/(T(1.) - o.method.ρ₁^o.step))*B) add!(C.B₂, ((o.method.ρ₂ - o.method.ρ₂^o.step)/(T(1.) - o.method.ρ₂^o.step))*C.B₂, ((T(1.) - o.method.ρ₂)/(T(1.) - o.method.ρ₂^o.step))*⊙²(B)) mul!(B, -o.method.η, /ᵉˡᵉ(C.B₁, scalar_add(racᵉˡᵉ(C.B₂), o.method.δ))) end -#fallbacks: +# defaults: ⊙²(A::AbstractVecOrMat) = A.^2 racᵉˡᵉ(A::AbstractVecOrMat) = sqrt.(A) /ᵉˡᵉ(A::AbstractVecOrMat, B::AbstractVecOrMat) = A./B diff --git a/src/optimizers/bfgs_cache.jl b/src/optimizers/bfgs_cache.jl new file mode 100644 index 000000000..26f1c0b96 --- /dev/null +++ b/src/optimizers/bfgs_cache.jl @@ -0,0 +1,51 @@ +@doc raw""" +The cache for the BFGS optimizer. + +It stores an array for the previous time step `B` and the inverse of the Hessian matrix `H`. + +It is important to note that setting up this cache already requires a derivative! This is not the case for the other optimizers. +""" +struct BFGSCache{T, AT<:AbstractArray{T}} <: AbstractCache + B::AT + S::AT + H::AbstractMatrix{T} + function BFGSCache(B::AbstractArray) + new{eltype(B), typeof(zero(B))}(zero(B), zero(B), initialize_hessian_inverse(zero(B))) + end +end + +@doc raw""" +In order to initialize `BGGSCache` we first need gradient information. This is why we initially have this `BFGSDummyCache` until gradient information is available. + +NOTE: we may not need this. +""" +struct BFGSDummyCache{T, AT<:AbstractArray{T}} <: AbstractCache + function BFGSDummyCache(B::AbstractArray) + new{eltype(B), typeof(zero(B))}() + end +end + +@kernel function assign_diagonal_ones_kernel!(B::AbstractMatrix{T}) where T + i = @index(Global) + B[i, i] = one(T) +end + +@doc raw""" +This initializes the inverse of the Hessian for various arrays. This requires an implementation of a *vectorization operation* `vec`. This is important for custom arrays. +""" +function initialize_hessian_inverse(B::AbstractArray{T}) where T + length_of_array = length(vec(B)) + backend = KernelAbstractions.get_backend(B) + H = KernelAbstractions.zeros(backend, T, length_of_array, length_of_array) + assign_diagonal_ones! = assign_diagonal_ones_kernel!(backend) + assign_diagonal_ones!(H, ndrange=length_of_array) + H +end + +setup_bfgs_cache(ps::NamedTuple) = apply_toNT(setup_bfgs_cache, ps) +setup_bfgs_cache(ps::Tuple) = Tuple([setup_bfgs_cache(x) for x in ps]) +setup_bfgs_cache(B::AbstractArray) = BFGSCache(B) + +setup_bfgs_dummy_cache(ps::NamedTuple) = apply_toNT(setup_bfgs_dummy_cache, ps) +setup_bfgs_dummy_cache(ps::Tuple) = Tuple([setup_bfgs_cache(x) for x in ps]) +setup_bfgs_dummy_cache(B::AbstractArray) = BFGSDummyCache(B) \ No newline at end of file diff --git a/src/optimizers/bfgs_optimizer.jl b/src/optimizers/bfgs_optimizer.jl new file mode 100644 index 000000000..a57bd8ae4 --- /dev/null +++ b/src/optimizers/bfgs_optimizer.jl @@ -0,0 +1,48 @@ +@doc raw""" +This is an implementation of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimizer. +""" +struct BFGSOptimizer{T<:Real} <: OptimizerMethod + η::T + δ::T + + function BFGSOptimizer(η::T = 1f-2, δ=1f-7) where T + new{T}(η, T(δ)) + end +end + +@doc raw""" +Optimization for an entire neural networks with BFGS. What is different in this case is that we still have to initialize the cache. + +If `o.step == 1`, then we initialize the cache +""" +function update!(o::Optimizer{<:BFGSOptimizer}, C::CT, B::AbstractArray{T}) where {T, CT<:BFGSCache{T}} + if o.step == 1 + bfgs_initialization!(o, C, B) + else + bfgs_update!(o, C, B) + end +end + +function bfgs_update!(o::Optimizer{<:BFGSOptimizer}, C::CT, B::AbstractArray{T}) where {T, CT<:BFGSCache{T}} + # in the first step we compute the difference between the current and the previous mapped gradients: + Y = vec(B - C.B) + # in the second step we update H (write a test to check that this preserves symmetry) + vecS = vec(C.S) + # the *term for the second condition* appears many times in the expression. + SY = vecS' * Y + o.method.δ + # C.H .= C.H + (SY + Y' * C.H * Y) / (SY ^ 2) * vecS * vecS' - (C.H * Y * vecS' + vecS * (C.H * Y)' ) / SY + # the two computations of the H matrix should be equivalent. Check this!! + HY = C.H * Y + C.H .= C.H - HY * HY' / (Y' * HY + o.method.δ) + vecS * vecS' / SY + # in the third step we compute the final velocity + mul!(vecS, C.H, vec(B)) + mul!(C.S, -o.method.η, C.S) + assign!(C.B, copy(B)) + assign!(B, copy(C.S)) +end + +function bfgs_initialization!(o::Optimizer{<:BFGSOptimizer}, C::CT, B::AbstractArray{T}) where {T, CT<:BFGSCache{T}} + mul!(C.S, -o.method.η, B) + assign!(C.B, copy(B)) + assign!(B, copy(C.S)) +end \ No newline at end of file diff --git a/src/optimizers/init_optimizer_cache.jl b/src/optimizers/init_optimizer_cache.jl index 599c13b70..5c1c341d1 100644 --- a/src/optimizers/init_optimizer_cache.jl +++ b/src/optimizers/init_optimizer_cache.jl @@ -1,7 +1,8 @@ -""" -Wrapper for the functions setup_adam_cache, setup_momentum_cache, setup_gradient_cache. -These appear outside of optimizer_caches.jl because the OptimizerMethods first have to be defined. +@doc raw""" +Wrapper for the functions `setup_adam_cache`, `setup_momentum_cache`, `setup_gradient_cache`, `setup_bfgs_cache`. +These appear outside of `optimizer_caches.jl` because the `OptimizerMethods` first have to be defined. """ init_optimizer_cache(::GradientOptimizer, x) = setup_gradient_cache(x) init_optimizer_cache(::MomentumOptimizer, x) = setup_momentum_cache(x) -init_optimizer_cache(::AdamOptimizer, x) = setup_adam_cache(x) \ No newline at end of file +init_optimizer_cache(::AdamOptimizer, x) = setup_adam_cache(x) +init_optimizer_cache(::BFGSOptimizer, x) = setup_bfgs_cache(x) \ No newline at end of file diff --git a/src/optimizers/manifold_related/global_sections.jl b/src/optimizers/manifold_related/global_sections.jl index 5eb80bdf8..2d79f7c3c 100644 --- a/src/optimizers/manifold_related/global_sections.jl +++ b/src/optimizers/manifold_related/global_sections.jl @@ -1,4 +1,4 @@ -""" +@doc raw""" This implements global sections for the Stiefel manifold and the Symplectic Stiefel manifold. In practice this is implemented using Householder reflections, with the auxiliary column vectors given by: @@ -14,10 +14,9 @@ Maybe consider dividing the output in the check functions by n! Implement a general global section here!!!! Tₓ𝔐 → G×𝔤 !!!!!! (think about random initialization!) """ -#global section maps an element of the manifold to its associated Lie group! struct GlobalSection{T, AT} Y::AT - #for now the only lift that is implemented is the Stiefel one - these types will have to be expanded! + # for now the only lift that is implemented is the Stiefel one - these types will have to be expanded! λ::Union{LinearAlgebra.QRCompactWYQ, LinearAlgebra.QRPackedQ, Nothing} function GlobalSection(Y::AbstractVecOrMat) @@ -30,26 +29,22 @@ function GlobalSection(ps::NamedTuple) apply_toNT(GlobalSection, ps) end -#this is an application G×𝔐 → 𝔐 +# this is an application G×𝔐 → 𝔐 function apply_section(λY::GlobalSection{T, AT}, Y₂::AT) where {T, AT<:StiefelManifold{T}} N, n = size(λY.Y) @assert (N, n) == size(Y₂) - #temporary solution for the moment - projection_matrix₁ = typeof(Y₂.A)(hcat(I(n), zeros(T, n, N-n))) - projection_matrix₂ = typeof(Y₂.A)(hcat(zeros(T, N-n, n), I(N-n))) + backend = KernelAbstractions.get_backend(Y₂) StiefelManifold( - λY.Y.A*(projection_matrix₁*Y₂.A) + λY.λ*vcat(projection_matrix₂*Y₂.A, typeof(Y₂.A)(zeros(T, n, n))) + λY.Y.A * Y₂.A[1:n, :] + λY.λ*vcat(Y₂.A[(n+1):N, :], KernelAbstractions.zeros(backend, T, n, n)) ) end function apply_section!(Y::AT, λY::GlobalSection{T, AT}, Y₂::AT) where {T, AT<:StiefelManifold{T}} N, n = size(λY.Y) @assert (N, n) == size(Y₂) == size(Y) - #temporary solution for the moment - projection_matrix₁ = typeof(Y₂.A)(hcat(I(n), zeros(T, n, N-n))) - projection_matrix₂ = typeof(Y₂.A)(hcat(zeros(T, N-n, n), I(N-n))) - Y.A .= λY.Y*(projection_matrix₁*Y₂) + λY.λ*vcat(projection_matrix₂*Y₂, typeof(Y₂.A)(zeros(T, n, n))) + backend = KernelAbstractions.get_backend(Y) + Y.A .= λY.Y * Y₂.A[1:n, :] + λY.λ*vcat(Y₂.A[(n+1):N, :], KernelAbstractions.zeros(backend, T, n, n)) end function apply_section(λY::GlobalSection{T, AT}, Y₂::AT) where {T, AT<:GrassmannManifold{T}} @@ -91,11 +86,9 @@ end function global_rep(λY::GlobalSection{T, AT}, Δ::AbstractMatrix{T}) where {T, AT<:StiefelManifold{T}} N, n = size(λY.Y) - #temporary workaround - projection_matrix = typeof(Δ)(hcat(I(N-n), zeros(T, N-n, n))) StiefelLieAlgHorMatrix( SkewSymMatrix(λY.Y.A'*Δ), - projection_matrix*(λY.λ'*Δ), + (λY.λ'*Δ)[1:(N-n), 1:n], N, n ) @@ -103,10 +96,9 @@ end function global_rep(λY::GlobalSection{T, AT}, Δ::AbstractMatrix{T}) where {T, AT<:GrassmannManifold{T}} N, n = size(λY.Y) - #temporary workaround - projection_matrix = typeof(Δ)(hcat(zeros(T, N-n, n), I(N-n))) GrassmannLieAlgHorMatrix( - projection_matrix*(λY.λ'*Δ), + # (λY.λ'*Δ)[(n+1):N, 1:n], + (λY.λ' * Δ)[1:(N-n), 1:n], N, n ) diff --git a/src/optimizers/manifold_related/retractions.jl b/src/optimizers/manifold_related/retractions.jl index 11b552406..b2bc37dc0 100644 --- a/src/optimizers/manifold_related/retractions.jl +++ b/src/optimizers/manifold_related/retractions.jl @@ -41,16 +41,12 @@ geodesic(B::NamedTuple) = apply_toNT(geodesic, B) #you will have to fix the scalar indexing problem wrt to SkewSymMatrix! function geodesic(B::StiefelLieAlgHorMatrix{T}) where T - N, n = B.N, B.n - E = typeof(B.B)(StiefelProjection(N, n, T)) - # expression from which matrix exponential and inverse have to be computed - unit = typeof(B.B)(I(n)) - # delete this line eventually!!! - # A_mat = typeof(B.B)(SkewSymMatrix(Vector(B.A.S), n)) - A_mat = B.A - exponent = hcat(vcat(T(.5)*(A_mat*one(A_mat)), T(.25)*A_mat*A_mat - B.B'*B.B), vcat(unit, T(.5)*(A_mat*one(A_mat)))) + E = StiefelProjection(B) + unit = one(B.A) + A_mat = B.A * unit + exponent = hcat(vcat(T(.5) * A_mat, T(.25) * B.A * A_mat - B.B' * B.B), vcat(unit, T(.5) * A_mat)) StiefelManifold( - E + hcat(vcat(T(.5)*A_mat*one(A_mat), B.B), E)*𝔄(exponent)*vcat(unit, T(.5)*A_mat*one(A_mat)) + E + hcat(vcat(T(.5) * A_mat, B.B), E) * 𝔄(exponent) * vcat(unit, T(.5) * A_mat) ) end @@ -68,19 +64,18 @@ end cayley(B::NamedTuple) = apply_toNT(cayley, B) function cayley(B::StiefelLieAlgHorMatrix{T}) where T - N, n = B.N, B.n - E = typeof(B.B)(StiefelProjection(N, n, T)) - unit = typeof(B.B)(I(n)) - A_mat = B.A*one(B.A) - A_mat2 = B.A*B.A - BB = B.B'*B.B + E = StiefelProjection(B) + unit = one(B.A) + A_mat = B.A * one(B.A) + A_mat2 = B.A * B.A + BB = B.B' * B.B - exponent = hcat(vcat(unit - T(.25)*A_mat, T(.5)*BB - T(.125)*A_mat2), vcat(-T(.5)*unit, unit - T(.25)*A_mat)) + exponent = hcat(vcat(unit - T(.25) * A_mat, T(.5) * BB - T(.125) * A_mat2), vcat(-T(.5) * unit, unit - T(.25) * A_mat)) StiefelManifold( E + - T(.5)*hcat(vcat(T(.5)*A_mat, B.B), vcat(unit, zero(B.B)))* + T(.5) * hcat(vcat(T(.5) * A_mat, B.B), vcat(unit, zero(B.B)))* ( - vcat(unit, T(0.5)*A_mat) + exponent \ (vcat(unit, T(0.5)*A_mat) + vcat(T(0.5)*A_mat, T(0.25)*A_mat2 - T(0.5)*BB)) + vcat(unit, T(0.5) * A_mat) + exponent \ (vcat(unit, T(0.5) * A_mat) + vcat(T(0.5) * A_mat, T(0.25) * A_mat2 - T(0.5) * BB)) ) ) end \ No newline at end of file diff --git a/src/optimizers/optimizer.jl b/src/optimizers/optimizer.jl index 056d7af36..7cd78d037 100644 --- a/src/optimizers/optimizer.jl +++ b/src/optimizers/optimizer.jl @@ -1,4 +1,4 @@ -""" +@doc raw""" Optimizer struct that stores the 'method' (i.e. Adam with corresponding hyperparameters), the cache and the optimization step. It takes as input an optimization method and the parameters of a network. @@ -46,6 +46,15 @@ function optimization_step!(o::Optimizer, d::Union{AbstractExplicitLayer, Abstra apply_section!(ps, λY, ps₂) end +@doc raw""" +Optimization for an entire neural networks, the way this function should be called. + +inputs: +- `o::Optimizer` +- `model::Chain` +- `ps::Tuple` +- `dx::Tuple` +""" function optimization_step!(o::Optimizer, model::Chain, ps::Tuple, dx::Tuple) o.step += 1 for (index, element) in zip(eachindex(model.layers), model.layers) @@ -69,6 +78,7 @@ function rgrad(Y::AbstractVecOrMat, dx::AbstractVecOrMat) dx end +# do we need those two? function update!(m::Optimizer, C::NamedTuple, B::NamedTuple) apply_toNT(m, C, B, update!) end diff --git a/src/optimizers/optimizer_caches.jl b/src/optimizers/optimizer_caches.jl index 78e113709..b8a21cb0a 100644 --- a/src/optimizers/optimizer_caches.jl +++ b/src/optimizers/optimizer_caches.jl @@ -1,8 +1,9 @@ -""" +@doc raw""" AbstractCache has subtypes: -AdamCache -MomentumCache -GradientCache +- AdamCache +- MomentumCache +- GradientCache +- BFGSCache All of them can be initialized with providing an array (also supporting manifold types). """ @@ -14,15 +15,15 @@ abstract type AbstractCache end struct AdamCache{T, AT <: AbstractArray{T}} <: AbstractCache B₁::AT B₂::AT - function AdamCache(B::AbstractArray) - new{eltype(B), typeof(zero(B))}(zero(B), zero(B)) + function AdamCache(Y::AbstractArray) + new{eltype(Y), typeof(zero(Y))}(zero(Y), zero(Y)) end end struct MomentumCache{T, AT <: AbstractArray{T}} <:AbstractCache B::AT - function MomentumCache(B::AbstractArray) - new{eltype(B), typeof(zero(B))}(zero(B)) + function MomentumCache(Y::AbstractArray) + new{eltype(Y), typeof(zero(Y))}(zero(Y)) end end @@ -32,9 +33,10 @@ GradientCache(::AbstractArray) = GradientCache() ############################################################################# # All the setup_cache functions -setup_adam_cache(B::AbstractArray) = reshape([setup_adam_cache(b) for b in B], size(B)) -setup_momentum_cache(B::AbstractArray) = reshape([setup_momentum_cache(b) for b in B], size(B)) -setup_gradient_cache(B::AbstractArray) = reshape([setup_gradient_cache(b) for b in B], size(B)) +# I don't really understand what we need these for ??? +# setup_adam_cache(B::AbstractArray) = reshape([setup_adam_cache(b) for b in B], size(B)) +# setup_momentum_cache(B::AbstractArray) = reshape([setup_momentum_cache(b) for b in B], size(B)) +# setup_gradient_cache(B::AbstractArray) = reshape([setup_gradient_cache(b) for b in B], size(B)) setup_adam_cache(ps::NamedTuple) = apply_toNT(setup_adam_cache, ps) setup_momentum_cache(ps::NamedTuple) = apply_toNT(setup_momentum_cache, ps) diff --git a/test/arrays/array_tests.jl b/test/arrays/array_tests.jl index 323aa877c..fd9269046 100644 --- a/test/arrays/array_tests.jl +++ b/test/arrays/array_tests.jl @@ -125,6 +125,10 @@ function sympl_lie_alg_add_sub_test(N, n) @test all(abs.(projection(W₁ - W₂) .- S₄) .< 1e-10) end +function stiefel_lie_alg_vectorization_test(N, n; T=Float32) + A = rand(StiefelLieAlgHorMatrix{T}, N, n) + @test isapprox(StiefelLieAlgHorMatrix(vec(A), N, n), A) +end # TODO: tests for ADAM functions @@ -150,4 +154,5 @@ for (N, n) ∈ zip(N_vec, n_vec) sympl_proj_test(N,n) stiefel_lie_alg_add_sub_test(N,n) sympl_lie_alg_add_sub_test(N,n) + stiefel_lie_alg_vectorization_test(N, n) end diff --git a/test/optimizers/bfgs_optimizer.jl b/test/optimizers/bfgs_optimizer.jl new file mode 100644 index 000000000..cc540b063 --- /dev/null +++ b/test/optimizers/bfgs_optimizer.jl @@ -0,0 +1,108 @@ +using GeometricMachineLearning +using Zygote +using Test +using LinearAlgebra: norm, svd + +@doc raw""" +This tests the BFGS optimizer. +""" +function bfgs_optimizer(N) + B = inv(rand(N, N)) + loss(A) = norm(A - B) ^ (3) + A = randn(N, N) + loss1 = loss(A) + opt = BFGSOptimizer(1e-3) + optimizer_instance = Optimizer(opt, A) + ∇loss = gradient(loss, A)[1] + GeometricMachineLearning.bfgs_initialization!(optimizer_instance, optimizer_instance.cache, ∇loss) + A .= A + ∇loss + loss2 = loss(A) + ∇loss = gradient(loss, A)[1] + GeometricMachineLearning.bfgs_update!(optimizer_instance, optimizer_instance.cache, ∇loss) + A .= A + ∇loss + loss3 = loss(A) + @test loss1 > loss2 > loss3 +end + +bfgs_optimizer(10) + +function bfgs_optimizer2(N, n_iterates=10) + losses = zeros(n_iterates+2) + B = inv(rand(N, N)) + loss(A) = norm(A - B) ^ (3) + A = randn(N, N) + losses[1] = loss(A) + opt = BFGSOptimizer(1e-3) + optimizer_instance = Optimizer(opt, A) + ∇loss = gradient(loss, A)[1] + GeometricMachineLearning.bfgs_initialization!(optimizer_instance, optimizer_instance.cache, ∇loss) + A .= A + ∇loss + losses[2] = loss(A) + for i in 1:n_iterates + ∇loss = gradient(loss, A)[1] + GeometricMachineLearning.bfgs_update!(optimizer_instance, optimizer_instance.cache, ∇loss) + A .= A + ∇loss + losses[i+2] = loss(A) + end + losses +end + +# bfgs_optimizer2(10) + +function stiefel_optimizer_test(N, n; T=Float32, n_iterates=10, η=1f-2) + opt=BFGSOptimizer(T(η)) + A = rand(StiefelManifold{T}, N, n) + B = T(10)*randn(T, N, N) + ideal_A = svd(B).U[:, 1:n] + loss(A) = norm(B - A * A' * B) + ideal_error = norm(ideal_A) + losses = zeros(n_iterates+2) + losses[1] = loss(A) + optimizer_instance = Optimizer(opt, A) + λA = GlobalSection(A) + grad_loss = global_rep(λA, rgrad(A, gradient(loss, A)[1])) + GeometricMachineLearning.bfgs_initialization!(optimizer_instance, optimizer_instance.cache, grad_loss) + # geodesic for `grad_loss` + exp_grad_loss = GeometricMachineLearning.geodesic(grad_loss) + GeometricMachineLearning.apply_section!(A, λA, exp_grad_loss) + losses[2] = loss(A) + for i in 1:n_iterates + λA = GlobalSection(A) + grad_loss = global_rep(λA, rgrad(A, gradient(loss, A)[1])) + GeometricMachineLearning.bfgs_update!(optimizer_instance, optimizer_instance.cache, grad_loss) + # geodesic for `grad_loss` + exp_grad_loss = GeometricMachineLearning.geodesic(grad_loss) + GeometricMachineLearning.apply_section!(A, λA, exp_grad_loss) + losses[i+2] = loss(A) + end + losses, ideal_error, check(A) +end + +function stiefel_adam_test(N, n; T=Float32, n_iterates=10) + opt=AdamOptimizer() + A = rand(StiefelManifold{T}, N, n) + B = T(10)*randn(T, N, N) + ideal_A = svd(B).U[:, 1:n] + loss(A) = norm(B - A * A' * B) + ideal_error = norm(ideal_A) + losses = zeros(n_iterates+2) + losses[1] = loss(A) + optimizer_instance = Optimizer(opt, A) + λA = GlobalSection(A) + grad_loss = global_rep(λA, rgrad(A, gradient(loss, A)[1])) + GeometricMachineLearning.update!(optimizer_instance, optimizer_instance.cache, grad_loss) + # geodesic for `grad_loss` + exp_grad_loss = GeometricMachineLearning.geodesic(grad_loss) + GeometricMachineLearning.apply_section!(A, λA, exp_grad_loss) + losses[2] = loss(A) + for i in 1:n_iterates + λA = GlobalSection(A) + grad_loss = global_rep(λA, rgrad(A, gradient(loss, A)[1])) + GeometricMachineLearning.update!(optimizer_instance, optimizer_instance.cache, grad_loss) + # geodesic for `grad_loss` + exp_grad_loss = GeometricMachineLearning.geodesic(grad_loss) + GeometricMachineLearning.apply_section!(A, λA, exp_grad_loss) + losses[i+2] = loss(A) + end + losses, ideal_error, check(A) +end \ No newline at end of file diff --git a/test/optimizers/optimizer_convergence_tests/svd_optim.jl b/test/optimizers/optimizer_convergence_tests/svd_optim.jl index 165817461..492076458 100644 --- a/test/optimizers/optimizer_convergence_tests/svd_optim.jl +++ b/test/optimizers/optimizer_convergence_tests/svd_optim.jl @@ -23,7 +23,7 @@ A = [ 0.06476993260924702 0.8369280855305259 0.6245358125914054 0.140729967064 0.7317681833003449 0.9051355184962627 0.3376918522349117 0.436545092402125 0.3462196925686055 ] -function svd_test(A, n, train_steps=1000, tol=1e-1; retraction=Cayley()) +function svd_test(A, n, train_steps=2000, tol=1e-1; retraction=Cayley()) N = size(A,1) U, Σ, Vt = svd(A) U_result = U[:, 1:n] diff --git a/test/transformer_related/multi_head_attention_stiefel_retraction.jl b/test/transformer_related/multi_head_attention_stiefel_retraction.jl index 549502401..842069da5 100644 --- a/test/transformer_related/multi_head_attention_stiefel_retraction.jl +++ b/test/transformer_related/multi_head_attention_stiefel_retraction.jl @@ -7,6 +7,7 @@ import Random, Test, Lux, LinearAlgebra, KernelAbstractions using GeometricMachineLearning, Test using GeometricMachineLearning: geodesic using GeometricMachineLearning: cayley +using GeometricMachineLearning: init_optimizer_cache dim = 64 n_heads = 8