diff --git a/docs/src/optimizers/bfgs_optimizer.md b/docs/src/optimizers/bfgs_optimizer.md
new file mode 100644
index 000000000..b7bcd6e3a
--- /dev/null
+++ b/docs/src/optimizers/bfgs_optimizer.md
@@ -0,0 +1,137 @@
+# The BFGS Algorithm
+
+The presentation shown here is largely taken from chapters 3 and 6 of reference [wright2006numerical](@cite) with a derivation based on an [online comment](https://math.stackexchange.com/questions/2091867/quasi-newton-methods-understanding-dfp-updating-formula). The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is a second order optimizer that can be also be used to train a neural network.
+
+It is a version of a *quasi-Newton* method and is therefore especially suited for convex problems. As is the case with any other (quasi-)Newton method the BFGS algorithm approximates the objective with a quadratic function in each optimization step:
+```math
+m_k(x) = f(x_k) + (\nabla_{x_k}f)^T(x - x_k) + \frac{1}{2}(x - x_k)^TB_k(x - x_k),
+```
+where ``B_k`` is referred to as the *approximate Hessian*. We further require ``B_k`` to be symmetric and positive definite. Differentiating the above expression and setting the derivative to zero gives us: 
+```math
+\nabla_xm_k = \nabla_{x_k}f + B_k(x - x_k) = 0,
+```
+or written differently: 
+```math
+x - x_k = -B_k^{-1}\nabla_{x_k}f.
+```
+This value we will from now on call ``p_k := x - x_k`` and refer to as the *search direction*. The new iterate then is: 
+```math
+x_{k+1} = x_k + \alpha_kp_k,
+```
+where ``\alpha_k`` is the *step length*. Techniques that describe how to pick an appropriate ``\alpha_k`` are called *line-search methods* and are discussed below. First we discuss what requirements we impose on ``B_k``. A first reasonable condition would be to require the gradient of ``m_k`` to be equal to that of ``f`` at the points ``x_{k-1}`` and ``x_k``: 
+```math
+\begin{aligned}
+\nabla_{x_k}m_k  & = \nabla_{x_k}f + B_k(x_k - x_k)  & \overset{!}{=} \nabla_{x_k}f \text{ and } \\
+\nabla_{x_{k-1}}m_k & = \nabla{x_k}f + B_k(x_{k-1} - x_k) & \overset{!}{=} \nabla_{x_{k-1}}f.
+\end{align}
+```
+The first one of these conditions is of course automatically satisfied. The second one can be rewritten as: 
+```math
+B_k(x_k - x_{k-1}) = \overset{!}{=} \nabla_{x_k}f - \nabla_{x_{k-1}}f. 
+```
+
+The following notations are often used: 
+```math
+s_{k-1} := \alpha_{k-1}p_{k-1} :=  x_{k} - x_{k-1} \text{ and } y_{k-1} := \nabla_{x_k}f - \nabla_{x_{k-1}}f. 
+```
+
+The conditions mentioned above then becomes: 
+```math
+B_ks_{k-1} \overset{!}{=} y_{k-1},
+```
+and we call it the *secant equation*. A second condition we impose on ``B_k`` is that is has to be positive-definite at point ``s_{k-1}``:
+```math
+s_{k-1}^Ty_{k-1} > 0.
+```
+This is referred to as the *curvature condition*. If we impose the *Wolfe conditions*, the *curvature condition* hold automatically. The Wolfe conditions are stated with respect to the parameter ``\alpha_k``.
+
+The *Wolfe conditions* are:
+1. ``f(x_k+\alpha{}p_k)\leq{}f(x_k) + c_1\alpha(\nabla_{x_k}f)^Tp_k`` for ``c_1\in(0,1)``.
+2. ``(\nabla_{(x_k + \alpha_kp_k)}f)^Tp_k \geq c_2(\nabla_{x_k}f)^Tp_k`` for ``c_2\in(c_1,1)``.
+
+A possible choice for ``c_1`` and ``c_2`` are ``10^{-4}`` and ``0.9`` (see [wright2006numerical](@cite)). The two Wolfe conditions above are respectively called the *sufficient decrease condition* and the *curvature condition* respectively. Note that the second Wolfe condition (also called curvature condition) is stronger than the one mentioned before under the assumption that the first Wolfe condition is true:
+```math
+(\nabla_{x_k}f)^Tp_{k-1} - c_2(\nabla_{x_{k-1}}f)^Tp_{k-1} = y_{k-1}^Tp_{k-1} + (1 - c_2)(\nabla_{x_{k-1}}f)^Tp_{k-1} \geq 0,
+```
+and the second term in this expression is ``(1 - c_2)(\nabla_{x_{k-1}}f)^Tp_{k-1}\geq\frac{1-c_2}{c_1\alpha_{k-1}}(f(x_k) - f(x_{k-1}))``, which is negative. 
+
+In order to pick the ideal ``B_k`` we solve the following problem: 
+```math
+\begin{aligned}
+\min_B & ||B - B_{k-1}||_W \\ 
+\text{s.t.} & B  = B^T\text{ and }Bs_{k-1}=y_{k-1},
+\end{aligned}
+```
+where the first condition is symmetry and the second one is the secant equation. For the norm ``||\cdot||_W`` we pick the weighted Frobenius norm:
+```math
+||A||_W := ||W^{1/2}AW^{1/2}||_F,
+```
+where ``||\cdot||_F`` is the usual Frobenius norm[^1] and the matrix ``W=\tilde{B}_{k-1}`` is the inverse of the *average Hessian*:
+```math
+\tilde{B}_{k-1} = \int_0^1 \nabla^2f(x_{k-1} + \tau\alpha_{k-1}p_{k-1})d\tau.
+``` 
+[^1]: The Frobenius norm is ``||A||_F^2 = \sum_{i,j}a_{ij}^2``.
+
+In order to find the ideal ``B_k`` under the conditions described above, we introduce some notation: 
+- ``\tilde{B}_{k-1} := W^{1/2}B_{k-1}W^{1/2}``,
+- ``\tilde{B} := W^{1/2}BW^{1/2}``, 
+- ``\tilde{y}_{k-1} := W^{1/2}y_{k-1}``, 
+- ``\tilde{s}_{k-1} := W^{-1/2}s_{k-1}``.
+
+With this notation we can rewrite the problem of finding ``B_k`` as: 
+```math
+\begin{aligned}
+\min_{\tilde{B}} & ||\tilde{B} - \tilde{B}_{k-1}||_F \\ 
+\text{s.t.} & \tilde{B} = \tilde{B}^T\text{ and }\tilde{B}\tilde{s}_{k-1}=\tilde{y}_{k-1}.
+\end{aligned}
+```
+
+We further have ``Wy_{k-1} = s_{k-1}`` (by the mean value theorem ?) and therefore ``\tilde{y}_{k-1} = \tilde{s}_{k-1}``.
+
+Now we rewrite ``B`` and ``B_{k-1}`` in a new basis ``U = [u|u_\perp]``, where ``u := \tilde{s}_{k-1}/||\tilde{s}_{k-1}||`` and ``u_perp`` is an orthogonal complement[^2] of ``u``:
+
+[^2]: So we must have ``u^Tu_\perp=0`` and further ``u_\perp^Tu_\perp=\mathbb{I}``.
+
+```math
+\begin{aligned}
+U^T\tilde{B}_{k-1}U - U^T\tilde{B}U = \begin{bmatrix}  u^T \\ u_\perp^T \end{bmatrix}(\tilde{B}_{k-1} - \tilde{B})\begin{bmatrix} u & u_\perp \end{bmatrix} = \\
+\begin{bmatrix}
+    u^T\tilde{B}_{k-1}u - 1 & u^T\tilde{B}_{k-1}u \\
+    u_\perp^T\tilde{B}_{k-1}u & u_\perp^T(\tilde{B}_{k-1}-\tilde{B}_k)u_\perp
+\end{bmatrix}.
+\end{aligned}
+```
+By a property of the Frobenius norm: 
+```math
+||\tilde{B}_{k-1} - \tilde{B}||^2_F = (u^T\tilde{B}_{k-1} -1)^2 + ||u^T\tilde{B}_{k-1}u_\perp||_F^2 + ||u_\perp^T\tilde{B}_{k-1}u||_F^2 + ||u_\perp^T(\tilde{B}_{k-1} - \tilde{B})u_\perp||_F^2.
+```
+
+We see that ``\tilde{B}`` only appears in the last term, which should therefore be made zero. This then gives: 
+```math
+\tilde{B} = U\begin{bmatrix} 1 & 0 \\ 0 & u^T_\perp\tilde{B}_{k-1}u_\perp \end{bmatrix} = uu^T + (\mathbb{I}-uu^T)\tilde{B}_{k-1}(\mathbb{I}-uu^T).
+```
+
+If we now map back to the original coordinate system, the ideal solution for ``B_k`` is: 
+```math
+B_k = (\mathbb{I} - \frac{1}{y_{k-1}^Ts_{k-1}}y_{k-1}s_{k-1}^T)B_{k-1}(\mathbb{I} - \frac{1}{y_{k-1}^Ts_{k-1}}s_{k-1}y_{k-1}^T) + \frac{1}{y_{k-1}^Ts_{k-1}}y_ky_k^T.
+```
+
+What we need in practice however is not ``B_k``, but its inverse ``H_k``. This is because we need to find ``s_{k-1}`` based on ``y_{k-1}``.  To get ``H_k`` based on the expression for ``B_k`` above we can use the *Sherman-Morrison-Woodbury formula*[^3] to obtain:
+
+[^3]: The *Sherman-Morrison-Woodbury formula* states ``(A + UCV)^{-1} = A^{-1} - A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}.
+
+```math
+H_{k} = H_{k-1} - \frac{H_{k-1}y_{k-1}y_{k-1}^TH_{k-1}}{y_{k-1}^TH_{k-1}y_{k-1}} + \frac{s_{k-1}s_{k-1}^T}{y_{k-1}^Ts_{k-1}}.
+```
+
+
+TODO: Example where this works well!
+
+## References 
+@book{wright2006numerical,
+  title={Numerical optimization},
+  author={Stephen J. Wright, Jorge Nocedal},
+  year={2006},
+  publisher={Springer Science+Business Media},
+  location={New York, NY}
+}
\ No newline at end of file
diff --git a/scripts/transformer.jl b/scripts/transformer.jl
deleted file mode 100644
index 0f43b50b0..000000000
--- a/scripts/transformer.jl
+++ /dev/null
@@ -1,114 +0,0 @@
-"""
-TODO: Add a better predictor at the end! It should set the biggest value of the softmax to 1 and the rest to zero!
-"""
-
-using GeometricMachineLearning, LinearAlgebra, ProgressMeter, Plots
-import Lux, Zygote, Random, MLDatasets, Flux, Lux.gpu
-
-#MNIST images are 28×28, so a sequence_length of 16 = 4² means the image patches are of size 7² = 49
-image_dim = 28
-patch_length = 7
-n_heads = 7
-n_layers = 5
-patch_number = (image_dim÷patch_length)^2
-
-train_x, train_y = MLDatasets.MNIST(split=:train)[:]
-test_x, test_y = MLDatasets.MNIST(split=:test)[:]
-
-#preprocessing steps 
-train_x =   Tuple(map(i -> sc_embed(split_and_flatten(train_x[:,:,i], patch_length)) #=|> gpu=#, 1:size(train_x,3)))
-test_x =    Tuple(map(i -> sc_embed(split_and_flatten(test_x[:,:,i], patch_length)) #=|> gpu=#, 1:size(test_x,3)))
-
-#implement this encoding yourself!
-train_y = Flux.onehotbatch(train_y, 0:9) #|> gpu
-test_y = Flux.onehotbatch(test_y, 0:9) #|> gpu
-
-
-#encoder layer - final layer has to be added for evaluation purposes!
-Ψᵉ₁ = Lux.Chain(
-    #Embedding(patch_length^2, patch_number),
-    Transformer(patch_length^2, n_heads, n_layers, Stiefel=false),
-    Lux.Dense(patch_length^2, 10, Lux.σ, use_bias=false)
-)
-
-Ψᵉ₂ = Lux.Chain(
-    #Embedding(patch_length^2, patch_number),
-    Transformer(patch_length^2, n_heads, n_layers, Stiefel=true),
-    Lux.Dense(patch_length^2, 10, Lux.σ, use_bias=false)
-)
-
-#err_freq is the frequency with which the error is computed (e.g. every 100 steps)
-function transformer_training(Ψᵉ::Lux.Chain, batch_size=64, training_steps=10000, err_freq=100, o=AdamOptimizer())
-    ps, st = Lux.setup(Random.default_rng(), Ψᵉ) #.|> gpu  
-
-    #loss_sing
-    #note that the loss here is the first column of the output of the entire model; similar to what was done in the ViT paper.
-    function loss_sing(ps, x, y)
-        norm(Lux.apply(Ψᵉ, x, ps, st)[1][:,1] - y)
-    end
-    function loss_sing(ps, train_x, train_y, index)
-        loss_sing(ps, train_x[index], train_y[:, index])    
-    end
-    function full_loss(ps, train_x, train_y)
-        num = length(train_x)
-        mapreduce(index -> loss_sing(ps, train_x, train_y, index), +, 1:num)    
-    end
-
-    num = length(train_x)
-
-    cache = init_optimizer_cache(o, ps)
-
-    loss_array = zeros(training_steps÷err_freq + 1)
-    loss_array[1] = full_loss(ps, train_x, train_y)/num
-    println("initial loss: ", loss_array[1])
-
-    @showprogress "Training network ..." for i in 1:training_steps
-        index₁ = Int(ceil(rand()*num))
-        x = train_x[index₁] #|> gpu
-        y = train_y[:, index₁] #|> gpu
-        l, pb = Zygote.pullback(ps -> loss_sing(ps, x, y), ps)
-        dp = pb(one(l))[1]
-        #dp = Zyogte.gradient(ps -> loss_sing(ps, x, y), ps)[1]
-
-        indices = Int.(ceil.(rand(batch_size -1)*num))
-        for index in indices
-            x = train_x[index] #|> gpu
-            y = train_y[:, index] #|> gpu
-            l, pb = Zygote.pullback(ps -> loss_sing(ps, x, y), ps)
-            dp = _add(dp, pb(one(l))[1])
-        end
-        optimization_step!(o, Ψᵉ, ps, cache, dp)    
-        if i%err_freq == 0
-            loss_array[1+i÷err_freq] = full_loss(ps, train_x, train_y)/num
-        end
-    end
-    println("final loss: ", loss_array[end])
-    println("final test loss: ", full_loss(ps, test_x, test_y)/length(test_x),"\n")
-
-    loss_array
-end
-
-batch_size = 16
-training_steps = 10000
-err_freq = 100
-o = AdamOptimizer()
-
-loss_array₁ = transformer_training(Ψᵉ₁, batch_size, training_steps, err_freq, o)
-loss_array₂ = transformer_training(Ψᵉ₂, batch_size, training_steps, err_freq, o)
-
-steps = vcat(1:err_freq:training_steps, training_steps+1) .- 1
-
-p₁ = plot(steps, loss_array₁, label="Regular weights", linewidth=2, size=(800,500))
-plot!(p₁, steps, loss_array₂, label="Weights on the Stiefel manifold", linewidth=2)
-ylims!(0.,2.2)
-png(p₁, "transformer_stiefel_reg_comp")
-
-
-loss_array₃ = transformer_training(Ψᵉ₂, batch_size, training_steps, err_freq, StandardOptimizer(0.001))
-loss_array₄ = transformer_training(Ψᵉ₂, batch_size, training_steps, err_freq, MomentumOptimizer(0.001, 0.5))
-
-p₂ = plot(steps, loss_array₃, label="Standard Optimizer",linewidth=2, size=(800,500))
-plot!(p₂, steps, loss_array₂, label="Adam Optimizer",linewidth=2)
-plot!(p₂, steps, loss_array₄, label="Momentum Optimizer",linewidth=2)
-ylims!(0.,2.2)
-png(p₂, "transformer_stiefel_ad_mom_stan")
\ No newline at end of file
diff --git a/scripts/transformer_new.jl b/scripts/transformer_bfgs.jl
similarity index 86%
rename from scripts/transformer_new.jl
rename to scripts/transformer_bfgs.jl
index daff099ee..30c5d9615 100644
--- a/scripts/transformer_new.jl
+++ b/scripts/transformer_bfgs.jl
@@ -15,7 +15,7 @@ image_dim = 28
 patch_length = 7
 transformer_dim = 49
 n_heads = 7
-n_layers = 16
+n_layers = 1
 number_of_patch = (image_dim÷patch_length)^2
 batch_size = 2048
 activation = softmax
@@ -41,7 +41,6 @@ backend, train_x, test_x, train_y, test_y =
         test_y
 end
 
-
 #encoder layer - final layer has to be added for evaluation purposes!
 model1 = Chain(Transformer(patch_length^2, n_heads, n_layers, Stiefel=false, add_connection=add_connection),
 	    Classification(patch_length^2, 10, activation))
@@ -50,7 +49,7 @@ model2 = Chain(Transformer(patch_length^2, n_heads, n_layers, Stiefel=true, add_
 	    Classification(patch_length^2, 10, activation))
 
 # err_freq is the frequency with which the error is computed (e.g. every 100 steps)
-function transformer_training(Ψᵉ::Chain; backend=CPU(), n_epochs=100, opt=AdamOptimizer())
+function transformer_training(Ψᵉ::Chain; backend=backend, n_epochs=100, opt=BFGSOptimizer(1f-3))
     # call data loader
     dl = DataLoader(train_x, train_y)
     dl_test = DataLoader(test_x, test_y)
@@ -85,20 +84,19 @@ function transformer_training(Ψᵉ::Chain; backend=CPU(), n_epochs=100, opt=Ada
     loss_array, ps, total_time, accuracy_score
 end
 
-
-loss_array2, ps2, total_time2, accuracy_score2 = transformer_training(model2, backend=backend, n_epochs=n_epochs)
-loss_array1, ps1, total_time1, accuracy_score1 = transformer_training(model1, backend=backend, n_epochs=n_epochs)
-loss_array3, ps3, total_time3, accuracy_score3 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=GradientOptimizer(0.001))
-loss_array4, ps4, total_time4, accuracy_score4 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=MomentumOptimizer(0.001, 0.5))
+loss_array1, ps1, total_time1, accuracy_score1 = transformer_training(model1, backend=backend, n_epochs=n_epochs, opt=BFGSOptimizer(1f-3))
+loss_array2, ps2, total_time2, accuracy_score2 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=BFGSOptimizer(1f-3))
+loss_array3, ps3, total_time3, accuracy_score3 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=GradientOptimizer(1f-3))
+loss_array4, ps4, total_time4, accuracy_score4 = transformer_training(model2, backend=backend, n_epochs=n_epochs, opt=AdamOptimizer())
 
 p1 = plot(loss_array1, color=1, label="Regular weights", ylimits=(0.,1.4), linewidth=2)
-plot!(p1, loss_array2, color=2, label="Weights on Stiefel Manifold", linewidth=2)
-png(p1, "Stiefel_Regular")
+plot!(p1, loss_array2, color=3, label="Weights on Stiefel Manifold", linewidth=2)
+png(p1, "BFGS_Stiefel_Regular")
 
-p2 = plot(loss_array2, color=2, label="Adam", ylimits=(0.,1.4), linewidth=2)
+p2 = plot(loss_array2, color=3, label="BFGS", ylimits=(0.,1.4), linewidth=2)
 plot!(p2, loss_array3, color=1, label="Gradient", linewidth=2)
-plot!(p2, loss_array4, color=3, label="Momentum", linewidth=2)
-png(p2, "Adam_Gradient_Momentum")
+plot!(p2, loss_array4, color=2, label="Adam", linewidth=2)
+png(p2, "BFGS_Gradient_Adam")
 
 text_string = 
     "n_epochs: " * string(n_epochs) * "\n"
diff --git a/scripts/transformer_mnist.jl b/scripts/transformer_mnist.jl
new file mode 100644
index 000000000..6adc3e661
--- /dev/null
+++ b/scripts/transformer_mnist.jl
@@ -0,0 +1,105 @@
+"""
+TODO: Add a better predictor at the end! It should set the biggest value of the softmax to 1 and the rest to zero!
+"""
+
+using GeometricMachineLearning, ProgressMeter, Plots, CUDA
+import Zygote, MLDatasets
+
+# MNIST images are 28×28, so a sequence_length of 16 = 4² means the image patches are of size 7² = 49
+image_dim = 28
+patch_length = 7
+transformer_dim = 49
+n_heads = 7
+n_layers = 16
+number_of_patch = (image_dim÷patch_length)^2
+batch_size = 2048
+activation = softmax
+n_epochs = 1000
+add_connection = false
+
+train_x, train_y = MLDatasets.MNIST(split=:train)[:]
+test_x, test_y = MLDatasets.MNIST(split=:test)[:]
+
+# use CUDA backend if available. else use CPU()
+backend, train_x, test_x, train_y, test_y = 
+    try
+        CUDABackend(),
+        train_x |> cu,
+        test_x |> cu,
+        train_y |> cu,
+        test_y |> cu
+    catch
+        CPU(), 
+        train_x, 
+        test_x, 
+        train_y, 
+        test_y
+end
+
+dl = DataLoader(train_x, train_y)
+dl_test = DataLoader(test_x, test_y)
+
+# the difference between the first and the second model is that we put the weights on the Stiefel manifold in the second case
+model1 = ClassificationTransformer(dl, n_heads=n_heads, n_layers=n_layers, Stiefel=false, add_connection=add_connection)
+model2 = ClassificationTransformer(dl, n_heads=n_heads, n_layers=n_layers, Stiefel=true, add_connection=add_connection)
+
+batch = Batch(batch_size)
+
+# err_freq is the frequency with which the error is computed (e.g. every 100 steps)
+function transformer_training(Ψᵉ::GeometricMachineLearning.Architecture; n_epochs=100, opt=AdamOptimizer())
+    nn = NeuralNetwork(Ψᵉ, backend, eltype(dl))
+    optimizer_instance = Optimizer(opt, nn)
+
+    println("initial test accuracy: ", GeometricMachineLearning.accuracy(nn, dl_test), "\n")
+
+    progress_object = Progress(n_epochs; enabled=true)
+
+    # use the `time` function to get the system time.
+    init_time = time()
+    total_time = init_time - time()
+
+    loss_array = zeros(eltype(train_x), n_epochs)
+    for i in 1:n_epochs
+        # there is some functionality in a recent PR that streamlines some of this -> make sure to include this!
+        loss_val = optimize_for_one_epoch!(optimizer_instance, nn, dl, batch)
+
+        ProgressMeter.next!(progress_object; showvalues = [(:TrainingLoss, loss_val)])   
+        loss_array[i] = loss_val
+
+        # update runtime
+        total_time = init_time - time()
+    end
+
+    accuracy_score = GeometricMachineLearning.accuracy(nn, dl_test)
+    println("final test accuracy: ", accuracy_score, "\n")
+
+    loss_array, nn, total_time, accuracy_score
+end
+
+
+loss_array2, nn2, total_time2, accuracy_score2 = transformer_training(model2, n_epochs=n_epochs)
+loss_array1, nn1, total_time1, accuracy_score1 = transformer_training(model1, n_epochs=n_epochs)
+loss_array3, nn3, total_time3, accuracy_score3 = transformer_training(model2, n_epochs=n_epochs, opt=GradientOptimizer(0.001))
+loss_array4, nn4, total_time4, accuracy_score4 = transformer_training(model2, n_epochs=n_epochs, opt=MomentumOptimizer(0.001, 0.5))
+
+p1 = plot(loss_array1, color=1, label="Regular weights", ylimits=(0.,1.4), linewidth=2)
+plot!(p1, loss_array2, color=2, label="Weights on Stiefel Manifold", linewidth=2)
+png(p1, "Stiefel_Regular")
+
+p2 = plot(loss_array2, color=2, label="Adam", ylimits=(0.,1.4), linewidth=2)
+plot!(p2, loss_array3, color=1, label="Gradient", linewidth=2)
+plot!(p2, loss_array4, color=3, label="Momentum", linewidth=2)
+png(p2, "Adam_Gradient_Momentum")
+
+text_string = 
+    "n_epochs: " * string(n_epochs) * "\n"
+    "Regular weights:   time: " * string(total_time1) * " classification accuracy: " * string(accuracy_score1) * "\n" *
+    "Stiefel weights:   time: " * string(total_time2) * " classification accuracy: " * string(accuracy_score2) * "\n" *
+    "GradientOptimizer: time: " * string(total_time3) * " classification accuracy: " * string(accuracy_score3) * "\n" *
+    "MomentumOptimizer: time: " * string(total_time4) * " classification accuracy: " * string(accuracy_score4) * "\n"
+
+display(text_string)
+
+open("measure_times"*string(backend), "w") do file
+    write(file, text_string)
+end
diff --git a/src/GeometricMachineLearning.jl b/src/GeometricMachineLearning.jl
index aa0262bb0..cc6d70674 100644
--- a/src/GeometricMachineLearning.jl
+++ b/src/GeometricMachineLearning.jl
@@ -45,6 +45,9 @@ module GeometricMachineLearning
     
     export σ, sigmoid, softmax
 
+    # the functionality in the script doesn't require anything else defined in GML, but some of the other scripts in that folder do.
+    include("data_loader/data_loader.jl")
+
     include("kernels/assign_q_and_p.jl")
     include("kernels/tensor_mat_mul.jl")
     include("kernels/tensor_tensor_mul.jl")
@@ -91,7 +94,6 @@ module GeometricMachineLearning
     include("arrays/skew_symmetric.jl")
     include("arrays/stiefel_lie_algebra_horizontal.jl")
     include("arrays/grassmann_lie_algebra_horizontal.jl")
-    include("arrays/auxiliary.jl")
 
     export SymmetricMatrix, SymplecticPotential, SkewSymMatrix
     export StiefelLieAlgHorMatrix
@@ -110,6 +112,8 @@ module GeometricMachineLearning
     # include("manifolds/symplectic_stiefel_manifold.jl")
     include("manifolds/grassmann_manifold.jl")
 
+    include("arrays/stiefel_projection.jl")
+
     export StiefelManifold, SymplecticStiefelManifold, GrassmannManifold, Manifold
     export rgrad, metric
 
@@ -119,6 +123,8 @@ module GeometricMachineLearning
     include("optimizers/gradient_optimizer.jl")
     include("optimizers/momentum_optimizer.jl")        
     include("optimizers/adam_optimizer.jl")
+    include("optimizers/bfgs_cache.jl")
+    include("optimizers/bfgs_optimizer.jl")
     include("optimizers/init_optimizer_cache.jl")
 
     include("optimizers/manifold_related/global_sections.jl")
@@ -150,10 +156,10 @@ module GeometricMachineLearning
     export GradientOptimizer, GradientCache
     export MomentumOptimizer, MomentumCache
     export AdamOptimizer, AdamCache
+    export BFGSOptimizer, BFGSCache
 
     export Optimizer
     export optimization_step!
-    export init_optimizer_cache
 
     export GlobalSection, apply_section
     export global_rep
@@ -234,6 +240,7 @@ module GeometricMachineLearning
     include("architectures/variable_width_network.jl")
     include("architectures/recurrent_neural_network.jl")
     include("architectures/LSTM_neural_network.jl")
+    include("architectures/transformer_neural_network.jl")
 
     export HamiltonianNeuralNetwork
     export LagrangianNeuralNetwork
@@ -242,11 +249,11 @@ module GeometricMachineLearning
     export GSympNet
     export RecurrentNeuralNetwork
     export LSTMNeuralNetwork
+    export ClassificationTransformer
 
     export train!, apply!, jacobian!
     export Iterate_Sympnet
 
-
     export default_arch
 
     include("architectures/default_architecture.jl")
@@ -356,7 +363,6 @@ module GeometricMachineLearning
     export integrate, integrate_step!
 
     include("integrator/sympnet_integrator.jl")
-
  
     include("reduced_system/system_type.jl")
     include("reduced_system/reduced_system.jl")
diff --git a/src/architectures/transformer_neural_network.jl b/src/architectures/transformer_neural_network.jl
new file mode 100644
index 000000000..1ca4ae4ec
--- /dev/null
+++ b/src/architectures/transformer_neural_network.jl
@@ -0,0 +1,32 @@
+@doc raw"""
+This is a transformer neural network for classification purposes. At the moment this is only used for training on MNIST, but can in theory be used for any classification problem.
+
+It has to be called with a `DataLoader` that stores an input and an output tensor. The optional arguments are: 
+- `n_heads`: The number of heads in the `MultiHeadAttention` (mha) layers. Default: `7`.
+- `n_layers`: The number of transformer layers. Default: `16`.
+- `activation`: The activation function. Default: `softmax`.
+- `Stiefel`: Wheter the matrices in the mha layers are on the **Stiefel manifold**. 
+- `add_connection`: Whether the input is appended to the output of the mha layer. (skip connection)
+"""
+struct ClassificationTransformer{AT} <: Architecture 
+    transformer_dim::Int
+    classification_dim::Int
+    n_heads::Int 
+    n_layers::Int
+    activation::AT
+    Stiefel::Bool
+    add_connection::Bool
+    function ClassificationTransformer(transformer_dim::Int, classification_dim::Int, n_heads::Int, n_layers::Int, σ::AT, Stiefel::Bool, add_connection::Bool) where AT
+        new{AT}(transformer_dim, classification_dim, n_heads, n_layers, σ, Stiefel, add_connection)
+    end
+    function ClassificationTransformer(dl::DataLoader{T, BT, CT}; n_heads::Int=7, n_layers::Int=16, activation::AT=softmax, Stiefel::Bool=true, add_connection::Bool=true) where {T, T1, BT<:AbstractArray{T, 3}, CT<:AbstractArray{T1, 3}, AT}
+        new{AT}(dl.input_dim, dl.output_dim, n_heads, n_layers, activation, Stiefel, add_connection)
+    end
+end
+
+function Chain(arch::ClassificationTransformer)
+    Chain(
+        Transformer(arch.transformer_dim, arch.n_heads, arch.n_layers, Stiefel=arch.Stiefel, add_connection=arch.add_connection).layers...,
+        Classification(arch.transformer_dim, arch.classification_dim, arch.activation)
+        )
+end
\ No newline at end of file
diff --git a/src/arrays/auxiliary.jl b/src/arrays/auxiliary.jl
deleted file mode 100644
index a5497779e..000000000
--- a/src/arrays/auxiliary.jl
+++ /dev/null
@@ -1,35 +0,0 @@
-"""
-Auxiliary arrays - these should proably be put into different files or given another name.
-"""
-
-struct StiefelProjection{T} <: AbstractMatrix{T}
-    N::Integer
-    n::Integer
-    StiefelProjection(N, n, T=Float64) = new{T}(N,n)
-end
-    
-function Base.getindex(::StiefelProjection{T},i,j) where T
-    if i == j 
-        return T(1.) 
-    end 
-    return T(0.) 
-end 
-   
-Base.parent(E::StiefelProjection) = (E.N, E.n)
-
-Base.size(E::StiefelProjection) = (E.N, E.n)
-
-struct One{T} <: AbstractMatrix{T}
-    n::Integer
-    One(n, T=Float64) = new{T}(n)
-end
-
-function Base.getindex(::One{T}, i, j) where T
-if i == j 
-    return T(1.) 
-    end 
-    return T(0.) 
-end 
-Base.parent(mat::One) = (mat.n)
-Base.size(mat::One) = (mat.n, mat.n)
-
diff --git a/src/arrays/skew_symmetric.jl b/src/arrays/skew_symmetric.jl
index c3e91ab81..e1ab85fe2 100644
--- a/src/arrays/skew_symmetric.jl
+++ b/src/arrays/skew_symmetric.jl
@@ -188,4 +188,27 @@ end
 # the first matrix is multiplied onto A2 in order for it to not be SkewSymMatrix!
 function Base.:*(A1::SkewSymMatrix{T}, A2::SkewSymMatrix{T}) where T 
     A1*(one(A2)*A2) 
+end
+
+@doc raw"""
+If `vec` is applied onto `SkewSymMatrix`, then the output is the associated vector.  
+"""
+function Base.vec(A::SkewSymMatrix)
+    A.S
+end
+
+function Base.zero(A::SkewSymMatrix)
+    SkewSymMatrix(zero(A.S), A.n)
+end
+
+function KernelAbstractions.get_backend(A::SkewSymMatrix)
+    KernelAbstractions.get_backend(A.S)
+end
+
+function assign!(B::SkewSymMatrix{T}, C::SkewSymMatrix{T}) where T 
+    B.S .= C.S 
+end
+
+function Base.copy(A::SkewSymMatrix)
+    SkewSymMatrix(copy(A.S), A.n)
 end
\ No newline at end of file
diff --git a/src/arrays/stiefel_lie_algebra_horizontal.jl b/src/arrays/stiefel_lie_algebra_horizontal.jl
index a1a1a6f28..71558e485 100644
--- a/src/arrays/stiefel_lie_algebra_horizontal.jl
+++ b/src/arrays/stiefel_lie_algebra_horizontal.jl
@@ -168,4 +168,55 @@ function LinearAlgebra.mul!(C::StiefelLieAlgHorMatrix, A::StiefelLieAlgHorMatrix
     mul!(C.B, A.B, α)
 end
 LinearAlgebra.mul!(C::StiefelLieAlgHorMatrix, α::Real, A::StiefelLieAlgHorMatrix) = mul!(C, A, α)
-LinearAlgebra.rmul!(C::StiefelLieAlgHorMatrix, α::Real) = mul!(C, C, α)
\ No newline at end of file
+LinearAlgebra.rmul!(C::StiefelLieAlgHorMatrix, α::Real) = mul!(C, C, α)
+
+function Base.vec(A::StiefelLieAlgHorMatrix)
+    vcat(vec(A.A), vec(A.B))
+end
+
+function StiefelLieAlgHorMatrix(V::AbstractVector, N::Int, n::Int)
+    # length of skew-symmetric matrix
+    skew_sym_size = n*(n-1)÷2
+    # size of matrix component 
+    matrix_size = (N-n)*n
+    @assert length(V) == skew_sym_size + matrix_size
+    StiefelLieAlgHorMatrix(
+        SkewSymMatrix(@view(V[1:skew_sym_size]), n),
+        reshape(@view(V[(skew_sym_size+1):(skew_sym_size+matrix_size)]), (N-n), n),
+        N, 
+        n
+    )
+end
+
+function Base.zero(B::StiefelLieAlgHorMatrix)
+    StiefelLieAlgHorMatrix(
+        zero(B.A),
+        zero(B.B),
+        B.N,
+        B.n
+    )
+end
+
+function KernelAbstractions.get_backend(B::StiefelLieAlgHorMatrix)
+    KernelAbstractions.get_backend(B.B)
+end
+
+# assign funciton; also implement this for other arrays! 
+function assign!(B::StiefelLieAlgHorMatrix{T}, C::StiefelLieAlgHorMatrix{T}) where T 
+    assign!(B.A, C.A)
+    B.B .= C.B 
+end
+
+function Base.copy(B::StiefelLieAlgHorMatrix)
+    StiefelLieAlgHorMatrix(
+        copy(B.A),
+        copy(B.B),
+        B.N,
+        B.n
+    )
+end
+
+# fallback -> put this somewhere else!
+function assign!(A::AbstractArray, B::AbstractArray)
+    A .= B 
+end
\ No newline at end of file
diff --git a/src/arrays/stiefel_projection.jl b/src/arrays/stiefel_projection.jl
new file mode 100644
index 000000000..82c04c2ac
--- /dev/null
+++ b/src/arrays/stiefel_projection.jl
@@ -0,0 +1,51 @@
+@doc raw"""
+An array that essentially does `vcat(I(n), zeros(N-n, n))` with GPU support. It has **three inner constructors**. The **first one** is called with the following arguments: 
+1. `backend`: backends as supported by `KernelAbstractions`.
+2. `T::Type`
+3. `N::Integer`
+4. `n::Integer`
+
+The **second constructor** is called by supplying a matrix as input. The constructor will then extract the backend, the type and the dimensions of that matrix. 
+
+The **third constructor** is called by supplying an instance of `StiefelLieAlgHorMatrix`.  
+
+Technically this should be a subtype of `StiefelManifold`. 
+"""
+struct StiefelProjection{T, AT} <: AbstractMatrix{T}
+    N::Integer
+    n::Integer
+    A::AT
+    function StiefelProjection(backend, T::Type, N::Integer, n::Integer) 
+        A = KernelAbstractions.zeros(backend, T, N, n)
+        assign_ones_for_stiefel_projection! = assign_ones_for_stiefel_projection_kernel!(backend)
+        assign_ones_for_stiefel_projection!(A, ndrange=n)
+        new{T, typeof(A)}(N,n, A)
+    end
+
+    function StiefelProjection(A::AbstractMatrix{T}) where T 
+        StiefelProjection(KernelAbstractions.get_backend(A), T, size(A)...)
+    end
+
+    function StiefelProjection(B::StiefelLieAlgHorMatrix{T}) where T 
+        StiefelProjection(KernelAbstractions.get_backend(B), T, B.N, B.n)
+    end
+end
+    
+@kernel function assign_ones_for_stiefel_projection_kernel!(A::AbstractArray{T}) where T
+    i = @index(Global)
+    A[i, i] = one(T)
+end
+
+"""
+Outer constructor for `StiefelProjection`. This works with two integers as input and optionally the type.
+"""
+StiefelProjection(N::Integer, n::Integer, T::Type=Float64) = StiefelProjection(CPU(), T, N, n)
+
+Base.size(E::StiefelProjection) = (E.N, E.n)
+Base.getindex(E::StiefelProjection, i, j) = getindex(E.A, i, j)
+Base.:+(E::StiefelProjection, A::AbstractMatrix) = E.A + A 
+Base.:+(A::AbstractMatrix, E::StiefelProjection) = +(E, A)
+Base.vcat(A::AbstractVecOrMat{T}, E::StiefelProjection{T}) where {T<:Number} = vcat(A, E.A)
+Base.vcat(E::StiefelProjection{T}, A::AbstractVecOrMat{T}) where {T<:Number} = vcat(E.A, A)
+Base.hcat(A::AbstractVecOrMat{T}, E::StiefelProjection{T}) where {T<:Number} = hcat(A, E.A)
+Base.hcat(E::StiefelProjection{T}, A::AbstractVecOrMat{T}) where {T<:Number} = hcat(E.A, A)
\ No newline at end of file
diff --git a/src/data_loader/batch.jl b/src/data_loader/batch.jl
index c85ede47c..bbb9e4e39 100644
--- a/src/data_loader/batch.jl
+++ b/src/data_loader/batch.jl
@@ -83,6 +83,10 @@ function optimize_for_one_epoch!(opt::Optimizer, model, ps::Union{Tuple, NamedTu
     optimize_for_one_epoch!(opt, model, ps, dl, batch, loss)
 end
 
+function optimize_for_one_epoch!(opt::Optimizer, nn::NeuralNetwork, dl::DataLoader, batch::Batch)
+    optimize_for_one_epoch!(opt, nn.model, nn.params, dl, batch)
+end
+
 """
 TODO: Add ProgressMeter!!!
 """
@@ -105,7 +109,6 @@ end
 
 
 function (o::Optimizer)(nn::NeuralNetwork, dl::DataLoader, batch::Batch, n_epochs::Int, loss)
-    
     progress_object = ProgressMeter.Progress(n_epochs; enabled=true)
     loss_array = zeros(n_epochs)
     for i in 1:n_epochs
diff --git a/src/data_loader/data_loader.jl b/src/data_loader/data_loader.jl
index 34d65c71a..ae6a45089 100644
--- a/src/data_loader/data_loader.jl
+++ b/src/data_loader/data_loader.jl
@@ -47,7 +47,7 @@ end
 The DataLoader is called with a tensor and a vector. For the moment this is always interpreted to be the MNIST data set. 
 """
 function DataLoader(data::AbstractArray{T, 3}, target::AbstractVector{T1}; patch_length=7) where {T, T1} 
-    @info "You provided a tensor and a vector as input. This will be treated as a classification problem (MNIST). Tensor axes: (i) & (ii) image axes and (iii) parameter dimesnion."
+    @info "You provided a tensor and a vector as input. This will be treated as a classification problem (MNIST). Tensor axes: (i) & (ii) image axes and (iii) parameter dimension."
     im_dim₁, im_dim₂, n_params = size(data)
     @assert length(target) == n_params 
     number_of_patches = (im_dim₁÷patch_length)*(im_dim₂÷patch_length) 
@@ -153,4 +153,8 @@ function accuracy(model::Chain, ps::Tuple, dl::DataLoader{T, AT, BT}) where {T,
     # get tensor of maximum elements
     tensor_of_maximum_elements[ind] .= T1(1)
     (size(dl.output, 3)-sum(abs.(dl.output - tensor_of_maximum_elements))/T1(2))/size(dl.output, 3)
-end
\ No newline at end of file
+end
+
+accuracy(nn::NeuralNetwork, dl::DataLoader) = accuracy(nn.model, nn.params, dl)
+
+Base.eltype(::DataLoader{T}) where T = T
\ No newline at end of file
diff --git a/src/layers/multi_head_attention.jl b/src/layers/multi_head_attention.jl
index 0ae8757f6..05c3c53cc 100644
--- a/src/layers/multi_head_attention.jl
+++ b/src/layers/multi_head_attention.jl
@@ -75,7 +75,10 @@ function initialparameters(backend::KernelAbstractions.Backend, T::Type, d::Mult
     (PQ=PQ, PK=PK, PV=PV)
 end
 
-function (d::MultiHeadAttention{M, M, Stiefel, Retraction, true})(x::AbstractMatrix{T}, ps::NamedTuple) where {M, Stiefel, Retraction, T}
+@doc raw"""
+Applies MHA to an abstract matrix. This is the same independent of whether the input is added to the output or not. 
+"""
+function compute_output_of_mha(d::MultiHeadAttention{M, M}, x::AbstractMatrix{T}, ps::NamedTuple) where {M, T}
     dim, input_length = size(x)
     @assert dim == M
 
@@ -84,35 +87,18 @@ function (d::MultiHeadAttention{M, M, Stiefel, Retraction, true})(x::AbstractMat
         key = Symbol("head_"*string(i))
         output = vcat(output, ps.PV[key]'*x*softmax((ps.PQ[key]'*x)'*(ps.PK[key]'*x)/T(sqrt(dim))))
     end
-    x + output
-end
-
-function (d::MultiHeadAttention{M, M, Stiefel, Retraction, false})(x::AbstractMatrix{T}, ps::NamedTuple) where {M, Stiefel, Retraction, T}
-    dim, input_length = size(x)
-    @assert dim == M
-
-    output = similar(x, 0, input_length)
-    for i in 1:d.n_heads
-        key = Symbol("head_"*string(i))
-        output = vcat(output, ps.PV[key]'*x*softmax((ps.PQ[key]'*x)'*(ps.PK[key]'*x)/T(sqrt(dim))))
-    end
     output
 end
 
-function (d::MultiHeadAttention{M, M, Stiefel, Retraction, true})(x::AbstractArray{T, 3}, ps::NamedTuple) where {M, Stiefel, Retraction, T} 
+
+function compute_output_of_mha(d::MultiHeadAttention{M, M}, x::AbstractArray{T, 3}, ps::NamedTuple) where {M, T}
     Dₕ = M ÷ d.n_heads
     dim, input_length, number_data = size(x)
     @assert dim == M
     
-
+    # initialize the output
     output = similar(x, 0, input_length, number_data)
 
-    # initialize the results of various tensor matrix multiplications
-    Q_tensor = similar(x, Dₕ, input_length, number_data)
-    K_tensor = similar(x, Dₕ, input_length, number_data)
-    V_tensor = similar(x, Dₕ, input_length, number_data)
-    QK_tensor = similar(x, input_length, input_length, number_data)
-
     # this is the result of a single head attention block
     single_head_output = similar(x, Dₕ, input_length, number_data)
 
@@ -127,35 +113,15 @@ function (d::MultiHeadAttention{M, M, Stiefel, Retraction, true})(x::AbstractArr
         output = vcat(output, single_head_output) 
         # KernelAbstractions.synchronize(backend)
     end
-    x + output
+    output
 end
 
-function (d::MultiHeadAttention{M, M, Stiefel, Retraction, false})(x::AbstractArray{T, 3}, ps::NamedTuple) where {M, Stiefel, Retraction, T} 
-    Dₕ = M ÷ d.n_heads
-    dim, input_length, number_data = size(x)
-    @assert dim == M
-    
-    output = similar(x, 0, input_length, number_data)
-
-    Q_tensor = similar(x, Dₕ, input_length, number_data)
-    K_tensor = similar(x, Dₕ, input_length, number_data)
-    V_tensor = similar(x, Dₕ, input_length, number_data)
-    QK_tensor = similar(x, input_length, input_length, number_data)
-    
-    single_head_output = similar(x, Dₕ, input_length, number_data)
-
-    for i in 1:d.n_heads 
-        key = Symbol("head_"*string(i))
-        Q_tensor = mat_tensor_mul(ps.PQ[key]', x)
-        K_tensor = mat_tensor_mul(ps.PK[key]', x)
-        V_tensor = mat_tensor_mul(ps.PV[key]', x)
-        QK_tensor = tensor_transpose_tensor_mul(Q_tensor, K_tensor)
+function (d::MultiHeadAttention{M, M, Stiefel, Retraction, true})(x::AbstractArray, ps::NamedTuple) where {M, Stiefel, Retraction} 
+    x + compute_output_of_mha(d, x, ps)
+end
 
-        single_head_output = tensor_tensor_mul(V_tensor, softmax(QK_tensor/T(sqrt(dim))))
-        output = vcat(output, single_head_output) 
-        # KernelAbstractions.synchronize(backend)
-    end
-    output
+function (d::MultiHeadAttention{M, M, Stiefel, Retraction, false})(x::AbstractArray, ps::NamedTuple) where {M, Stiefel, Retraction} 
+    compute_output_of_mha(d, x, ps)
 end
 
 import ChainRules
diff --git a/src/manifolds/stiefel_manifold.jl b/src/manifolds/stiefel_manifold.jl
index cbe750c55..4149dcd2d 100644
--- a/src/manifolds/stiefel_manifold.jl
+++ b/src/manifolds/stiefel_manifold.jl
@@ -1,8 +1,10 @@
+@doc raw"""
+An implementation of the Stiefel manifold. It has various convenience functions associated with it:
+- check 
+- rand 
+- rgrad
+- metric
 """
-maybe consider dividing the output in the check functions by n!
-TODO: Implement sampling procedures!!
-"""
-
 mutable struct StiefelManifold{T, AT <: AbstractMatrix{T}} <: Manifold{T}
     A::AT
     function StiefelManifold(A::AbstractMatrix)
@@ -65,10 +67,30 @@ function Base.:*(Y::Adjoint{T, StiefelManifold{T, AT}}, B::AbstractMatrix) where
     Y.parent.A'*B 
 end
 
+@doc raw"""
+Computes the Riemannian gradient for the Stiefel manifold given an element ``Y\in{}St(N,n)`` and a matrix ``\nabla{}L\in\mahbb{R}^{N\times{}n}`` (the Euclidean gradient). It computes the Riemannian gradient with respect to the canonical metric (see the documentation for the function `metric` for an explanation of this).
+The precise form of the mapping is: 
+```math
+\mathtt{rgrad}(Y, \nabla{}L) \mapsto \nabla{}L - Y(\nabla{}L)^TY
+```
+It is called with inputs:
+- `Y::StiefelManifold`
+- `e_grad::AbstractMatrix`: i.e. the Euclidean gradient (what was called ``\nabla{}L``) above.
+"""
 function rgrad(Y::StiefelManifold, e_grad::AbstractMatrix)
     e_grad - Y.A*(e_grad'*Y.A)
 end
 
+@doc raw"""
+Implements the canonical Riemannian metric for the Stiefel manifold:
+```math 
+g_Y: (\Delta_1, \Delta_2) \mapsto \mathrm{tr}(\Delta_1^T(\mathbb{I} - \frac{1}{2}YY^T)\Delta_2).
+```
+It is called with: 
+- `Y::StiefelManifold`
+- `Δ₁::AbstractMatrix`
+- `Δ₂::AbstractMatrix``
+"""
 function metric(Y::StiefelManifold, Δ₁::AbstractMatrix, Δ₂::AbstractMatrix)
     LinearAlgebra.tr(Δ₁'*(I - .5*Y.A*Y.A')*Δ₂)
 end
diff --git a/src/optimizers/adam_optimizer.jl b/src/optimizers/adam_optimizer.jl
index 8e5378f71..8dd5d895b 100644
--- a/src/optimizers/adam_optimizer.jl
+++ b/src/optimizers/adam_optimizer.jl
@@ -11,13 +11,13 @@ struct AdamOptimizer{T<:Real} <: OptimizerMethod
     AdamOptimizer(η = 1f-3, ρ₁ = 9f-1, ρ₂ = 9.9f-1, δ = 3f-7; T=typeof(η)) = new{T}(T(η), T(ρ₁), T(ρ₂), T(δ))
 end
 
-function update!(o::Optimizer{<:AdamOptimizer{T}}, C::AdamCache, B::AbstractVecOrMat) where T
+function update!(o::Optimizer{<:AdamOptimizer{T}}, C::AdamCache, B::AbstractArray) where T
     add!(C.B₁, ((o.method.ρ₁ - o.method.ρ₁^o.step)/(T(1.) - o.method.ρ₁^o.step))*C.B₁, ((T(1.) - o.method.ρ₁)/(T(1.) - o.method.ρ₁^o.step))*B)
     add!(C.B₂, ((o.method.ρ₂ - o.method.ρ₂^o.step)/(T(1.) - o.method.ρ₂^o.step))*C.B₂, ((T(1.) - o.method.ρ₂)/(T(1.) - o.method.ρ₂^o.step))*⊙²(B))
     mul!(B, -o.method.η, /ᵉˡᵉ(C.B₁, scalar_add(racᵉˡᵉ(C.B₂), o.method.δ)))
 end
 
-#fallbacks: 
+# defaults: 
 ⊙²(A::AbstractVecOrMat) = A.^2
 racᵉˡᵉ(A::AbstractVecOrMat) = sqrt.(A)
 /ᵉˡᵉ(A::AbstractVecOrMat, B::AbstractVecOrMat) = A./B
diff --git a/src/optimizers/bfgs_cache.jl b/src/optimizers/bfgs_cache.jl
new file mode 100644
index 000000000..26f1c0b96
--- /dev/null
+++ b/src/optimizers/bfgs_cache.jl
@@ -0,0 +1,51 @@
+@doc raw"""
+The cache for the BFGS optimizer.
+
+It stores an array for the previous time step `B` and the inverse of the Hessian matrix `H`.
+
+It is important to note that setting up this cache already requires a derivative! This is not the case for the other optimizers.
+"""
+struct BFGSCache{T, AT<:AbstractArray{T}} <: AbstractCache
+    B::AT
+    S::AT
+    H::AbstractMatrix{T}
+    function BFGSCache(B::AbstractArray)
+        new{eltype(B), typeof(zero(B))}(zero(B), zero(B), initialize_hessian_inverse(zero(B)))
+    end
+end
+
+@doc raw"""
+In order to initialize `BGGSCache` we first need gradient information. This is why we initially have this `BFGSDummyCache` until gradient information is available.
+
+NOTE: we may not need this. 
+"""
+struct BFGSDummyCache{T, AT<:AbstractArray{T}} <: AbstractCache
+    function BFGSDummyCache(B::AbstractArray)
+        new{eltype(B), typeof(zero(B))}()
+    end
+end
+
+@kernel function assign_diagonal_ones_kernel!(B::AbstractMatrix{T}) where T 
+    i = @index(Global)
+    B[i, i] = one(T)
+end
+
+@doc raw"""
+This initializes the inverse of the Hessian for various arrays. This requires an implementation of a *vectorization operation* `vec`. This is important for custom arrays.
+"""
+function initialize_hessian_inverse(B::AbstractArray{T}) where T
+    length_of_array = length(vec(B))
+    backend = KernelAbstractions.get_backend(B)
+    H = KernelAbstractions.zeros(backend, T, length_of_array, length_of_array)
+    assign_diagonal_ones! = assign_diagonal_ones_kernel!(backend)
+    assign_diagonal_ones!(H, ndrange=length_of_array)
+    H
+end
+
+setup_bfgs_cache(ps::NamedTuple) = apply_toNT(setup_bfgs_cache, ps)
+setup_bfgs_cache(ps::Tuple) = Tuple([setup_bfgs_cache(x) for x in ps])
+setup_bfgs_cache(B::AbstractArray) = BFGSCache(B)
+
+setup_bfgs_dummy_cache(ps::NamedTuple) = apply_toNT(setup_bfgs_dummy_cache, ps)
+setup_bfgs_dummy_cache(ps::Tuple) = Tuple([setup_bfgs_cache(x) for x in ps])
+setup_bfgs_dummy_cache(B::AbstractArray) = BFGSDummyCache(B)
\ No newline at end of file
diff --git a/src/optimizers/bfgs_optimizer.jl b/src/optimizers/bfgs_optimizer.jl
new file mode 100644
index 000000000..a57bd8ae4
--- /dev/null
+++ b/src/optimizers/bfgs_optimizer.jl
@@ -0,0 +1,48 @@
+@doc raw"""
+This is an implementation of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimizer. 
+"""
+struct BFGSOptimizer{T<:Real} <: OptimizerMethod
+    η::T
+    δ::T
+
+    function BFGSOptimizer(η::T = 1f-2, δ=1f-7) where T 
+        new{T}(η, T(δ))
+    end
+end
+
+@doc raw"""
+Optimization for an entire neural networks with BFGS. What is different in this case is that we still have to initialize the cache.
+
+If `o.step == 1`, then we initialize the cache
+"""
+function update!(o::Optimizer{<:BFGSOptimizer}, C::CT, B::AbstractArray{T}) where {T, CT<:BFGSCache{T}}
+    if o.step == 1
+        bfgs_initialization!(o, C, B)
+    else
+        bfgs_update!(o, C, B)   
+    end
+end
+
+function bfgs_update!(o::Optimizer{<:BFGSOptimizer}, C::CT, B::AbstractArray{T}) where {T, CT<:BFGSCache{T}}
+    # in the first step we compute the difference between the current and the previous mapped gradients:
+    Y = vec(B - C.B) 
+    # in the second step we update H (write a test to check that this preserves symmetry)
+    vecS = vec(C.S)
+    # the *term for the second condition* appears many times in the expression.
+    SY = vecS' * Y + o.method.δ
+    # C.H .= C.H + (SY + Y' * C.H * Y) / (SY ^ 2) * vecS * vecS' - (C.H * Y * vecS' + vecS * (C.H * Y)' ) / SY
+    # the two computations of the H matrix should be equivalent. Check this!!
+    HY = C.H * Y
+    C.H .= C.H - HY * HY' / (Y' * HY + o.method.δ) + vecS * vecS' / SY
+    # in the third step we compute the final velocity
+    mul!(vecS, C.H, vec(B))
+    mul!(C.S, -o.method.η, C.S)
+    assign!(C.B, copy(B))
+    assign!(B, copy(C.S))
+end
+
+function bfgs_initialization!(o::Optimizer{<:BFGSOptimizer}, C::CT, B::AbstractArray{T}) where {T, CT<:BFGSCache{T}}
+    mul!(C.S, -o.method.η, B)
+    assign!(C.B, copy(B))
+    assign!(B, copy(C.S))
+end
\ No newline at end of file
diff --git a/src/optimizers/init_optimizer_cache.jl b/src/optimizers/init_optimizer_cache.jl
index 599c13b70..5c1c341d1 100644
--- a/src/optimizers/init_optimizer_cache.jl
+++ b/src/optimizers/init_optimizer_cache.jl
@@ -1,7 +1,8 @@
-"""
-Wrapper for the functions setup_adam_cache, setup_momentum_cache, setup_gradient_cache.
-These appear outside of optimizer_caches.jl because the OptimizerMethods first have to be defined.
+@doc raw"""
+Wrapper for the functions `setup_adam_cache`, `setup_momentum_cache`, `setup_gradient_cache`, `setup_bfgs_cache`.
+These appear outside of `optimizer_caches.jl` because the `OptimizerMethods` first have to be defined.
 """
 init_optimizer_cache(::GradientOptimizer, x) = setup_gradient_cache(x)
 init_optimizer_cache(::MomentumOptimizer, x) = setup_momentum_cache(x)
-init_optimizer_cache(::AdamOptimizer, x) = setup_adam_cache(x)
\ No newline at end of file
+init_optimizer_cache(::AdamOptimizer, x) = setup_adam_cache(x)
+init_optimizer_cache(::BFGSOptimizer, x) = setup_bfgs_cache(x)
\ No newline at end of file
diff --git a/src/optimizers/manifold_related/global_sections.jl b/src/optimizers/manifold_related/global_sections.jl
index 5eb80bdf8..2d79f7c3c 100644
--- a/src/optimizers/manifold_related/global_sections.jl
+++ b/src/optimizers/manifold_related/global_sections.jl
@@ -1,4 +1,4 @@
-"""
+@doc raw"""
 This implements global sections for the Stiefel manifold and the Symplectic Stiefel manifold. 
 
 In practice this is implemented using Householder reflections, with the auxiliary column vectors given by: 
@@ -14,10 +14,9 @@ Maybe consider dividing the output in the check functions by n!
 
 Implement a general global section here!!!! Tₓ𝔐 → G×𝔤 !!!!!! (think about random initialization!)
 """
-#global section maps an element of the manifold to its associated Lie group!
 struct GlobalSection{T, AT} 
     Y::AT
-    #for now the only lift that is implemented is the Stiefel one - these types will have to be expanded!
+    # for now the only lift that is implemented is the Stiefel one - these types will have to be expanded!
     λ::Union{LinearAlgebra.QRCompactWYQ, LinearAlgebra.QRPackedQ, Nothing}
 
     function GlobalSection(Y::AbstractVecOrMat)
@@ -30,26 +29,22 @@ function GlobalSection(ps::NamedTuple)
     apply_toNT(GlobalSection, ps)
 end
 
-#this is an application G×𝔐 → 𝔐
+# this is an application G×𝔐 → 𝔐
 function apply_section(λY::GlobalSection{T, AT}, Y₂::AT) where {T, AT<:StiefelManifold{T}}
     N, n = size(λY.Y)
     @assert (N, n) == size(Y₂)
-    #temporary solution for the moment 
-    projection_matrix₁ = typeof(Y₂.A)(hcat(I(n), zeros(T, n, N-n)))
-    projection_matrix₂ = typeof(Y₂.A)(hcat(zeros(T, N-n, n), I(N-n)))
+    backend = KernelAbstractions.get_backend(Y₂)
     StiefelManifold(
-        λY.Y.A*(projection_matrix₁*Y₂.A) + λY.λ*vcat(projection_matrix₂*Y₂.A, typeof(Y₂.A)(zeros(T, n, n)))
+        λY.Y.A * Y₂.A[1:n, :] + λY.λ*vcat(Y₂.A[(n+1):N, :], KernelAbstractions.zeros(backend, T, n, n))
     )
 end
 
 function apply_section!(Y::AT, λY::GlobalSection{T, AT}, Y₂::AT) where {T, AT<:StiefelManifold{T}}
     N, n = size(λY.Y)
     @assert (N, n) == size(Y₂) == size(Y)
-    #temporary solution for the moment 
-    projection_matrix₁ = typeof(Y₂.A)(hcat(I(n), zeros(T, n, N-n)))
-    projection_matrix₂ = typeof(Y₂.A)(hcat(zeros(T, N-n, n), I(N-n)))
 
-    Y.A .= λY.Y*(projection_matrix₁*Y₂) + λY.λ*vcat(projection_matrix₂*Y₂, typeof(Y₂.A)(zeros(T, n, n)))
+    backend = KernelAbstractions.get_backend(Y)
+    Y.A .= λY.Y * Y₂.A[1:n, :] + λY.λ*vcat(Y₂.A[(n+1):N, :], KernelAbstractions.zeros(backend, T, n, n))
 end
 
 function apply_section(λY::GlobalSection{T, AT}, Y₂::AT) where {T, AT<:GrassmannManifold{T}}
@@ -91,11 +86,9 @@ end
 
 function global_rep(λY::GlobalSection{T, AT}, Δ::AbstractMatrix{T}) where {T, AT<:StiefelManifold{T}}
     N, n = size(λY.Y)
-    #temporary workaround 
-    projection_matrix = typeof(Δ)(hcat(I(N-n), zeros(T, N-n, n)))
     StiefelLieAlgHorMatrix(
         SkewSymMatrix(λY.Y.A'*Δ),
-        projection_matrix*(λY.λ'*Δ), 
+        (λY.λ'*Δ)[1:(N-n), 1:n], 
         N, 
         n
     )
@@ -103,10 +96,9 @@ end
 
 function global_rep(λY::GlobalSection{T, AT}, Δ::AbstractMatrix{T}) where {T, AT<:GrassmannManifold{T}}
     N, n = size(λY.Y)
-    #temporary workaround 
-    projection_matrix = typeof(Δ)(hcat(zeros(T, N-n, n), I(N-n)))
     GrassmannLieAlgHorMatrix(
-        projection_matrix*(λY.λ'*Δ),
+        # (λY.λ'*Δ)[(n+1):N, 1:n],
+        (λY.λ' * Δ)[1:(N-n), 1:n],
         N,
         n
     )
diff --git a/src/optimizers/manifold_related/retractions.jl b/src/optimizers/manifold_related/retractions.jl
index 11b552406..b2bc37dc0 100644
--- a/src/optimizers/manifold_related/retractions.jl
+++ b/src/optimizers/manifold_related/retractions.jl
@@ -41,16 +41,12 @@ geodesic(B::NamedTuple) = apply_toNT(geodesic, B)
 
 #you will have to fix the scalar indexing problem wrt to SkewSymMatrix!
 function geodesic(B::StiefelLieAlgHorMatrix{T}) where T
-    N, n = B.N, B.n
-    E = typeof(B.B)(StiefelProjection(N, n, T))
-    # expression from which matrix exponential and inverse have to be computed
-    unit = typeof(B.B)(I(n))
-    # delete this line eventually!!!
-    # A_mat = typeof(B.B)(SkewSymMatrix(Vector(B.A.S), n))
-    A_mat = B.A
-    exponent = hcat(vcat(T(.5)*(A_mat*one(A_mat)), T(.25)*A_mat*A_mat - B.B'*B.B), vcat(unit, T(.5)*(A_mat*one(A_mat))))
+    E = StiefelProjection(B)
+    unit = one(B.A)
+    A_mat = B.A * unit
+    exponent = hcat(vcat(T(.5) * A_mat, T(.25) * B.A * A_mat - B.B' * B.B), vcat(unit, T(.5) * A_mat))
     StiefelManifold(
-        E + hcat(vcat(T(.5)*A_mat*one(A_mat), B.B), E)*𝔄(exponent)*vcat(unit, T(.5)*A_mat*one(A_mat))
+        E + hcat(vcat(T(.5) * A_mat, B.B), E) * 𝔄(exponent) * vcat(unit, T(.5) * A_mat)
     )
 end
 
@@ -68,19 +64,18 @@ end
 cayley(B::NamedTuple) = apply_toNT(cayley, B)
 
 function cayley(B::StiefelLieAlgHorMatrix{T}) where T
-    N, n = B.N, B.n
-    E = typeof(B.B)(StiefelProjection(N, n, T))
-    unit = typeof(B.B)(I(n))
-    A_mat = B.A*one(B.A)
-    A_mat2 = B.A*B.A 
-    BB = B.B'*B.B
+    E = StiefelProjection(B)
+    unit = one(B.A)
+    A_mat = B.A * one(B.A)
+    A_mat2 = B.A * B.A 
+    BB = B.B' * B.B
 
-    exponent = hcat(vcat(unit - T(.25)*A_mat, T(.5)*BB - T(.125)*A_mat2), vcat(-T(.5)*unit, unit - T(.25)*A_mat))
+    exponent = hcat(vcat(unit - T(.25) * A_mat, T(.5) * BB - T(.125) * A_mat2), vcat(-T(.5) * unit, unit - T(.25) * A_mat))
     StiefelManifold(
         E + 
-        T(.5)*hcat(vcat(T(.5)*A_mat, B.B), vcat(unit, zero(B.B)))*
+        T(.5) * hcat(vcat(T(.5) * A_mat, B.B), vcat(unit, zero(B.B)))*
         (
-            vcat(unit, T(0.5)*A_mat) + exponent \ (vcat(unit, T(0.5)*A_mat) + vcat(T(0.5)*A_mat, T(0.25)*A_mat2 - T(0.5)*BB))
+            vcat(unit, T(0.5) * A_mat) + exponent \ (vcat(unit, T(0.5) * A_mat) + vcat(T(0.5) * A_mat, T(0.25) * A_mat2 - T(0.5) * BB))
             )
     )
 end
\ No newline at end of file
diff --git a/src/optimizers/optimizer.jl b/src/optimizers/optimizer.jl
index 056d7af36..7cd78d037 100644
--- a/src/optimizers/optimizer.jl
+++ b/src/optimizers/optimizer.jl
@@ -1,4 +1,4 @@
-"""
+@doc raw"""
 Optimizer struct that stores the 'method' (i.e. Adam with corresponding hyperparameters), the cache and the optimization step.
 
 It takes as input an optimization method and the parameters of a network. 
@@ -46,6 +46,15 @@ function optimization_step!(o::Optimizer, d::Union{AbstractExplicitLayer, Abstra
     apply_section!(ps, λY, ps₂)
 end
 
+@doc raw"""
+Optimization for an entire neural networks, the way this function should be called. 
+
+inputs: 
+- `o::Optimizer`
+- `model::Chain`
+- `ps::Tuple`
+- `dx::Tuple`
+"""
 function optimization_step!(o::Optimizer, model::Chain, ps::Tuple, dx::Tuple)
     o.step += 1
     for (index, element) in zip(eachindex(model.layers), model.layers)
@@ -69,6 +78,7 @@ function rgrad(Y::AbstractVecOrMat, dx::AbstractVecOrMat)
     dx
 end
 
+# do we need those two? 
 function update!(m::Optimizer, C::NamedTuple, B::NamedTuple)
     apply_toNT(m, C, B, update!)
 end
diff --git a/src/optimizers/optimizer_caches.jl b/src/optimizers/optimizer_caches.jl
index 78e113709..b8a21cb0a 100644
--- a/src/optimizers/optimizer_caches.jl
+++ b/src/optimizers/optimizer_caches.jl
@@ -1,8 +1,9 @@
-"""
+@doc raw"""
 AbstractCache has subtypes: 
-AdamCache
-MomentumCache
-GradientCache
+- AdamCache
+- MomentumCache
+- GradientCache
+- BFGSCache
 
 All of them can be initialized with providing an array (also supporting manifold types).
 """
@@ -14,15 +15,15 @@ abstract type AbstractCache end
 struct AdamCache{T, AT <: AbstractArray{T}} <: AbstractCache
     B₁::AT
     B₂::AT 
-    function AdamCache(B::AbstractArray)
-        new{eltype(B), typeof(zero(B))}(zero(B), zero(B))
+    function AdamCache(Y::AbstractArray)
+        new{eltype(Y), typeof(zero(Y))}(zero(Y), zero(Y))
     end
 end
 
 struct MomentumCache{T, AT <: AbstractArray{T}} <:AbstractCache
     B::AT
-    function MomentumCache(B::AbstractArray)
-        new{eltype(B), typeof(zero(B))}(zero(B))
+    function MomentumCache(Y::AbstractArray)
+        new{eltype(Y), typeof(zero(Y))}(zero(Y))
     end
 end
 
@@ -32,9 +33,10 @@ GradientCache(::AbstractArray) = GradientCache()
 #############################################################################
 # All the setup_cache functions 
 
-setup_adam_cache(B::AbstractArray) = reshape([setup_adam_cache(b) for b in B], size(B))
-setup_momentum_cache(B::AbstractArray) = reshape([setup_momentum_cache(b) for b in B], size(B))
-setup_gradient_cache(B::AbstractArray) = reshape([setup_gradient_cache(b) for b in B], size(B))
+# I don't really understand what we need these for ???
+# setup_adam_cache(B::AbstractArray) = reshape([setup_adam_cache(b) for b in B], size(B))
+# setup_momentum_cache(B::AbstractArray) = reshape([setup_momentum_cache(b) for b in B], size(B))
+# setup_gradient_cache(B::AbstractArray) = reshape([setup_gradient_cache(b) for b in B], size(B))
 
 setup_adam_cache(ps::NamedTuple) = apply_toNT(setup_adam_cache, ps)
 setup_momentum_cache(ps::NamedTuple) = apply_toNT(setup_momentum_cache, ps)
diff --git a/test/arrays/array_tests.jl b/test/arrays/array_tests.jl
index 323aa877c..fd9269046 100644
--- a/test/arrays/array_tests.jl
+++ b/test/arrays/array_tests.jl
@@ -125,6 +125,10 @@ function sympl_lie_alg_add_sub_test(N, n)
     @test all(abs.(projection(W₁ - W₂) .- S₄) .< 1e-10)
 end
 
+function stiefel_lie_alg_vectorization_test(N, n; T=Float32)
+    A = rand(StiefelLieAlgHorMatrix{T}, N, n)
+    @test isapprox(StiefelLieAlgHorMatrix(vec(A), N, n), A)
+end
 
 # TODO: tests for ADAM functions
 
@@ -150,4 +154,5 @@ for (N, n) ∈ zip(N_vec, n_vec)
     sympl_proj_test(N,n)
     stiefel_lie_alg_add_sub_test(N,n)
     sympl_lie_alg_add_sub_test(N,n)
+    stiefel_lie_alg_vectorization_test(N, n)
 end
diff --git a/test/optimizers/bfgs_optimizer.jl b/test/optimizers/bfgs_optimizer.jl
new file mode 100644
index 000000000..cc540b063
--- /dev/null
+++ b/test/optimizers/bfgs_optimizer.jl
@@ -0,0 +1,108 @@
+using GeometricMachineLearning
+using Zygote
+using Test
+using LinearAlgebra: norm, svd
+
+@doc raw"""
+This tests the BFGS optimizer.
+"""
+function bfgs_optimizer(N)
+    B = inv(rand(N, N))
+    loss(A) = norm(A - B) ^ (3)
+    A = randn(N, N)
+    loss1 = loss(A)
+    opt = BFGSOptimizer(1e-3)
+    optimizer_instance = Optimizer(opt, A)
+    ∇loss = gradient(loss, A)[1]
+    GeometricMachineLearning.bfgs_initialization!(optimizer_instance, optimizer_instance.cache, ∇loss)
+    A .= A + ∇loss
+    loss2 = loss(A)
+    ∇loss = gradient(loss, A)[1]
+    GeometricMachineLearning.bfgs_update!(optimizer_instance, optimizer_instance.cache, ∇loss)
+    A .= A + ∇loss
+    loss3 = loss(A)
+    @test loss1 > loss2 > loss3
+end
+
+bfgs_optimizer(10)
+
+function bfgs_optimizer2(N, n_iterates=10)
+    losses = zeros(n_iterates+2)
+    B = inv(rand(N, N))
+    loss(A) = norm(A - B) ^ (3)
+    A = randn(N, N)
+    losses[1] = loss(A)
+    opt = BFGSOptimizer(1e-3)
+    optimizer_instance = Optimizer(opt, A)
+    ∇loss = gradient(loss, A)[1]
+    GeometricMachineLearning.bfgs_initialization!(optimizer_instance, optimizer_instance.cache, ∇loss)
+    A .= A + ∇loss
+    losses[2] = loss(A)
+    for i in 1:n_iterates
+        ∇loss = gradient(loss, A)[1]
+        GeometricMachineLearning.bfgs_update!(optimizer_instance, optimizer_instance.cache, ∇loss)
+        A .= A + ∇loss
+        losses[i+2] = loss(A)
+    end
+    losses
+end
+
+# bfgs_optimizer2(10)
+
+function stiefel_optimizer_test(N, n; T=Float32, n_iterates=10, η=1f-2)
+    opt=BFGSOptimizer(T(η))
+    A = rand(StiefelManifold{T}, N, n)
+    B = T(10)*randn(T, N, N)
+    ideal_A = svd(B).U[:, 1:n]
+    loss(A) = norm(B - A * A' * B)
+    ideal_error = norm(ideal_A)
+    losses = zeros(n_iterates+2)
+    losses[1] = loss(A)
+    optimizer_instance = Optimizer(opt, A)
+    λA = GlobalSection(A)
+    grad_loss = global_rep(λA, rgrad(A, gradient(loss, A)[1]))
+    GeometricMachineLearning.bfgs_initialization!(optimizer_instance, optimizer_instance.cache, grad_loss)
+    # geodesic for `grad_loss`
+    exp_grad_loss = GeometricMachineLearning.geodesic(grad_loss)
+    GeometricMachineLearning.apply_section!(A, λA, exp_grad_loss)
+    losses[2] = loss(A)
+    for i in 1:n_iterates
+        λA = GlobalSection(A)
+        grad_loss = global_rep(λA, rgrad(A, gradient(loss, A)[1]))
+        GeometricMachineLearning.bfgs_update!(optimizer_instance, optimizer_instance.cache, grad_loss)
+        # geodesic for `grad_loss`
+        exp_grad_loss = GeometricMachineLearning.geodesic(grad_loss)
+        GeometricMachineLearning.apply_section!(A, λA, exp_grad_loss)
+        losses[i+2] = loss(A)
+    end
+    losses, ideal_error, check(A)
+end
+
+function stiefel_adam_test(N, n; T=Float32, n_iterates=10)
+    opt=AdamOptimizer()
+    A = rand(StiefelManifold{T}, N, n)
+    B = T(10)*randn(T, N, N)
+    ideal_A = svd(B).U[:, 1:n]
+    loss(A) = norm(B - A * A' * B)
+    ideal_error = norm(ideal_A)
+    losses = zeros(n_iterates+2)
+    losses[1] = loss(A)
+    optimizer_instance = Optimizer(opt, A)
+    λA = GlobalSection(A)
+    grad_loss = global_rep(λA, rgrad(A, gradient(loss, A)[1]))
+    GeometricMachineLearning.update!(optimizer_instance, optimizer_instance.cache, grad_loss)
+    # geodesic for `grad_loss`
+    exp_grad_loss = GeometricMachineLearning.geodesic(grad_loss)
+    GeometricMachineLearning.apply_section!(A, λA, exp_grad_loss)
+    losses[2] = loss(A)
+    for i in 1:n_iterates
+        λA = GlobalSection(A)
+        grad_loss = global_rep(λA, rgrad(A, gradient(loss, A)[1]))
+        GeometricMachineLearning.update!(optimizer_instance, optimizer_instance.cache, grad_loss)
+        # geodesic for `grad_loss`
+        exp_grad_loss = GeometricMachineLearning.geodesic(grad_loss)
+        GeometricMachineLearning.apply_section!(A, λA, exp_grad_loss)
+        losses[i+2] = loss(A)
+    end
+    losses, ideal_error, check(A)
+end
\ No newline at end of file
diff --git a/test/optimizers/optimizer_convergence_tests/svd_optim.jl b/test/optimizers/optimizer_convergence_tests/svd_optim.jl
index 165817461..492076458 100644
--- a/test/optimizers/optimizer_convergence_tests/svd_optim.jl
+++ b/test/optimizers/optimizer_convergence_tests/svd_optim.jl
@@ -23,7 +23,7 @@ A = [   0.06476993260924702 0.8369280855305259 0.6245358125914054 0.140729967064
         0.7317681833003449 0.9051355184962627 0.3376918522349117 0.436545092402125 0.3462196925686055   ]
 
 
-function svd_test(A, n, train_steps=1000, tol=1e-1; retraction=Cayley())
+function svd_test(A, n, train_steps=2000, tol=1e-1; retraction=Cayley())
     N = size(A,1)
     U, Σ, Vt = svd(A)
     U_result = U[:, 1:n]
diff --git a/test/transformer_related/multi_head_attention_stiefel_retraction.jl b/test/transformer_related/multi_head_attention_stiefel_retraction.jl
index 549502401..842069da5 100644
--- a/test/transformer_related/multi_head_attention_stiefel_retraction.jl
+++ b/test/transformer_related/multi_head_attention_stiefel_retraction.jl
@@ -7,6 +7,7 @@ import Random, Test, Lux, LinearAlgebra, KernelAbstractions
 using GeometricMachineLearning, Test
 using GeometricMachineLearning: geodesic
 using GeometricMachineLearning: cayley
+using GeometricMachineLearning: init_optimizer_cache
 
 dim = 64
 n_heads = 8