diff --git a/latest/Optimizer/index.html b/latest/Optimizer/index.html index 9d0949770..41c006f3b 100644 --- a/latest/Optimizer/index.html +++ b/latest/Optimizer/index.html @@ -1,2 +1,2 @@ -Optimizer Framework · GeometricMachineLearning.jl

Optimizer

In order to generalize neural network optimizers to homogeneous spaces, a class of manifolds we often encounter in machine learning, we have to find a global tangent space representation which we call $\mathfrak{g}^\mathrm{hor}$ here.

Starting from an element of the tangent space $T_Y\mathcal{M}$[1], we need to perform two mappings to arrive at $\mathfrak{g}^\mathrm{hor}$, which we refer to by $\Omega$ and a red horizontal arrow:

Here the mapping $\Omega$ is a horizontal lift from the tangent space onto the horizontal component of the Lie algebra at $Y$.

The red line maps the horizontal component at $Y$, i.e. $\mathfrak{g}^{\mathrm{hor},Y}$, to the horizontal component at $\mathfrak{g}^\mathrm{hor}$.

The $\mathrm{cache}$ stores information about previous optimization steps and is dependent on the optimizer. The elements of the $\mathrm{cache}$ are also in $\mathfrak{g}^\mathrm{hor}$. Based on this the optimer (Adam in this case) computes a final velocity, which is the input of a retraction. Because this update is done for $\mathfrak{g}^{\mathrm{hor}}\equiv{}T_Y\mathcal{M}$, we still need to perform a mapping, called apply_section here, that then finally updates the network parameters. The two red lines are described in global sections.

References

  • Brantner B. Generalizing Adam To Manifolds For Efficiently Training Transformers[J]. arXiv preprint arXiv:2305.16901, 2023.
  • 1In practice this is obtained by first using an AD routine on a loss function $L$, and then computing the Riemannian gradient based on this. See the section of the Stiefel manifold for an example of this.
+Optimizer Framework · GeometricMachineLearning.jl

Optimizer

In order to generalize neural network optimizers to homogeneous spaces, a class of manifolds we often encounter in machine learning, we have to find a global tangent space representation which we call $\mathfrak{g}^\mathrm{hor}$ here.

Starting from an element of the tangent space $T_Y\mathcal{M}$[1], we need to perform two mappings to arrive at $\mathfrak{g}^\mathrm{hor}$, which we refer to by $\Omega$ and a red horizontal arrow:

Here the mapping $\Omega$ is a horizontal lift from the tangent space onto the horizontal component of the Lie algebra at $Y$.

The red line maps the horizontal component at $Y$, i.e. $\mathfrak{g}^{\mathrm{hor},Y}$, to the horizontal component at $\mathfrak{g}^\mathrm{hor}$.

The $\mathrm{cache}$ stores information about previous optimization steps and is dependent on the optimizer. The elements of the $\mathrm{cache}$ are also in $\mathfrak{g}^\mathrm{hor}$. Based on this the optimer (Adam in this case) computes a final velocity, which is the input of a retraction. Because this update is done for $\mathfrak{g}^{\mathrm{hor}}\equiv{}T_Y\mathcal{M}$, we still need to perform a mapping, called apply_section here, that then finally updates the network parameters. The two red lines are described in global sections.

References

  • Brantner B. Generalizing Adam To Manifolds For Efficiently Training Transformers[J]. arXiv preprint arXiv:2305.16901, 2023.
  • 1In practice this is obtained by first using an AD routine on a loss function $L$, and then computing the Riemannian gradient based on this. See the section of the Stiefel manifold for an example of this.
diff --git a/latest/architectures/sympnet/index.html b/latest/architectures/sympnet/index.html index fa36c87da..038d16584 100644 --- a/latest/architectures/sympnet/index.html +++ b/latest/architectures/sympnet/index.html @@ -1,5 +1,5 @@ -SympNet · GeometricMachineLearning.jl

SympNet

This page documents the SympNet architecture and its implementation in GeometricMachineLearning.jl.

Quick overview of the theory of SympNet

Principle

SympNets is a new type of neural network proposing a new approach to compute the trajectory of an Hamiltonian system in phase space. Let us denote by $(q,p)=(q_1,...,q_d,p_1,....p_d)\in \mathbb{R}^{2d}$ the phase space with $q\in \mathbb{R}^{d}$ the generalized position and $p\in \mathbb{R}^{d}$ the generalized momentum. Given a physical problem, SympNets takes a phase space element $(q,p)$ and aims to compute the next position $(q',p')$ of the trajectory in phase space a time step later while preserving the well known symplectic structure of Hamiltonian systems. The way SympNet preserve the symplectic structure is really specific and characterizes it as this preserving is intrinsic of the neural network. Indeed, SympNet is not made with traditional layers but with symplectic layers (described later) modifying the traditional universal approximation theorem into a symplectic one : SympNet is able to approach any symplectic function providing conditions on an activation function.

SympNet (noted $\Phi$ in the following) is so an integrator from $\mathbb{R}^{d} \times \mathbb{R}^{d}$ to $\mathbb{R}^{d} \times \mathbb{R}^{d}$ preserving symplecticity which can compute, from an initial condition $(q_0,p_0)$, a sequence of phase space elements of a trajectory $(q_n,p_n)=\Phi(q_{n-1},p_{n-1})=...=\Phi^n(q_0,p_0)$. The time step between predictions is not a parameter we can choose but is related to the temporal frequency of the training data. SympNet can handle both temporally regular data, i.e with a fix time step between data, and temporally irregular data, i.e with variable time step.

Architecture of SympNets

With GeometricMachineLearning.jl, it is possible to implement two types of architecture which are LA-SympNet and G-SympNet.

LA-SympNet

LA-SympNets are made of the alternation of two types of layers, symplectic linear layers and symplectic activation layers. For a given integer $n$, a symplectic linear layer is defined by

\[\mathcal{L}^{n,up} +SympNet · GeometricMachineLearning.jl

SympNet

This page documents the SympNet architecture and its implementation in GeometricMachineLearning.jl.

Quick overview of the theory of SympNet

Principle

SympNets is a new type of neural network proposing a new approach to compute the trajectory of an Hamiltonian system in phase space. Let us denote by $(q,p)=(q_1,...,q_d,p_1,....p_d)\in \mathbb{R}^{2d}$ the phase space with $q\in \mathbb{R}^{d}$ the generalized position and $p\in \mathbb{R}^{d}$ the generalized momentum. Given a physical problem, SympNets takes a phase space element $(q,p)$ and aims to compute the next position $(q',p')$ of the trajectory in phase space a time step later while preserving the well known symplectic structure of Hamiltonian systems. The way SympNet preserve the symplectic structure is really specific and characterizes it as this preserving is intrinsic of the neural network. Indeed, SympNet is not made with traditional layers but with symplectic layers (described later) modifying the traditional universal approximation theorem into a symplectic one : SympNet is able to approach any symplectic function providing conditions on an activation function.

SympNet (noted $\Phi$ in the following) is so an integrator from $\mathbb{R}^{d} \times \mathbb{R}^{d}$ to $\mathbb{R}^{d} \times \mathbb{R}^{d}$ preserving symplecticity which can compute, from an initial condition $(q_0,p_0)$, a sequence of phase space elements of a trajectory $(q_n,p_n)=\Phi(q_{n-1},p_{n-1})=...=\Phi^n(q_0,p_0)$. The time step between predictions is not a parameter we can choose but is related to the temporal frequency of the training data. SympNet can handle both temporally regular data, i.e with a fix time step between data, and temporally irregular data, i.e with variable time step.

Architecture of SympNets

With GeometricMachineLearning.jl, it is possible to implement two types of architecture which are LA-SympNet and G-SympNet.

LA-SympNet

LA-SympNets are made of the alternation of two types of layers, symplectic linear layers and symplectic activation layers. For a given integer $n$, a symplectic linear layer is defined by

\[\mathcal{L}^{n,up} \begin{pmatrix} q \\ p \\ @@ -106,4 +106,4 @@ # perform training (returns array that contains the total loss for each training step) total_loss = train!(nn, opt, data_q, data_p; ntraining = nruns, batch_size = nbatch)

The train function will change the parameters of the neural networks and gives an a vector containing the evolution of the value of the loss function during the training. Default values for the arguments ntraining and batch_size are respectively $1000$ and $10$.

The trainings data data_q and data_p must be matrices of $\mathbb{R}^{n\times d}$ where $n$ is the length of data and $d$ is the half of the dimension of the system, i.e data_q[i,j] is $q_j(t_i)$ where $(t_1,...,t_n)$ are the corresponding time of the training data.

Then we can make prediction. Let's compare the initial data with a prediction starting from the same phase space point using the provided function Iterate_Sympnet:

#predictions
-q_learned, p_learned = Iterate_Sympnet(nn, q0, p0; n_points = size(data_q,1))

+q_learned, p_learned = Iterate_Sympnet(nn, q0, p0; n_points = size(data_q,1))

diff --git a/latest/arrays/stiefel_lie_alg_horizontal/index.html b/latest/arrays/stiefel_lie_alg_horizontal/index.html index 98d91e257..ccef823d0 100644 --- a/latest/arrays/stiefel_lie_alg_horizontal/index.html +++ b/latest/arrays/stiefel_lie_alg_horizontal/index.html @@ -1,8 +1,8 @@ -Global Tangent Space · GeometricMachineLearning.jl

Horizontal component of the Lie algebra $\mathfrak{g}$

What we use to optimize Adam (and other algorithms) to manifolds is a global tangent space representation of the homogeneous spaces.

For the Stiefel manifold, the homogeneous space takes a simple form:

\[B = \begin{bmatrix} +Global Tangent Space · GeometricMachineLearning.jl

Horizontal component of the Lie algebra $\mathfrak{g}$

What we use to optimize Adam (and other algorithms) to manifolds is a global tangent space representation of the homogeneous spaces.

For the Stiefel manifold, the homogeneous space takes a simple form:

\[B = \begin{bmatrix} A & -B^T \\ B & \mathbb{O} \end{bmatrix},\]

where $A\in\mathbb{R}^{n\times{}n}$ is skew-symmetric and $B\in\mathbb{R}^{N\times{}n}$ is arbitary. In GeometricMachineLearning the struct StiefelLieAlgHorMatrix implements elements of this form.

Theoretical background

Vertical and horizontal components

The Stiefel manifold is a homogeneous space obtained from $SO(N)$ by setting two matrices, whose first $n$ columns conincide, equivalent. Another way of expressing this is:

\[A_1 \sim A_2 \iff A_1E = A_2E\]

for

\[E = \begin{bmatrix} \mathbb{I} \\ \mathbb{O}\end{bmatrix}.\]

The tangent space $T_ESt(n,N)$ can also be expressed that way:

\[T_ESt(n,N) = \mathfrak{g}\cdot{}E = \{BE:B\in\mathfrak{g}\}.\]

The kernel of the mapping $\mathfrak{g}\to{}T_ESt(n,N), B\mapsto{}BE$ is referred to as $\mathfrak{g}^{\mathrm{ver},E}$, the vertical component of the Lie algebra at $E$. It is clear that elements belonging to $\mathfrak{g}^{\mathrm{ver},E}$ are of the following form:

\[\begin{bmatrix} \hat{\mathbb{O}} & \tilde{\mathbb{O}}^T \\ \tilde{\mathbb{O}} & C -\end{bmatrix},\]

where $\hat{\mathbb{O}}\in\mathbb{R}^{n\times{}n}$ is a "small" matrix and $\tilde{\mathbb{O}}\in\mathbb{R}^{N\times{}n}$ is a bigger one. $C\in\mathbb{R}^{N\times{}N}$ is a skew-symmetric matrix.

We can then take the orthogonal complement of this matrix (with respect to the canonical metric). We will denote this by $\mathfrak{g}^{\mathrm{hor},E}\equiv\mathfrak{g}^\mathrm{hor}$ and call it the horizontal component. Its elements are of the form described on top of this page.

Special functions

You can also draw random elements from $\mathfrak{g}^\mathrm{hor}$ through e.g.

rand(CUDADevice(), StiefelLieAlgHorMatrix{Float32}, 10, 5)

In this example: $N=10$ and $n=5$.

+\end{bmatrix},\]

where $\hat{\mathbb{O}}\in\mathbb{R}^{n\times{}n}$ is a "small" matrix and $\tilde{\mathbb{O}}\in\mathbb{R}^{N\times{}n}$ is a bigger one. $C\in\mathbb{R}^{N\times{}N}$ is a skew-symmetric matrix.

We can then take the orthogonal complement of this matrix (with respect to the canonical metric). We will denote this by $\mathfrak{g}^{\mathrm{hor},E}\equiv\mathfrak{g}^\mathrm{hor}$ and call it the horizontal component. Its elements are of the form described on top of this page.

Special functions

You can also draw random elements from $\mathfrak{g}^\mathrm{hor}$ through e.g.

rand(CUDADevice(), StiefelLieAlgHorMatrix{Float32}, 10, 5)

In this example: $N=10$ and $n=5$.

diff --git a/latest/data_loader/data_loader/index.html b/latest/data_loader/data_loader/index.html new file mode 100644 index 000000000..e18bff4d5 --- /dev/null +++ b/latest/data_loader/data_loader/index.html @@ -0,0 +1,2 @@ + +Routines · GeometricMachineLearning.jl

Data Loader

GeometricMachineLearning provides flexible routines to load and manage data for training neural networks. DataLoader has several constructors:

  1. If provided with a tensor, then it assumes the first axis is the system dimension, the second axis is the dimension of the parameter space, and the third axis gives the time evolution of the system.

  2. If provided with a tensor and a vector, it assumes the data are related to a classification task.

diff --git a/latest/data_loader/snapshot_matrix/index.html b/latest/data_loader/snapshot_matrix/index.html new file mode 100644 index 000000000..3efc31f7f --- /dev/null +++ b/latest/data_loader/snapshot_matrix/index.html @@ -0,0 +1,8 @@ + +Snapshot matrix · GeometricMachineLearning.jl

Snapshot matrix

The snapshot matrix stores solutions of the high-dimensional ODE (obtained from discretizing a PDE). This is then used to construct reduced bases in a data-driven way. So (for a single parameter[1]) the snapshot matrix takes the following form:

\[M = \left[\begin{array}{c:c:c:c} +\hat{u}_1(t_0) & \hat{u}_1(t_1) & \quad\ldots\quad & \hat{u}_1(t_f) \\ +\hat{u}_2(t_0) & \hat{u}_2(t_1) & \ldots & \hat{u}_2(t_f) \\ +\hat{u}_3(t_0) & \hat{u}_3(t_1) & \ldots & \hat{u}_3(t_f) \\ +\ldots & \ldots & \ldots & \ldots \\ +\hat{u}_{2N}(t_0) & \hat{u}_{2N}(t_1) & \ldots & \hat{u}_{2N}(t_f) \\ +\end{array}\right].\]

  • 1If we deal with a parametrized PDE then there are two stages at which the snapshot matrix has to be processed: the offline stage and the online stage.
diff --git a/latest/images/adam_optimizer.png b/latest/images/adam_optimizer.png index 8b5d1a8c5..62569f311 100644 Binary files a/latest/images/adam_optimizer.png and b/latest/images/adam_optimizer.png differ diff --git a/latest/images/adam_optimizer.tex b/latest/images/adam_optimizer.tex index 4789dfd19..88ecc758c 100644 --- a/latest/images/adam_optimizer.tex +++ b/latest/images/adam_optimizer.tex @@ -24,7 +24,7 @@ \coordinate[right of=R3, xshift=.5cm] (addition); \node[fit=(R2)(cache)(R3)(R4)(addition),draw, ultra thick, rounded corners, label=below:$\mathtt{optimization\_step(o, ::}\mathbb{R}^N\mathtt{,::}\mathbb{R}^N\mathtt{)}$] (optimization_step) {}; - \draw[->] (R2) -- (R3) node[pos=.5, sloped, below] {Adam}; + \draw[->] (R2) -- (cache) node[pos=.5, sloped, below] {Adam}; \draw[->] (cache) -- (R3) node[pos=.5, sloped, above] {Adam}; \draw[->] (R3) -- (R4) node[pos=.5, right] {Addition}; \draw[->, mgreen] (R1) -- (R2) node[pos=.5, left] {\color{mgreen}AD}; diff --git a/latest/images/general_optimization.pdf b/latest/images/general_optimization.pdf index d8da647b7..2e7ebbe67 100644 Binary files a/latest/images/general_optimization.pdf and b/latest/images/general_optimization.pdf differ diff --git a/latest/images/general_optimization.png b/latest/images/general_optimization.png index ead084d68..c7b692435 100644 Binary files a/latest/images/general_optimization.png and b/latest/images/general_optimization.png differ diff --git a/latest/images/general_optimization.tex b/latest/images/general_optimization.tex index bc7a4e2b3..4b68d8f04 100644 --- a/latest/images/general_optimization.tex +++ b/latest/images/general_optimization.tex @@ -18,7 +18,7 @@ \draw[->] (TYM) -- (ghorY) node[pos=.5, left] {$\Omega$}; \draw[->, mred] (ghorY) -- (ghor); - \draw[->] (ghor) -- (ghor2) node[pos=.5, sloped, below] {Adam}; + \draw[->] (ghor) -- (cache) node[pos=.5, sloped, below] {Adam}; \draw[->] (cache) -- (ghor2) node[pos=.5, sloped, above] {Adam}; \draw[->] (ghor2) -- (M) node[pos=.5, right] {Retraction}; \end{tikzpicture} diff --git a/latest/images/general_optimization_with_boundary.png b/latest/images/general_optimization_with_boundary.png index 4c1012cb1..a4b32b6d7 100644 Binary files a/latest/images/general_optimization_with_boundary.png and b/latest/images/general_optimization_with_boundary.png differ diff --git a/latest/images/general_optimization_with_boundary.tex b/latest/images/general_optimization_with_boundary.tex index 9ce0295e9..a576864d9 100644 --- a/latest/images/general_optimization_with_boundary.tex +++ b/latest/images/general_optimization_with_boundary.tex @@ -31,7 +31,7 @@ \draw[->] (TYM) -- (ghorY) node[pos=.5, left] {$\Omega$}; \draw[->, mred] (ghorY) -- (ghor); - \draw[->] (ghor) -- (ghor2) node[pos=.5, sloped, below] {Adam}; + \draw[->] (ghor) -- (cache) node[pos=.5, sloped, below] {Adam}; \draw[->] (cache) -- (ghor2) node[pos=.5, sloped, above] {Adam}; \draw[->] (ghor2) -- (M) node[pos=.5, right] {Retraction}; \draw[->, mgreen] (M2) -- (Euc) node[pos=.5, left] {\color{mgreen}AD}; diff --git a/latest/images/mha.png b/latest/images/mha.png new file mode 100644 index 000000000..40628aa5f Binary files /dev/null and b/latest/images/mha.png differ diff --git a/latest/images/solution_manifold_2.png b/latest/images/solution_manifold_2.png new file mode 100644 index 000000000..3ed6cf860 Binary files /dev/null and b/latest/images/solution_manifold_2.png differ diff --git a/latest/images/symplectic_autoencoder.png b/latest/images/symplectic_autoencoder.png new file mode 100644 index 000000000..d8dee8a98 Binary files /dev/null and b/latest/images/symplectic_autoencoder.png differ diff --git a/latest/index.html b/latest/index.html index 60e393f66..6ae2768ff 100644 --- a/latest/index.html +++ b/latest/index.html @@ -1,2 +1,2 @@ -Home · GeometricMachineLearning.jl

Geometric Machine Learning

GeometricMachineLearning.jl implements various scientific machine learning models that aim at learning dynamical systems with geometric structure, such as Hamiltonian (symplectic) or Lagrangian (variational) systems.

Installation

GeometricMachineLearning.jl and all of its dependencies can be installed via the Julia REPL by typing

]add GeometricMachineLearning

Architectures

Manifolds

+Home · GeometricMachineLearning.jl

Geometric Machine Learning

GeometricMachineLearning.jl implements various scientific machine learning models that aim at learning dynamical systems with geometric structure, such as Hamiltonian (symplectic) or Lagrangian (variational) systems.

Installation

GeometricMachineLearning.jl and all of its dependencies can be installed via the Julia REPL by typing

]add GeometricMachineLearning

Architectures

Manifolds

diff --git a/latest/layers/attention_layer/index.html b/latest/layers/attention_layer/index.html index b9529d76e..7c3867c0c 100644 --- a/latest/layers/attention_layer/index.html +++ b/latest/layers/attention_layer/index.html @@ -1,5 +1,5 @@ -Attention · GeometricMachineLearning.jl

The Attention Layer

The attention layer (and the orthonormal activation function defined for it) was specifically designed to generalize transformers to symplectic data. Usually a self-attention layer takes the following form:

\[Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\mathrm{softmax}((P^QZ)^T(P^KZ)),\]

where we left out the linear mapping onto the values $P^V$.

The idea behind is that we can perform a non-linear re-weighting of the columns of $Z$ by multiplying with a $Z$-dependent matrix from the right and therefore take the sequential nature of the data into account (which is not possible with normal neural networks). After the attention step the transformer applies a simple ResNet from the left.

What the softmax does is a vector-wise operation, i.e. it operates on each column of an input matrix $A = [a_1, \ldots, a_T]$. The result is a sequence of probability vectors $[p^{(1)}, \ldots, p^{(T)}]$ for which

\[\sum_{i=1}^Tp^{(j)}_i=1\quad\forall{}j\in\{1,\dots,T\}.\]

What we want to construct is a symplectic transformation that is transformer-like. For this we modify the attention layer the following way:

\[Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\sigma((P^QZ)^T(P^KZ)),\]

where $\sigma(A)=\exp(\mathtt{upper\_triangular{\_asymmetrize}}(A))$ and

\[[\mathtt{upper\_triangular\_asymmetrize}(A)]_{ij} = \begin{cases} a_{ij} & \text{if $i<j$} \\ -a_{ji} & \text{if $i>j$} \\ 0 & \text{else.}\end{cases}\]

This has as a consequence that the matrix $\Lambda(Z) := \sigma((P^QZ)^T(P^KZ))$ is orthonormal and hence preserves an extended symplectic structure. To make this more clear, consider that the transformer maps sequences of vectors to sequences of vectors, i.e. $V\times\cdots\times{}V \ni [z^1, \ldots, z^T] \mapsto [\hat{z}^1, \ldots, \hat{z}^T]$. We can define a symplectic structure on $V\times\cdots\times{}V$ by rearranging $[z^1, \ldots, z^T]$ into a vector. We do this in the following way:

\[\tilde{Z} = \begin{pmatrix} q^{(1)}_1 \\ q^{(2)}_1 \\ \cdots \\ q^{(T)}_1 \\ q^{(1)}_2 \\ \cdots \\ q^{(T)}_d \\ p^{(1)}_1 \\ p^{(2)}_1 \\ \cdots \\ p^{(T)}_1 \\ p^{(1)}_2 \\ \cdots \\ p^{(T)}_d \end{pmatrix}.\]

The symplectic structure on this big space is then:

\[\mathbb{J}=\begin{pmatrix} +Attention · GeometricMachineLearning.jl

The Attention Layer

The attention layer (and the orthonormal activation function defined for it) was specifically designed to generalize transformers to symplectic data. Usually a self-attention layer takes the following form:

\[Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\mathrm{softmax}((P^QZ)^T(P^KZ)),\]

where we left out the linear mapping onto the values $P^V$.

The idea behind is that we can perform a non-linear re-weighting of the columns of $Z$ by multiplying with a $Z$-dependent matrix from the right and therefore take the sequential nature of the data into account (which is not possible with normal neural networks). After the attention step the transformer applies a simple ResNet from the left.

What the softmax does is a vector-wise operation, i.e. it operates on each column of an input matrix $A = [a_1, \ldots, a_T]$. The result is a sequence of probability vectors $[p^{(1)}, \ldots, p^{(T)}]$ for which

\[\sum_{i=1}^Tp^{(j)}_i=1\quad\forall{}j\in\{1,\dots,T\}.\]

What we want to construct is a symplectic transformation that is transformer-like. For this we modify the attention layer the following way:

\[Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\sigma((P^QZ)^T(P^KZ)),\]

where $\sigma(A)=\exp(\mathtt{upper\_triangular{\_asymmetrize}}(A))$ and

\[[\mathtt{upper\_triangular\_asymmetrize}(A)]_{ij} = \begin{cases} a_{ij} & \text{if $i<j$} \\ -a_{ji} & \text{if $i>j$} \\ 0 & \text{else.}\end{cases}\]

This has as a consequence that the matrix $\Lambda(Z) := \sigma((P^QZ)^T(P^KZ))$ is orthonormal and hence preserves an extended symplectic structure. To make this more clear, consider that the transformer maps sequences of vectors to sequences of vectors, i.e. $V\times\cdots\times{}V \ni [z^1, \ldots, z^T] \mapsto [\hat{z}^1, \ldots, \hat{z}^T]$. We can define a symplectic structure on $V\times\cdots\times{}V$ by rearranging $[z^1, \ldots, z^T]$ into a vector. We do this in the following way:

\[\tilde{Z} = \begin{pmatrix} q^{(1)}_1 \\ q^{(2)}_1 \\ \cdots \\ q^{(T)}_1 \\ q^{(1)}_2 \\ \cdots \\ q^{(T)}_d \\ p^{(1)}_1 \\ p^{(2)}_1 \\ \cdots \\ p^{(T)}_1 \\ p^{(1)}_2 \\ \cdots \\ p^{(T)}_d \end{pmatrix}.\]

The symplectic structure on this big space is then:

\[\mathbb{J}=\begin{pmatrix} \mathbb{O}_{dT} & \mathbb{I}_{dT} \\ -\mathbb{I}_{dT} & \mathbb{O}_{dT} \end{pmatrix}.\]

Multiplying with the matrix $\Lambda(Z)$ from the right onto $[z^1, \ldots, z^T]$ corresponds to applying the sparse matrix

\[\tilde{\Lambda}(Z)=\left[ @@ -8,4 +8,4 @@ \vdots & \ddots & \vdots \\ \mathbb{O}_T & \cdots & \Lambda(Z) \end{array} -\right]\]

from the left onto the big vector.

+\right]\]

from the left onto the big vector.

diff --git a/latest/layers/multihead_attention_layer/index.html b/latest/layers/multihead_attention_layer/index.html index a5c930e4a..f0729c9eb 100644 --- a/latest/layers/multihead_attention_layer/index.html +++ b/latest/layers/multihead_attention_layer/index.html @@ -1,2 +1,2 @@ -Multihead Attention · GeometricMachineLearning.jl

Multihead Attention Layer

In order to arrive from the attention layer at the multihead attention layer we only have to do a simple modification:

References

  • Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
+Multihead Attention · GeometricMachineLearning.jl

Multihead Attention Layer

In order to arrive from the attention layer at the multihead attention layer we have to do a few modifications:

Note that these neural networks were originally developed for natural language processing (NLP) tasks and the terminology used here bears some resemblance to that field. The input to a multihead attention layer typicaly comprises three components:

  1. Values $V\in\mathbb{R}^{n\times{}T}$: a matrix whose columns are value vectors,
  2. Queries $Q\in\mathbb{R}^{n\times{}T}$: a matrix whose columns are query vectors,
  3. Keys $K\in\mathbb{R}^{n\times{}T}$: a matrix whose columns are key vectors.

Regular attention performs the following operation:

\[\mathrm{Attention}(Q,K,V) = V\mathrm{softmax}(\frac{K^TQ}{\sqrt{n}}),\]

where $n$ is the dimension of the vectors in $V$, $Q$ and $K$. The softmax activation function here acts column-wise, so it can be seen as a transformation $\mathrm{softmax}:\mathbb{R}^{T}\to\mathbb{R}^T$ with $[\mathrm{softmax}(v)]_i = e^{v_i}/\left(\sum_{j=1}e^{v_j}\right)$. The $K^TQ$ term is a similarity matrix between the queries and the vectors.

The transformer contains a self-attention mechanism, i.e. takes an input $X$ and then transforms it linearly to $V$, $Q$ and $K$, i.e. $V = P^VX$, $Q = P^QX$ and $K = P^KX$. What distinguishes the multihead attention layer from the singlehead attention layer, is that there is not just one $P^V$, $P^Q$ and $P^K$, but there are several: one for each head of the multihead attention layer. After computing the individual values, queries and vectors, and after applying the softmax, the outputs are then concatenated together in order to obtain again an array that is of the same size as the input array:

Here the various $P$ matrices can be interpreted as being projections onto lower-dimensional subspaces, hence the designation by the letter $P$. Because of this interpretation as projection matrices onto smaller spaces that should capture features in the input data it makes sense to constrain these elements to be part of the Stiefel manifold.

References

  • Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
diff --git a/latest/library/index.html b/latest/library/index.html index 80609a254..8639a5cdc 100644 --- a/latest/library/index.html +++ b/latest/library/index.html @@ -1,4 +1,32 @@ -Library · GeometricMachineLearning.jl

GeometricMachineLearning Library Functions

GeometricMachineLearning.AbstractRetractionType

AbstractRetraction is a type that comprises all retraction methods for manifolds. For every manifold layer one has to specify a retraction method that takes the layer and elements of the (global) tangent space.

source
GeometricMachineLearning.DataLoaderType

Data Loader is a struct that creates an instance based on a tensor (or different input format) and is designed to make training convenient.

Implemented: If the data loader is called with a single tensor, a batchsize and an outputsize, then the batch is drawn randomly in the relevant range and the output is assigned accordingly.

The fields of the struct are the following: - data: The input data with axes (i) system dimension, (ii) number of parameters and (iii) number of time steps. - batch: A tensor in which the current batch is stored. - targettensor: A tensor in which the current target is stored. - output: The tensor from which the output is drawn - this may be of type Nothing if the constructor is only called with one tensor. - sysdim: The ``dimension'' of the system, i.e. what is taken as input by a regular neural network. - seqlength: The length batches should have. - batchsize: - outputsize: The size of the second axis of the output tensor (predictionwindow, outputsize=1 in most cases) - nparams: The number of parameters that are present in the data set (length of second axis). - ntimesteps: Number of time steps (length of third axis).

For drawing the batch, the sampling is done over nparams and ntimesteps (here seqlength and output_size are also taken into account).

TODO: Implement DataLoader that works well with GeometricEnsembles etc.

source
GeometricMachineLearning.OptimizerType

Optimizer struct that stores the 'method' (i.e. Adam with corresponding hyperparameters), the cache and the optimization step.

It takes as input an optimization method and the parameters of a network.

source
GeometricMachineLearning.SymmetricMatrixType

A SymmetricMatrix is a matrix | a S | | S b |

The first index is the row index, the second one the column index.

If the constructor is called with a matrix as input it returns a symmetric matrix via the projection A ↦ .5*(A + Aᵀ). This is a projection defined via the canonical metric (A,B) ↦ tr(AᵀB).

TODO: Overload Adjoint operation for SymmetricMatrix!! (Aᵀ = A)

source
GeometricMachineLearning.SympNetLayerType

Implements the various layers from the SympNet paper: (https://www.sciencedirect.com/science/article/abs/pii/S0893608020303063). Its components are of the form:

\[\begin{pmatrix} +Library · GeometricMachineLearning.jl

GeometricMachineLearning Library Functions

GeometricMachineLearning.AbstractRetractionType

AbstractRetraction is a type that comprises all retraction methods for manifolds. For every manifold layer one has to specify a retraction method that takes the layer and elements of the (global) tangent space.

source
GeometricMachineLearning.DataLoaderType

Data Loader is a struct that creates an instance based on a tensor (or different input format) and is designed to make training convenient.

Implemented:

If the data loader is called with a single tensor, a batchsize and an outputsize, then the batch is drawn randomly in the relevant range and the output is assigned accordingly.

The fields of the struct are the following:

   - data: The input data with axes (i) system dimension, (ii) number of parameters and (iii) number of time steps.
+
+   - batch: A tensor in which the current batch is stored.
+
+   - target_tensor: A tensor in which the current target is stored.
+
+   - output: The tensor from which the output is drawn - this may be of type Nothing if the constructor is only called with one tensor.
+
+   - sys_dim: The ``dimension'' of the system, i.e. what is taken as input by a regular neural network.
+
+   - seq_length: The length batches should have.
+
+   - batch_size:
+
+   - output_size: The size of the second axis of the output tensor (prediction_window, output_size=1 in most cases)
+
+   - n_params: The number of parameters that are present in the data set (length of second axis)
+
+   - n_time_steps: Number of time steps (length of third axis)

For drawing the batch, the sampling is done over nparams and ntimesteps (here seqlength and output_size are also taken into account).

TODO: Implement DataLoader that works well with GeometricEnsembles etc.

source
GeometricMachineLearning.OptimizerType

Optimizer struct that stores the 'method' (i.e. Adam with corresponding hyperparameters), the cache and the optimization step.

It takes as input an optimization method and the parameters of a network.

source
GeometricMachineLearning.PSDLayerType

This is a PSD-like layer used for symplectic autoencoders. One layer has the following shape:

|Φ 0|

A = |0 Φ|, where Φ is an element of the regular Stiefel manifold.

source
GeometricMachineLearning.SymmetricMatrixType

A SymmetricMatrix is a matrix | a S | | S b |

The first index is the row index, the second one the column index.

If the constructor is called with a matrix as input it returns a symmetric matrix via the projection A ↦ .5*(A + Aᵀ). This is a projection defined via the canonical metric (A,B) ↦ tr(AᵀB).

TODO: Overload Adjoint operation for SymmetricMatrix!! (Aᵀ = A)

source
GeometricMachineLearning.SympNetLayerType

Implements the various layers from the SympNet paper: (https://www.sciencedirect.com/science/article/abs/pii/S0893608020303063). Its components are of the form:

\[\begin{pmatrix} I & \nabla{}V \\ 0 & I -\end{pmatrix},\]

with $V(p) = \sum_ia_i\Sigma(\sum_jk_{ij}p_j+b_i)$, where $\Sigma$ is the antiderivative of the activation function $\sigma$ (one-layer neural network). Such layers are by construction symplectic.

For the linear layer, the activation and the bias are left out, and for the activation layer K and b are left out!

source
GeometricMachineLearning.TransformerMethod

The architecture for a "transformer encoder" is essentially taken from arXiv:2010.11929, but with the difference that 𝐧𝐨 layer normalization is employed. This is because we still need to find a generalization of layer normalization to manifolds.

source
GeometricMachineLearning.assign_batch_kernel!Method

Takes as input a $batch tensor'' (to which the data are assigned), the whole data tensor and two vectors$params'' and ``time_steps'' that include the specific parameters and time steps we want to assign.

Note that this assigns sequential data! For e.g. being processed by a transformer.

source
GeometricMachineLearning.draw_batch!Method

This function draws random time steps and parameters and based on these assign the batch and the output.

For ODE DataLoader: (i) batch input tensor to be written on (ii) batch output tensor to be written on (iii) data tensor

For MNIST DataLoader: (i) batch input tensor to be written on (ii) batch output tensor to be written on (iii) data tensor (iv) target tensor

source
GeometricMachineLearning.train!Function
train!(...)

Perform a training of a neural networks on data using given method a training Method

Different ways of use:

train!(neuralnetwork, data, optimizer = GradientOptimizer(1e-2), training_method; nruns = 1000, batch_size = default(data, type), showprogress = false )

Arguments

  • neuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend
  • data : the data (see TrainingData)
  • optimizer = GradientOptimizer: the optimization method (see Optimizer)
  • training_method : specify the loss function used
  • nruns : number of iteration through the process with default value
  • batch_size : size of batch of data used for each step
source
GeometricMachineLearning.train!Method
train!(neuralnetwork, data, optimizer, training_method; nruns = 1000, batch_size, showprogress = false )

Arguments

  • neuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend
  • data::AbstractTrainingData : the data
  • ``
source
+\end{pmatrix},\]

with $V(p) = \sum_ia_i\Sigma(\sum_jk_{ij}p_j+b_i)$, where $\Sigma$ is the antiderivative of the activation function $\sigma$ (one-layer neural network). Such layers are by construction symplectic.

For the linear layer, the activation and the bias are left out, and for the activation layer K and b are left out!

source
GeometricMachineLearning.TransformerMethod

The architecture for a "transformer encoder" is essentially taken from arXiv:2010.11929, but with the difference that 𝐧𝐨 layer normalization is employed. This is because we still need to find a generalization of layer normalization to manifolds.

source
GeometricMachineLearning.assign_batch_kernel!Method

Takes as input a $batch tensor'' (to which the data are assigned), the whole data tensor and two vectors$params'' and ``time_steps'' that include the specific parameters and time steps we want to assign.

Note that this assigns sequential data! For e.g. being processed by a transformer.

source
GeometricMachineLearning.draw_batch!Method

This function draws random time steps and parameters and based on these assign the batch and the output.

For ODE DataLoader: (i) batch input tensor to be written on (ii) batch output tensor to be written on (iii) data tensor

For MNIST DataLoader: (i) batch input tensor to be written on (ii) batch output tensor to be written on (iii) data tensor (iv) target tensor

source
GeometricMachineLearning.train!Function
train!(...)

Perform a training of a neural networks on data using given method a training Method

Different ways of use:

train!(neuralnetwork, data, optimizer = GradientOptimizer(1e-2), training_method; nruns = 1000, batch_size = default(data, type), showprogress = false )

Arguments

  • neuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend
  • data : the data (see TrainingData)
  • optimizer = GradientOptimizer: the optimization method (see Optimizer)
  • training_method : specify the loss function used
  • nruns : number of iteration through the process with default value
  • batch_size : size of batch of data used for each step
source
GeometricMachineLearning.train!Method
train!(neuralnetwork, data, optimizer, training_method; nruns = 1000, batch_size, showprogress = false )

Arguments

  • neuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend
  • data::AbstractTrainingData : the data
  • ``
source
diff --git a/latest/manifolds/grassmann_manifold/index.html b/latest/manifolds/grassmann_manifold/index.html index 5fce4c8df..e248bd78a 100644 --- a/latest/manifolds/grassmann_manifold/index.html +++ b/latest/manifolds/grassmann_manifold/index.html @@ -1,9 +1,9 @@ -Grassmann · GeometricMachineLearning.jl

Grassmann Manifold

(The description of the Grassmann manifold is based on that of the Stiefel manifold, so this should be read first.)

An element of the Grassmann manifold $G(n,N)$ is a vector subspace $\subset\mathbb{R}^N$ of dimension $n$, and each such subspace can be represented by a full-rank matrix $A\in\mathbb{R}^{N\times{}n}$ and the full space takes the form $G(n,N) = \mathbb{R}^{N\times{}n}/\sim$ where the equivalence relation is $A\sim{}B \iff \exists{}C\in\mathbb{R}^{n\times{}n}\text{ s.t. }AC = B$. One can find a parametrization of the manifold the following way: Because the matrix $A$ has full rank, there have to be $n$ independent columns in it: $i_1, \ldots, i_n$. For simplicity assume that $i_1 = 1, i_2=2, \ldots, i_n=n$ and call the matrix made up by these columns $C$. Then the mapping to the coordinate chart is: $AC^{-1}$ and the last $N-n$ columns are the coordinates.

The tangent space for this element can then be represented through matrices:

\[\begin{pmatrix} +Grassmann · GeometricMachineLearning.jl

Grassmann Manifold

(The description of the Grassmann manifold is based on that of the Stiefel manifold, so this should be read first.)

An element of the Grassmann manifold $G(n,N)$ is a vector subspace $\subset\mathbb{R}^N$ of dimension $n$, and each such subspace can be represented by a full-rank matrix $A\in\mathbb{R}^{N\times{}n}$ and the full space takes the form $G(n,N) = \mathbb{R}^{N\times{}n}/\sim$ where the equivalence relation is $A\sim{}B \iff \exists{}C\in\mathbb{R}^{n\times{}n}\text{ s.t. }AC = B$. One can find a parametrization of the manifold the following way: Because the matrix $A$ has full rank, there have to be $n$ independent columns in it: $i_1, \ldots, i_n$. For simplicity assume that $i_1 = 1, i_2=2, \ldots, i_n=n$ and call the matrix made up by these columns $C$. Then the mapping to the coordinate chart is: $AC^{-1}$ and the last $N-n$ columns are the coordinates.

The tangent space for this element can then be represented through matrices:

\[\begin{pmatrix} 0 & \cdots & 0 \\ \cdots & \cdots & \cdots \\ 0 & \cdots & 0 \\ a_{11} & \cdots & a_{1n} \\ \cdots & \cdots & \cdots \\ a_{(N-n)1} & \cdots & a_{(N-n)n} -\end{pmatrix}.\]

The Grassmann manifold can also be seen as the Stiefel manifold modulo an equivalence class. This leads to the following (which is used for optimization):

\[\mathfrak{g}^\mathrm{hor} = \mathfrak{g}^{\mathrm{hor},E} = \left\{\begin{pmatrix} 0 & -B^T \\ B & 0 \end{pmatrix}: \text{$B$ arbitrary}\right\}.\]

+\end{pmatrix}.\]

The Grassmann manifold can also be seen as the Stiefel manifold modulo an equivalence class. This leads to the following (which is used for optimization):

\[\mathfrak{g}^\mathrm{hor} = \mathfrak{g}^{\mathrm{hor},E} = \left\{\begin{pmatrix} 0 & -B^T \\ B & 0 \end{pmatrix}: \text{$B$ arbitrary}\right\}.\]

diff --git a/latest/manifolds/homogeneous_spaces/index.html b/latest/manifolds/homogeneous_spaces/index.html index 531c9d59c..2d388645a 100644 --- a/latest/manifolds/homogeneous_spaces/index.html +++ b/latest/manifolds/homogeneous_spaces/index.html @@ -1,2 +1,2 @@ -Homogeneous Spaces · GeometricMachineLearning.jl

Homogeneous Spaces

Homogeneous spaces are manifolds $\mathcal{M}$ on which a Lie group $G$ acts transitively, i.e.

\[\forall X,Y\in\mathcal{M} \exists{}A\in{}G\text{ s.t. }AX = Y.\]

Now fix a distinct element $E\in\mathcal{M}$. We can also establish an isomorphism between $\mathcal{M}$ and the quotient space $G/\sim$ with the equivalence relation:

\[A_1 \sim A_2 \iff A_1E = A_2E.\]

Note that this is independent of the chosen $E$.

The tangent spaces of $\mathcal{M}$ are of the form $T_Y\mathcal{M} = \mathfrak{g}\cdot{}Y$, i.e. can be fully described through its Lie algebra. Based on this we can perform a splitting of $\mathfrak{g}$ into two parts:

  1. The vertical component $\mathfrak{g}^{\mathrm{ver},Y}$ is the kernel of the map $\mathfrak{g}\to{}T_Y\mathcal{M}, V \mapsto VY$, i.e. $\mathfrak{g}^{\mathrm{ver},Y} = \{V\in\mathfrak{g}:VY = 0\}.$

  2. The horizontal component $\mathfrak{g}^{\mathrm{hor},Y}$ is the orthogonal complement of $\mathfrak{g}^{\mathrm{ver},Y}$ in $\mathfrak{g}$. It is isomorphic to $T_Y\mathcal{M}$.

We will refer to the mapping from $T_Y\mathcal{M}$ to $\mathfrak{g}^{\mathrm{hor}, Y}$ by $\Omega$. If we have now defined a metric $\langle\cdot,\cdot\rangle$ on $\mathfrak{g}$, then this induces a Riemannian metric on $\mathcal{M}$:

\[g_Y(\Delta_1, \Delta_2) = \langle\Omega(Y,\Delta_1),\Omega(Y,\Delta_2)\rangle\text{ for $\Delta_1,\Delta_2\in{}T_Y\mathcal{M}$.}\]

Two examples of homogeneous spaces implemented in GeometricMachineLearning are the Stiefel and the Grassmann manifold.

References

  • Frankel, Theodore. The geometry of physics: an introduction. Cambridge university press, 2011.
+Homogeneous Spaces · GeometricMachineLearning.jl

Homogeneous Spaces

Homogeneous spaces are manifolds $\mathcal{M}$ on which a Lie group $G$ acts transitively, i.e.

\[\forall X,Y\in\mathcal{M} \exists{}A\in{}G\text{ s.t. }AX = Y.\]

Now fix a distinct element $E\in\mathcal{M}$. We can also establish an isomorphism between $\mathcal{M}$ and the quotient space $G/\sim$ with the equivalence relation:

\[A_1 \sim A_2 \iff A_1E = A_2E.\]

Note that this is independent of the chosen $E$.

The tangent spaces of $\mathcal{M}$ are of the form $T_Y\mathcal{M} = \mathfrak{g}\cdot{}Y$, i.e. can be fully described through its Lie algebra. Based on this we can perform a splitting of $\mathfrak{g}$ into two parts:

  1. The vertical component $\mathfrak{g}^{\mathrm{ver},Y}$ is the kernel of the map $\mathfrak{g}\to{}T_Y\mathcal{M}, V \mapsto VY$, i.e. $\mathfrak{g}^{\mathrm{ver},Y} = \{V\in\mathfrak{g}:VY = 0\}.$

  2. The horizontal component $\mathfrak{g}^{\mathrm{hor},Y}$ is the orthogonal complement of $\mathfrak{g}^{\mathrm{ver},Y}$ in $\mathfrak{g}$. It is isomorphic to $T_Y\mathcal{M}$.

We will refer to the mapping from $T_Y\mathcal{M}$ to $\mathfrak{g}^{\mathrm{hor}, Y}$ by $\Omega$. If we have now defined a metric $\langle\cdot,\cdot\rangle$ on $\mathfrak{g}$, then this induces a Riemannian metric on $\mathcal{M}$:

\[g_Y(\Delta_1, \Delta_2) = \langle\Omega(Y,\Delta_1),\Omega(Y,\Delta_2)\rangle\text{ for $\Delta_1,\Delta_2\in{}T_Y\mathcal{M}$.}\]

Two examples of homogeneous spaces implemented in GeometricMachineLearning are the Stiefel and the Grassmann manifold.

References

  • Frankel, Theodore. The geometry of physics: an introduction. Cambridge university press, 2011.
diff --git a/latest/manifolds/stiefel_manifold/index.html b/latest/manifolds/stiefel_manifold/index.html index 7ab64b9be..63af7e611 100644 --- a/latest/manifolds/stiefel_manifold/index.html +++ b/latest/manifolds/stiefel_manifold/index.html @@ -1,5 +1,5 @@ -Stiefel · GeometricMachineLearning.jl

Stiefel manifold

The Stiefel manifold $St(n, N)$ is the space (a homogeneous space) of all orthonormal frames in $\mathbb{R}^{N\times{}n}$, i.e. matrices $Y\in\mathbb{R}^{N\times{}n}$ s.t. $Y^TY = \mathbb{I}_n$. It can also be seen as the special orthonormal group $SO(N)$ modulo an equivalence relation: $A\sim{}B\iff{}AE = BE$ for

\[E = \begin{bmatrix} +Stiefel · GeometricMachineLearning.jl

Stiefel manifold

The Stiefel manifold $St(n, N)$ is the space (a homogeneous space) of all orthonormal frames in $\mathbb{R}^{N\times{}n}$, i.e. matrices $Y\in\mathbb{R}^{N\times{}n}$ s.t. $Y^TY = \mathbb{I}_n$. It can also be seen as the special orthonormal group $SO(N)$ modulo an equivalence relation: $A\sim{}B\iff{}AE = BE$ for

\[E = \begin{bmatrix} \mathbb{I}_n \\ \mathbb{O} -\end{bmatrix}\in\mathcal{M},\]

which is the canonical element of the Stiefel manifold. In words: the first $n$ columns of $A$ and $B$ are the same.

The tangent space to the element $Y\in{}St(n,N)$ can easily be determined:

\[T_YSt(n,N)=\{\Delta:\Delta^TY + Y^T\Delta = 0\}.\]

The Lie algebra of $SO(N)$ is $\mathfrak{so}(N):=\{V\in\mathbb{R}^{N\times{}N}:V^T + V = 0\}$ and the canonical metric associated with it is simply $(V_1,V_2)\mapsto\frac{1}{2}\mathrm{Tr}(V_1^TV_2)$.

The Riemannian Gradient

For matrix manifolds (like the Stiefel manifold), the Riemannian gradient of a function can be easily determined computationally:

The Euclidean gradient of a function $L$ is equivalent to an element of the cotangent space $T^*_Y\mathcal{M}$ via:

\[\langle\nabla{}L,\cdot\rangle:T_Y\mathcal{M} \to \mathbb{R}, \Delta \mapsto \sum_{ij}[\nabla{}L]_{ij}[\Delta]_{ij} = \mathrm{Tr}(\nabla{}L^T\Delta).\]

We can then utilize the Riemannian metric on $\mathcal{M}$ to map the element from the cotangent space (i.e. $\nabla{}L$) to the tangent space. This element is called $\mathrm{grad}_{(\cdot)}L$ here. Explicitly, it is given by:

\[ \mathrm{grad}_YL = \nabla_YL - Y(\nabla_YL)^TY\]

rgrad

What was referred to as $\nabla{}L$ before can in practice be obtained with an AD routine. We then use the function rgrad to map this Euclidean gradient to $\in{}T_YSt(n,N)$. This mapping has the property:

\[\mathrm{Tr}((\nabla{}L)^T\Delta) = g_Y(\mathtt{rgrad}(Y, \nabla{}L), \Delta) \forall\Delta\in{}T_YSt(n,N)\]

and $g$ is the Riemannian metric.

+\end{bmatrix}\in\mathcal{M},\]

which is the canonical element of the Stiefel manifold. In words: the first $n$ columns of $A$ and $B$ are the same.

The tangent space to the element $Y\in{}St(n,N)$ can easily be determined:

\[T_YSt(n,N)=\{\Delta:\Delta^TY + Y^T\Delta = 0\}.\]

The Lie algebra of $SO(N)$ is $\mathfrak{so}(N):=\{V\in\mathbb{R}^{N\times{}N}:V^T + V = 0\}$ and the canonical metric associated with it is simply $(V_1,V_2)\mapsto\frac{1}{2}\mathrm{Tr}(V_1^TV_2)$.

The Riemannian Gradient

For matrix manifolds (like the Stiefel manifold), the Riemannian gradient of a function can be easily determined computationally:

The Euclidean gradient of a function $L$ is equivalent to an element of the cotangent space $T^*_Y\mathcal{M}$ via:

\[\langle\nabla{}L,\cdot\rangle:T_Y\mathcal{M} \to \mathbb{R}, \Delta \mapsto \sum_{ij}[\nabla{}L]_{ij}[\Delta]_{ij} = \mathrm{Tr}(\nabla{}L^T\Delta).\]

We can then utilize the Riemannian metric on $\mathcal{M}$ to map the element from the cotangent space (i.e. $\nabla{}L$) to the tangent space. This element is called $\mathrm{grad}_{(\cdot)}L$ here. Explicitly, it is given by:

\[ \mathrm{grad}_YL = \nabla_YL - Y(\nabla_YL)^TY\]

rgrad

What was referred to as $\nabla{}L$ before can in practice be obtained with an AD routine. We then use the function rgrad to map this Euclidean gradient to $\in{}T_YSt(n,N)$. This mapping has the property:

\[\mathrm{Tr}((\nabla{}L)^T\Delta) = g_Y(\mathtt{rgrad}(Y, \nabla{}L), \Delta) \forall\Delta\in{}T_YSt(n,N)\]

and $g$ is the Riemannian metric.

diff --git a/latest/optimizers/adam_optimizer/index.html b/latest/optimizers/adam_optimizer/index.html index 956391908..4e5025426 100644 --- a/latest/optimizers/adam_optimizer/index.html +++ b/latest/optimizers/adam_optimizer/index.html @@ -1,2 +1,2 @@ -Adam Optimizer · GeometricMachineLearning.jl

The Adam Optimizer

The Adam Optimizer is one of the most widely (if not the most widely used) neural network optimizer. Like most modern neural network optimizers it contains a cache that is updated based on first-order gradient information and then, in a second step, the cache is used to compute a velocity estimate for updating the neural networ weights.

Here we first describe the Adam algorithm for the case where all the weights are on a vector space and then show how to generalize this to the case where the weights are on a manifold.

All weights on a vector space

The cache of the Adam optimizer consists of first and second moments. The first moments $B_1$ store linear information about the current and previous gradients, and the second moments $B_2$ store quadratic information about current and previous gradients (all computed from a first-order gradient).

If all the weights are on a vector space, then we directly compute updates for $B_1$ and $B_2$:

  1. \[B_1 \gets ((\rho_1 - \rho_1^t)/(1 - \rho_1^t))\cdot{}B_1 + (1 - \rho_1)/(1 - \rho_1^t)\cdot{}\nabla{}L,\]

  2. \[B_2 \gets ((\rho_2 - \rho_1^t)/(1 - \rho_2^t))\cdot{}B_2 + (1 - \rho_2)/(1 - \rho_2^t)\cdot\nabla{}L\odot\nabla{}L,\]

    where $\odot:\mathbb{R}^n\times\mathbb{R}^n\to\mathbb{R}^n$ is the Hadamard product: $[a\odot{}b]_i = a_ib_i$. $\rho_1$ and $\rho_2$ are hyperparameters. Their defaults, $\rho_1=0.9$ and $\rho_2=0.99$, are taken from (Goodfellow et al., 2016, page 301). After having updated the cache (i.e. $B_1$ and $B_2$) we compute a velocity (step 3) with which the parameters $Y_t$ are then updated (step 4).

  3. \[W_t\gets -\eta{}B_1/\sqrt{B_2 + \delta},\]

  4. \[Y_{t+1} \gets Y_t + W_t,\]

Here $\eta$ (with default 0.01) is the learning rate and $\delta$ (with default $3\cdot10^{-7}$) is a small constant that is added for stability. The division, square root and addition in step 3 are performed element-wise.

Weights on manifolds

The problem with generalizing Adam to manifolds is that the Hadamard product $\odot$ as well as the other element-wise operations ($/$, $\sqrt{}$ and $+$ in step 3 above) lack a clear geometric interpretation. In GeometricMachineLearning we get around this issue by utilizing a so-called global tangent space representation.

References

  • Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT press, 2016.
+Adam Optimizer · GeometricMachineLearning.jl

The Adam Optimizer

The Adam Optimizer is one of the most widely (if not the most widely used) neural network optimizer. Like most modern neural network optimizers it contains a cache that is updated based on first-order gradient information and then, in a second step, the cache is used to compute a velocity estimate for updating the neural networ weights.

Here we first describe the Adam algorithm for the case where all the weights are on a vector space and then show how to generalize this to the case where the weights are on a manifold.

All weights on a vector space

The cache of the Adam optimizer consists of first and second moments. The first moments $B_1$ store linear information about the current and previous gradients, and the second moments $B_2$ store quadratic information about current and previous gradients (all computed from a first-order gradient).

If all the weights are on a vector space, then we directly compute updates for $B_1$ and $B_2$:

  1. \[B_1 \gets ((\rho_1 - \rho_1^t)/(1 - \rho_1^t))\cdot{}B_1 + (1 - \rho_1)/(1 - \rho_1^t)\cdot{}\nabla{}L,\]

  2. \[B_2 \gets ((\rho_2 - \rho_1^t)/(1 - \rho_2^t))\cdot{}B_2 + (1 - \rho_2)/(1 - \rho_2^t)\cdot\nabla{}L\odot\nabla{}L,\]

    where $\odot:\mathbb{R}^n\times\mathbb{R}^n\to\mathbb{R}^n$ is the Hadamard product: $[a\odot{}b]_i = a_ib_i$. $\rho_1$ and $\rho_2$ are hyperparameters. Their defaults, $\rho_1=0.9$ and $\rho_2=0.99$, are taken from (Goodfellow et al., 2016, page 301). After having updated the cache (i.e. $B_1$ and $B_2$) we compute a velocity (step 3) with which the parameters $Y_t$ are then updated (step 4).

  3. \[W_t\gets -\eta{}B_1/\sqrt{B_2 + \delta},\]

  4. \[Y_{t+1} \gets Y_t + W_t,\]

Here $\eta$ (with default 0.01) is the learning rate and $\delta$ (with default $3\cdot10^{-7}$) is a small constant that is added for stability. The division, square root and addition in step 3 are performed element-wise.

Weights on manifolds

The problem with generalizing Adam to manifolds is that the Hadamard product $\odot$ as well as the other element-wise operations ($/$, $\sqrt{}$ and $+$ in step 3 above) lack a clear geometric interpretation. In GeometricMachineLearning we get around this issue by utilizing a so-called global tangent space representation.

References

  • Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT press, 2016.
diff --git a/latest/optimizers/manifold_related/cayley/index.html b/latest/optimizers/manifold_related/cayley/index.html new file mode 100644 index 000000000..6db3a3bbf --- /dev/null +++ b/latest/optimizers/manifold_related/cayley/index.html @@ -0,0 +1,16 @@ + +Cayley Retraction · GeometricMachineLearning.jl

The Cayley Retraction

The Cayley transformation is one of the most popular retractions. For several matrix Lie groups it is a mapping from the Lie algebra $\mathfrak{g}$ onto the Lie group $G$. They Cayley retraction reads:

\[ \mathrm{Cayley}(C) = \left(\mathbb{I} -\frac{1}{2}C\right)^{-1}\left(\mathbb{I} +\frac{1}{2}C\right).\]

This is easily checked to be a retraction, i.e. $\mathrm{Cayley}(\mathbb{O}) = \mathbb{I}$ and $\frac{\partial}{\partial{}t}\mathrm{Cayley}(tC) = C$.

What we need in practice is not the computation of the Cayley transform of an arbitrary matrix, but the Cayley transform of an element of $\mathfrak{g}^\mathrm{hor}$, the global tangent space representation.

The elements of $\mathfrak{g}^\mathrm{hor}$ can be written as:

\[C = \begin{bmatrix} + A & -B^T \\ + B & \mathbb{O} +\end{bmatrix} = \begin{bmatrix} \frac{1}{2}A & \mathbb{I} \\ B & \mathbb{O} \end{bmatrix} \begin{bmatrix} \mathbb{I} & \mathbb{O} \\ \frac{1}{2}A & -B^T \end{bmatrix},\]

where the second expression exploits the sparse structure of the array, i.e. it is a multiplication of a $N\times2n$ with a $2n\times{}N$ matrix. We can hence use the Sherman-Morrison-Woodbury formula to obtain:

\[(\mathbb{I} - \frac{1}{2}UV)^{-1} = \mathbb{I} + \frac{1}{2}U(\mathbb{I} - \frac{1}{2}VU)^{-1}V\]

So what we have to invert is the term

\[\mathbb{I} - \frac{1}{2}\begin{bmatrix} \mathbb{I} & \mathbb{O} \\ \frac{1}{2}A & -B^T \end{bmatrix}\begin{bmatrix} \frac{1}{2}A & \mathbb{I} \\ B & \mathbb{O} \end{bmatrix} = +\begin{bmatrix} \mathbb{I} - \frac{1}{4}A & - \frac{1}{2}\mathbb{I} \\ \frac{1}{2}B^TB - \frac{1}{8}A^2 & \mathbb{I} - \frac{1}{4}A \end{bmatrix}.\]

The whole cayley transform is then:

\[\left(\mathbb{I} + \frac{1}{2}\begin{bmatrix} \frac{1}{2}A & \mathbb{I} \\ B & \mathbb{O} \end{bmatrix} \begin{bmatrix} \mathbb{I} - \frac{1}{4}A & - \frac{1}{2}\mathbb{I} \\ \frac{1}{2}B^TB - \frac{1}{8}A^2 & \mathbb{I} - \frac{1}{4}A \end{bmatrix}^{-1} \begin{bmatrix} \mathbb{I} & \mathbb{O} \\ \frac{1}{2}A & -B^T \end{bmatrix} \right)\left( E + \frac{1}{2}\begin{bmatrix} \frac{1}{2}A & \mathbb{I} \\ B & \mathbb{O} \end{bmatrix} \begin{bmatrix} \mathbb{I} \\ \frac{1}{2}A \end{bmatrix}\ \right) = \\ + +E + \frac{1}{2}\begin{bmatrix} \frac{1}{2}A & \mathbb{I} \\ B & \mathbb{O} \end{bmatrix}\left( + \begin{bmatrix} \mathbb{I} \\ \frac{1}{2}A \end{bmatrix} + + \begin{bmatrix} \mathbb{I} - \frac{1}{4}A & - \frac{1}{2}\mathbb{I} \\ \frac{1}{2}B^TB - \frac{1}{8}A^2 & \mathbb{I} - \frac{1}{4}A \end{bmatrix}^{-1}\left( + + \begin{bmatrix} \mathbb{I} \\ \frac{1}{2}A \end{bmatrix} + + \begin{bmatrix} \frac{1}{2}A \\ \frac{1}{4}A^2 - \frac{1}{2}B^TB \end{bmatrix} + + \right) + \right)\]

Note that for computational reason we compute $\mathrm{Cayley}(C)E$ instead of just the Cayley transform (see the section on retractions).

diff --git a/latest/optimizers/manifold_related/geodesic/index.html b/latest/optimizers/manifold_related/geodesic/index.html new file mode 100644 index 000000000..a9bed393a --- /dev/null +++ b/latest/optimizers/manifold_related/geodesic/index.html @@ -0,0 +1,2 @@ + +Geodesic Retraction · GeometricMachineLearning.jl

Geodesic Retraction

General retractions are approximations of the exponential map. In GeometricMachineLearning we can, instead of using an approximation, solve the geodesic equation exactly (up to numerical error) by specifying Geodesic() as the argument of layers that have manifold weights.

diff --git a/latest/optimizers/manifold_related/global_sections/index.html b/latest/optimizers/manifold_related/global_sections/index.html index 585c0618a..5127da729 100644 --- a/latest/optimizers/manifold_related/global_sections/index.html +++ b/latest/optimizers/manifold_related/global_sections/index.html @@ -1,2 +1,2 @@ -Global Sections · GeometricMachineLearning.jl

Global Sections

Global sections are needed needed for the generalization of Adam and other optimizers to homogeneous spaces. They are necessary to perform the two mappings represented represented by horizontal and vertical red lines in the section on the general optimizer framework.

Computing the global section

In differential geometry a section is always associated to some bundle, in our case this bundle is $\pi:G\to\mathcal{M},A\mapsto{}AE$. A section is a mapping $\mathcal{M}\to{}G$ for which $\pi$ is a left inverse, i.e. $\pi\circ\lambda = \mathrm{id}$.

For the Stiefel manifold $St(n, N)\subset\mathbb{R}^{N\times{}n}$ we compute the global section the following way:

  1. Start with an element $Y\in{}St(n,N)$,
  2. Draw a random matrix $A\in\mathbb{R}^{N\times{}(N-n)}$,
  3. Remove the subspace spanned by $Y$ from the range of $A$: $A\gets{}A-YY^TA$
  4. Compute a QR decomposition of $A$ and take as section $\lambda(Y) = [Y, Q_{[1:N, 1:(N-n)]}]$.

It is easy to check that $\lambda(Y)\in{}G=SO(N)$.

In GeometricMachineLearning, GlobalSection takes an element of $Y\in{}St(n,N)\equiv$StiefelManifold{T} and returns an instance of GlobalSection{T, StiefelManifold{T}}. The application $O(N)\times{}St(n,N)\to{}St(n,N)$ is done with the functions apply_section! and apply_section.

Computing the global tangent space representation based on a global section

The output of the horizontal lift $\Omega$ is an element of $\mathfrak{g}^{\mathrm{hor},Y}$. For this mapping $\Omega(Y, B{}Y) = B$ if $B\in\mathfrak{g}^{\mathrm{hor},Y}$, i.e. there is no information loss and no projection is performed. We can map the $B\in\mathfrak{g}^{\mathrm{hor},Y}$ to $\mathfrak{g}^\mathrm{hor}$ with $B\mapsto{}\lambda(Y)^{-1}B\lambda(Y)$.

The function global_rep performs both mappings at once[1], i.e. it takes an instance of GlobalSection and an element of $T_YSt(n,N)$, and then returns an element of $\frak{g}^\mathrm{hor}\equiv$StiefelLieAlgHorMatrix.

Optimization

The output of global_rep is then used for all the optimization steps.

References

  • Frankel, Theodore. The geometry of physics: an introduction. Cambridge university press, 2011.
  • 1For computational reasons.
+Global Sections · GeometricMachineLearning.jl

Global Sections

Global sections are needed needed for the generalization of Adam and other optimizers to homogeneous spaces. They are necessary to perform the two mappings represented represented by horizontal and vertical red lines in the section on the general optimizer framework.

Computing the global section

In differential geometry a section is always associated to some bundle, in our case this bundle is $\pi:G\to\mathcal{M},A\mapsto{}AE$. A section is a mapping $\mathcal{M}\to{}G$ for which $\pi$ is a left inverse, i.e. $\pi\circ\lambda = \mathrm{id}$.

For the Stiefel manifold $St(n, N)\subset\mathbb{R}^{N\times{}n}$ we compute the global section the following way:

  1. Start with an element $Y\in{}St(n,N)$,
  2. Draw a random matrix $A\in\mathbb{R}^{N\times{}(N-n)}$,
  3. Remove the subspace spanned by $Y$ from the range of $A$: $A\gets{}A-YY^TA$
  4. Compute a QR decomposition of $A$ and take as section $\lambda(Y) = [Y, Q_{[1:N, 1:(N-n)]}]$.

It is easy to check that $\lambda(Y)\in{}G=SO(N)$.

In GeometricMachineLearning, GlobalSection takes an element of $Y\in{}St(n,N)\equiv$StiefelManifold{T} and returns an instance of GlobalSection{T, StiefelManifold{T}}. The application $O(N)\times{}St(n,N)\to{}St(n,N)$ is done with the functions apply_section! and apply_section.

Computing the global tangent space representation based on a global section

The output of the horizontal lift $\Omega$ is an element of $\mathfrak{g}^{\mathrm{hor},Y}$. For this mapping $\Omega(Y, B{}Y) = B$ if $B\in\mathfrak{g}^{\mathrm{hor},Y}$, i.e. there is no information loss and no projection is performed. We can map the $B\in\mathfrak{g}^{\mathrm{hor},Y}$ to $\mathfrak{g}^\mathrm{hor}$ with $B\mapsto{}\lambda(Y)^{-1}B\lambda(Y)$.

The function global_rep performs both mappings at once[1], i.e. it takes an instance of GlobalSection and an element of $T_YSt(n,N)$, and then returns an element of $\frak{g}^\mathrm{hor}\equiv$StiefelLieAlgHorMatrix.

Optimization

The output of global_rep is then used for all the optimization steps.

References

  • Frankel, Theodore. The geometry of physics: an introduction. Cambridge university press, 2011.
  • 1For computational reasons.
diff --git a/latest/optimizers/manifold_related/horizontal_lift/index.html b/latest/optimizers/manifold_related/horizontal_lift/index.html index 83faec79c..b9ae17513 100644 --- a/latest/optimizers/manifold_related/horizontal_lift/index.html +++ b/latest/optimizers/manifold_related/horizontal_lift/index.html @@ -1,2 +1,2 @@ -Horizontal Lift · GeometricMachineLearning.jl

The Horizontal Lift

For each element $Y\in\mathcal{M}$ we can perform a splitting $\mathfrak{g} = \mathfrak{g}^{\mathrm{hor}, Y}\oplus\mathfrak{g}^{\mathrm{ver}, Y}$, where the two subspaces are the horizontal and the vertical component of $\mathfrak{g}$ at $Y$ respectively. For homogeneous spaces: $T_Y\mathcal{M} = \mathfrak{g}\cdot{}Y$, i.e. every tangent space to $\mathcal{M}$ can be expressed through the application of the Lie algebra to the relevant element. The vertical component consists of those elements of $\mathfrak{g}$ which are mapped to the zero element of $T_Y\mathcal{M}$, i.e.

\[\mathfrak{g}^{\mathrm{ver}, Y} := \mathrm{ker}(\mathfrak{g}\to{}T_Y\mathcal{M}).\]

The orthogonal complement[1] of $\mathfrak{g}^{\mathrm{ver}, Y}$ is the horizontal component and is referred to by $\mathfrak{g}^{\mathrm{hor}, Y}$. This is naturally isomorphic to $T_Y\mathcal{M}$. For the Stiefel manifold the horizontal lift has the simple form:

\[\Omega(Y, V) = \left(\mathbb{I} - \frac{1}{2}\right)VY^T - YV^T(\mathbb{I} - \frac{1}{2}YY^T).\]

If the element $Y$ is the distinct element $E$, then the elements of $\mathfrak{g}^{\mathrm{hor},E}$ take a particularly simple form, see Global Tangent Space for a description of this.

  • 1The orthogonal complement is taken with respect to a metric defined on $\mathfrak{g}$. For the case of $G=SO(N)$ and $\mathfrak{g}=\mathfrak{so}(N) = \{A:A+A^T =0\}$ this metric can be chosen as $(A_1,A_2)\mapsto{}\frac{1}{2}A_1^TA_2$.
+Horizontal Lift · GeometricMachineLearning.jl

The Horizontal Lift

For each element $Y\in\mathcal{M}$ we can perform a splitting $\mathfrak{g} = \mathfrak{g}^{\mathrm{hor}, Y}\oplus\mathfrak{g}^{\mathrm{ver}, Y}$, where the two subspaces are the horizontal and the vertical component of $\mathfrak{g}$ at $Y$ respectively. For homogeneous spaces: $T_Y\mathcal{M} = \mathfrak{g}\cdot{}Y$, i.e. every tangent space to $\mathcal{M}$ can be expressed through the application of the Lie algebra to the relevant element. The vertical component consists of those elements of $\mathfrak{g}$ which are mapped to the zero element of $T_Y\mathcal{M}$, i.e.

\[\mathfrak{g}^{\mathrm{ver}, Y} := \mathrm{ker}(\mathfrak{g}\to{}T_Y\mathcal{M}).\]

The orthogonal complement[1] of $\mathfrak{g}^{\mathrm{ver}, Y}$ is the horizontal component and is referred to by $\mathfrak{g}^{\mathrm{hor}, Y}$. This is naturally isomorphic to $T_Y\mathcal{M}$. For the Stiefel manifold the horizontal lift has the simple form:

\[\Omega(Y, V) = \left(\mathbb{I} - \frac{1}{2}\right)VY^T - YV^T(\mathbb{I} - \frac{1}{2}YY^T).\]

If the element $Y$ is the distinct element $E$, then the elements of $\mathfrak{g}^{\mathrm{hor},E}$ take a particularly simple form, see Global Tangent Space for a description of this.

  • 1The orthogonal complement is taken with respect to a metric defined on $\mathfrak{g}$. For the case of $G=SO(N)$ and $\mathfrak{g}=\mathfrak{so}(N) = \{A:A+A^T =0\}$ this metric can be chosen as $(A_1,A_2)\mapsto{}\frac{1}{2}A_1^TA_2$.
diff --git a/latest/optimizers/manifold_related/retractions/index.html b/latest/optimizers/manifold_related/retractions/index.html index 90711db1a..aa5825c0b 100644 --- a/latest/optimizers/manifold_related/retractions/index.html +++ b/latest/optimizers/manifold_related/retractions/index.html @@ -1,2 +1,2 @@ -Retractions · GeometricMachineLearning.jl

Retractions

Retractions are a map from the horizontal component of the Lie algebra $\mathfrak{g}^\mathrm{hor}$ to the respective manifold.

For optimization in neural networks (almost always first order) we solve a gradient flow equation

\[\dot{W} = -\mathrm{grad}_WL, \]

where $\mathrm{grad}_WL$ is the Riemannian gradient of the loss function $L$ evaluated at position $W$.

If we deal with Euclidean spaces (vector spaces), then the Riemannian gradient is just the result of an AD routine and the solution of the equation above can be approximated with $W^{t+1} \gets W^t - \eta\nabla_{W^t}L$, where $\eta$ is the learning rate.

For manifolds, after we obtained the Riemannian gradient (see e.g. the section on Stiefel manifold), we have to solve a geodesic equation. This is a canonical ODE associated with any Riemannian manifold.

The general theory of Riemannian manifolds is rather complicated, but for the neural networks treated in GeometricMachineLearning, we only rely on optimization of matrix Lie groups and homogeneous spaces, which is much simpler.

For Lie groups each tangent space is isomorphic to its Lie algebra $\mathfrak{g}\equiv{}T_\mathbb{I}G$. The geodesic map from $\mathfrak{g}$ to $G$, for matrix Lie groups with bi-invariant Riemannian metric like $SO(N)$, is simply the application of the matrix exponential $\exp$. Alternatively this can be replaced by the Cayley transform (see (Absil et al, 2008).)

Starting from this basic map $\exp:\mathfrak{g}\to{}G$ we can build mappings for more complicated cases:

  1. General tangent space to a Lie group $T_AG$: The geodesic map for an element $V\in{}T_AG$ is simply $A\exp(A^{-1}V)$.

  2. Special tangent space to a homogeneous space $T_E\mathcal{M}$: For $V=BE\in{}T_E\mathcal{M}$ the exponential map is simply $\exp(B)E$.

  3. General tangent space to a homogeneous space $T_Y\mathcal{M}$ with $Y = AE$: For $\Delta=ABE\in{}T_Y\mathcal{M}$ the exponential map is simply $A\exp(B)E$. This is the general case which we deal with.

The general theory behind points 2. and 3. is discussed in chapter 11 of (O'Neill, 1983). The function retraction in GeometricMachineLearning performs $\mathfrak{g}^\mathrm{hor}\to\mathcal{M}$, which is the second of the above points. To get the third from the second point, we simply have to multiply with a matrix from the left. This step is done with apply_section and represented through the red vertical line in the diagram on the general optimizer framework.

Word of caution

The Lie group corresponding to the Stiefel manifold $SO(N)$ has a bi-invariant Riemannian metric associated with it: $(B_1,B_2)\mapsto \mathrm{Tr}(B_1^TB_2)$. For other Lie groups (e.g. the symplectic group) the situation is slightly more difficult (see (Bendokat et al, 2021).)

References

  • Absil P A, Mahony R, Sepulchre R. Optimization algorithms on matrix manifolds[M]. Princeton University Press, 2008.

  • Bendokat T, Zimmermann R. The real symplectic Stiefel and Grassmann manifolds: metrics, geodesics and applications[J]. arXiv preprint arXiv:2108.12447, 2021.

  • O'Neill, Barrett. Semi-Riemannian geometry with applications to relativity. Academic press, 1983.

+Retractions · GeometricMachineLearning.jl

Retractions

Classical Definition

Classically, retractions are defined as maps smooth maps

\[R: T\mathcal{M}\to\mathcal{M}:(x,v)\mapsto{}R_x(v)\]

such that each curve $c(t) := R_x(tv)$ satisfies $c(0) = x$ and $c'(0) = v$.

In GeometricMachineLearning

Retractions are a map from the horizontal component of the Lie algebra $\mathfrak{g}^\mathrm{hor}$ to the respective manifold.

For optimization in neural networks (almost always first order) we solve a gradient flow equation

\[\dot{W} = -\mathrm{grad}_WL, \]

where $\mathrm{grad}_WL$ is the Riemannian gradient of the loss function $L$ evaluated at position $W$.

If we deal with Euclidean spaces (vector spaces), then the Riemannian gradient is just the result of an AD routine and the solution of the equation above can be approximated with $W^{t+1} \gets W^t - \eta\nabla_{W^t}L$, where $\eta$ is the learning rate.

For manifolds, after we obtained the Riemannian gradient (see e.g. the section on Stiefel manifold), we have to solve a geodesic equation. This is a canonical ODE associated with any Riemannian manifold.

The general theory of Riemannian manifolds is rather complicated, but for the neural networks treated in GeometricMachineLearning, we only rely on optimization of matrix Lie groups and homogeneous spaces, which is much simpler.

For Lie groups each tangent space is isomorphic to its Lie algebra $\mathfrak{g}\equiv{}T_\mathbb{I}G$. The geodesic map from $\mathfrak{g}$ to $G$, for matrix Lie groups with bi-invariant Riemannian metric like $SO(N)$, is simply the application of the matrix exponential $\exp$. Alternatively this can be replaced by the Cayley transform (see (Absil et al, 2008).)

Starting from this basic map $\exp:\mathfrak{g}\to{}G$ we can build mappings for more complicated cases:

  1. General tangent space to a Lie group $T_AG$: The geodesic map for an element $V\in{}T_AG$ is simply $A\exp(A^{-1}V)$.

  2. Special tangent space to a homogeneous space $T_E\mathcal{M}$: For $V=BE\in{}T_E\mathcal{M}$ the exponential map is simply $\exp(B)E$.

  3. General tangent space to a homogeneous space $T_Y\mathcal{M}$ with $Y = AE$: For $\Delta=ABE\in{}T_Y\mathcal{M}$ the exponential map is simply $A\exp(B)E$. This is the general case which we deal with.

The general theory behind points 2. and 3. is discussed in chapter 11 of (O'Neill, 1983). The function retraction in GeometricMachineLearning performs $\mathfrak{g}^\mathrm{hor}\to\mathcal{M}$, which is the second of the above points. To get the third from the second point, we simply have to multiply with a matrix from the left. This step is done with apply_section and represented through the red vertical line in the diagram on the general optimizer framework.

Word of caution

The Lie group corresponding to the Stiefel manifold $SO(N)$ has a bi-invariant Riemannian metric associated with it: $(B_1,B_2)\mapsto \mathrm{Tr}(B_1^TB_2)$. For other Lie groups (e.g. the symplectic group) the situation is slightly more difficult (see (Bendokat et al, 2021).)

References

  • Absil P A, Mahony R, Sepulchre R. Optimization algorithms on matrix manifolds[M]. Princeton University Press, 2008.

  • Bendokat T, Zimmermann R. The real symplectic Stiefel and Grassmann manifolds: metrics, geodesics and applications[J]. arXiv preprint arXiv:2108.12447, 2021.

  • O'Neill, Barrett. Semi-Riemannian geometry with applications to relativity. Academic press, 1983.

diff --git a/latest/reduced_order_modeling/autoencoder/index.html b/latest/reduced_order_modeling/autoencoder/index.html new file mode 100644 index 000000000..1abd6c173 --- /dev/null +++ b/latest/reduced_order_modeling/autoencoder/index.html @@ -0,0 +1,2 @@ + +POD and Autoencoders · GeometricMachineLearning.jl

Reduced Order modeling and Autoencoders

Reduced order modeling is a data-driven technique that exploits the structure of parametric PDEs to make solving those PDEs easier.

Consider a parametric PDE written in the form: $F(z(\mu);\mu)=0$ where $z(\mu)$ evolves on a infinite-dimensional Hilbert space $V$.

In modeling any PDE we have to choose a discretization (particle discretization, finite element method, ...) of $V$, which will be denoted by $V_h$.

Solution manifold

To any parametric PDE we associate a solution manifold:

\[\mathcal{M} = \{z(\mu):F(z(\mu);\mu)=0, \mu\in\mathbb{P}\}.\]

In the image above a 2-dimensional solution manifold is visualized as a sub-manifold in 3-dimensional space. In general the embedding space is an infinite-dimensional function space.

As an example of this consider the 1-dimensional wave equation:

\[\partial_{tt}^2q(t,\xi;\mu) = \mu^2\partial_{\xi\xi}^2q(t,\xi;\mu)\text{ on }I\times\Omega,\]

where $I = (0,1)$ and $\Omega=(-1/2,1/2)$. As initial condition for the first derivative we have $\partial_tq(0,\xi;\mu) = -\mu\partial_\xi{}q_0(\xi;\mu)$ and furthermore $q(t,\xi;\mu)=0$ on the boundary (i.e. $\xi\in\{-1/2,1/2\}$).

The solution manifold is a 1-dimensional submanifold:

\[\mathcal{M} = \{(t, \xi)\mapsto{}q(t,\xi;\mu)=q_0(\xi-\mu{}t;\mu):\mu\in\mathbb{P}\subset\mathbb{R}\}.\]

If we provide an initial condition $u_0$, a parameter instance $\mu$ and a time $t$, then $\xi\mapsto{}q(t,\xi;\mu)$ will be the momentary solution. If we consider the time evolution of $q(t,\xi;\mu)$, then it evolves on a two-dimensional submanifold $\bar{\mathcal{M}} := \{\xi\mapsto{}q(t,\xi;\mu):t\in{}I,\mu\in\mathbb{P}\}$.

General workflow

In reduced order modeling we aim to construct a mapping to a space that is close to this solution manifold. This is done through the following steps:

  1. Discretize the PDE.

  2. Solve the discretized PDE for a certain set of parameter instances $\mu\in\mathbb{P}$.

  3. Build a reduced basis with the data obtained from having solved the discretized PDE. This step consists of finding two mappings: the reduction $\mathcal{P}$ and the reconstruction $\mathcal{R}$.

The third step can be done with various machine learning (ML) techniques. Traditionally the most popular of these has been Proper orthogonal decomposition (POD), but in recent years autoencoders have also become a popular alternative (see (Fresca et al, 2021)).

References

  • Fresca, Stefania, Luca Dede, and Andrea Manzoni. "A comprehensive deep learning-based approach to reduced order modeling of nonlinear time-dependent parametrized PDEs." Journal of Scientific Computing 87 (2021): 1-36.
diff --git a/latest/reduced_order_modeling/kolmogorov_n_width/index.html b/latest/reduced_order_modeling/kolmogorov_n_width/index.html new file mode 100644 index 000000000..80702f68d --- /dev/null +++ b/latest/reduced_order_modeling/kolmogorov_n_width/index.html @@ -0,0 +1,2 @@ + +Kolmogorov n-width · GeometricMachineLearning.jl

Kolmogorov $n$-width

The Kolmogorov $n$-width measures how well some set $\mathcal{M}$ (typically the solution manifold) can be approximated with a linear subspace:

\[d_n(\mathcal{M}) := \mathrm{inf}_{V_n\sub{}V;\mathrm{dim}V_n=n}\mathrm{sup}(u\in\mathcal{M})\mathrm{inf}_{v_n\in{}V_n}|| u - v_n ||_V,\]

with $\mathcal{M}\sub{}V$ and $V$ is a (typically infinite-dimensional) Banach space. For advection-dominated problems (among others) the decay of the Kolmogorov $n$-width is very slow, i.e. one has to pick $n$ very high in order to obtain useful approximations (see (Greif and Urban, 2019) and (Blickhan, 2023)).

In order to overcome this, techniques based on neural networks (see e.g. (Lee and Carlberg, 2020)) and optimal transport (see e.g. (Blickhan, 2023)) have been used.

References

  • Greif, Constantin, and Karsten Urban. "Decay of the Kolmogorov N-width for wave problems." Applied Mathematics Letters 96 (2019): 216-222.
  • Blickhan, Tobias. "A registration method for reduced basis problems using linear optimal transport." arXiv preprint arXiv:2304.14884 (2023).
  • Lee, Kookjin, and Kevin T. Carlberg. "Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders." Journal of Computational Physics 404 (2020): 108973.
diff --git a/latest/reduced_order_modeling/symplectic_autoencoder/index.html b/latest/reduced_order_modeling/symplectic_autoencoder/index.html new file mode 100644 index 000000000..6857a55b4 --- /dev/null +++ b/latest/reduced_order_modeling/symplectic_autoencoder/index.html @@ -0,0 +1,12 @@ + +PSD and Symplectic Autoencoders · GeometricMachineLearning.jl

Symplectic Autoencoder

Symplectic Autoencoders are a type of neural network suitable for treating Hamiltonian parametrized PDEs with slowly decaying Kolmogorov $n$-width. It is based on proper symplectic decomposition (PSD) and symplectic neural networks (SympNets).

Hamiltonian Model Order Reduction

Hamiltonian PDEs are partial differential equations that, like its ODE counterpart, have a Hamiltonian associated with it. An example of this is the linear wave equation (see (Buckfink et al, 2023)) with Hamiltonian

\[\mathcal{H}(q, p; \mu) := \frac{1}{2}\int_\Omega\mu^2(\partial_\xi{}q(t,\xi;\mu))^2 + p(t,\xi;\mu)^2d\xi.\]

The PDE for to this Hamiltonian can be obtained similarly as in the ODE case:

\[\partial_t{}q(t,\xi;\mu) = \frac{\delta{}\mathcal{H}}{\delta{}p} = p(t,\xi;\mu), \quad \partial_t{}p(t,\xi;\mu) = -\frac{\delta{}\mathcal{H}}{\delta{}q} = \mu^2\partial_{\xi{}\xi}q(t,\xi;\mu)\]

Symplectic Solution Manifold

As with regular parametric PDEs, we also associate a solution manifold with Hamiltonian PDEs. This is a finite-dimensional manifold, on which the dynamics can be described through a Hamiltonian ODE. I NEED A PROOF OR SOME EXPLANATION FOR THIS!

Workflow for Symplectic ROM

As with any other reduced order modeling technique we first discretize the PDE. This should be done with a structure-preserving scheme, thus yielding a (high-dimensional) Hamiltonian ODE as a result. Discretizing the wave equation above with finite differences yields a Hamiltonian system:

\[\mathcal{H}_\mathrm{discr}(z(t;\mu);\mu) := \frac{1}{2}x(t;\mu)^T\begin{bmatrix} -\mu^2D_{\xi{}\xi} & \mathbb{O} \\ \mathbb{O} & \mathbb{I} \end{bmatrix} x(t;\mu).\]

In Hamiltonian reduced order modelling we try to find a symplectic submanifold of the solution space[1] that captures the dynamics of the full system as well as possible.

Similar to the regular PDE case we again build an encoder $\Psi^\mathrm{enc}$ and a decoder $\Psi^\mathrm{dec}$; but now both these mappings are required to be symplectic!

Concretely this means:

  1. The encoder is a mapping from a high-dimensional symplectic space to a low-dimensional symplectic space, i.e. $\Psi^\mathrm{enc}:\mathbb{R}^{2N}\to\mathbb{R}^{2n}$ such that $\nabla\Psi^\mathrm{enc}\mathbb{J}_{2N}(\nabla\Psi^\mathrm{enc})^T = \mathbb{J}_{2n}$.
  2. The decoder is a mapping from a low-dimensional symplectic space to a high-dimensional symplectic space, i.e. $\Psi^\mathrm{dec}:\mathbb{R}^{2n}\to\mathbb{R}^{2N}$ such that $(\nabla\Psi^\mathrm{dec})^T\mathbb{J}_{2N}\nabla\Psi^\mathrm{dec} = \mathbb{J}_{2n}$.

If these two maps are constrained to linear maps, then one can easily find good solutions with proper symplectic decomposition (PSD).

Proper Symplectic Decomposition

For PSD the two mappings $\Psi^\mathrm{enc}$ and $\Psi^\mathrm{dec}$ are constrained to be linear, orthonormal (i.e. $\Psi^T\Psi = \mathbb{I}$) and symplectic. The easiest way to enforce this is through the so-called cotangent lift:

\[\Psi_\mathrm{CL} = +\begin{bmatrix} \Phi & \mathbb{O} \\ \mathbb{O} & \Phi \end{bmatrix},\]

and $\Phi\in{}St(n,N)\sub\mathbb{R}^{N\times{}n}$, i.e. is an element of the Stiefel manifold. If the snapshot matrix is of the form:

\[M = \left[\begin{array}{c:c:c:c} +\hat{q}_1(t_0) & \hat{q}_1(t_1) & \quad\ldots\quad & \hat{q}_1(t_f) \\ +\hat{q}_2(t_0) & \hat{q}_2(t_1) & \ldots & \hat{q}_2(t_f) \\ +\ldots & \ldots & \ldots & \ldots \\ +\hat{q}_N(t_0) & \hat{q}_N(t_1) & \ldots & \hat{q}_N(t_f) \\ +\hat{p}_1(t_0) & \hat{p}_1(t_1) & \ldots & \hat{p}_1(t_f) \\ +\hat{p}_2(t_0) & \hat{p}_2(t_1) & \ldots & \hat{p}_2(t_f) \\ +\ldots & \ldots & \ldots & \ldots \\ +\hat{p}_{N}(t_0) & \hat{p}_{N}(t_1) & \ldots & \hat{p}_{N}(t_f) \\ +\end{array}\right],\]

then $\Phi$ can be computed in a very straight-forward manner:

  1. Rearrange the rows of the matrix $M$ such that we end up with a $N\times2(f+1)$ matrix: $\hat{M} := [M_q, M_p]$.
  2. Perform SVD: $\hat{M} = U\Sigma{}V^T$; set $\Phi\gets{}U\mathtt{[:,1:n]}$.

For details on the cotangent lift (and other methods for linear symplectic model reduction) consult (Peng and Mohseni, 2016).

Symplectic Autoencoders

PSD suffers from the similar shortcomings as regular POD: it is a linear map and the approximation space $\tilde{\mathcal{M}}= \{\Psi^\mathrm{dec}(z_r)\in\mathbb{R}^{2N}:u_r\in\mathrm{R}^{2n}\}$ is strictly linear. For problems with slowly-decaying Kolmogorov $n$-width this leads to very poor approximations.

In order to overcome this difficulty we use neural networks, more specifically SympNets, together with cotangent lift-like matrices. The resulting architecture, symplectic autoencoders, are demonstrated in the following image:

So we alternate between SympNet and PSD layers. Because all the PSD layers are based on matrices $\Phi\in{}St(n,N)$ we have to optimize on the Stiefel manifold.

References

  • Buchfink, Patrick, Silke Glas, and Bernard Haasdonk. "Symplectic model reduction of Hamiltonian systems on nonlinear manifolds and approximation with weakly symplectic autoencoder." SIAM Journal on Scientific Computing 45.2 (2023): A289-A311.
  • Peng, Liqian, and Kamran Mohseni. "Symplectic model reduction of Hamiltonian systems." SIAM Journal on Scientific Computing 38.1 (2016): A1-A27.
  • 1The submanifold is: $\tilde{\mathcal{M}} = \{\Psi^\mathrm{dec}(z_r)\in\mathbb{R}^{2N}:u_r\in\mathrm{R}^{2n}\}$ where $z_r$ is the reduced state of the system.
diff --git a/latest/search/index.html b/latest/search/index.html index a851d09cc..3539c5632 100644 --- a/latest/search/index.html +++ b/latest/search/index.html @@ -1,2 +1,2 @@ -Search · GeometricMachineLearning.jl

Loading search...

    +Search · GeometricMachineLearning.jl

    Loading search...

      diff --git a/latest/search_index.js b/latest/search_index.js index 022acd549..a8bab6a77 100644 --- a/latest/search_index.js +++ b/latest/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"manifolds/stiefel_manifold/#Stiefel-manifold","page":"Stiefel","title":"Stiefel manifold","text":"","category":"section"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The Stiefel manifold St(n N) is the space (a homogeneous space) of all orthonormal frames in mathbbR^Ntimesn, i.e. matrices YinmathbbR^Ntimesn s.t. Y^TY = mathbbI_n. It can also be seen as the special orthonormal group SO(N) modulo an equivalence relation: AsimBiffAE = BE for ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"E = beginbmatrix\nmathbbI_n \nmathbbO\nendbmatrixinmathcalM","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"which is the canonical element of the Stiefel manifold. In words: the first n columns of A and B are the same.","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The tangent space to the element YinSt(nN) can easily be determined: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"T_YSt(nN)=DeltaDelta^TY + Y^TDelta = 0","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The Lie algebra of SO(N) is mathfrakso(N)=VinmathbbR^NtimesNV^T + V = 0 and the canonical metric associated with it is simply (V_1V_2)mapstofrac12mathrmTr(V_1^TV_2).","category":"page"},{"location":"manifolds/stiefel_manifold/#The-Riemannian-Gradient","page":"Stiefel","title":"The Riemannian Gradient","text":"","category":"section"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"For matrix manifolds (like the Stiefel manifold), the Riemannian gradient of a function can be easily determined computationally:","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The Euclidean gradient of a function L is equivalent to an element of the cotangent space T^*_YmathcalM via: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"langlenablaLcdotrangleT_YmathcalM to mathbbR Delta mapsto sum_ijnablaL_ijDelta_ij = mathrmTr(nablaL^TDelta)","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"We can then utilize the Riemannian metric on mathcalM to map the element from the cotangent space (i.e. nablaL) to the tangent space. This element is called mathrmgrad_(cdot)L here. Explicitly, it is given by: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":" mathrmgrad_YL = nabla_YL - Y(nabla_YL)^TY","category":"page"},{"location":"manifolds/stiefel_manifold/#rgrad","page":"Stiefel","title":"rgrad","text":"","category":"section"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"What was referred to as nablaL before can in practice be obtained with an AD routine. We then use the function rgrad to map this Euclidean gradient to inT_YSt(nN). This mapping has the property: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"mathrmTr((nablaL)^TDelta) = g_Y(mathttrgrad(Y nablaL) Delta) forallDeltainT_YSt(nN)","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"and g is the Riemannian metric.","category":"page"},{"location":"optimizers/manifold_related/retractions/#Retractions","page":"Retractions","title":"Retractions","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Retractions are a map from the horizontal component of the Lie algebra mathfrakg^mathrmhor to the respective manifold.","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"For optimization in neural networks (almost always first order) we solve a gradient flow equation ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"dotW = -mathrmgrad_WL ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"where mathrmgrad_WL is the Riemannian gradient of the loss function L evaluated at position W.","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"If we deal with Euclidean spaces (vector spaces), then the Riemannian gradient is just the result of an AD routine and the solution of the equation above can be approximated with W^t+1 gets W^t - etanabla_W^tL, where eta is the learning rate. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"For manifolds, after we obtained the Riemannian gradient (see e.g. the section on Stiefel manifold), we have to solve a geodesic equation. This is a canonical ODE associated with any Riemannian manifold. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"The general theory of Riemannian manifolds is rather complicated, but for the neural networks treated in GeometricMachineLearning, we only rely on optimization of matrix Lie groups and homogeneous spaces, which is much simpler. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"For Lie groups each tangent space is isomorphic to its Lie algebra mathfrakgequivT_mathbbIG. The geodesic map from mathfrakg to G, for matrix Lie groups with bi-invariant Riemannian metric like SO(N), is simply the application of the matrix exponential exp. Alternatively this can be replaced by the Cayley transform (see (Absil et al, 2008).)","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Starting from this basic map expmathfrakgtoG we can build mappings for more complicated cases: ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"General tangent space to a Lie group T_AG: The geodesic map for an element VinT_AG is simply Aexp(A^-1V).\nSpecial tangent space to a homogeneous space T_EmathcalM: For V=BEinT_EmathcalM the exponential map is simply exp(B)E. \nGeneral tangent space to a homogeneous space T_YmathcalM with Y = AE: For Delta=ABEinT_YmathcalM the exponential map is simply Aexp(B)E. This is the general case which we deal with. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"The general theory behind points 2. and 3. is discussed in chapter 11 of (O'Neill, 1983). The function retraction in GeometricMachineLearning performs mathfrakg^mathrmhortomathcalM, which is the second of the above points. To get the third from the second point, we simply have to multiply with a matrix from the left. This step is done with apply_section and represented through the red vertical line in the diagram on the general optimizer framework.","category":"page"},{"location":"optimizers/manifold_related/retractions/#Word-of-caution","page":"Retractions","title":"Word of caution","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"The Lie group corresponding to the Stiefel manifold SO(N) has a bi-invariant Riemannian metric associated with it: (B_1B_2)mapsto mathrmTr(B_1^TB_2). For other Lie groups (e.g. the symplectic group) the situation is slightly more difficult (see (Bendokat et al, 2021).)","category":"page"},{"location":"optimizers/manifold_related/retractions/#References","page":"Retractions","title":"References","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Absil P A, Mahony R, Sepulchre R. Optimization algorithms on matrix manifolds[M]. Princeton University Press, 2008.\nBendokat T, Zimmermann R. The real symplectic Stiefel and Grassmann manifolds: metrics, geodesics and applications[J]. arXiv preprint arXiv:2108.12447, 2021.\nO'Neill, Barrett. Semi-Riemannian geometry with applications to relativity. Academic press, 1983.","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/#Horizontal-component-of-the-Lie-algebra-\\mathfrak{g}","page":"Global Tangent Space","title":"Horizontal component of the Lie algebra mathfrakg","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"What we use to optimize Adam (and other algorithms) to manifolds is a global tangent space representation of the homogeneous spaces. ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"For the Stiefel manifold, the homogeneous space takes a simple form: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"B = beginbmatrix\n A -B^T \n B mathbbO\nendbmatrix","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"where AinmathbbR^ntimesn is skew-symmetric and BinmathbbR^Ntimesn is arbitary. In GeometricMachineLearning the struct StiefelLieAlgHorMatrix implements elements of this form.","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/#Theoretical-background","page":"Global Tangent Space","title":"Theoretical background","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/#Vertical-and-horizontal-components","page":"Global Tangent Space","title":"Vertical and horizontal components","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"The Stiefel manifold is a homogeneous space obtained from SO(N) by setting two matrices, whose first n columns conincide, equivalent. Another way of expressing this is: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"A_1 sim A_2 iff A_1E = A_2E","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"for ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"E = beginbmatrix mathbbI mathbbOendbmatrix","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"The tangent space T_ESt(nN) can also be expressed that way:","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"T_ESt(nN) = mathfrakgcdotE = BEBinmathfrakg","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"The kernel of the mapping mathfrakgtoT_ESt(nN) BmapstoBE is referred to as mathfrakg^mathrmverE, the vertical component of the Lie algebra at E. It is clear that elements belonging to mathfrakg^mathrmverE are of the following form: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"beginbmatrix\nhatmathbbO tildemathbbO^T \ntildemathbbO C\nendbmatrix","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"where hatmathbbOinmathbbR^ntimesn is a \"small\" matrix and tildemathbbOinmathbbR^Ntimesn is a bigger one. CinmathbbR^NtimesN is a skew-symmetric matrix. ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"We can then take the orthogonal complement of this matrix (with respect to the canonical metric). We will denote this by mathfrakg^mathrmhorEequivmathfrakg^mathrmhor and call it the horizontal component. Its elements are of the form described on top of this page.","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/#Special-functions","page":"Global Tangent Space","title":"Special functions","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"You can also draw random elements from mathfrakg^mathrmhor through e.g. ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"rand(CUDADevice(), StiefelLieAlgHorMatrix{Float32}, 10, 5)","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"In this example: N=10 and n=5.","category":"page"},{"location":"library/","page":"Library","title":"Library","text":"CurrentModule = GeometricMachineLearning","category":"page"},{"location":"library/#GeometricMachineLearning-Library-Functions","page":"Library","title":"GeometricMachineLearning Library Functions","text":"","category":"section"},{"location":"library/","page":"Library","title":"Library","text":"Modules = [GeometricMachineLearning]","category":"page"},{"location":"library/#GeometricMachineLearning.AbstractCache","page":"Library","title":"GeometricMachineLearning.AbstractCache","text":"AbstractCache has subtypes: AdamCache MomentumCache GradientCache\n\nAll of them can be initialized with providing an array (also supporting manifold types).\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.AbstractRetraction","page":"Library","title":"GeometricMachineLearning.AbstractRetraction","text":"AbstractRetraction is a type that comprises all retraction methods for manifolds. For every manifold layer one has to specify a retraction method that takes the layer and elements of the (global) tangent space.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.AdamOptimizer","page":"Library","title":"GeometricMachineLearning.AdamOptimizer","text":"Defines the Adam Optimizer. Algorithm and suggested defaults are taken from (Goodfellow et al., 2016, page 301), except for δ, because single precision is used!\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Attention","page":"Library","title":"GeometricMachineLearning.Attention","text":"MultiHeadAttention (MHA) serves as a preprocessing step in the transformer. It reweights the input vectors bases on correlations within those data. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.BlockIdentityLowerMatrix","page":"Library","title":"GeometricMachineLearning.BlockIdentityLowerMatrix","text":"A BlockIdentityLowerMatrix is a matrix with blocks | 1 0 | | S 1 | Currently, it only implements a custom mul! method, exploiting this structure.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.BlockIdentityUpperMatrix","page":"Library","title":"GeometricMachineLearning.BlockIdentityUpperMatrix","text":"A BlockIdentityUpperMatrix is a matrix with blocks | 1 S | | 0 1 | Currently, it only implements a custom mul! method, exploiting this structure.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Classification","page":"Library","title":"GeometricMachineLearning.Classification","text":"Classification Layer that takes a matrix as an input and returns a vector that is used for MNIST classification. \n\nTODO: Implement picking the last vector.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.DataLoader","page":"Library","title":"GeometricMachineLearning.DataLoader","text":"Data Loader is a struct that creates an instance based on a tensor (or different input format) and is designed to make training convenient.\n\nImplemented: If the data loader is called with a single tensor, a batchsize and an outputsize, then the batch is drawn randomly in the relevant range and the output is assigned accordingly.\n\nThe fields of the struct are the following: - data: The input data with axes (i) system dimension, (ii) number of parameters and (iii) number of time steps. - batch: A tensor in which the current batch is stored. - targettensor: A tensor in which the current target is stored. - output: The tensor from which the output is drawn - this may be of type Nothing if the constructor is only called with one tensor. - sysdim: The ``dimension'' of the system, i.e. what is taken as input by a regular neural network. - seqlength: The length batches should have. - batchsize: - outputsize: The size of the second axis of the output tensor (predictionwindow, outputsize=1 in most cases) - nparams: The number of parameters that are present in the data set (length of second axis). - ntimesteps: Number of time steps (length of third axis).\n\nFor drawing the batch, the sampling is done over nparams and ntimesteps (here seqlength and output_size are also taken into account).\n\nTODO: Implement DataLoader that works well with GeometricEnsembles etc. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GradientOptimizer","page":"Library","title":"GeometricMachineLearning.GradientOptimizer","text":"Define the Gradient optimizer, i.e. W ← W - η*∇f(W) Or the riemannian manifold equivalent, if applicable.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GrassmannLayer","page":"Library","title":"GeometricMachineLearning.GrassmannLayer","text":"Defines a layer that performs simple multiplication with an element of the Grassmann manifold.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GrassmannManifold","page":"Library","title":"GeometricMachineLearning.GrassmannManifold","text":"maybe consider dividing the output in the check functions by n! TODO: Implement sampling procedures!!\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.LayerWithManifold","page":"Library","title":"GeometricMachineLearning.LayerWithManifold","text":"Additional types to make handling manifolds more readable.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.ManifoldLayer","page":"Library","title":"GeometricMachineLearning.ManifoldLayer","text":"This defines a manifold layer that only has one matrix-valued manifold A associated with it does xmapstoAx. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.MomentumOptimizer","page":"Library","title":"GeometricMachineLearning.MomentumOptimizer","text":"Define the Momentum optimizer, i.e. V ← αV - ∇f(W) W ← W + ηV Or the riemannian manifold equivalent, if applicable.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.MultiHeadAttention","page":"Library","title":"GeometricMachineLearning.MultiHeadAttention","text":"MultiHeadAttention (MHA) serves as a preprocessing step in the transformer. It reweights the input vectors bases on correlations within those data. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Optimizer","page":"Library","title":"GeometricMachineLearning.Optimizer","text":"Optimizer struct that stores the 'method' (i.e. Adam with corresponding hyperparameters), the cache and the optimization step.\n\nIt takes as input an optimization method and the parameters of a network. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.StiefelLayer","page":"Library","title":"GeometricMachineLearning.StiefelLayer","text":"Defines a layer that performs simple multiplication with an element of the Stiefel manifold.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SymmetricMatrix","page":"Library","title":"GeometricMachineLearning.SymmetricMatrix","text":"A SymmetricMatrix is a matrix | a S | | S b |\n\nThe first index is the row index, the second one the column index.\n\nIf the constructor is called with a matrix as input it returns a symmetric matrix via the projection A ↦ .5*(A + Aᵀ). This is a projection defined via the canonical metric (A,B) ↦ tr(AᵀB).\n\nTODO: Overload Adjoint operation for SymmetricMatrix!! (Aᵀ = A)\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SympNetLayer","page":"Library","title":"GeometricMachineLearning.SympNetLayer","text":"Implements the various layers from the SympNet paper: (https://www.sciencedirect.com/science/article/abs/pii/S0893608020303063). Its components are of the form: \n\nbeginpmatrix\n I nablaV 0 I \nendpmatrix\n\nwith V(p) = sum_ia_iSigma(sum_jk_ijp_j+b_i), where Sigma is the antiderivative of the activation function sigma (one-layer neural network). Such layers are by construction symplectic.\n\nFor the linear layer, the activation and the bias are left out, and for the activation layer K and b are left out!\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Transformer-Tuple{Integer, Integer, Integer}","page":"Library","title":"GeometricMachineLearning.Transformer","text":"The architecture for a \"transformer encoder\" is essentially taken from arXiv:2010.11929, but with the difference that 𝐧𝐨 layer normalization is employed. This is because we still need to find a generalization of layer normalization to manifolds. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.assign_batch_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.assign_batch_kernel!","text":"Takes as input a batch tensor (to which the data are assigned) the whole data tensor and two vectorsparams'' and ``time_steps'' that include the specific parameters and time steps we want to assign. \n\nNote that this assigns sequential data! For e.g. being processed by a transformer.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.assign_output_estimate-Union{Tuple{T}, Tuple{AbstractArray{T, 3}, Any}} where T","page":"Library","title":"GeometricMachineLearning.assign_output_estimate","text":"Closely related to the transformer. It takes the last prediction_window columns of the output and uses is for the final prediction.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.assign_output_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.assign_output_kernel!","text":"This should be used together with assignbatchkernel!. It assigns the corresponding output (i.e. target).\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.augment_zeros_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.augment_zeros_kernel!","text":"Used for differentiating assignoutputestimate (this appears in the loss). \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.draw_batch!-Union{Tuple{T}, Tuple{AbstractArray{T, 3}, AbstractArray{T, 3}, AbstractArray{T, 3}}} where T","page":"Library","title":"GeometricMachineLearning.draw_batch!","text":"This function draws random time steps and parameters and based on these assign the batch and the output.\n\nFor ODE DataLoader: (i) batch input tensor to be written on (ii) batch output tensor to be written on (iii) data tensor \n\nFor MNIST DataLoader: (i) batch input tensor to be written on (ii) batch output tensor to be written on (iii) data tensor (iv) target tensor\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.init_optimizer_cache-Tuple{GradientOptimizer, Any}","page":"Library","title":"GeometricMachineLearning.init_optimizer_cache","text":"Wrapper for the functions setupadamcache, setupmomentumcache, setupgradientcache. These appear outside of optimizer_caches.jl because the OptimizerMethods first have to be defined.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.patch_index-Union{Tuple{T}, Tuple{T, T, T}, NTuple{4, T}} where T<:Integer","page":"Library","title":"GeometricMachineLearning.patch_index","text":"Based on coordinates i,j this returns the batch index (for MNIST data set for now).\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.train!","page":"Library","title":"GeometricMachineLearning.train!","text":"train!(...)\n\nPerform a training of a neural networks on data using given method a training Method\n\nDifferent ways of use:\n\ntrain!(neuralnetwork, data, optimizer = GradientOptimizer(1e-2), training_method; nruns = 1000, batch_size = default(data, type), showprogress = false )\n\nArguments\n\nneuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend\ndata : the data (see TrainingData)\noptimizer = GradientOptimizer: the optimization method (see Optimizer)\ntraining_method : specify the loss function used \nnruns : number of iteration through the process with default value \nbatch_size : size of batch of data used for each step\n\n\n\n\n\n","category":"function"},{"location":"library/#GeometricMachineLearning.train!-Tuple{AbstractNeuralNetworks.AbstractNeuralNetwork{<:AbstractNeuralNetworks.Architecture}, AbstractTrainingData, TrainingParameters}","page":"Library","title":"GeometricMachineLearning.train!","text":"train!(neuralnetwork, data, optimizer, training_method; nruns = 1000, batch_size, showprogress = false )\n\nArguments\n\nneuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend\ndata::AbstractTrainingData : the data\n``\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.within_patch_index-Union{Tuple{T}, Tuple{T, T, T}} where T<:Integer","page":"Library","title":"GeometricMachineLearning.within_patch_index","text":"Based on coordinates i,j this returns the index within the batch\n\n\n\n\n\n","category":"method"},{"location":"optimizers/adam_optimizer/#The-Adam-Optimizer","page":"Adam Optimizer","title":"The Adam Optimizer","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"The Adam Optimizer is one of the most widely (if not the most widely used) neural network optimizer. Like most modern neural network optimizers it contains a cache that is updated based on first-order gradient information and then, in a second step, the cache is used to compute a velocity estimate for updating the neural networ weights. ","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"Here we first describe the Adam algorithm for the case where all the weights are on a vector space and then show how to generalize this to the case where the weights are on a manifold. ","category":"page"},{"location":"optimizers/adam_optimizer/#All-weights-on-a-vector-space","page":"Adam Optimizer","title":"All weights on a vector space","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"The cache of the Adam optimizer consists of first and second moments. The first moments B_1 store linear information about the current and previous gradients, and the second moments B_2 store quadratic information about current and previous gradients (all computed from a first-order gradient). ","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"If all the weights are on a vector space, then we directly compute updates for B_1 and B_2:","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"B_1 gets ((rho_1 - rho_1^t)(1 - rho_1^t))cdotB_1 + (1 - rho_1)(1 - rho_1^t)cdotnablaL\nB_2 gets ((rho_2 - rho_1^t)(1 - rho_2^t))cdotB_2 + (1 - rho_2)(1 - rho_2^t)cdotnablaLodotnablaL\nwhere odotmathbbR^ntimesmathbbR^ntomathbbR^n is the Hadamard product: aodotb_i = a_ib_i. rho_1 and rho_2 are hyperparameters. Their defaults, rho_1=09 and rho_2=099, are taken from (Goodfellow et al., 2016, page 301). After having updated the cache (i.e. B_1 and B_2) we compute a velocity (step 3) with which the parameters Y_t are then updated (step 4).\nW_tgets -etaB_1sqrtB_2 + delta\nY_t+1 gets Y_t + W_t","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"Here eta (with default 0.01) is the learning rate and delta (with default 3cdot10^-7) is a small constant that is added for stability. The division, square root and addition in step 3 are performed element-wise. ","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"(Image: )","category":"page"},{"location":"optimizers/adam_optimizer/#Weights-on-manifolds","page":"Adam Optimizer","title":"Weights on manifolds","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"The problem with generalizing Adam to manifolds is that the Hadamard product odot as well as the other element-wise operations (, sqrt and + in step 3 above) lack a clear geometric interpretation. In GeometricMachineLearning we get around this issue by utilizing a so-called global tangent space representation. ","category":"page"},{"location":"optimizers/adam_optimizer/#References","page":"Adam Optimizer","title":"References","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT press, 2016.","category":"page"},{"location":"layers/multihead_attention_layer/#Multihead-Attention-Layer","page":"Multihead Attention","title":"Multihead Attention Layer","text":"","category":"section"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"In order to arrive from the attention layer at the multihead attention layer we only have to do a simple modification: ","category":"page"},{"location":"layers/multihead_attention_layer/#References","page":"Multihead Attention","title":"References","text":"","category":"section"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Vaswani, Ashish, et al. \"Attention is all you need.\" Advances in neural information processing systems 30 (2017).","category":"page"},{"location":"layers/attention_layer/#The-Attention-Layer","page":"Attention","title":"The Attention Layer","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The attention layer (and the orthonormal activation function defined for it) was specifically designed to generalize transformers to symplectic data. Usually a self-attention layer takes the following form: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Z = z^(1) ldots z^(T) mapsto Zmathrmsoftmax((P^QZ)^T(P^KZ))","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"where we left out the linear mapping onto the values P^V. ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The idea behind is that we can perform a non-linear re-weighting of the columns of Z by multiplying with a Z-dependent matrix from the right and therefore take the sequential nature of the data into account (which is not possible with normal neural networks). After the attention step the transformer applies a simple ResNet from the left.","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"What the softmax does is a vector-wise operation, i.e. it operates on each column of an input matrix A = a_1 ldots a_T. The result is a sequence of probability vectors p^(1) ldots p^(T) for which ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"sum_i=1^Tp^(j)_i=1quadforalljin1dotsT","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"What we want to construct is a symplectic transformation that is transformer-like. For this we modify the attention layer the following way: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Z = z^(1) ldots z^(T) mapsto Zsigma((P^QZ)^T(P^KZ))","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"where sigma(A)=exp(mathttupper_triangular_asymmetrize(A)) and ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"mathttupper_triangular_asymmetrize(A)_ij = begincases a_ij textif ij -a_ji textif ij 0 textelseendcases","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"This has as a consequence that the matrix Lambda(Z) = sigma((P^QZ)^T(P^KZ)) is orthonormal and hence preserves an extended symplectic structure. To make this more clear, consider that the transformer maps sequences of vectors to sequences of vectors, i.e. VtimescdotstimesV ni z^1 ldots z^T mapsto hatz^1 ldots hatz^T. We can define a symplectic structure on VtimescdotstimesV by rearranging z^1 ldots z^T into a vector. We do this in the following way: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"tildeZ = beginpmatrix q^(1)_1 q^(2)_1 cdots q^(T)_1 q^(1)_2 cdots q^(T)_d p^(1)_1 p^(2)_1 cdots p^(T)_1 p^(1)_2 cdots p^(T)_d endpmatrix","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The symplectic structure on this big space is then: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"mathbbJ=beginpmatrix\n mathbbO_dT mathbbI_dT \n -mathbbI_dT mathbbO_dT\nendpmatrix","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Multiplying with the matrix Lambda(Z) from the right onto z^1 ldots z^T corresponds to applying the sparse matrix ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"tildeLambda(Z)=left\nbeginarrayccc\n Lambda(Z) cdots mathbbO_T \n vdots ddots vdots \n mathbbO_T cdots Lambda(Z) \n endarray\nright","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"from the left onto the big vector. ","category":"page"},{"location":"manifolds/homogeneous_spaces/#Homogeneous-Spaces","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"","category":"section"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Homogeneous spaces are manifolds mathcalM on which a Lie group G acts transitively, i.e. ","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"forall XYinmathcalM existsAinGtext st AX = Y","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Now fix a distinct element EinmathcalM. We can also establish an isomorphism between mathcalM and the quotient space Gsim with the equivalence relation: ","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"A_1 sim A_2 iff A_1E = A_2E","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Note that this is independent of the chosen E.","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"The tangent spaces of mathcalM are of the form T_YmathcalM = mathfrakgcdotY, i.e. can be fully described through its Lie algebra. Based on this we can perform a splitting of mathfrakg into two parts:","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"The vertical component mathfrakg^mathrmverY is the kernel of the map mathfrakgtoT_YmathcalM V mapsto VY, i.e. mathfrakg^mathrmverY = VinmathfrakgVY = 0\nThe horizontal component mathfrakg^mathrmhorY is the orthogonal complement of mathfrakg^mathrmverY in mathfrakg. It is isomorphic to T_YmathcalM.","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"We will refer to the mapping from T_YmathcalM to mathfrakg^mathrmhor Y by Omega. If we have now defined a metric langlecdotcdotrangle on mathfrakg, then this induces a Riemannian metric on mathcalM:","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"g_Y(Delta_1 Delta_2) = langleOmega(YDelta_1)Omega(YDelta_2)rangletext for Delta_1Delta_2inT_YmathcalM","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Two examples of homogeneous spaces implemented in GeometricMachineLearning are the Stiefel and the Grassmann manifold.","category":"page"},{"location":"manifolds/homogeneous_spaces/#References","page":"Homogeneous Spaces","title":"References","text":"","category":"section"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Frankel, Theodore. The geometry of physics: an introduction. Cambridge university press, 2011.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Global-Sections","page":"Global Sections","title":"Global Sections","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"Global sections are needed needed for the generalization of Adam and other optimizers to homogeneous spaces. They are necessary to perform the two mappings represented represented by horizontal and vertical red lines in the section on the general optimizer framework.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Computing-the-global-section","page":"Global Sections","title":"Computing the global section","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"In differential geometry a section is always associated to some bundle, in our case this bundle is piGtomathcalMAmapstoAE. A section is a mapping mathcalMtoG for which pi is a left inverse, i.e. picirclambda = mathrmid. ","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"For the Stiefel manifold St(n N)subsetmathbbR^Ntimesn we compute the global section the following way: ","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"Start with an element YinSt(nN),\nDraw a random matrix AinmathbbR^Ntimes(N-n),\nRemove the subspace spanned by Y from the range of A: AgetsA-YY^TA\nCompute a QR decomposition of A and take as section lambda(Y) = Y Q_1N 1(N-n).","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"It is easy to check that lambda(Y)inG=SO(N).","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"In GeometricMachineLearning, GlobalSection takes an element of YinSt(nN)equivStiefelManifold{T} and returns an instance of GlobalSection{T, StiefelManifold{T}}. The application O(N)timesSt(nN)toSt(nN) is done with the functions apply_section! and apply_section.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Computing-the-global-tangent-space-representation-based-on-a-global-section","page":"Global Sections","title":"Computing the global tangent space representation based on a global section","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"The output of the horizontal lift Omega is an element of mathfrakg^mathrmhorY. For this mapping Omega(Y BY) = B if Binmathfrakg^mathrmhorY, i.e. there is no information loss and no projection is performed. We can map the Binmathfrakg^mathrmhorY to mathfrakg^mathrmhor with Bmapstolambda(Y)^-1Blambda(Y).","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"The function global_rep performs both mappings at once[1], i.e. it takes an instance of GlobalSection and an element of T_YSt(nN), and then returns an element of frakg^mathrmhorequivStiefelLieAlgHorMatrix.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Optimization","page":"Global Sections","title":"Optimization","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"The output of global_rep is then used for all the optimization steps.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#References","page":"Global Sections","title":"References","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"Frankel, Theodore. The geometry of physics: an introduction. Cambridge university press, 2011.","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"[1]: For computational reasons.","category":"page"},{"location":"architectures/sympnet/#SympNet","page":"SympNet","title":"SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"This page documents the SympNet architecture and its implementation in GeometricMachineLearning.jl.","category":"page"},{"location":"architectures/sympnet/#Quick-overview-of-the-theory-of-SympNet","page":"SympNet","title":"Quick overview of the theory of SympNet","text":"","category":"section"},{"location":"architectures/sympnet/#Principle","page":"SympNet","title":"Principle","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"SympNets is a new type of neural network proposing a new approach to compute the trajectory of an Hamiltonian system in phase space. Let us denote by (qp)=(q_1q_dp_1p_d)in mathbbR^2d the phase space with qin mathbbR^d the generalized position and pin mathbbR^d the generalized momentum. Given a physical problem, SympNets takes a phase space element (qp) and aims to compute the next position (qp) of the trajectory in phase space a time step later while preserving the well known symplectic structure of Hamiltonian systems. The way SympNet preserve the symplectic structure is really specific and characterizes it as this preserving is intrinsic of the neural network. Indeed, SympNet is not made with traditional layers but with symplectic layers (described later) modifying the traditional universal approximation theorem into a symplectic one : SympNet is able to approach any symplectic function providing conditions on an activation function.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"SympNet (noted Phi in the following) is so an integrator from mathbbR^d times mathbbR^d to mathbbR^d times mathbbR^d preserving symplecticity which can compute, from an initial condition (q_0p_0), a sequence of phase space elements of a trajectory (q_np_n)=Phi(q_n-1p_n-1)==Phi^n(q_0p_0). The time step between predictions is not a parameter we can choose but is related to the temporal frequency of the training data. SympNet can handle both temporally regular data, i.e with a fix time step between data, and temporally irregular data, i.e with variable time step. ","category":"page"},{"location":"architectures/sympnet/#Architecture-of-SympNets","page":"SympNet","title":"Architecture of SympNets","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"With GeometricMachineLearning.jl, it is possible to implement two types of architecture which are LA-SympNet and G-SympNet. ","category":"page"},{"location":"architectures/sympnet/#LA-SympNet","page":"SympNet","title":"LA-SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"(Image: )","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"LA-SympNets are made of the alternation of two types of layers, symplectic linear layers and symplectic activation layers. For a given integer n, a symplectic linear layer is defined by","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"mathcalL^nup\nbeginpmatrix\n q \n p \nendpmatrix\n = \nbeginpmatrix \n I S^n0 \n 0S^n I \nendpmatrix\n cdots \nbeginpmatrix \n I 0 \n S^2 I \nendpmatrix\nbeginpmatrix \n I S^1 \n 0 I \nendpmatrix\nbeginpmatrix\n q \n p \nendpmatrix\n+ b ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"or ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"mathcalL^nlow\nbeginpmatrix q \n p endpmatrix = \n beginpmatrix \n I 0S^n \n S^n0 I\n endpmatrix cdots \n beginpmatrix \n I S^2 \n 0 I\n endpmatrix\n beginpmatrix \n I 0 \n S^1 I\n endpmatrix\n beginpmatrix q \n p endpmatrix\n + b ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The parameters to learn are the symmetric matrices S^iinmathbbR^dtimes d and the bias binmathbbR^2d. The integer n is the width of the symplectic linear layer. If ngeq9, we know that the symplectic linear layers represent any linear symplectic map so that n need not be larger than 9. We note the set of symplectic linear layers mathcalM^L. This type of layers plays the role of standard linear layers. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"For a given activation function sigma, a symplectic activation layer is defined by","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalA^up beginpmatrix q \n p endpmatrix = \n beginbmatrix \n Ihatsigma^a \n 0I\n endbmatrix beginpmatrix q \n p endpmatrix =\n beginpmatrix \n mathrmdiag(a)sigma(p)+q \n p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"or","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalA^low beginpmatrix q \n p endpmatrix = \n beginbmatrix \n I0 \n hatsigma^aI\n endbmatrix beginpmatrix q \n p endpmatrix\n =\n beginpmatrix \n q \n mathrmdiag(a)sigma(q)+p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The parameters to learn are the weights ainmathbbR^d. This type of layers plays the role of standard activation layers layers. We note the set of symplectic activation layers mathcalM^A. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"A LA-SympNet is a function of the form Psi=l_k+1 circ a_k circ v_k circ cdots circ a_1 circ l_1 where (l_i)_1leq ileq k+1 subset (mathcalM^L)^k+1 and ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"(a_i)_1leq ileq k subset (mathcalM^A)^k","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":".","category":"page"},{"location":"architectures/sympnet/#G-SympNet","page":"SympNet","title":"G-SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"G-SympNets are an alternative to LA-SympNet. They are constituated with only one kind of layers called gradient layers. For a given activation function sigma and an integer ngeq d, a gradient layers is a symplectic map from mathbbR^2d to mathbbR^2d defined by","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalG^up beginpmatrix q \n p endpmatrix = \n beginbmatrix \n Ihatsigma^Kab \n 0I\n endbmatrix beginpmatrix q \n p endpmatrix =\n beginpmatrix \n K^T mathrmdiag(a)sigma(Kp+b)+q \n p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"or","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalG^low beginpmatrix q \n p endpmatrix = \n beginbmatrix \n I0 \n hatsigma^KabI\n endbmatrix beginpmatrix q \n p endpmatrix\n =\n beginpmatrix \n q \n K^T mathrmdiag(a)sigma(Kq+b)+p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The parameters of this layer are the scale matrix KinmathbbR^ntimes d, the bias binmathbbR^n and the vector of weights ainmathbbR^n. The idea is that hatsigma^Kab can approximate any function of the form nabla V, hence the name of this layer. The integer n is called the width of the gradient layer.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"If we note by mathcalM^G the set of gradient layers, a G-SympNet is a function of the form Psi=g_k circ g_k-1 circ cdots circ g_1 where (g_i)_1leq ileq k subset (mathcalM^G)^k.","category":"page"},{"location":"architectures/sympnet/#Universal-approximation-theorems","page":"SympNet","title":"Universal approximation theorems","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"We give now properly the universal approximation for both architectures. But let us give few definitions before. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Let U be an open set of mathbbR^2d, and let us note by SP^r(U) the set of C^r smooth symplectic maps on U. Let us give a topology on the set of C^r smooth maps from a compact K of mathbbR^n to mathbbR^n for any positive integers n through the norm","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"f_C^r(KmathbbR^n) = undersetalphaleq rsum underset1leq i leq nmaxundersetxin Ksup D^alpha f_i(x)","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"where the differential operator D^alpha is defined for any map of C^r(mathbbR^nmathbbR) by ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"D^alpha f = fracpartial^alpha fpartial x_1^alpha_1x_n^alpha_n","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"with alpha = alpha_1 ++ alpha_n. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Definition Let sigma a real map and rin mathbbN. sigma is r-finite if sigmain C^r(mathbbRmathbbR) and int D^rsigma(x)dx +infty.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Definition Let mnrin mathbbN with mn0 be given, U an open set of mathbbR^m, and IJsubset C^r(UmathbbR^n. We say J is r-uniformly dense on compacta in I if J subset I and for any fin I, epsilon0, and any compact Ksubset U, there exists gin J such that f-g_C^r(KmathbbR^n) epsilon.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"We can now gives the theorems.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Theorem (Approximation theorem for LA-SympNet) For any positive integer r0 and open set Uin mathbbR^2d, the set of LA-SympNet is r-uniformly dense on compacta in SP^r(U) if the activation function sigma is r-finite.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Theorem (Approximation theorem for G-SympNet) For any positive integer r0 and open set Uin mathbbR^2d, the set of G-SympNet is r-uniformly dense on compacta in SP^r(U) if the activation function sigma is r-finite.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"These two theorems are at odds with the well-foundedness of the SympNets. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Example of r-finite functions","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"sigmoid sigma(x)=frac11+e^-x for any positive integer r, \ntanh tanh(x)=frace^x-e^-xe^x+e^-x for any positive integer r. ","category":"page"},{"location":"architectures/sympnet/#SympNet-with-GeometricMachineLearning.jl","page":"SympNet","title":"SympNet with GeometricMachineLearning.jl","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"With GeometricMachineLearning.jl, it is really easy to implement and train a SympNet. The steps are the following :","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Create the architecture in one line with the function GSympNet or LASympNet,\nCreate the neural networks depending a backend (e.g. with Lux),\nCreate an optimizer for the training step,\nTrain the neural networks with the train!function.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Both LA-SympNet and G-SympNet architectures can be generated in one line with GeometricMachineLearning.jl.","category":"page"},{"location":"architectures/sympnet/#LA-SympNet-2","page":"SympNet","title":"LA-SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"To create a LA-SympNet, one needs to write","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"lasympnet = LASympNet(dim; width=9, nhidden=1, activation=tanh, init_uplow_linear=[true,false], \n init_uplow_act=[true,false],init_sym_matrices=Lux.glorot_uniform, init_bias=Lux.zeros32, \n init_weight=Lux.glorot_uniform) ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"LASympNet takes one obligatory argument:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"dim : the dimension of the phase space,","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"and several keywords argument :","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"width : the width for all the symplectic linear layers with default value set to 9 (if width>9, width is set to 9),\nnhidden : the number of pairs of symplectic linear and activation layers with default value set to 0 (i.e LA-SympNet is a single symplectic linear layer),\nactivation : the activation function for all the symplectic activations layers with default value set to tanh,\ninituplowlinear : a vector of boolean whose the ith coordinate is true only if all the symplectic linear layers in (i mod length(init_uplow_linear))-th position is up (for example the default value is [true,false] which represents an alternation of up and low symplectic linear layers),\ninituplowact : a vector of boolean whose the ith coordinate is true only if all the symplectic activation layers in (i mod length(init_uplow_act))-th position is up (for example the default value is [true,false] which represents an alternation of up and low symplectic activation layers),\ninitsymmatrices: the function which gives the way to initialize the symmetric matrices S^i of symplectic linear layers,\ninit_bias: the function which gives the way to initialize the vector of bias b,\ninit_weight: the function which gives the way to initialize the weight a.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The default value of the last three keyword arguments uses Lux functions.","category":"page"},{"location":"architectures/sympnet/#G-SympNet-2","page":"SympNet","title":"G-SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"To create a G-SympNet, one needs to write","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"gsympnet = GSympNet(dim; width=dim, nhidden=1, activation=tanh, init_uplow=[true,false], init_weight=Lux.glorot_uniform, \ninit_bias=Lux.zeros32, init_scale=Lux.glorot_uniform) ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"GSympNet takes one obligatory argument:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"dim : the dimension of the phase space,","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"and severals keywords argument :","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"width : the width for all the gradients layers with default value set to dim to have widthgeqdim,\nnhidden : the number of gradient layers with default value set to 1,\nactivation : the activation function for all the gradients layers with default value set to tanh,\ninit_uplow: a vector of boolean whose the ith coordinate is true only if all the gradient layers in (i mod length(init_uplow))-th position is up (for example the default value is [true,false] which represents an alternation of up and low gradient layers),\ninit_weight: the function which gives the way to initialize the vector of weights a,\ninit_bias: the function which gives the way to initialize the vector of bias b,\ninit_scale: the function which gives the way to initialize the scale matrix K.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The default value of the last three keyword arguments uses Lux functions.","category":"page"},{"location":"architectures/sympnet/#Loss-function","page":"SympNet","title":"Loss function","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"To train the SympNet, one need data along a trajectory such that the model is trained to perform an integration. These data are (QP) where Qij (respectively Pij) is the real number q_j(t_i) (respectively pij) which is the j-th coordinates of the generalized position (respectively momentum) at the i-th time step. One also need a loss function defined as :","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Loss(QP) = undersetisum d(Phi(Qi-Pi-) Qi- Pi-^T)","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"where d is a distance on mathbbR^d.","category":"page"},{"location":"architectures/sympnet/#Examples","page":"SympNet","title":"Examples","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Let us see how to use it on several examples.","category":"page"},{"location":"architectures/sympnet/#Example-of-a-pendulum-with-G-SympNet","page":"SympNet","title":"Example of a pendulum with G-SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Let us begin with a simple example, the pendulum system, the Hamiltonian of which is ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"H(qp)inmathbbR^2 mapsto frac12p^2-cos(q) in mathbbR","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The first thing to do is to create an architecture, in this example a G-SympNet.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"# number of inputs/dimension of system\nconst ninput = 2\n# layer dimension for gradient module \nconst ld = 10 \n# hidden layers\nconst ln = 4\n# activation function\nconst act = tanh\n\n# Creation of a G-SympNet architecture \ngsympnet = GSympNet(ninput, width=ld, nhidden=ln, activation=act)\n\n# Creation of a LA-SympNet architecture \nlasympnet = LASympNet(ninput, nhidden=ln, activation=act)","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Then we can create the neural networks depending on the backend. Here we will use Lux:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"# create Lux network\nnn = NeuralNetwork(gsympnet, LuxBackend())","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"We have to define an optimizer which will be use in the training of the SympNet. For more details on optimizer, please see the corresponding documentation Optimizer.md. For example, let us use a momentum optimizer :","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"# Optimiser\nopt = MomentumOptimizer(1e-2, 0.5)","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"We can now perform the training of the neural networks. The syntax is the following :","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"# number of training runs\nconst nruns = 10000\n# Batchsize used to compute the gradient of the loss function with respect to the parameters of the neural networks.\nconst nbatch = 10\n\n# perform training (returns array that contains the total loss for each training step)\ntotal_loss = train!(nn, opt, data_q, data_p; ntraining = nruns, batch_size = nbatch)","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The train function will change the parameters of the neural networks and gives an a vector containing the evolution of the value of the loss function during the training. Default values for the arguments ntraining and batch_size are respectively 1000 and 10.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The trainings data data_q and data_p must be matrices of mathbbR^ntimes d where n is the length of data and d is the half of the dimension of the system, i.e data_q[i,j] is q_j(t_i) where (t_1t_n) are the corresponding time of the training data.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Then we can make prediction. Let's compare the initial data with a prediction starting from the same phase space point using the provided function Iterate_Sympnet:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"#predictions\nq_learned, p_learned = Iterate_Sympnet(nn, q0, p0; n_points = size(data_q,1))","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"(Image: )","category":"page"},{"location":"Optimizer/#Optimizer","page":"Optimizer Framework","title":"Optimizer","text":"","category":"section"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"In order to generalize neural network optimizers to homogeneous spaces, a class of manifolds we often encounter in machine learning, we have to find a global tangent space representation which we call mathfrakg^mathrmhor here. ","category":"page"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"Starting from an element of the tangent space T_YmathcalM[1], we need to perform two mappings to arrive at mathfrakg^mathrmhor, which we refer to by Omega and a red horizontal arrow:","category":"page"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"(Image: )","category":"page"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"Here the mapping Omega is a horizontal lift from the tangent space onto the horizontal component of the Lie algebra at Y. ","category":"page"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"The red line maps the horizontal component at Y, i.e. mathfrakg^mathrmhorY, to the horizontal component at mathfrakg^mathrmhor.","category":"page"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"The mathrmcache stores information about previous optimization steps and is dependent on the optimizer. The elements of the mathrmcache are also in mathfrakg^mathrmhor. Based on this the optimer (Adam in this case) computes a final velocity, which is the input of a retraction. Because this update is done for mathfrakg^mathrmhorequivT_YmathcalM, we still need to perform a mapping, called apply_section here, that then finally updates the network parameters. The two red lines are described in global sections.","category":"page"},{"location":"Optimizer/#References","page":"Optimizer Framework","title":"References","text":"","category":"section"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"Brantner B. Generalizing Adam To Manifolds For Efficiently Training Transformers[J]. arXiv preprint arXiv:2305.16901, 2023.","category":"page"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"[1]: In practice this is obtained by first using an AD routine on a loss function L, and then computing the Riemannian gradient based on this. See the section of the Stiefel manifold for an example of this.","category":"page"},{"location":"manifolds/grassmann_manifold/#Grassmann-Manifold","page":"Grassmann","title":"Grassmann Manifold","text":"","category":"section"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"(The description of the Grassmann manifold is based on that of the Stiefel manifold, so this should be read first.)","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"An element of the Grassmann manifold G(nN) is a vector subspace subsetmathbbR^N of dimension n, and each such subspace can be represented by a full-rank matrix AinmathbbR^Ntimesn and the full space takes the form G(nN) = mathbbR^Ntimesnsim where the equivalence relation is AsimB iff existsCinmathbbR^ntimesntext st AC = B. One can find a parametrization of the manifold the following way: Because the matrix A has full rank, there have to be n independent columns in it: i_1 ldots i_n. For simplicity assume that i_1 = 1 i_2=2 ldots i_n=n and call the matrix made up by these columns C. Then the mapping to the coordinate chart is: AC^-1 and the last N-n columns are the coordinates. ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"The tangent space for this element can then be represented through matrices: ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"beginpmatrix\n 0 cdots 0 \n cdots cdots cdots \n 0 cdots 0 \n a_11 cdots a_1n \n cdots cdots cdots \n a_(N-n)1 cdots a_(N-n)n\nendpmatrix","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"The Grassmann manifold can also be seen as the Stiefel manifold modulo an equivalence class. This leads to the following (which is used for optimization):","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"mathfrakg^mathrmhor = mathfrakg^mathrmhorE = leftbeginpmatrix 0 -B^T B 0 endpmatrix textB arbitraryright","category":"page"},{"location":"","page":"Home","title":"Home","text":"CurrentModule = GeometricMachineLearning","category":"page"},{"location":"#Geometric-Machine-Learning","page":"Home","title":"Geometric Machine Learning","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"GeometricMachineLearning.jl implements various scientific machine learning models that aim at learning dynamical systems with geometric structure, such as Hamiltonian (symplectic) or Lagrangian (variational) systems.","category":"page"},{"location":"#Installation","page":"Home","title":"Installation","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"GeometricMachineLearning.jl and all of its dependencies can be installed via the Julia REPL by typing ","category":"page"},{"location":"","page":"Home","title":"Home","text":"]add GeometricMachineLearning","category":"page"},{"location":"#Architectures","page":"Home","title":"Architectures","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"Pages = [\n \"architectures/sympnet.md\",\n]","category":"page"},{"location":"#Manifolds","page":"Home","title":"Manifolds","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"Pages = [\n \"manifolds/grassmann_manifold.md\",\n \"manifolds/stiefel_manifold.md\",\n]","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/#The-Horizontal-Lift","page":"Horizontal Lift","title":"The Horizontal Lift","text":"","category":"section"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"For each element YinmathcalM we can perform a splitting mathfrakg = mathfrakg^mathrmhor Yoplusmathfrakg^mathrmver Y, where the two subspaces are the horizontal and the vertical component of mathfrakg at Y respectively. For homogeneous spaces: T_YmathcalM = mathfrakgcdotY, i.e. every tangent space to mathcalM can be expressed through the application of the Lie algebra to the relevant element. The vertical component consists of those elements of mathfrakg which are mapped to the zero element of T_YmathcalM, i.e. ","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"mathfrakg^mathrmver Y = mathrmker(mathfrakgtoT_YmathcalM)","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"The orthogonal complement[1] of mathfrakg^mathrmver Y is the horizontal component and is referred to by mathfrakg^mathrmhor Y. This is naturally isomorphic to T_YmathcalM. For the Stiefel manifold the horizontal lift has the simple form: ","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"Omega(Y V) = left(mathbbI - frac12right)VY^T - YV^T(mathbbI - frac12YY^T)","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"If the element Y is the distinct element E, then the elements of mathfrakg^mathrmhorE take a particularly simple form, see Global Tangent Space for a description of this. ","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"[1]: The orthogonal complement is taken with respect to a metric defined on mathfrakg. For the case of G=SO(N) and mathfrakg=mathfrakso(N) = AA+A^T =0 this metric can be chosen as (A_1A_2)mapstofrac12A_1^TA_2.","category":"page"}] +[{"location":"manifolds/stiefel_manifold/#Stiefel-manifold","page":"Stiefel","title":"Stiefel manifold","text":"","category":"section"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The Stiefel manifold St(n N) is the space (a homogeneous space) of all orthonormal frames in mathbbR^Ntimesn, i.e. matrices YinmathbbR^Ntimesn s.t. Y^TY = mathbbI_n. It can also be seen as the special orthonormal group SO(N) modulo an equivalence relation: AsimBiffAE = BE for ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"E = beginbmatrix\nmathbbI_n \nmathbbO\nendbmatrixinmathcalM","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"which is the canonical element of the Stiefel manifold. In words: the first n columns of A and B are the same.","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The tangent space to the element YinSt(nN) can easily be determined: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"T_YSt(nN)=DeltaDelta^TY + Y^TDelta = 0","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The Lie algebra of SO(N) is mathfrakso(N)=VinmathbbR^NtimesNV^T + V = 0 and the canonical metric associated with it is simply (V_1V_2)mapstofrac12mathrmTr(V_1^TV_2).","category":"page"},{"location":"manifolds/stiefel_manifold/#The-Riemannian-Gradient","page":"Stiefel","title":"The Riemannian Gradient","text":"","category":"section"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"For matrix manifolds (like the Stiefel manifold), the Riemannian gradient of a function can be easily determined computationally:","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The Euclidean gradient of a function L is equivalent to an element of the cotangent space T^*_YmathcalM via: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"langlenablaLcdotrangleT_YmathcalM to mathbbR Delta mapsto sum_ijnablaL_ijDelta_ij = mathrmTr(nablaL^TDelta)","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"We can then utilize the Riemannian metric on mathcalM to map the element from the cotangent space (i.e. nablaL) to the tangent space. This element is called mathrmgrad_(cdot)L here. Explicitly, it is given by: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":" mathrmgrad_YL = nabla_YL - Y(nabla_YL)^TY","category":"page"},{"location":"manifolds/stiefel_manifold/#rgrad","page":"Stiefel","title":"rgrad","text":"","category":"section"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"What was referred to as nablaL before can in practice be obtained with an AD routine. We then use the function rgrad to map this Euclidean gradient to inT_YSt(nN). This mapping has the property: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"mathrmTr((nablaL)^TDelta) = g_Y(mathttrgrad(Y nablaL) Delta) forallDeltainT_YSt(nN)","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"and g is the Riemannian metric.","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/#The-Horizontal-Lift","page":"Horizontal Lift","title":"The Horizontal Lift","text":"","category":"section"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"For each element YinmathcalM we can perform a splitting mathfrakg = mathfrakg^mathrmhor Yoplusmathfrakg^mathrmver Y, where the two subspaces are the horizontal and the vertical component of mathfrakg at Y respectively. For homogeneous spaces: T_YmathcalM = mathfrakgcdotY, i.e. every tangent space to mathcalM can be expressed through the application of the Lie algebra to the relevant element. The vertical component consists of those elements of mathfrakg which are mapped to the zero element of T_YmathcalM, i.e. ","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"mathfrakg^mathrmver Y = mathrmker(mathfrakgtoT_YmathcalM)","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"The orthogonal complement[1] of mathfrakg^mathrmver Y is the horizontal component and is referred to by mathfrakg^mathrmhor Y. This is naturally isomorphic to T_YmathcalM. For the Stiefel manifold the horizontal lift has the simple form: ","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"Omega(Y V) = left(mathbbI - frac12right)VY^T - YV^T(mathbbI - frac12YY^T)","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"If the element Y is the distinct element E, then the elements of mathfrakg^mathrmhorE take a particularly simple form, see Global Tangent Space for a description of this. ","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"[1]: The orthogonal complement is taken with respect to a metric defined on mathfrakg. For the case of G=SO(N) and mathfrakg=mathfrakso(N) = AA+A^T =0 this metric can be chosen as (A_1A_2)mapstofrac12A_1^TA_2.","category":"page"},{"location":"optimizers/manifold_related/retractions/#Retractions","page":"Retractions","title":"Retractions","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/#Classical-Definition","page":"Retractions","title":"Classical Definition","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Classically, retractions are defined as maps smooth maps ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"R TmathcalMtomathcalM(xv)mapstoR_x(v)","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"such that each curve c(t) = R_x(tv) satisfies c(0) = x and c(0) = v.","category":"page"},{"location":"optimizers/manifold_related/retractions/#In-GeometricMachineLearning","page":"Retractions","title":"In GeometricMachineLearning","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Retractions are a map from the horizontal component of the Lie algebra mathfrakg^mathrmhor to the respective manifold.","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"For optimization in neural networks (almost always first order) we solve a gradient flow equation ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"dotW = -mathrmgrad_WL ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"where mathrmgrad_WL is the Riemannian gradient of the loss function L evaluated at position W.","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"If we deal with Euclidean spaces (vector spaces), then the Riemannian gradient is just the result of an AD routine and the solution of the equation above can be approximated with W^t+1 gets W^t - etanabla_W^tL, where eta is the learning rate. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"For manifolds, after we obtained the Riemannian gradient (see e.g. the section on Stiefel manifold), we have to solve a geodesic equation. This is a canonical ODE associated with any Riemannian manifold. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"The general theory of Riemannian manifolds is rather complicated, but for the neural networks treated in GeometricMachineLearning, we only rely on optimization of matrix Lie groups and homogeneous spaces, which is much simpler. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"For Lie groups each tangent space is isomorphic to its Lie algebra mathfrakgequivT_mathbbIG. The geodesic map from mathfrakg to G, for matrix Lie groups with bi-invariant Riemannian metric like SO(N), is simply the application of the matrix exponential exp. Alternatively this can be replaced by the Cayley transform (see (Absil et al, 2008).)","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Starting from this basic map expmathfrakgtoG we can build mappings for more complicated cases: ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"General tangent space to a Lie group T_AG: The geodesic map for an element VinT_AG is simply Aexp(A^-1V).\nSpecial tangent space to a homogeneous space T_EmathcalM: For V=BEinT_EmathcalM the exponential map is simply exp(B)E. \nGeneral tangent space to a homogeneous space T_YmathcalM with Y = AE: For Delta=ABEinT_YmathcalM the exponential map is simply Aexp(B)E. This is the general case which we deal with. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"The general theory behind points 2. and 3. is discussed in chapter 11 of (O'Neill, 1983). The function retraction in GeometricMachineLearning performs mathfrakg^mathrmhortomathcalM, which is the second of the above points. To get the third from the second point, we simply have to multiply with a matrix from the left. This step is done with apply_section and represented through the red vertical line in the diagram on the general optimizer framework.","category":"page"},{"location":"optimizers/manifold_related/retractions/#Word-of-caution","page":"Retractions","title":"Word of caution","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"The Lie group corresponding to the Stiefel manifold SO(N) has a bi-invariant Riemannian metric associated with it: (B_1B_2)mapsto mathrmTr(B_1^TB_2). For other Lie groups (e.g. the symplectic group) the situation is slightly more difficult (see (Bendokat et al, 2021).)","category":"page"},{"location":"optimizers/manifold_related/retractions/#References","page":"Retractions","title":"References","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Absil P A, Mahony R, Sepulchre R. Optimization algorithms on matrix manifolds[M]. Princeton University Press, 2008.\nBendokat T, Zimmermann R. The real symplectic Stiefel and Grassmann manifolds: metrics, geodesics and applications[J]. arXiv preprint arXiv:2108.12447, 2021.\nO'Neill, Barrett. Semi-Riemannian geometry with applications to relativity. Academic press, 1983.","category":"page"},{"location":"reduced_order_modeling/autoencoder/#Reduced-Order-modeling-and-Autoencoders","page":"POD and Autoencoders","title":"Reduced Order modeling and Autoencoders","text":"","category":"section"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"Reduced order modeling is a data-driven technique that exploits the structure of parametric PDEs to make solving those PDEs easier.","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"Consider a parametric PDE written in the form: F(z(mu)mu)=0 where z(mu) evolves on a infinite-dimensional Hilbert space V. ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"In modeling any PDE we have to choose a discretization (particle discretization, finite element method, ...) of V, which will be denoted by V_h. ","category":"page"},{"location":"reduced_order_modeling/autoencoder/#Solution-manifold","page":"POD and Autoencoders","title":"Solution manifold","text":"","category":"section"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"To any parametric PDE we associate a solution manifold: ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"mathcalM = z(mu)F(z(mu)mu)=0 muinmathbbP","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"(Image: )","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"In the image above a 2-dimensional solution manifold is visualized as a sub-manifold in 3-dimensional space. In general the embedding space is an infinite-dimensional function space.","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"As an example of this consider the 1-dimensional wave equation: ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"partial_tt^2q(tximu) = mu^2partial_xixi^2q(tximu)text on ItimesOmega","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"where I = (01) and Omega=(-1212). As initial condition for the first derivative we have partial_tq(0ximu) = -mupartial_xiq_0(ximu) and furthermore q(tximu)=0 on the boundary (i.e. xiin-1212).","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"The solution manifold is a 1-dimensional submanifold: ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"mathcalM = (t xi)mapstoq(tximu)=q_0(xi-mutmu)muinmathbbPsubsetmathbbR","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"If we provide an initial condition u_0, a parameter instance mu and a time t, then ximapstoq(tximu) will be the momentary solution. If we consider the time evolution of q(tximu), then it evolves on a two-dimensional submanifold barmathcalM = ximapstoq(tximu)tinImuinmathbbP.","category":"page"},{"location":"reduced_order_modeling/autoencoder/#General-workflow","page":"POD and Autoencoders","title":"General workflow","text":"","category":"section"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"In reduced order modeling we aim to construct a mapping to a space that is close to this solution manifold. This is done through the following steps: ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"Discretize the PDE.\nSolve the discretized PDE for a certain set of parameter instances muinmathbbP.\nBuild a reduced basis with the data obtained from having solved the discretized PDE. This step consists of finding two mappings: the reduction mathcalP and the reconstruction mathcalR.","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"The third step can be done with various machine learning (ML) techniques. Traditionally the most popular of these has been Proper orthogonal decomposition (POD), but in recent years autoencoders have also become a popular alternative (see (Fresca et al, 2021)). ","category":"page"},{"location":"reduced_order_modeling/autoencoder/#References","page":"POD and Autoencoders","title":"References","text":"","category":"section"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"Fresca, Stefania, Luca Dede, and Andrea Manzoni. \"A comprehensive deep learning-based approach to reduced order modeling of nonlinear time-dependent parametrized PDEs.\" Journal of Scientific Computing 87 (2021): 1-36.","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/#Horizontal-component-of-the-Lie-algebra-\\mathfrak{g}","page":"Global Tangent Space","title":"Horizontal component of the Lie algebra mathfrakg","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"What we use to optimize Adam (and other algorithms) to manifolds is a global tangent space representation of the homogeneous spaces. ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"For the Stiefel manifold, the homogeneous space takes a simple form: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"B = beginbmatrix\n A -B^T \n B mathbbO\nendbmatrix","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"where AinmathbbR^ntimesn is skew-symmetric and BinmathbbR^Ntimesn is arbitary. In GeometricMachineLearning the struct StiefelLieAlgHorMatrix implements elements of this form.","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/#Theoretical-background","page":"Global Tangent Space","title":"Theoretical background","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/#Vertical-and-horizontal-components","page":"Global Tangent Space","title":"Vertical and horizontal components","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"The Stiefel manifold is a homogeneous space obtained from SO(N) by setting two matrices, whose first n columns conincide, equivalent. Another way of expressing this is: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"A_1 sim A_2 iff A_1E = A_2E","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"for ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"E = beginbmatrix mathbbI mathbbOendbmatrix","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"The tangent space T_ESt(nN) can also be expressed that way:","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"T_ESt(nN) = mathfrakgcdotE = BEBinmathfrakg","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"The kernel of the mapping mathfrakgtoT_ESt(nN) BmapstoBE is referred to as mathfrakg^mathrmverE, the vertical component of the Lie algebra at E. It is clear that elements belonging to mathfrakg^mathrmverE are of the following form: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"beginbmatrix\nhatmathbbO tildemathbbO^T \ntildemathbbO C\nendbmatrix","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"where hatmathbbOinmathbbR^ntimesn is a \"small\" matrix and tildemathbbOinmathbbR^Ntimesn is a bigger one. CinmathbbR^NtimesN is a skew-symmetric matrix. ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"We can then take the orthogonal complement of this matrix (with respect to the canonical metric). We will denote this by mathfrakg^mathrmhorEequivmathfrakg^mathrmhor and call it the horizontal component. Its elements are of the form described on top of this page.","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/#Special-functions","page":"Global Tangent Space","title":"Special functions","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"You can also draw random elements from mathfrakg^mathrmhor through e.g. ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"rand(CUDADevice(), StiefelLieAlgHorMatrix{Float32}, 10, 5)","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Global Tangent Space","title":"Global Tangent Space","text":"In this example: N=10 and n=5.","category":"page"},{"location":"library/","page":"Library","title":"Library","text":"CurrentModule = GeometricMachineLearning","category":"page"},{"location":"library/#GeometricMachineLearning-Library-Functions","page":"Library","title":"GeometricMachineLearning Library Functions","text":"","category":"section"},{"location":"library/","page":"Library","title":"Library","text":"Modules = [GeometricMachineLearning]","category":"page"},{"location":"library/#GeometricMachineLearning.AbstractCache","page":"Library","title":"GeometricMachineLearning.AbstractCache","text":"AbstractCache has subtypes: AdamCache MomentumCache GradientCache\n\nAll of them can be initialized with providing an array (also supporting manifold types).\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.AbstractRetraction","page":"Library","title":"GeometricMachineLearning.AbstractRetraction","text":"AbstractRetraction is a type that comprises all retraction methods for manifolds. For every manifold layer one has to specify a retraction method that takes the layer and elements of the (global) tangent space.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.AdamOptimizer","page":"Library","title":"GeometricMachineLearning.AdamOptimizer","text":"Defines the Adam Optimizer. Algorithm and suggested defaults are taken from (Goodfellow et al., 2016, page 301), except for δ, because single precision is used!\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Attention","page":"Library","title":"GeometricMachineLearning.Attention","text":"MultiHeadAttention (MHA) serves as a preprocessing step in the transformer. It reweights the input vectors bases on correlations within those data. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.BlockIdentityLowerMatrix","page":"Library","title":"GeometricMachineLearning.BlockIdentityLowerMatrix","text":"A BlockIdentityLowerMatrix is a matrix with blocks | 1 0 | | S 1 | Currently, it only implements a custom mul! method, exploiting this structure.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.BlockIdentityUpperMatrix","page":"Library","title":"GeometricMachineLearning.BlockIdentityUpperMatrix","text":"A BlockIdentityUpperMatrix is a matrix with blocks | 1 S | | 0 1 | Currently, it only implements a custom mul! method, exploiting this structure.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Classification","page":"Library","title":"GeometricMachineLearning.Classification","text":"Classification Layer that takes a matrix as an input and returns a vector that is used for MNIST classification. \n\nTODO: Implement picking the last vector.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.DataLoader","page":"Library","title":"GeometricMachineLearning.DataLoader","text":"Data Loader is a struct that creates an instance based on a tensor (or different input format) and is designed to make training convenient. \n\nImplemented: \n\nIf the data loader is called with a single tensor, a batchsize and an outputsize, then the batch is drawn randomly in the relevant range and the output is assigned accordingly.\n\nThe fields of the struct are the following: \n\n - data: The input data with axes (i) system dimension, (ii) number of parameters and (iii) number of time steps.\n\n - batch: A tensor in which the current batch is stored.\n\n - target_tensor: A tensor in which the current target is stored.\n\n - output: The tensor from which the output is drawn - this may be of type Nothing if the constructor is only called with one tensor.\n\n - sys_dim: The ``dimension'' of the system, i.e. what is taken as input by a regular neural network.\n\n - seq_length: The length batches should have.\n\n - batch_size:\n\n - output_size: The size of the second axis of the output tensor (prediction_window, output_size=1 in most cases)\n\n - n_params: The number of parameters that are present in the data set (length of second axis)\n\n - n_time_steps: Number of time steps (length of third axis)\n\nFor drawing the batch, the sampling is done over nparams and ntimesteps (here seqlength and output_size are also taken into account).\n\nTODO: Implement DataLoader that works well with GeometricEnsembles etc.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GradientOptimizer","page":"Library","title":"GeometricMachineLearning.GradientOptimizer","text":"Define the Gradient optimizer, i.e. W ← W - η*∇f(W) Or the riemannian manifold equivalent, if applicable.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GrassmannLayer","page":"Library","title":"GeometricMachineLearning.GrassmannLayer","text":"Defines a layer that performs simple multiplication with an element of the Grassmann manifold.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GrassmannManifold","page":"Library","title":"GeometricMachineLearning.GrassmannManifold","text":"maybe consider dividing the output in the check functions by n! TODO: Implement sampling procedures!!\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.LayerWithManifold","page":"Library","title":"GeometricMachineLearning.LayerWithManifold","text":"Additional types to make handling manifolds more readable.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.ManifoldLayer","page":"Library","title":"GeometricMachineLearning.ManifoldLayer","text":"This defines a manifold layer that only has one matrix-valued manifold A associated with it does xmapstoAx. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.MomentumOptimizer","page":"Library","title":"GeometricMachineLearning.MomentumOptimizer","text":"Define the Momentum optimizer, i.e. V ← αV - ∇f(W) W ← W + ηV Or the riemannian manifold equivalent, if applicable.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.MultiHeadAttention","page":"Library","title":"GeometricMachineLearning.MultiHeadAttention","text":"MultiHeadAttention (MHA) serves as a preprocessing step in the transformer. It reweights the input vectors bases on correlations within those data. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Optimizer","page":"Library","title":"GeometricMachineLearning.Optimizer","text":"Optimizer struct that stores the 'method' (i.e. Adam with corresponding hyperparameters), the cache and the optimization step.\n\nIt takes as input an optimization method and the parameters of a network. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.PSDLayer","page":"Library","title":"GeometricMachineLearning.PSDLayer","text":"This is a PSD-like layer used for symplectic autoencoders. One layer has the following shape:\n\n|Φ 0|\n\nA = |0 Φ|, where Φ is an element of the regular Stiefel manifold. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.StiefelLayer","page":"Library","title":"GeometricMachineLearning.StiefelLayer","text":"Defines a layer that performs simple multiplication with an element of the Stiefel manifold.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SymmetricMatrix","page":"Library","title":"GeometricMachineLearning.SymmetricMatrix","text":"A SymmetricMatrix is a matrix | a S | | S b |\n\nThe first index is the row index, the second one the column index.\n\nIf the constructor is called with a matrix as input it returns a symmetric matrix via the projection A ↦ .5*(A + Aᵀ). This is a projection defined via the canonical metric (A,B) ↦ tr(AᵀB).\n\nTODO: Overload Adjoint operation for SymmetricMatrix!! (Aᵀ = A)\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SympNetLayer","page":"Library","title":"GeometricMachineLearning.SympNetLayer","text":"Implements the various layers from the SympNet paper: (https://www.sciencedirect.com/science/article/abs/pii/S0893608020303063). Its components are of the form: \n\nbeginpmatrix\n I nablaV 0 I \nendpmatrix\n\nwith V(p) = sum_ia_iSigma(sum_jk_ijp_j+b_i), where Sigma is the antiderivative of the activation function sigma (one-layer neural network). Such layers are by construction symplectic.\n\nFor the linear layer, the activation and the bias are left out, and for the activation layer K and b are left out!\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.TrainingData","page":"Library","title":"GeometricMachineLearning.TrainingData","text":"TrainingData stores: \n\n - problem \n\n - shape \n\n - get \n\n - symbols \n\n - dim \n\n - noisemaker\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Transformer-Tuple{Integer, Integer, Integer}","page":"Library","title":"GeometricMachineLearning.Transformer","text":"The architecture for a \"transformer encoder\" is essentially taken from arXiv:2010.11929, but with the difference that 𝐧𝐨 layer normalization is employed. This is because we still need to find a generalization of layer normalization to manifolds. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.assign_batch_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.assign_batch_kernel!","text":"Takes as input a batch tensor (to which the data are assigned) the whole data tensor and two vectorsparams'' and ``time_steps'' that include the specific parameters and time steps we want to assign. \n\nNote that this assigns sequential data! For e.g. being processed by a transformer.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.assign_output_estimate-Union{Tuple{T}, Tuple{AbstractArray{T, 3}, Any}} where T","page":"Library","title":"GeometricMachineLearning.assign_output_estimate","text":"Closely related to the transformer. It takes the last prediction_window columns of the output and uses is for the final prediction.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.assign_output_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.assign_output_kernel!","text":"This should be used together with assignbatchkernel!. It assigns the corresponding output (i.e. target).\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.augment_zeros_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.augment_zeros_kernel!","text":"Used for differentiating assignoutputestimate (this appears in the loss). \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.draw_batch!-Union{Tuple{T}, Tuple{AbstractArray{T, 3}, AbstractArray{T, 3}, AbstractArray{T, 3}}} where T","page":"Library","title":"GeometricMachineLearning.draw_batch!","text":"This function draws random time steps and parameters and based on these assign the batch and the output.\n\nFor ODE DataLoader: (i) batch input tensor to be written on (ii) batch output tensor to be written on (iii) data tensor \n\nFor MNIST DataLoader: (i) batch input tensor to be written on (ii) batch output tensor to be written on (iii) data tensor (iv) target tensor\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.draw_batch!-Union{Tuple{T}, Tuple{AbstractMatrix{T}, AbstractMatrix{T}}} where T","page":"Library","title":"GeometricMachineLearning.draw_batch!","text":"This assigns the batch if the data are in form of a matrix.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.init_optimizer_cache-Tuple{GradientOptimizer, Any}","page":"Library","title":"GeometricMachineLearning.init_optimizer_cache","text":"Wrapper for the functions setupadamcache, setupmomentumcache, setupgradientcache. These appear outside of optimizer_caches.jl because the OptimizerMethods first have to be defined.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.patch_index-Union{Tuple{T}, Tuple{T, T, T}, NTuple{4, T}} where T<:Integer","page":"Library","title":"GeometricMachineLearning.patch_index","text":"Based on coordinates i,j this returns the batch index (for MNIST data set for now).\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.train!","page":"Library","title":"GeometricMachineLearning.train!","text":"train!(...)\n\nPerform a training of a neural networks on data using given method a training Method\n\nDifferent ways of use:\n\ntrain!(neuralnetwork, data, optimizer = GradientOptimizer(1e-2), training_method; nruns = 1000, batch_size = default(data, type), showprogress = false )\n\nArguments\n\nneuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend\ndata : the data (see TrainingData)\noptimizer = GradientOptimizer: the optimization method (see Optimizer)\ntraining_method : specify the loss function used \nnruns : number of iteration through the process with default value \nbatch_size : size of batch of data used for each step\n\n\n\n\n\n","category":"function"},{"location":"library/#GeometricMachineLearning.train!-Tuple{AbstractNeuralNetworks.AbstractNeuralNetwork{<:AbstractNeuralNetworks.Architecture}, AbstractTrainingData, TrainingParameters}","page":"Library","title":"GeometricMachineLearning.train!","text":"train!(neuralnetwork, data, optimizer, training_method; nruns = 1000, batch_size, showprogress = false )\n\nArguments\n\nneuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend\ndata::AbstractTrainingData : the data\n``\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.within_patch_index-Union{Tuple{T}, Tuple{T, T, T}} where T<:Integer","page":"Library","title":"GeometricMachineLearning.within_patch_index","text":"Based on coordinates i,j this returns the index within the batch\n\n\n\n\n\n","category":"method"},{"location":"optimizers/adam_optimizer/#The-Adam-Optimizer","page":"Adam Optimizer","title":"The Adam Optimizer","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"The Adam Optimizer is one of the most widely (if not the most widely used) neural network optimizer. Like most modern neural network optimizers it contains a cache that is updated based on first-order gradient information and then, in a second step, the cache is used to compute a velocity estimate for updating the neural networ weights. ","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"Here we first describe the Adam algorithm for the case where all the weights are on a vector space and then show how to generalize this to the case where the weights are on a manifold. ","category":"page"},{"location":"optimizers/adam_optimizer/#All-weights-on-a-vector-space","page":"Adam Optimizer","title":"All weights on a vector space","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"The cache of the Adam optimizer consists of first and second moments. The first moments B_1 store linear information about the current and previous gradients, and the second moments B_2 store quadratic information about current and previous gradients (all computed from a first-order gradient). ","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"If all the weights are on a vector space, then we directly compute updates for B_1 and B_2:","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"B_1 gets ((rho_1 - rho_1^t)(1 - rho_1^t))cdotB_1 + (1 - rho_1)(1 - rho_1^t)cdotnablaL\nB_2 gets ((rho_2 - rho_1^t)(1 - rho_2^t))cdotB_2 + (1 - rho_2)(1 - rho_2^t)cdotnablaLodotnablaL\nwhere odotmathbbR^ntimesmathbbR^ntomathbbR^n is the Hadamard product: aodotb_i = a_ib_i. rho_1 and rho_2 are hyperparameters. Their defaults, rho_1=09 and rho_2=099, are taken from (Goodfellow et al., 2016, page 301). After having updated the cache (i.e. B_1 and B_2) we compute a velocity (step 3) with which the parameters Y_t are then updated (step 4).\nW_tgets -etaB_1sqrtB_2 + delta\nY_t+1 gets Y_t + W_t","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"Here eta (with default 0.01) is the learning rate and delta (with default 3cdot10^-7) is a small constant that is added for stability. The division, square root and addition in step 3 are performed element-wise. ","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"(Image: )","category":"page"},{"location":"optimizers/adam_optimizer/#Weights-on-manifolds","page":"Adam Optimizer","title":"Weights on manifolds","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"The problem with generalizing Adam to manifolds is that the Hadamard product odot as well as the other element-wise operations (, sqrt and + in step 3 above) lack a clear geometric interpretation. In GeometricMachineLearning we get around this issue by utilizing a so-called global tangent space representation. ","category":"page"},{"location":"optimizers/adam_optimizer/#References","page":"Adam Optimizer","title":"References","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT press, 2016.","category":"page"},{"location":"data_loader/data_loader/#Data-Loader","page":"Routines","title":"Data Loader","text":"","category":"section"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"GeometricMachineLearning provides flexible routines to load and manage data for training neural networks. DataLoader has several constructors: ","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"If provided with a tensor, then it assumes the first axis is the system dimension, the second axis is the dimension of the parameter space, and the third axis gives the time evolution of the system. \nIf provided with a tensor and a vector, it assumes the data are related to a classification task. ","category":"page"},{"location":"tutorials/mnist_tutorial/#MNIST-tutorial","page":"MNIST","title":"MNIST tutorial","text":"","category":"section"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"This is a short tutorial that shows how we can use GeometricMachineLearning to build a vision transformer and apply it for MNIST, while also putting some of the weights on a manifold. ","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"First, we need to import the relevant packages: ","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"using GeometricMachineLearning, CUDA\nimport Zygote, MLDatasets","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"In this example Zygote as an AD routine and we get the dataset from MLDatasets. First we need to load the data set, and put it on GPU (if you have one):","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"train_x, train_y = MLDatasets.MNIST(split=:train)[:]\ntest_x, test_y = MLDatasets.MNIST(split=:test)[:]\ntrain_x = train_x |> cu \ntest_x = test_x |> cu \ntrain_y = train_y |> cu \ntest_y = test_y |> cu","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"GeometricMachineLearning has built-in data loaders that make it particularly easy to handle data: ","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"patch_length = 7\ndl = DataLoader(train_x, train_y, batch_size=512, patch_length=patch_length)\ndl_test = DataLoader(train_x, train_y, batch_size=length(y), patch_length=patch_length)","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"The second line in the above code snippet indicates that we use the entire data set as one \"batch\" when processing the test set. For training, the batch size was here set to 512. ","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"ps = initialparameters(backend, eltype(dl.data), Ψᵉ) \n\noptimizer_instance = Optimizer(o, ps)\n\nprintln(\"initial test accuracy: \", accuracy(Ψᵉ, ps, dl_test), \"\\n\")\n\nprogress_object = Progress(n_training_steps; enabled=true)\n\nloss_array = zeros(eltype(train_x), n_training_steps)\nfor i in 1:n_training_steps\n redraw_batch!(dl)\n # get rid of try catch statement. This softmax issue should be solvable!\n loss_val, pb = try Zygote.pullback(ps -> loss(Ψᵉ, ps, dl), ps)\n catch\n loss_array[i] = loss_array[i-1] \n continue \n end\n dp = pb(one(loss_val))[1]\n\n optimization_step!(optimizer_instance, Ψᵉ, ps, dp)\n ProgressMeter.next!(progress_object; showvalues = [(:TrainingLoss, loss_val)]) \n loss_array[i] = loss_val\nend\n\nprintln(\"final test accuracy: \", accuracy(Ψᵉ, ps, dl_test), \"\\n\")","category":"page"},{"location":"layers/multihead_attention_layer/#Multihead-Attention-Layer","page":"Multihead Attention","title":"Multihead Attention Layer","text":"","category":"section"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"In order to arrive from the attention layer at the multihead attention layer we have to do a few modifications: ","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Note that these neural networks were originally developed for natural language processing (NLP) tasks and the terminology used here bears some resemblance to that field. The input to a multihead attention layer typicaly comprises three components:","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Values VinmathbbR^ntimesT: a matrix whose columns are value vectors, \nQueries QinmathbbR^ntimesT: a matrix whose columns are query vectors, \nKeys KinmathbbR^ntimesT: a matrix whose columns are key vectors.","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Regular attention performs the following operation: ","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"mathrmAttention(QKV) = Vmathrmsoftmax(fracK^TQsqrtn)","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"where n is the dimension of the vectors in V, Q and K. The softmax activation function here acts column-wise, so it can be seen as a transformation mathrmsoftmaxmathbbR^TtomathbbR^T with mathrmsoftmax(v)_i = e^v_ileft(sum_j=1e^v_jright). The K^TQ term is a similarity matrix between the queries and the vectors. ","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"The transformer contains a self-attention mechanism, i.e. takes an input X and then transforms it linearly to V, Q and K, i.e. V = P^VX, Q = P^QX and K = P^KX. What distinguishes the multihead attention layer from the singlehead attention layer, is that there is not just one P^V, P^Q and P^K, but there are several: one for each head of the multihead attention layer. After computing the individual values, queries and vectors, and after applying the softmax, the outputs are then concatenated together in order to obtain again an array that is of the same size as the input array:","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"(Image: )","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Here the various P matrices can be interpreted as being projections onto lower-dimensional subspaces, hence the designation by the letter P. Because of this interpretation as projection matrices onto smaller spaces that should capture features in the input data it makes sense to constrain these elements to be part of the Stiefel manifold. ","category":"page"},{"location":"layers/multihead_attention_layer/#References","page":"Multihead Attention","title":"References","text":"","category":"section"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Vaswani, Ashish, et al. \"Attention is all you need.\" Advances in neural information processing systems 30 (2017).","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Symplectic-Autoencoder","page":"PSD and Symplectic Autoencoders","title":"Symplectic Autoencoder","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Symplectic Autoencoders are a type of neural network suitable for treating Hamiltonian parametrized PDEs with slowly decaying Kolmogorov n-width. It is based on proper symplectic decomposition (PSD) and symplectic neural networks (SympNets).","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Hamiltonian-Model-Order-Reduction","page":"PSD and Symplectic Autoencoders","title":"Hamiltonian Model Order Reduction","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Hamiltonian PDEs are partial differential equations that, like its ODE counterpart, have a Hamiltonian associated with it. An example of this is the linear wave equation (see (Buckfink et al, 2023)) with Hamiltonian ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"mathcalH(q p mu) = frac12int_Omegamu^2(partial_xiq(tximu))^2 + p(tximu)^2dxi","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"The PDE for to this Hamiltonian can be obtained similarly as in the ODE case:","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"partial_tq(tximu) = fracdeltamathcalHdeltap = p(tximu) quad partial_tp(tximu) = -fracdeltamathcalHdeltaq = mu^2partial_xixiq(tximu)","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Symplectic-Solution-Manifold","page":"PSD and Symplectic Autoencoders","title":"Symplectic Solution Manifold","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"As with regular parametric PDEs, we also associate a solution manifold with Hamiltonian PDEs. This is a finite-dimensional manifold, on which the dynamics can be described through a Hamiltonian ODE. I NEED A PROOF OR SOME EXPLANATION FOR THIS!","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Workflow-for-Symplectic-ROM","page":"PSD and Symplectic Autoencoders","title":"Workflow for Symplectic ROM","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"As with any other reduced order modeling technique we first discretize the PDE. This should be done with a structure-preserving scheme, thus yielding a (high-dimensional) Hamiltonian ODE as a result. Discretizing the wave equation above with finite differences yields a Hamiltonian system: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"mathcalH_mathrmdiscr(z(tmu)mu) = frac12x(tmu)^Tbeginbmatrix -mu^2D_xixi mathbbO mathbbO mathbbI endbmatrix x(tmu)","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"In Hamiltonian reduced order modelling we try to find a symplectic submanifold of the solution space[1] that captures the dynamics of the full system as well as possible.","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"[1]: The submanifold is: tildemathcalM = Psi^mathrmdec(z_r)inmathbbR^2Nu_rinmathrmR^2n where z_r is the reduced state of the system. ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Similar to the regular PDE case we again build an encoder Psi^mathrmenc and a decoder Psi^mathrmdec; but now both these mappings are required to be symplectic!","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Concretely this means: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"The encoder is a mapping from a high-dimensional symplectic space to a low-dimensional symplectic space, i.e. Psi^mathrmencmathbbR^2NtomathbbR^2n such that nablaPsi^mathrmencmathbbJ_2N(nablaPsi^mathrmenc)^T = mathbbJ_2n.\nThe decoder is a mapping from a low-dimensional symplectic space to a high-dimensional symplectic space, i.e. Psi^mathrmdecmathbbR^2ntomathbbR^2N such that (nablaPsi^mathrmdec)^TmathbbJ_2NnablaPsi^mathrmdec = mathbbJ_2n.","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"If these two maps are constrained to linear maps, then one can easily find good solutions with proper symplectic decomposition (PSD).","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Proper-Symplectic-Decomposition","page":"PSD and Symplectic Autoencoders","title":"Proper Symplectic Decomposition","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"For PSD the two mappings Psi^mathrmenc and Psi^mathrmdec are constrained to be linear, orthonormal (i.e. Psi^TPsi = mathbbI) and symplectic. The easiest way to enforce this is through the so-called cotangent lift: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Psi_mathrmCL = \nbeginbmatrix Phi mathbbO mathbbO Phi endbmatrix","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"and PhiinSt(nN)submathbbR^Ntimesn, i.e. is an element of the Stiefel manifold. If the snapshot matrix is of the form: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"M = leftbeginarraycccc\nhatq_1(t_0) hatq_1(t_1) quadldotsquad hatq_1(t_f) \nhatq_2(t_0) hatq_2(t_1) ldots hatq_2(t_f) \nldots ldots ldots ldots \nhatq_N(t_0) hatq_N(t_1) ldots hatq_N(t_f) \nhatp_1(t_0) hatp_1(t_1) ldots hatp_1(t_f) \nhatp_2(t_0) hatp_2(t_1) ldots hatp_2(t_f) \nldots ldots ldots ldots \nhatp_N(t_0) hatp_N(t_1) ldots hatp_N(t_f) \nendarrayright","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"then Phi can be computed in a very straight-forward manner: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Rearrange the rows of the matrix M such that we end up with a Ntimes2(f+1) matrix: hatM = M_q M_p.\nPerform SVD: hatM = USigmaV^T; set PhigetsUmathtt1n.","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"For details on the cotangent lift (and other methods for linear symplectic model reduction) consult (Peng and Mohseni, 2016).","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Symplectic-Autoencoders","page":"PSD and Symplectic Autoencoders","title":"Symplectic Autoencoders","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"PSD suffers from the similar shortcomings as regular POD: it is a linear map and the approximation space tildemathcalM= Psi^mathrmdec(z_r)inmathbbR^2Nu_rinmathrmR^2n is strictly linear. For problems with slowly-decaying Kolmogorov n-width this leads to very poor approximations. ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"In order to overcome this difficulty we use neural networks, more specifically SympNets, together with cotangent lift-like matrices. The resulting architecture, symplectic autoencoders, are demonstrated in the following image: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"(Image: )","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"So we alternate between SympNet and PSD layers. Because all the PSD layers are based on matrices PhiinSt(nN) we have to optimize on the Stiefel manifold.","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#References","page":"PSD and Symplectic Autoencoders","title":"References","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Buchfink, Patrick, Silke Glas, and Bernard Haasdonk. \"Symplectic model reduction of Hamiltonian systems on nonlinear manifolds and approximation with weakly symplectic autoencoder.\" SIAM Journal on Scientific Computing 45.2 (2023): A289-A311.\nPeng, Liqian, and Kamran Mohseni. \"Symplectic model reduction of Hamiltonian systems.\" SIAM Journal on Scientific Computing 38.1 (2016): A1-A27.","category":"page"},{"location":"layers/attention_layer/#The-Attention-Layer","page":"Attention","title":"The Attention Layer","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The attention layer (and the orthonormal activation function defined for it) was specifically designed to generalize transformers to symplectic data. Usually a self-attention layer takes the following form: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Z = z^(1) ldots z^(T) mapsto Zmathrmsoftmax((P^QZ)^T(P^KZ))","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"where we left out the linear mapping onto the values P^V. ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The idea behind is that we can perform a non-linear re-weighting of the columns of Z by multiplying with a Z-dependent matrix from the right and therefore take the sequential nature of the data into account (which is not possible with normal neural networks). After the attention step the transformer applies a simple ResNet from the left.","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"What the softmax does is a vector-wise operation, i.e. it operates on each column of an input matrix A = a_1 ldots a_T. The result is a sequence of probability vectors p^(1) ldots p^(T) for which ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"sum_i=1^Tp^(j)_i=1quadforalljin1dotsT","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"What we want to construct is a symplectic transformation that is transformer-like. For this we modify the attention layer the following way: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Z = z^(1) ldots z^(T) mapsto Zsigma((P^QZ)^T(P^KZ))","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"where sigma(A)=exp(mathttupper_triangular_asymmetrize(A)) and ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"mathttupper_triangular_asymmetrize(A)_ij = begincases a_ij textif ij -a_ji textif ij 0 textelseendcases","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"This has as a consequence that the matrix Lambda(Z) = sigma((P^QZ)^T(P^KZ)) is orthonormal and hence preserves an extended symplectic structure. To make this more clear, consider that the transformer maps sequences of vectors to sequences of vectors, i.e. VtimescdotstimesV ni z^1 ldots z^T mapsto hatz^1 ldots hatz^T. We can define a symplectic structure on VtimescdotstimesV by rearranging z^1 ldots z^T into a vector. We do this in the following way: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"tildeZ = beginpmatrix q^(1)_1 q^(2)_1 cdots q^(T)_1 q^(1)_2 cdots q^(T)_d p^(1)_1 p^(2)_1 cdots p^(T)_1 p^(1)_2 cdots p^(T)_d endpmatrix","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The symplectic structure on this big space is then: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"mathbbJ=beginpmatrix\n mathbbO_dT mathbbI_dT \n -mathbbI_dT mathbbO_dT\nendpmatrix","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Multiplying with the matrix Lambda(Z) from the right onto z^1 ldots z^T corresponds to applying the sparse matrix ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"tildeLambda(Z)=left\nbeginarrayccc\n Lambda(Z) cdots mathbbO_T \n vdots ddots vdots \n mathbbO_T cdots Lambda(Z) \n endarray\nright","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"from the left onto the big vector. ","category":"page"},{"location":"manifolds/homogeneous_spaces/#Homogeneous-Spaces","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"","category":"section"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Homogeneous spaces are manifolds mathcalM on which a Lie group G acts transitively, i.e. ","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"forall XYinmathcalM existsAinGtext st AX = Y","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Now fix a distinct element EinmathcalM. We can also establish an isomorphism between mathcalM and the quotient space Gsim with the equivalence relation: ","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"A_1 sim A_2 iff A_1E = A_2E","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Note that this is independent of the chosen E.","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"The tangent spaces of mathcalM are of the form T_YmathcalM = mathfrakgcdotY, i.e. can be fully described through its Lie algebra. Based on this we can perform a splitting of mathfrakg into two parts:","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"The vertical component mathfrakg^mathrmverY is the kernel of the map mathfrakgtoT_YmathcalM V mapsto VY, i.e. mathfrakg^mathrmverY = VinmathfrakgVY = 0\nThe horizontal component mathfrakg^mathrmhorY is the orthogonal complement of mathfrakg^mathrmverY in mathfrakg. It is isomorphic to T_YmathcalM.","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"We will refer to the mapping from T_YmathcalM to mathfrakg^mathrmhor Y by Omega. If we have now defined a metric langlecdotcdotrangle on mathfrakg, then this induces a Riemannian metric on mathcalM:","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"g_Y(Delta_1 Delta_2) = langleOmega(YDelta_1)Omega(YDelta_2)rangletext for Delta_1Delta_2inT_YmathcalM","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Two examples of homogeneous spaces implemented in GeometricMachineLearning are the Stiefel and the Grassmann manifold.","category":"page"},{"location":"manifolds/homogeneous_spaces/#References","page":"Homogeneous Spaces","title":"References","text":"","category":"section"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Frankel, Theodore. The geometry of physics: an introduction. Cambridge university press, 2011.","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/#Kolmogorov-n-width","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"","category":"section"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"The Kolmogorov n-width measures how well some set mathcalM (typically the solution manifold) can be approximated with a linear subspace:","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"d_n(mathcalM) = mathrminf_V_nsubVmathrmdimV_n=nmathrmsup(uinmathcalM)mathrminf_v_ninV_n u - v_n _V","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"with mathcalMsubV and V is a (typically infinite-dimensional) Banach space. For advection-dominated problems (among others) the decay of the Kolmogorov n-width is very slow, i.e. one has to pick n very high in order to obtain useful approximations (see (Greif and Urban, 2019) and (Blickhan, 2023)).","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"In order to overcome this, techniques based on neural networks (see e.g. (Lee and Carlberg, 2020)) and optimal transport (see e.g. (Blickhan, 2023)) have been used. ","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/#References","page":"Kolmogorov n-width","title":"References","text":"","category":"section"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"Greif, Constantin, and Karsten Urban. \"Decay of the Kolmogorov N-width for wave problems.\" Applied Mathematics Letters 96 (2019): 216-222.\nBlickhan, Tobias. \"A registration method for reduced basis problems using linear optimal transport.\" arXiv preprint arXiv:2304.14884 (2023).\nLee, Kookjin, and Kevin T. Carlberg. \"Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders.\" Journal of Computational Physics 404 (2020): 108973.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Global-Sections","page":"Global Sections","title":"Global Sections","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"Global sections are needed needed for the generalization of Adam and other optimizers to homogeneous spaces. They are necessary to perform the two mappings represented represented by horizontal and vertical red lines in the section on the general optimizer framework.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Computing-the-global-section","page":"Global Sections","title":"Computing the global section","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"In differential geometry a section is always associated to some bundle, in our case this bundle is piGtomathcalMAmapstoAE. A section is a mapping mathcalMtoG for which pi is a left inverse, i.e. picirclambda = mathrmid. ","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"For the Stiefel manifold St(n N)subsetmathbbR^Ntimesn we compute the global section the following way: ","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"Start with an element YinSt(nN),\nDraw a random matrix AinmathbbR^Ntimes(N-n),\nRemove the subspace spanned by Y from the range of A: AgetsA-YY^TA\nCompute a QR decomposition of A and take as section lambda(Y) = Y Q_1N 1(N-n).","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"It is easy to check that lambda(Y)inG=SO(N).","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"In GeometricMachineLearning, GlobalSection takes an element of YinSt(nN)equivStiefelManifold{T} and returns an instance of GlobalSection{T, StiefelManifold{T}}. The application O(N)timesSt(nN)toSt(nN) is done with the functions apply_section! and apply_section.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Computing-the-global-tangent-space-representation-based-on-a-global-section","page":"Global Sections","title":"Computing the global tangent space representation based on a global section","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"The output of the horizontal lift Omega is an element of mathfrakg^mathrmhorY. For this mapping Omega(Y BY) = B if Binmathfrakg^mathrmhorY, i.e. there is no information loss and no projection is performed. We can map the Binmathfrakg^mathrmhorY to mathfrakg^mathrmhor with Bmapstolambda(Y)^-1Blambda(Y).","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"The function global_rep performs both mappings at once[1], i.e. it takes an instance of GlobalSection and an element of T_YSt(nN), and then returns an element of frakg^mathrmhorequivStiefelLieAlgHorMatrix.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Optimization","page":"Global Sections","title":"Optimization","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"The output of global_rep is then used for all the optimization steps.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#References","page":"Global Sections","title":"References","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"Frankel, Theodore. The geometry of physics: an introduction. Cambridge university press, 2011.","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"[1]: For computational reasons.","category":"page"},{"location":"optimizers/manifold_related/geodesic/#Geodesic-Retraction","page":"Geodesic Retraction","title":"Geodesic Retraction","text":"","category":"section"},{"location":"optimizers/manifold_related/geodesic/","page":"Geodesic Retraction","title":"Geodesic Retraction","text":"General retractions are approximations of the exponential map. In GeometricMachineLearning we can, instead of using an approximation, solve the geodesic equation exactly (up to numerical error) by specifying Geodesic() as the argument of layers that have manifold weights. ","category":"page"},{"location":"optimizers/manifold_related/cayley/#The-Cayley-Retraction","page":"Cayley Retraction","title":"The Cayley Retraction","text":"","category":"section"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"The Cayley transformation is one of the most popular retractions. For several matrix Lie groups it is a mapping from the Lie algebra mathfrakg onto the Lie group G. They Cayley retraction reads: ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":" mathrmCayley(C) = left(mathbbI -frac12Cright)^-1left(mathbbI +frac12Cright)","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"This is easily checked to be a retraction, i.e. mathrmCayley(mathbbO) = mathbbI and fracpartialpartialtmathrmCayley(tC) = C.","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"What we need in practice is not the computation of the Cayley transform of an arbitrary matrix, but the Cayley transform of an element of mathfrakg^mathrmhor, the global tangent space representation. ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"The elements of mathfrakg^mathrmhor can be written as: ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"C = beginbmatrix\n A -B^T \n B mathbbO\nendbmatrix = beginbmatrix frac12A mathbbI B mathbbO endbmatrix beginbmatrix mathbbI mathbbO frac12A -B^T endbmatrix","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"where the second expression exploits the sparse structure of the array, i.e. it is a multiplication of a Ntimes2n with a 2ntimesN matrix. We can hence use the Sherman-Morrison-Woodbury formula to obtain:","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"(mathbbI - frac12UV)^-1 = mathbbI + frac12U(mathbbI - frac12VU)^-1V","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"So what we have to invert is the term ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"mathbbI - frac12beginbmatrix mathbbI mathbbO frac12A -B^T endbmatrixbeginbmatrix frac12A mathbbI B mathbbO endbmatrix = \nbeginbmatrix mathbbI - frac14A - frac12mathbbI frac12B^TB - frac18A^2 mathbbI - frac14A endbmatrix","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"The whole cayley transform is then: ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"left(mathbbI + frac12beginbmatrix frac12A mathbbI B mathbbO endbmatrix beginbmatrix mathbbI - frac14A - frac12mathbbI frac12B^TB - frac18A^2 mathbbI - frac14A endbmatrix^-1 beginbmatrix mathbbI mathbbO frac12A -B^T endbmatrix right)left( E + frac12beginbmatrix frac12A mathbbI B mathbbO endbmatrix beginbmatrix mathbbI frac12A endbmatrix right) = \n\nE + frac12beginbmatrix frac12A mathbbI B mathbbO endbmatrixleft(\n beginbmatrix mathbbI frac12A endbmatrix + \n beginbmatrix mathbbI - frac14A - frac12mathbbI frac12B^TB - frac18A^2 mathbbI - frac14A endbmatrix^-1left(\n\n beginbmatrix mathbbI frac12A endbmatrix + \n beginbmatrix frac12A frac14A^2 - frac12B^TB endbmatrix\n\n right)\n right)","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"Note that for computational reason we compute mathrmCayley(C)E instead of just the Cayley transform (see the section on retractions).","category":"page"},{"location":"architectures/sympnet/#SympNet","page":"SympNet","title":"SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"This page documents the SympNet architecture and its implementation in GeometricMachineLearning.jl.","category":"page"},{"location":"architectures/sympnet/#Quick-overview-of-the-theory-of-SympNet","page":"SympNet","title":"Quick overview of the theory of SympNet","text":"","category":"section"},{"location":"architectures/sympnet/#Principle","page":"SympNet","title":"Principle","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"SympNets is a new type of neural network proposing a new approach to compute the trajectory of an Hamiltonian system in phase space. Let us denote by (qp)=(q_1q_dp_1p_d)in mathbbR^2d the phase space with qin mathbbR^d the generalized position and pin mathbbR^d the generalized momentum. Given a physical problem, SympNets takes a phase space element (qp) and aims to compute the next position (qp) of the trajectory in phase space a time step later while preserving the well known symplectic structure of Hamiltonian systems. The way SympNet preserve the symplectic structure is really specific and characterizes it as this preserving is intrinsic of the neural network. Indeed, SympNet is not made with traditional layers but with symplectic layers (described later) modifying the traditional universal approximation theorem into a symplectic one : SympNet is able to approach any symplectic function providing conditions on an activation function.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"SympNet (noted Phi in the following) is so an integrator from mathbbR^d times mathbbR^d to mathbbR^d times mathbbR^d preserving symplecticity which can compute, from an initial condition (q_0p_0), a sequence of phase space elements of a trajectory (q_np_n)=Phi(q_n-1p_n-1)==Phi^n(q_0p_0). The time step between predictions is not a parameter we can choose but is related to the temporal frequency of the training data. SympNet can handle both temporally regular data, i.e with a fix time step between data, and temporally irregular data, i.e with variable time step. ","category":"page"},{"location":"architectures/sympnet/#Architecture-of-SympNets","page":"SympNet","title":"Architecture of SympNets","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"With GeometricMachineLearning.jl, it is possible to implement two types of architecture which are LA-SympNet and G-SympNet. ","category":"page"},{"location":"architectures/sympnet/#LA-SympNet","page":"SympNet","title":"LA-SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"(Image: )","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"LA-SympNets are made of the alternation of two types of layers, symplectic linear layers and symplectic activation layers. For a given integer n, a symplectic linear layer is defined by","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"mathcalL^nup\nbeginpmatrix\n q \n p \nendpmatrix\n = \nbeginpmatrix \n I S^n0 \n 0S^n I \nendpmatrix\n cdots \nbeginpmatrix \n I 0 \n S^2 I \nendpmatrix\nbeginpmatrix \n I S^1 \n 0 I \nendpmatrix\nbeginpmatrix\n q \n p \nendpmatrix\n+ b ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"or ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"mathcalL^nlow\nbeginpmatrix q \n p endpmatrix = \n beginpmatrix \n I 0S^n \n S^n0 I\n endpmatrix cdots \n beginpmatrix \n I S^2 \n 0 I\n endpmatrix\n beginpmatrix \n I 0 \n S^1 I\n endpmatrix\n beginpmatrix q \n p endpmatrix\n + b ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The parameters to learn are the symmetric matrices S^iinmathbbR^dtimes d and the bias binmathbbR^2d. The integer n is the width of the symplectic linear layer. If ngeq9, we know that the symplectic linear layers represent any linear symplectic map so that n need not be larger than 9. We note the set of symplectic linear layers mathcalM^L. This type of layers plays the role of standard linear layers. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"For a given activation function sigma, a symplectic activation layer is defined by","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalA^up beginpmatrix q \n p endpmatrix = \n beginbmatrix \n Ihatsigma^a \n 0I\n endbmatrix beginpmatrix q \n p endpmatrix =\n beginpmatrix \n mathrmdiag(a)sigma(p)+q \n p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"or","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalA^low beginpmatrix q \n p endpmatrix = \n beginbmatrix \n I0 \n hatsigma^aI\n endbmatrix beginpmatrix q \n p endpmatrix\n =\n beginpmatrix \n q \n mathrmdiag(a)sigma(q)+p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The parameters to learn are the weights ainmathbbR^d. This type of layers plays the role of standard activation layers layers. We note the set of symplectic activation layers mathcalM^A. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"A LA-SympNet is a function of the form Psi=l_k+1 circ a_k circ v_k circ cdots circ a_1 circ l_1 where (l_i)_1leq ileq k+1 subset (mathcalM^L)^k+1 and ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"(a_i)_1leq ileq k subset (mathcalM^A)^k","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":".","category":"page"},{"location":"architectures/sympnet/#G-SympNet","page":"SympNet","title":"G-SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"G-SympNets are an alternative to LA-SympNet. They are constituated with only one kind of layers called gradient layers. For a given activation function sigma and an integer ngeq d, a gradient layers is a symplectic map from mathbbR^2d to mathbbR^2d defined by","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalG^up beginpmatrix q \n p endpmatrix = \n beginbmatrix \n Ihatsigma^Kab \n 0I\n endbmatrix beginpmatrix q \n p endpmatrix =\n beginpmatrix \n K^T mathrmdiag(a)sigma(Kp+b)+q \n p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"or","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalG^low beginpmatrix q \n p endpmatrix = \n beginbmatrix \n I0 \n hatsigma^KabI\n endbmatrix beginpmatrix q \n p endpmatrix\n =\n beginpmatrix \n q \n K^T mathrmdiag(a)sigma(Kq+b)+p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The parameters of this layer are the scale matrix KinmathbbR^ntimes d, the bias binmathbbR^n and the vector of weights ainmathbbR^n. The idea is that hatsigma^Kab can approximate any function of the form nabla V, hence the name of this layer. The integer n is called the width of the gradient layer.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"If we note by mathcalM^G the set of gradient layers, a G-SympNet is a function of the form Psi=g_k circ g_k-1 circ cdots circ g_1 where (g_i)_1leq ileq k subset (mathcalM^G)^k.","category":"page"},{"location":"architectures/sympnet/#Universal-approximation-theorems","page":"SympNet","title":"Universal approximation theorems","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"We give now properly the universal approximation for both architectures. But let us give few definitions before. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Let U be an open set of mathbbR^2d, and let us note by SP^r(U) the set of C^r smooth symplectic maps on U. Let us give a topology on the set of C^r smooth maps from a compact K of mathbbR^n to mathbbR^n for any positive integers n through the norm","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"f_C^r(KmathbbR^n) = undersetalphaleq rsum underset1leq i leq nmaxundersetxin Ksup D^alpha f_i(x)","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"where the differential operator D^alpha is defined for any map of C^r(mathbbR^nmathbbR) by ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"D^alpha f = fracpartial^alpha fpartial x_1^alpha_1x_n^alpha_n","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"with alpha = alpha_1 ++ alpha_n. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Definition Let sigma a real map and rin mathbbN. sigma is r-finite if sigmain C^r(mathbbRmathbbR) and int D^rsigma(x)dx +infty.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Definition Let mnrin mathbbN with mn0 be given, U an open set of mathbbR^m, and IJsubset C^r(UmathbbR^n. We say J is r-uniformly dense on compacta in I if J subset I and for any fin I, epsilon0, and any compact Ksubset U, there exists gin J such that f-g_C^r(KmathbbR^n) epsilon.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"We can now gives the theorems.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Theorem (Approximation theorem for LA-SympNet) For any positive integer r0 and open set Uin mathbbR^2d, the set of LA-SympNet is r-uniformly dense on compacta in SP^r(U) if the activation function sigma is r-finite.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Theorem (Approximation theorem for G-SympNet) For any positive integer r0 and open set Uin mathbbR^2d, the set of G-SympNet is r-uniformly dense on compacta in SP^r(U) if the activation function sigma is r-finite.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"These two theorems are at odds with the well-foundedness of the SympNets. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Example of r-finite functions","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"sigmoid sigma(x)=frac11+e^-x for any positive integer r, \ntanh tanh(x)=frace^x-e^-xe^x+e^-x for any positive integer r. ","category":"page"},{"location":"architectures/sympnet/#SympNet-with-GeometricMachineLearning.jl","page":"SympNet","title":"SympNet with GeometricMachineLearning.jl","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"With GeometricMachineLearning.jl, it is really easy to implement and train a SympNet. The steps are the following :","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Create the architecture in one line with the function GSympNet or LASympNet,\nCreate the neural networks depending a backend (e.g. with Lux),\nCreate an optimizer for the training step,\nTrain the neural networks with the train!function.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Both LA-SympNet and G-SympNet architectures can be generated in one line with GeometricMachineLearning.jl.","category":"page"},{"location":"architectures/sympnet/#LA-SympNet-2","page":"SympNet","title":"LA-SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"To create a LA-SympNet, one needs to write","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"lasympnet = LASympNet(dim; width=9, nhidden=1, activation=tanh, init_uplow_linear=[true,false], \n init_uplow_act=[true,false],init_sym_matrices=Lux.glorot_uniform, init_bias=Lux.zeros32, \n init_weight=Lux.glorot_uniform) ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"LASympNet takes one obligatory argument:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"dim : the dimension of the phase space,","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"and several keywords argument :","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"width : the width for all the symplectic linear layers with default value set to 9 (if width>9, width is set to 9),\nnhidden : the number of pairs of symplectic linear and activation layers with default value set to 0 (i.e LA-SympNet is a single symplectic linear layer),\nactivation : the activation function for all the symplectic activations layers with default value set to tanh,\ninituplowlinear : a vector of boolean whose the ith coordinate is true only if all the symplectic linear layers in (i mod length(init_uplow_linear))-th position is up (for example the default value is [true,false] which represents an alternation of up and low symplectic linear layers),\ninituplowact : a vector of boolean whose the ith coordinate is true only if all the symplectic activation layers in (i mod length(init_uplow_act))-th position is up (for example the default value is [true,false] which represents an alternation of up and low symplectic activation layers),\ninitsymmatrices: the function which gives the way to initialize the symmetric matrices S^i of symplectic linear layers,\ninit_bias: the function which gives the way to initialize the vector of bias b,\ninit_weight: the function which gives the way to initialize the weight a.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The default value of the last three keyword arguments uses Lux functions.","category":"page"},{"location":"architectures/sympnet/#G-SympNet-2","page":"SympNet","title":"G-SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"To create a G-SympNet, one needs to write","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"gsympnet = GSympNet(dim; width=dim, nhidden=1, activation=tanh, init_uplow=[true,false], init_weight=Lux.glorot_uniform, \ninit_bias=Lux.zeros32, init_scale=Lux.glorot_uniform) ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"GSympNet takes one obligatory argument:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"dim : the dimension of the phase space,","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"and severals keywords argument :","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"width : the width for all the gradients layers with default value set to dim to have widthgeqdim,\nnhidden : the number of gradient layers with default value set to 1,\nactivation : the activation function for all the gradients layers with default value set to tanh,\ninit_uplow: a vector of boolean whose the ith coordinate is true only if all the gradient layers in (i mod length(init_uplow))-th position is up (for example the default value is [true,false] which represents an alternation of up and low gradient layers),\ninit_weight: the function which gives the way to initialize the vector of weights a,\ninit_bias: the function which gives the way to initialize the vector of bias b,\ninit_scale: the function which gives the way to initialize the scale matrix K.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The default value of the last three keyword arguments uses Lux functions.","category":"page"},{"location":"architectures/sympnet/#Loss-function","page":"SympNet","title":"Loss function","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"To train the SympNet, one need data along a trajectory such that the model is trained to perform an integration. These data are (QP) where Qij (respectively Pij) is the real number q_j(t_i) (respectively pij) which is the j-th coordinates of the generalized position (respectively momentum) at the i-th time step. One also need a loss function defined as :","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Loss(QP) = undersetisum d(Phi(Qi-Pi-) Qi- Pi-^T)","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"where d is a distance on mathbbR^d.","category":"page"},{"location":"architectures/sympnet/#Examples","page":"SympNet","title":"Examples","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Let us see how to use it on several examples.","category":"page"},{"location":"architectures/sympnet/#Example-of-a-pendulum-with-G-SympNet","page":"SympNet","title":"Example of a pendulum with G-SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Let us begin with a simple example, the pendulum system, the Hamiltonian of which is ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"H(qp)inmathbbR^2 mapsto frac12p^2-cos(q) in mathbbR","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The first thing to do is to create an architecture, in this example a G-SympNet.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"# number of inputs/dimension of system\nconst ninput = 2\n# layer dimension for gradient module \nconst ld = 10 \n# hidden layers\nconst ln = 4\n# activation function\nconst act = tanh\n\n# Creation of a G-SympNet architecture \ngsympnet = GSympNet(ninput, width=ld, nhidden=ln, activation=act)\n\n# Creation of a LA-SympNet architecture \nlasympnet = LASympNet(ninput, nhidden=ln, activation=act)","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Then we can create the neural networks depending on the backend. Here we will use Lux:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"# create Lux network\nnn = NeuralNetwork(gsympnet, LuxBackend())","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"We have to define an optimizer which will be use in the training of the SympNet. For more details on optimizer, please see the corresponding documentation Optimizer.md. For example, let us use a momentum optimizer :","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"# Optimiser\nopt = MomentumOptimizer(1e-2, 0.5)","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"We can now perform the training of the neural networks. The syntax is the following :","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"# number of training runs\nconst nruns = 10000\n# Batchsize used to compute the gradient of the loss function with respect to the parameters of the neural networks.\nconst nbatch = 10\n\n# perform training (returns array that contains the total loss for each training step)\ntotal_loss = train!(nn, opt, data_q, data_p; ntraining = nruns, batch_size = nbatch)","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The train function will change the parameters of the neural networks and gives an a vector containing the evolution of the value of the loss function during the training. Default values for the arguments ntraining and batch_size are respectively 1000 and 10.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The trainings data data_q and data_p must be matrices of mathbbR^ntimes d where n is the length of data and d is the half of the dimension of the system, i.e data_q[i,j] is q_j(t_i) where (t_1t_n) are the corresponding time of the training data.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Then we can make prediction. Let's compare the initial data with a prediction starting from the same phase space point using the provided function Iterate_Sympnet:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"#predictions\nq_learned, p_learned = Iterate_Sympnet(nn, q0, p0; n_points = size(data_q,1))","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"(Image: )","category":"page"},{"location":"Optimizer/#Optimizer","page":"Optimizer Framework","title":"Optimizer","text":"","category":"section"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"In order to generalize neural network optimizers to homogeneous spaces, a class of manifolds we often encounter in machine learning, we have to find a global tangent space representation which we call mathfrakg^mathrmhor here. ","category":"page"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"Starting from an element of the tangent space T_YmathcalM[1], we need to perform two mappings to arrive at mathfrakg^mathrmhor, which we refer to by Omega and a red horizontal arrow:","category":"page"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"(Image: )","category":"page"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"Here the mapping Omega is a horizontal lift from the tangent space onto the horizontal component of the Lie algebra at Y. ","category":"page"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"The red line maps the horizontal component at Y, i.e. mathfrakg^mathrmhorY, to the horizontal component at mathfrakg^mathrmhor.","category":"page"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"The mathrmcache stores information about previous optimization steps and is dependent on the optimizer. The elements of the mathrmcache are also in mathfrakg^mathrmhor. Based on this the optimer (Adam in this case) computes a final velocity, which is the input of a retraction. Because this update is done for mathfrakg^mathrmhorequivT_YmathcalM, we still need to perform a mapping, called apply_section here, that then finally updates the network parameters. The two red lines are described in global sections.","category":"page"},{"location":"Optimizer/#References","page":"Optimizer Framework","title":"References","text":"","category":"section"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"Brantner B. Generalizing Adam To Manifolds For Efficiently Training Transformers[J]. arXiv preprint arXiv:2305.16901, 2023.","category":"page"},{"location":"Optimizer/","page":"Optimizer Framework","title":"Optimizer Framework","text":"[1]: In practice this is obtained by first using an AD routine on a loss function L, and then computing the Riemannian gradient based on this. See the section of the Stiefel manifold for an example of this.","category":"page"},{"location":"manifolds/grassmann_manifold/#Grassmann-Manifold","page":"Grassmann","title":"Grassmann Manifold","text":"","category":"section"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"(The description of the Grassmann manifold is based on that of the Stiefel manifold, so this should be read first.)","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"An element of the Grassmann manifold G(nN) is a vector subspace subsetmathbbR^N of dimension n, and each such subspace can be represented by a full-rank matrix AinmathbbR^Ntimesn and the full space takes the form G(nN) = mathbbR^Ntimesnsim where the equivalence relation is AsimB iff existsCinmathbbR^ntimesntext st AC = B. One can find a parametrization of the manifold the following way: Because the matrix A has full rank, there have to be n independent columns in it: i_1 ldots i_n. For simplicity assume that i_1 = 1 i_2=2 ldots i_n=n and call the matrix made up by these columns C. Then the mapping to the coordinate chart is: AC^-1 and the last N-n columns are the coordinates. ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"The tangent space for this element can then be represented through matrices: ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"beginpmatrix\n 0 cdots 0 \n cdots cdots cdots \n 0 cdots 0 \n a_11 cdots a_1n \n cdots cdots cdots \n a_(N-n)1 cdots a_(N-n)n\nendpmatrix","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"The Grassmann manifold can also be seen as the Stiefel manifold modulo an equivalence class. This leads to the following (which is used for optimization):","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"mathfrakg^mathrmhor = mathfrakg^mathrmhorE = leftbeginpmatrix 0 -B^T B 0 endpmatrix textB arbitraryright","category":"page"},{"location":"","page":"Home","title":"Home","text":"CurrentModule = GeometricMachineLearning","category":"page"},{"location":"#Geometric-Machine-Learning","page":"Home","title":"Geometric Machine Learning","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"GeometricMachineLearning.jl implements various scientific machine learning models that aim at learning dynamical systems with geometric structure, such as Hamiltonian (symplectic) or Lagrangian (variational) systems.","category":"page"},{"location":"#Installation","page":"Home","title":"Installation","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"GeometricMachineLearning.jl and all of its dependencies can be installed via the Julia REPL by typing ","category":"page"},{"location":"","page":"Home","title":"Home","text":"]add GeometricMachineLearning","category":"page"},{"location":"#Architectures","page":"Home","title":"Architectures","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"Pages = [\n \"architectures/sympnet.md\",\n]","category":"page"},{"location":"#Manifolds","page":"Home","title":"Manifolds","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"Pages = [\n \"manifolds/grassmann_manifold.md\",\n \"manifolds/stiefel_manifold.md\",\n]","category":"page"},{"location":"data_loader/snapshot_matrix/#Snapshot-matrix","page":"Snapshot matrix","title":"Snapshot matrix","text":"","category":"section"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix","title":"Snapshot matrix","text":"The snapshot matrix stores solutions of the high-dimensional ODE (obtained from discretizing a PDE). This is then used to construct reduced bases in a data-driven way. So (for a single parameter[1]) the snapshot matrix takes the following form: ","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix","title":"Snapshot matrix","text":"[1]: If we deal with a parametrized PDE then there are two stages at which the snapshot matrix has to be processed: the offline stage and the online stage. ","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix","title":"Snapshot matrix","text":"M = leftbeginarraycccc\nhatu_1(t_0) hatu_1(t_1) quadldotsquad hatu_1(t_f) \nhatu_2(t_0) hatu_2(t_1) ldots hatu_2(t_f) \nhatu_3(t_0) hatu_3(t_1) ldots hatu_3(t_f) \nldots ldots ldots ldots \nhatu_2N(t_0) hatu_2N(t_1) ldots hatu_2N(t_f) \nendarrayright","category":"page"}] } diff --git a/latest/tutorials/mnist_tutorial/index.html b/latest/tutorials/mnist_tutorial/index.html new file mode 100644 index 000000000..a762064ae --- /dev/null +++ b/latest/tutorials/mnist_tutorial/index.html @@ -0,0 +1,34 @@ + +MNIST · GeometricMachineLearning.jl

      MNIST tutorial

      This is a short tutorial that shows how we can use GeometricMachineLearning to build a vision transformer and apply it for MNIST, while also putting some of the weights on a manifold.

      First, we need to import the relevant packages:

      using GeometricMachineLearning, CUDA
      +import Zygote, MLDatasets

      In this example Zygote as an AD routine and we get the dataset from MLDatasets. First we need to load the data set, and put it on GPU (if you have one):

      train_x, train_y = MLDatasets.MNIST(split=:train)[:]
      +test_x, test_y = MLDatasets.MNIST(split=:test)[:]
      +train_x = train_x |> cu 
      +test_x = test_x |> cu 
      +train_y = train_y |> cu 
      +test_y = test_y |> cu

      GeometricMachineLearning has built-in data loaders that make it particularly easy to handle data:

      patch_length = 7
      +dl = DataLoader(train_x, train_y, batch_size=512, patch_length=patch_length)
      +dl_test = DataLoader(train_x, train_y, batch_size=length(y), patch_length=patch_length)

      The second line in the above code snippet indicates that we use the entire data set as one "batch" when processing the test set. For training, the batch size was here set to 512.

      ps = initialparameters(backend, eltype(dl.data), Ψᵉ) 
      +
      +optimizer_instance = Optimizer(o, ps)
      +
      +println("initial test accuracy: ", accuracy(Ψᵉ, ps, dl_test), "\n")
      +
      +progress_object = Progress(n_training_steps; enabled=true)
      +
      +loss_array = zeros(eltype(train_x), n_training_steps)
      +for i in 1:n_training_steps
      +    redraw_batch!(dl)
      +    # get rid of try catch statement. This softmax issue should be solvable!
      +    loss_val, pb = try Zygote.pullback(ps -> loss(Ψᵉ, ps, dl), ps)
      +    catch
      +        loss_array[i] = loss_array[i-1] 
      +        continue 
      +    end
      +    dp = pb(one(loss_val))[1]
      +
      +    optimization_step!(optimizer_instance, Ψᵉ, ps, dp)
      +    ProgressMeter.next!(progress_object; showvalues = [(:TrainingLoss, loss_val)])   
      +    loss_array[i] = loss_val
      +end
      +
      +println("final test accuracy: ", accuracy(Ψᵉ, ps, dl_test), "\n")