diff --git a/latest/.documenter-siteinfo.json b/latest/.documenter-siteinfo.json index 954ddde5e..83eeb80eb 100644 --- a/latest/.documenter-siteinfo.json +++ b/latest/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.2","generation_timestamp":"2024-04-15T15:41:40","documenter_version":"1.4.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.2","generation_timestamp":"2024-04-18T15:51:47","documenter_version":"1.4.0"}} \ No newline at end of file diff --git a/latest/GeometricMachineLearning.bib b/latest/GeometricMachineLearning.bib index f02a7fb06..c342eb5e1 100644 --- a/latest/GeometricMachineLearning.bib +++ b/latest/GeometricMachineLearning.bib @@ -264,4 +264,13 @@ @book{jacobs1992discrete publisher={Birkh{\"a}user Verlag}, address={Basel, Switzerland}, year={1992} +} + +@article{feng1998step, + title={The step-transition operators for multi-step methods of ODE's}, + author={Feng, Kang}, + journal={Journal of Computational Mathematics}, + pages={193--202}, + year={1998}, + publisher={JSTOR} } \ No newline at end of file diff --git a/latest/Optimizer/index.html b/latest/Optimizer/index.html index 97be2d39c..39f247c86 100644 --- a/latest/Optimizer/index.html +++ b/latest/Optimizer/index.html @@ -1,2 +1,2 @@ -Optimizers · GeometricMachineLearning.jl

Optimizer

In order to generalize neural network optimizers to homogeneous spaces, a class of manifolds we often encounter in machine learning, we have to find a global tangent space representation which we call $\mathfrak{g}^\mathrm{hor}$ here.

Starting from an element of the tangent space $T_Y\mathcal{M}$[1], we need to perform two mappings to arrive at $\mathfrak{g}^\mathrm{hor}$, which we refer to by $\Omega$ and a red horizontal arrow:

Here the mapping $\Omega$ is a horizontal lift from the tangent space onto the horizontal component of the Lie algebra at $Y$.

The red line maps the horizontal component at $Y$, i.e. $\mathfrak{g}^{\mathrm{hor},Y}$, to the horizontal component at $\mathfrak{g}^\mathrm{hor}$.

The $\mathrm{cache}$ stores information about previous optimization steps and is dependent on the optimizer. The elements of the $\mathrm{cache}$ are also in $\mathfrak{g}^\mathrm{hor}$. Based on this the optimer (Adam in this case) computes a final velocity, which is the input of a retraction. Because this update is done for $\mathfrak{g}^{\mathrm{hor}}\equiv{}T_Y\mathcal{M}$, we still need to perform a mapping, called apply_section here, that then finally updates the network parameters. The two red lines are described in global sections.

References

[15]
B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).
  • 1In practice this is obtained by first using an AD routine on a loss function $L$, and then computing the Riemannian gradient based on this. See the section of the Stiefel manifold for an example of this.
+Optimizers · GeometricMachineLearning.jl

Optimizer

In order to generalize neural network optimizers to homogeneous spaces, a class of manifolds we often encounter in machine learning, we have to find a global tangent space representation which we call $\mathfrak{g}^\mathrm{hor}$ here.

Starting from an element of the tangent space $T_Y\mathcal{M}$[1], we need to perform two mappings to arrive at $\mathfrak{g}^\mathrm{hor}$, which we refer to by $\Omega$ and a red horizontal arrow:

Here the mapping $\Omega$ is a horizontal lift from the tangent space onto the horizontal component of the Lie algebra at $Y$.

The red line maps the horizontal component at $Y$, i.e. $\mathfrak{g}^{\mathrm{hor},Y}$, to the horizontal component at $\mathfrak{g}^\mathrm{hor}$.

The $\mathrm{cache}$ stores information about previous optimization steps and is dependent on the optimizer. The elements of the $\mathrm{cache}$ are also in $\mathfrak{g}^\mathrm{hor}$. Based on this the optimer (Adam in this case) computes a final velocity, which is the input of a retraction. Because this update is done for $\mathfrak{g}^{\mathrm{hor}}\equiv{}T_Y\mathcal{M}$, we still need to perform a mapping, called apply_section here, that then finally updates the network parameters. The two red lines are described in global sections.

References

[24]
B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).
  • 1In practice this is obtained by first using an AD routine on a loss function $L$, and then computing the Riemannian gradient based on this. See the section of the Stiefel manifold for an example of this.
diff --git a/latest/architectures/autoencoders/index.html b/latest/architectures/autoencoders/index.html index c97bfe899..a42b9ed02 100644 --- a/latest/architectures/autoencoders/index.html +++ b/latest/architectures/autoencoders/index.html @@ -1,3 +1,3 @@ -Variational Autoencoders · GeometricMachineLearning.jl

Variational Autoencoders

Variational autoencoders (Lee and Carlberg, 2020) train on the following set:

\[\mathcal{X}(\mathbb{P}_\mathrm{train}) := \{\mathbf{x}^k(\mu) - \mathbf{x}^0(\mu):0\leq{}k\leq{}K,\mu\in\mathbb{P}_\mathrm{train}\},\]

where $\mathbf{x}^k(\mu)\approx\mathbf{x}(t^k;\mu)$. Note that $\mathbf{0}\in\mathcal{X}(\mathbb{P}_\mathrm{train})$ as $k$ can also be zero.

The encoder $\Psi^\mathrm{enc}$ and decoder $\Psi^\mathrm{dec}$ are then trained on this set $\mathcal{X}(\mathbb{P}_\mathrm{train})$ by minimizing the reconstruction error:

\[|| \mathbf{x} - \Psi^\mathrm{dec}\circ\Psi^\mathrm{enc}(\mathbf{x}) ||\text{ for $\mathbf{x}\in\mathcal{X}(\mathbb{P}_\mathrm{train})$}.\]

Initial condition

No matter the parameter $\mu$ the initial condition in the reduced system is always $\mathbf{x}_{r,0}(\mu) = \mathbf{x}_{r,0} = \Psi^\mathrm{enc}(\mathbf{0})$.

Reconstructed solution

In order to arrive at the reconstructed solution one first has to decode the reduced state and then add the reference state:

\[\mathbf{x}^\mathrm{reconstr}(t;\mu) = \mathbf{x}^\mathrm{ref}(\mu) + \Psi^\mathrm{dec}(\mathbf{x}_r(t;\mu)),\]

where $\mathbf{x}^\mathrm{ref}(\mu) = \mathbf{x}(t_0;\mu) - \Psi^\mathrm{dec}\circ\Psi^\mathrm{dec}(\mathbf{0})$.

Symplectic reduced vector field

A symplectic vector field is one whose flow conserves the symplectic structure $\mathbb{J}$. This is equivalent[1] to there existing a Hamiltonian $H$ s.t. the vector field $X$ can be written as $X = \mathbb{J}\nabla{}H$.

If the full-order Hamiltonian is $H^\mathrm{full}\equiv{}H$ we can obtain another Hamiltonian on the reduces space by simply setting:

\[H^\mathrm{red}(\mathbf{x}_r(t;\mu)) = H(\mathbf{x}^\mathrm{reconstr}(t;\mu)) = H(\mathbf{x}^\mathrm{ref}(\mu) + \Psi^\mathrm{dec}(\mathbf{x}_r(t;\mu))).\]

The ODE associated to this Hamiltonian is also the one corresponding to Manifold Galerkin ROM (see (Lee and Carlberg, 2020)).

Manifold Galerkin ROM

Define the FOM ODE residual as:

\[r: (\mathbf{v}, \xi, \tau; \mu) \mapsto \mathbf{v} - f(\xi, \tau; \mu).\]

The reduced ODE is then defined to be:

\[\dot{\hat{\mathbf{x}}}(t;\mu) = \mathrm{arg\,{}min}_{\hat{\mathbf{v}}\in\mathbb{R}^p}|| r(\mathcal{J}(\hat{\mathbf{x}}(t;\mu))\hat{\mathbf{v}},\hat{\mathbf{x}}^\mathrm{ref}(\mu) + \Psi^\mathrm{dec}(\hat{\mathbf{x}}(t;\mu)),t;\mu) ||_2^2,\]

where $\mathcal{J}$ is the Jacobian of the decoder $\Psi^\mathrm{dec}$. This leads to:

\[\mathcal{J}(\hat{\mathbf{x}}(t;\mu))\hat{\mathbf{v}} - f(\hat{\mathbf{x}}^\mathrm{ref}(\mu) + \Psi^\mathrm{dec}(\hat{\mathbf{x}}(t;\mu)), t; \mu) \overset{!}{=} 0 \implies -\hat{\mathbf{v}} = \mathcal{J}(\hat{\mathbf{x}}(t;\mu))^+f(\hat{\mathbf{x}}^\mathrm{ref}(\mu) + \Psi^\mathrm{dec}(\hat{\mathbf{x}}(t;\mu)), t; \mu),\]

where $\mathcal{J}(\hat{\mathbf{x}}(t;\mu))^+$ is the pseudoinverse of $\mathcal{J}(\hat{\mathbf{x}}(t;\mu))$. Because $\mathcal{J}(\hat{\mathbf{x}}(t;\mu))$ is a symplectic matrix the pseudoinverse is the symplectic inverse (see (Peng and Mohseni, 2016)).

Furthermore, because $f$ is Hamiltonian, the vector field describing $dot{\hat{\mathbf{x}}}(t;\mu)$ will also be Hamiltonian.

References

  • K. Lee and K. Carlberg. “Model reduction of dynamical systems on nonlinear manifolds using

deep convolutional autoencoders”. In: Journal of Computational Physics 404 (2020), p. 108973.

  • Peng L, Mohseni K. Symplectic model reduction of Hamiltonian systems[J]. SIAM Journal on Scientific Computing, 2016, 38(1): A1-A27.
  • 1Technically speaking the definitions are equivalent only for simply-connected manifolds, so also for vector spaces.
+Variational Autoencoders · GeometricMachineLearning.jl

Variational Autoencoders

Variational autoencoders (Lee and Carlberg, 2020) train on the following set:

\[\mathcal{X}(\mathbb{P}_\mathrm{train}) := \{\mathbf{x}^k(\mu) - \mathbf{x}^0(\mu):0\leq{}k\leq{}K,\mu\in\mathbb{P}_\mathrm{train}\},\]

where $\mathbf{x}^k(\mu)\approx\mathbf{x}(t^k;\mu)$. Note that $\mathbf{0}\in\mathcal{X}(\mathbb{P}_\mathrm{train})$ as $k$ can also be zero.

The encoder $\Psi^\mathrm{enc}$ and decoder $\Psi^\mathrm{dec}$ are then trained on this set $\mathcal{X}(\mathbb{P}_\mathrm{train})$ by minimizing the reconstruction error:

\[|| \mathbf{x} - \Psi^\mathrm{dec}\circ\Psi^\mathrm{enc}(\mathbf{x}) ||\text{ for $\mathbf{x}\in\mathcal{X}(\mathbb{P}_\mathrm{train})$}.\]

Initial condition

No matter the parameter $\mu$ the initial condition in the reduced system is always $\mathbf{x}_{r,0}(\mu) = \mathbf{x}_{r,0} = \Psi^\mathrm{enc}(\mathbf{0})$.

Reconstructed solution

In order to arrive at the reconstructed solution one first has to decode the reduced state and then add the reference state:

\[\mathbf{x}^\mathrm{reconstr}(t;\mu) = \mathbf{x}^\mathrm{ref}(\mu) + \Psi^\mathrm{dec}(\mathbf{x}_r(t;\mu)),\]

where $\mathbf{x}^\mathrm{ref}(\mu) = \mathbf{x}(t_0;\mu) - \Psi^\mathrm{dec}\circ\Psi^\mathrm{dec}(\mathbf{0})$.

Symplectic reduced vector field

A symplectic vector field is one whose flow conserves the symplectic structure $\mathbb{J}$. This is equivalent[1] to there existing a Hamiltonian $H$ s.t. the vector field $X$ can be written as $X = \mathbb{J}\nabla{}H$.

If the full-order Hamiltonian is $H^\mathrm{full}\equiv{}H$ we can obtain another Hamiltonian on the reduces space by simply setting:

\[H^\mathrm{red}(\mathbf{x}_r(t;\mu)) = H(\mathbf{x}^\mathrm{reconstr}(t;\mu)) = H(\mathbf{x}^\mathrm{ref}(\mu) + \Psi^\mathrm{dec}(\mathbf{x}_r(t;\mu))).\]

The ODE associated to this Hamiltonian is also the one corresponding to Manifold Galerkin ROM (see (Lee and Carlberg, 2020)).

Manifold Galerkin ROM

Define the FOM ODE residual as:

\[r: (\mathbf{v}, \xi, \tau; \mu) \mapsto \mathbf{v} - f(\xi, \tau; \mu).\]

The reduced ODE is then defined to be:

\[\dot{\hat{\mathbf{x}}}(t;\mu) = \mathrm{arg\,{}min}_{\hat{\mathbf{v}}\in\mathbb{R}^p}|| r(\mathcal{J}(\hat{\mathbf{x}}(t;\mu))\hat{\mathbf{v}},\hat{\mathbf{x}}^\mathrm{ref}(\mu) + \Psi^\mathrm{dec}(\hat{\mathbf{x}}(t;\mu)),t;\mu) ||_2^2,\]

where $\mathcal{J}$ is the Jacobian of the decoder $\Psi^\mathrm{dec}$. This leads to:

\[\mathcal{J}(\hat{\mathbf{x}}(t;\mu))\hat{\mathbf{v}} - f(\hat{\mathbf{x}}^\mathrm{ref}(\mu) + \Psi^\mathrm{dec}(\hat{\mathbf{x}}(t;\mu)), t; \mu) \overset{!}{=} 0 \implies +\hat{\mathbf{v}} = \mathcal{J}(\hat{\mathbf{x}}(t;\mu))^+f(\hat{\mathbf{x}}^\mathrm{ref}(\mu) + \Psi^\mathrm{dec}(\hat{\mathbf{x}}(t;\mu)), t; \mu),\]

where $\mathcal{J}(\hat{\mathbf{x}}(t;\mu))^+$ is the pseudoinverse of $\mathcal{J}(\hat{\mathbf{x}}(t;\mu))$. Because $\mathcal{J}(\hat{\mathbf{x}}(t;\mu))$ is a symplectic matrix the pseudoinverse is the symplectic inverse (see (Peng and Mohseni, 2016)).

Furthermore, because $f$ is Hamiltonian, the vector field describing $dot{\hat{\mathbf{x}}}(t;\mu)$ will also be Hamiltonian.

References

  • K. Lee and K. Carlberg. “Model reduction of dynamical systems on nonlinear manifolds using

deep convolutional autoencoders”. In: Journal of Computational Physics 404 (2020), p. 108973.

  • Peng L, Mohseni K. Symplectic model reduction of Hamiltonian systems[J]. SIAM Journal on Scientific Computing, 2016, 38(1): A1-A27.
  • 1Technically speaking the definitions are equivalent only for simply-connected manifolds, so also for vector spaces.
diff --git a/latest/architectures/sympnet/index.html b/latest/architectures/sympnet/index.html index 1de2d8e31..c2edfc2ef 100644 --- a/latest/architectures/sympnet/index.html +++ b/latest/architectures/sympnet/index.html @@ -1,5 +1,5 @@ -SympNet · GeometricMachineLearning.jl

SympNet

This document discusses the SympNet architecture and its implementation in GeometricMachineLearning.jl.

Quick overview of the theory of SympNets

Principle

SympNets (see [1] for the eponymous paper) are a type of neural network that can model the trajectory of a Hamiltonian system in phase space. Take $(q^T,p^T)^T=(q_1,\ldots,q_d,p_1,\ldots,p_d)^T\in \mathbb{R}^{2d}$ as the coordinates in phase space, where $q=(q_1, \ldots, q_d)^T\in \mathbb{R}^{d}$ is refered to as the position and $p=(p_1, \ldots, p_d)^T\in \mathbb{R}^{d}$ the momentum. Given a point $(q^T,p^T)^T$ in $\mathbb{R}^{2d}$ the SympNet aims to compute the next position $((q')^T,(p')^T)^T$ and thus predicts the trajectory while preserving the symplectic structure of the system. SympNets are enforcing symplecticity strongly, meaning that this property is hard-coded into the network architecture. The layers are reminiscent of traditional neural network feedforward layers, but have a strong restriction imposed on them in order to be symplectic.

SympNets can be viewed as a "symplectic integrator" (see [2] and [3]). Their goal is to predict, based on an initial condition $((q^{(0)})^T,(p^{(0)})^T)^T$, a sequence of points in phase space that fit the training data as well as possible:

\[\begin{pmatrix} q^{(0)} \\ p^{(0)} \end{pmatrix}, \cdots, \begin{pmatrix} \tilde{q}^{(1)} \\ \tilde{p}^{(1)} \end{pmatrix}, \cdots \begin{pmatrix} \tilde{q}^{(n)} \\ \tilde{p}^{(n)} \end{pmatrix}.\]

The tilde in the above equation indicates predicted data. The time step between predictions is not a parameter we can choose but is related to the temporal frequency of the training data. This means that if data is recorded in an interval of e.g. 0.1 seconds, then this will be the time step of our integrator.

There are two types of SympNet architectures: $LA$-SympNets and $G$-SympNets.

$LA$-SympNet

The first type of SympNets, $LA$-SympNets, are obtained from composing two types of layers: symplectic linear layers and symplectic activation layers. For a given integer $n$, a symplectic linear layer is defined by

\[\mathcal{L}^{n,q} +SympNet · GeometricMachineLearning.jl

SympNet

This document discusses the SympNet architecture and its implementation in GeometricMachineLearning.jl.

Quick overview of the theory of SympNets

Principle

SympNets (see [1] for the eponymous paper) are a type of neural network that can model the trajectory of a Hamiltonian system in phase space. Take $(q^T,p^T)^T=(q_1,\ldots,q_d,p_1,\ldots,p_d)^T\in \mathbb{R}^{2d}$ as the coordinates in phase space, where $q=(q_1, \ldots, q_d)^T\in \mathbb{R}^{d}$ is refered to as the position and $p=(p_1, \ldots, p_d)^T\in \mathbb{R}^{d}$ the momentum. Given a point $(q^T,p^T)^T$ in $\mathbb{R}^{2d}$ the SympNet aims to compute the next position $((q')^T,(p')^T)^T$ and thus predicts the trajectory while preserving the symplectic structure of the system. SympNets are enforcing symplecticity strongly, meaning that this property is hard-coded into the network architecture. The layers are reminiscent of traditional neural network feedforward layers, but have a strong restriction imposed on them in order to be symplectic.

SympNets can be viewed as a "symplectic integrator" (see [2] and [3]). Their goal is to predict, based on an initial condition $((q^{(0)})^T,(p^{(0)})^T)^T$, a sequence of points in phase space that fit the training data as well as possible:

\[\begin{pmatrix} q^{(0)} \\ p^{(0)} \end{pmatrix}, \cdots, \begin{pmatrix} \tilde{q}^{(1)} \\ \tilde{p}^{(1)} \end{pmatrix}, \cdots \begin{pmatrix} \tilde{q}^{(n)} \\ \tilde{p}^{(n)} \end{pmatrix}.\]

The tilde in the above equation indicates predicted data. The time step between predictions is not a parameter we can choose but is related to the temporal frequency of the training data. This means that if data is recorded in an interval of e.g. 0.1 seconds, then this will be the time step of our integrator.

There are two types of SympNet architectures: $LA$-SympNets and $G$-SympNets.

$LA$-SympNet

The first type of SympNets, $LA$-SympNets, are obtained from composing two types of layers: symplectic linear layers and symplectic activation layers. For a given integer $n$, a symplectic linear layer is defined by

\[\mathcal{L}^{n,q} \begin{pmatrix} q \\ p \\ @@ -81,4 +81,4 @@ \begin{pmatrix} q \\ K^T \mathrm{diag}(a)\sigma(Kq+b)+p - \end{pmatrix}.\]

The parameters of this layer are the scaling matrix $K\in\mathbb{R}^{m\times d}$, the bias $b\in\mathbb{R}^{m}$ and the scaling vector $a\in\mathbb{R}^{m}$. The name "gradient layer" has its origin in the fact that the expression $[K^T\mathrm{diag}(a)\sigma(Kq+b)]_i = \sum_jk_{ji}a_j\sigma(\sum_\ell{}k_{j\ell}q_\ell+b_j)$ is the gradient of a function $\sum_ja_j\tilde{\sigma}(\sum_\ell{}k_{j\ell}q_\ell+b_j)$, where $\tilde{\sigma}$ is the antiderivative of $\sigma$. The first dimension of $K$ we refer to as the upscaling dimension.

If we denote by $\mathcal{M}^G$ the set of gradient layers, a $G$-SympNet is a function of the form $\Psi=g_k \circ g_{k-1} \circ \cdots \circ g_0$ where $(g_i)_{0\leq i\leq k} \subset (\mathcal{M}^G)^k$. The index $k$ is again the number of hidden layers.

Further note here the different roles played by round and square brackets: the latter indicates a nonlinear operation as opposed to a regular vector or matrix.

Universal approximation theorems

In order to state the universal approximation theorem for both architectures we first need a few definitions:

Let $U$ be an open set of $\mathbb{R}^{2d}$, and let us denote by $\mathcal{SP}^r(U)$ the set of $C^r$ smooth symplectic maps on $U$. We now define a topology on $C^r(K, \mathbb{R}^n)$, the set of $C^r$-smooth maps from a compact set $K\subset\mathbb{R}^{n}$ to $\mathbb{R}^{n}$ through the norm

\[||f||_{C^r(K,\mathbb{R}^{n})} = \underset{|\alpha|\leq r}{\sum} \underset{1\leq i \leq n}{\max}\underset{x\in K}{\sup} |D^\alpha f_i(x)|,\]

where the differential operator $D^\alpha$ is defined by

\[D^\alpha f = \frac{\partial^{|\alpha|} f}{\partial x_1^{\alpha_1}...x_n^{\alpha_n}},\]

with $|\alpha| = \alpha_1 +...+ \alpha_n$.

Definition $\sigma$ is $r$-finite if $\sigma\in C^r(\mathbb{R},\mathbb{R})$ and $\int |D^r\sigma(x)|dx <+\infty$.

Definition Let $m,n,r\in \mathbb{N}$ with $m,n>0$ be given, $U$ an open set of $\mathbb{R}^m$, and $I,J\subset C^r(U,\mathbb{R}^n)$. We say $J$ is $r$-uniformly dense on compacta in $I$ if $J \subset I$ and for any $f\in I$, $\epsilon>0$, and any compact $K\subset U$, there exists $g\in J$ such that $||f-g||_{C^r(K,\mathbb{R}^{n})} < \epsilon$.

We can now state the universal approximation theorems:

Theorem (Approximation theorem for LA-SympNet) For any positive integer $r>0$ and open set $U\in \mathbb{R}^{2d}$, the set of $LA$-SympNet is $r$-uniformly dense on compacta in $SP^r(U)$ if the activation function $\sigma$ is $r$-finite.

Theorem (Approximation theorem for G-SympNet) For any positive integer $r>0$ and open set $U\in \mathbb{R}^{2d}$, the set of $G$-SympNet is $r$-uniformly dense on compacta in $SP^r(U)$ if the activation function $\sigma$ is $r$-finite.

There are many $r$-finite activation functions commonly used in neural networks, for example:

  • sigmoid $\sigma(x)=\frac{1}{1+e^{-x}}$ for any positive integer $r$,
  • tanh $\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}$ for any positive integer $r$.

The universal approximation theorems state that we can, in principle, get arbitrarily close to any symplectomorphism defined on $\mathbb{R}^{2d}$. But this does not tell us anything about how to optimize the network. This is can be done with any common neural network optimizer and these neural network optimizers always rely on a corresponding loss function.

Loss function

To train the SympNet, one need data along a trajectory such that the model is trained to perform an integration. These data are $(Q,P)$ where $Q[i,j]$ (respectively $P[i,j]$) is the real number $q_j(t_i)$ (respectively $p[i,j]$) which is the j-th coordinates of the generalized position (respectively momentum) at the i-th time step. One also need a loss function defined as :

\[Loss(Q,P) = \underset{i}{\sum} d(\Phi(Q[i,-],P[i,-]), [Q[i,-] P[i,-]]^T)\]

where $d$ is a distance on $\mathbb{R}^d$.

See the tutorial section for an introduction into using SympNets with GeometricMachineLearning.jl.

References

[1]
P. Jin, Z. Zhang, A. Zhu, Y. Tang and G. E. Karniadakis. SympNets: Intrinsic structure-preserving symplectic networks for identifying Hamiltonian systems. Neural Networks 132, 166–179 (2020).
  • 1Note that if $k=1$ then the $LA$-SympNet consists of only one linear layer.
+ \end{pmatrix}.\]

The parameters of this layer are the scaling matrix $K\in\mathbb{R}^{m\times d}$, the bias $b\in\mathbb{R}^{m}$ and the scaling vector $a\in\mathbb{R}^{m}$. The name "gradient layer" has its origin in the fact that the expression $[K^T\mathrm{diag}(a)\sigma(Kq+b)]_i = \sum_jk_{ji}a_j\sigma(\sum_\ell{}k_{j\ell}q_\ell+b_j)$ is the gradient of a function $\sum_ja_j\tilde{\sigma}(\sum_\ell{}k_{j\ell}q_\ell+b_j)$, where $\tilde{\sigma}$ is the antiderivative of $\sigma$. The first dimension of $K$ we refer to as the upscaling dimension.

If we denote by $\mathcal{M}^G$ the set of gradient layers, a $G$-SympNet is a function of the form $\Psi=g_k \circ g_{k-1} \circ \cdots \circ g_0$ where $(g_i)_{0\leq i\leq k} \subset (\mathcal{M}^G)^k$. The index $k$ is again the number of hidden layers.

Further note here the different roles played by round and square brackets: the latter indicates a nonlinear operation as opposed to a regular vector or matrix.

Universal approximation theorems

In order to state the universal approximation theorem for both architectures we first need a few definitions:

Let $U$ be an open set of $\mathbb{R}^{2d}$, and let us denote by $\mathcal{SP}^r(U)$ the set of $C^r$ smooth symplectic maps on $U$. We now define a topology on $C^r(K, \mathbb{R}^n)$, the set of $C^r$-smooth maps from a compact set $K\subset\mathbb{R}^{n}$ to $\mathbb{R}^{n}$ through the norm

\[||f||_{C^r(K,\mathbb{R}^{n})} = \underset{|\alpha|\leq r}{\sum} \underset{1\leq i \leq n}{\max}\underset{x\in K}{\sup} |D^\alpha f_i(x)|,\]

where the differential operator $D^\alpha$ is defined by

\[D^\alpha f = \frac{\partial^{|\alpha|} f}{\partial x_1^{\alpha_1}...x_n^{\alpha_n}},\]

with $|\alpha| = \alpha_1 +...+ \alpha_n$.

Definition $\sigma$ is $r$-finite if $\sigma\in C^r(\mathbb{R},\mathbb{R})$ and $\int |D^r\sigma(x)|dx <+\infty$.

Definition Let $m,n,r\in \mathbb{N}$ with $m,n>0$ be given, $U$ an open set of $\mathbb{R}^m$, and $I,J\subset C^r(U,\mathbb{R}^n)$. We say $J$ is $r$-uniformly dense on compacta in $I$ if $J \subset I$ and for any $f\in I$, $\epsilon>0$, and any compact $K\subset U$, there exists $g\in J$ such that $||f-g||_{C^r(K,\mathbb{R}^{n})} < \epsilon$.

We can now state the universal approximation theorems:

Theorem (Approximation theorem for LA-SympNet) For any positive integer $r>0$ and open set $U\in \mathbb{R}^{2d}$, the set of $LA$-SympNet is $r$-uniformly dense on compacta in $SP^r(U)$ if the activation function $\sigma$ is $r$-finite.

Theorem (Approximation theorem for G-SympNet) For any positive integer $r>0$ and open set $U\in \mathbb{R}^{2d}$, the set of $G$-SympNet is $r$-uniformly dense on compacta in $SP^r(U)$ if the activation function $\sigma$ is $r$-finite.

There are many $r$-finite activation functions commonly used in neural networks, for example:

  • sigmoid $\sigma(x)=\frac{1}{1+e^{-x}}$ for any positive integer $r$,
  • tanh $\tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}$ for any positive integer $r$.

The universal approximation theorems state that we can, in principle, get arbitrarily close to any symplectomorphism defined on $\mathbb{R}^{2d}$. But this does not tell us anything about how to optimize the network. This is can be done with any common neural network optimizer and these neural network optimizers always rely on a corresponding loss function.

Loss function

To train the SympNet, one need data along a trajectory such that the model is trained to perform an integration. These data are $(Q,P)$ where $Q[i,j]$ (respectively $P[i,j]$) is the real number $q_j(t_i)$ (respectively $p[i,j]$) which is the j-th coordinates of the generalized position (respectively momentum) at the i-th time step. One also need a loss function defined as :

\[Loss(Q,P) = \underset{i}{\sum} d(\Phi(Q[i,-],P[i,-]), [Q[i,-] P[i,-]]^T)\]

where $d$ is a distance on $\mathbb{R}^d$.

See the tutorial section for an introduction into using SympNets with GeometricMachineLearning.jl.

References

[1]
P. Jin, Z. Zhang, A. Zhu, Y. Tang and G. E. Karniadakis. SympNets: Intrinsic structure-preserving symplectic networks for identifying Hamiltonian systems. Neural Networks 132, 166–179 (2020).
  • 1Note that if $k=1$ then the $LA$-SympNet consists of only one linear layer.
diff --git a/latest/arrays/grassmann_lie_alg_hor_matrix/index.html b/latest/arrays/grassmann_lie_alg_hor_matrix/index.html index 4c12893f5..85d991a41 100644 --- a/latest/arrays/grassmann_lie_alg_hor_matrix/index.html +++ b/latest/arrays/grassmann_lie_alg_hor_matrix/index.html @@ -1,5 +1,5 @@ -Grassmann Global Tangent Space · GeometricMachineLearning.jl

The horizontal component of the Lie algebra $\mathfrak{g}$ for the Grassmann manifold

Tangent space to the element $\mathcal{E}$

Consider the tangent space to the distinct element $\mathcal{E}=\mathrm{span}(E)\in{}Gr(n,N)$, where $E$ is again:

\[E = \begin{bmatrix} +Grassmann Global Tangent Space · GeometricMachineLearning.jl

The horizontal component of the Lie algebra $\mathfrak{g}$ for the Grassmann manifold

Tangent space to the element $\mathcal{E}$

Consider the tangent space to the distinct element $\mathcal{E}=\mathrm{span}(E)\in{}Gr(n,N)$, where $E$ is again:

\[E = \begin{bmatrix} \mathbb{I}_n \\ \mathbb{O} \end{bmatrix}.\]

The tangent tangent space $T_\mathcal{E}Gr(n,N)$ can be represented through matrices:

\[\begin{pmatrix} @@ -9,4 +9,4 @@ a_{11} & \cdots & a_{1n} \\ \cdots & \cdots & \cdots \\ a_{(N-n)1} & \cdots & a_{(N-n)n} -\end{pmatrix},\]

where we have used the identification $T_\mathcal{E}Gr(n,N)\to{}T_E\mathcal{S}_E$ that was discussed in the section on the Grassmann manifold. The Grassmann manifold can also be seen as the Stiefel manifold modulo an equivalence class. This leads to the following (which is used for optimization):

\[\mathfrak{g}^\mathrm{hor} = \mathfrak{g}^{\mathrm{hor},\mathcal{E}} = \left\{\begin{pmatrix} 0 & -B^T \\ B & 0 \end{pmatrix}: \text{$B$ arbitrary}\right\}.\]

This is equivalent to the horizontal component of $\mathfrak{g}$ for the Stiefel manifold for the case when $A$ is zero. This is a reflection of the rotational invariance of the Grassmann manifold: the skew-symmetric matrices $A$ are connected to the group of rotations $O(n)$ which is factored out in the Grassmann manifold $Gr(n,N)\simeq{}St(n,N)/O(n)$.

+\end{pmatrix},\]

where we have used the identification $T_\mathcal{E}Gr(n,N)\to{}T_E\mathcal{S}_E$ that was discussed in the section on the Grassmann manifold. The Grassmann manifold can also be seen as the Stiefel manifold modulo an equivalence class. This leads to the following (which is used for optimization):

\[\mathfrak{g}^\mathrm{hor} = \mathfrak{g}^{\mathrm{hor},\mathcal{E}} = \left\{\begin{pmatrix} 0 & -B^T \\ B & 0 \end{pmatrix}: \text{$B$ arbitrary}\right\}.\]

This is equivalent to the horizontal component of $\mathfrak{g}$ for the Stiefel manifold for the case when $A$ is zero. This is a reflection of the rotational invariance of the Grassmann manifold: the skew-symmetric matrices $A$ are connected to the group of rotations $O(n)$ which is factored out in the Grassmann manifold $Gr(n,N)\simeq{}St(n,N)/O(n)$.

diff --git a/latest/arrays/skew_symmetric_matrix/index.html b/latest/arrays/skew_symmetric_matrix/index.html index 32a1f152f..ec2e9d90e 100644 --- a/latest/arrays/skew_symmetric_matrix/index.html +++ b/latest/arrays/skew_symmetric_matrix/index.html @@ -1,19 +1,19 @@ -Symmetric and Skew-Symmetric Matrices · GeometricMachineLearning.jl

SymmetricMatrix and SkewSymMatrix

There are special implementations of symmetric and skew-symmetric matrices in GeometricMachineLearning.jl. They are implemented to work on GPU and for multiplication with tensors. The following image demonstrates how the data necessary for an instance of SkewSymMatrix are stored[1]:

end #

So what is stored internally is a vector of size $n(n-1)/2$ for the skew-symmetric matrix and a vector of size $n(n+1)/2$ for the symmetric matrix. We can sample a random skew-symmetric matrix:

using GeometricMachineLearning # hide
+Symmetric and Skew-Symmetric Matrices · GeometricMachineLearning.jl

SymmetricMatrix and SkewSymMatrix

There are special implementations of symmetric and skew-symmetric matrices in GeometricMachineLearning.jl. They are implemented to work on GPU and for multiplication with tensors. The following image demonstrates how the data necessary for an instance of SkewSymMatrix are stored[1]:

end #

So what is stored internally is a vector of size $n(n-1)/2$ for the skew-symmetric matrix and a vector of size $n(n+1)/2$ for the symmetric matrix. We can sample a random skew-symmetric matrix:

using GeometricMachineLearning # hide
 
 A = rand(SkewSymMatrix, 5)
5×5 SkewSymMatrix{Float64, Vector{Float64}}:
- 0.0       -0.386481   -0.587248  -0.58316    -0.296752
- 0.386481   0.0        -0.75987   -0.0395198  -0.791269
- 0.587248   0.75987     0.0       -0.993098   -0.806084
- 0.58316    0.0395198   0.993098   0.0        -0.144389
- 0.296752   0.791269    0.806084   0.144389    0.0

and then access the vector:

A.S
10-element Vector{Float64}:
- 0.3864805493023272
- 0.5872480099511458
- 0.7598695437592404
- 0.5831596656088702
- 0.039519799779951015
- 0.9930982798738633
- 0.296751627407034
- 0.7912689001441062
- 0.8060839030961772
- 0.14438925368580213
  • 1It works similarly for SymmetricMatrix.
+ 0.0 -0.610054 -0.00398528 -0.622094 -0.380788 + 0.610054 0.0 -0.992344 -0.180899 -0.00394746 + 0.00398528 0.992344 0.0 -0.14235 -0.917795 + 0.622094 0.180899 0.14235 0.0 -0.160516 + 0.380788 0.00394746 0.917795 0.160516 0.0

and then access the vector:

A.S
10-element Vector{Float64}:
+ 0.6100544430751406
+ 0.003985283505292925
+ 0.9923437417075986
+ 0.6220939557264527
+ 0.18089911064993713
+ 0.14235002431430632
+ 0.38078806629037143
+ 0.003947461004538244
+ 0.917794697764087
+ 0.1605158243435728
  • 1It works similarly for SymmetricMatrix.
diff --git a/latest/arrays/stiefel_lie_alg_horizontal/index.html b/latest/arrays/stiefel_lie_alg_horizontal/index.html index be8a470ff..b69b905b5 100644 --- a/latest/arrays/stiefel_lie_alg_horizontal/index.html +++ b/latest/arrays/stiefel_lie_alg_horizontal/index.html @@ -1,8 +1,8 @@ -Stiefel Global Tangent Space · GeometricMachineLearning.jl

Horizontal component of the Lie algebra $\mathfrak{g}$

What we use to optimize Adam (and other algorithms) to manifolds is a global tangent space representation of the homogeneous spaces.

For the Stiefel manifold, this global tangent space representation takes a simple form:

\[\mathcal{B} = \begin{bmatrix} +Stiefel Global Tangent Space · GeometricMachineLearning.jl

Horizontal component of the Lie algebra $\mathfrak{g}$

What we use to optimize Adam (and other algorithms) to manifolds is a global tangent space representation of the homogeneous spaces.

For the Stiefel manifold, this global tangent space representation takes a simple form:

\[\mathcal{B} = \begin{bmatrix} A & -B^T \\ B & \mathbb{O} \end{bmatrix},\]

where $A\in\mathbb{R}^{n\times{}n}$ is skew-symmetric and $B\in\mathbb{R}^{N\times{}n}$ is arbitary. In GeometricMachineLearning the struct StiefelLieAlgHorMatrix implements elements of this form.

Theoretical background

Vertical and horizontal components

The Stiefel manifold $St(n, N)$ is a homogeneous space obtained from $SO(N)$ by setting two matrices, whose first $n$ columns conincide, equivalent. Another way of expressing this is:

\[A_1 \sim A_2 \iff A_1E = A_2E\]

for

\[E = \begin{bmatrix} \mathbb{I} \\ \mathbb{O}\end{bmatrix}.\]

Because $St(n,N)$ is a homogeneous space, we can take any element $Y\in{}St(n,N)$ and $SO(N)$ acts transitively on it, i.e. can produce any other element in $SO(N)$. A similar statement is also true regarding the tangent spaces of $St(n,N)$, namely:

\[T_YSt(n,N) = \mathfrak{g}\cdot{}Y,\]

i.e. every tangent space can be expressed through an action of the associated Lie algebra.

The kernel of the mapping $\mathfrak{g}\to{}T_YSt(n,N), B\mapsto{}BY$ is referred to as $\mathfrak{g}^{\mathrm{ver},Y}$, the vertical component of the Lie algebra at $Y$. In the case $Y=E$ it is easy to see that elements belonging to $\mathfrak{g}^{\mathrm{ver},E}$ are of the following form:

\[\begin{bmatrix} \hat{\mathbb{O}} & \tilde{\mathbb{O}}^T \\ \tilde{\mathbb{O}} & C -\end{bmatrix},\]

where $\hat{\mathbb{O}}\in\mathbb{R}^{n\times{}n}$ is a "small" matrix and $\tilde{\mathbb{O}}\in\mathbb{R}^{N\times{}n}$ is a bigger one. $C\in\mathbb{R}^{N\times{}N}$ is a skew-symmetric matrix.

The orthogonal complement of the vertical component is referred to as the horizontal component and denoted by $\mathfrak{g}^{\mathrm{hor}, Y}$. It is isomorphic to $T_YSt(n,N)$ and this isomorphism can be found explicitly. In the case of the Stiefel manifold:

\[\Omega(Y, \cdot):T_YSt(n,N)\to\mathfrak{g}^{\mathrm{hor},Y},\, \Delta \mapsto (\mathbb{I} - \frac{1}{2}YY^T)\Delta{}Y^T - Y\Delta^T(\mathbb{I} - \frac{1}{2}YY^T)\]

The elements of $\mathfrak{g}^{\mathrm{hor},E}=:\mathfrak{g}^\mathrm{hor}$, i.e. for the special case $Y=E$. Its elements are of the form described on top of this page.

Special functions

You can also draw random elements from $\mathfrak{g}^\mathrm{hor}$ through e.g.

rand(CUDADevice(), StiefelLieAlgHorMatrix{Float32}, 10, 5)

In this example: $N=10$ and $n=5$.

+\end{bmatrix},\]

where $\hat{\mathbb{O}}\in\mathbb{R}^{n\times{}n}$ is a "small" matrix and $\tilde{\mathbb{O}}\in\mathbb{R}^{N\times{}n}$ is a bigger one. $C\in\mathbb{R}^{N\times{}N}$ is a skew-symmetric matrix.

The orthogonal complement of the vertical component is referred to as the horizontal component and denoted by $\mathfrak{g}^{\mathrm{hor}, Y}$. It is isomorphic to $T_YSt(n,N)$ and this isomorphism can be found explicitly. In the case of the Stiefel manifold:

\[\Omega(Y, \cdot):T_YSt(n,N)\to\mathfrak{g}^{\mathrm{hor},Y},\, \Delta \mapsto (\mathbb{I} - \frac{1}{2}YY^T)\Delta{}Y^T - Y\Delta^T(\mathbb{I} - \frac{1}{2}YY^T)\]

The elements of $\mathfrak{g}^{\mathrm{hor},E}=:\mathfrak{g}^\mathrm{hor}$, i.e. for the special case $Y=E$. Its elements are of the form described on top of this page.

Special functions

You can also draw random elements from $\mathfrak{g}^\mathrm{hor}$ through e.g.

rand(CUDADevice(), StiefelLieAlgHorMatrix{Float32}, 10, 5)

In this example: $N=10$ and $n=5$.

diff --git a/latest/data_loader/TODO/index.html b/latest/data_loader/TODO/index.html index 92db5f5b1..028cacee3 100644 --- a/latest/data_loader/TODO/index.html +++ b/latest/data_loader/TODO/index.html @@ -1,2 +1,2 @@ -DATA Loader TODO · GeometricMachineLearning.jl

DATA Loader TODO

  • [x] Implement @views instead of allocating a new array in every step.
  • [x] Implement sampling without replacement.
  • [x] Store information on the epoch and the current loss.
  • [x] Usually the training loss is computed over the entire data set, we are probably going to do this for one epoch via

\[loss_e = \frac{1}{|batches|}\sum_{batch\in{}batches}loss(batch).\]

Point 4 makes sense because the output of an AD routine is the value of the loss function as well as the pullback.

+DATA Loader TODO · GeometricMachineLearning.jl

DATA Loader TODO

  • [x] Implement @views instead of allocating a new array in every step.
  • [x] Implement sampling without replacement.
  • [x] Store information on the epoch and the current loss.
  • [x] Usually the training loss is computed over the entire data set, we are probably going to do this for one epoch via

\[loss_e = \frac{1}{|batches|}\sum_{batch\in{}batches}loss(batch).\]

Point 4 makes sense because the output of an AD routine is the value of the loss function as well as the pullback.

diff --git a/latest/data_loader/data_loader/index.html b/latest/data_loader/data_loader/index.html index d6cfa8cf1..b7e37c040 100644 --- a/latest/data_loader/data_loader/index.html +++ b/latest/data_loader/data_loader/index.html @@ -1,28 +1,28 @@ -Routines · GeometricMachineLearning.jl

Data Loader

Data Loader is a struct that creates an instance based on a tensor (or different input format) and is designed to make training convenient.

Constructor

The data loader can be called with various inputs:

  • A single vector: If the data loader is called with a single vector (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the second axis indicates parameter values and/or time steps and the system has a single degree of freedom (i.e. the system dimension is one).
  • A single matrix: If the data loader is called with a single matrix (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the first axis is assumed to indicate the degrees of freedom of the system and the second axis indicates parameter values and/or time steps.
  • A single tensor: If the data loader is called with a single tensor, then this is interpreted as an integration problem with the second axis indicating the time step and the third one indicating the parameters.
  • A tensor and a vector: This is a special case (MNIST classification problem). For the MNIST problem for example the input are $n_p$ matrices (first input argument) and $n_p$ integers (second input argument).
  • A NamedTuple with fields q and p: The NamedTuple contains (i) two matrices or (ii) two tensors.
  • An EnsembleSolution: The EnsembleSolution typically comes from GeometricProblems.

When we supply a single vector or a single matrix as input to DataLoader and further set autoencoder = false (keyword argument), then the data are stored as an integration problem and the second axis is assumed to indicate time steps.

The data loader can be called with various types of arrays as input, for example a snapshot matrix:

SnapshotMatrix = rand(Float32, 10, 100)
+Routines · GeometricMachineLearning.jl

Data Loader

Data Loader is a struct that creates an instance based on a tensor (or different input format) and is designed to make training convenient.

Constructor

The data loader can be called with various inputs:

  • A single vector: If the data loader is called with a single vector (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the second axis indicates parameter values and/or time steps and the system has a single degree of freedom (i.e. the system dimension is one).
  • A single matrix: If the data loader is called with a single matrix (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the first axis is assumed to indicate the degrees of freedom of the system and the second axis indicates parameter values and/or time steps.
  • A single tensor: If the data loader is called with a single tensor, then this is interpreted as an integration problem with the second axis indicating the time step and the third one indicating the parameters.
  • A tensor and a vector: This is a special case (MNIST classification problem). For the MNIST problem for example the input are $n_p$ matrices (first input argument) and $n_p$ integers (second input argument).
  • A NamedTuple with fields q and p: The NamedTuple contains (i) two matrices or (ii) two tensors.
  • An EnsembleSolution: The EnsembleSolution typically comes from GeometricProblems.

When we supply a single vector or a single matrix as input to DataLoader and further set autoencoder = false (keyword argument), then the data are stored as an integration problem and the second axis is assumed to indicate time steps.

The data loader can be called with various types of arrays as input, for example a snapshot matrix:

SnapshotMatrix = rand(Float32, 10, 100)
 
-dl = DataLoader(SnapshotMatrix)
DataLoader{Float32, Array{Float32, 3}, Nothing, :RegularData}(Float32[0.24799973; 0.3912878; … ; 0.64255303; 0.47915155;;; 0.24978966; 0.12909335; … ; 0.032393575; 0.18256313;;; 0.67669857; 0.53128225; … ; 0.6543872; 0.7989751;;; … ;;; 0.3812762; 0.29651302; … ; 0.027504563; 0.8203194;;; 0.090964735; 0.25212443; … ; 0.197765; 0.5558512;;; 0.75686467; 0.3287887; … ; 0.27345914; 0.38904423], nothing, 10, 1, 100, nothing, nothing)

or a snapshot tensor:

SnapshotTensor = rand(Float32, 10, 100, 5)
+dl = DataLoader(SnapshotMatrix)
DataLoader{Float32, Array{Float32, 3}, Nothing, :RegularData}(Float32[0.21329117; 0.5349811; … ; 0.060578704; 0.9527543;;; 0.18569678; 0.24320775; … ; 0.04518521; 0.22125131;;; 0.97603726; 0.54176146; … ; 0.5287888; 0.53149116;;; … ;;; 0.75115865; 0.2662626; … ; 0.41072345; 0.37011862;;; 0.83626455; 0.74022704; … ; 0.5978349; 0.5407477;;; 0.6977506; 0.09609294; … ; 0.34777403; 0.29850447], nothing, 10, 1, 100, nothing, nothing)

or a snapshot tensor:

SnapshotTensor = rand(Float32, 10, 100, 5)
 
-dl = DataLoader(SnapshotTensor)
DataLoader{Float32, Array{Float32, 3}, Nothing, :TimeSeries}(Float32[0.13292462 0.8675269 … 0.7886783 0.6382356; 0.22432995 0.43434638 … 0.82763267 0.41215235; … ; 0.29805195 0.04973328 … 0.23735166 0.6277334; 0.31186426 0.59075046 … 0.43199152 0.29078525;;; 0.614533 0.37017387 … 0.71245986 0.96971285; 0.98493093 0.031980693 … 0.2118206 0.86453134; … ; 0.2347545 0.7969141 … 0.030377686 0.53750414; 0.0451746 0.1492858 … 0.8383721 0.9667157;;; 0.07496083 0.7030878 … 0.41185933 0.8934395; 0.4459529 0.523878 … 0.27900404 0.4171574; … ; 0.9874781 0.93942124 … 0.611218 0.2650187; 0.24236047 0.13514596 … 0.4100657 0.44404012;;; 0.18803877 0.4530118 … 0.31370008 0.19260997; 0.5630235 0.97113204 … 0.31883442 0.536943; … ; 0.35017914 0.95842427 … 0.5171707 0.5879185; 0.44269866 0.81681085 … 0.78467906 0.037288547;;; 0.85101664 0.19772315 … 0.19121706 0.10227281; 0.40648496 0.049670577 … 0.88277376 0.54447365; … ; 0.9227476 0.87870425 … 0.14486456 0.39141983; 0.8475448 0.2717154 … 0.36061782 0.061712027], nothing, 10, 100, 5, nothing, nothing)

Here the DataLoader has different properties :RegularData and :TimeSeries. This indicates that in the first case we treat all columns in the input tensor independently (this is mostly used for autoencoder problems), whereas in the second case we have time series-like data, which are mostly used for integration problems. We can also treat a problem with a matrix as input as a time series-like problem by providing an additional keyword argument: autoencoder=false:

SnapshotMatrix = rand(Float32, 10, 100)
+dl = DataLoader(SnapshotTensor)
DataLoader{Float32, Array{Float32, 3}, Nothing, :TimeSeries}(Float32[0.6299093 0.5080978 … 0.5730255 0.16093194; 0.85916823 0.6776185 … 0.9058492 0.041073978; … ; 0.2637154 0.20470506 … 0.21398133 0.39398128; 0.3766297 0.6702371 … 0.14960146 0.7540684;;; 0.27853662 0.31455904 … 0.7930825 0.36608016; 0.24653143 0.059753776 … 0.9465062 0.44592172; … ; 0.998021 0.42324793 … 0.2518189 0.96451; 0.20836818 0.8083292 … 0.34346628 0.67479604;;; 0.4044118 0.6446118 … 0.98114467 0.8183864; 0.7549607 0.59253055 … 0.817721 0.98026544; … ; 0.7290771 0.542317 … 0.7217573 0.6233855; 0.40719545 0.43148184 … 0.26494503 0.7616534;;; 0.039343 0.5738849 … 0.79412687 0.22947001; 0.5207316 0.07618922 … 0.3597306 0.4990015; … ; 0.9814506 0.46019953 … 0.9067676 0.4165418; 0.7366005 0.33311248 … 0.52213496 0.6476879;;; 0.18017292 0.2616762 … 0.22677249 0.88968194; 0.24929476 0.96156615 … 0.6026798 0.40530246; … ; 0.85775167 0.5682184 … 0.51712865 0.32396287; 0.38374996 0.6410177 … 0.8430548 0.45784992], nothing, 10, 100, 5, nothing, nothing)

Here the DataLoader has different properties :RegularData and :TimeSeries. This indicates that in the first case we treat all columns in the input tensor independently (this is mostly used for autoencoder problems), whereas in the second case we have time series-like data, which are mostly used for integration problems. We can also treat a problem with a matrix as input as a time series-like problem by providing an additional keyword argument: autoencoder=false:

SnapshotMatrix = rand(Float32, 10, 100)
 
 dl = DataLoader(SnapshotMatrix; autoencoder=false)
 dl.input_time_steps
100

DataLoader can also be called with a NamedTuple that has q and p as keys.

In this case the field input_dim of DataLoader is interpreted as the sum of the $q$- and $p$-dimensions, i.e. if $q$ and $p$ both evolve on $\mathbb{R}^n$, then input_dim is $2n$.

SymplecticSnapshotTensor = (q = rand(Float32, 10, 100, 5), p = rand(Float32, 10, 100, 5))
 
-dl = DataLoader(SymplecticSnapshotTensor)
DataLoader{Float32, @NamedTuple{q::Array{Float32, 3}, p::Array{Float32, 3}}, Nothing, :TimeSeries}((q = Float32[0.15644282 0.7073246 … 0.8249736 0.4740212; 0.21305132 0.16514009 … 0.16166991 0.8499346; … ; 0.024424314 0.728299 … 0.2242353 0.38895804; 0.60622776 0.95988494 … 0.9982563 0.44332892;;; 0.5877351 0.0811286 … 0.20768261 0.46987653; 0.5586822 0.9591143 … 0.5811738 0.9609387; … ; 0.5536761 0.36311072 … 0.1073218 0.33813965; 0.3897627 0.17892796 … 0.43707442 0.28174007;;; 0.99351263 0.99489754 … 0.0545007 0.4126557; 0.44839823 0.43165177 … 0.48689526 0.9167457; … ; 0.9912741 0.34914166 … 0.31272775 0.6109176; 0.91831887 0.4854077 … 0.58914196 0.31581843;;; 0.9183417 0.9080335 … 0.89300394 0.2955643; 0.4026698 0.7506492 … 0.5082923 0.939201; … ; 0.4048438 0.08850813 … 0.9773625 0.7758554; 0.4620188 0.53673846 … 0.7752993 0.929549;;; 0.70152813 0.70411754 … 0.4844982 0.55630016; 0.09341776 0.08390343 … 0.43498808 0.8259233; … ; 0.9540665 0.71738315 … 0.31836736 0.3504672; 0.08056831 0.11129582 … 0.18075955 0.84167683], p = Float32[0.24897355 0.84406173 … 0.8821633 0.7332919; 0.35389954 0.33957994 … 0.44175637 0.04273534; … ; 0.018718898 0.59222263 … 0.0070593357 0.8794088; 0.45509583 0.14810872 … 0.0074183345 0.8188201;;; 0.65852267 0.79771835 … 0.29122263 0.29927695; 0.8663034 0.9410317 … 0.02499646 0.09458858; … ; 0.3977123 0.6167766 … 0.026077628 0.22049832; 0.6863993 0.5649373 … 0.7501161 0.7416841;;; 0.21843678 0.4145254 … 0.7728342 0.051957905; 0.3930192 0.3803016 … 0.6458817 0.8798668; … ; 0.73145735 0.21938437 … 0.4649971 0.09445888; 0.036848545 0.11338526 … 0.9082846 0.23102015;;; 0.14175218 0.8213072 … 0.6976974 0.7727129; 0.27857906 0.026842952 … 0.46511102 0.29197192; … ; 0.29707193 0.93260044 … 0.38173282 0.023017883; 0.36474448 0.86480606 … 0.36332452 0.16360503;;; 0.14413375 0.31174034 … 0.3052305 0.31921828; 0.7458084 0.9550317 … 0.039779007 0.16072929; … ; 0.41656 0.24584025 … 0.25366008 0.4624713; 0.21026474 0.75346506 … 0.7817601 0.96784276]), nothing, 20, 100, 5, nothing, nothing)
dl.input_dim
20

The Batch struct

Batch is a struct whose functor acts on an instance of DataLoader to produce a sequence of training samples for training for one epoch.

The Constructor

The constructor for Batch is called with:

  • batch_size::Int
  • seq_length::Int (optional)
  • prediction_window::Int (optional)

The first one of these arguments is required; it indicates the number of training samples in a batch. If we deal with time series data then we can additionaly supply a sequence length and a prediction window as input arguments to Batch. These indicate the number of input vectors and the number of output vectors.

The functor

An instance of Batch can be called on an instance of DataLoader to produce a sequence of samples that contain all the input data, i.e. for training for one epoch. The output of applying batch:Batch to dl::DataLoader is a tuple of vectors of integers. Each of these vectors contains two integers: the first is the time index and the second one is the parameter index.

matrix_data = rand(Float32, 2, 10)
+dl = DataLoader(SymplecticSnapshotTensor)
DataLoader{Float32, @NamedTuple{q::Array{Float32, 3}, p::Array{Float32, 3}}, Nothing, :TimeSeries}((q = Float32[0.7791366 0.28073764 … 0.9862997 0.30020016; 0.7440304 0.18946952 … 0.22494721 0.6579873; … ; 0.6939318 0.18089384 … 0.75817144 0.75737053; 0.21608377 0.05419606 … 0.9309881 0.84315276;;; 0.4892816 0.32990253 … 0.6673117 0.37662077; 0.47587126 0.49790013 … 0.60261637 0.48116523; … ; 0.78565806 0.24955648 … 0.14914101 0.5634272; 0.5075221 0.7401721 … 0.84214705 0.4099716;;; 0.11747193 0.34643006 … 0.61611444 0.06825787; 0.13054687 0.8988276 … 0.23735303 0.88616043; … ; 0.5568334 0.78899366 … 0.27619523 0.20393169; 0.9208061 0.9108033 … 0.20435917 0.43289655;;; 0.8698324 0.60437876 … 0.76151127 0.8353441; 0.8823013 0.89698875 … 0.969461 0.94291246; … ; 0.4953075 0.8342314 … 0.33299845 0.14747047; 0.47873944 0.3389067 … 0.105866134 0.4911;;; 0.5670072 0.8998616 … 0.07160187 0.46568286; 0.7891013 0.784264 … 0.072253704 0.4323724; … ; 0.13048267 0.64141244 … 0.41664118 0.19565916; 0.9595706 0.58772904 … 0.8873439 0.2282213], p = Float32[0.8571841 0.46687567 … 0.49983686 0.13969034; 0.43319297 0.118955374 … 0.29352862 0.79054767; … ; 0.16126317 0.26671147 … 0.022414565 0.8941822; 0.9232449 0.56121236 … 0.6927374 0.41106206;;; 0.7507395 0.09502411 … 0.2725559 0.38396645; 0.24457109 0.96957535 … 0.39122403 0.7505981; … ; 0.22301441 0.9049039 … 0.0025550127 0.3290789; 0.84067076 0.64391595 … 0.07551992 0.78211766;;; 0.46823806 0.7967339 … 0.40415215 0.62230915; 0.006547153 0.82625693 … 0.14502281 0.21186227; … ; 0.095276654 0.85156894 … 0.33828026 0.6474707; 0.3929432 0.037352264 … 0.23333037 0.6576426;;; 0.34381378 0.9628466 … 0.58575636 0.11351174; 0.2104814 0.29876453 … 0.75060034 0.57417625; … ; 0.9730569 0.39612782 … 0.7958593 0.6515283; 0.1408338 0.51213044 … 0.04157412 0.70600814;;; 0.32858634 0.99018466 … 0.65625966 0.84543324; 0.60571367 0.4936729 … 0.7362143 0.08068818; … ; 0.4874196 0.84893006 … 0.3774628 0.1426813; 0.6773069 0.8764256 … 0.62613535 0.20771247]), nothing, 20, 100, 5, nothing, nothing)
dl.input_dim
20

The Batch struct

Batch is a struct whose functor acts on an instance of DataLoader to produce a sequence of training samples for training for one epoch.

The Constructor

The constructor for Batch is called with:

  • batch_size::Int
  • seq_length::Int (optional)
  • prediction_window::Int (optional)

The first one of these arguments is required; it indicates the number of training samples in a batch. If we deal with time series data then we can additionaly supply a sequence length and a prediction window as input arguments to Batch. These indicate the number of input vectors and the number of output vectors.

The functor

An instance of Batch can be called on an instance of DataLoader to produce a sequence of samples that contain all the input data, i.e. for training for one epoch. The output of applying batch:Batch to dl::DataLoader is a tuple of vectors of integers. Each of these vectors contains two integers: the first is the time index and the second one is the parameter index.

matrix_data = rand(Float32, 2, 10)
 dl = DataLoader(matrix_data; autoencoder = true)
 
 batch = Batch(3)
-batch(dl)
([(1, 1), (1, 5), (1, 3)], [(1, 2), (1, 6), (1, 8)], [(1, 9), (1, 10), (1, 4)], [(1, 7)])

This also works if the data are in $qp$ form:

qp_data = (q = rand(Float32, 2, 10), p = rand(Float32, 2, 10))
+batch(dl)
([(1, 7), (1, 10), (1, 3)], [(1, 2), (1, 1), (1, 8)], [(1, 9), (1, 5), (1, 6)], [(1, 4)])

This also works if the data are in $qp$ form:

qp_data = (q = rand(Float32, 2, 10), p = rand(Float32, 2, 10))
 dl = DataLoader(qp_data; autoencoder = true)
 
 batch = Batch(3)
-batch(dl)
([(1, 1), (1, 6), (1, 7)], [(1, 9), (1, 8), (1, 2)], [(1, 10), (1, 5), (1, 3)], [(1, 4)])

In those two examples the autoencoder keyword was set to true (the default). This is why the first index was always 1. This changes if we set autoencoder = false:

qp_data = (q = rand(Float32, 2, 10), p = rand(Float32, 2, 10))
+batch(dl)
([(1, 10), (1, 5), (1, 1)], [(1, 2), (1, 6), (1, 4)], [(1, 8), (1, 3), (1, 9)], [(1, 7)])

In those two examples the autoencoder keyword was set to true (the default). This is why the first index was always 1. This changes if we set autoencoder = false:

qp_data = (q = rand(Float32, 2, 10), p = rand(Float32, 2, 10))
 dl = DataLoader(qp_data; autoencoder = false) # false is default
 
 batch = Batch(3)
-batch(dl)
([(3, 1), (8, 1), (1, 1)], [(7, 1), (4, 1), (6, 1)], [(2, 1), (5, 1), (9, 1)])

Specifically the routines do the following:

  1. $\mathtt{n\_indices}\leftarrow \mathtt{n\_params}\lor\mathtt{input\_time\_steps},$
  2. $\mathtt{indices} \leftarrow \mathtt{shuffle}(\mathtt{1:\mathtt{n\_indices}}),$
  3. $\mathcal{I}_i \leftarrow \mathtt{indices[(i - 1)} \cdot \mathtt{batch\_size} + 1 \mathtt{:} i \cdot \mathtt{batch\_size]}\text{ for }i=1, \ldots, (\mathrm{last} -1),$
  4. $\mathcal{I}_\mathtt{last} \leftarrow \mathtt{indices[}(\mathtt{n\_batches} - 1) \cdot \mathtt{batch\_size} + 1\mathtt{:end]}.$

Note that the routines are implemented in such a way that no two indices appear double.

Sampling from a tensor

We can also sample tensor data.

qp_data = (q = rand(Float32, 2, 20, 3), p = rand(Float32, 2, 20, 3))
+batch(dl)
([(1, 1), (7, 1), (9, 1)], [(2, 1), (4, 1), (3, 1)], [(5, 1), (8, 1), (6, 1)])

Specifically the routines do the following:

  1. $\mathtt{n\_indices}\leftarrow \mathtt{n\_params}\lor\mathtt{input\_time\_steps},$
  2. $\mathtt{indices} \leftarrow \mathtt{shuffle}(\mathtt{1:\mathtt{n\_indices}}),$
  3. $\mathcal{I}_i \leftarrow \mathtt{indices[(i - 1)} \cdot \mathtt{batch\_size} + 1 \mathtt{:} i \cdot \mathtt{batch\_size]}\text{ for }i=1, \ldots, (\mathrm{last} -1),$
  4. $\mathcal{I}_\mathtt{last} \leftarrow \mathtt{indices[}(\mathtt{n\_batches} - 1) \cdot \mathtt{batch\_size} + 1\mathtt{:end]}.$

Note that the routines are implemented in such a way that no two indices appear double.

Sampling from a tensor

We can also sample tensor data.

qp_data = (q = rand(Float32, 2, 20, 3), p = rand(Float32, 2, 20, 3))
 dl = DataLoader(qp_data)
 
 # also specify sequence length here
 batch = Batch(4, 5)
-batch(dl)
([(10, 3), (7, 3), (4, 3), (9, 3)], [(2, 3), (11, 3), (3, 3), (1, 3)], [(8, 3), (5, 3), (6, 3), (10, 2)], [(7, 2), (4, 2), (9, 2), (2, 2)], [(11, 2), (3, 2), (1, 2), (8, 2)], [(5, 2), (6, 2), (10, 1), (7, 1)], [(4, 1), (9, 1), (2, 1), (11, 1)], [(3, 1), (1, 1), (8, 1), (5, 1)], [(6, 1)])

Sampling from a tensor is done the following way ($\mathcal{I}_i$ again denotes the batch indices for the $i$-th batch):

  1. $\mathtt{time\_indices} \leftarrow \mathtt{shuffle}(\mathtt{1:}(\mathtt{input\_time\_steps} - \mathtt{seq\_length} - \mathtt{prediction_window}),$
  2. $\mathtt{parameter\_indices} \leftarrow \mathtt{shuffle}(\mathtt{1:n\_params}),$
  3. $\mathtt{complete\_indices} \leftarrow \mathtt{product}(\mathtt{time\_indices}, \mathtt{parameter\_indices}),$
  4. $\mathcal{I}_i \leftarrow \mathtt{complete\_indices[}(i - 1) \cdot \mathtt{batch\_size} + 1 : i \cdot \mathtt{batch\_size]}\text{ for }i=1, \ldots, (\mathrm{last} -1),$
  5. $\mathcal{I}_\mathrm{last} \leftarrow \mathtt{complete\_indices[}(\mathrm{last} - 1) \cdot \mathtt{batch\_size} + 1\mathtt{:end]}.$

This algorithm can be visualized the following way (here batch_size = 4):

Here the sampling is performed over the second axis (the time step dimension) and the third axis (the parameter dimension). Whereas each block has thickness 1 in the $x$ direction (i.e. pertains to a single parameter), its length in the $y$ direction is seq_length. In total we sample as many such blocks as the batch size is big. By construction those blocks are never the same throughout a training epoch but may intersect each other!

+batch(dl)
([(6, 1), (8, 1), (10, 1), (1, 1)], [(11, 1), (4, 1), (3, 1), (5, 1)], [(9, 1), (7, 1), (2, 1), (6, 2)], [(8, 2), (10, 2), (1, 2), (11, 2)], [(4, 2), (3, 2), (5, 2), (9, 2)], [(7, 2), (2, 2), (6, 3), (8, 3)], [(10, 3), (1, 3), (11, 3), (4, 3)], [(3, 3), (5, 3), (9, 3), (7, 3)], [(2, 3)])

Sampling from a tensor is done the following way ($\mathcal{I}_i$ again denotes the batch indices for the $i$-th batch):

  1. $\mathtt{time\_indices} \leftarrow \mathtt{shuffle}(\mathtt{1:}(\mathtt{input\_time\_steps} - \mathtt{seq\_length} - \mathtt{prediction_window}),$
  2. $\mathtt{parameter\_indices} \leftarrow \mathtt{shuffle}(\mathtt{1:n\_params}),$
  3. $\mathtt{complete\_indices} \leftarrow \mathtt{product}(\mathtt{time\_indices}, \mathtt{parameter\_indices}),$
  4. $\mathcal{I}_i \leftarrow \mathtt{complete\_indices[}(i - 1) \cdot \mathtt{batch\_size} + 1 : i \cdot \mathtt{batch\_size]}\text{ for }i=1, \ldots, (\mathrm{last} -1),$
  5. $\mathcal{I}_\mathrm{last} \leftarrow \mathtt{complete\_indices[}(\mathrm{last} - 1) \cdot \mathtt{batch\_size} + 1\mathtt{:end]}.$

This algorithm can be visualized the following way (here batch_size = 4):

Here the sampling is performed over the second axis (the time step dimension) and the third axis (the parameter dimension). Whereas each block has thickness 1 in the $x$ direction (i.e. pertains to a single parameter), its length in the $y$ direction is seq_length. In total we sample as many such blocks as the batch size is big. By construction those blocks are never the same throughout a training epoch but may intersect each other!

diff --git a/latest/data_loader/snapshot_matrix/index.html b/latest/data_loader/snapshot_matrix/index.html index 5cddbd9c8..ee3fca15f 100644 --- a/latest/data_loader/snapshot_matrix/index.html +++ b/latest/data_loader/snapshot_matrix/index.html @@ -1,8 +1,8 @@ -Snapshot matrix & tensor · GeometricMachineLearning.jl

Snapshot matrix

The snapshot matrix stores solutions of the high-dimensional ODE (obtained from discretizing a PDE). This is then used to construct reduced bases in a data-driven way. So (for a single parameter[1]) the snapshot matrix takes the following form:

\[M = \left[\begin{array}{c:c:c:c} +Snapshot matrix & tensor · GeometricMachineLearning.jl

Snapshot matrix

The snapshot matrix stores solutions of the high-dimensional ODE (obtained from discretizing a PDE). This is then used to construct reduced bases in a data-driven way. So (for a single parameter[1]) the snapshot matrix takes the following form:

\[M = \left[\begin{array}{c:c:c:c} \hat{u}_1(t_0) & \hat{u}_1(t_1) & \quad\ldots\quad & \hat{u}_1(t_f) \\ \hat{u}_2(t_0) & \hat{u}_2(t_1) & \ldots & \hat{u}_2(t_f) \\ \hat{u}_3(t_0) & \hat{u}_3(t_1) & \ldots & \hat{u}_3(t_f) \\ \ldots & \ldots & \ldots & \ldots \\ \hat{u}_{2N}(t_0) & \hat{u}_{2N}(t_1) & \ldots & \hat{u}_{2N}(t_f) \\ -\end{array}\right].\]

In the above example we store a matrix whose first axis is the system dimension (i.e. a state is an element of $\mathbb{R}^{2n}$) and the second dimension gives the time step.

The starting point for using the snapshot matrix as data for a machine learning model is that all the columns of $M$ live on a lower-dimensional solution manifold and we can use techniques such as POD and autoencoders to find this solution manifold. We also note that the second axis of $M$ does not necessarily indicate time but can also represent various parameters (including initial conditions). The second axis in the DataLoader struct is therefore saved in the field n_params.

Snapshot tensor

The snapshot tensor fulfills the same role as the snapshot matrix but has a third axis that describes different initial parameters (such as different initial conditions).

When drawing training samples from the snapshot tensor we also need to specify a sequence length (as an argument to the Batch struct). When sampling a batch from the snapshot tensor we sample over the starting point of the time interval (which is of length seq_length) and the third axis of the tensor (the parameters). The total number of batches in this case is $\lceil\mathtt{(dl.input\_time_steps - batch.seq\_length) * dl.n\_params / batch.batch_size}\rceil$.

  • 1If we deal with a parametrized PDE then there are two stages at which the snapshot matrix has to be processed: the offline stage and the online stage.
+\end{array}\right].\]

In the above example we store a matrix whose first axis is the system dimension (i.e. a state is an element of $\mathbb{R}^{2n}$) and the second dimension gives the time step.

The starting point for using the snapshot matrix as data for a machine learning model is that all the columns of $M$ live on a lower-dimensional solution manifold and we can use techniques such as POD and autoencoders to find this solution manifold. We also note that the second axis of $M$ does not necessarily indicate time but can also represent various parameters (including initial conditions). The second axis in the DataLoader struct is therefore saved in the field n_params.

Snapshot tensor

The snapshot tensor fulfills the same role as the snapshot matrix but has a third axis that describes different initial parameters (such as different initial conditions).

When drawing training samples from the snapshot tensor we also need to specify a sequence length (as an argument to the Batch struct). When sampling a batch from the snapshot tensor we sample over the starting point of the time interval (which is of length seq_length) and the third axis of the tensor (the parameters). The total number of batches in this case is $\lceil\mathtt{(dl.input\_time_steps - batch.seq\_length) * dl.n\_params / batch.batch_size}\rceil$.

  • 1If we deal with a parametrized PDE then there are two stages at which the snapshot matrix has to be processed: the offline stage and the online stage.
diff --git a/latest/index.html b/latest/index.html index 1f69d75fd..afd745b70 100644 --- a/latest/index.html +++ b/latest/index.html @@ -1,2 +1,2 @@ -Home · GeometricMachineLearning.jl

Geometric Machine Learning

GeometricMachineLearning.jl implements various scientific machine learning models that aim at learning dynamical systems with geometric structure, such as Hamiltonian (symplectic) or Lagrangian (variational) systems.

Installation

GeometricMachineLearning.jl and all of its dependencies can be installed via the Julia REPL by typing

]add GeometricMachineLearning

Architectures

There are several architectures tailored towards problems in scientific machine learning implemented in GeometricMachineLearning.

Manifolds

GeometricMachineLearning supports putting neural network weights on manifolds. These include:

Special Neural Network Layer

Many layers have been adapted in order to be used for problems in scientific machine learning. Including:

Tutorials

Tutorials for using GeometricMachineLearning are:

Reduced Order Modeling

A short description of the key concepts in reduced order modeling (where GeometricMachineLearning can be used) are in:

+Home · GeometricMachineLearning.jl

Geometric Machine Learning

GeometricMachineLearning.jl implements various scientific machine learning models that aim at learning dynamical systems with geometric structure, such as Hamiltonian (symplectic) or Lagrangian (variational) systems.

Installation

GeometricMachineLearning.jl and all of its dependencies can be installed via the Julia REPL by typing

]add GeometricMachineLearning

Architectures

There are several architectures tailored towards problems in scientific machine learning implemented in GeometricMachineLearning.

Manifolds

GeometricMachineLearning supports putting neural network weights on manifolds. These include:

Special Neural Network Layer

Many layers have been adapted in order to be used for problems in scientific machine learning. Including:

Tutorials

Tutorials for using GeometricMachineLearning are:

Reduced Order Modeling

A short description of the key concepts in reduced order modeling (where GeometricMachineLearning can be used) are in:

diff --git a/latest/layers/attention_layer/index.html b/latest/layers/attention_layer/index.html index e5bf3a7be..56fcf99a4 100644 --- a/latest/layers/attention_layer/index.html +++ b/latest/layers/attention_layer/index.html @@ -1,11 +1,20 @@ -Attention · GeometricMachineLearning.jl

The Attention Layer

The attention mechanism was originally applied for image and natural language processing (NLP) tasks. In (Bahdanau et al, 2014) ``additive'' attention is used:

\[(z_q, z_k) \mapsto v^T\sigma(Wz_q + Uz_k).\]

However ``multiplicative'' attention is more straightforward to interpret and cheaper to handle computationally:

\[(z_q, z_k) \mapsto z_q^TWz_k.\]

Regardless of the type of attention used, they all try to compute correlations among input sequences on whose basis further neural network-based computation is performed. So given two input sequences $(z_q^{(1)}, \ldots, z_q^{(T)})$ and $(z_k^{(1)}, \ldots, z_k^{(T)})$, various attention mechanisms always return an output $C\in\mathbb{R}^{T\times{}T}$ with entries $[C]_{ij} = \mathtt{attention}(z_q^{(i)}, z_k^{(j)}$.

Self Attention

Attention in GeometricMachineLearning

The attention layer (and the orthonormal activation function defined for it) in GeometricMachineLearning was specifically designed to generalize transformers to symplectic data. Usually a self-attention layer takes the following form:

\[Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\mathrm{softmax}((P^QZ)^T(P^KZ)),\]

where we left out the linear mapping onto the values $P^V$.

The idea behind is that we can perform a non-linear re-weighting of the columns of $Z$ by multiplying with a $Z$-dependent matrix from the right and therefore take the sequential nature of the data into account (which is not possible with normal neural networks). After the attention step the transformer applies a simple ResNet from the left.

What the softmax does is a vector-wise operation, i.e. it operates on each column of an input matrix $A = [a_1, \ldots, a_T]$. The result is a sequence of probability vectors $[p^{(1)}, \ldots, p^{(T)}]$ for which

\[\sum_{i=1}^Tp^{(j)}_i=1\quad\forall{}j\in\{1,\dots,T\}.\]

What we want to construct is a symplectic transformation that is transformer-like. For this we modify the attention layer the following way:

\[Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\sigma((P^QZ)^T(P^KZ)),\]

where $\sigma(A)=\exp(\mathtt{upper\_triangular{\_asymmetrize}}(A))$ and

\[[\mathtt{upper\_triangular\_asymmetrize}(A)]_{ij} = \begin{cases} a_{ij} & \text{if $i<j$} \\ -a_{ji} & \text{if $i>j$} \\ 0 & \text{else.}\end{cases}\]

This has as a consequence that the matrix $\Lambda(Z) := \sigma((P^QZ)^T(P^KZ))$ is orthonormal and hence preserves an extended symplectic structure. To make this more clear, consider that the transformer maps sequences of vectors to sequences of vectors, i.e. $V\times\cdots\times{}V \ni [z^1, \ldots, z^T] \mapsto [\hat{z}^1, \ldots, \hat{z}^T]$. We can define a symplectic structure on $V\times\cdots\times{}V$ by rearranging $[z^1, \ldots, z^T]$ into a vector. We do this in the following way:

\[\tilde{Z} = \begin{pmatrix} q^{(1)}_1 \\ q^{(2)}_1 \\ \cdots \\ q^{(T)}_1 \\ q^{(1)}_2 \\ \cdots \\ q^{(T)}_d \\ p^{(1)}_1 \\ p^{(2)}_1 \\ \cdots \\ p^{(T)}_1 \\ p^{(1)}_2 \\ \cdots \\ p^{(T)}_d \end{pmatrix}.\]

The symplectic structure on this big space is then:

\[\mathbb{J}=\begin{pmatrix} - \mathbb{O}_{dT} & \mathbb{I}_{dT} \\ - -\mathbb{I}_{dT} & \mathbb{O}_{dT} -\end{pmatrix}.\]

Multiplying with the matrix $\Lambda(Z)$ from the right onto $[z^1, \ldots, z^T]$ corresponds to applying the sparse matrix

\[\tilde{\Lambda}(Z)=\left[ -\begin{array}{ccc} - \Lambda(Z) & \cdots & \mathbb{O}_T \\ - \vdots & \ddots & \vdots \\ - \mathbb{O}_T & \cdots & \Lambda(Z) - \end{array} -\right]\]

from the left onto the big vector.

Historical Note

Attention was used before, but always in connection with recurrent neural networks (see (Luong et al, 2015) and (Bahdanau et al, 2014)).

References

[20]
D. Bahdanau, K. Cho and Y. Bengio. Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014).
[19]
M.-T. Luong, H. Pham and C. D. Manning. Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025 (2015).
+Attention · GeometricMachineLearning.jl

The Attention Layer

The attention mechanism was originally developed for image and natural language processing (NLP) tasks. It is motivated by the need to handle time series data in an efficient way[1]. Its essential idea is to compute correlations between vectors in input sequences. I.e. given sequences

\[(z_q^{(1)}, z_q^{(2)}, \ldots, z_q^{(T)}) \text{ and } (z_p^{(1)}, z_p^{(2)}, \ldots, z_p^{(T)}),\]

an attention mechanism computes pair-wise correlations between all combinations of two input vectors from these sequences. In [13] "additive" attention is used to compute such correlations:

\[(z_q, z_k) \mapsto v^T\sigma(Wz_q + Uz_k), \]

where $z_q, z_k \in \mathbb{R}^d$ are elements of the input sequences. The learnable parameters are $W, U \in \mathbb{R}^{n\times{}d}$ and $v \in \mathbb{R}^n$.

However multiplicative attention (see e.g. [14])is more straightforward to interpret and cheaper to handle computationally:

\[(z_q, z_k) \mapsto z_q^TWz_k,\]

where $W \in \mathbb{R}^{d\times{}d}$ is a learnable weight matrix with respect to which correlations are computed as scalar products. Regardless of the type of attention used, they all try to compute correlations among input sequences on whose basis further computation is performed. Given two input sequences $Z_q = (z_q^{(1)}, \ldots, z_q^{(T)})$ and $Z_k = (z_k^{(1)}, \ldots, z_k^{(T)})$, we can arrange the various correlations into a correlation matrix $C\in\mathbb{R}^{T\times{}T}$ with entries $[C]_{ij} = \mathtt{attention}(z_q^{(i)}, z_k^{(j)})$. In the case of multiplicative attention this matrix is just $C = Z^TWZ$.

Reweighting of the input sequence

In GeometricMachineLearning we always compute self-attention, meaning that the two input sequences $Z_q$ and $Z_k$ are the same, i.e. $Z = Z_q = Z_k$.[2]

This is then used to reweight the columns in the input sequence $Z$. For this we first apply a nonlinearity $\sigma$ onto $C$ and then multiply $\sigma(C)$ onto $Z$ from the right, i.e. the output of the attention layer is $Z\sigma(C)$. So we perform the following mappings:

\[Z \xrightarrow{\mathrm{correlations}} C(Z) =: C \xrightarrow{\sigma} \sigma(C) \xrightarrow{\text{right multiplication}} Z \sigma(C).\]

After the right multiplication the outputs is of the following form:

\[ [\sum_{i=1}^Tp^{(1)}_iz^{(i)}, \ldots, \sum_{i=1}^Tp^{(T)}_iz^{(i)}],\]

for $p^{(i)} = [\sigma(C)]_{\bullet{}i}$. What is learned during training are $T$ different linear combinations of the input vectors, where the coefficients $p^{(i)}_j$ in these linear combinations depend on the input $Z$ nonlinearly.

VolumePreservingAttention in GeometricMachineLearning

The attention layer (and the activation function $\sigma$ defined for it) in GeometricMachineLearning was specifically designed to apply it to data coming from physical systems that can be described through a divergence-free or a symplectic vector field. Traditionally the nonlinearity in the attention mechanism is a softmax[3] (see [14]) and the self-attention layer performs the following mapping:

\[Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\mathrm{softmax}(Z^TWZ).\]

The softmax activation acts vector-wise, i.e. if we supply it with a matrix $C$ as input it returns:

\[\mathrm{softmax}(C) = [\mathrm{softmax}(c_{\bullet{}1}), \ldots, \mathrm{softmax}(c_{\bullet{}T})].\]

The output of a softmax is a probability vector (also called stochastic vector) and the matrix $P = [p^{(1)}, \ldots, p^{(T)}]$, where each column is a probability vector, is sometimes referred to as a stochastic matrix (see [15]). This attention mechanism finds application in transformer neural networks [14]. The problem with this matrix from a geometric point of view is that all the columns are independent of each other and the nonlinear transformation could in theory produce a stochastic matrix for which all columns are identical and thus lead to a loss of information. So the softmax activation function is inherently non-geometric.

Besides the traditional attention mechanism GeometricMachineLearning therefore also has a volume-preserving transformation that fulfills a similar role. There are two approaches implemented to realize similar transformations. Both of them however utilize the Cayley transform to produce orthogonal matrices $\sigma(C)$ instead of stochastic matrices. For an orthogonal matrix $\Sigma$ we have $\Sigma^T\Sigma = \mathbb{I}$, so all the columns are linearly independent which is not necessarily true for a stochastic matrix $P$. The following explains how this new activation function is implemented.

The Cayley transform

The Cayley transform maps from skew-symmetric matrices to orthonormal matrices[4]. It takes the form:

\[\mathrm{Cayley}: A \mapsto (\mathbb{I} - A)(\mathbb{I} + A)^{-1}.\]

We can easily check that $\mathrm{Cayley}(A)$ is orthogonal if $A$ is skew-symmetric. For this consider $\varepsilon \mapsto A(\varepsilon)\in\mathcal{S}_\mathrm{skew}$ with $A(0) = \mathbb{I}$ and $A'(0) = B$. Then we have:

\[\frac{\delta\mathrm{Cayley}}{\delta{}A} = \frac{d}{d\varepsilon}|_{\varepsilon=0} \mathrm{Cayley}(A(\varepsilon))^T \mathrm{Cayley}(A(\varepsilon)) = \mathbb{O}.\]

In order to use the Cayley transform as an activation function we further need a mapping from the input $Z$ to a skew-symmetric matrix. This is realized in two ways in GeometricMachineLearning: via a scalar-product with a skew-symmetric weighting and via a scalar-product with an arbitrary weighting.

First approach: scalar products with a skew-symmetric weighting

For this the attention layer is modified in the following way:

\[Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\sigma(Z^TAZ),\]

where $\sigma(C)=\mathrm{Cayley}(C)$ and $A$ is a skew-symmetric matrix that is learnable, i.e. the parameters of the attention layer are stored in $A$.

Second approach: scalar products with an arbitrary weighting

For this approach we compute correlations between the input vectors with a skew-symmetric weighting. The correlations we consider here are based on:

\[(z^{(2)})^TAz^{(1)}, (z^{(3)})^TAz^{(1)}, \ldots, (z^{(d)})^TAz^{(1)}, (z^{(3)})^TAz^{(2)}, \ldots, (z^{(d)})^TAz^{(2)}, \ldots, (z^{(d)})^TAz^{(d-1)}.\]

So in total we consider correlations $(z^{(i)})^Tz^{(j)}$ for which $i > j$. We now arrange these correlations into a skew-symmetric matrix:

\[C = \begin{bmatrix} + 0 & -(z^{(2)})^TAz^{(1)} & -(z^{(3)})^TAz^{(1)} & \ldots & -(z^{(d)})^TAz^{(1)} \\ + (z^{(2)})^TAz^{(1)} & 0 & -(z^{(3)})^TAz^{(2)} & \ldots & -(z^{(d)})^TAz^{(2)} \\ + \ldots & \ldots & \ldots & \ldots & \ldots \\ + (z^{(d)})^TAz^{(1)} & (z^{(d)})^TAz^{(2)} & (z^{(d)})^TAz^{(3)} & \ldots & 0 +\end{bmatrix}.\]

This correlation matrix can now again be used as an input for the Cayley transform to produce an orthogonal matrix.

How is structure preserved?

In order to discuss how structure is preserved we first have to define what structure we mean precisely. This structure is strongly inspired by traditional multi-step methods (see [17]). We now define what volume preservation means for the product space $\mathbb{R}^{d}\times\cdots\times\mathbb{R}^{d}\equiv\times_\text{$T$ times}\mathbb{R}^{d}$.

Consider an isomorphism $\hat{}: \times_\text{($T$ times)}\mathbb{R}^{d}\stackrel{\approx}{\longrightarrow}\mathbb{R}^{dT}$. Specifically, this isomorphism takes the form:

\[Z = \left[\begin{array}{cccc} + z_1^{(1)} & z_1^{(2)} & \quad\cdots\quad & z_1^{(T)} \\ + z_2^{(1)} & z_2^{(2)} & \cdots & z_2^{(T)} \\ + \cdots & \cdots & \cdots & \cdots \\ + z_d^{(1)} & z_d^{(2)} & \cdots & z_d^{(T)} + \end{array}\right] \mapsto + \left[\begin{array}{c} z_1^{(1)} \\ z_1^{(2)} \\ \cdots \\ z_1^{(T)} \\ z_2^{(1)} \\ \cdots \\ z_d^{(T)} \end{array}\right] =: Z_\mathrm{vec}.\]

The inverse of $Z \mapsto \hat{Z}$ we refer to as $Y \mapsto \tilde{Y}$. In the following we also write $\hat{\varphi}$ for the mapping $\,\hat{}\circ\varphi\circ\tilde{}\,$.

DEFINITION: We say that a mapping $\varphi: \times_\text{$T$ times}\mathbb{R}^{d} \to \times_\text{$T$ times}\mathbb{R}^{d}$ is volume-preserving if the associated $\hat{\varphi}$ is volume-preserving.

In the transformed coordinate system (in terms of the vector $Z_\mathrm{vec}$ defined above) this is equivalent to multiplication by a sparse matrix $\tilde\Lambda(Z)$ from the left:

\[ \tilde{\Lambda}(Z) Z_\mathrm{vec} := + \begin{pmatrix} + \Lambda(Z) & \mathbb{O} & \cdots & \mathbb{O} \\ + \mathbb{O} & \Lambda(Z) & \cdots & \mathbb{O} \\ + \cdots & \cdots & \ddots & \cdots \\ + \mathbb{O} & \mathbb{O} & \cdots & \Lambda(Z) \\ + \end{pmatrix} + \left[\begin{array}{c} z_1^{(1)} \\ z_1^{(2)} \\ \ldots \\ z_1^{(T)} \\ z_2^{(1)} \\ \ldots \\ z_d^{(T)} \end{array}\right] .\]

$\tilde{\Lambda}(Z)$ in m[eq:LambdaApplication]m(@latex) is easily shown to be an orthogonal matrix.

Historical Note

Attention was used before, but always in connection with recurrent neural networks (see [18] and [13]).

References

[13]
D. Bahdanau, K. Cho and Y. Bengio. Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014).
[18]
M.-T. Luong, H. Pham and C. D. Manning. Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025 (2015).
  • 1Recurrent neural networks have the same motivation.
  • 2Multihead attention also falls into this category. Here the input $Z$ is multiplied from the left with several projection matrices $P^Q_i$ and $P^K_i$, where $i$ indicates the head. For each head we then compute a correlation matrix $(P^Q_i Z)^T(P^K Z)$.
  • 3The softmax acts on the matrix $C$ in a vector-wise manner, i.e. it operates on each column of the input matrix $C = [c^{(1)}, \ldots, c^{(T)}]$. The result is a sequence of probability vectors $[p^{(1)}, \ldots, p^{(T)}]$ for which $\sum_{i=1}^Tp^{(j)}_i=1\quad\forall{}j\in\{1,\dots,T\}.$
  • 4A matrix $A$ is skew-symmetric if $A = -A^T$ and a matrix $B$ is orthonormal if $B^TB = \mathbb{I}$. The orthonormal matrices form a Lie group, i.e. the set of orthonormal matrices can be endowed with the structure of a differential manifold and this set also satisfies the group axioms. The corresponding Lie algebra are the skew-symmetric matrices and the Cayley transform is a so-called retraction in this case. For more details consult e.g. [2] and [16].
diff --git a/latest/layers/multihead_attention_layer/index.html b/latest/layers/multihead_attention_layer/index.html index 2050729cd..e41ba8a47 100644 --- a/latest/layers/multihead_attention_layer/index.html +++ b/latest/layers/multihead_attention_layer/index.html @@ -1,2 +1,2 @@ -Multihead Attention · GeometricMachineLearning.jl

Multihead Attention Layer

In order to arrive from the attention layer at the multihead attention layer we have to do a few modifications:

Note that these neural networks were originally developed for natural language processing (NLP) tasks and the terminology used here bears some resemblance to that field. The input to a multihead attention layer typicaly comprises three components:

  1. Values $V\in\mathbb{R}^{n\times{}T}$: a matrix whose columns are value vectors,
  2. Queries $Q\in\mathbb{R}^{n\times{}T}$: a matrix whose columns are query vectors,
  3. Keys $K\in\mathbb{R}^{n\times{}T}$: a matrix whose columns are key vectors.

Regular attention performs the following operation:

\[\mathrm{Attention}(Q,K,V) = V\mathrm{softmax}(\frac{K^TQ}{\sqrt{n}}),\]

where $n$ is the dimension of the vectors in $V$, $Q$ and $K$. The softmax activation function here acts column-wise, so it can be seen as a transformation $\mathrm{softmax}:\mathbb{R}^{T}\to\mathbb{R}^T$ with $[\mathrm{softmax}(v)]_i = e^{v_i}/\left(\sum_{j=1}e^{v_j}\right)$. The $K^TQ$ term is a similarity matrix between the queries and the vectors.

The transformer contains a self-attention mechanism, i.e. takes an input $X$ and then transforms it linearly to $V$, $Q$ and $K$, i.e. $V = P^VX$, $Q = P^QX$ and $K = P^KX$. What distinguishes the multihead attention layer from the singlehead attention layer, is that there is not just one $P^V$, $P^Q$ and $P^K$, but there are several: one for each head of the multihead attention layer. After computing the individual values, queries and vectors, and after applying the softmax, the outputs are then concatenated together in order to obtain again an array that is of the same size as the input array:

Here the various $P$ matrices can be interpreted as being projections onto lower-dimensional subspaces, hence the designation by the letter $P$. Because of this interpretation as projection matrices onto smaller spaces that should capture features in the input data it makes sense to constrain these elements to be part of the Stiefel manifold.

Computing Correlations in the Multihead-Attention Layer

The attention mechanism describes a reweighting of the "values" $V_i$ based on correlations between the "keys" $K_i$ and the "queries" $Q_i$. First note the structure of these matrices: they are all a collection of $T$ vectors $(N\div\mathtt{n\_heads})$-dimensional vectors, i.e. $V_i=[v_i^{(1)}, \ldots, v_i^{(T)}], K_i=[k_i^{(1)}, \ldots, k_i^{(T)}], Q_i=[q_i^{(1)}, \ldots, q_i^{(T)}]$ . Those vectors have been obtained by applying the respective projection matrices onto the original input $I_i\in\mathbb{R}^{N\times{}T}$.

When performing the reweighting of the columns of $V_i$ we first compute the correlations between the vectors in $K_i$ and in $Q_i$ and store the results in a correlation matrix $C_i$:

\[ [C_i]_{mn} = \left(k_i^{(m)}\right)^Tq_i^{(n)}.\]

The columns of this correlation matrix are than rescaled with a softmax function, obtaining a matrix of probability vectors $\mathcal{P}_i$:

\[ [\mathcal{P}_i]_{\bullet{}n} = \mathrm{softmax}([C_i]_{\bullet{}n}).\]

Finally the matrix $\mathcal{P}_i$ is multiplied onto $V_i$ from the right, resulting in 16 convex combinations of the 16 vectors $v_i^{(m)}$ with $m=1,\ldots,T$:

\[ V_i\mathcal{P}_i = \left[\sum_{m=1}^{16}[\mathcal{P}_i]_{m,1}v_i^{(m)}, \ldots, \sum_{m=1}^{T}[\mathcal{P}_i]_{m,T}v_i^{(m)}\right].\]

With this we can now give a better interpretation of what the projection matrices $W_i^V$, $W_i^K$ and $W_i^Q$ should do: they map the original data to lower-dimensional subspaces. We then compute correlations between the representation in the $K$ and in the $Q$ basis and use this correlation to perform a convex reweighting of the vectors in the $V$ basis. These reweighted values are then fed into a standard feedforward neural network.

Because the main task of the $W_i^V$, $W_i^K$ and $W_i^Q$ matrices here is for them to find bases, it makes sense to constrain them onto the Stiefel manifold; they do not and should not have the maximum possible generality.

References

[21]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin. Attention is all you need. Advances in neural information processing systems 30 (2017).
+Multihead Attention · GeometricMachineLearning.jl

Multihead Attention Layer

In order to arrive from the attention layer at the multihead attention layer we have to do a few modifications:

Note that these neural networks were originally developed for natural language processing (NLP) tasks and the terminology used here bears some resemblance to that field. The input to a multihead attention layer typicaly comprises three components:

  1. Values $V\in\mathbb{R}^{n\times{}T}$: a matrix whose columns are value vectors,
  2. Queries $Q\in\mathbb{R}^{n\times{}T}$: a matrix whose columns are query vectors,
  3. Keys $K\in\mathbb{R}^{n\times{}T}$: a matrix whose columns are key vectors.

Regular attention performs the following operation:

\[\mathrm{Attention}(Q,K,V) = V\mathrm{softmax}(\frac{K^TQ}{\sqrt{n}}),\]

where $n$ is the dimension of the vectors in $V$, $Q$ and $K$. The softmax activation function here acts column-wise, so it can be seen as a transformation $\mathrm{softmax}:\mathbb{R}^{T}\to\mathbb{R}^T$ with $[\mathrm{softmax}(v)]_i = e^{v_i}/\left(\sum_{j=1}e^{v_j}\right)$. The $K^TQ$ term is a similarity matrix between the queries and the vectors.

The transformer contains a self-attention mechanism, i.e. takes an input $X$ and then transforms it linearly to $V$, $Q$ and $K$, i.e. $V = P^VX$, $Q = P^QX$ and $K = P^KX$. What distinguishes the multihead attention layer from the singlehead attention layer, is that there is not just one $P^V$, $P^Q$ and $P^K$, but there are several: one for each head of the multihead attention layer. After computing the individual values, queries and vectors, and after applying the softmax, the outputs are then concatenated together in order to obtain again an array that is of the same size as the input array:

Here the various $P$ matrices can be interpreted as being projections onto lower-dimensional subspaces, hence the designation by the letter $P$. Because of this interpretation as projection matrices onto smaller spaces that should capture features in the input data it makes sense to constrain these elements to be part of the Stiefel manifold.

Computing Correlations in the Multihead-Attention Layer

The attention mechanism describes a reweighting of the "values" $V_i$ based on correlations between the "keys" $K_i$ and the "queries" $Q_i$. First note the structure of these matrices: they are all a collection of $T$ vectors $(N\div\mathtt{n\_heads})$-dimensional vectors, i.e. $V_i=[v_i^{(1)}, \ldots, v_i^{(T)}], K_i=[k_i^{(1)}, \ldots, k_i^{(T)}], Q_i=[q_i^{(1)}, \ldots, q_i^{(T)}]$ . Those vectors have been obtained by applying the respective projection matrices onto the original input $I_i\in\mathbb{R}^{N\times{}T}$.

When performing the reweighting of the columns of $V_i$ we first compute the correlations between the vectors in $K_i$ and in $Q_i$ and store the results in a correlation matrix $C_i$:

\[ [C_i]_{mn} = \left(k_i^{(m)}\right)^Tq_i^{(n)}.\]

The columns of this correlation matrix are than rescaled with a softmax function, obtaining a matrix of probability vectors $\mathcal{P}_i$:

\[ [\mathcal{P}_i]_{\bullet{}n} = \mathrm{softmax}([C_i]_{\bullet{}n}).\]

Finally the matrix $\mathcal{P}_i$ is multiplied onto $V_i$ from the right, resulting in 16 convex combinations of the 16 vectors $v_i^{(m)}$ with $m=1,\ldots,T$:

\[ V_i\mathcal{P}_i = \left[\sum_{m=1}^{16}[\mathcal{P}_i]_{m,1}v_i^{(m)}, \ldots, \sum_{m=1}^{T}[\mathcal{P}_i]_{m,T}v_i^{(m)}\right].\]

With this we can now give a better interpretation of what the projection matrices $W_i^V$, $W_i^K$ and $W_i^Q$ should do: they map the original data to lower-dimensional subspaces. We then compute correlations between the representation in the $K$ and in the $Q$ basis and use this correlation to perform a convex reweighting of the vectors in the $V$ basis. These reweighted values are then fed into a standard feedforward neural network.

Because the main task of the $W_i^V$, $W_i^K$ and $W_i^Q$ matrices here is for them to find bases, it makes sense to constrain them onto the Stiefel manifold; they do not and should not have the maximum possible generality.

References

[14]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin. Attention is all you need. Advances in neural information processing systems 30 (2017).
diff --git a/latest/layers/volume_preserving_feedforward/index.html b/latest/layers/volume_preserving_feedforward/index.html index 905bf61be..6c0bb7626 100644 --- a/latest/layers/volume_preserving_feedforward/index.html +++ b/latest/layers/volume_preserving_feedforward/index.html @@ -1,5 +1,5 @@ -Volume-Preserving Layers · GeometricMachineLearning.jl

Volume-Preserving Feedforward Layer

Volume preserving feedforward layers are a special type of ResNet layer for which we restrict the weight matrices to be of a particular form. I.e. each layer computes:

\[x \mapsto x + \sigma(Ax + b),\]

where $\sigma$ is a nonlinearity, $A$ is the weight and $b$ is the bias. The matrix $A$ is either a lower-triangular matrix $L$ or an upper-triangular matrix $U$[1]. The lower triangular matrix is of the form (the upper-triangular layer is simply the transpose of the lower triangular):

\[L = \begin{pmatrix} +Volume-Preserving Layers · GeometricMachineLearning.jl

Volume-Preserving Feedforward Layer

Volume preserving feedforward layers are a special type of ResNet layer for which we restrict the weight matrices to be of a particular form. I.e. each layer computes:

\[x \mapsto x + \sigma(Ax + b),\]

where $\sigma$ is a nonlinearity, $A$ is the weight and $b$ is the bias. The matrix $A$ is either a lower-triangular matrix $L$ or an upper-triangular matrix $U$[1]. The lower triangular matrix is of the form (the upper-triangular layer is simply the transpose of the lower triangular):

\[L = \begin{pmatrix} 0 & 0 & \cdots & 0 \\ a_{21} & \ddots & & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ @@ -9,4 +9,4 @@ b_{21} & \ddots & & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ b_{n1} & \cdots & b_{n(n-1)} & 1 -\end{pmatrix},\]

and the determinant of $J$ is 1, i.e. the map is volume-preserving.

Neural network architecture

Volume-preserving feedforward neural networks should be used as Architectures in GeometricMachineLearning. The constructor for them is:

The constructor is called with the following arguments:

  • sys_dim::Int: The system dimension.
  • n_blocks::Int: The number of blocks in the neural network (containing linear layers and nonlinear layers). Default is 1.
  • n_linear::Int: The number of linear VolumePreservingLowerLayers and VolumePreservingUpperLayers in one block. Default is 1.
  • activation: The activation function for the nonlinear layers in a block.
  • init_upper::Bool=false (keyword argument): Specifies if the first layer is lower or upper.

The constructor produces the following architecture[2]:

Here LinearLowerLayer performs $x \mapsto x + Lx$ and NonLinearLowerLayer performs $x \mapsto x + \sigma(Lx + b)$. The activation function $\sigma$ is the forth input argument to the constructor and tanh by default.

Note on Sympnets

As SympNets are symplectic maps, they also conserve phase space volume and therefore form a subcategory of volume-preserving feedforward layers.

  • 1Implemented as LowerTriangular and UpperTriangular in GeometricMachineLearning.
  • 2Based on the input arguments n_linear and n_blocks. In this example init_upper is set to false, which means that the first layer is of type lower followed by a layer of type upper.
+\end{pmatrix},\]

and the determinant of $J$ is 1, i.e. the map is volume-preserving.

Neural network architecture

Volume-preserving feedforward neural networks should be used as Architectures in GeometricMachineLearning. The constructor for them is:

The constructor is called with the following arguments:

  • sys_dim::Int: The system dimension.
  • n_blocks::Int: The number of blocks in the neural network (containing linear layers and nonlinear layers). Default is 1.
  • n_linear::Int: The number of linear VolumePreservingLowerLayers and VolumePreservingUpperLayers in one block. Default is 1.
  • activation: The activation function for the nonlinear layers in a block.
  • init_upper::Bool=false (keyword argument): Specifies if the first layer is lower or upper.

The constructor produces the following architecture[2]:

Here LinearLowerLayer performs $x \mapsto x + Lx$ and NonLinearLowerLayer performs $x \mapsto x + \sigma(Lx + b)$. The activation function $\sigma$ is the forth input argument to the constructor and tanh by default.

Note on Sympnets

As SympNets are symplectic maps, they also conserve phase space volume and therefore form a subcategory of volume-preserving feedforward layers.

  • 1Implemented as LowerTriangular and UpperTriangular in GeometricMachineLearning.
  • 2Based on the input arguments n_linear and n_blocks. In this example init_upper is set to false, which means that the first layer is of type lower followed by a layer of type upper.
diff --git a/latest/library/index.html b/latest/library/index.html index 280639dce..03da3a2da 100644 --- a/latest/library/index.html +++ b/latest/library/index.html @@ -1,25 +1,25 @@ -Library · GeometricMachineLearning.jl

GeometricMachineLearning Library Functions

GeometricMachineLearning.AbstractRetractionType

AbstractRetraction is a type that comprises all retraction methods for manifolds. For every manifold layer one has to specify a retraction method that takes the layer and elements of the (global) tangent space.

source
GeometricMachineLearning.ActivationLayerPMethod

Performs:

\[\begin{pmatrix} +Library · GeometricMachineLearning.jl

GeometricMachineLearning Library Functions

GeometricMachineLearning.AbstractRetractionType

AbstractRetraction is a type that comprises all retraction methods for manifolds. For every manifold layer one has to specify a retraction method that takes the layer and elements of the (global) tangent space.

source
GeometricMachineLearning.AdamOptimizerWithDecayType

Defines the Adam Optimizer with weight decay.

Constructors

The default constructor takes as input:

  • n_epochs::Int
  • η₁: the learning rate at the start
  • η₂: the learning rate at the end
  • ρ₁: the decay parameter for the first moment
  • ρ₂: the decay parameter for the second moment
  • δ: the safety parameter
  • T (keyword argument): the type.

The second constructor is called with:

  • n_epochs::Int
  • T

... the rest are keyword arguments

source
GeometricMachineLearning.BFGSCacheType

The cache for the BFGS optimizer.

It stores an array for the previous time step B and the inverse of the Hessian matrix H.

It is important to note that setting up this cache already requires a derivative! This is not the case for the other optimizers.

source
GeometricMachineLearning.BFGSDummyCacheType

In order to initialize BGGSCache we first need gradient information. This is why we initially have this BFGSDummyCache until gradient information is available.

NOTE: we may not need this.

source
GeometricMachineLearning.BatchType

Batch is a struct whose functor acts on an instance of DataLoader to produce a sequence of training samples for training for one epoch.

The Constructor

The constructor for Batch is called with:

  • batch_size::Int
  • seq_length::Int (optional)
  • prediction_window::Int (optional)

The first one of these arguments is required; it indicates the number of training samples in a batch. If we deal with time series data then we can additionaly supply a sequence length and a prediction window as input arguments to Batch. These indicate the number of input vectors and the number of output vectors.

The functor

An instance of Batch can be called on an instance of DataLoader to produce a sequence of samples that contain all the input data, i.e. for training for one epoch. The output of applying batch:Batch to dl::DataLoader is a tuple of vectors of integers. Each of these vectors contains two integers: the first is the time index and the second one is the parameter index.

source
GeometricMachineLearning.ClassificationType

Classification Layer that takes a matrix as an input and returns a vector that is used for MNIST classification.

It has the following arguments:

  • M: input dimension
  • N: output dimension
  • activation: the activation function

And the following optional argument:

  • average: If this is set to true, then the output is computed as $\frac{1}{N}\sum_{i=1}^N[input]_{\bullet{}i}$. If set to false (the default) it picks the last column of the input.
source
GeometricMachineLearning.ClassificationTransformerType

This is a transformer neural network for classification purposes. At the moment this is only used for training on MNIST, but can in theory be used for any classification problem.

It has to be called with a DataLoader that stores an input and an output tensor. The optional arguments are:

  • n_heads: The number of heads in the MultiHeadAttention (mha) layers. Default: 7.
  • n_layers: The number of transformer layers. Default: 16.
  • activation: The activation function. Default: softmax.
  • Stiefel: Wheter the matrices in the mha layers are on the Stiefel manifold.
  • add_connection: Whether the input is appended to the output of the mha layer. (skip connection)
source
GeometricMachineLearning.DataLoaderType

Data Loader is a struct that creates an instance based on a tensor (or different input format) and is designed to make training convenient.

Constructor

The data loader can be called with various inputs:

  • A single vector: If the data loader is called with a single vector (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the second axis indicates parameter values and/or time steps and the system has a single degree of freedom (i.e. the system dimension is one).
  • A single matrix: If the data loader is called with a single matrix (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the first axis is assumed to indicate the degrees of freedom of the system and the second axis indicates parameter values and/or time steps.
  • A single tensor: If the data loader is called with a single tensor, then this is interpreted as an integration problem with the second axis indicating the time step and the third one indicating the parameters.
  • A tensor and a vector: This is a special case (MNIST classification problem). For the MNIST problem for example the input are $n_p$ matrices (first input argument) and $n_p$ integers (second input argument).
  • A NamedTuple with fields q and p: The NamedTuple contains (i) two matrices or (ii) two tensors.
  • An EnsembleSolution: The EnsembleSolution typically comes from GeometricProblems.

When we supply a single vector or a single matrix as input to DataLoader and further set autoencoder = false (keyword argument), then the data are stored as an integration problem and the second axis is assumed to indicate time steps.

Fields of DataLoader

The fields of the DataLoader struct are the following: - input: The input data with axes (i) system dimension, (ii) number of time steps and (iii) number of parameters. - output: The tensor that contains the output (supervised learning) - this may be of type Nothing if the constructor is only called with one tensor (unsupervised learning). - input_dim: The dimension of the system, i.e. what is taken as input by a regular neural network. - input_time_steps: The length of the entire time series (length of the second axis). - n_params: The number of parameters that are present in the data set (length of third axis) - output_dim: The dimension of the output tensor (first axis). If output is of type Nothing, then this is also of type Nothing. - output_time_steps: The size of the second axis of the output tensor. If output is of type Nothing, then this is also of type Nothing.

The input and output fields of DataLoader

Even though the arguments to the Constructor may be vectors or matrices, internally DataLoader always stores tensors.

source
GeometricMachineLearning.DataLoaderMethod

Data Loader is a struct that creates an instance based on a tensor (or different input format) and is designed to make training convenient.

Constructor

The data loader can be called with various inputs:

  • A single vector: If the data loader is called with a single vector (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the second axis indicates parameter values and/or time steps and the system has a single degree of freedom (i.e. the system dimension is one).
  • A single matrix: If the data loader is called with a single matrix (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the first axis is assumed to indicate the degrees of freedom of the system and the second axis indicates parameter values and/or time steps.
  • A single tensor: If the data loader is called with a single tensor, then this is interpreted as an integration problem with the second axis indicating the time step and the third one indicating the parameters.
  • A tensor and a vector: This is a special case (MNIST classification problem). For the MNIST problem for example the input are $n_p$ matrices (first input argument) and $n_p$ integers (second input argument).
  • A NamedTuple with fields q and p: The NamedTuple contains (i) two matrices or (ii) two tensors.
  • An EnsembleSolution: The EnsembleSolution typically comes from GeometricProblems.

When we supply a single vector or a single matrix as input to DataLoader and further set autoencoder = false (keyword argument), then the data are stored as an integration problem and the second axis is assumed to indicate time steps.

source
GeometricMachineLearning.GSympNetType

GSympNet is called with a single input argument, the system dimension, or with an instance of DataLoader. Optional input arguments are:

  • upscaling_dimension::Int: The upscaling dimension of the gradient layer. See the documentation for GradientLayerQ and GradientLayerP for further explanation. The default is 2*dim.
  • nhidden::Int: The number of hidden layers (i.e. layers that are not input or output layers). The default is 2.
  • activation: The activation function that is applied. By default this is tanh.
  • init_upper::Bool: Initialize the gradient layer so that it first modifies the $q$-component. The default is true.
source
GeometricMachineLearning.GlobalSectionType

This implements global sections for the Stiefel manifold and the Symplectic Stiefel manifold.

In practice this is implemented using Householder reflections, with the auxiliary column vectors given by: |0| |0| |.| |1| ith spot for i in (n+1) to N (or with random columns) |0| |.| |0|

Maybe consider dividing the output in the check functions by n!

Implement a general global section here!!!! Tₓ𝔐 → G×𝔤 !!!!!! (think about random initialization!)

source
GeometricMachineLearning.AdamOptimizerWithDecayType

Defines the Adam Optimizer with weight decay.

Constructors

The default constructor takes as input:

  • n_epochs::Int
  • η₁: the learning rate at the start
  • η₂: the learning rate at the end
  • ρ₁: the decay parameter for the first moment
  • ρ₂: the decay parameter for the second moment
  • δ: the safety parameter
  • T (keyword argument): the type.

The second constructor is called with:

  • n_epochs::Int
  • T

... the rest are keyword arguments

source
GeometricMachineLearning.BFGSCacheType

The cache for the BFGS optimizer.

It stores an array for the previous time step B and the inverse of the Hessian matrix H.

It is important to note that setting up this cache already requires a derivative! This is not the case for the other optimizers.

source
GeometricMachineLearning.BFGSDummyCacheType

In order to initialize BGGSCache we first need gradient information. This is why we initially have this BFGSDummyCache until gradient information is available.

NOTE: we may not need this.

source
GeometricMachineLearning.BatchType

Batch is a struct whose functor acts on an instance of DataLoader to produce a sequence of training samples for training for one epoch.

The Constructor

The constructor for Batch is called with:

  • batch_size::Int
  • seq_length::Int (optional)
  • prediction_window::Int (optional)

The first one of these arguments is required; it indicates the number of training samples in a batch. If we deal with time series data then we can additionaly supply a sequence length and a prediction window as input arguments to Batch. These indicate the number of input vectors and the number of output vectors.

The functor

An instance of Batch can be called on an instance of DataLoader to produce a sequence of samples that contain all the input data, i.e. for training for one epoch. The output of applying batch:Batch to dl::DataLoader is a tuple of vectors of integers. Each of these vectors contains two integers: the first is the time index and the second one is the parameter index.

source
GeometricMachineLearning.ClassificationType

Classification Layer that takes a matrix as an input and returns a vector that is used for MNIST classification.

It has the following arguments:

  • M: input dimension
  • N: output dimension
  • activation: the activation function

And the following optional argument:

  • average: If this is set to true, then the output is computed as $\frac{1}{N}\sum_{i=1}^N[input]_{\bullet{}i}$. If set to false (the default) it picks the last column of the input.
source
GeometricMachineLearning.ClassificationTransformerType

This is a transformer neural network for classification purposes. At the moment this is only used for training on MNIST, but can in theory be used for any classification problem.

It has to be called with a DataLoader that stores an input and an output tensor. The optional arguments are:

  • n_heads: The number of heads in the MultiHeadAttention (mha) layers. Default: 7.
  • n_layers: The number of transformer layers. Default: 16.
  • activation: The activation function. Default: softmax.
  • Stiefel: Wheter the matrices in the mha layers are on the Stiefel manifold.
  • add_connection: Whether the input is appended to the output of the mha layer. (skip connection)
source
GeometricMachineLearning.DataLoaderType

Data Loader is a struct that creates an instance based on a tensor (or different input format) and is designed to make training convenient.

Constructor

The data loader can be called with various inputs:

  • A single vector: If the data loader is called with a single vector (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the second axis indicates parameter values and/or time steps and the system has a single degree of freedom (i.e. the system dimension is one).
  • A single matrix: If the data loader is called with a single matrix (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the first axis is assumed to indicate the degrees of freedom of the system and the second axis indicates parameter values and/or time steps.
  • A single tensor: If the data loader is called with a single tensor, then this is interpreted as an integration problem with the second axis indicating the time step and the third one indicating the parameters.
  • A tensor and a vector: This is a special case (MNIST classification problem). For the MNIST problem for example the input are $n_p$ matrices (first input argument) and $n_p$ integers (second input argument).
  • A NamedTuple with fields q and p: The NamedTuple contains (i) two matrices or (ii) two tensors.
  • An EnsembleSolution: The EnsembleSolution typically comes from GeometricProblems.

When we supply a single vector or a single matrix as input to DataLoader and further set autoencoder = false (keyword argument), then the data are stored as an integration problem and the second axis is assumed to indicate time steps.

Fields of DataLoader

The fields of the DataLoader struct are the following: - input: The input data with axes (i) system dimension, (ii) number of time steps and (iii) number of parameters. - output: The tensor that contains the output (supervised learning) - this may be of type Nothing if the constructor is only called with one tensor (unsupervised learning). - input_dim: The dimension of the system, i.e. what is taken as input by a regular neural network. - input_time_steps: The length of the entire time series (length of the second axis). - n_params: The number of parameters that are present in the data set (length of third axis) - output_dim: The dimension of the output tensor (first axis). If output is of type Nothing, then this is also of type Nothing. - output_time_steps: The size of the second axis of the output tensor. If output is of type Nothing, then this is also of type Nothing.

The input and output fields of DataLoader

Even though the arguments to the Constructor may be vectors or matrices, internally DataLoader always stores tensors.

source
GeometricMachineLearning.DataLoaderMethod

Data Loader is a struct that creates an instance based on a tensor (or different input format) and is designed to make training convenient.

Constructor

The data loader can be called with various inputs:

  • A single vector: If the data loader is called with a single vector (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the second axis indicates parameter values and/or time steps and the system has a single degree of freedom (i.e. the system dimension is one).
  • A single matrix: If the data loader is called with a single matrix (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the first axis is assumed to indicate the degrees of freedom of the system and the second axis indicates parameter values and/or time steps.
  • A single tensor: If the data loader is called with a single tensor, then this is interpreted as an integration problem with the second axis indicating the time step and the third one indicating the parameters.
  • A tensor and a vector: This is a special case (MNIST classification problem). For the MNIST problem for example the input are $n_p$ matrices (first input argument) and $n_p$ integers (second input argument).
  • A NamedTuple with fields q and p: The NamedTuple contains (i) two matrices or (ii) two tensors.
  • An EnsembleSolution: The EnsembleSolution typically comes from GeometricProblems.

When we supply a single vector or a single matrix as input to DataLoader and further set autoencoder = false (keyword argument), then the data are stored as an integration problem and the second axis is assumed to indicate time steps.

source
GeometricMachineLearning.GSympNetType

GSympNet is called with a single input argument, the system dimension, or with an instance of DataLoader. Optional input arguments are:

  • upscaling_dimension::Int: The upscaling dimension of the gradient layer. See the documentation for GradientLayerQ and GradientLayerP for further explanation. The default is 2*dim.
  • nhidden::Int: The number of hidden layers (i.e. layers that are not input or output layers). The default is 2.
  • activation: The activation function that is applied. By default this is tanh.
  • init_upper::Bool: Initialize the gradient layer so that it first modifies the $q$-component. The default is true.
source
GeometricMachineLearning.GlobalSectionType

This implements global sections for the Stiefel manifold and the Symplectic Stiefel manifold.

In practice this is implemented using Householder reflections, with the auxiliary column vectors given by: |0| |0| |.| |1| ith spot for i in (n+1) to N (or with random columns) |0| |.| |0|

Maybe consider dividing the output in the check functions by n!

Implement a general global section here!!!! Tₓ𝔐 → G×𝔤 !!!!!! (think about random initialization!)

source
GeometricMachineLearning.GradientLayerPMethod

The gradient layer that changes the $q$ component. It is of the form:

\[\begin{bmatrix} \mathbb{I} & \mathbb{O} \\ \nabla{}V & \mathbb{I} -\end{bmatrix},\]

with $V(p) = \sum_{i=1}^Ma_i\Sigma(\sum_jk_{ij}p_j+b_i)$, where $\Sigma$ is the antiderivative of the activation function $\sigma$ (one-layer neural network). We refer to $M$ as the upscaling dimension. Such layers are by construction symplectic.

source
GeometricMachineLearning.GradientLayerQMethod

The gradient layer that changes the $q$ component. It is of the form:

\[\begin{bmatrix} +\end{bmatrix},\]

with $V(p) = \sum_{i=1}^Ma_i\Sigma(\sum_jk_{ij}p_j+b_i)$, where $\Sigma$ is the antiderivative of the activation function $\sigma$ (one-layer neural network). We refer to $M$ as the upscaling dimension. Such layers are by construction symplectic.

source
GeometricMachineLearning.GradientLayerQMethod

The gradient layer that changes the $q$ component. It is of the form:

\[\begin{bmatrix} \mathbb{I} & \nabla{}V \\ \mathbb{O} & \mathbb{I} -\end{bmatrix},\]

with $V(p) = \sum_{i=1}^Ma_i\Sigma(\sum_jk_{ij}p_j+b_i)$, where $\Sigma$ is the antiderivative of the activation function $\sigma$ (one-layer neural network). We refer to $M$ as the upscaling dimension. Such layers are by construction symplectic.

source
GeometricMachineLearning.LASympNetType

LASympNet is called with a single input argument, the system dimension, or with an instance of DataLoader. Optional input arguments are:

  • depth::Int: The number of linear layers that are applied. The default is 5.
  • nhidden::Int: The number of hidden layers (i.e. layers that are not input or output layers). The default is 2.
  • activation: The activation function that is applied. By default this is tanh.
  • init_upper_linear::Bool: Initialize the linear layer so that it first modifies the $q$-component. The default is true.
  • init_upper_act::Bool: Initialize the activation layer so that it first modifies the $q$-component. The default is true.
source
GeometricMachineLearning.LinearLayerPMethod

Equivalent to a left multiplication by the matrix:

\[\begin{pmatrix} +\end{bmatrix},\]

with $V(p) = \sum_{i=1}^Ma_i\Sigma(\sum_jk_{ij}p_j+b_i)$, where $\Sigma$ is the antiderivative of the activation function $\sigma$ (one-layer neural network). We refer to $M$ as the upscaling dimension. Such layers are by construction symplectic.

source
GeometricMachineLearning.LASympNetType

LASympNet is called with a single input argument, the system dimension, or with an instance of DataLoader. Optional input arguments are:

  • depth::Int: The number of linear layers that are applied. The default is 5.
  • nhidden::Int: The number of hidden layers (i.e. layers that are not input or output layers). The default is 2.
  • activation: The activation function that is applied. By default this is tanh.
  • init_upper_linear::Bool: Initialize the linear layer so that it first modifies the $q$-component. The default is true.
  • init_upper_act::Bool: Initialize the activation layer so that it first modifies the $q$-component. The default is true.
source
GeometricMachineLearning.LowerTriangularType

A lower-triangular matrix is an $n\times{}n$ matrix that has ones on the diagonal and zeros on the upper triangular.

The data are stored in a vector $S$ similarly to SkewSymMatrix.

The struct two fields: S and n. The first stores all the entries of the matrix in a sparse fashion (in a vector) and the second is the dimension $n$ for $A\in\mathbb{R}^{n\times{}n}$.

source
GeometricMachineLearning.OptimizerType

Optimizer struct that stores the 'method' (i.e. Adam with corresponding hyperparameters), the cache and the optimization step.

It takes as input an optimization method and the parameters of a network.

For technical reasons we first specify an OptimizerMethod that stores all the hyperparameters of the optimizer.

source
GeometricMachineLearning.OptimizerMethod

A functor for Optimizer. It is called with: - nn::NeuralNetwork - dl::DataLoader - batch::Batch - n_epochs::Int - loss

The last argument is a function through which Zygote differentiates. This argument is optional; if it is not supplied GeometricMachineLearning defaults to an appropriate loss for the DataLoader.

source
GeometricMachineLearning.PSDLayerType

This is a PSD-like layer used for symplectic autoencoders. One layer has the following shape:

\[A = \begin{bmatrix} \Phi & \mathbb{O} \\ \mathbb{O} & \Phi \end{bmatrix},\]

where $\Phi$ is an element of the Stiefel manifold $St(n, N)$.

The constructor of PSDLayer is called by PSDLayer(M, N; retraction=retraction):

  • M is the input dimension.
  • N is the output dimension.
  • retraction is an instance of a struct with supertype AbstractRetraction. The only options at the moment are Geodesic() and Cayley().
source
GeometricMachineLearning.ReducedSystemType

ReducedSystem computes the reconstructed dynamics in the full system based on the reduced one. Optionally it can be compared to the FOM solution.

It can be called using the following constructor: ReducedSystem(N, n, encoder, decoder, fullvectorfield, reducedvectorfield, params, tspan, tstep, ics, projection_error) where

  • encoder: a function $\mathbb{R}^{2N}\mapsto{}\mathbb{R}^{2n}$
  • decoder: a (differentiable) function $\mathbb{R}^{2n}\mapsto\mathbb{R}^{2N}$
  • fullvectorfield: a (differentiable) mapping defined the same way as in GeometricIntegrators
  • reducedvectorfield: a (differentiable) mapping defined the same way as in GeometricIntegrators
  • params: a NamedTuple that parametrizes the vector fields (the same for fullvectorfield and reducedvectorfield)
  • tspan: a tuple (t₀, tₗ) that specifies start and end point of the time interval over which integration is performed.
  • tstep: the time step
  • ics: the initial condition for the big system.
  • projection_error: the error $||M - \mathcal{R}\circ\mathcal{P}(M)||$ where $M$ is the snapshot matrix; $\mathcal{P}$ and $\mathcal{R}$ are the reduction and reconstruction respectively.
source
GeometricMachineLearning.RegularTransformerIntegratorType

The regular transformer used as an integrator (multi-step method).

The constructor is called with the following arguments:

  • sys_dim::Int
  • transformer_dim::Int: the default is transformer_dim = sys_dim.
  • n_blocks::Int: The default is 1.
  • n_heads::Int: the number of heads in the multihead attentio layer (default is n_heads = sys_dim)
  • L::Int the number of transformer blocks (default is L = 2).
  • upscaling_activation: by default identity
  • resnet_activation: by default tanh
  • add_connection:Bool=true (keyword argument): if the input should be added to the output.
source
GeometricMachineLearning.SkewSymMatrixType

A SkewSymMatrix is a matrix $A$ s.t. $A^T = -A$.

If the constructor is called with a matrix as input it returns a symmetric matrix via the projection $A \mapsto \frac{1}{2}(A - A^T)$. This is a projection defined via the canonical metric $\mathbb{R}^{n\times{}n}\times\mathbb{R}^{n\times{}n}\to\mathbb{R}, (A,B) \mapsto \mathrm{Tr}(A^TB)$.

The first index is the row index, the second one the column index.

The struct two fields: S and n. The first stores all the entries of the matrix in a sparse fashion (in a vector) and the second is the dimension $n$ for $A\in\mathbb{R}^{n\times{}n}$.

source
GeometricMachineLearning.StiefelLieAlgHorMatrixType

StiefelLieAlgHorMatrix is the horizontal component of the Lie algebra of skew-symmetric matrices (with respect to the canonical metric). The projection here is: (\pi:S \to SE ) where

\[E = \begin{pmatrix} \mathbb{I}_{n} \\ \mathbb{O}_{(N-n)\times{}n} \end{pmatrix}.\]

The matrix (E) is implemented under StiefelProjection in GeometricMachineLearning.

An element of StiefelLieAlgMatrix takes the form:

\[\begin{pmatrix} +\end{pmatrix}, \]

where $B$ is a symmetric matrix.

source
GeometricMachineLearning.LowerTriangularType

A lower-triangular matrix is an $n\times{}n$ matrix that has ones on the diagonal and zeros on the upper triangular.

The data are stored in a vector $S$ similarly to SkewSymMatrix.

The struct two fields: S and n. The first stores all the entries of the matrix in a sparse fashion (in a vector) and the second is the dimension $n$ for $A\in\mathbb{R}^{n\times{}n}$.

source
GeometricMachineLearning.OptimizerType

Optimizer struct that stores the 'method' (i.e. Adam with corresponding hyperparameters), the cache and the optimization step.

It takes as input an optimization method and the parameters of a network.

For technical reasons we first specify an OptimizerMethod that stores all the hyperparameters of the optimizer.

source
GeometricMachineLearning.OptimizerMethod

A functor for Optimizer. It is called with: - nn::NeuralNetwork - dl::DataLoader - batch::Batch - n_epochs::Int - loss

The last argument is a function through which Zygote differentiates. This argument is optional; if it is not supplied GeometricMachineLearning defaults to an appropriate loss for the DataLoader.

source
GeometricMachineLearning.PSDLayerType

This is a PSD-like layer used for symplectic autoencoders. One layer has the following shape:

\[A = \begin{bmatrix} \Phi & \mathbb{O} \\ \mathbb{O} & \Phi \end{bmatrix},\]

where $\Phi$ is an element of the Stiefel manifold $St(n, N)$.

The constructor of PSDLayer is called by PSDLayer(M, N; retraction=retraction):

  • M is the input dimension.
  • N is the output dimension.
  • retraction is an instance of a struct with supertype AbstractRetraction. The only options at the moment are Geodesic() and Cayley().
source
GeometricMachineLearning.ReducedSystemType

ReducedSystem computes the reconstructed dynamics in the full system based on the reduced one. Optionally it can be compared to the FOM solution.

It can be called using the following constructor: ReducedSystem(N, n, encoder, decoder, fullvectorfield, reducedvectorfield, params, tspan, tstep, ics, projection_error) where

  • encoder: a function $\mathbb{R}^{2N}\mapsto{}\mathbb{R}^{2n}$
  • decoder: a (differentiable) function $\mathbb{R}^{2n}\mapsto\mathbb{R}^{2N}$
  • fullvectorfield: a (differentiable) mapping defined the same way as in GeometricIntegrators
  • reducedvectorfield: a (differentiable) mapping defined the same way as in GeometricIntegrators
  • params: a NamedTuple that parametrizes the vector fields (the same for fullvectorfield and reducedvectorfield)
  • tspan: a tuple (t₀, tₗ) that specifies start and end point of the time interval over which integration is performed.
  • tstep: the time step
  • ics: the initial condition for the big system.
  • projection_error: the error $||M - \mathcal{R}\circ\mathcal{P}(M)||$ where $M$ is the snapshot matrix; $\mathcal{P}$ and $\mathcal{R}$ are the reduction and reconstruction respectively.
source
GeometricMachineLearning.RegularTransformerIntegratorType

The regular transformer used as an integrator (multi-step method).

The constructor is called with the following arguments:

  • sys_dim::Int
  • transformer_dim::Int: the default is transformer_dim = sys_dim.
  • n_blocks::Int: The default is 1.
  • n_heads::Int: the number of heads in the multihead attentio layer (default is n_heads = sys_dim)
  • L::Int the number of transformer blocks (default is L = 2).
  • upscaling_activation: by default identity
  • resnet_activation: by default tanh
  • add_connection:Bool=true (keyword argument): if the input should be added to the output.
source
GeometricMachineLearning.SkewSymMatrixType

A SkewSymMatrix is a matrix $A$ s.t. $A^T = -A$.

If the constructor is called with a matrix as input it returns a symmetric matrix via the projection $A \mapsto \frac{1}{2}(A - A^T)$. This is a projection defined via the canonical metric $\mathbb{R}^{n\times{}n}\times\mathbb{R}^{n\times{}n}\to\mathbb{R}, (A,B) \mapsto \mathrm{Tr}(A^TB)$.

The first index is the row index, the second one the column index.

The struct two fields: S and n. The first stores all the entries of the matrix in a sparse fashion (in a vector) and the second is the dimension $n$ for $A\in\mathbb{R}^{n\times{}n}$.

source
GeometricMachineLearning.StiefelLieAlgHorMatrixType

StiefelLieAlgHorMatrix is the horizontal component of the Lie algebra of skew-symmetric matrices (with respect to the canonical metric). The projection here is: (\pi:S \to SE ) where

\[E = \begin{pmatrix} \mathbb{I}_{n} \\ \mathbb{O}_{(N-n)\times{}n} \end{pmatrix}.\]

The matrix (E) is implemented under StiefelProjection in GeometricMachineLearning.

An element of StiefelLieAlgMatrix takes the form:

\[\begin{pmatrix} A & B^T \\ B & \mathbb{O} \end{pmatrix},\]

where (A) is skew-symmetric (this is SkewSymMatrix in GeometricMachineLearning).

If the constructor is called with a big (N\times{}N) matrix, then the projection is performed the following way:

\[\begin{pmatrix} A & B_1 \\ @@ -28,11 +28,11 @@ \begin{pmatrix} \mathrm{skew}(A) & -B_2^T \\ B_2 & \mathbb{O} -\end{pmatrix}.\]

The operation $\mathrm{skew}:\mathbb{R}^{n\times{}n}\to\mathcal{S}_\mathrm{skew}(n)$ is the skew-symmetrization operation. This is equivalent to calling the constructor of SkewSymMatrix with an (n\times{}n) matrix.

source
GeometricMachineLearning.StiefelProjectionType

An array that essentially does vcat(I(n), zeros(N-n, n)) with GPU support. It has three inner constructors. The first one is called with the following arguments:

  1. backend: backends as supported by KernelAbstractions.
  2. T::Type
  3. N::Integer
  4. n::Integer

The second constructor is called by supplying a matrix as input. The constructor will then extract the backend, the type and the dimensions of that matrix.

The third constructor is called by supplying an instance of StiefelLieAlgHorMatrix.

Technically this should be a subtype of StiefelManifold.

source
GeometricMachineLearning.SymmetricMatrixType

A SymmetricMatrix $A$ is a matrix $A^T = A$.

If the constructor is called with a matrix as input it returns a symmetric matrix via the projection:

\[A \mapsto \frac{1}{2}(A + A^T).\]

This is a projection defined via the canonical metric $(A,B) \mapsto \mathrm{tr}(A^TB)$.

Internally the struct saves a vector $S$ of size $n(n+1)\div2$. The conversion is done the following way:

\[[A]_{ij} = \begin{cases} S[( (i-1) i ) \div 2 + j] & \text{if $i\geq{}j$}\\ - S[( (j-1) j ) \div 2 + i] & \text{else}. \end{cases}\]

So $S$ stores a string of vectors taken from $A$: $S = [\tilde{a}_1, \tilde{a}_2, \ldots, \tilde{a}_n]$ with $\tilde{a}_i = [[A]_{i1},[A]_{i2},\ldots,[A]_{ii}]$.

source
GeometricMachineLearning.SympNetLayerType

Implements the various layers from the SympNet paper: (https://www.sciencedirect.com/science/article/abs/pii/S0893608020303063). This is a super type of Gradient, Activation and Linear.

For the linear layer, the activation and the bias are left out, and for the activation layer $K$ and $b$ are left out!

source
GeometricMachineLearning.SymplecticPotentialType

SymplecticPotential(n)

Returns a symplectic matrix of size 2n x 2n

\[\begin{pmatrix} +\end{pmatrix}.\]

The operation $\mathrm{skew}:\mathbb{R}^{n\times{}n}\to\mathcal{S}_\mathrm{skew}(n)$ is the skew-symmetrization operation. This is equivalent to calling the constructor of SkewSymMatrix with an (n\times{}n) matrix.

source
GeometricMachineLearning.StiefelProjectionType

An array that essentially does vcat(I(n), zeros(N-n, n)) with GPU support. It has three inner constructors. The first one is called with the following arguments:

  1. backend: backends as supported by KernelAbstractions.
  2. T::Type
  3. N::Integer
  4. n::Integer

The second constructor is called by supplying a matrix as input. The constructor will then extract the backend, the type and the dimensions of that matrix.

The third constructor is called by supplying an instance of StiefelLieAlgHorMatrix.

Technically this should be a subtype of StiefelManifold.

source
GeometricMachineLearning.SymmetricMatrixType

A SymmetricMatrix $A$ is a matrix $A^T = A$.

If the constructor is called with a matrix as input it returns a symmetric matrix via the projection:

\[A \mapsto \frac{1}{2}(A + A^T).\]

This is a projection defined via the canonical metric $(A,B) \mapsto \mathrm{tr}(A^TB)$.

Internally the struct saves a vector $S$ of size $n(n+1)\div2$. The conversion is done the following way:

\[[A]_{ij} = \begin{cases} S[( (i-1) i ) \div 2 + j] & \text{if $i\geq{}j$}\\ + S[( (j-1) j ) \div 2 + i] & \text{else}. \end{cases}\]

So $S$ stores a string of vectors taken from $A$: $S = [\tilde{a}_1, \tilde{a}_2, \ldots, \tilde{a}_n]$ with $\tilde{a}_i = [[A]_{i1},[A]_{i2},\ldots,[A]_{ii}]$.

source
GeometricMachineLearning.SympNetLayerType

Implements the various layers from the SympNet paper: (https://www.sciencedirect.com/science/article/abs/pii/S0893608020303063). This is a super type of Gradient, Activation and Linear.

For the linear layer, the activation and the bias are left out, and for the activation layer $K$ and $b$ are left out!

source
GeometricMachineLearning.UpperTriangularType

An upper-triangular matrix is an $n\times{}n$ matrix that has ones on the diagonal and zeros on the upper triangular.

The data are stored in a vector $S$ similarly to SkewSymMatrix.

The struct two fields: S and n. The first stores all the entries of the matrix in a sparse fashion (in a vector) and the second is the dimension $n$ for $A\in\mathbb{R}^{n\times{}n}$.

source
GeometricMachineLearning.VolumePreservingAttentionType

Volume-preserving attention (single head attention)

Drawbacks:

  • the super fast activation is only implemented for sequence lengths of 2, 3, 4 and 5.
  • other sequence lengths only work on CPU for now (lu decomposition has to be implemented to work for tensors in parallel).

Constructor

The constructor is called with:

  • dim::Int: The system dimension
  • seq_length::Int: The sequence length to be considered. The default is zero, i.e. arbitrary sequence lengths; this works for all sequence lengths but doesn't apply the super-fast activation.
  • skew_sym::Bool (keyword argument): specifies if we the weight matrix is skew symmetric or arbitrary (default is false).

Functor

Applying a layer of type VolumePreservingAttention does the following:

  • First we perform the operation $X \mapsto X^T A X =: C$, where $X\in\mathbb{R}^{N\times\mathtt{seq\_length}}$ is a vector containing time series data and $A$ is the skew symmetric matrix associated with the layer.
  • In a second step we compute the Cayley transform of $C$; $\Lambda = \mathrm{Cayley}(C)$.
  • The output of the layer is then $X\Lambda$.
source
GeometricMachineLearning.VolumePreservingFeedForwardType

Realizes a volume-preserving neural network as a combination of VolumePreservingLowerLayer and VolumePreservingUpperLayer.

Constructor

The constructor is called with the following arguments:

  • sys_dim::Int: The system dimension.
  • n_blocks::Int: The number of blocks in the neural network (containing linear layers and nonlinear layers). Default is 1.
  • n_linear::Int: The number of linear VolumePreservingLowerLayers and VolumePreservingUpperLayers in one block. Default is 1.
  • activation: The activation function for the nonlinear layers in a block.
  • init_upper::Bool=false (keyword argument): Specifies if the first layer is lower or upper.
source
GeometricMachineLearning.VolumePreservingFeedForwardLayerType

Super-type of VolumePreservingLowerLayer and VolumePreservingUpperLayer. The layers do the following:

\[x \mapsto \begin{cases} \sigma(Lx + b) & \text{where $L$ is }\mathtt{LowerTriangular} \\ \sigma(Ux + b) & \text{where $U$ is }\mathtt{UpperTriangular}. \end{cases}\]

The functor can be applied to a vecotr, a matrix or a tensor.

Constructor

The constructors are called with:

  • sys_dim::Int: the system dimension.
  • activation=tanh: the activation function.
  • include_bias::Bool=true (keyword argument): specifies whether a bias should be used.
source
AbstractNeuralNetworks.update!Method

Optimization for an entire neural networks with BFGS. What is different in this case is that we still have to initialize the cache.

If o.step == 1, then we initialize the cache

source
Base.iterateMethod

This function computes a trajectory for a Transformer that has already been trained for valuation purposes.

It takes as input:

  • nn: a NeuralNetwork (that has been trained).
  • ics: initial conditions (a matrix in $\mathbb{R}^{2n\times\mathtt{seq\_length}}$ or NamedTuple of two matrices in $\mathbb{R}^{n\times\mathtt{seq\_length}}$)
  • n_points::Int=100 (keyword argument): The number of steps for which we run the prediction.
  • prediction_window::Int=size(ics.q, 2): The prediction window (i.e. the number of steps we predict into the future) is equal to the sequence length (i.e. the number of input time steps) by default.
source
Base.iterateMethod

This function computes a trajectory for a SympNet that has already been trained for valuation purposes.

It takes as input:

  • nn: a NeuralNetwork (that has been trained).
  • ics: initial conditions (a NamedTuple of two vectors)
source
Base.vecMethod

If vec is applied onto Triangular, then the output is the associated vector.

source
Base.vecMethod

If vec is applied onto SkewSymMatrix, then the output is the associated vector.

source
GeometricMachineLearning.GradientFunction

This is an old constructor and will be depricated. For change_q=true it is equivalent to GradientLayerQ; for change_q=false it is equivalent to GradientLayerP.

If full_grad=false then ActivationLayer is called

source
GeometricMachineLearning.TransformerMethod

The architecture for a "transformer encoder" is essentially taken from arXiv:2010.11929, but with the difference that no layer normalization is employed. This is because we still need to find a generalization of layer normalization to manifolds.

The transformer is called with the following inputs:

  • dim: the dimension of the transformer
  • n_heads: the number of heads
  • L: the number of transformer blocks

In addition we have the following optional arguments:

  • activation: the activation function used for the ResNet (tanh by default)
  • Stiefel::Bool: if the matrices $P^V$, $P^Q$ and $P^K$ should live on a manifold (false by default)
  • retraction: which retraction should be used (Geodesic() by default)
  • add_connection::Bool: if the input should by added to the ouput after the MultiHeadAttention layer is used (true by default)
  • use_bias::Bool: If the ResNet should use a bias (true by default)
source
GeometricMachineLearning.accuracyMethod

Computes the accuracy (as opposed to the loss) of a neural network classifier.

It takes as input:

  • model::Chain
  • ps: parameters of the network
  • dl::DataLoader
source
GeometricMachineLearning.apply_layer_to_nt_and_return_arrayMethod

This function is used in the wrappers where the input to the SympNet layers is not a NamedTuple (as it should be) but an AbstractArray.

It converts the Array to a NamedTuple (via assign_q_and_p), then calls the SympNet routine(s) and converts back to an AbstractArray (with vcat).

source
GeometricMachineLearning.assign_batch_kernel!Method

Takes as input a batch tensor (to which the data are assigned), the whole data tensor and two vectors params and time_steps that include the specific parameters and time steps we want to assign.

Note that this assigns sequential data! For e.g. being processed by a transformer.

source
GeometricMachineLearning.assign_output_estimateMethod

The function assign_output_estimate is closely related to the transformer. It takes the last prediction_window columns of the output and uses them for the final prediction. i.e.

\[\mathbb{R}^{N\times\mathtt{pw}}\to\mathbb{R}^{N\times\mathtt{pw}}, + - noisemaker

source
GeometricMachineLearning.UpperTriangularType

An upper-triangular matrix is an $n\times{}n$ matrix that has ones on the diagonal and zeros on the upper triangular.

The data are stored in a vector $S$ similarly to SkewSymMatrix.

The struct two fields: S and n. The first stores all the entries of the matrix in a sparse fashion (in a vector) and the second is the dimension $n$ for $A\in\mathbb{R}^{n\times{}n}$.

source
GeometricMachineLearning.VolumePreservingAttentionType

Volume-preserving attention (single head attention)

Drawbacks:

  • the super fast activation is only implemented for sequence lengths of 2, 3, 4 and 5.
  • other sequence lengths only work on CPU for now (lu decomposition has to be implemented to work for tensors in parallel).

Constructor

The constructor is called with:

  • dim::Int: The system dimension
  • seq_length::Int: The sequence length to be considered. The default is zero, i.e. arbitrary sequence lengths; this works for all sequence lengths but doesn't apply the super-fast activation.
  • skew_sym::Bool (keyword argument): specifies if we the weight matrix is skew symmetric or arbitrary (default is false).

Functor

Applying a layer of type VolumePreservingAttention does the following:

  • First we perform the operation $X \mapsto X^T A X =: C$, where $X\in\mathbb{R}^{N\times\mathtt{seq\_length}}$ is a vector containing time series data and $A$ is the skew symmetric matrix associated with the layer.
  • In a second step we compute the Cayley transform of $C$; $\Lambda = \mathrm{Cayley}(C)$.
  • The output of the layer is then $X\Lambda$.
source
GeometricMachineLearning.VolumePreservingFeedForwardType

Realizes a volume-preserving neural network as a combination of VolumePreservingLowerLayer and VolumePreservingUpperLayer.

Constructor

The constructor is called with the following arguments:

  • sys_dim::Int: The system dimension.
  • n_blocks::Int: The number of blocks in the neural network (containing linear layers and nonlinear layers). Default is 1.
  • n_linear::Int: The number of linear VolumePreservingLowerLayers and VolumePreservingUpperLayers in one block. Default is 1.
  • activation: The activation function for the nonlinear layers in a block.
  • init_upper::Bool=false (keyword argument): Specifies if the first layer is lower or upper.
source
GeometricMachineLearning.VolumePreservingFeedForwardLayerType

Super-type of VolumePreservingLowerLayer and VolumePreservingUpperLayer. The layers do the following:

\[x \mapsto \begin{cases} \sigma(Lx + b) & \text{where $L$ is }\mathtt{LowerTriangular} \\ \sigma(Ux + b) & \text{where $U$ is }\mathtt{UpperTriangular}. \end{cases}\]

The functor can be applied to a vecotr, a matrix or a tensor.

Constructor

The constructors are called with:

  • sys_dim::Int: the system dimension.
  • activation=tanh: the activation function.
  • include_bias::Bool=true (keyword argument): specifies whether a bias should be used.
source
AbstractNeuralNetworks.update!Method

Optimization for an entire neural networks with BFGS. What is different in this case is that we still have to initialize the cache.

If o.step == 1, then we initialize the cache

source
Base.iterateMethod

This function computes a trajectory for a Transformer that has already been trained for valuation purposes.

It takes as input:

  • nn: a NeuralNetwork (that has been trained).
  • ics: initial conditions (a matrix in $\mathbb{R}^{2n\times\mathtt{seq\_length}}$ or NamedTuple of two matrices in $\mathbb{R}^{n\times\mathtt{seq\_length}}$)
  • n_points::Int=100 (keyword argument): The number of steps for which we run the prediction.
  • prediction_window::Int=size(ics.q, 2): The prediction window (i.e. the number of steps we predict into the future) is equal to the sequence length (i.e. the number of input time steps) by default.
source
Base.iterateMethod

This function computes a trajectory for a SympNet that has already been trained for valuation purposes.

It takes as input:

  • nn: a NeuralNetwork (that has been trained).
  • ics: initial conditions (a NamedTuple of two vectors)
source
Base.vecMethod

If vec is applied onto Triangular, then the output is the associated vector.

source
Base.vecMethod

If vec is applied onto SkewSymMatrix, then the output is the associated vector.

source
GeometricMachineLearning.GradientFunction

This is an old constructor and will be depricated. For change_q=true it is equivalent to GradientLayerQ; for change_q=false it is equivalent to GradientLayerP.

If full_grad=false then ActivationLayer is called

source
GeometricMachineLearning.TransformerMethod

The architecture for a "transformer encoder" is essentially taken from arXiv:2010.11929, but with the difference that no layer normalization is employed. This is because we still need to find a generalization of layer normalization to manifolds.

The transformer is called with the following inputs:

  • dim: the dimension of the transformer
  • n_heads: the number of heads
  • L: the number of transformer blocks

In addition we have the following optional arguments:

  • activation: the activation function used for the ResNet (tanh by default)
  • Stiefel::Bool: if the matrices $P^V$, $P^Q$ and $P^K$ should live on a manifold (false by default)
  • retraction: which retraction should be used (Geodesic() by default)
  • add_connection::Bool: if the input should by added to the ouput after the MultiHeadAttention layer is used (true by default)
  • use_bias::Bool: If the ResNet should use a bias (true by default)
source
GeometricMachineLearning.accuracyMethod

Computes the accuracy (as opposed to the loss) of a neural network classifier.

It takes as input:

  • model::Chain
  • ps: parameters of the network
  • dl::DataLoader
source
GeometricMachineLearning.apply_layer_to_nt_and_return_arrayMethod

This function is used in the wrappers where the input to the SympNet layers is not a NamedTuple (as it should be) but an AbstractArray.

It converts the Array to a NamedTuple (via assign_q_and_p), then calls the SympNet routine(s) and converts back to an AbstractArray (with vcat).

source
GeometricMachineLearning.assign_batch_kernel!Method

Takes as input a batch tensor (to which the data are assigned), the whole data tensor and two vectors params and time_steps that include the specific parameters and time steps we want to assign.

Note that this assigns sequential data! For e.g. being processed by a transformer.

source
GeometricMachineLearning.assign_output_estimateMethod

The function assign_output_estimate is closely related to the transformer. It takes the last prediction_window columns of the output and uses them for the final prediction. i.e.

\[\mathbb{R}^{N\times\mathtt{pw}}\to\mathbb{R}^{N\times\mathtt{pw}}, \begin{bmatrix} z^{(1)}_1 & \cdots & z^{(T)}_1 \\ \cdots & \cdots & \cdots \\ @@ -51,4 +51,4 @@ \begin{bmatrix} z^{(T - \mathtt{pw})}_1 & \cdots & z^{(T)}_1 \\ \cdots & \cdots & \cdots \\ - z^{(T - \mathtt{pw})}_n & \cdots & z^{(T})_n\end{bmatrix} \]

source
GeometricMachineLearning.assign_q_and_pMethod

Allocates two new arrays q and p whose first dimension is half of that of the input x. This should also be supplied through the second argument N.

The output is a Tuple containing q and p.

source
GeometricMachineLearning.init_optimizer_cacheMethod

Wrapper for the functions setup_adam_cache, setup_momentum_cache, setup_gradient_cache, setup_bfgs_cache. These appear outside of optimizer_caches.jl because the OptimizerMethods first have to be defined.

source
GeometricMachineLearning.lossMethod

Wrapper if we deal with a neural network.

You can supply an instance of NeuralNetwork instead of the two arguments model (of type Union{Chain, AbstractExplicitLayer}) and parameters (of type Union{Tuple, NamedTuple}).

source
GeometricMachineLearning.lossMethod

Computes the loss for a neural network and a data set. The computed loss is

\[||output - \mathcal{NN}(input)||_F/||output||_F,\]

where $||A||_F := \sqrt{\sum_{i_1,\ldots,i_k}|a_{i_1,\ldots,i_k}^2}|^2$ is the Frobenius norm.

It takes as input:

  • model::Union{Chain, AbstractExplicitLayer}
  • ps::Union{Tuple, NamedTuple}
  • input::Union{Array, NamedTuple}
  • output::Uniont{Array, NamedTuple}
source
GeometricMachineLearning.lossMethod

The autoencoder loss:

\[||output - \mathcal{NN}(input)||_F/||output||_F.\]

It takes as input:

  • model::Union{Chain, AbstractExplicitLayer}
  • ps::Union{Tuple, NamedTuple}
  • input::Union{Array, NamedTuple}
source
GeometricMachineLearning.metricMethod

Implements the canonical Riemannian metric for the Stiefel manifold:

\[g_Y: (\Delta_1, \Delta_2) \mapsto \mathrm{tr}(\Delta_1^T(\mathbb{I} - \frac{1}{2}YY^T)\Delta_2).\]

It is called with:

  • Y::StiefelManifold
  • Δ₁::AbstractMatrix
  • Δ₂::AbstractMatrix`
source
GeometricMachineLearning.onehotbatchMethod

One-hot-batch encoding of a vector of integers: $input\in\{0,1,\ldots,9\}^\ell$. The output is a tensor of shape $10\times1\times\ell$.

\[0 \mapsto \begin{bmatrix} 1 & 0 & \ldots & 0 \end{bmatrix}.\]

In more abstract terms: $i \mapsto e_i$.

source
GeometricMachineLearning.optimization_step!Method

Optimization for a single layer.

inputs:

  • o::Optimizer
  • d::Union{AbstractExplicitLayer, AbstractExplicitCell}
  • ps::NamedTuple: the parameters
  • C::NamedTuple: NamedTuple of the caches
  • dx::NamedTuple: NamedTuple of the derivatives (output of AD routine)

ps, C and dx must have the same keys.

source
GeometricMachineLearning.optimize_for_one_epoch!Method

Optimize for an entire epoch. For this you have to supply:

  • an instance of the optimizer.
  • the neural network model
  • the parameters of the model
  • the data (in form of DataLoader)
  • in instance of Batch that contains batch_size (and optionally seq_length)

With the optional argument:

  • the loss, which takes the model, the parameters ps and an instance of DataLoader as input.

The output of optimize_for_one_epoch! is the average loss over all batches of the epoch:

\[output = \frac{1}{\mathtt{steps\_per\_epoch}}\sum_{t=1}^\mathtt{steps\_per\_epoch}loss(\theta^{(t-1)}).\]

This is done because any reverse differentiation routine always has two outputs: a pullback and the value of the function it is differentiating. In the case of zygote: loss_value, pullback = Zygote.pullback(ps -> loss(ps), ps) (if the loss only depends on the parameters).

source
GeometricMachineLearning.rgradMethod

Computes the Riemannian gradient for the Stiefel manifold given an element $Y\in{}St(N,n)$ and a matrix $\nabla{}L\in\mathbb{R}^{N\times{}n}$ (the Euclidean gradient). It computes the Riemannian gradient with respect to the canonical metric (see the documentation for the function metric for an explanation of this). The precise form of the mapping is:

\[\mathtt{rgrad}(Y, \nabla{}L) \mapsto \nabla{}L - Y(\nabla{}L)^TY\]

It is called with inputs:

  • Y::StiefelManifold
  • e_grad::AbstractMatrix: i.e. the Euclidean gradient (what was called $\nabla{}L$) above.
source
GeometricMachineLearning.split_and_flattenMethod

split_and_flatten takes a tensor as input and produces another one as output (essentially rearranges the input data in an intricate way) so that it can easily be processed with a transformer.

The optional arguments are:

  • patch_length: by default this is 7.
  • number_of_patches: by default this is 16.
source
GeometricMachineLearning.tensor_mat_skew_sym_assignMethod

Takes as input:

  • Z::AbstractArray{T, 3}: A tensor that stores a bunch of time series.
  • A::AbstractMatrix: A matrix that is used to perform various scalar products.

For one of these time series the function performs the following computation:

\[ (z^{(i)}, z^{(j)}) \mapsto (z^{(i)})^TAz^{(j)} \text{ for } i > j.\]

The result of this are $n(n-2)\div2$ scalar products. These scalar products are written into a lower-triangular matrix and the final output of the function is a tensor of these lower-triangular matrices.

source
GeometricMachineLearning.train!Function
train!(...)

Perform a training of a neural networks on data using given method a training Method

Different ways of use:

train!(neuralnetwork, data, optimizer = GradientOptimizer(1e-2), training_method; nruns = 1000, batch_size = default(data, type), showprogress = false )

Arguments

  • neuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend
  • data : the data (see TrainingData)
  • optimizer = GradientOptimizer: the optimization method (see Optimizer)
  • training_method : specify the loss function used
  • nruns : number of iteration through the process with default value
  • batch_size : size of batch of data used for each step
source
GeometricMachineLearning.train!Method
train!(neuralnetwork, data, optimizer, training_method; nruns = 1000, batch_size, showprogress = false )

Arguments

  • neuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend
  • data::AbstractTrainingData : the data
  • ``
source
GeometricMachineLearning.transformer_lossMethod

The transformer works similarly to the regular loss, but with the difference that $\mathcal{NN}(input)$ and $output$ may have different sizes.

It takes as input:

  • model::Union{Chain, AbstractExplicitLayer}
  • ps::Union{Tuple, NamedTuple}
  • input::Union{Array, NamedTuple}
  • output::Uniont{Array, NamedTuple}
source
+ z^{(T - \mathtt{pw})}_n & \cdots & z^{(T})_n\end{bmatrix} \]

source
GeometricMachineLearning.assign_q_and_pMethod

Allocates two new arrays q and p whose first dimension is half of that of the input x. This should also be supplied through the second argument N.

The output is a Tuple containing q and p.

source
GeometricMachineLearning.init_optimizer_cacheMethod

Wrapper for the functions setup_adam_cache, setup_momentum_cache, setup_gradient_cache, setup_bfgs_cache. These appear outside of optimizer_caches.jl because the OptimizerMethods first have to be defined.

source
GeometricMachineLearning.lossMethod

Wrapper if we deal with a neural network.

You can supply an instance of NeuralNetwork instead of the two arguments model (of type Union{Chain, AbstractExplicitLayer}) and parameters (of type Union{Tuple, NamedTuple}).

source
GeometricMachineLearning.lossMethod

Computes the loss for a neural network and a data set. The computed loss is

\[||output - \mathcal{NN}(input)||_F/||output||_F,\]

where $||A||_F := \sqrt{\sum_{i_1,\ldots,i_k}|a_{i_1,\ldots,i_k}^2}|^2$ is the Frobenius norm.

It takes as input:

  • model::Union{Chain, AbstractExplicitLayer}
  • ps::Union{Tuple, NamedTuple}
  • input::Union{Array, NamedTuple}
  • output::Uniont{Array, NamedTuple}
source
GeometricMachineLearning.lossMethod

The autoencoder loss:

\[||output - \mathcal{NN}(input)||_F/||output||_F.\]

It takes as input:

  • model::Union{Chain, AbstractExplicitLayer}
  • ps::Union{Tuple, NamedTuple}
  • input::Union{Array, NamedTuple}
source
GeometricMachineLearning.metricMethod

Implements the canonical Riemannian metric for the Stiefel manifold:

\[g_Y: (\Delta_1, \Delta_2) \mapsto \mathrm{tr}(\Delta_1^T(\mathbb{I} - \frac{1}{2}YY^T)\Delta_2).\]

It is called with:

  • Y::StiefelManifold
  • Δ₁::AbstractMatrix
  • Δ₂::AbstractMatrix`
source
GeometricMachineLearning.onehotbatchMethod

One-hot-batch encoding of a vector of integers: $input\in\{0,1,\ldots,9\}^\ell$. The output is a tensor of shape $10\times1\times\ell$.

\[0 \mapsto \begin{bmatrix} 1 & 0 & \ldots & 0 \end{bmatrix}.\]

In more abstract terms: $i \mapsto e_i$.

source
GeometricMachineLearning.optimization_step!Method

Optimization for a single layer.

inputs:

  • o::Optimizer
  • d::Union{AbstractExplicitLayer, AbstractExplicitCell}
  • ps::NamedTuple: the parameters
  • C::NamedTuple: NamedTuple of the caches
  • dx::NamedTuple: NamedTuple of the derivatives (output of AD routine)

ps, C and dx must have the same keys.

source
GeometricMachineLearning.optimize_for_one_epoch!Method

Optimize for an entire epoch. For this you have to supply:

  • an instance of the optimizer.
  • the neural network model
  • the parameters of the model
  • the data (in form of DataLoader)
  • in instance of Batch that contains batch_size (and optionally seq_length)

With the optional argument:

  • the loss, which takes the model, the parameters ps and an instance of DataLoader as input.

The output of optimize_for_one_epoch! is the average loss over all batches of the epoch:

\[output = \frac{1}{\mathtt{steps\_per\_epoch}}\sum_{t=1}^\mathtt{steps\_per\_epoch}loss(\theta^{(t-1)}).\]

This is done because any reverse differentiation routine always has two outputs: a pullback and the value of the function it is differentiating. In the case of zygote: loss_value, pullback = Zygote.pullback(ps -> loss(ps), ps) (if the loss only depends on the parameters).

source
GeometricMachineLearning.rgradMethod

Computes the Riemannian gradient for the Stiefel manifold given an element $Y\in{}St(N,n)$ and a matrix $\nabla{}L\in\mathbb{R}^{N\times{}n}$ (the Euclidean gradient). It computes the Riemannian gradient with respect to the canonical metric (see the documentation for the function metric for an explanation of this). The precise form of the mapping is:

\[\mathtt{rgrad}(Y, \nabla{}L) \mapsto \nabla{}L - Y(\nabla{}L)^TY\]

It is called with inputs:

  • Y::StiefelManifold
  • e_grad::AbstractMatrix: i.e. the Euclidean gradient (what was called $\nabla{}L$) above.
source
GeometricMachineLearning.split_and_flattenMethod

split_and_flatten takes a tensor as input and produces another one as output (essentially rearranges the input data in an intricate way) so that it can easily be processed with a transformer.

The optional arguments are:

  • patch_length: by default this is 7.
  • number_of_patches: by default this is 16.
source
GeometricMachineLearning.tensor_mat_skew_sym_assignMethod

Takes as input:

  • Z::AbstractArray{T, 3}: A tensor that stores a bunch of time series.
  • A::AbstractMatrix: A matrix that is used to perform various scalar products.

For one of these time series the function performs the following computation:

\[ (z^{(i)}, z^{(j)}) \mapsto (z^{(i)})^TAz^{(j)} \text{ for } i > j.\]

The result of this are $n(n-2)\div2$ scalar products. These scalar products are written into a lower-triangular matrix and the final output of the function is a tensor of these lower-triangular matrices.

source
GeometricMachineLearning.train!Function
train!(...)

Perform a training of a neural networks on data using given method a training Method

Different ways of use:

train!(neuralnetwork, data, optimizer = GradientOptimizer(1e-2), training_method; nruns = 1000, batch_size = default(data, type), showprogress = false )

Arguments

  • neuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend
  • data : the data (see TrainingData)
  • optimizer = GradientOptimizer: the optimization method (see Optimizer)
  • training_method : specify the loss function used
  • nruns : number of iteration through the process with default value
  • batch_size : size of batch of data used for each step
source
GeometricMachineLearning.train!Method
train!(neuralnetwork, data, optimizer, training_method; nruns = 1000, batch_size, showprogress = false )

Arguments

  • neuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend
  • data::AbstractTrainingData : the data
  • ``
source
GeometricMachineLearning.transformer_lossMethod

The transformer works similarly to the regular loss, but with the difference that $\mathcal{NN}(input)$ and $output$ may have different sizes.

It takes as input:

  • model::Union{Chain, AbstractExplicitLayer}
  • ps::Union{Tuple, NamedTuple}
  • input::Union{Array, NamedTuple}
  • output::Uniont{Array, NamedTuple}
source
diff --git a/latest/manifolds/basic_topology/index.html b/latest/manifolds/basic_topology/index.html index 651243d90..479000ba8 100644 --- a/latest/manifolds/basic_topology/index.html +++ b/latest/manifolds/basic_topology/index.html @@ -1,2 +1,2 @@ -Concepts from General Topology · GeometricMachineLearning.jl

Basic Concepts of General Topology

On this page we discuss basic notions of topology that are necessary to define and work manifolds. Here we largely omit concrete examples and only define concepts that are necessary for defining a manifold[1], namely the properties of being Hausdorff and second countable. For a wide range of examples and a detailed discussion of the theory see e.g. [5]. The here-presented theory is also (rudimentary) covered in most differential geometry books such as [6] and [7].

Definition: A topological space is a set $\mathcal{M}$ for which we define a collection of subsets of $\mathcal{M}$, which we denote by $\mathcal{T}$ and call the open subsets. $\mathcal{T}$ further has to satisfy the following three conditions:

  1. The empty set and $\mathcal{M}$ belong to $\mathcal{T}$.
  2. Any union of an arbitrary number of elements of $\mathcal{T}$ again belongs to $\mathcal{T}$.
  3. Any intersection of a finite number of elements of $\mathcal{T}$ again belongs to $\mathcal{T}$.

Based on this definition of a topological space we can now define what it means to be Hausdorff: Definition: A topological space $\mathcal{M}$ is said to be Hausdorff if for any two points $x,y\in\mathcal{M}$ we can find two open sets $U_x,U_y\in\mathcal{T}$ s.t. $x\in{}U_x, y\in{}U_y$ and $U_x\cap{}U_y=\{\}$.

We now give the second definition that we need for defining manifolds, that of second countability: Definition: A topological space $\mathcal{M}$ is said to be second-countable if we can find a countable subcollection of $\mathcal{T}$ called $\mathcal{U}$ s.t. $\forall{}U\in\mathcal{T}$ and $x\in{}U$ we can find an element $V\in\mathcal{U}$ for which $x\in{}V\subset{}U$.

We now give a few definitions and results that are needed for the inverse function theorem which is essential for practical applications of manifold theory.

Definition: A mapping $f$ between topological spaces $\mathcal{M}$ and $\mathcal{N}$ is called continuous if the preimage of every open set is again an open set, i.e. if $f^{-1}\{U\}\in\mathcal{T}$ for $U$ open in $\mathcal{N}$ and $\mathcal{T}$ the topology on $\mathcal{M}$.

Definition: A closed set of a topological space $\mathcal{M}$ is one whose complement is an open set, i.e. $F$ is closed if $F^c\in\mathcal{T}$, where the superscript ${}^c$ indicates the complement. For closed sets we thus have the following three properties:

  1. The empty set and $\mathcal{M}$ are closed sets.
  2. Any union of a finite number of closed sets is again closed.
  3. Any intersection of an arbitrary number of closed sets is again closed.

Theorem: The definition of continuity is equivalent to the following, second definition: $f:\mathcal{M}\to\mathcal{N}$ is continuous if $f^{-1}\{F\}\subset\mathcal{M}$ is a closed set for each closed set $F\subset\mathcal{N}$.

Proof: First assume that $f$ is continuous according to the first definition and not to the second. Then $f^{-1}{F}$ is not closed but $f^{-1}{F^c}$ is open. But $f^{-1}\{F^c\} = \{x\in\mathcal{M}:f(x)\not\in\mathcal{N}\} = (f^{-1}\{F\})^c$ cannot be open, else $f^{-1}\{F\}$ would be closed. The implication of the first definition under assumption of the second can be shown analogously.

Theorem: The property of a set $F$ being closed is equivalent to the following statement: If a point $y$ is such that for every open set $U$ containing it we have $U\cap{}F\neq\{\}$ then this point is contained in $F$.

Proof: We first proof that if a set is closed then the statement holds. Consider a closed set $F$ and a point $y\not\in{}F$ s.t. every open set containing $y$ has nonempty intersection with $F$. But the complement $F^c$ also is such a set, which is a clear contradiction. Now assume the above statement for a set $F$ and further assume $F$ is not closed. Its complement $F^c$ is thus not open. Now consider the interior of this set: $\mathrm{int}(F^c):=\cup\{U:U\subset{}F^c\}$, i.e. the biggest open set contained within $F^c$. Hence there must be a point $y$ which is in $F^c$ but is not in its interior, else $F^c$ would be equal to its interior, i.e. would be open. We further must be able to find an open set $U$ that contains $y$ but is also contained in $F^c$, else $y$ would be an element of $F$. A contradiction.

Definition: An open cover of a topological space $\mathcal{M}$ is a (not necessarily countable) collection of open sets $\{U_i\}_{i\mathcal{I}}$ s.t. their union contains $\mathcal{M}$. A finite open cover is a collection of a finite number of open sets that cover $\mathcal{M}$. We say that an open cover is reducible to a finite cover if we can find a finite number of elements in the open cover whose union still contains $\mathcal{M}$.

Definition: A topological space $\mathcal{M}$ is called compact if every open cover is reducible to a finite cover.

Theorem: Consider a continuous function $f:\mathcal{M}\to\mathcal{N}$ and a compact set $K\in\mathcal{M}$. Then $f(K)$ is also compact.

Proof: Consider an open cover of $f(K)$: $\{U_i\}_{i\in\mathcal{I}}$. Then $\{f^{-1}\{U_i\}\}_{i\in\mathcal{I}}$ is an open cover of $K$ and hence reducible to a finite cover $\{f^{-1}\{U_i\}\}_{i\in\{i_1,\ldots,i_n\}}$. But then $\{{U_i\}_{i\in\{i_1,\ldots,i_n}}$ also covers $f(K)$.

Theorem: A closed subset of a compact space is compact:

Proof: Call the closed set $F$ and consider an open cover of this set: $\{U\}_{i\in\mathcal{I}}$. Then this open cover combined with $F^c$ is an open cover for the entire compact space, hence reducible to a finite cover.

Theorem: A compact subset of a Hausdorff space is closed:

Proof: Consider a compact subset $K$. If $K$ is not closed, then there has to be a point $y\not\in{}K$ s.t. every open set containing $y$ intersects $K$. Because the surrounding space is Hausdorff we can now find the following two collections of open sets: $\{(U_z, U_{z,y}: U_z\cap{}U_{z,y}=\{\})\}_{z\in{}K}$. The open cover $\{U_z\}_{z\in{}K}$ is then reducible to a finite cover $\{U_z\}_{z\in\{z_1, \ldots, z_n\}}$. The intersection $\cap_{z\in{z_1, \ldots, z_n}}U_{z,y}$ is then an open set that contains $y$ but has no intersection with $K$. A contraction.

Theorem: If $\mathcal{M}$ is compact and $\mathcal{N}$ is Hausdorff, then the inverse of a continuous function $f:\mathcal{M}\to\mathcal{N}$ is again continuous, i.e. $f(V)$ is an open set in $\mathcal{N}$ for $V\in\mathcal{T}$.

Proof: We can equivalently show that every closed set is mapped to a closed set. First consider the set $K\in\mathcal{M}$. Its image is again compact and hence closed because $\mathcal{N}$ is Hausdorff.

References

[7]
S. I. Richard L. Bishop. Tensor Analysis on Manifolds (Dover Publications, 1980).
[6]
S. Lang. Fundamentals of differential geometry. Vol. 191 (Springer Science & Business Media, 2012).
[5]
S. Lipschutz. General Topology (McGraw-Hill Book Company, 1965).
  • 1Some authors (see e.g. [6]) do not require these properties. But since they constitute very weak restrictions and are always satisfied by the manifolds relevant for our purposes we require them here.
+Concepts from General Topology · GeometricMachineLearning.jl

Basic Concepts of General Topology

On this page we discuss basic notions of topology that are necessary to define and work manifolds. Here we largely omit concrete examples and only define concepts that are necessary for defining a manifold[1], namely the properties of being Hausdorff and second countable. For a wide range of examples and a detailed discussion of the theory see e.g. [5]. The here-presented theory is also (rudimentary) covered in most differential geometry books such as [6] and [7].

Definition: A topological space is a set $\mathcal{M}$ for which we define a collection of subsets of $\mathcal{M}$, which we denote by $\mathcal{T}$ and call the open subsets. $\mathcal{T}$ further has to satisfy the following three conditions:

  1. The empty set and $\mathcal{M}$ belong to $\mathcal{T}$.
  2. Any union of an arbitrary number of elements of $\mathcal{T}$ again belongs to $\mathcal{T}$.
  3. Any intersection of a finite number of elements of $\mathcal{T}$ again belongs to $\mathcal{T}$.

Based on this definition of a topological space we can now define what it means to be Hausdorff: Definition: A topological space $\mathcal{M}$ is said to be Hausdorff if for any two points $x,y\in\mathcal{M}$ we can find two open sets $U_x,U_y\in\mathcal{T}$ s.t. $x\in{}U_x, y\in{}U_y$ and $U_x\cap{}U_y=\{\}$.

We now give the second definition that we need for defining manifolds, that of second countability: Definition: A topological space $\mathcal{M}$ is said to be second-countable if we can find a countable subcollection of $\mathcal{T}$ called $\mathcal{U}$ s.t. $\forall{}U\in\mathcal{T}$ and $x\in{}U$ we can find an element $V\in\mathcal{U}$ for which $x\in{}V\subset{}U$.

We now give a few definitions and results that are needed for the inverse function theorem which is essential for practical applications of manifold theory.

Definition: A mapping $f$ between topological spaces $\mathcal{M}$ and $\mathcal{N}$ is called continuous if the preimage of every open set is again an open set, i.e. if $f^{-1}\{U\}\in\mathcal{T}$ for $U$ open in $\mathcal{N}$ and $\mathcal{T}$ the topology on $\mathcal{M}$.

Definition: A closed set of a topological space $\mathcal{M}$ is one whose complement is an open set, i.e. $F$ is closed if $F^c\in\mathcal{T}$, where the superscript ${}^c$ indicates the complement. For closed sets we thus have the following three properties:

  1. The empty set and $\mathcal{M}$ are closed sets.
  2. Any union of a finite number of closed sets is again closed.
  3. Any intersection of an arbitrary number of closed sets is again closed.

Theorem: The definition of continuity is equivalent to the following, second definition: $f:\mathcal{M}\to\mathcal{N}$ is continuous if $f^{-1}\{F\}\subset\mathcal{M}$ is a closed set for each closed set $F\subset\mathcal{N}$.

Proof: First assume that $f$ is continuous according to the first definition and not to the second. Then $f^{-1}{F}$ is not closed but $f^{-1}{F^c}$ is open. But $f^{-1}\{F^c\} = \{x\in\mathcal{M}:f(x)\not\in\mathcal{N}\} = (f^{-1}\{F\})^c$ cannot be open, else $f^{-1}\{F\}$ would be closed. The implication of the first definition under assumption of the second can be shown analogously.

Theorem: The property of a set $F$ being closed is equivalent to the following statement: If a point $y$ is such that for every open set $U$ containing it we have $U\cap{}F\neq\{\}$ then this point is contained in $F$.

Proof: We first proof that if a set is closed then the statement holds. Consider a closed set $F$ and a point $y\not\in{}F$ s.t. every open set containing $y$ has nonempty intersection with $F$. But the complement $F^c$ also is such a set, which is a clear contradiction. Now assume the above statement for a set $F$ and further assume $F$ is not closed. Its complement $F^c$ is thus not open. Now consider the interior of this set: $\mathrm{int}(F^c):=\cup\{U:U\subset{}F^c\}$, i.e. the biggest open set contained within $F^c$. Hence there must be a point $y$ which is in $F^c$ but is not in its interior, else $F^c$ would be equal to its interior, i.e. would be open. We further must be able to find an open set $U$ that contains $y$ but is also contained in $F^c$, else $y$ would be an element of $F$. A contradiction.

Definition: An open cover of a topological space $\mathcal{M}$ is a (not necessarily countable) collection of open sets $\{U_i\}_{i\mathcal{I}}$ s.t. their union contains $\mathcal{M}$. A finite open cover is a collection of a finite number of open sets that cover $\mathcal{M}$. We say that an open cover is reducible to a finite cover if we can find a finite number of elements in the open cover whose union still contains $\mathcal{M}$.

Definition: A topological space $\mathcal{M}$ is called compact if every open cover is reducible to a finite cover.

Theorem: Consider a continuous function $f:\mathcal{M}\to\mathcal{N}$ and a compact set $K\in\mathcal{M}$. Then $f(K)$ is also compact.

Proof: Consider an open cover of $f(K)$: $\{U_i\}_{i\in\mathcal{I}}$. Then $\{f^{-1}\{U_i\}\}_{i\in\mathcal{I}}$ is an open cover of $K$ and hence reducible to a finite cover $\{f^{-1}\{U_i\}\}_{i\in\{i_1,\ldots,i_n\}}$. But then $\{{U_i\}_{i\in\{i_1,\ldots,i_n}}$ also covers $f(K)$.

Theorem: A closed subset of a compact space is compact:

Proof: Call the closed set $F$ and consider an open cover of this set: $\{U\}_{i\in\mathcal{I}}$. Then this open cover combined with $F^c$ is an open cover for the entire compact space, hence reducible to a finite cover.

Theorem: A compact subset of a Hausdorff space is closed:

Proof: Consider a compact subset $K$. If $K$ is not closed, then there has to be a point $y\not\in{}K$ s.t. every open set containing $y$ intersects $K$. Because the surrounding space is Hausdorff we can now find the following two collections of open sets: $\{(U_z, U_{z,y}: U_z\cap{}U_{z,y}=\{\})\}_{z\in{}K}$. The open cover $\{U_z\}_{z\in{}K}$ is then reducible to a finite cover $\{U_z\}_{z\in\{z_1, \ldots, z_n\}}$. The intersection $\cap_{z\in{z_1, \ldots, z_n}}U_{z,y}$ is then an open set that contains $y$ but has no intersection with $K$. A contraction.

Theorem: If $\mathcal{M}$ is compact and $\mathcal{N}$ is Hausdorff, then the inverse of a continuous function $f:\mathcal{M}\to\mathcal{N}$ is again continuous, i.e. $f(V)$ is an open set in $\mathcal{N}$ for $V\in\mathcal{T}$.

Proof: We can equivalently show that every closed set is mapped to a closed set. First consider the set $K\in\mathcal{M}$. Its image is again compact and hence closed because $\mathcal{N}$ is Hausdorff.

References

[7]
S. I. Richard L. Bishop. Tensor Analysis on Manifolds (Dover Publications, 1980).
[6]
S. Lang. Fundamentals of differential geometry. Vol. 191 (Springer Science & Business Media, 2012).
[5]
S. Lipschutz. General Topology (McGraw-Hill Book Company, 1965).
  • 1Some authors (see e.g. [6]) do not require these properties. But since they constitute very weak restrictions and are always satisfied by the manifolds relevant for our purposes we require them here.
diff --git a/latest/manifolds/existence_and_uniqueness_theorem/index.html b/latest/manifolds/existence_and_uniqueness_theorem/index.html index 034748689..f9cfec030 100644 --- a/latest/manifolds/existence_and_uniqueness_theorem/index.html +++ b/latest/manifolds/existence_and_uniqueness_theorem/index.html @@ -1,5 +1,5 @@ -Differential Equations and the EAU theorem · GeometricMachineLearning.jl

The Existence-And-Uniqueness Theorem

In order to proof the existence-and-uniqueness theorem we first need another theorem, the Banach fixed-point theorem for which we also need another definition.

Definition: A contraction mapping is a map $T:\mathbb{R}^N\to\mathbb{R}^N$ for which there exists $q\in[0,1)$ s.t. $\forall{}x,y\in\mathbb{R}^N,\,||T(x)-T(y)||\leq{}q||x-y||$.

Theorem (Banach fixed-point theorem): Every contraction mapping $T$ admits a unique fixed point $x^*$ (i.e. a point $x^*$ s.t. $F(x^*)=x^*$) and this point can be found by taking an arbitrary point $x_0\in\mathbb{R}^N$ and taking the limit $\lim_{n\to\infty}T^n(x_0)$.

Proof (Banach fixed-point theorem): Take an arbitrary point $x_0\in\mathbb{R}^N$ and consider the sequence $(x_n)_{n\in\mathbb{N}}$ with $x_n:=T^n(x_0)$. Then it holds that (for $m>n$):

\[\begin{aligned} +Differential Equations and the EAU theorem · GeometricMachineLearning.jl

The Existence-And-Uniqueness Theorem

In order to proof the existence-and-uniqueness theorem we first need another theorem, the Banach fixed-point theorem for which we also need another definition.

Definition: A contraction mapping is a map $T:\mathbb{R}^N\to\mathbb{R}^N$ for which there exists $q\in[0,1)$ s.t. $\forall{}x,y\in\mathbb{R}^N,\,||T(x)-T(y)||\leq{}q||x-y||$.

Theorem (Banach fixed-point theorem): Every contraction mapping $T$ admits a unique fixed point $x^*$ (i.e. a point $x^*$ s.t. $F(x^*)=x^*$) and this point can be found by taking an arbitrary point $x_0\in\mathbb{R}^N$ and taking the limit $\lim_{n\to\infty}T^n(x_0)$.

Proof (Banach fixed-point theorem): Take an arbitrary point $x_0\in\mathbb{R}^N$ and consider the sequence $(x_n)_{n\in\mathbb{N}}$ with $x_n:=T^n(x_0)$. Then it holds that (for $m>n$):

\[\begin{aligned} |x_m - x_n| & \leq |x_m - x_{m-1}| + |x_{m-1} - x_{m-2}| + \cdots + |x_{m-(m-n+1)}-x_{n}| \\ & = |x_{n+(m-n)} - x_{n+(m-n-1)}| + \cdots + |x_{n+1} - x_n| \\ & \leq \sum_{i=0}^{m-n-1}q^i|x_{n+1} - x_n| \\ @@ -8,4 +8,4 @@ \end{aligned}\]

where we have used the triangle inequality in the first line. If we now let $m$ on the right-hand side first go to infinity then we get

\[\begin{aligned} |x_m-x_n| & \leq q^n|x_1 -x_0|\sum_{i=1}^{\infty}q^i & =q^n|x_1 -x_0| \frac{1}{1-q}, -\end{aligned}\]

proofing that the sequence is Cauchy. Because $\mathbb{R}^N$ is a complete metric space we get that $(x_n)_{n\in\mathbb{N}}$ is a convergent sequence. We call the limit of this sequence $x^*$. This completes the proof of the Banach fixed-point theorem.

+\end{aligned}\]

proofing that the sequence is Cauchy. Because $\mathbb{R}^N$ is a complete metric space we get that $(x_n)_{n\in\mathbb{N}}$ is a convergent sequence. We call the limit of this sequence $x^*$. This completes the proof of the Banach fixed-point theorem.

diff --git a/latest/manifolds/grassmann_manifold/index.html b/latest/manifolds/grassmann_manifold/index.html index 705bd0db5..0b5c5e13d 100644 --- a/latest/manifolds/grassmann_manifold/index.html +++ b/latest/manifolds/grassmann_manifold/index.html @@ -1,2 +1,2 @@ -Grassmann · GeometricMachineLearning.jl

Grassmann Manifold

(The description of the Grassmann manifold is based on that of the Stiefel manifold, so this should be read first.)

An element of the Grassmann manifold $G(n,N)$ is a vector subspace $\subset\mathbb{R}^N$ of dimension $n$. Each such subspace (i.e. element of the Grassmann manifold) can be represented by a full-rank matrix $A\in\mathbb{R}^{N\times{}n}$ and we identify two elements with the following equivalence relation:

\[A_1 \sim A_2 \iff \exists{}C\in\mathbb{R}^{n\times{}n}\text{ s.t. }A_1C = A_2.\]

The resulting manifold is of dimension $n(N-n)$. One can find a parametrization of the manifold the following way: Because the matrix $Y$ has full rank, there have to be $n$ independent columns in it: $i_1, \ldots, i_n$. For simplicity assume that $i_1 = 1, i_2=2, \ldots, i_n=n$ and call the matrix made up by these columns $C$. Then the mapping to the coordinate chart is: $YC^{-1}$ and the last $N-n$ columns are the coordinates.

We can also define the Grassmann manifold based on the Stiefel manifold since elements of the Stiefel manifold are already full-rank matrices. In this case we have the following equivalence relation (for $Y_1, Y_2\in{}St(n,N)$):

\[Y_1 \sim Y_2 \iff \exists{}C\in{}O(n)\text{ s.t. }Y_1C = Y_2.\]

The Riemannian Gradient

Obtaining the Riemannian Gradient for the Grassmann manifold is slightly more difficult than it is in the case of the Stiefel manifold. Since the Grassmann manifold can be obtained from the Stiefel manifold through an equivalence relation however, we can use this as a starting point. In a first step we identify charts on the Grassmann manifold to make dealing with it easier. For this consider the following open cover of the Grassmann manifold (also see [8]):

\[\{\mathcal{U}_W\}_{W\in{}St(n, N)} \quad\text{where}\quad \mathcal{U}_W = \{\mathrm{span}(Y):\mathrm{det}(W^TY)\neq0\}.\]

We can find a canonical bijective mapping from the set $\mathcal{U}_W$ to the set $\mathcal{S}_W := \{Y\in\mathbb{R}^{N\times{}n}:W^TY=\mathbb{I}_n\}$:

\[\sigma_W: \mathcal{U}_W \to \mathcal{S}_W,\, \mathcal{Y}=\mathrm{span}(Y)\mapsto{}Y(W^TY)^{-1} =: \hat{Y}.\]

That $\sigma_W$ is well-defined is easy to see: Consider $YC$ with $C\in\mathbb{R}^{n\times{}n}$ non-singular. Then $YC(W^TYC)^{-1}=Y(W^TY)^{-1} = \hat{Y}$. With this isomorphism we can also find a representation of elements of the tangent space:

\[T_\mathcal{Y}\sigma_W: T_\mathcal{Y}Gr(n,N)\to{}T_{\hat{Y}}\mathcal{S}_W,\, \xi \mapsto (\xi_{\diamond{}Y} -\hat{Y}(W^T\xi_{\diamond{}Y}))(W^TY)^{-1}.\]

$\xi_{\diamond{}Y}$ is the representation of $\xi\in{}T_\mathcal{Y}Gr(n,N)$ for the point $Y\in{}St(n,N)$, i.e. $T_Y\pi(\xi_{\diamond{}Y}) = \xi$; because the map $\sigma_W$ does not care about the representation of $\mathrm{span}(Y)$ we can perform the variations in $St(n,N)$[1]:

\[\frac{d}{dt}Y(t)(W^TY(t))^{-1} = (\dot{Y}(0) - Y(W^TY)^{-1}W^T\dot{Y}(0))(W^TY)^{-1},\]

where $\dot{Y}(0)\in{}T_YSt(n,N)$. Also note that the representation of $\xi$ in $T_YSt(n,N)$ is not unique in general, but $T_\mathcal{Y}\sigma_W$ is still well-defined. To see this consider two curves $Y(t)$ and $\bar{Y}(t)$ for which we have $Y(0) = \bar{Y}(0) = Y$ and further $T\pi(\dot{Y}(0)) = T\pi(\dot{bar{Y}}(0))$. This is equivalent to being able to find a $C(\cdot):(-\varepsilon,\varepsilon)\to{}O(n)$ for which $C(0)=\mathbb{I}(0)$ s.t. $\bar{Y}(t) = Y(t)C(t)$. We thus have $\dot{\bar{Y}}(0) = \dot{Y}(0) + Y\dot{C}(0)$ and if we replace $\xi_{\diamond{}Y}$ above with the second term in the expression we get: $Y\dot{C}(0) - \hat{Y}W^T(Y\dot{C}(0)) = 0$. The parametrization of $T_\mathcal{Y}Gr(n,N)$ with $T_\mathcal{Y}\sigma_W$ is thus independent of the choice of $\dot{C}(0)$ and hence of $\xi_{\diamond{}Y}$ and is therefore well-defined.

Further note that we have $T_\mathcal{Y}\mathcal{U}_W = T_\mathcal{Y}Gr(n,N)$ because $\mathcal{U}_W$ is an open subset of $Gr(n,N)$. We thus can identify the tangent space $T_\mathcal{Y}Gr(n,N)$ with the following set (where we again have $\hat{Y}=Y(W^TY)^{-1}$):

\[T_{\hat{Y}}\mathcal{S}_W = \{(\Delta - Y(W^TY)^{-1}W^T\Delta)(W^T\Delta)^{-1}: Y\in{}St(n,N)\text{ s.t. }\mathrm{span}(Y)=\mathcal{Y}\text{ and }\Delta\in{}T_YSt(n,N)\}.\]

If we now further take $W=Y$[2] then we get the identification:

\[T_\mathcal{Y}Gr(n,N) \equiv \{\Delta - YY^T\Delta: Y\in{}St(n,N)\text{ s.t. }\mathrm{span}(Y)=\mathcal{Y}\text{ and }\Delta\in{}T_YSt(n,N)\},\]

which is very easy to handle computationally (we simply store and change the matrix $Y$ that represents an element of the Grassmann manifold). The Riemannian gradient is then

\[\mathrm{grad}_\mathcal{Y}^{Gr}L = \mathrm{grad}_Y^{St}L - YY^T\mathrm{grad}_Y^{St}L = \nabla_Y{}L - YY^T\nabla_YL,\]

where $\nabla_Y{}L$ again is the Euclidean gradient as in the Stiefel manifold case.

  • 1I.e. $Y(t)\in{}St(n,N)$ for $t\in(-\varepsilon,\varepsilon)$. We also set $Y(0) = Y$.
  • 2We can pick any element $W$ to construct the charts for a neighborhood around the point $\mathcal{Y}\in{}Gr(n,N)$ as long as we have $\mathrm{det}(W^TY)\neq0$ for $\mathrm{span}(Y)=\mathcal{Y}$.
+Grassmann · GeometricMachineLearning.jl

Grassmann Manifold

(The description of the Grassmann manifold is based on that of the Stiefel manifold, so this should be read first.)

An element of the Grassmann manifold $G(n,N)$ is a vector subspace $\subset\mathbb{R}^N$ of dimension $n$. Each such subspace (i.e. element of the Grassmann manifold) can be represented by a full-rank matrix $A\in\mathbb{R}^{N\times{}n}$ and we identify two elements with the following equivalence relation:

\[A_1 \sim A_2 \iff \exists{}C\in\mathbb{R}^{n\times{}n}\text{ s.t. }A_1C = A_2.\]

The resulting manifold is of dimension $n(N-n)$. One can find a parametrization of the manifold the following way: Because the matrix $Y$ has full rank, there have to be $n$ independent columns in it: $i_1, \ldots, i_n$. For simplicity assume that $i_1 = 1, i_2=2, \ldots, i_n=n$ and call the matrix made up by these columns $C$. Then the mapping to the coordinate chart is: $YC^{-1}$ and the last $N-n$ columns are the coordinates.

We can also define the Grassmann manifold based on the Stiefel manifold since elements of the Stiefel manifold are already full-rank matrices. In this case we have the following equivalence relation (for $Y_1, Y_2\in{}St(n,N)$):

\[Y_1 \sim Y_2 \iff \exists{}C\in{}O(n)\text{ s.t. }Y_1C = Y_2.\]

The Riemannian Gradient

Obtaining the Riemannian Gradient for the Grassmann manifold is slightly more difficult than it is in the case of the Stiefel manifold. Since the Grassmann manifold can be obtained from the Stiefel manifold through an equivalence relation however, we can use this as a starting point. In a first step we identify charts on the Grassmann manifold to make dealing with it easier. For this consider the following open cover of the Grassmann manifold (also see [8]):

\[\{\mathcal{U}_W\}_{W\in{}St(n, N)} \quad\text{where}\quad \mathcal{U}_W = \{\mathrm{span}(Y):\mathrm{det}(W^TY)\neq0\}.\]

We can find a canonical bijective mapping from the set $\mathcal{U}_W$ to the set $\mathcal{S}_W := \{Y\in\mathbb{R}^{N\times{}n}:W^TY=\mathbb{I}_n\}$:

\[\sigma_W: \mathcal{U}_W \to \mathcal{S}_W,\, \mathcal{Y}=\mathrm{span}(Y)\mapsto{}Y(W^TY)^{-1} =: \hat{Y}.\]

That $\sigma_W$ is well-defined is easy to see: Consider $YC$ with $C\in\mathbb{R}^{n\times{}n}$ non-singular. Then $YC(W^TYC)^{-1}=Y(W^TY)^{-1} = \hat{Y}$. With this isomorphism we can also find a representation of elements of the tangent space:

\[T_\mathcal{Y}\sigma_W: T_\mathcal{Y}Gr(n,N)\to{}T_{\hat{Y}}\mathcal{S}_W,\, \xi \mapsto (\xi_{\diamond{}Y} -\hat{Y}(W^T\xi_{\diamond{}Y}))(W^TY)^{-1}.\]

$\xi_{\diamond{}Y}$ is the representation of $\xi\in{}T_\mathcal{Y}Gr(n,N)$ for the point $Y\in{}St(n,N)$, i.e. $T_Y\pi(\xi_{\diamond{}Y}) = \xi$; because the map $\sigma_W$ does not care about the representation of $\mathrm{span}(Y)$ we can perform the variations in $St(n,N)$[1]:

\[\frac{d}{dt}Y(t)(W^TY(t))^{-1} = (\dot{Y}(0) - Y(W^TY)^{-1}W^T\dot{Y}(0))(W^TY)^{-1},\]

where $\dot{Y}(0)\in{}T_YSt(n,N)$. Also note that the representation of $\xi$ in $T_YSt(n,N)$ is not unique in general, but $T_\mathcal{Y}\sigma_W$ is still well-defined. To see this consider two curves $Y(t)$ and $\bar{Y}(t)$ for which we have $Y(0) = \bar{Y}(0) = Y$ and further $T\pi(\dot{Y}(0)) = T\pi(\dot{bar{Y}}(0))$. This is equivalent to being able to find a $C(\cdot):(-\varepsilon,\varepsilon)\to{}O(n)$ for which $C(0)=\mathbb{I}(0)$ s.t. $\bar{Y}(t) = Y(t)C(t)$. We thus have $\dot{\bar{Y}}(0) = \dot{Y}(0) + Y\dot{C}(0)$ and if we replace $\xi_{\diamond{}Y}$ above with the second term in the expression we get: $Y\dot{C}(0) - \hat{Y}W^T(Y\dot{C}(0)) = 0$. The parametrization of $T_\mathcal{Y}Gr(n,N)$ with $T_\mathcal{Y}\sigma_W$ is thus independent of the choice of $\dot{C}(0)$ and hence of $\xi_{\diamond{}Y}$ and is therefore well-defined.

Further note that we have $T_\mathcal{Y}\mathcal{U}_W = T_\mathcal{Y}Gr(n,N)$ because $\mathcal{U}_W$ is an open subset of $Gr(n,N)$. We thus can identify the tangent space $T_\mathcal{Y}Gr(n,N)$ with the following set (where we again have $\hat{Y}=Y(W^TY)^{-1}$):

\[T_{\hat{Y}}\mathcal{S}_W = \{(\Delta - Y(W^TY)^{-1}W^T\Delta)(W^T\Delta)^{-1}: Y\in{}St(n,N)\text{ s.t. }\mathrm{span}(Y)=\mathcal{Y}\text{ and }\Delta\in{}T_YSt(n,N)\}.\]

If we now further take $W=Y$[2] then we get the identification:

\[T_\mathcal{Y}Gr(n,N) \equiv \{\Delta - YY^T\Delta: Y\in{}St(n,N)\text{ s.t. }\mathrm{span}(Y)=\mathcal{Y}\text{ and }\Delta\in{}T_YSt(n,N)\},\]

which is very easy to handle computationally (we simply store and change the matrix $Y$ that represents an element of the Grassmann manifold). The Riemannian gradient is then

\[\mathrm{grad}_\mathcal{Y}^{Gr}L = \mathrm{grad}_Y^{St}L - YY^T\mathrm{grad}_Y^{St}L = \nabla_Y{}L - YY^T\nabla_YL,\]

where $\nabla_Y{}L$ again is the Euclidean gradient as in the Stiefel manifold case.

  • 1I.e. $Y(t)\in{}St(n,N)$ for $t\in(-\varepsilon,\varepsilon)$. We also set $Y(0) = Y$.
  • 2We can pick any element $W$ to construct the charts for a neighborhood around the point $\mathcal{Y}\in{}Gr(n,N)$ as long as we have $\mathrm{det}(W^TY)\neq0$ for $\mathrm{span}(Y)=\mathcal{Y}$.
diff --git a/latest/manifolds/homogeneous_spaces/index.html b/latest/manifolds/homogeneous_spaces/index.html index ce31a1564..bf6811910 100644 --- a/latest/manifolds/homogeneous_spaces/index.html +++ b/latest/manifolds/homogeneous_spaces/index.html @@ -1,2 +1,2 @@ -Homogeneous Spaces · GeometricMachineLearning.jl

Homogeneous Spaces

Homogeneous spaces are manifolds $\mathcal{M}$ on which a Lie group $G$ acts transitively, i.e.

\[\forall X,Y\in\mathcal{M} \exists{}A\in{}G\text{ s.t. }AX = Y.\]

Now fix a distinct element $E\in\mathcal{M}$. We can also establish an isomorphism between $\mathcal{M}$ and the quotient space $G/\sim$ with the equivalence relation:

\[A_1 \sim A_2 \iff A_1E = A_2E.\]

Note that this is independent of the chosen $E$.

The tangent spaces of $\mathcal{M}$ are of the form $T_Y\mathcal{M} = \mathfrak{g}\cdot{}Y$, i.e. can be fully described through its Lie algebra. Based on this we can perform a splitting of $\mathfrak{g}$ into two parts:

  1. The vertical component $\mathfrak{g}^{\mathrm{ver},Y}$ is the kernel of the map $\mathfrak{g}\to{}T_Y\mathcal{M}, V \mapsto VY$, i.e. $\mathfrak{g}^{\mathrm{ver},Y} = \{V\in\mathfrak{g}:VY = 0\}.$

  2. The horizontal component $\mathfrak{g}^{\mathrm{hor},Y}$ is the orthogonal complement of $\mathfrak{g}^{\mathrm{ver},Y}$ in $\mathfrak{g}$. It is isomorphic to $T_Y\mathcal{M}$.

We will refer to the mapping from $T_Y\mathcal{M}$ to $\mathfrak{g}^{\mathrm{hor}, Y}$ by $\Omega$. If we have now defined a metric $\langle\cdot,\cdot\rangle$ on $\mathfrak{g}$, then this induces a Riemannian metric on $\mathcal{M}$:

\[g_Y(\Delta_1, \Delta_2) = \langle\Omega(Y,\Delta_1),\Omega(Y,\Delta_2)\rangle\text{ for $\Delta_1,\Delta_2\in{}T_Y\mathcal{M}$.}\]

Two examples of homogeneous spaces implemented in GeometricMachineLearning are the Stiefel and the Grassmann manifold.

References

  • Frankel, Theodore. The geometry of physics: an introduction. Cambridge university press, 2011.
+Homogeneous Spaces · GeometricMachineLearning.jl

Homogeneous Spaces

Homogeneous spaces are manifolds $\mathcal{M}$ on which a Lie group $G$ acts transitively, i.e.

\[\forall X,Y\in\mathcal{M} \exists{}A\in{}G\text{ s.t. }AX = Y.\]

Now fix a distinct element $E\in\mathcal{M}$. We can also establish an isomorphism between $\mathcal{M}$ and the quotient space $G/\sim$ with the equivalence relation:

\[A_1 \sim A_2 \iff A_1E = A_2E.\]

Note that this is independent of the chosen $E$.

The tangent spaces of $\mathcal{M}$ are of the form $T_Y\mathcal{M} = \mathfrak{g}\cdot{}Y$, i.e. can be fully described through its Lie algebra. Based on this we can perform a splitting of $\mathfrak{g}$ into two parts:

  1. The vertical component $\mathfrak{g}^{\mathrm{ver},Y}$ is the kernel of the map $\mathfrak{g}\to{}T_Y\mathcal{M}, V \mapsto VY$, i.e. $\mathfrak{g}^{\mathrm{ver},Y} = \{V\in\mathfrak{g}:VY = 0\}.$

  2. The horizontal component $\mathfrak{g}^{\mathrm{hor},Y}$ is the orthogonal complement of $\mathfrak{g}^{\mathrm{ver},Y}$ in $\mathfrak{g}$. It is isomorphic to $T_Y\mathcal{M}$.

We will refer to the mapping from $T_Y\mathcal{M}$ to $\mathfrak{g}^{\mathrm{hor}, Y}$ by $\Omega$. If we have now defined a metric $\langle\cdot,\cdot\rangle$ on $\mathfrak{g}$, then this induces a Riemannian metric on $\mathcal{M}$:

\[g_Y(\Delta_1, \Delta_2) = \langle\Omega(Y,\Delta_1),\Omega(Y,\Delta_2)\rangle\text{ for $\Delta_1,\Delta_2\in{}T_Y\mathcal{M}$.}\]

Two examples of homogeneous spaces implemented in GeometricMachineLearning are the Stiefel and the Grassmann manifold.

References

  • Frankel, Theodore. The geometry of physics: an introduction. Cambridge university press, 2011.
diff --git a/latest/manifolds/inverse_function_theorem/index.html b/latest/manifolds/inverse_function_theorem/index.html index bee7fec55..6d3db8901 100644 --- a/latest/manifolds/inverse_function_theorem/index.html +++ b/latest/manifolds/inverse_function_theorem/index.html @@ -1,7 +1,7 @@ -The Inverse Function Theorem · GeometricMachineLearning.jl

The Inverse Function Theorem

The inverse function theorem gives a sufficient condition on a vector-valued function to be invertible in a neighborhood of a specific point. This theorem is critical in developing a theory of manifolds and serves as a basis for the submersion theorem. Here we first state the theorem and then give a proof.

Theorem (Inverse function theorem): Consider a vector-valued differentiable function $F:\mathbb{R}^N\to\mathbb{R}^N$ and assume its Jacobian is non-degenerate at a point $x\in\mathbb{R}^N$. Then there exists a neighborhood $U$ that contains $F(x)$ and on which $F$ is invertible, i.e. $\exists{}H:U\to\mathbb{R}^N$ s.t. $\forall{}y\in{}U,\,F\circ{}H(y) = y$ and the inverse is differentiable.

Proof: Consider a mapping $F:\mathbb{R}^N\to\mathbb{R}^N$ and assume its Jacobian has full rank at point $x$, i.e. $\det{}F'(x)\neq0$. Now consider a ball around $x$ whose radius $r$ we do not yet fix and two points $y$ and $z$ in that ball: $y,z\in{}B(x,r)$. We further introduce the function $G(y):=F(x)-F'(x)y$. By the mean value theorem we have $|G(z) - G(y)|\leq|z-y|\sup_{0<t<1}||G'(x + t(y-x))||$ where $||\cdot||$ is the operator norm. Because $t\mapsto{}G'(x+t(y-x))$ is continuous and $G'(x)=0$ there must exist an $r$ s.t. $\forall{}t\in[0,1],\,|G'(x +t(y-x)) - G'(x)|<\frac{1}{2}|F'(x)|$. $F$ must then be injective on $B(x,r)$ (and hence invertible on $F(B(x,r))$). Assume for the moment it is not. We can then find two distinct elements $y, z\in{}B(x,r)$ s.t. $F(z) - F(y) = 0$. This implies $|G(z) - G(y)| = ||F'(x)|||y - x|$ which is a contradiction. The inverse (which we call $H:F(B(x,r))\to{}B(x,r)$) is also continuous by the last theorem presented in the section on basic topological concepts[1]. We still have to prove differentiability of the inverse. We now proof that the derivative of $H$ at $F(x)$ exists and that it is equal to $F'(x)^{-1}F(x)$. For this we denote $F(x)$ by $\xi$ and let $\eta\in{}F(B(x,r))$ go to zero.

\[\begin{aligned} +The Inverse Function Theorem · GeometricMachineLearning.jl

The Inverse Function Theorem

The inverse function theorem gives a sufficient condition on a vector-valued function to be invertible in a neighborhood of a specific point. This theorem is critical in developing a theory of manifolds and serves as a basis for the submersion theorem. Here we first state the theorem and then give a proof.

Theorem (Inverse function theorem): Consider a vector-valued differentiable function $F:\mathbb{R}^N\to\mathbb{R}^N$ and assume its Jacobian is non-degenerate at a point $x\in\mathbb{R}^N$. Then there exists a neighborhood $U$ that contains $F(x)$ and on which $F$ is invertible, i.e. $\exists{}H:U\to\mathbb{R}^N$ s.t. $\forall{}y\in{}U,\,F\circ{}H(y) = y$ and the inverse is differentiable.

Proof: Consider a mapping $F:\mathbb{R}^N\to\mathbb{R}^N$ and assume its Jacobian has full rank at point $x$, i.e. $\det{}F'(x)\neq0$. Now consider a ball around $x$ whose radius $r$ we do not yet fix and two points $y$ and $z$ in that ball: $y,z\in{}B(x,r)$. We further introduce the function $G(y):=F(x)-F'(x)y$. By the mean value theorem we have $|G(z) - G(y)|\leq|z-y|\sup_{0<t<1}||G'(x + t(y-x))||$ where $||\cdot||$ is the operator norm. Because $t\mapsto{}G'(x+t(y-x))$ is continuous and $G'(x)=0$ there must exist an $r$ s.t. $\forall{}t\in[0,1],\,|G'(x +t(y-x)) - G'(x)|<\frac{1}{2}|F'(x)|$. $F$ must then be injective on $B(x,r)$ (and hence invertible on $F(B(x,r))$). Assume for the moment it is not. We can then find two distinct elements $y, z\in{}B(x,r)$ s.t. $F(z) - F(y) = 0$. This implies $|G(z) - G(y)| = ||F'(x)|||y - x|$ which is a contradiction. The inverse (which we call $H:F(B(x,r))\to{}B(x,r)$) is also continuous by the last theorem presented in the section on basic topological concepts[1]. We still have to prove differentiability of the inverse. We now proof that the derivative of $H$ at $F(x)$ exists and that it is equal to $F'(x)^{-1}F(x)$. For this we denote $F(x)$ by $\xi$ and let $\eta\in{}F(B(x,r))$ go to zero.

\[\begin{aligned} |\eta|^{-1}|H(\xi+\eta) - H(\xi) - F'(x)^{-1}\eta| & \leq |\eta|^{-1}||F'(x)||^{-1}|F'(x)H(\xi+\eta)-F'(x)H(\xi) -\eta| \\ & \leq |\eta|^{-1}||F'(x)||^{-1}|F(H(\xi+\eta)) - G(H(\xi+\eta)) - F(H(\xi)) + G(x) - \eta| \\ & = |\eta|^{-1}||F'(x)||^{-1}|\xi + \eta - G(H(\xi+\eta)) - \xi + G(x) - \eta| \\ & = |\eta|^{-1}||F'(x)||^{-1}|G(H(\xi+\eta)) - G(H(\xi))|, -\end{aligned}\]

and this goes to zero as $\eta$ goes to zero, because $H$ is continuous and therefore $H(\xi+\eta)$ goes to $H(\xi)=x$ and the expression on the right goes to zero as well.

References

[6]
S. Lang. Fundamentals of differential geometry. Vol. 191 (Springer Science & Business Media, 2012).
  • 1In order to apply said theorem we must have a mapping from a compact space to a Hausdorff space. The image is clearly Hausdorff. For compactness, we could further restrict our ball to $B(x,r/2)$, then $G$ and its inverse are at least continuous on the closure of $B(x,r/2)$ (or its image respectively) and hence also on $B(x,r/2)$.
+\end{aligned}\]

and this goes to zero as $\eta$ goes to zero, because $H$ is continuous and therefore $H(\xi+\eta)$ goes to $H(\xi)=x$ and the expression on the right goes to zero as well.

References

[6]
S. Lang. Fundamentals of differential geometry. Vol. 191 (Springer Science & Business Media, 2012).
  • 1In order to apply said theorem we must have a mapping from a compact space to a Hausdorff space. The image is clearly Hausdorff. For compactness, we could further restrict our ball to $B(x,r/2)$, then $G$ and its inverse are at least continuous on the closure of $B(x,r/2)$ (or its image respectively) and hence also on $B(x,r/2)$.
diff --git a/latest/manifolds/manifolds/index.html b/latest/manifolds/manifolds/index.html index 8e9c9c16c..5c172eaeb 100644 --- a/latest/manifolds/manifolds/index.html +++ b/latest/manifolds/manifolds/index.html @@ -1,2 +1,2 @@ -General Theory on Manifolds · GeometricMachineLearning.jl

(Matrix) Manifolds

Manifolds are topological spaces that locally look like vector spaces. In the following we restrict ourselves to finite-dimensional manifolds. Definition: A finite-dimensional smooth manifold of dimension $n$ is a second-countable Hausdorff space $\mathcal{M}$ for which $\forall{}x\in\mathcal{M}$ we can find a neighborhood $U$ that contains $x$ and a corresponding homeomorphism $\varphi_U:U\cong{}W\subset\mathbb{R}^n$ where $W$ is an open subset. The homeomorphisms $\varphi_U$ are referred to as coordinate charts. If two such coordinate charts overlap, i.e. if $U_1\cap{}U_2\neq\{\}$, then the map $\varphi_{U_2}^{-1}\circ\varphi_{U_1}$ is $C^\infty$.

One example of a manifold that is also important for GeometricMachineLearning.jl is the Lie group[1] of orthonormal matrices $SO(N)$. Before we can proof that $SO(N)$ is a manifold we first need another definition and a theorem:

Definition: Consider a smooth mapping $g: \mathcal{M}\to\mathcal{N}$ from one manifold to another. A point $B\in\mathcal{M}$ is called a regular value of $\mathcal{M}$ if $\forall{}A\in{}g^{-1}\{B\}$ the map $T_Ag:T_A\mathcal{M}\to{}T_{g(A)}\mathcal{N}$ is surjective.

Theorem: Consider a smooth map $g:\mathcal{M}\to\mathcal{N}$ from one manifold to another. Then the preimage of a regular point $B$ of $\mathcal{N}$ is a submanifold of $\mathcal{M}$. Furthermore the codimension of $g^{-1}\{B\}$ is equal to the dimension of $\mathcal{N}$ and the tangent space $T_A(g^{-1}\{B\})$ is equal to the kernel of $T_Ag$. This is known as the preimage theorem.

Proof:

Theorem: The group $SO(N)$ is a Lie group (i.e. has manifold structure). Proof: The vector space $\mathbb{R}^{N\times{}N}$ clearly has manifold structure. The group $SO(N)$ is equivalent to one of the level sets of the mapping: $f:\mathbb{R}^{N\times{}N}\to\mathcal{S}(N), A\mapsto{}A^TA$, i.e. it is the component of $f^{-1}\{\mathbb{I}\}$ that contains $\mathbb{I}$. We still need to proof that $\mathbb{I}$ is a regular point of $f$, i.e. that for $A\in{}SO(N)$ the mapping $T_Af$ is surjective. This means that $\forall{}B\in\mathcal{S}(N), A\in\mathbb{R}^{N\times{}N}$ $\exists{}C\in\mathbb{R}^{N\times{}N}$ s.t. $C^TA + A^TC = B$. The element $C=\frac{1}{2}AB\in\mathcal{R}^{N\times{}N}$ satisfies this property.

With the definition above we can generalize the notion of an ordinary differential equation (ODE) on a vector space to an ordinary differential equation on a manifold:

Definition: An ODE on a manifold is a mapping that assigns to each element of the manifold $A\in\mathcal{M}$ an element of the corresponding tangent space $T_A\mathcal{M}$.

  • 1Lie groups are manifolds that also have a group structure, i.e. there is an operation $\mathcal{M}\times\mathcal{M}\to\mathcal{M},(a,b)\mapsto{}ab$ s.t. $(ab)c = a(bc)$ and $\exists{}e\mathcal{M}$ s.t. $ae$ = $a$ $\forall{}a\in\mathcal{M}$.
+General Theory on Manifolds · GeometricMachineLearning.jl

(Matrix) Manifolds

Manifolds are topological spaces that locally look like vector spaces. In the following we restrict ourselves to finite-dimensional manifolds. Definition: A finite-dimensional smooth manifold of dimension $n$ is a second-countable Hausdorff space $\mathcal{M}$ for which $\forall{}x\in\mathcal{M}$ we can find a neighborhood $U$ that contains $x$ and a corresponding homeomorphism $\varphi_U:U\cong{}W\subset\mathbb{R}^n$ where $W$ is an open subset. The homeomorphisms $\varphi_U$ are referred to as coordinate charts. If two such coordinate charts overlap, i.e. if $U_1\cap{}U_2\neq\{\}$, then the map $\varphi_{U_2}^{-1}\circ\varphi_{U_1}$ is $C^\infty$.

One example of a manifold that is also important for GeometricMachineLearning.jl is the Lie group[1] of orthonormal matrices $SO(N)$. Before we can proof that $SO(N)$ is a manifold we first need another definition and a theorem:

Definition: Consider a smooth mapping $g: \mathcal{M}\to\mathcal{N}$ from one manifold to another. A point $B\in\mathcal{M}$ is called a regular value of $\mathcal{M}$ if $\forall{}A\in{}g^{-1}\{B\}$ the map $T_Ag:T_A\mathcal{M}\to{}T_{g(A)}\mathcal{N}$ is surjective.

Theorem: Consider a smooth map $g:\mathcal{M}\to\mathcal{N}$ from one manifold to another. Then the preimage of a regular point $B$ of $\mathcal{N}$ is a submanifold of $\mathcal{M}$. Furthermore the codimension of $g^{-1}\{B\}$ is equal to the dimension of $\mathcal{N}$ and the tangent space $T_A(g^{-1}\{B\})$ is equal to the kernel of $T_Ag$. This is known as the preimage theorem.

Proof:

Theorem: The group $SO(N)$ is a Lie group (i.e. has manifold structure). Proof: The vector space $\mathbb{R}^{N\times{}N}$ clearly has manifold structure. The group $SO(N)$ is equivalent to one of the level sets of the mapping: $f:\mathbb{R}^{N\times{}N}\to\mathcal{S}(N), A\mapsto{}A^TA$, i.e. it is the component of $f^{-1}\{\mathbb{I}\}$ that contains $\mathbb{I}$. We still need to proof that $\mathbb{I}$ is a regular point of $f$, i.e. that for $A\in{}SO(N)$ the mapping $T_Af$ is surjective. This means that $\forall{}B\in\mathcal{S}(N), A\in\mathbb{R}^{N\times{}N}$ $\exists{}C\in\mathbb{R}^{N\times{}N}$ s.t. $C^TA + A^TC = B$. The element $C=\frac{1}{2}AB\in\mathcal{R}^{N\times{}N}$ satisfies this property.

With the definition above we can generalize the notion of an ordinary differential equation (ODE) on a vector space to an ordinary differential equation on a manifold:

Definition: An ODE on a manifold is a mapping that assigns to each element of the manifold $A\in\mathcal{M}$ an element of the corresponding tangent space $T_A\mathcal{M}$.

  • 1Lie groups are manifolds that also have a group structure, i.e. there is an operation $\mathcal{M}\times\mathcal{M}\to\mathcal{M},(a,b)\mapsto{}ab$ s.t. $(ab)c = a(bc)$ and $\exists{}e\mathcal{M}$ s.t. $ae$ = $a$ $\forall{}a\in\mathcal{M}$.
diff --git a/latest/manifolds/stiefel_manifold/index.html b/latest/manifolds/stiefel_manifold/index.html index 167e4b093..a396f3d88 100644 --- a/latest/manifolds/stiefel_manifold/index.html +++ b/latest/manifolds/stiefel_manifold/index.html @@ -1,5 +1,5 @@ -Stiefel · GeometricMachineLearning.jl

Stiefel manifold

The Stiefel manifold $St(n, N)$ is the space (a homogeneous space) of all orthonormal frames in $\mathbb{R}^{N\times{}n}$, i.e. matrices $Y\in\mathbb{R}^{N\times{}n}$ s.t. $Y^TY = \mathbb{I}_n$. It can also be seen as the special orthonormal group $SO(N)$ modulo an equivalence relation: $A\sim{}B\iff{}AE = BE$ for

\[E = \begin{bmatrix} +Stiefel · GeometricMachineLearning.jl

Stiefel manifold

The Stiefel manifold $St(n, N)$ is the space (a homogeneous space) of all orthonormal frames in $\mathbb{R}^{N\times{}n}$, i.e. matrices $Y\in\mathbb{R}^{N\times{}n}$ s.t. $Y^TY = \mathbb{I}_n$. It can also be seen as the special orthonormal group $SO(N)$ modulo an equivalence relation: $A\sim{}B\iff{}AE = BE$ for

\[E = \begin{bmatrix} \mathbb{I}_n \\ \mathbb{O} -\end{bmatrix}\in\mathcal{M},\]

which is the canonical element of the Stiefel manifold. In words: the first $n$ columns of $A$ and $B$ are the same.

The tangent space to the element $Y\in{}St(n,N)$ can easily be determined:

\[T_YSt(n,N)=\{\Delta:\Delta^TY + Y^T\Delta = 0\}.\]

The Lie algebra of $SO(N)$ is $\mathfrak{so}(N):=\{V\in\mathbb{R}^{N\times{}N}:V^T + V = 0\}$ and the canonical metric associated with it is simply $(V_1,V_2)\mapsto\frac{1}{2}\mathrm{Tr}(V_1^TV_2)$.

The Riemannian Gradient

For matrix manifolds (like the Stiefel manifold), the Riemannian gradient of a function can be easily determined computationally:

The Euclidean gradient of a function $L$ is equivalent to an element of the cotangent space $T^*_Y\mathcal{M}$ via:

\[\langle\nabla{}L,\cdot\rangle:T_Y\mathcal{M} \to \mathbb{R}, \Delta \mapsto \sum_{ij}[\nabla{}L]_{ij}[\Delta]_{ij} = \mathrm{Tr}(\nabla{}L^T\Delta).\]

We can then utilize the Riemannian metric on $\mathcal{M}$ to map the element from the cotangent space (i.e. $\nabla{}L$) to the tangent space. This element is called $\mathrm{grad}_{(\cdot)}L$ here. Explicitly, it is given by:

\[ \mathrm{grad}_YL = \nabla_YL - Y(\nabla_YL)^TY\]

rgrad

What was referred to as $\nabla{}L$ before can in practice be obtained with an AD routine. We then use the function rgrad to map this Euclidean gradient to $\in{}T_YSt(n,N)$. This mapping has the property:

\[\mathrm{Tr}((\nabla{}L)^T\Delta) = g_Y(\mathtt{rgrad}(Y, \nabla{}L), \Delta) \forall\Delta\in{}T_YSt(n,N)\]

and $g$ is the Riemannian metric.

+\end{bmatrix}\in\mathcal{M},\]

which is the canonical element of the Stiefel manifold. In words: the first $n$ columns of $A$ and $B$ are the same.

The tangent space to the element $Y\in{}St(n,N)$ can easily be determined:

\[T_YSt(n,N)=\{\Delta:\Delta^TY + Y^T\Delta = 0\}.\]

The Lie algebra of $SO(N)$ is $\mathfrak{so}(N):=\{V\in\mathbb{R}^{N\times{}N}:V^T + V = 0\}$ and the canonical metric associated with it is simply $(V_1,V_2)\mapsto\frac{1}{2}\mathrm{Tr}(V_1^TV_2)$.

The Riemannian Gradient

For matrix manifolds (like the Stiefel manifold), the Riemannian gradient of a function can be easily determined computationally:

The Euclidean gradient of a function $L$ is equivalent to an element of the cotangent space $T^*_Y\mathcal{M}$ via:

\[\langle\nabla{}L,\cdot\rangle:T_Y\mathcal{M} \to \mathbb{R}, \Delta \mapsto \sum_{ij}[\nabla{}L]_{ij}[\Delta]_{ij} = \mathrm{Tr}(\nabla{}L^T\Delta).\]

We can then utilize the Riemannian metric on $\mathcal{M}$ to map the element from the cotangent space (i.e. $\nabla{}L$) to the tangent space. This element is called $\mathrm{grad}_{(\cdot)}L$ here. Explicitly, it is given by:

\[ \mathrm{grad}_YL = \nabla_YL - Y(\nabla_YL)^TY\]

rgrad

What was referred to as $\nabla{}L$ before can in practice be obtained with an AD routine. We then use the function rgrad to map this Euclidean gradient to $\in{}T_YSt(n,N)$. This mapping has the property:

\[\mathrm{Tr}((\nabla{}L)^T\Delta) = g_Y(\mathtt{rgrad}(Y, \nabla{}L), \Delta) \forall\Delta\in{}T_YSt(n,N)\]

and $g$ is the Riemannian metric.

diff --git a/latest/manifolds/submersion_theorem/index.html b/latest/manifolds/submersion_theorem/index.html index 6a5dab830..a082e4e13 100644 --- a/latest/manifolds/submersion_theorem/index.html +++ b/latest/manifolds/submersion_theorem/index.html @@ -1,2 +1,2 @@ -The Submersion Theorem · GeometricMachineLearning.jl
+The Submersion Theorem · GeometricMachineLearning.jl
diff --git a/latest/objects.inv b/latest/objects.inv index f8e627cd7..74b78e822 100644 Binary files a/latest/objects.inv and b/latest/objects.inv differ diff --git a/latest/optimizers/adam_optimizer/index.html b/latest/optimizers/adam_optimizer/index.html index 33f0ea37c..015d1f2f2 100644 --- a/latest/optimizers/adam_optimizer/index.html +++ b/latest/optimizers/adam_optimizer/index.html @@ -1,2 +1,2 @@ -Adam Optimizer · GeometricMachineLearning.jl

The Adam Optimizer

The Adam Optimizer is one of the most widely (if not the most widely used) neural network optimizer. Like most modern neural network optimizers it contains a cache that is updated based on first-order gradient information and then, in a second step, the cache is used to compute a velocity estimate for updating the neural networ weights.

Here we first describe the Adam algorithm for the case where all the weights are on a vector space and then show how to generalize this to the case where the weights are on a manifold.

All weights on a vector space

The cache of the Adam optimizer consists of first and second moments. The first moments $B_1$ store linear information about the current and previous gradients, and the second moments $B_2$ store quadratic information about current and previous gradients (all computed from a first-order gradient).

If all the weights are on a vector space, then we directly compute updates for $B_1$ and $B_2$:

  1. \[B_1 \gets ((\rho_1 - \rho_1^t)/(1 - \rho_1^t))\cdot{}B_1 + (1 - \rho_1)/(1 - \rho_1^t)\cdot{}\nabla{}L,\]

  2. \[B_2 \gets ((\rho_2 - \rho_1^t)/(1 - \rho_2^t))\cdot{}B_2 + (1 - \rho_2)/(1 - \rho_2^t)\cdot\nabla{}L\odot\nabla{}L,\]

    where $\odot:\mathbb{R}^n\times\mathbb{R}^n\to\mathbb{R}^n$ is the Hadamard product: $[a\odot{}b]_i = a_ib_i$. $\rho_1$ and $\rho_2$ are hyperparameters. Their defaults, $\rho_1=0.9$ and $\rho_2=0.99$, are taken from (Goodfellow et al., 2016, page 301). After having updated the cache (i.e. $B_1$ and $B_2$) we compute a velocity (step 3) with which the parameters $Y_t$ are then updated (step 4).

  3. \[W_t\gets -\eta{}B_1/\sqrt{B_2 + \delta},\]

  4. \[Y_{t+1} \gets Y_t + W_t,\]

Here $\eta$ (with default 0.01) is the learning rate and $\delta$ (with default $3\cdot10^{-7}$) is a small constant that is added for stability. The division, square root and addition in step 3 are performed element-wise.

Weights on manifolds

The problem with generalizing Adam to manifolds is that the Hadamard product $\odot$ as well as the other element-wise operations ($/$, $\sqrt{}$ and $+$ in step 3 above) lack a clear geometric interpretation. In GeometricMachineLearning we get around this issue by utilizing a so-called global tangent space representation.

References

  • Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT press, 2016.
[25]
I. Goodfellow, Y. Bengio and A. Courville. Deep learning (MIT press, Cambridge, MA, 2016).
+Adam Optimizer · GeometricMachineLearning.jl

The Adam Optimizer

The Adam Optimizer is one of the most widely (if not the most widely used) neural network optimizer. Like most modern neural network optimizers it contains a cache that is updated based on first-order gradient information and then, in a second step, the cache is used to compute a velocity estimate for updating the neural networ weights.

Here we first describe the Adam algorithm for the case where all the weights are on a vector space and then show how to generalize this to the case where the weights are on a manifold.

All weights on a vector space

The cache of the Adam optimizer consists of first and second moments. The first moments $B_1$ store linear information about the current and previous gradients, and the second moments $B_2$ store quadratic information about current and previous gradients (all computed from a first-order gradient).

If all the weights are on a vector space, then we directly compute updates for $B_1$ and $B_2$:

  1. \[B_1 \gets ((\rho_1 - \rho_1^t)/(1 - \rho_1^t))\cdot{}B_1 + (1 - \rho_1)/(1 - \rho_1^t)\cdot{}\nabla{}L,\]

  2. \[B_2 \gets ((\rho_2 - \rho_1^t)/(1 - \rho_2^t))\cdot{}B_2 + (1 - \rho_2)/(1 - \rho_2^t)\cdot\nabla{}L\odot\nabla{}L,\]

    where $\odot:\mathbb{R}^n\times\mathbb{R}^n\to\mathbb{R}^n$ is the Hadamard product: $[a\odot{}b]_i = a_ib_i$. $\rho_1$ and $\rho_2$ are hyperparameters. Their defaults, $\rho_1=0.9$ and $\rho_2=0.99$, are taken from (Goodfellow et al., 2016, page 301). After having updated the cache (i.e. $B_1$ and $B_2$) we compute a velocity (step 3) with which the parameters $Y_t$ are then updated (step 4).

  3. \[W_t\gets -\eta{}B_1/\sqrt{B_2 + \delta},\]

  4. \[Y_{t+1} \gets Y_t + W_t,\]

Here $\eta$ (with default 0.01) is the learning rate and $\delta$ (with default $3\cdot10^{-7}$) is a small constant that is added for stability. The division, square root and addition in step 3 are performed element-wise.

Weights on manifolds

The problem with generalizing Adam to manifolds is that the Hadamard product $\odot$ as well as the other element-wise operations ($/$, $\sqrt{}$ and $+$ in step 3 above) lack a clear geometric interpretation. In GeometricMachineLearning we get around this issue by utilizing a so-called global tangent space representation.

References

  • Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT press, 2016.
[31]
I. Goodfellow, Y. Bengio and A. Courville. Deep learning (MIT press, Cambridge, MA, 2016).
diff --git a/latest/optimizers/bfgs_optimizer/index.html b/latest/optimizers/bfgs_optimizer/index.html index c31b053ea..b73bc7e3b 100644 --- a/latest/optimizers/bfgs_optimizer/index.html +++ b/latest/optimizers/bfgs_optimizer/index.html @@ -1,8 +1,8 @@ -BFGS Optimizer · GeometricMachineLearning.jl

The BFGS Algorithm

The presentation shown here is largely taken from chapters 3 and 6 of reference [9] with a derivation based on an online comment. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is a second order optimizer that can be also be used to train a neural network.

It is a version of a quasi-Newton method and is therefore especially suited for convex problems. As is the case with any other (quasi-)Newton method the BFGS algorithm approximates the objective with a quadratic function in each optimization step:

\[m_k(x) = f(x_k) + (\nabla_{x_k}f)^T(x - x_k) + \frac{1}{2}(x - x_k)^TB_k(x - x_k),\]

where $B_k$ is referred to as the approximate Hessian. We further require $B_k$ to be symmetric and positive definite. Differentiating the above expression and setting the derivative to zero gives us:

\[\nabla_xm_k = \nabla_{x_k}f + B_k(x - x_k) = 0,\]

or written differently:

\[x - x_k = -B_k^{-1}\nabla_{x_k}f.\]

This value we will from now on call $p_k := x - x_k$ and refer to as the search direction. The new iterate then is:

\[x_{k+1} = x_k + \alpha_kp_k,\]

where $\alpha_k$ is the step length. Techniques that describe how to pick an appropriate $\alpha_k$ are called line-search methods and are discussed below. First we discuss what requirements we impose on $B_k$. A first reasonable condition would be to require the gradient of $m_k$ to be equal to that of $f$ at the points $x_{k-1}$ and $x_k$:

\[\begin{aligned} +BFGS Optimizer · GeometricMachineLearning.jl

The BFGS Algorithm

The presentation shown here is largely taken from chapters 3 and 6 of reference [12] with a derivation based on an online comment. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is a second order optimizer that can be also be used to train a neural network.

It is a version of a quasi-Newton method and is therefore especially suited for convex problems. As is the case with any other (quasi-)Newton method the BFGS algorithm approximates the objective with a quadratic function in each optimization step:

\[m_k(x) = f(x_k) + (\nabla_{x_k}f)^T(x - x_k) + \frac{1}{2}(x - x_k)^TB_k(x - x_k),\]

where $B_k$ is referred to as the approximate Hessian. We further require $B_k$ to be symmetric and positive definite. Differentiating the above expression and setting the derivative to zero gives us:

\[\nabla_xm_k = \nabla_{x_k}f + B_k(x - x_k) = 0,\]

or written differently:

\[x - x_k = -B_k^{-1}\nabla_{x_k}f.\]

This value we will from now on call $p_k := x - x_k$ and refer to as the search direction. The new iterate then is:

\[x_{k+1} = x_k + \alpha_kp_k,\]

where $\alpha_k$ is the step length. Techniques that describe how to pick an appropriate $\alpha_k$ are called line-search methods and are discussed below. First we discuss what requirements we impose on $B_k$. A first reasonable condition would be to require the gradient of $m_k$ to be equal to that of $f$ at the points $x_{k-1}$ and $x_k$:

\[\begin{aligned} \nabla_{x_k}m_k & = \nabla_{x_k}f + B_k(x_k - x_k) & \overset{!}{=} \nabla_{x_k}f \text{ and } \\ \nabla_{x_{k-1}}m_k & = \nabla{x_k}f + B_k(x_{k-1} - x_k) & \overset{!}{=} \nabla_{x_{k-1}}f. -\end{aligned}\]

The first one of these conditions is of course automatically satisfied. The second one can be rewritten as:

\[B_k(x_k - x_{k-1}) = \overset{!}{=} \nabla_{x_k}f - \nabla_{x_{k-1}}f. \]

The following notations are often used:

\[s_{k-1} := \alpha_{k-1}p_{k-1} := x_{k} - x_{k-1} \text{ and } y_{k-1} := \nabla_{x_k}f - \nabla_{x_{k-1}}f. \]

The conditions mentioned above then becomes:

\[B_ks_{k-1} \overset{!}{=} y_{k-1},\]

and we call it the secant equation. A second condition we impose on $B_k$ is that is has to be positive-definite at point $s_{k-1}$:

\[s_{k-1}^Ty_{k-1} > 0.\]

This is referred to as the curvature condition. If we impose the Wolfe conditions, the curvature condition hold automatically. The Wolfe conditions are stated with respect to the parameter $\alpha_k$.

The Wolfe conditions are:

  1. $f(x_k+\alpha{}p_k)\leq{}f(x_k) + c_1\alpha(\nabla_{x_k}f)^Tp_k$ for $c_1\in(0,1)$.
  2. $(\nabla_{(x_k + \alpha_kp_k)}f)^Tp_k \geq c_2(\nabla_{x_k}f)^Tp_k$ for $c_2\in(c_1,1)$.

A possible choice for $c_1$ and $c_2$ are $10^{-4}$ and $0.9$ (see [9]). The two Wolfe conditions above are respectively called the sufficient decrease condition and the curvature condition respectively. Note that the second Wolfe condition (also called curvature condition) is stronger than the one mentioned before under the assumption that the first Wolfe condition is true:

\[(\nabla_{x_k}f)^Tp_{k-1} - c_2(\nabla_{x_{k-1}}f)^Tp_{k-1} = y_{k-1}^Tp_{k-1} + (1 - c_2)(\nabla_{x_{k-1}}f)^Tp_{k-1} \geq 0,\]

and the second term in this expression is $(1 - c_2)(\nabla_{x_{k-1}}f)^Tp_{k-1}\geq\frac{1-c_2}{c_1\alpha_{k-1}}(f(x_k) - f(x_{k-1}))$, which is negative.

In order to pick the ideal $B_k$ we solve the following problem:

\[\begin{aligned} +\end{aligned}\]

The first one of these conditions is of course automatically satisfied. The second one can be rewritten as:

\[B_k(x_k - x_{k-1}) = \overset{!}{=} \nabla_{x_k}f - \nabla_{x_{k-1}}f. \]

The following notations are often used:

\[s_{k-1} := \alpha_{k-1}p_{k-1} := x_{k} - x_{k-1} \text{ and } y_{k-1} := \nabla_{x_k}f - \nabla_{x_{k-1}}f. \]

The conditions mentioned above then becomes:

\[B_ks_{k-1} \overset{!}{=} y_{k-1},\]

and we call it the secant equation. A second condition we impose on $B_k$ is that is has to be positive-definite at point $s_{k-1}$:

\[s_{k-1}^Ty_{k-1} > 0.\]

This is referred to as the curvature condition. If we impose the Wolfe conditions, the curvature condition hold automatically. The Wolfe conditions are stated with respect to the parameter $\alpha_k$.

The Wolfe conditions are:

  1. $f(x_k+\alpha{}p_k)\leq{}f(x_k) + c_1\alpha(\nabla_{x_k}f)^Tp_k$ for $c_1\in(0,1)$.
  2. $(\nabla_{(x_k + \alpha_kp_k)}f)^Tp_k \geq c_2(\nabla_{x_k}f)^Tp_k$ for $c_2\in(c_1,1)$.

A possible choice for $c_1$ and $c_2$ are $10^{-4}$ and $0.9$ (see [12]). The two Wolfe conditions above are respectively called the sufficient decrease condition and the curvature condition respectively. Note that the second Wolfe condition (also called curvature condition) is stronger than the one mentioned before under the assumption that the first Wolfe condition is true:

\[(\nabla_{x_k}f)^Tp_{k-1} - c_2(\nabla_{x_{k-1}}f)^Tp_{k-1} = y_{k-1}^Tp_{k-1} + (1 - c_2)(\nabla_{x_{k-1}}f)^Tp_{k-1} \geq 0,\]

and the second term in this expression is $(1 - c_2)(\nabla_{x_{k-1}}f)^Tp_{k-1}\geq\frac{1-c_2}{c_1\alpha_{k-1}}(f(x_k) - f(x_{k-1}))$, which is negative.

In order to pick the ideal $B_k$ we solve the following problem:

\[\begin{aligned} \min_B & ||B - B_{k-1}||_W \\ \text{s.t.} & B = B^T\text{ and }Bs_{k-1}=y_{k-1}, \end{aligned}\]

where the first condition is symmetry and the second one is the secant equation. For the norm $||\cdot||_W$ we pick the weighted Frobenius norm:

\[||A||_W := ||W^{1/2}AW^{1/2}||_F,\]

where $||\cdot||_F$ is the usual Frobenius norm[1] and the matrix $W=\tilde{B}_{k-1}$ is the inverse of the average Hessian:

\[\tilde{B}_{k-1} = \int_0^1 \nabla^2f(x_{k-1} + \tau\alpha_{k-1}p_{k-1})d\tau.\]

In order to find the ideal $B_k$ under the conditions described above, we introduce some notation:

  • $\tilde{B}_{k-1} := W^{1/2}B_{k-1}W^{1/2}$,
  • $\tilde{B} := W^{1/2}BW^{1/2}$,
  • $\tilde{y}_{k-1} := W^{1/2}y_{k-1}$,
  • $\tilde{s}_{k-1} := W^{-1/2}s_{k-1}$.

With this notation we can rewrite the problem of finding $B_k$ as:

\[\begin{aligned} @@ -14,4 +14,4 @@ u^T\tilde{B}_{k-1}u - 1 & u^T\tilde{B}_{k-1}u \\ u_\perp^T\tilde{B}_{k-1}u & u_\perp^T(\tilde{B}_{k-1}-\tilde{B}_k)u_\perp \end{bmatrix}. -\end{aligned}\]

By a property of the Frobenius norm:

\[||\tilde{B}_{k-1} - \tilde{B}||^2_F = (u^T\tilde{B}_{k-1} -1)^2 + ||u^T\tilde{B}_{k-1}u_\perp||_F^2 + ||u_\perp^T\tilde{B}_{k-1}u||_F^2 + ||u_\perp^T(\tilde{B}_{k-1} - \tilde{B})u_\perp||_F^2.\]

We see that $\tilde{B}$ only appears in the last term, which should therefore be made zero. This then gives:

\[\tilde{B} = U\begin{bmatrix} 1 & 0 \\ 0 & u^T_\perp\tilde{B}_{k-1}u_\perp \end{bmatrix} = uu^T + (\mathbb{I}-uu^T)\tilde{B}_{k-1}(\mathbb{I}-uu^T).\]

If we now map back to the original coordinate system, the ideal solution for $B_k$ is:

\[B_k = (\mathbb{I} - \frac{1}{y_{k-1}^Ts_{k-1}}y_{k-1}s_{k-1}^T)B_{k-1}(\mathbb{I} - \frac{1}{y_{k-1}^Ts_{k-1}}s_{k-1}y_{k-1}^T) + \frac{1}{y_{k-1}^Ts_{k-1}}y_ky_k^T.\]

What we need in practice however is not $B_k$, but its inverse $H_k$. This is because we need to find $s_{k-1}$ based on $y_{k-1}$. To get $H_k$ based on the expression for $B_k$ above we can use the Sherman-Morrison-Woodbury formula[3] to obtain:

\[H_{k} = H_{k-1} - \frac{H_{k-1}y_{k-1}y_{k-1}^TH_{k-1}}{y_{k-1}^TH_{k-1}y_{k-1}} + \frac{s_{k-1}s_{k-1}^T}{y_{k-1}^Ts_{k-1}}.\]

TODO: Example where this works well!

References

[9]
J. N. Stephen J. Wright. Numerical optimization (Springer Science+Business Media, 2006).
  • 1The Frobenius norm is $||A||_F^2 = \sum_{i,j}a_{ij}^2$.
  • 2So we must have $u^Tu_\perp=0$ and further $u_\perp^Tu_\perp=\mathbb{I}$.
  • 3The Sherman-Morrison-Woodbury formula states $(A + UCV)^{-1} = A^{-1} - A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}$.
+\end{aligned}\]

By a property of the Frobenius norm:

\[||\tilde{B}_{k-1} - \tilde{B}||^2_F = (u^T\tilde{B}_{k-1} -1)^2 + ||u^T\tilde{B}_{k-1}u_\perp||_F^2 + ||u_\perp^T\tilde{B}_{k-1}u||_F^2 + ||u_\perp^T(\tilde{B}_{k-1} - \tilde{B})u_\perp||_F^2.\]

We see that $\tilde{B}$ only appears in the last term, which should therefore be made zero. This then gives:

\[\tilde{B} = U\begin{bmatrix} 1 & 0 \\ 0 & u^T_\perp\tilde{B}_{k-1}u_\perp \end{bmatrix} = uu^T + (\mathbb{I}-uu^T)\tilde{B}_{k-1}(\mathbb{I}-uu^T).\]

If we now map back to the original coordinate system, the ideal solution for $B_k$ is:

\[B_k = (\mathbb{I} - \frac{1}{y_{k-1}^Ts_{k-1}}y_{k-1}s_{k-1}^T)B_{k-1}(\mathbb{I} - \frac{1}{y_{k-1}^Ts_{k-1}}s_{k-1}y_{k-1}^T) + \frac{1}{y_{k-1}^Ts_{k-1}}y_ky_k^T.\]

What we need in practice however is not $B_k$, but its inverse $H_k$. This is because we need to find $s_{k-1}$ based on $y_{k-1}$. To get $H_k$ based on the expression for $B_k$ above we can use the Sherman-Morrison-Woodbury formula[3] to obtain:

\[H_{k} = H_{k-1} - \frac{H_{k-1}y_{k-1}y_{k-1}^TH_{k-1}}{y_{k-1}^TH_{k-1}y_{k-1}} + \frac{s_{k-1}s_{k-1}^T}{y_{k-1}^Ts_{k-1}}.\]

TODO: Example where this works well!

References

[12]
J. N. Stephen J. Wright. Numerical optimization (Springer Science+Business Media, 2006).
  • 1The Frobenius norm is $||A||_F^2 = \sum_{i,j}a_{ij}^2$.
  • 2So we must have $u^Tu_\perp=0$ and further $u_\perp^Tu_\perp=\mathbb{I}$.
  • 3The Sherman-Morrison-Woodbury formula states $(A + UCV)^{-1} = A^{-1} - A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}$.
diff --git a/latest/optimizers/general_optimization/index.html b/latest/optimizers/general_optimization/index.html index fe8650353..6305f228e 100644 --- a/latest/optimizers/general_optimization/index.html +++ b/latest/optimizers/general_optimization/index.html @@ -1,2 +1,2 @@ -General Optimization · GeometricMachineLearning.jl

Optimization for Neural Networks

Optimization for neural networks is (almost always) some variation on gradient descent. The most basic form of gradient descent is a discretization of the gradient flow equation:

\[\dot{\theta} = -\nabla_\theta{}L,\]

by means of a Euler time-stepping scheme:

\[\theta^{t+1} = \theta^{t} - h\nabla_{\theta^{t}}L,\]

where $\eta$ (the time step of the Euler scheme) is referred to as the learning rate

This equation can easily be generalized to manifolds by replacing the Euclidean gradient $\nabla_{\theta^{t}L}$ by a Riemannian gradient $-h\mathrm{grad}_{\theta^{t}}L$ and addition by $-h\nabla_{\theta^{t}}L$ with a retraction by $-h\mathrm{grad}_{\theta^{t}}L$.

+General Optimization · GeometricMachineLearning.jl

Optimization for Neural Networks

Optimization for neural networks is (almost always) some variation on gradient descent. The most basic form of gradient descent is a discretization of the gradient flow equation:

\[\dot{\theta} = -\nabla_\theta{}L,\]

by means of a Euler time-stepping scheme:

\[\theta^{t+1} = \theta^{t} - h\nabla_{\theta^{t}}L,\]

where $\eta$ (the time step of the Euler scheme) is referred to as the learning rate

This equation can easily be generalized to manifolds by replacing the Euclidean gradient $\nabla_{\theta^{t}L}$ by a Riemannian gradient $-h\mathrm{grad}_{\theta^{t}}L$ and addition by $-h\nabla_{\theta^{t}}L$ with a retraction by $-h\mathrm{grad}_{\theta^{t}}L$.

diff --git a/latest/optimizers/manifold_related/cayley/index.html b/latest/optimizers/manifold_related/cayley/index.html index 423b296b9..94be8833b 100644 --- a/latest/optimizers/manifold_related/cayley/index.html +++ b/latest/optimizers/manifold_related/cayley/index.html @@ -1,5 +1,5 @@ -Cayley Retraction · GeometricMachineLearning.jl

The Cayley Retraction

The Cayley transformation is one of the most popular retractions. For several matrix Lie groups it is a mapping from the Lie algebra $\mathfrak{g}$ onto the Lie group $G$. They Cayley retraction reads:

\[ \mathrm{Cayley}(C) = \left(\mathbb{I} -\frac{1}{2}C\right)^{-1}\left(\mathbb{I} +\frac{1}{2}C\right).\]

This is easily checked to be a retraction, i.e. $\mathrm{Cayley}(\mathbb{O}) = \mathbb{I}$ and $\frac{\partial}{\partial{}t}\mathrm{Cayley}(tC) = C$.

What we need in practice is not the computation of the Cayley transform of an arbitrary matrix, but the Cayley transform of an element of $\mathfrak{g}^\mathrm{hor}$, the global tangent space representation.

The elements of $\mathfrak{g}^\mathrm{hor}$ can be written as:

\[C = \begin{bmatrix} +Cayley Retraction · GeometricMachineLearning.jl

The Cayley Retraction

The Cayley transformation is one of the most popular retractions. For several matrix Lie groups it is a mapping from the Lie algebra $\mathfrak{g}$ onto the Lie group $G$. They Cayley retraction reads:

\[ \mathrm{Cayley}(C) = \left(\mathbb{I} -\frac{1}{2}C\right)^{-1}\left(\mathbb{I} +\frac{1}{2}C\right).\]

This is easily checked to be a retraction, i.e. $\mathrm{Cayley}(\mathbb{O}) = \mathbb{I}$ and $\frac{\partial}{\partial{}t}\mathrm{Cayley}(tC) = C$.

What we need in practice is not the computation of the Cayley transform of an arbitrary matrix, but the Cayley transform of an element of $\mathfrak{g}^\mathrm{hor}$, the global tangent space representation.

The elements of $\mathfrak{g}^\mathrm{hor}$ can be written as:

\[C = \begin{bmatrix} A & -B^T \\ B & \mathbb{O} \end{bmatrix} = \begin{bmatrix} \frac{1}{2}A & \mathbb{I} \\ B & \mathbb{O} \end{bmatrix} \begin{bmatrix} \mathbb{I} & \mathbb{O} \\ \frac{1}{2}A & -B^T \end{bmatrix},\]

where the second expression exploits the sparse structure of the array, i.e. it is a multiplication of a $N\times2n$ with a $2n\times{}N$ matrix. We can hence use the Sherman-Morrison-Woodbury formula to obtain:

\[(\mathbb{I} - \frac{1}{2}UV)^{-1} = \mathbb{I} + \frac{1}{2}U(\mathbb{I} - \frac{1}{2}VU)^{-1}V\]

So what we have to invert is the term

\[\mathbb{I} - \frac{1}{2}\begin{bmatrix} \mathbb{I} & \mathbb{O} \\ \frac{1}{2}A & -B^T \end{bmatrix}\begin{bmatrix} \frac{1}{2}A & \mathbb{I} \\ B & \mathbb{O} \end{bmatrix} = @@ -10,4 +10,4 @@ \begin{bmatrix} \mathbb{I} \\ \frac{1}{2}A \end{bmatrix} + \begin{bmatrix} \frac{1}{2}A \\ \frac{1}{4}A^2 - \frac{1}{2}B^TB \end{bmatrix} \right) - \right)\]

Note that for computational reason we compute $\mathrm{Cayley}(C)E$ instead of just the Cayley transform (see the section on retractions).

+ \right)\]

Note that for computational reason we compute $\mathrm{Cayley}(C)E$ instead of just the Cayley transform (see the section on retractions).

diff --git a/latest/optimizers/manifold_related/geodesic/index.html b/latest/optimizers/manifold_related/geodesic/index.html index b88fdc598..36f1ffc81 100644 --- a/latest/optimizers/manifold_related/geodesic/index.html +++ b/latest/optimizers/manifold_related/geodesic/index.html @@ -1,2 +1,2 @@ -Geodesic Retraction · GeometricMachineLearning.jl

Geodesic Retraction

General retractions are approximations of the exponential map. In GeometricMachineLearning we can, instead of using an approximation, solve the geodesic equation exactly (up to numerical error) by specifying Geodesic() as the argument of layers that have manifold weights.

+Geodesic Retraction · GeometricMachineLearning.jl

Geodesic Retraction

General retractions are approximations of the exponential map. In GeometricMachineLearning we can, instead of using an approximation, solve the geodesic equation exactly (up to numerical error) by specifying Geodesic() as the argument of layers that have manifold weights.

diff --git a/latest/optimizers/manifold_related/global_sections/index.html b/latest/optimizers/manifold_related/global_sections/index.html index d330db072..6bc9955c7 100644 --- a/latest/optimizers/manifold_related/global_sections/index.html +++ b/latest/optimizers/manifold_related/global_sections/index.html @@ -1,5 +1,5 @@ -Global Sections · GeometricMachineLearning.jl

Global Sections

Global sections are needed needed for the generalization of Adam and other optimizers to homogeneous spaces. They are necessary to perform the two mappings represented represented by horizontal and vertical red lines in the section on the general optimizer framework.

Computing the global section

In differential geometry a section is always associated to some bundle, in our case this bundle is $\pi:G\to\mathcal{M},A\mapsto{}AE$. A section is a mapping $\mathcal{M}\to{}G$ for which $\pi$ is a left inverse, i.e. $\pi\circ\lambda = \mathrm{id}$.

For the Stiefel manifold $St(n, N)\subset\mathbb{R}^{N\times{}n}$ we compute the global section the following way:

  1. Start with an element $Y\in{}St(n,N)$,
  2. Draw a random matrix $A\in\mathbb{R}^{N\times{}(N-n)}$,
  3. Remove the subspace spanned by $Y$ from the range of $A$: $A\gets{}A-YY^TA$
  4. Compute a QR decomposition of $A$ and take as section $\lambda(Y) = [Y, Q_{[1:N, 1:(N-n)]}] =: [Y, \bar{\lambda}]$.

It is easy to check that $\lambda(Y)\in{}G=SO(N)$.

In GeometricMachineLearning, GlobalSection takes an element of $Y\in{}St(n,N)\equiv$StiefelManifold{T} and returns an instance of GlobalSection{T, StiefelManifold{T}}. The application $O(N)\times{}St(n,N)\to{}St(n,N)$ is done with the functions apply_section! and apply_section.

Computing the global tangent space representation based on a global section

The output of the horizontal lift $\Omega$ is an element of $\mathfrak{g}^{\mathrm{hor},Y}$. For this mapping $\Omega(Y, B{}Y) = B$ if $B\in\mathfrak{g}^{\mathrm{hor},Y}$, i.e. there is no information loss and no projection is performed. We can map the $B\in\mathfrak{g}^{\mathrm{hor},Y}$ to $\mathfrak{g}^\mathrm{hor}$ with $B\mapsto{}\lambda(Y)^{-1}B\lambda(Y)$.

The function global_rep performs both mappings at once[1], i.e. it takes an instance of GlobalSection and an element of $T_YSt(n,N)$, and then returns an element of $\frak{g}^\mathrm{hor}\equiv$StiefelLieAlgHorMatrix.

In practice we use the following:

\[\begin{aligned} +Global Sections · GeometricMachineLearning.jl

Global Sections

Global sections are needed needed for the generalization of Adam and other optimizers to homogeneous spaces. They are necessary to perform the two mappings represented represented by horizontal and vertical red lines in the section on the general optimizer framework.

Computing the global section

In differential geometry a section is always associated to some bundle, in our case this bundle is $\pi:G\to\mathcal{M},A\mapsto{}AE$. A section is a mapping $\mathcal{M}\to{}G$ for which $\pi$ is a left inverse, i.e. $\pi\circ\lambda = \mathrm{id}$.

For the Stiefel manifold $St(n, N)\subset\mathbb{R}^{N\times{}n}$ we compute the global section the following way:

  1. Start with an element $Y\in{}St(n,N)$,
  2. Draw a random matrix $A\in\mathbb{R}^{N\times{}(N-n)}$,
  3. Remove the subspace spanned by $Y$ from the range of $A$: $A\gets{}A-YY^TA$
  4. Compute a QR decomposition of $A$ and take as section $\lambda(Y) = [Y, Q_{[1:N, 1:(N-n)]}] =: [Y, \bar{\lambda}]$.

It is easy to check that $\lambda(Y)\in{}G=SO(N)$.

In GeometricMachineLearning, GlobalSection takes an element of $Y\in{}St(n,N)\equiv$StiefelManifold{T} and returns an instance of GlobalSection{T, StiefelManifold{T}}. The application $O(N)\times{}St(n,N)\to{}St(n,N)$ is done with the functions apply_section! and apply_section.

Computing the global tangent space representation based on a global section

The output of the horizontal lift $\Omega$ is an element of $\mathfrak{g}^{\mathrm{hor},Y}$. For this mapping $\Omega(Y, B{}Y) = B$ if $B\in\mathfrak{g}^{\mathrm{hor},Y}$, i.e. there is no information loss and no projection is performed. We can map the $B\in\mathfrak{g}^{\mathrm{hor},Y}$ to $\mathfrak{g}^\mathrm{hor}$ with $B\mapsto{}\lambda(Y)^{-1}B\lambda(Y)$.

The function global_rep performs both mappings at once[1], i.e. it takes an instance of GlobalSection and an element of $T_YSt(n,N)$, and then returns an element of $\frak{g}^\mathrm{hor}\equiv$StiefelLieAlgHorMatrix.

In practice we use the following:

\[\begin{aligned} \lambda(Y)^T\Omega(Y,\Delta)\lambda(Y) & = \lambda(Y)^T[(\mathbb{I} - \frac{1}{2}YY^T)\Delta{}Y^T - Y\Delta^T(\mathbb{I} - \frac{1}{2}YY^T)]\lambda(Y) \\ & = \lambda(Y)^T[(\mathbb{I} - \frac{1}{2}YY^T)\Delta{}E^T - Y\Delta^T(\lambda(Y) - \frac{1}{2}YE^T)] \\ & = \lambda(Y)^T\Delta{}E^T - \frac{1}{2}EY^T\Delta{}E^T - E\Delta^T\lambda(Y) + \frac{1}{2}E\Delta^TYE^T \\ @@ -7,4 +7,4 @@ & = \begin{bmatrix} Y^T\Delta{}E^T \\ \bar{\lambda}\Delta{}E^T \end{bmatrix} + E\Delta^TYE^T - \begin{bmatrix}E\Delta^TY & E\Delta^T\bar{\lambda} \end{bmatrix} \\ & = EY^T\Delta{}E^T + E\Delta^TYE^T - E\Delta^TYE^T + \begin{bmatrix} \mathbb{O} \\ \bar{\lambda}\Delta{}E^T \end{bmatrix} - \begin{bmatrix} \mathbb{O} & E\Delta^T\bar{\lambda} \end{bmatrix} \\ & = EY^T\Delta{}E^T + \begin{bmatrix} \mathbb{O} \\ \bar{\lambda}\Delta{}E^T \end{bmatrix} - \begin{bmatrix} \mathbb{O} & E\Delta^T\bar{\lambda} \end{bmatrix}, -\end{aligned}\]

meaning that for an element of the horizontal component of the Lie algebra $\mathfrak{g}^\mathrm{hor}$ we store $A=Y^T\Delta$ and $B=\bar{\lambda}^T\Delta$.

Optimization

The output of global_rep is then used for all the optimization steps.

References

[24]
T. Frankel. The geometry of physics: an introduction (Cambridge university press, Cambridge, UK, 2011).
  • 1For computational reasons.
+\end{aligned}\]

meaning that for an element of the horizontal component of the Lie algebra $\mathfrak{g}^\mathrm{hor}$ we store $A=Y^T\Delta$ and $B=\bar{\lambda}^T\Delta$.

Optimization

The output of global_rep is then used for all the optimization steps.

References

[30]
T. Frankel. The geometry of physics: an introduction (Cambridge university press, Cambridge, UK, 2011).
  • 1For computational reasons.
diff --git a/latest/optimizers/manifold_related/horizontal_lift/index.html b/latest/optimizers/manifold_related/horizontal_lift/index.html index 0e5b4a4ec..494a8710e 100644 --- a/latest/optimizers/manifold_related/horizontal_lift/index.html +++ b/latest/optimizers/manifold_related/horizontal_lift/index.html @@ -1,2 +1,2 @@ -Horizontal Lift · GeometricMachineLearning.jl

The Horizontal Lift

For each element $Y\in\mathcal{M}$ we can perform a splitting $\mathfrak{g} = \mathfrak{g}^{\mathrm{hor}, Y}\oplus\mathfrak{g}^{\mathrm{ver}, Y}$, where the two subspaces are the horizontal and the vertical component of $\mathfrak{g}$ at $Y$ respectively. For homogeneous spaces: $T_Y\mathcal{M} = \mathfrak{g}\cdot{}Y$, i.e. every tangent space to $\mathcal{M}$ can be expressed through the application of the Lie algebra to the relevant element. The vertical component consists of those elements of $\mathfrak{g}$ which are mapped to the zero element of $T_Y\mathcal{M}$, i.e.

\[\mathfrak{g}^{\mathrm{ver}, Y} := \mathrm{ker}(\mathfrak{g}\to{}T_Y\mathcal{M}).\]

The orthogonal complement[1] of $\mathfrak{g}^{\mathrm{ver}, Y}$ is the horizontal component and is referred to by $\mathfrak{g}^{\mathrm{hor}, Y}$. This is naturally isomorphic to $T_Y\mathcal{M}$. For the Stiefel manifold the horizontal lift has the simple form:

\[\Omega(Y, V) = \left(\mathbb{I} - \frac{1}{2}\right)VY^T - YV^T(\mathbb{I} - \frac{1}{2}YY^T).\]

If the element $Y$ is the distinct element $E$, then the elements of $\mathfrak{g}^{\mathrm{hor},E}$ take a particularly simple form, see Global Tangent Space for a description of this.

  • 1The orthogonal complement is taken with respect to a metric defined on $\mathfrak{g}$. For the case of $G=SO(N)$ and $\mathfrak{g}=\mathfrak{so}(N) = \{A:A+A^T =0\}$ this metric can be chosen as $(A_1,A_2)\mapsto{}\frac{1}{2}A_1^TA_2$.
+Horizontal Lift · GeometricMachineLearning.jl

The Horizontal Lift

For each element $Y\in\mathcal{M}$ we can perform a splitting $\mathfrak{g} = \mathfrak{g}^{\mathrm{hor}, Y}\oplus\mathfrak{g}^{\mathrm{ver}, Y}$, where the two subspaces are the horizontal and the vertical component of $\mathfrak{g}$ at $Y$ respectively. For homogeneous spaces: $T_Y\mathcal{M} = \mathfrak{g}\cdot{}Y$, i.e. every tangent space to $\mathcal{M}$ can be expressed through the application of the Lie algebra to the relevant element. The vertical component consists of those elements of $\mathfrak{g}$ which are mapped to the zero element of $T_Y\mathcal{M}$, i.e.

\[\mathfrak{g}^{\mathrm{ver}, Y} := \mathrm{ker}(\mathfrak{g}\to{}T_Y\mathcal{M}).\]

The orthogonal complement[1] of $\mathfrak{g}^{\mathrm{ver}, Y}$ is the horizontal component and is referred to by $\mathfrak{g}^{\mathrm{hor}, Y}$. This is naturally isomorphic to $T_Y\mathcal{M}$. For the Stiefel manifold the horizontal lift has the simple form:

\[\Omega(Y, V) = \left(\mathbb{I} - \frac{1}{2}\right)VY^T - YV^T(\mathbb{I} - \frac{1}{2}YY^T).\]

If the element $Y$ is the distinct element $E$, then the elements of $\mathfrak{g}^{\mathrm{hor},E}$ take a particularly simple form, see Global Tangent Space for a description of this.

  • 1The orthogonal complement is taken with respect to a metric defined on $\mathfrak{g}$. For the case of $G=SO(N)$ and $\mathfrak{g}=\mathfrak{so}(N) = \{A:A+A^T =0\}$ this metric can be chosen as $(A_1,A_2)\mapsto{}\frac{1}{2}A_1^TA_2$.
diff --git a/latest/optimizers/manifold_related/retractions/index.html b/latest/optimizers/manifold_related/retractions/index.html index bcae7ad9f..f12167914 100644 --- a/latest/optimizers/manifold_related/retractions/index.html +++ b/latest/optimizers/manifold_related/retractions/index.html @@ -1,2 +1,2 @@ -Retractions · GeometricMachineLearning.jl

Retractions

Classical Definition

Classically, retractions are defined as maps smooth maps

\[R: T\mathcal{M}\to\mathcal{M}:(x,v)\mapsto{}R_x(v)\]

such that each curve $c(t) := R_x(tv)$ satisfies $c(0) = x$ and $c'(0) = v$.

In GeometricMachineLearning

Retractions are a map from the horizontal component of the Lie algebra $\mathfrak{g}^\mathrm{hor}$ to the respective manifold.

For optimization in neural networks (almost always first order) we solve a gradient flow equation

\[\dot{W} = -\mathrm{grad}_WL, \]

where $\mathrm{grad}_WL$ is the Riemannian gradient of the loss function $L$ evaluated at position $W$.

If we deal with Euclidean spaces (vector spaces), then the Riemannian gradient is just the result of an AD routine and the solution of the equation above can be approximated with $W^{t+1} \gets W^t - \eta\nabla_{W^t}L$, where $\eta$ is the learning rate.

For manifolds, after we obtained the Riemannian gradient (see e.g. the section on Stiefel manifold), we have to solve a geodesic equation. This is a canonical ODE associated with any Riemannian manifold.

The general theory of Riemannian manifolds is rather complicated, but for the neural networks treated in GeometricMachineLearning, we only rely on optimization of matrix Lie groups and homogeneous spaces, which is much simpler.

For Lie groups each tangent space is isomorphic to its Lie algebra $\mathfrak{g}\equiv{}T_\mathbb{I}G$. The geodesic map from $\mathfrak{g}$ to $G$, for matrix Lie groups with bi-invariant Riemannian metric like $SO(N)$, is simply the application of the matrix exponential $\exp$. Alternatively this can be replaced by the Cayley transform (see (Absil et al, 2008).)

Starting from this basic map $\exp:\mathfrak{g}\to{}G$ we can build mappings for more complicated cases:

  1. General tangent space to a Lie group $T_AG$: The geodesic map for an element $V\in{}T_AG$ is simply $A\exp(A^{-1}V)$.

  2. Special tangent space to a homogeneous space $T_E\mathcal{M}$: For $V=BE\in{}T_E\mathcal{M}$ the exponential map is simply $\exp(B)E$.

  3. General tangent space to a homogeneous space $T_Y\mathcal{M}$ with $Y = AE$: For $\Delta=ABE\in{}T_Y\mathcal{M}$ the exponential map is simply $A\exp(B)E$. This is the general case which we deal with.

The general theory behind points 2. and 3. is discussed in chapter 11 of (O'Neill, 1983). The function retraction in GeometricMachineLearning performs $\mathfrak{g}^\mathrm{hor}\to\mathcal{M}$, which is the second of the above points. To get the third from the second point, we simply have to multiply with a matrix from the left. This step is done with apply_section and represented through the red vertical line in the diagram on the general optimizer framework.

Word of caution

The Lie group corresponding to the Stiefel manifold $SO(N)$ has a bi-invariant Riemannian metric associated with it: $(B_1,B_2)\mapsto \mathrm{Tr}(B_1^TB_2)$. For other Lie groups (e.g. the symplectic group) the situation is slightly more difficult (see (Bendokat et al, 2021).)

References

  • Absil P A, Mahony R, Sepulchre R. Optimization algorithms on matrix manifolds[M]. Princeton University Press, 2008.

  • Bendokat T, Zimmermann R. The real symplectic Stiefel and Grassmann manifolds: metrics, geodesics and applications[J]. arXiv preprint arXiv:2108.12447, 2021.

  • O'Neill, Barrett. Semi-Riemannian geometry with applications to relativity. Academic press, 1983.

+Retractions · GeometricMachineLearning.jl

Retractions

Classical Definition

Classically, retractions are defined as maps smooth maps

\[R: T\mathcal{M}\to\mathcal{M}:(x,v)\mapsto{}R_x(v)\]

such that each curve $c(t) := R_x(tv)$ satisfies $c(0) = x$ and $c'(0) = v$.

In GeometricMachineLearning

Retractions are a map from the horizontal component of the Lie algebra $\mathfrak{g}^\mathrm{hor}$ to the respective manifold.

For optimization in neural networks (almost always first order) we solve a gradient flow equation

\[\dot{W} = -\mathrm{grad}_WL, \]

where $\mathrm{grad}_WL$ is the Riemannian gradient of the loss function $L$ evaluated at position $W$.

If we deal with Euclidean spaces (vector spaces), then the Riemannian gradient is just the result of an AD routine and the solution of the equation above can be approximated with $W^{t+1} \gets W^t - \eta\nabla_{W^t}L$, where $\eta$ is the learning rate.

For manifolds, after we obtained the Riemannian gradient (see e.g. the section on Stiefel manifold), we have to solve a geodesic equation. This is a canonical ODE associated with any Riemannian manifold.

The general theory of Riemannian manifolds is rather complicated, but for the neural networks treated in GeometricMachineLearning, we only rely on optimization of matrix Lie groups and homogeneous spaces, which is much simpler.

For Lie groups each tangent space is isomorphic to its Lie algebra $\mathfrak{g}\equiv{}T_\mathbb{I}G$. The geodesic map from $\mathfrak{g}$ to $G$, for matrix Lie groups with bi-invariant Riemannian metric like $SO(N)$, is simply the application of the matrix exponential $\exp$. Alternatively this can be replaced by the Cayley transform (see (Absil et al, 2008).)

Starting from this basic map $\exp:\mathfrak{g}\to{}G$ we can build mappings for more complicated cases:

  1. General tangent space to a Lie group $T_AG$: The geodesic map for an element $V\in{}T_AG$ is simply $A\exp(A^{-1}V)$.

  2. Special tangent space to a homogeneous space $T_E\mathcal{M}$: For $V=BE\in{}T_E\mathcal{M}$ the exponential map is simply $\exp(B)E$.

  3. General tangent space to a homogeneous space $T_Y\mathcal{M}$ with $Y = AE$: For $\Delta=ABE\in{}T_Y\mathcal{M}$ the exponential map is simply $A\exp(B)E$. This is the general case which we deal with.

The general theory behind points 2. and 3. is discussed in chapter 11 of (O'Neill, 1983). The function retraction in GeometricMachineLearning performs $\mathfrak{g}^\mathrm{hor}\to\mathcal{M}$, which is the second of the above points. To get the third from the second point, we simply have to multiply with a matrix from the left. This step is done with apply_section and represented through the red vertical line in the diagram on the general optimizer framework.

Word of caution

The Lie group corresponding to the Stiefel manifold $SO(N)$ has a bi-invariant Riemannian metric associated with it: $(B_1,B_2)\mapsto \mathrm{Tr}(B_1^TB_2)$. For other Lie groups (e.g. the symplectic group) the situation is slightly more difficult (see (Bendokat et al, 2021).)

References

  • Absil P A, Mahony R, Sepulchre R. Optimization algorithms on matrix manifolds[M]. Princeton University Press, 2008.

  • Bendokat T, Zimmermann R. The real symplectic Stiefel and Grassmann manifolds: metrics, geodesics and applications[J]. arXiv preprint arXiv:2108.12447, 2021.

  • O'Neill, Barrett. Semi-Riemannian geometry with applications to relativity. Academic press, 1983.

diff --git a/latest/pullbacks/computation_of_pullbacks/index.html b/latest/pullbacks/computation_of_pullbacks/index.html new file mode 100644 index 000000000..c8a7bb209 --- /dev/null +++ b/latest/pullbacks/computation_of_pullbacks/index.html @@ -0,0 +1,9 @@ + +Pullbacks · GeometricMachineLearning.jl

How to compute pullbacks

GeometricMachineLearning has many pullbacks for custom array types and other operations implemented. The need for this essentially comes from the fact that we cannot trivially differentiate custom GPU kernels at the moment[1].

What is a pullback?

Here we first explain the principle of a pullback with the example of a vector-valued function. The generalization to matrices and higher-order tensors is straight-forward.

The pullback of a vector-valued function $f:\mathbb{R}^{n}\to\mathbb{R}^m$ can be interpreted as the sensitivities in the input space $\mathbb{R}^n$ with respect to variations in the output space $\mathbb{R}^m$ via the function $f$:

\[\left[\mathrm{pullback}(f)[a\in\mathbb{R}^n, db\in\mathbb{R}^m]\right]_i = \sum_{j=1}^m\frac{\partial{}f_j}{\partial{}a_i}db_j.\]

This principle can easily be generalized to matrices. For this consider the function $g::\mathbb{R}^{n_1\times{}n_2}\to\mathbb{R}^{m_1\times{}m_2}$. For this case we have:

\[\left[\mathrm{pullback}(g)[A\in\mathbb{R}^{n_1\times{}n_2}, dB\in\mathbb{R}^{m_1\times{}m_2}]\right]_{(i_1, i_2)} = \sum_{j_1=1}^{m_1}\sum_{j_2=1}^{m_2}\frac{\partial{}f_{(j_1, j_2)}}{\partial{}a_{(i_1, i_2)}}db_{(j_1, j_2)}.\]

The generalization to higher-order tensors is again straight-forward.

Illustrative example

Consider the matrix inverse $\mathrm{inv}: \mathbb{R}^{n\times{}n}\to\mathbb{R}^{n\times{}n}$ as an example. This fits into the above framework where $inv$ is a matrix-valued function from $\mathbb{R}^{n\times{}n}$ to $\mathbb{R}^{n\times{}n}$. We here write $B := A^{-1} = \mathrm{inv}(A)$. We thus have to compute:

\[\left[\mathrm{pullback}(\mathrm{inv})[A\in\mathbb{R}^{n\times{}n}, dB\in\mathbb{R}^{n\times{}n}]\right]_{(i, j)} = \sum_{k=1}^{n}\sum_{\ell=1}^{n}\frac{\partial{}b_{k, \ell}}{\partial{}a_{i, j}}db_{k, \ell}.\]

For a matrix $A$ that depends on a parameter $\varepsilon$ we have that:

\[\frac{\partial}{\partial\varepsilon}B = -B\left( \frac{\partial}{\partial\varepsilon} \right) B.\]

This can easily be checked:

\[\mathbb{O} = \frac{\partial}{\partial\varepsilon}\mathbb{I} = \frac{\partial}{\partial\varepsilon}(AB) = A\frac{\partial}{\partial\varepsilon}B + \left(\frac{\partial}{\partial\varepsilon}A\right)B.\]

We can then write:

\[\begin{aligned} +\sum_{k,\ell}\left( \frac{\partial}{\partial{}a_{ij}} b_{k\ell} \right) db_{k\ell} & = \sum_{k\ell}\left[ \frac{\partial}{\partial{}a_{ij}} B \right]_{k\ell} db_{k,\ell} \\ +& = - \sum_{k,\ell}\left[B \left(\frac{\partial}{\partial{}a_{ij}} A\right) B \right]_{k\ell} db_{k\ell} \\ +& = - \sum_{k,\ell,m,n}b_{km} \left(\frac{\partial{}a_{mn}}{\partial{}a_{ij}}\right) b_{n\ell} db_{k\ell} \\ +& = - \sum_{k,\ell,m,n}b_{km} \delta_{im}\delta_{jn} b_{n\ell} db_{k\ell} \\ +& = - \sum_{k,\ell}b_{ki} b_{j\ell} db_{k\ell} \\ +& \equiv - B^T\cdot{}dB\cdot{}B^T. +\end{aligned}\]

Motivation from a differential-geometric perspective

The notions of a pullback in automatic differentiation and differential geometry are closely related (see e.g. [10] and [11]). In both cases we want to compute, based on a mapping $f:\mathcal{V}\to\mathcal{W}, a \mapsto f(a) =: b$, a map of differentials $db \mapsto da$. In the differential geometry case $db$ and $da$ are part of the associated cotangent spaces, i.e. $db\in{}T^*_b\mathcal{W}$ and $da\in{}T^*_a\mathcal{V}$; in AD we (mostly) deal with spaces of arrays, i.e. vector spaces, which means that $db\in\mathcal{W}$ and $da\in\mathcal{V}$.

[10]
M. Betancourt. A geometric theory of higher-order automatic differentiation, arXiv preprint arXiv:1812.11592 (2018).
[11]
J. Bolte and E. Pauwels. A mathematical model for automatic differentiation in machine learning. Advances in Neural Information Processing Systems 33, 10809–10819 (2020).
  • 1This will change once we switch to Enzyme (see [9]), but the package is still in its infancy.
diff --git a/latest/reduced_order_modeling/autoencoder/index.html b/latest/reduced_order_modeling/autoencoder/index.html index f0b4bb88f..0e4873b41 100644 --- a/latest/reduced_order_modeling/autoencoder/index.html +++ b/latest/reduced_order_modeling/autoencoder/index.html @@ -1,2 +1,2 @@ -POD and Autoencoders · GeometricMachineLearning.jl

Reduced Order modeling and Autoencoders

Reduced order modeling is a data-driven technique that exploits the structure of parametric PDEs to make solving those PDEs easier.

Consider a parametric PDE written in the form: $F(z(\mu);\mu)=0$ where $z(\mu)$ evolves on a infinite-dimensional Hilbert space $V$.

In modeling any PDE we have to choose a discretization (particle discretization, finite element method, ...) of $V$, which will be denoted by $V_h$.

Solution manifold

To any parametric PDE we associate a solution manifold:

\[\mathcal{M} = \{z(\mu):F(z(\mu);\mu)=0, \mu\in\mathbb{P}\}.\]

In the image above a 2-dimensional solution manifold is visualized as a sub-manifold in 3-dimensional space. In general the embedding space is an infinite-dimensional function space.

As an example of this consider the 1-dimensional wave equation:

\[\partial_{tt}^2q(t,\xi;\mu) = \mu^2\partial_{\xi\xi}^2q(t,\xi;\mu)\text{ on }I\times\Omega,\]

where $I = (0,1)$ and $\Omega=(-1/2,1/2)$. As initial condition for the first derivative we have $\partial_tq(0,\xi;\mu) = -\mu\partial_\xi{}q_0(\xi;\mu)$ and furthermore $q(t,\xi;\mu)=0$ on the boundary (i.e. $\xi\in\{-1/2,1/2\}$).

The solution manifold is a 1-dimensional submanifold:

\[\mathcal{M} = \{(t, \xi)\mapsto{}q(t,\xi;\mu)=q_0(\xi-\mu{}t;\mu):\mu\in\mathbb{P}\subset\mathbb{R}\}.\]

If we provide an initial condition $u_0$, a parameter instance $\mu$ and a time $t$, then $\xi\mapsto{}q(t,\xi;\mu)$ will be the momentary solution. If we consider the time evolution of $q(t,\xi;\mu)$, then it evolves on a two-dimensional submanifold $\bar{\mathcal{M}} := \{\xi\mapsto{}q(t,\xi;\mu):t\in{}I,\mu\in\mathbb{P}\}$.

General workflow

In reduced order modeling we aim to construct a mapping to a space that is close to this solution manifold. This is done through the following steps:

  1. Discretize the PDE.

  2. Solve the discretized PDE for a certain set of parameter instances $\mu\in\mathbb{P}$.

  3. Build a reduced basis with the data obtained from having solved the discretized PDE. This step consists of finding two mappings: the reduction $\mathcal{P}$ and the reconstruction $\mathcal{R}$.

The third step can be done with various machine learning (ML) techniques. Traditionally the most popular of these has been Proper orthogonal decomposition (POD), but in recent years autoencoders have also become a popular alternative (see (Fresca et al, 2021)).

References

[18]
S. Fresca, L. Dede’ and A. Manzoni. A comprehensive deep learning-based approach to reduced order modeling of nonlinear time-dependent parametrized PDEs. Journal of Scientific Computing 87, 1–36 (2021).
+POD and Autoencoders · GeometricMachineLearning.jl

Reduced Order modeling and Autoencoders

Reduced order modeling is a data-driven technique that exploits the structure of parametric PDEs to make solving those PDEs easier.

Consider a parametric PDE written in the form: $F(z(\mu);\mu)=0$ where $z(\mu)$ evolves on a infinite-dimensional Hilbert space $V$.

In modeling any PDE we have to choose a discretization (particle discretization, finite element method, ...) of $V$, which will be denoted by $V_h$.

Solution manifold

To any parametric PDE we associate a solution manifold:

\[\mathcal{M} = \{z(\mu):F(z(\mu);\mu)=0, \mu\in\mathbb{P}\}.\]

In the image above a 2-dimensional solution manifold is visualized as a sub-manifold in 3-dimensional space. In general the embedding space is an infinite-dimensional function space.

As an example of this consider the 1-dimensional wave equation:

\[\partial_{tt}^2q(t,\xi;\mu) = \mu^2\partial_{\xi\xi}^2q(t,\xi;\mu)\text{ on }I\times\Omega,\]

where $I = (0,1)$ and $\Omega=(-1/2,1/2)$. As initial condition for the first derivative we have $\partial_tq(0,\xi;\mu) = -\mu\partial_\xi{}q_0(\xi;\mu)$ and furthermore $q(t,\xi;\mu)=0$ on the boundary (i.e. $\xi\in\{-1/2,1/2\}$).

The solution manifold is a 1-dimensional submanifold:

\[\mathcal{M} = \{(t, \xi)\mapsto{}q(t,\xi;\mu)=q_0(\xi-\mu{}t;\mu):\mu\in\mathbb{P}\subset\mathbb{R}\}.\]

If we provide an initial condition $u_0$, a parameter instance $\mu$ and a time $t$, then $\xi\mapsto{}q(t,\xi;\mu)$ will be the momentary solution. If we consider the time evolution of $q(t,\xi;\mu)$, then it evolves on a two-dimensional submanifold $\bar{\mathcal{M}} := \{\xi\mapsto{}q(t,\xi;\mu):t\in{}I,\mu\in\mathbb{P}\}$.

General workflow

In reduced order modeling we aim to construct a mapping to a space that is close to this solution manifold. This is done through the following steps:

  1. Discretize the PDE.

  2. Solve the discretized PDE for a certain set of parameter instances $\mu\in\mathbb{P}$.

  3. Build a reduced basis with the data obtained from having solved the discretized PDE. This step consists of finding two mappings: the reduction $\mathcal{P}$ and the reconstruction $\mathcal{R}$.

The third step can be done with various machine learning (ML) techniques. Traditionally the most popular of these has been Proper orthogonal decomposition (POD), but in recent years autoencoders have also become a popular alternative (see (Fresca et al, 2021)).

References

[27]
S. Fresca, L. Dede’ and A. Manzoni. A comprehensive deep learning-based approach to reduced order modeling of nonlinear time-dependent parametrized PDEs. Journal of Scientific Computing 87, 1–36 (2021).
diff --git a/latest/reduced_order_modeling/kolmogorov_n_width/index.html b/latest/reduced_order_modeling/kolmogorov_n_width/index.html index 7624fbfe4..a0dff0009 100644 --- a/latest/reduced_order_modeling/kolmogorov_n_width/index.html +++ b/latest/reduced_order_modeling/kolmogorov_n_width/index.html @@ -1,2 +1,2 @@ -Kolmogorov n-width · GeometricMachineLearning.jl

Kolmogorov $n$-width

The Kolmogorov $n$-width measures how well some set $\mathcal{M}$ (typically the solution manifold) can be approximated with a linear subspace:

\[d_n(\mathcal{M}) := \mathrm{inf}_{V_n\subset{}V;\mathrm{dim}V_n=n}\mathrm{sup}(u\in\mathcal{M})\mathrm{inf}_{v_n\in{}V_n}|| u - v_n ||_V,\]

with $\mathcal{M}\subset{}V$ and $V$ is a (typically infinite-dimensional) Banach space. For advection-dominated problems (among others) the decay of the Kolmogorov $n$-width is very slow, i.e. one has to pick $n$ very high in order to obtain useful approximations (see [12] and [13]).

In order to overcome this, techniques based on neural networks (see e.g. [14]) and optimal transport (see e.g. [13]) have been used.

References

[13]
T. Blickhan. A registration method for reduced basis problems using linear optimal transport, arXiv preprint arXiv:2304.14884 (2023).
[12]
C. Greif and K. Urban. Decay of the Kolmogorov N-width for wave problems. Applied Mathematics Letters 96, 216–222 (2019).
[14]
K. Lee and K. T. Carlberg. Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders. Journal of Computational Physics 404, 108973 (2020).
+Kolmogorov n-width · GeometricMachineLearning.jl

Kolmogorov $n$-width

The Kolmogorov $n$-width measures how well some set $\mathcal{M}$ (typically the solution manifold) can be approximated with a linear subspace:

\[d_n(\mathcal{M}) := \mathrm{inf}_{V_n\subset{}V;\mathrm{dim}V_n=n}\mathrm{sup}(u\in\mathcal{M})\mathrm{inf}_{v_n\in{}V_n}|| u - v_n ||_V,\]

with $\mathcal{M}\subset{}V$ and $V$ is a (typically infinite-dimensional) Banach space. For advection-dominated problems (among others) the decay of the Kolmogorov $n$-width is very slow, i.e. one has to pick $n$ very high in order to obtain useful approximations (see [21] and [22]).

In order to overcome this, techniques based on neural networks (see e.g. [23]) and optimal transport (see e.g. [22]) have been used.

References

[22]
T. Blickhan. A registration method for reduced basis problems using linear optimal transport, arXiv preprint arXiv:2304.14884 (2023).
[21]
C. Greif and K. Urban. Decay of the Kolmogorov N-width for wave problems. Applied Mathematics Letters 96, 216–222 (2019).
[23]
K. Lee and K. T. Carlberg. Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders. Journal of Computational Physics 404, 108973 (2020).
diff --git a/latest/reduced_order_modeling/projection_reduction_errors/index.html b/latest/reduced_order_modeling/projection_reduction_errors/index.html index 1ef304722..8ae16def2 100644 --- a/latest/reduced_order_modeling/projection_reduction_errors/index.html +++ b/latest/reduced_order_modeling/projection_reduction_errors/index.html @@ -1,5 +1,5 @@ -Projection and Reduction Error · GeometricMachineLearning.jl

Projection and Reduction Errors of Reduced Models

Two errors that are of very big importance in reduced order modeling are the projection and the reduction error. During training one typically aims to miminimze the projection error, but for the actual application of the model the reduction error is often more important.

Projection Error

The projection error computes how well a reduced basis, represented by the reduction $\mathcal{P}$ and the reconstruction $\mathcal{R}$, can represent the data with which it is build. In mathematical terms:

\[e_\mathrm{proj}(\mu) := +Projection and Reduction Error · GeometricMachineLearning.jl

Projection and Reduction Errors of Reduced Models

Two errors that are of very big importance in reduced order modeling are the projection and the reduction error. During training one typically aims to miminimze the projection error, but for the actual application of the model the reduction error is often more important.

Projection Error

The projection error computes how well a reduced basis, represented by the reduction $\mathcal{P}$ and the reconstruction $\mathcal{R}$, can represent the data with which it is build. In mathematical terms:

\[e_\mathrm{proj}(\mu) := \frac{|| \mathcal{R}\circ\mathcal{P}(M) - M ||}{|| M ||},\]

where $||\cdot||$ is the Frobenius norm (one could also optimize for different norms).

Reduction Error

The reduction error measures how far the reduced system diverges from the full-order system during integration (online stage). In mathematical terms (and for a single initial condition):

\[e_\mathrm{red}(\mu) := \sqrt{ \frac{\sum_{t=0}^K|| \mathbf{x}^{(t)}(\mu) - \mathcal{R}(\mathbf{x}^{(t)}_r(\mu)) ||^2}{\sum_{t=0}^K|| \mathbf{x}^{(t)}(\mu) ||^2} -},\]

where $\mathbf{x}^{(t)}$ is the solution of the FOM at point $t$ and $\mathbf{x}^{(t)}_r$ is the solution of the ROM (in the reduced basis) at point $t$. The reduction error, as opposed to the projection error, not only measures how well the solution manifold is represented by the reduced basis, but also measures how well the FOM dynamics are approximated by the ROM dynamics (via the induced vector field on the reduced basis).

+},\]

where $\mathbf{x}^{(t)}$ is the solution of the FOM at point $t$ and $\mathbf{x}^{(t)}_r$ is the solution of the ROM (in the reduced basis) at point $t$. The reduction error, as opposed to the projection error, not only measures how well the solution manifold is represented by the reduced basis, but also measures how well the FOM dynamics are approximated by the ROM dynamics (via the induced vector field on the reduced basis).

diff --git a/latest/reduced_order_modeling/symplectic_autoencoder/index.html b/latest/reduced_order_modeling/symplectic_autoencoder/index.html index 1d2c723cd..425d47912 100644 --- a/latest/reduced_order_modeling/symplectic_autoencoder/index.html +++ b/latest/reduced_order_modeling/symplectic_autoencoder/index.html @@ -1,5 +1,5 @@ -PSD and Symplectic Autoencoders · GeometricMachineLearning.jl

Symplectic Autoencoder

Symplectic Autoencoders are a type of neural network suitable for treating Hamiltonian parametrized PDEs with slowly decaying Kolmogorov $n$-width. It is based on proper symplectic decomposition (PSD) and symplectic neural networks (SympNets).

Hamiltonian Model Order Reduction

Hamiltonian PDEs are partial differential equations that, like its ODE counterpart, have a Hamiltonian associated with it. An example of this is the linear wave equation (see [10]) with Hamiltonian

\[\mathcal{H}(q, p; \mu) := \frac{1}{2}\int_\Omega\mu^2(\partial_\xi{}q(t,\xi;\mu))^2 + p(t,\xi;\mu)^2d\xi.\]

The PDE for to this Hamiltonian can be obtained similarly as in the ODE case:

\[\partial_t{}q(t,\xi;\mu) = \frac{\delta{}\mathcal{H}}{\delta{}p} = p(t,\xi;\mu), \quad \partial_t{}p(t,\xi;\mu) = -\frac{\delta{}\mathcal{H}}{\delta{}q} = \mu^2\partial_{\xi{}\xi}q(t,\xi;\mu)\]

Symplectic Solution Manifold

As with regular parametric PDEs, we also associate a solution manifold with Hamiltonian PDEs. This is a finite-dimensional manifold, on which the dynamics can be described through a Hamiltonian ODE. I NEED A PROOF OR SOME EXPLANATION FOR THIS!

Workflow for Symplectic ROM

As with any other reduced order modeling technique we first discretize the PDE. This should be done with a structure-preserving scheme, thus yielding a (high-dimensional) Hamiltonian ODE as a result. Discretizing the wave equation above with finite differences yields a Hamiltonian system:

\[\mathcal{H}_\mathrm{discr}(z(t;\mu);\mu) := \frac{1}{2}x(t;\mu)^T\begin{bmatrix} -\mu^2D_{\xi{}\xi} & \mathbb{O} \\ \mathbb{O} & \mathbb{I} \end{bmatrix} x(t;\mu).\]

In Hamiltonian reduced order modelling we try to find a symplectic submanifold of the solution space[1] that captures the dynamics of the full system as well as possible.

Similar to the regular PDE case we again build an encoder $\Psi^\mathrm{enc}$ and a decoder $\Psi^\mathrm{dec}$; but now both these mappings are required to be symplectic!

Concretely this means:

  1. The encoder is a mapping from a high-dimensional symplectic space to a low-dimensional symplectic space, i.e. $\Psi^\mathrm{enc}:\mathbb{R}^{2N}\to\mathbb{R}^{2n}$ such that $\nabla\Psi^\mathrm{enc}\mathbb{J}_{2N}(\nabla\Psi^\mathrm{enc})^T = \mathbb{J}_{2n}$.
  2. The decoder is a mapping from a low-dimensional symplectic space to a high-dimensional symplectic space, i.e. $\Psi^\mathrm{dec}:\mathbb{R}^{2n}\to\mathbb{R}^{2N}$ such that $(\nabla\Psi^\mathrm{dec})^T\mathbb{J}_{2N}\nabla\Psi^\mathrm{dec} = \mathbb{J}_{2n}$.

If these two maps are constrained to linear maps, then one can easily find good solutions with proper symplectic decomposition (PSD).

Proper Symplectic Decomposition

For PSD the two mappings $\Psi^\mathrm{enc}$ and $\Psi^\mathrm{dec}$ are constrained to be linear, orthonormal (i.e. $\Psi^T\Psi = \mathbb{I}$) and symplectic. The easiest way to enforce this is through the so-called cotangent lift:

\[\Psi_\mathrm{CL} = +PSD and Symplectic Autoencoders · GeometricMachineLearning.jl

Symplectic Autoencoder

Symplectic Autoencoders are a type of neural network suitable for treating Hamiltonian parametrized PDEs with slowly decaying Kolmogorov $n$-width. It is based on proper symplectic decomposition (PSD) and symplectic neural networks (SympNets).

Hamiltonian Model Order Reduction

Hamiltonian PDEs are partial differential equations that, like its ODE counterpart, have a Hamiltonian associated with it. An example of this is the linear wave equation (see [19]) with Hamiltonian

\[\mathcal{H}(q, p; \mu) := \frac{1}{2}\int_\Omega\mu^2(\partial_\xi{}q(t,\xi;\mu))^2 + p(t,\xi;\mu)^2d\xi.\]

The PDE for to this Hamiltonian can be obtained similarly as in the ODE case:

\[\partial_t{}q(t,\xi;\mu) = \frac{\delta{}\mathcal{H}}{\delta{}p} = p(t,\xi;\mu), \quad \partial_t{}p(t,\xi;\mu) = -\frac{\delta{}\mathcal{H}}{\delta{}q} = \mu^2\partial_{\xi{}\xi}q(t,\xi;\mu)\]

Symplectic Solution Manifold

As with regular parametric PDEs, we also associate a solution manifold with Hamiltonian PDEs. This is a finite-dimensional manifold, on which the dynamics can be described through a Hamiltonian ODE. I NEED A PROOF OR SOME EXPLANATION FOR THIS!

Workflow for Symplectic ROM

As with any other reduced order modeling technique we first discretize the PDE. This should be done with a structure-preserving scheme, thus yielding a (high-dimensional) Hamiltonian ODE as a result. Discretizing the wave equation above with finite differences yields a Hamiltonian system:

\[\mathcal{H}_\mathrm{discr}(z(t;\mu);\mu) := \frac{1}{2}x(t;\mu)^T\begin{bmatrix} -\mu^2D_{\xi{}\xi} & \mathbb{O} \\ \mathbb{O} & \mathbb{I} \end{bmatrix} x(t;\mu).\]

In Hamiltonian reduced order modelling we try to find a symplectic submanifold of the solution space[1] that captures the dynamics of the full system as well as possible.

Similar to the regular PDE case we again build an encoder $\Psi^\mathrm{enc}$ and a decoder $\Psi^\mathrm{dec}$; but now both these mappings are required to be symplectic!

Concretely this means:

  1. The encoder is a mapping from a high-dimensional symplectic space to a low-dimensional symplectic space, i.e. $\Psi^\mathrm{enc}:\mathbb{R}^{2N}\to\mathbb{R}^{2n}$ such that $\nabla\Psi^\mathrm{enc}\mathbb{J}_{2N}(\nabla\Psi^\mathrm{enc})^T = \mathbb{J}_{2n}$.
  2. The decoder is a mapping from a low-dimensional symplectic space to a high-dimensional symplectic space, i.e. $\Psi^\mathrm{dec}:\mathbb{R}^{2n}\to\mathbb{R}^{2N}$ such that $(\nabla\Psi^\mathrm{dec})^T\mathbb{J}_{2N}\nabla\Psi^\mathrm{dec} = \mathbb{J}_{2n}$.

If these two maps are constrained to linear maps, then one can easily find good solutions with proper symplectic decomposition (PSD).

Proper Symplectic Decomposition

For PSD the two mappings $\Psi^\mathrm{enc}$ and $\Psi^\mathrm{dec}$ are constrained to be linear, orthonormal (i.e. $\Psi^T\Psi = \mathbb{I}$) and symplectic. The easiest way to enforce this is through the so-called cotangent lift:

\[\Psi_\mathrm{CL} = \begin{bmatrix} \Phi & \mathbb{O} \\ \mathbb{O} & \Phi \end{bmatrix},\]

and $\Phi\in{}St(n,N)\subset\mathbb{R}^{N\times{}n}$, i.e. is an element of the Stiefel manifold. If the snapshot matrix is of the form:

\[M = \left[\begin{array}{c:c:c:c} \hat{q}_1(t_0) & \hat{q}_1(t_1) & \quad\ldots\quad & \hat{q}_1(t_f) \\ \hat{q}_2(t_0) & \hat{q}_2(t_1) & \ldots & \hat{q}_2(t_f) \\ @@ -9,4 +9,4 @@ \hat{p}_2(t_0) & \hat{p}_2(t_1) & \ldots & \hat{p}_2(t_f) \\ \ldots & \ldots & \ldots & \ldots \\ \hat{p}_{N}(t_0) & \hat{p}_{N}(t_1) & \ldots & \hat{p}_{N}(t_f) \\ -\end{array}\right],\]

then $\Phi$ can be computed in a very straight-forward manner:

  1. Rearrange the rows of the matrix $M$ such that we end up with a $N\times2(f+1)$ matrix: $\hat{M} := [M_q, M_p]$.
  2. Perform SVD: $\hat{M} = U\Sigma{}V^T$; set $\Phi\gets{}U\mathtt{[:,1:n]}$.

For details on the cotangent lift (and other methods for linear symplectic model reduction) consult [11].

Symplectic Autoencoders

PSD suffers from the similar shortcomings as regular POD: it is a linear map and the approximation space $\tilde{\mathcal{M}}= \{\Psi^\mathrm{dec}(z_r)\in\mathbb{R}^{2N}:u_r\in\mathrm{R}^{2n}\}$ is strictly linear. For problems with slowly-decaying Kolmogorov $n$-width this leads to very poor approximations.

In order to overcome this difficulty we use neural networks, more specifically SympNets, together with cotangent lift-like matrices. The resulting architecture, symplectic autoencoders, are demonstrated in the following image:

So we alternate between SympNet and PSD layers. Because all the PSD layers are based on matrices $\Phi\in{}St(n,N)$ we have to optimize on the Stiefel manifold.

References

[10]
P. Buchfink, S. Glas and B. Haasdonk. Symplectic model reduction of Hamiltonian systems on nonlinear manifolds and approximation with weakly symplectic autoencoder. SIAM Journal on Scientific Computing 45, A289–A311 (2023).
[11]
L. Peng and K. Mohseni. Symplectic model reduction of Hamiltonian systems. SIAM Journal on Scientific Computing 38, A1–A27 (2016).
  • 1The submanifold is: $\tilde{\mathcal{M}} = \{\Psi^\mathrm{dec}(z_r)\in\mathbb{R}^{2N}:u_r\in\mathrm{R}^{2n}\}$ where $z_r$ is the reduced state of the system.
+\end{array}\right],\]

then $\Phi$ can be computed in a very straight-forward manner:

  1. Rearrange the rows of the matrix $M$ such that we end up with a $N\times2(f+1)$ matrix: $\hat{M} := [M_q, M_p]$.
  2. Perform SVD: $\hat{M} = U\Sigma{}V^T$; set $\Phi\gets{}U\mathtt{[:,1:n]}$.

For details on the cotangent lift (and other methods for linear symplectic model reduction) consult [20].

Symplectic Autoencoders

PSD suffers from the similar shortcomings as regular POD: it is a linear map and the approximation space $\tilde{\mathcal{M}}= \{\Psi^\mathrm{dec}(z_r)\in\mathbb{R}^{2N}:u_r\in\mathrm{R}^{2n}\}$ is strictly linear. For problems with slowly-decaying Kolmogorov $n$-width this leads to very poor approximations.

In order to overcome this difficulty we use neural networks, more specifically SympNets, together with cotangent lift-like matrices. The resulting architecture, symplectic autoencoders, are demonstrated in the following image:

So we alternate between SympNet and PSD layers. Because all the PSD layers are based on matrices $\Phi\in{}St(n,N)$ we have to optimize on the Stiefel manifold.

References

[19]
P. Buchfink, S. Glas and B. Haasdonk. Symplectic model reduction of Hamiltonian systems on nonlinear manifolds and approximation with weakly symplectic autoencoder. SIAM Journal on Scientific Computing 45, A289–A311 (2023).
[20]
L. Peng and K. Mohseni. Symplectic model reduction of Hamiltonian systems. SIAM Journal on Scientific Computing 38, A1–A27 (2016).
  • 1The submanifold is: $\tilde{\mathcal{M}} = \{\Psi^\mathrm{dec}(z_r)\in\mathbb{R}^{2N}:u_r\in\mathrm{R}^{2n}\}$ where $z_r$ is the reduced state of the system.
diff --git a/latest/references/index.html b/latest/references/index.html index d236b66c4..ed721b83f 100644 --- a/latest/references/index.html +++ b/latest/references/index.html @@ -1,2 +1,2 @@ -References · GeometricMachineLearning.jl

References

[1]
P. Jin, Z. Zhang, A. Zhu, Y. Tang and G. E. Karniadakis. SympNets: Intrinsic structure-preserving symplectic networks for identifying Hamiltonian systems. Neural Networks 132, 166–179 (2020).
[2]
E. Hairer, C. Lubich and G. Wanner. Geometric Numerical integration: structure-preserving algorithms for ordinary differential equations (Springer, 2006).
[3]
B. Leimkuhler and S. Reich. Simulating hamiltonian dynamics. No. 14 (Cambridge university press, 2004).
[4]
P. Jin, Z. Lin and B. Xiao. Optimal unit triangular factorization of symplectic matrices. Linear Algebra and its Applications (2022).
[5]
S. Lipschutz. General Topology (McGraw-Hill Book Company, 1965).
[6]
S. Lang. Fundamentals of differential geometry. Vol. 191 (Springer Science & Business Media, 2012).
[7]
S. I. Richard L. Bishop. Tensor Analysis on Manifolds (Dover Publications, 1980).
[8]
P.-A. Absil, R. Mahony and R. Sepulchre. Riemannian geometry of Grassmann manifolds with a view on algorithmic computation. Acta Applicandae Mathematica 80, 199–220 (2004).
[9]
J. N. Stephen J. Wright. Numerical optimization (Springer Science+Business Media, 2006).
[10]
P. Buchfink, S. Glas and B. Haasdonk. Symplectic model reduction of Hamiltonian systems on nonlinear manifolds and approximation with weakly symplectic autoencoder. SIAM Journal on Scientific Computing 45, A289–A311 (2023).
[11]
L. Peng and K. Mohseni. Symplectic model reduction of Hamiltonian systems. SIAM Journal on Scientific Computing 38, A1–A27 (2016).
[12]
C. Greif and K. Urban. Decay of the Kolmogorov N-width for wave problems. Applied Mathematics Letters 96, 216–222 (2019).
[13]
T. Blickhan. A registration method for reduced basis problems using linear optimal transport, arXiv preprint arXiv:2304.14884 (2023).
[14]
K. Lee and K. T. Carlberg. Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders. Journal of Computational Physics 404, 108973 (2020).
[15]
B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).
[16]
T. Lin and H. Zha. Riemannian manifold learning. IEEE transactions on pattern analysis and machine intelligence 30, 796–809 (2008).
[17]
T. Blickhan. BrenierTwoFluids.jl, https://github.com/ToBlick/BrenierTwoFluids (2023).
[18]
S. Fresca, L. Dede’ and A. Manzoni. A comprehensive deep learning-based approach to reduced order modeling of nonlinear time-dependent parametrized PDEs. Journal of Scientific Computing 87, 1–36 (2021).
[19]
M.-T. Luong, H. Pham and C. D. Manning. Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025 (2015).
[20]
D. Bahdanau, K. Cho and Y. Bengio. Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014).
[21]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin. Attention is all you need. Advances in neural information processing systems 30 (2017).
[22]
B. Brantner and M. Kraus. Symplectic autoencoders for Model Reduction of Hamiltonian Systems, arXiv preprint arXiv:2312.10004 (2023).
[23]
B. Brantner, G. de Romemont, M. Kraus and Z. Li. Structure-Preserving Transformers for Learning Parametrized Hamiltonian Systems, arXiv preprint arXiv:2312:11166 (2023).
[24]
T. Frankel. The geometry of physics: an introduction (Cambridge university press, Cambridge, UK, 2011).
[25]
I. Goodfellow, Y. Bengio and A. Courville. Deep learning (MIT press, Cambridge, MA, 2016).
[26]
and R. Sepulchre. Optimization algorithms on matrix manifolds (Princeton University Press, Princeton, New Jersey, 2008).
[27]
T. Bendokat, R. Zimmermann and P.-A. Absil. A Grassmann manifold handbook: Basic geometry and computational aspects, arXiv preprint arXiv:2011.13699 (2020).
[28]
W. S. Moses, V. Churavy, L. Paehler, J. Hückelheim, S. H. Narayanan, M. Schanen and J. Doerfert. Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '21 (Association for Computing Machinery, New York, NY, USA, 2021).
[29]
M. Betancourt. A geometric theory of higher-order automatic differentiation, arXiv preprint arXiv:1812.11592 (2018).
[30]
J. Bolte and E. Pauwels. A mathematical model for automatic differentiation in machine learning. Advances in Neural Information Processing Systems 33, 10809–10819 (2020).
[31]
K. Jacobs. Discrete Stochastics (Birkhäuser Verlag, Basel, Switzerland, 1992).
+References · GeometricMachineLearning.jl

References

[1]
P. Jin, Z. Zhang, A. Zhu, Y. Tang and G. E. Karniadakis. SympNets: Intrinsic structure-preserving symplectic networks for identifying Hamiltonian systems. Neural Networks 132, 166–179 (2020).
[2]
E. Hairer, C. Lubich and G. Wanner. Geometric Numerical integration: structure-preserving algorithms for ordinary differential equations (Springer, 2006).
[3]
B. Leimkuhler and S. Reich. Simulating hamiltonian dynamics. No. 14 (Cambridge university press, 2004).
[4]
P. Jin, Z. Lin and B. Xiao. Optimal unit triangular factorization of symplectic matrices. Linear Algebra and its Applications (2022).
[5]
S. Lipschutz. General Topology (McGraw-Hill Book Company, 1965).
[6]
S. Lang. Fundamentals of differential geometry. Vol. 191 (Springer Science & Business Media, 2012).
[7]
S. I. Richard L. Bishop. Tensor Analysis on Manifolds (Dover Publications, 1980).
[8]
P.-A. Absil, R. Mahony and R. Sepulchre. Riemannian geometry of Grassmann manifolds with a view on algorithmic computation. Acta Applicandae Mathematica 80, 199–220 (2004).
[9]
W. S. Moses, V. Churavy, L. Paehler, J. Hückelheim, S. H. Narayanan, M. Schanen and J. Doerfert. Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '21 (Association for Computing Machinery, New York, NY, USA, 2021).
[10]
M. Betancourt. A geometric theory of higher-order automatic differentiation, arXiv preprint arXiv:1812.11592 (2018).
[11]
J. Bolte and E. Pauwels. A mathematical model for automatic differentiation in machine learning. Advances in Neural Information Processing Systems 33, 10809–10819 (2020).
[12]
J. N. Stephen J. Wright. Numerical optimization (Springer Science+Business Media, 2006).
[13]
D. Bahdanau, K. Cho and Y. Bengio. Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014).
[14]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin. Attention is all you need. Advances in neural information processing systems 30 (2017).
[15]
K. Jacobs. Discrete Stochastics (Birkhäuser Verlag, Basel, Switzerland, 1992).
[16]
and R. Sepulchre. Optimization algorithms on matrix manifolds (Princeton University Press, Princeton, New Jersey, 2008).
[17]
K. Feng. The step-transition operators for multi-step methods of ODE's. Journal of Computational Mathematics, 193–202 (1998).
[18]
M.-T. Luong, H. Pham and C. D. Manning. Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025 (2015).
[19]
P. Buchfink, S. Glas and B. Haasdonk. Symplectic model reduction of Hamiltonian systems on nonlinear manifolds and approximation with weakly symplectic autoencoder. SIAM Journal on Scientific Computing 45, A289–A311 (2023).
[20]
L. Peng and K. Mohseni. Symplectic model reduction of Hamiltonian systems. SIAM Journal on Scientific Computing 38, A1–A27 (2016).
[21]
C. Greif and K. Urban. Decay of the Kolmogorov N-width for wave problems. Applied Mathematics Letters 96, 216–222 (2019).
[22]
T. Blickhan. A registration method for reduced basis problems using linear optimal transport, arXiv preprint arXiv:2304.14884 (2023).
[23]
K. Lee and K. T. Carlberg. Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders. Journal of Computational Physics 404, 108973 (2020).
[24]
B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).
[25]
T. Lin and H. Zha. Riemannian manifold learning. IEEE transactions on pattern analysis and machine intelligence 30, 796–809 (2008).
[26]
T. Blickhan. BrenierTwoFluids.jl, https://github.com/ToBlick/BrenierTwoFluids (2023).
[27]
S. Fresca, L. Dede’ and A. Manzoni. A comprehensive deep learning-based approach to reduced order modeling of nonlinear time-dependent parametrized PDEs. Journal of Scientific Computing 87, 1–36 (2021).
[28]
B. Brantner and M. Kraus. Symplectic autoencoders for Model Reduction of Hamiltonian Systems, arXiv preprint arXiv:2312.10004 (2023).
[29]
B. Brantner, G. de Romemont, M. Kraus and Z. Li. Structure-Preserving Transformers for Learning Parametrized Hamiltonian Systems, arXiv preprint arXiv:2312:11166 (2023).
[30]
T. Frankel. The geometry of physics: an introduction (Cambridge university press, Cambridge, UK, 2011).
[31]
I. Goodfellow, Y. Bengio and A. Courville. Deep learning (MIT press, Cambridge, MA, 2016).
[32]
T. Bendokat, R. Zimmermann and P.-A. Absil. A Grassmann manifold handbook: Basic geometry and computational aspects, arXiv preprint arXiv:2011.13699 (2020).
diff --git a/latest/search_index.js b/latest/search_index.js index 12b719ed0..7dbca2e1e 100644 --- a/latest/search_index.js +++ b/latest/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"manifolds/grassmann_manifold/#Grassmann-Manifold","page":"Grassmann","title":"Grassmann Manifold","text":"","category":"section"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"(The description of the Grassmann manifold is based on that of the Stiefel manifold, so this should be read first.)","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"An element of the Grassmann manifold G(nN) is a vector subspace subsetmathbbR^N of dimension n. Each such subspace (i.e. element of the Grassmann manifold) can be represented by a full-rank matrix AinmathbbR^Ntimesn and we identify two elements with the following equivalence relation: ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"A_1 sim A_2 iff existsCinmathbbR^ntimesntext st A_1C = A_2","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"The resulting manifold is of dimension n(N-n). One can find a parametrization of the manifold the following way: Because the matrix Y has full rank, there have to be n independent columns in it: i_1 ldots i_n. For simplicity assume that i_1 = 1 i_2=2 ldots i_n=n and call the matrix made up by these columns C. Then the mapping to the coordinate chart is: YC^-1 and the last N-n columns are the coordinates.","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"We can also define the Grassmann manifold based on the Stiefel manifold since elements of the Stiefel manifold are already full-rank matrices. In this case we have the following equivalence relation (for Y_1 Y_2inSt(nN)): ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"Y_1 sim Y_2 iff existsCinO(n)text st Y_1C = Y_2","category":"page"},{"location":"manifolds/grassmann_manifold/#The-Riemannian-Gradient","page":"Grassmann","title":"The Riemannian Gradient","text":"","category":"section"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"Obtaining the Riemannian Gradient for the Grassmann manifold is slightly more difficult than it is in the case of the Stiefel manifold. Since the Grassmann manifold can be obtained from the Stiefel manifold through an equivalence relation however, we can use this as a starting point. In a first step we identify charts on the Grassmann manifold to make dealing with it easier. For this consider the following open cover of the Grassmann manifold (also see [8]): ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"mathcalU_W_WinSt(n N) quadtextwherequad mathcalU_W = mathrmspan(Y)mathrmdet(W^TY)neq0","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"We can find a canonical bijective mapping from the set mathcalU_W to the set mathcalS_W = YinmathbbR^NtimesnW^TY=mathbbI_n:","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"sigma_W mathcalU_W to mathcalS_W mathcalY=mathrmspan(Y)mapstoY(W^TY)^-1 = hatY","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"That sigma_W is well-defined is easy to see: Consider YC with CinmathbbR^ntimesn non-singular. Then YC(W^TYC)^-1=Y(W^TY)^-1 = hatY. With this isomorphism we can also find a representation of elements of the tangent space:","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"T_mathcalYsigma_W T_mathcalYGr(nN)toT_hatYmathcalS_W xi mapsto (xi_diamondY -hatY(W^Txi_diamondY))(W^TY)^-1","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"xi_diamondY is the representation of xiinT_mathcalYGr(nN) for the point YinSt(nN), i.e. T_Ypi(xi_diamondY) = xi; because the map sigma_W does not care about the representation of mathrmspan(Y) we can perform the variations in St(nN)[1]:","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"[1]: I.e. Y(t)inSt(nN) for tin(-varepsilonvarepsilon). We also set Y(0) = Y.","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"fracddtY(t)(W^TY(t))^-1 = (dotY(0) - Y(W^TY)^-1W^TdotY(0))(W^TY)^-1","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"where dotY(0)inT_YSt(nN). Also note that the representation of xi in T_YSt(nN) is not unique in general, but T_mathcalYsigma_W is still well-defined. To see this consider two curves Y(t) and barY(t) for which we have Y(0) = barY(0) = Y and further Tpi(dotY(0)) = Tpi(dotbarY(0)). This is equivalent to being able to find a C(cdot)(-varepsilonvarepsilon)toO(n) for which C(0)=mathbbI(0) s.t. barY(t) = Y(t)C(t). We thus have dotbarY(0) = dotY(0) + YdotC(0) and if we replace xi_diamondY above with the second term in the expression we get: YdotC(0) - hatYW^T(YdotC(0)) = 0. The parametrization of T_mathcalYGr(nN) with T_mathcalYsigma_W is thus independent of the choice of dotC(0) and hence of xi_diamondY and is therefore well-defined.","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"Further note that we have T_mathcalYmathcalU_W = T_mathcalYGr(nN) because mathcalU_W is an open subset of Gr(nN). We thus can identify the tangent space T_mathcalYGr(nN) with the following set (where we again have hatY=Y(W^TY)^-1):","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"T_hatYmathcalS_W = (Delta - Y(W^TY)^-1W^TDelta)(W^TDelta)^-1 YinSt(nN)text st mathrmspan(Y)=mathcalYtext and DeltainT_YSt(nN)","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"If we now further take W=Y[2] then we get the identification: ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"[2]: We can pick any element W to construct the charts for a neighborhood around the point mathcalYinGr(nN) as long as we have mathrmdet(W^TY)neq0 for mathrmspan(Y)=mathcalY. ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"T_mathcalYGr(nN) equiv Delta - YY^TDelta YinSt(nN)text st mathrmspan(Y)=mathcalYtext and DeltainT_YSt(nN)","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"which is very easy to handle computationally (we simply store and change the matrix Y that represents an element of the Grassmann manifold). The Riemannian gradient is then ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"mathrmgrad_mathcalY^GrL = mathrmgrad_Y^StL - YY^Tmathrmgrad_Y^StL = nabla_YL - YY^Tnabla_YL","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"where nabla_YL again is the Euclidean gradient as in the Stiefel manifold case.","category":"page"},{"location":"references/#References","page":"References","title":"References","text":"","category":"section"},{"location":"references/","page":"References","title":"References","text":"P. Jin, Z. Zhang, A. Zhu, Y. Tang and G. E. Karniadakis. SympNets: Intrinsic structure-preserving symplectic networks for identifying Hamiltonian systems. Neural Networks 132, 166–179 (2020).\n\n\n\nE. Hairer, C. Lubich and G. Wanner. Geometric Numerical integration: structure-preserving algorithms for ordinary differential equations (Springer, 2006).\n\n\n\nB. Leimkuhler and S. Reich. Simulating hamiltonian dynamics. No. 14 (Cambridge university press, 2004).\n\n\n\nP. Jin, Z. Lin and B. Xiao. Optimal unit triangular factorization of symplectic matrices. Linear Algebra and its Applications (2022).\n\n\n\nS. Lipschutz. General Topology (McGraw-Hill Book Company, 1965).\n\n\n\nS. Lang. Fundamentals of differential geometry. Vol. 191 (Springer Science & Business Media, 2012).\n\n\n\nS. I. Richard L. Bishop. Tensor Analysis on Manifolds (Dover Publications, 1980).\n\n\n\nP.-A. Absil, R. Mahony and R. Sepulchre. Riemannian geometry of Grassmann manifolds with a view on algorithmic computation. Acta Applicandae Mathematica 80, 199–220 (2004).\n\n\n\nJ. N. Stephen J. Wright. Numerical optimization (Springer Science+Business Media, 2006).\n\n\n\nP. Buchfink, S. Glas and B. Haasdonk. Symplectic model reduction of Hamiltonian systems on nonlinear manifolds and approximation with weakly symplectic autoencoder. SIAM Journal on Scientific Computing 45, A289–A311 (2023).\n\n\n\nL. Peng and K. Mohseni. Symplectic model reduction of Hamiltonian systems. SIAM Journal on Scientific Computing 38, A1–A27 (2016).\n\n\n\nC. Greif and K. Urban. Decay of the Kolmogorov N-width for wave problems. Applied Mathematics Letters 96, 216–222 (2019).\n\n\n\nT. Blickhan. A registration method for reduced basis problems using linear optimal transport, arXiv preprint arXiv:2304.14884 (2023).\n\n\n\nK. Lee and K. T. Carlberg. Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders. Journal of Computational Physics 404, 108973 (2020).\n\n\n\nB. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).\n\n\n\nT. Lin and H. Zha. Riemannian manifold learning. IEEE transactions on pattern analysis and machine intelligence 30, 796–809 (2008).\n\n\n\nT. Blickhan. BrenierTwoFluids.jl, https://github.com/ToBlick/BrenierTwoFluids (2023).\n\n\n\nS. Fresca, L. Dede’ and A. Manzoni. A comprehensive deep learning-based approach to reduced order modeling of nonlinear time-dependent parametrized PDEs. Journal of Scientific Computing 87, 1–36 (2021).\n\n\n\nM.-T. Luong, H. Pham and C. D. Manning. Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025 (2015).\n\n\n\nD. Bahdanau, K. Cho and Y. Bengio. Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014).\n\n\n\nA. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin. Attention is all you need. Advances in neural information processing systems 30 (2017).\n\n\n\nB. Brantner and M. Kraus. Symplectic autoencoders for Model Reduction of Hamiltonian Systems, arXiv preprint arXiv:2312.10004 (2023).\n\n\n\nB. Brantner, G. de Romemont, M. Kraus and Z. Li. Structure-Preserving Transformers for Learning Parametrized Hamiltonian Systems, arXiv preprint arXiv:2312:11166 (2023).\n\n\n\nT. Frankel. The geometry of physics: an introduction (Cambridge university press, Cambridge, UK, 2011).\n\n\n\nI. Goodfellow, Y. Bengio and A. Courville. Deep learning (MIT press, Cambridge, MA, 2016).\n\n\n\nand R. Sepulchre. Optimization algorithms on matrix manifolds (Princeton University Press, Princeton, New Jersey, 2008).\n\n\n\nT. Bendokat, R. Zimmermann and P.-A. Absil. A Grassmann manifold handbook: Basic geometry and computational aspects, arXiv preprint arXiv:2011.13699 (2020).\n\n\n\nW. S. Moses, V. Churavy, L. Paehler, J. Hückelheim, S. H. Narayanan, M. Schanen and J. Doerfert. Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '21 (Association for Computing Machinery, New York, NY, USA, 2021).\n\n\n\nM. Betancourt. A geometric theory of higher-order automatic differentiation, arXiv preprint arXiv:1812.11592 (2018).\n\n\n\nJ. Bolte and E. Pauwels. A mathematical model for automatic differentiation in machine learning. Advances in Neural Information Processing Systems 33, 10809–10819 (2020).\n\n\n\nK. Jacobs. Discrete Stochastics (Birkhäuser Verlag, Basel, Switzerland, 1992).\n\n\n\n","category":"page"},{"location":"manifolds/stiefel_manifold/#Stiefel-manifold","page":"Stiefel","title":"Stiefel manifold","text":"","category":"section"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The Stiefel manifold St(n N) is the space (a homogeneous space) of all orthonormal frames in mathbbR^Ntimesn, i.e. matrices YinmathbbR^Ntimesn s.t. Y^TY = mathbbI_n. It can also be seen as the special orthonormal group SO(N) modulo an equivalence relation: AsimBiffAE = BE for ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"E = beginbmatrix\nmathbbI_n \nmathbbO\nendbmatrixinmathcalM","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"which is the canonical element of the Stiefel manifold. In words: the first n columns of A and B are the same.","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The tangent space to the element YinSt(nN) can easily be determined: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"T_YSt(nN)=DeltaDelta^TY + Y^TDelta = 0","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The Lie algebra of SO(N) is mathfrakso(N)=VinmathbbR^NtimesNV^T + V = 0 and the canonical metric associated with it is simply (V_1V_2)mapstofrac12mathrmTr(V_1^TV_2).","category":"page"},{"location":"manifolds/stiefel_manifold/#The-Riemannian-Gradient","page":"Stiefel","title":"The Riemannian Gradient","text":"","category":"section"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"For matrix manifolds (like the Stiefel manifold), the Riemannian gradient of a function can be easily determined computationally:","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The Euclidean gradient of a function L is equivalent to an element of the cotangent space T^*_YmathcalM via: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"langlenablaLcdotrangleT_YmathcalM to mathbbR Delta mapsto sum_ijnablaL_ijDelta_ij = mathrmTr(nablaL^TDelta)","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"We can then utilize the Riemannian metric on mathcalM to map the element from the cotangent space (i.e. nablaL) to the tangent space. This element is called mathrmgrad_(cdot)L here. Explicitly, it is given by: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":" mathrmgrad_YL = nabla_YL - Y(nabla_YL)^TY","category":"page"},{"location":"manifolds/stiefel_manifold/#rgrad","page":"Stiefel","title":"rgrad","text":"","category":"section"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"What was referred to as nablaL before can in practice be obtained with an AD routine. We then use the function rgrad to map this Euclidean gradient to inT_YSt(nN). This mapping has the property: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"mathrmTr((nablaL)^TDelta) = g_Y(mathttrgrad(Y nablaL) Delta) forallDeltainT_YSt(nN)","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"and g is the Riemannian metric.","category":"page"},{"location":"arrays/skew_symmetric_matrix/#SymmetricMatrix-and-SkewSymMatrix","page":"Symmetric and Skew-Symmetric Matrices","title":"SymmetricMatrix and SkewSymMatrix","text":"","category":"section"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"There are special implementations of symmetric and skew-symmetric matrices in GeometricMachineLearning.jl. They are implemented to work on GPU and for multiplication with tensors. The following image demonstrates how the data necessary for an instance of SkewSymMatrix are stored[1]:","category":"page"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"[1]: It works similarly for SymmetricMatrix. ","category":"page"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/skew_sym_visualization.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # ","category":"page"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"So what is stored internally is a vector of size n(n-1)2 for the skew-symmetric matrix and a vector of size n(n+1)2 for the symmetric matrix. We can sample a random skew-symmetric matrix: ","category":"page"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"using GeometricMachineLearning # hide \n\nA = rand(SkewSymMatrix, 5)","category":"page"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"and then access the vector:","category":"page"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"A.S ","category":"page"},{"location":"manifolds/submersion_theorem/#The-Submersion-Theorem","page":"The Submersion Theorem","title":"The Submersion Theorem","text":"","category":"section"},{"location":"manifolds/submersion_theorem/","page":"The Submersion Theorem","title":"The Submersion Theorem","text":"The submersion theorem is an application of the inverse function theorem that we need in order to show that the spaces we deal with here are indeed manifolds. ","category":"page"},{"location":"optimizers/general_optimization/#Optimization-for-Neural-Networks","page":"General Optimization","title":"Optimization for Neural Networks","text":"","category":"section"},{"location":"optimizers/general_optimization/","page":"General Optimization","title":"General Optimization","text":"Optimization for neural networks is (almost always) some variation on gradient descent. The most basic form of gradient descent is a discretization of the gradient flow equation:","category":"page"},{"location":"optimizers/general_optimization/","page":"General Optimization","title":"General Optimization","text":"dottheta = -nabla_thetaL","category":"page"},{"location":"optimizers/general_optimization/","page":"General Optimization","title":"General Optimization","text":"by means of a Euler time-stepping scheme: ","category":"page"},{"location":"optimizers/general_optimization/","page":"General Optimization","title":"General Optimization","text":"theta^t+1 = theta^t - hnabla_theta^tL","category":"page"},{"location":"optimizers/general_optimization/","page":"General Optimization","title":"General Optimization","text":"where eta (the time step of the Euler scheme) is referred to as the learning rate","category":"page"},{"location":"optimizers/general_optimization/","page":"General Optimization","title":"General Optimization","text":"This equation can easily be generalized to manifolds by replacing the Euclidean gradient nabla_theta^tL by a Riemannian gradient -hmathrmgrad_theta^tL and addition by -hnabla_theta^tL with a retraction by -hmathrmgrad_theta^tL.","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/#The-Horizontal-Lift","page":"Horizontal Lift","title":"The Horizontal Lift","text":"","category":"section"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"For each element YinmathcalM we can perform a splitting mathfrakg = mathfrakg^mathrmhor Yoplusmathfrakg^mathrmver Y, where the two subspaces are the horizontal and the vertical component of mathfrakg at Y respectively. For homogeneous spaces: T_YmathcalM = mathfrakgcdotY, i.e. every tangent space to mathcalM can be expressed through the application of the Lie algebra to the relevant element. The vertical component consists of those elements of mathfrakg which are mapped to the zero element of T_YmathcalM, i.e. ","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"mathfrakg^mathrmver Y = mathrmker(mathfrakgtoT_YmathcalM)","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"The orthogonal complement[1] of mathfrakg^mathrmver Y is the horizontal component and is referred to by mathfrakg^mathrmhor Y. This is naturally isomorphic to T_YmathcalM. For the Stiefel manifold the horizontal lift has the simple form: ","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"Omega(Y V) = left(mathbbI - frac12right)VY^T - YV^T(mathbbI - frac12YY^T)","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"If the element Y is the distinct element E, then the elements of mathfrakg^mathrmhorE take a particularly simple form, see Global Tangent Space for a description of this. ","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"[1]: The orthogonal complement is taken with respect to a metric defined on mathfrakg. For the case of G=SO(N) and mathfrakg=mathfrakso(N) = AA+A^T =0 this metric can be chosen as (A_1A_2)mapstofrac12A_1^TA_2.","category":"page"},{"location":"optimizers/manifold_related/retractions/#Retractions","page":"Retractions","title":"Retractions","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/#Classical-Definition","page":"Retractions","title":"Classical Definition","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Classically, retractions are defined as maps smooth maps ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"R TmathcalMtomathcalM(xv)mapstoR_x(v)","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"such that each curve c(t) = R_x(tv) satisfies c(0) = x and c(0) = v.","category":"page"},{"location":"optimizers/manifold_related/retractions/#In-GeometricMachineLearning","page":"Retractions","title":"In GeometricMachineLearning","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Retractions are a map from the horizontal component of the Lie algebra mathfrakg^mathrmhor to the respective manifold.","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"For optimization in neural networks (almost always first order) we solve a gradient flow equation ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"dotW = -mathrmgrad_WL ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"where mathrmgrad_WL is the Riemannian gradient of the loss function L evaluated at position W.","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"If we deal with Euclidean spaces (vector spaces), then the Riemannian gradient is just the result of an AD routine and the solution of the equation above can be approximated with W^t+1 gets W^t - etanabla_W^tL, where eta is the learning rate. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"For manifolds, after we obtained the Riemannian gradient (see e.g. the section on Stiefel manifold), we have to solve a geodesic equation. This is a canonical ODE associated with any Riemannian manifold. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"The general theory of Riemannian manifolds is rather complicated, but for the neural networks treated in GeometricMachineLearning, we only rely on optimization of matrix Lie groups and homogeneous spaces, which is much simpler. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"For Lie groups each tangent space is isomorphic to its Lie algebra mathfrakgequivT_mathbbIG. The geodesic map from mathfrakg to G, for matrix Lie groups with bi-invariant Riemannian metric like SO(N), is simply the application of the matrix exponential exp. Alternatively this can be replaced by the Cayley transform (see (Absil et al, 2008).)","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Starting from this basic map expmathfrakgtoG we can build mappings for more complicated cases: ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"General tangent space to a Lie group T_AG: The geodesic map for an element VinT_AG is simply Aexp(A^-1V).\nSpecial tangent space to a homogeneous space T_EmathcalM: For V=BEinT_EmathcalM the exponential map is simply exp(B)E. \nGeneral tangent space to a homogeneous space T_YmathcalM with Y = AE: For Delta=ABEinT_YmathcalM the exponential map is simply Aexp(B)E. This is the general case which we deal with. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"The general theory behind points 2. and 3. is discussed in chapter 11 of (O'Neill, 1983). The function retraction in GeometricMachineLearning performs mathfrakg^mathrmhortomathcalM, which is the second of the above points. To get the third from the second point, we simply have to multiply with a matrix from the left. This step is done with apply_section and represented through the red vertical line in the diagram on the general optimizer framework.","category":"page"},{"location":"optimizers/manifold_related/retractions/#Word-of-caution","page":"Retractions","title":"Word of caution","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"The Lie group corresponding to the Stiefel manifold SO(N) has a bi-invariant Riemannian metric associated with it: (B_1B_2)mapsto mathrmTr(B_1^TB_2). For other Lie groups (e.g. the symplectic group) the situation is slightly more difficult (see (Bendokat et al, 2021).)","category":"page"},{"location":"optimizers/manifold_related/retractions/#References","page":"Retractions","title":"References","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Absil P A, Mahony R, Sepulchre R. Optimization algorithms on matrix manifolds[M]. Princeton University Press, 2008.\nBendokat T, Zimmermann R. The real symplectic Stiefel and Grassmann manifolds: metrics, geodesics and applications[J]. arXiv preprint arXiv:2108.12447, 2021.\nO'Neill, Barrett. Semi-Riemannian geometry with applications to relativity. Academic press, 1983.","category":"page"},{"location":"reduced_order_modeling/autoencoder/#Reduced-Order-modeling-and-Autoencoders","page":"POD and Autoencoders","title":"Reduced Order modeling and Autoencoders","text":"","category":"section"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"Reduced order modeling is a data-driven technique that exploits the structure of parametric PDEs to make solving those PDEs easier.","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"Consider a parametric PDE written in the form: F(z(mu)mu)=0 where z(mu) evolves on a infinite-dimensional Hilbert space V. ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"In modeling any PDE we have to choose a discretization (particle discretization, finite element method, ...) of V, which will be denoted by V_h. ","category":"page"},{"location":"reduced_order_modeling/autoencoder/#Solution-manifold","page":"POD and Autoencoders","title":"Solution manifold","text":"","category":"section"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"To any parametric PDE we associate a solution manifold: ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"mathcalM = z(mu)F(z(mu)mu)=0 muinmathbbP","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"(Image: )","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"In the image above a 2-dimensional solution manifold is visualized as a sub-manifold in 3-dimensional space. In general the embedding space is an infinite-dimensional function space.","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"As an example of this consider the 1-dimensional wave equation: ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"partial_tt^2q(tximu) = mu^2partial_xixi^2q(tximu)text on ItimesOmega","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"where I = (01) and Omega=(-1212). As initial condition for the first derivative we have partial_tq(0ximu) = -mupartial_xiq_0(ximu) and furthermore q(tximu)=0 on the boundary (i.e. xiin-1212).","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"The solution manifold is a 1-dimensional submanifold: ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"mathcalM = (t xi)mapstoq(tximu)=q_0(xi-mutmu)muinmathbbPsubsetmathbbR","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"If we provide an initial condition u_0, a parameter instance mu and a time t, then ximapstoq(tximu) will be the momentary solution. If we consider the time evolution of q(tximu), then it evolves on a two-dimensional submanifold barmathcalM = ximapstoq(tximu)tinImuinmathbbP.","category":"page"},{"location":"reduced_order_modeling/autoencoder/#General-workflow","page":"POD and Autoencoders","title":"General workflow","text":"","category":"section"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"In reduced order modeling we aim to construct a mapping to a space that is close to this solution manifold. This is done through the following steps: ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"Discretize the PDE.\nSolve the discretized PDE for a certain set of parameter instances muinmathbbP.\nBuild a reduced basis with the data obtained from having solved the discretized PDE. This step consists of finding two mappings: the reduction mathcalP and the reconstruction mathcalR.","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"The third step can be done with various machine learning (ML) techniques. Traditionally the most popular of these has been Proper orthogonal decomposition (POD), but in recent years autoencoders have also become a popular alternative (see (Fresca et al, 2021)). ","category":"page"},{"location":"reduced_order_modeling/autoencoder/#References","page":"POD and Autoencoders","title":"References","text":"","category":"section"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"S. Fresca, L. Dede’ and A. Manzoni. A comprehensive deep learning-based approach to reduced order modeling of nonlinear time-dependent parametrized PDEs. Journal of Scientific Computing 87, 1–36 (2021).\n\n\n\n","category":"page"},{"location":"manifolds/manifolds/#(Matrix)-Manifolds","page":"General Theory on Manifolds","title":"(Matrix) Manifolds","text":"","category":"section"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"Manifolds are topological spaces that locally look like vector spaces. In the following we restrict ourselves to finite-dimensional manifolds. Definition: A finite-dimensional smooth manifold of dimension n is a second-countable Hausdorff space mathcalM for which forallxinmathcalM we can find a neighborhood U that contains x and a corresponding homeomorphism varphi_UUcongWsubsetmathbbR^n where W is an open subset. The homeomorphisms varphi_U are referred to as coordinate charts. If two such coordinate charts overlap, i.e. if U_1capU_2neq, then the map varphi_U_2^-1circvarphi_U_1 is C^infty.","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"One example of a manifold that is also important for GeometricMachineLearning.jl is the Lie group[1] of orthonormal matrices SO(N). Before we can proof that SO(N) is a manifold we first need another definition and a theorem:","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"[1]: Lie groups are manifolds that also have a group structure, i.e. there is an operation mathcalMtimesmathcalMtomathcalM(ab)mapstoab s.t. (ab)c = a(bc) and existsemathcalM s.t. ae = a forallainmathcalM.","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"Definition: Consider a smooth mapping g mathcalMtomathcalN from one manifold to another. A point BinmathcalM is called a regular value of mathcalM if forallAing^-1B the map T_AgT_AmathcalMtoT_g(A)mathcalN is surjective. ","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"Theorem: Consider a smooth map gmathcalMtomathcalN from one manifold to another. Then the preimage of a regular point B of mathcalN is a submanifold of mathcalM. Furthermore the codimension of g^-1B is equal to the dimension of mathcalN and the tangent space T_A(g^-1B) is equal to the kernel of T_Ag. This is known as the preimage theorem.","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"Proof: ","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"Theorem: The group SO(N) is a Lie group (i.e. has manifold structure). Proof: The vector space mathbbR^NtimesN clearly has manifold structure. The group SO(N) is equivalent to one of the level sets of the mapping: fmathbbR^NtimesNtomathcalS(N) AmapstoA^TA, i.e. it is the component of f^-1mathbbI that contains mathbbI. We still need to proof that mathbbI is a regular point of f, i.e. that for AinSO(N) the mapping T_Af is surjective. This means that forallBinmathcalS(N) AinmathbbR^NtimesN existsCinmathbbR^NtimesN s.t. C^TA + A^TC = B. The element C=frac12ABinmathcalR^NtimesN satisfies this property.","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"With the definition above we can generalize the notion of an ordinary differential equation (ODE) on a vector space to an ordinary differential equation on a manifold:","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"Definition: An ODE on a manifold is a mapping that assigns to each element of the manifold AinmathcalM an element of the corresponding tangent space T_AmathcalM.","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/#Horizontal-component-of-the-Lie-algebra-\\mathfrak{g}","page":"Stiefel Global Tangent Space","title":"Horizontal component of the Lie algebra mathfrakg","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"What we use to optimize Adam (and other algorithms) to manifolds is a global tangent space representation of the homogeneous spaces. ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"For the Stiefel manifold, this global tangent space representation takes a simple form: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"mathcalB = beginbmatrix\n A -B^T \n B mathbbO\nendbmatrix","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"where AinmathbbR^ntimesn is skew-symmetric and BinmathbbR^Ntimesn is arbitary. In GeometricMachineLearning the struct StiefelLieAlgHorMatrix implements elements of this form.","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/#Theoretical-background","page":"Stiefel Global Tangent Space","title":"Theoretical background","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/#Vertical-and-horizontal-components","page":"Stiefel Global Tangent Space","title":"Vertical and horizontal components","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"The Stiefel manifold St(n N) is a homogeneous space obtained from SO(N) by setting two matrices, whose first n columns conincide, equivalent. Another way of expressing this is: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"A_1 sim A_2 iff A_1E = A_2E","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"for ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"E = beginbmatrix mathbbI mathbbOendbmatrix","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"Because St(nN) is a homogeneous space, we can take any element YinSt(nN) and SO(N) acts transitively on it, i.e. can produce any other element in SO(N). A similar statement is also true regarding the tangent spaces of St(nN), namely: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"T_YSt(nN) = mathfrakgcdotY","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"i.e. every tangent space can be expressed through an action of the associated Lie algebra. ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"The kernel of the mapping mathfrakgtoT_YSt(nN) BmapstoBY is referred to as mathfrakg^mathrmverY, the vertical component of the Lie algebra at Y. In the case Y=E it is easy to see that elements belonging to mathfrakg^mathrmverE are of the following form: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"beginbmatrix\nhatmathbbO tildemathbbO^T \ntildemathbbO C\nendbmatrix","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"where hatmathbbOinmathbbR^ntimesn is a \"small\" matrix and tildemathbbOinmathbbR^Ntimesn is a bigger one. CinmathbbR^NtimesN is a skew-symmetric matrix. ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"The orthogonal complement of the vertical component is referred to as the horizontal component and denoted by mathfrakg^mathrmhor Y. It is isomorphic to T_YSt(nN) and this isomorphism can be found explicitly. In the case of the Stiefel manifold: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"Omega(Y cdot)T_YSt(nN)tomathfrakg^mathrmhorY Delta mapsto (mathbbI - frac12YY^T)DeltaY^T - YDelta^T(mathbbI - frac12YY^T)","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"The elements of mathfrakg^mathrmhorE=mathfrakg^mathrmhor, i.e. for the special case Y=E. Its elements are of the form described on top of this page.","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/#Special-functions","page":"Stiefel Global Tangent Space","title":"Special functions","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"You can also draw random elements from mathfrakg^mathrmhor through e.g. ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"rand(CUDADevice(), StiefelLieAlgHorMatrix{Float32}, 10, 5)","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"In this example: N=10 and n=5.","category":"page"},{"location":"reduced_order_modeling/projection_reduction_errors/#Projection-and-Reduction-Errors-of-Reduced-Models","page":"Projection and Reduction Error","title":"Projection and Reduction Errors of Reduced Models","text":"","category":"section"},{"location":"reduced_order_modeling/projection_reduction_errors/","page":"Projection and Reduction Error","title":"Projection and Reduction Error","text":"Two errors that are of very big importance in reduced order modeling are the projection and the reduction error. During training one typically aims to miminimze the projection error, but for the actual application of the model the reduction error is often more important. ","category":"page"},{"location":"reduced_order_modeling/projection_reduction_errors/#Projection-Error","page":"Projection and Reduction Error","title":"Projection Error","text":"","category":"section"},{"location":"reduced_order_modeling/projection_reduction_errors/","page":"Projection and Reduction Error","title":"Projection and Reduction Error","text":"The projection error computes how well a reduced basis, represented by the reduction mathcalP and the reconstruction mathcalR, can represent the data with which it is build. In mathematical terms: ","category":"page"},{"location":"reduced_order_modeling/projection_reduction_errors/","page":"Projection and Reduction Error","title":"Projection and Reduction Error","text":"e_mathrmproj(mu) = \n frac mathcalRcircmathcalP(M) - M M ","category":"page"},{"location":"reduced_order_modeling/projection_reduction_errors/","page":"Projection and Reduction Error","title":"Projection and Reduction Error","text":"where cdot is the Frobenius norm (one could also optimize for different norms).","category":"page"},{"location":"reduced_order_modeling/projection_reduction_errors/#Reduction-Error","page":"Projection and Reduction Error","title":"Reduction Error","text":"","category":"section"},{"location":"reduced_order_modeling/projection_reduction_errors/","page":"Projection and Reduction Error","title":"Projection and Reduction Error","text":"The reduction error measures how far the reduced system diverges from the full-order system during integration (online stage). In mathematical terms (and for a single initial condition): ","category":"page"},{"location":"reduced_order_modeling/projection_reduction_errors/","page":"Projection and Reduction Error","title":"Projection and Reduction Error","text":"e_mathrmred(mu) = sqrt\n fracsum_t=0^K mathbfx^(t)(mu) - mathcalR(mathbfx^(t)_r(mu)) ^2sum_t=0^K mathbfx^(t)(mu) ^2\n","category":"page"},{"location":"reduced_order_modeling/projection_reduction_errors/","page":"Projection and Reduction Error","title":"Projection and Reduction Error","text":"where mathbfx^(t) is the solution of the FOM at point t and mathbfx^(t)_r is the solution of the ROM (in the reduced basis) at point t. The reduction error, as opposed to the projection error, not only measures how well the solution manifold is represented by the reduced basis, but also measures how well the FOM dynamics are approximated by the ROM dynamics (via the induced vector field on the reduced basis).","category":"page"},{"location":"library/","page":"Library","title":"Library","text":"CurrentModule = GeometricMachineLearning","category":"page"},{"location":"library/#GeometricMachineLearning-Library-Functions","page":"Library","title":"GeometricMachineLearning Library Functions","text":"","category":"section"},{"location":"library/","page":"Library","title":"Library","text":"Modules = [GeometricMachineLearning]","category":"page"},{"location":"library/#AbstractNeuralNetworks.Chain-Union{Tuple{GSympNet{AT, true}}, Tuple{AT}} where AT","page":"Library","title":"AbstractNeuralNetworks.Chain","text":"Chain can also be called with a neural network as input.\n\n\n\n\n\n","category":"method"},{"location":"library/#AbstractNeuralNetworks.Chain-Union{Tuple{LASympNet{AT, false, false}}, Tuple{AT}} where AT","page":"Library","title":"AbstractNeuralNetworks.Chain","text":"Build a chain for an LASympnet for which init_upper_linear is false and init_upper_act is false.\n\n\n\n\n\n","category":"method"},{"location":"library/#AbstractNeuralNetworks.Chain-Union{Tuple{LASympNet{AT, false, true}}, Tuple{AT}} where AT","page":"Library","title":"AbstractNeuralNetworks.Chain","text":"Build a chain for an LASympnet for which init_upper_linear is false and init_upper_act is true.\n\n\n\n\n\n","category":"method"},{"location":"library/#AbstractNeuralNetworks.Chain-Union{Tuple{LASympNet{AT, true, false}}, Tuple{AT}} where AT","page":"Library","title":"AbstractNeuralNetworks.Chain","text":"Build a chain for an LASympnet for which init_upper_linear is true and init_upper_act is false.\n\n\n\n\n\n","category":"method"},{"location":"library/#AbstractNeuralNetworks.Chain-Union{Tuple{LASympNet{AT, true, true}}, Tuple{AT}} where AT","page":"Library","title":"AbstractNeuralNetworks.Chain","text":"Build a chain for an LASympnet for which init_upper_linear is true and init_upper_act is true.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.AbstractCache","page":"Library","title":"GeometricMachineLearning.AbstractCache","text":"AbstractCache has subtypes: \n\nAdamCache\nMomentumCache\nGradientCache\nBFGSCache\n\nAll of them can be initialized with providing an array (also supporting manifold types).\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.AbstractLieAlgHorMatrix","page":"Library","title":"GeometricMachineLearning.AbstractLieAlgHorMatrix","text":"AbstractLieAlgHorMatrix is a supertype for various horizontal components of Lie algebras. We usually call this mathfrakg^mathrmhor.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.AbstractRetraction","page":"Library","title":"GeometricMachineLearning.AbstractRetraction","text":"AbstractRetraction is a type that comprises all retraction methods for manifolds. For every manifold layer one has to specify a retraction method that takes the layer and elements of the (global) tangent space.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.ActivationLayer","page":"Library","title":"GeometricMachineLearning.ActivationLayer","text":"ActivationLayer is the struct corresponding to the constructors ActivationLayerQ and ActivationLayerP. See those for more information.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.ActivationLayerP-Tuple{Any, Any}","page":"Library","title":"GeometricMachineLearning.ActivationLayerP","text":"Performs:\n\nbeginpmatrix\n q p\nendpmatrix mapsto \nbeginpmatrix\n q p + mathrmdiag(a)sigma(q)\nendpmatrix\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.ActivationLayerQ-Tuple{Any, Any}","page":"Library","title":"GeometricMachineLearning.ActivationLayerQ","text":"Performs:\n\nbeginpmatrix\n q p\nendpmatrix mapsto \nbeginpmatrix\n q + mathrmdiag(a)sigma(p) p\nendpmatrix\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.AdamOptimizer","page":"Library","title":"GeometricMachineLearning.AdamOptimizer","text":"Defines the Adam Optimizer. Algorithm and suggested defaults are taken from (Goodfellow et al., 2016, page 301), except for δ, because single precision is used!\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.AdamOptimizerWithDecay","page":"Library","title":"GeometricMachineLearning.AdamOptimizerWithDecay","text":"Defines the Adam Optimizer with weight decay.\n\nConstructors\n\nThe default constructor takes as input: \n\nn_epochs::Int\nη₁: the learning rate at the start \nη₂: the learning rate at the end \nρ₁: the decay parameter for the first moment \nρ₂: the decay parameter for the second moment\nδ: the safety parameter \nT (keyword argument): the type. \n\nThe second constructor is called with: \n\nn_epochs::Int\nT\n\n... the rest are keyword arguments\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.BFGSCache","page":"Library","title":"GeometricMachineLearning.BFGSCache","text":"The cache for the BFGS optimizer.\n\nIt stores an array for the previous time step B and the inverse of the Hessian matrix H.\n\nIt is important to note that setting up this cache already requires a derivative! This is not the case for the other optimizers.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.BFGSDummyCache","page":"Library","title":"GeometricMachineLearning.BFGSDummyCache","text":"In order to initialize BGGSCache we first need gradient information. This is why we initially have this BFGSDummyCache until gradient information is available.\n\nNOTE: we may not need this. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.BFGSOptimizer","page":"Library","title":"GeometricMachineLearning.BFGSOptimizer","text":"This is an implementation of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimizer. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Batch","page":"Library","title":"GeometricMachineLearning.Batch","text":"Batch is a struct whose functor acts on an instance of DataLoader to produce a sequence of training samples for training for one epoch. \n\nThe Constructor\n\nThe constructor for Batch is called with: \n\nbatch_size::Int\nseq_length::Int (optional)\nprediction_window::Int (optional)\n\nThe first one of these arguments is required; it indicates the number of training samples in a batch. If we deal with time series data then we can additionaly supply a sequence length and a prediction window as input arguments to Batch. These indicate the number of input vectors and the number of output vectors.\n\nThe functor\n\nAn instance of Batch can be called on an instance of DataLoader to produce a sequence of samples that contain all the input data, i.e. for training for one epoch. The output of applying batch:Batch to dl::DataLoader is a tuple of vectors of integers. Each of these vectors contains two integers: the first is the time index and the second one is the parameter index.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.BiasLayer","page":"Library","title":"GeometricMachineLearning.BiasLayer","text":"A bias layer that does nothing more than add a vector to the input. This is needed for LA-SympNets.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Classification","page":"Library","title":"GeometricMachineLearning.Classification","text":"Classification Layer that takes a matrix as an input and returns a vector that is used for MNIST classification. \n\nIt has the following arguments: \n\nM: input dimension \nN: output dimension \nactivation: the activation function \n\nAnd the following optional argument: \n\naverage: If this is set to true, then the output is computed as frac1Nsum_i=1^Ninput_bulleti. If set to false (the default) it picks the last column of the input. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.ClassificationTransformer","page":"Library","title":"GeometricMachineLearning.ClassificationTransformer","text":"This is a transformer neural network for classification purposes. At the moment this is only used for training on MNIST, but can in theory be used for any classification problem.\n\nIt has to be called with a DataLoader that stores an input and an output tensor. The optional arguments are: \n\nn_heads: The number of heads in the MultiHeadAttention (mha) layers. Default: 7.\nn_layers: The number of transformer layers. Default: 16.\nactivation: The activation function. Default: softmax.\nStiefel: Wheter the matrices in the mha layers are on the Stiefel manifold. \nadd_connection: Whether the input is appended to the output of the mha layer. (skip connection)\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.DataLoader","page":"Library","title":"GeometricMachineLearning.DataLoader","text":"Data Loader is a struct that creates an instance based on a tensor (or different input format) and is designed to make training convenient. \n\nConstructor\n\nThe data loader can be called with various inputs:\n\nA single vector: If the data loader is called with a single vector (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the second axis indicates parameter values and/or time steps and the system has a single degree of freedom (i.e. the system dimension is one).\nA single matrix: If the data loader is called with a single matrix (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the first axis is assumed to indicate the degrees of freedom of the system and the second axis indicates parameter values and/or time steps. \nA single tensor: If the data loader is called with a single tensor, then this is interpreted as an integration problem with the second axis indicating the time step and the third one indicating the parameters.\nA tensor and a vector: This is a special case (MNIST classification problem). For the MNIST problem for example the input are n_p matrices (first input argument) and n_p integers (second input argument).\nA NamedTuple with fields q and p: The NamedTuple contains (i) two matrices or (ii) two tensors. \nAn EnsembleSolution: The EnsembleSolution typically comes from GeometricProblems.\n\nWhen we supply a single vector or a single matrix as input to DataLoader and further set autoencoder = false (keyword argument), then the data are stored as an integration problem and the second axis is assumed to indicate time steps.\n\nFields of DataLoader\n\nThe fields of the DataLoader struct are the following: - input: The input data with axes (i) system dimension, (ii) number of time steps and (iii) number of parameters. - output: The tensor that contains the output (supervised learning) - this may be of type Nothing if the constructor is only called with one tensor (unsupervised learning). - input_dim: The dimension of the system, i.e. what is taken as input by a regular neural network. - input_time_steps: The length of the entire time series (length of the second axis). - n_params: The number of parameters that are present in the data set (length of third axis) - output_dim: The dimension of the output tensor (first axis). If output is of type Nothing, then this is also of type Nothing. - output_time_steps: The size of the second axis of the output tensor. If output is of type Nothing, then this is also of type Nothing.\n\nThe input and output fields of DataLoader\n\nEven though the arguments to the Constructor may be vectors or matrices, internally DataLoader always stores tensors.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.DataLoader-Union{Tuple{@NamedTuple{q::AT, p::AT}}, Tuple{AT}, Tuple{T}} where {T, AT<:AbstractMatrix{T}}","page":"Library","title":"GeometricMachineLearning.DataLoader","text":"Data Loader is a struct that creates an instance based on a tensor (or different input format) and is designed to make training convenient. \n\nConstructor\n\nThe data loader can be called with various inputs:\n\nA single vector: If the data loader is called with a single vector (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the second axis indicates parameter values and/or time steps and the system has a single degree of freedom (i.e. the system dimension is one).\nA single matrix: If the data loader is called with a single matrix (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the first axis is assumed to indicate the degrees of freedom of the system and the second axis indicates parameter values and/or time steps. \nA single tensor: If the data loader is called with a single tensor, then this is interpreted as an integration problem with the second axis indicating the time step and the third one indicating the parameters.\nA tensor and a vector: This is a special case (MNIST classification problem). For the MNIST problem for example the input are n_p matrices (first input argument) and n_p integers (second input argument).\nA NamedTuple with fields q and p: The NamedTuple contains (i) two matrices or (ii) two tensors. \nAn EnsembleSolution: The EnsembleSolution typically comes from GeometricProblems.\n\nWhen we supply a single vector or a single matrix as input to DataLoader and further set autoencoder = false (keyword argument), then the data are stored as an integration problem and the second axis is assumed to indicate time steps.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.DataLoader-Union{Tuple{GeometricSolutions.EnsembleSolution{T, T1, Vector{ST}}}, Tuple{ST}, Tuple{DT}, Tuple{T1}, Tuple{T}} where {T, T1, DT, ST<:(GeometricSolutions.GeometricSolution{T, T1, @NamedTuple{q::DT, p::DT}})}","page":"Library","title":"GeometricMachineLearning.DataLoader","text":"Constructor for EnsembleSolution form package GeometricSolutions with fields q and p.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.DataLoader-Union{Tuple{GeometricSolutions.EnsembleSolution{T, T1, Vector{ST}}}, Tuple{ST}, Tuple{DT}, Tuple{T1}, Tuple{T}} where {T, T1, DT, ST<:(GeometricSolutions.GeometricSolution{T, T1, @NamedTuple{q::DT}})}","page":"Library","title":"GeometricMachineLearning.DataLoader","text":"Constructor for EnsembleSolution from package GeometricSolutions with field q.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.GSympNet","page":"Library","title":"GeometricMachineLearning.GSympNet","text":"GSympNet is called with a single input argument, the system dimension, or with an instance of DataLoader. Optional input arguments are: \n\nupscaling_dimension::Int: The upscaling dimension of the gradient layer. See the documentation for GradientLayerQ and GradientLayerP for further explanation. The default is 2*dim.\nnhidden::Int: The number of hidden layers (i.e. layers that are not input or output layers). The default is 2.\nactivation: The activation function that is applied. By default this is tanh.\ninit_upper::Bool: Initialize the gradient layer so that it first modifies the q-component. The default is true.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GlobalSection","page":"Library","title":"GeometricMachineLearning.GlobalSection","text":"This implements global sections for the Stiefel manifold and the Symplectic Stiefel manifold. \n\nIn practice this is implemented using Householder reflections, with the auxiliary column vectors given by: |0| |0| |.| |1| ith spot for i in (n+1) to N (or with random columns) |0| |.| |0|\n\nMaybe consider dividing the output in the check functions by n!\n\nImplement a general global section here!!!! Tₓ𝔐 → G×𝔤 !!!!!! (think about random initialization!)\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GradientLayer","page":"Library","title":"GeometricMachineLearning.GradientLayer","text":"GradientLayer is the struct corresponding to the constructors GradientLayerQ and GradientLayerP. See those for more information.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GradientLayerP-Tuple{Any, Any, Any}","page":"Library","title":"GeometricMachineLearning.GradientLayerP","text":"The gradient layer that changes the q component. It is of the form: \n\nbeginbmatrix\n mathbbI mathbbO nablaV mathbbI \nendbmatrix\n\nwith V(p) = sum_i=1^Ma_iSigma(sum_jk_ijp_j+b_i), where Sigma is the antiderivative of the activation function sigma (one-layer neural network). We refer to M as the upscaling dimension. Such layers are by construction symplectic.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.GradientLayerQ-Tuple{Any, Any, Any}","page":"Library","title":"GeometricMachineLearning.GradientLayerQ","text":"The gradient layer that changes the q component. It is of the form: \n\nbeginbmatrix\n mathbbI nablaV mathbbO mathbbI \nendbmatrix\n\nwith V(p) = sum_i=1^Ma_iSigma(sum_jk_ijp_j+b_i), where Sigma is the antiderivative of the activation function sigma (one-layer neural network). We refer to M as the upscaling dimension. Such layers are by construction symplectic.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.GradientOptimizer","page":"Library","title":"GeometricMachineLearning.GradientOptimizer","text":"Define the Gradient optimizer, i.e. W ← W - η*∇f(W) Or the riemannian manifold equivalent, if applicable.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GrassmannLayer","page":"Library","title":"GeometricMachineLearning.GrassmannLayer","text":"Defines a layer that performs simple multiplication with an element of the Grassmann manifold.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GrassmannLieAlgHorMatrix","page":"Library","title":"GeometricMachineLearning.GrassmannLieAlgHorMatrix","text":"This implements the horizontal component of a Lie algebra that is isomorphic to the Grassmann manifold. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GrassmannManifold","page":"Library","title":"GeometricMachineLearning.GrassmannManifold","text":"The GrassmannManifold is based on the StiefelManifold\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.LASympNet","page":"Library","title":"GeometricMachineLearning.LASympNet","text":"LASympNet is called with a single input argument, the system dimension, or with an instance of DataLoader. Optional input arguments are: \n\ndepth::Int: The number of linear layers that are applied. The default is 5.\nnhidden::Int: The number of hidden layers (i.e. layers that are not input or output layers). The default is 2.\nactivation: The activation function that is applied. By default this is tanh.\ninit_upper_linear::Bool: Initialize the linear layer so that it first modifies the q-component. The default is true.\ninit_upper_act::Bool: Initialize the activation layer so that it first modifies the q-component. The default is true.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.LayerWithManifold","page":"Library","title":"GeometricMachineLearning.LayerWithManifold","text":"Additional types to make handling manifolds more readable.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.LinearLayer","page":"Library","title":"GeometricMachineLearning.LinearLayer","text":"LinearLayer is the struct corresponding to the constructors LinearLayerQ and LinearLayerP. See those for more information.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.LinearLayerP-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.LinearLayerP","text":"Equivalent to a left multiplication by the matrix:\n\nbeginpmatrix\nmathbbI mathbbO \nB mathbbI\nendpmatrix \n\nwhere B is a symmetric matrix.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.LinearLayerQ-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.LinearLayerQ","text":"Equivalent to a left multiplication by the matrix:\n\nbeginpmatrix\nmathbbI B \nmathbbO mathbbI\nendpmatrix \n\nwhere B is a symmetric matrix.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.LowerTriangular","page":"Library","title":"GeometricMachineLearning.LowerTriangular","text":"A lower-triangular matrix is an ntimesn matrix that has ones on the diagonal and zeros on the upper triangular.\n\nThe data are stored in a vector S similarly to SkewSymMatrix.\n\nThe struct two fields: S and n. The first stores all the entries of the matrix in a sparse fashion (in a vector) and the second is the dimension n for AinmathbbR^ntimesn.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Manifold","page":"Library","title":"GeometricMachineLearning.Manifold","text":"rand is implemented for manifolds that use the initialization of the StiefelManifold and the GrassmannManifold by default. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.ManifoldLayer","page":"Library","title":"GeometricMachineLearning.ManifoldLayer","text":"This defines a manifold layer that only has one matrix-valued manifold A associated with it does xmapstoAx. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.MomentumOptimizer","page":"Library","title":"GeometricMachineLearning.MomentumOptimizer","text":"Define the Momentum optimizer, i.e. V ← αV - ∇f(W) W ← W + ηV Or the riemannian manifold equivalent, if applicable.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.MultiHeadAttention","page":"Library","title":"GeometricMachineLearning.MultiHeadAttention","text":"MultiHeadAttention (MHA) serves as a preprocessing step in the transformer. It reweights the input vectors bases on correlations within those data. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Optimizer","page":"Library","title":"GeometricMachineLearning.Optimizer","text":"Optimizer struct that stores the 'method' (i.e. Adam with corresponding hyperparameters), the cache and the optimization step.\n\nIt takes as input an optimization method and the parameters of a network. \n\nFor technical reasons we first specify an OptimizerMethod that stores all the hyperparameters of the optimizer. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Optimizer-Tuple{NeuralNetwork, DataLoader, Batch, Int64, GeometricMachineLearning.NetworkLoss}","page":"Library","title":"GeometricMachineLearning.Optimizer","text":"A functor for Optimizer. It is called with: - nn::NeuralNetwork - dl::DataLoader - batch::Batch - n_epochs::Int - loss\n\nThe last argument is a function through which Zygote differentiates. This argument is optional; if it is not supplied GeometricMachineLearning defaults to an appropriate loss for the DataLoader.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.Optimizer-Tuple{OptimizerMethod, NeuralNetwork}","page":"Library","title":"GeometricMachineLearning.Optimizer","text":"Typically the Optimizer is not initialized with the network parameters, but instead with a NeuralNetwork struct.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.PSDLayer","page":"Library","title":"GeometricMachineLearning.PSDLayer","text":"This is a PSD-like layer used for symplectic autoencoders. One layer has the following shape:\n\nA = beginbmatrix Phi mathbbO mathbbO Phi endbmatrix\n\nwhere Phi is an element of the Stiefel manifold St(n N).\n\nThe constructor of PSDLayer is called by PSDLayer(M, N; retraction=retraction): \n\nM is the input dimension.\nN is the output dimension. \nretraction is an instance of a struct with supertype AbstractRetraction. The only options at the moment are Geodesic() and Cayley().\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.ReducedSystem","page":"Library","title":"GeometricMachineLearning.ReducedSystem","text":"ReducedSystem computes the reconstructed dynamics in the full system based on the reduced one. Optionally it can be compared to the FOM solution.\n\nIt can be called using the following constructor: ReducedSystem(N, n, encoder, decoder, fullvectorfield, reducedvectorfield, params, tspan, tstep, ics, projection_error) where \n\nencoder: a function mathbbR^2NmapstomathbbR^2n\ndecoder: a (differentiable) function mathbbR^2nmapstomathbbR^2N\nfullvectorfield: a (differentiable) mapping defined the same way as in GeometricIntegrators \nreducedvectorfield: a (differentiable) mapping defined the same way as in GeometricIntegrators \nparams: a NamedTuple that parametrizes the vector fields (the same for fullvectorfield and reducedvectorfield)\ntspan: a tuple (t₀, tₗ) that specifies start and end point of the time interval over which integration is performed. \ntstep: the time step \nics: the initial condition for the big system.\nprojection_error: the error M - mathcalRcircmathcalP(M) where M is the snapshot matrix; mathcalP and mathcalR are the reduction and reconstruction respectively.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.RegularTransformerIntegrator","page":"Library","title":"GeometricMachineLearning.RegularTransformerIntegrator","text":"The regular transformer used as an integrator (multi-step method). \n\nThe constructor is called with the following arguments: \n\nsys_dim::Int\ntransformer_dim::Int: the default is transformer_dim = sys_dim.\nn_blocks::Int: The default is 1.\nn_heads::Int: the number of heads in the multihead attentio layer (default is n_heads = sys_dim)\nL::Int the number of transformer blocks (default is L = 2).\nupscaling_activation: by default identity\nresnet_activation: by default tanh\nadd_connection:Bool=true (keyword argument): if the input should be added to the output.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SkewSymMatrix","page":"Library","title":"GeometricMachineLearning.SkewSymMatrix","text":"A SkewSymMatrix is a matrix A s.t. A^T = -A.\n\nIf the constructor is called with a matrix as input it returns a symmetric matrix via the projection A mapsto frac12(A - A^T). This is a projection defined via the canonical metric mathbbR^ntimesntimesmathbbR^ntimesntomathbbR (AB) mapsto mathrmTr(A^TB).\n\nThe first index is the row index, the second one the column index.\n\nThe struct two fields: S and n. The first stores all the entries of the matrix in a sparse fashion (in a vector) and the second is the dimension n for AinmathbbR^ntimesn.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.StiefelLayer","page":"Library","title":"GeometricMachineLearning.StiefelLayer","text":"Defines a layer that performs simple multiplication with an element of the Stiefel manifold.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.StiefelLieAlgHorMatrix","page":"Library","title":"GeometricMachineLearning.StiefelLieAlgHorMatrix","text":"StiefelLieAlgHorMatrix is the horizontal component of the Lie algebra of skew-symmetric matrices (with respect to the canonical metric). The projection here is: (\\pi:S \\to SE ) where \n\nE = beginpmatrix mathbbI_n mathbbO_(N-n)timesn endpmatrix\n\nThe matrix (E) is implemented under StiefelProjection in GeometricMachineLearning.\n\nAn element of StiefelLieAlgMatrix takes the form: \n\nbeginpmatrix\nA B^T B mathbbO\nendpmatrix\n\nwhere (A) is skew-symmetric (this is SkewSymMatrix in GeometricMachineLearning).\n\nIf the constructor is called with a big (N\\times{}N) matrix, then the projection is performed the following way: \n\nbeginpmatrix\nA B_1 \nB_2 D\nendpmatrix mapsto \nbeginpmatrix\nmathrmskew(A) -B_2^T \nB_2 mathbbO\nendpmatrix\n\nThe operation mathrmskewmathbbR^ntimesntomathcalS_mathrmskew(n) is the skew-symmetrization operation. This is equivalent to calling the constructor of SkewSymMatrix with an (n\\times{}n) matrix.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.StiefelManifold","page":"Library","title":"GeometricMachineLearning.StiefelManifold","text":"An implementation of the Stiefel manifold. It has various convenience functions associated with it:\n\ncheck \nrand \nrgrad\nmetric\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.StiefelProjection","page":"Library","title":"GeometricMachineLearning.StiefelProjection","text":"Outer constructor for StiefelProjection. This works with two integers as input and optionally the type.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.StiefelProjection-2","page":"Library","title":"GeometricMachineLearning.StiefelProjection","text":"An array that essentially does vcat(I(n), zeros(N-n, n)) with GPU support. It has three inner constructors. The first one is called with the following arguments: \n\nbackend: backends as supported by KernelAbstractions.\nT::Type\nN::Integer\nn::Integer\n\nThe second constructor is called by supplying a matrix as input. The constructor will then extract the backend, the type and the dimensions of that matrix. \n\nThe third constructor is called by supplying an instance of StiefelLieAlgHorMatrix. \n\nTechnically this should be a subtype of StiefelManifold. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SymmetricMatrix","page":"Library","title":"GeometricMachineLearning.SymmetricMatrix","text":"A SymmetricMatrix A is a matrix A^T = A.\n\nIf the constructor is called with a matrix as input it returns a symmetric matrix via the projection:\n\nA mapsto frac12(A + A^T)\n\nThis is a projection defined via the canonical metric (AB) mapsto mathrmtr(A^TB).\n\nInternally the struct saves a vector S of size n(n+1)div2. The conversion is done the following way: \n\nA_ij = begincases S( (i-1) i ) div 2 + j textif igeqj \n S( (j-1) j ) div 2 + i textelse endcases\n\nSo S stores a string of vectors taken from A: S = tildea_1 tildea_2 ldots tildea_n with tildea_i = A_i1A_i2ldotsA_ii.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SympNet","page":"Library","title":"GeometricMachineLearning.SympNet","text":"SympNet type encompasses GSympNets and LASympnets.\n\nTODO: -[ ] add bias to LASympNet!\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SympNetLayer","page":"Library","title":"GeometricMachineLearning.SympNetLayer","text":"Implements the various layers from the SympNet paper: (https://www.sciencedirect.com/science/article/abs/pii/S0893608020303063). This is a super type of Gradient, Activation and Linear.\n\nFor the linear layer, the activation and the bias are left out, and for the activation layer K and b are left out!\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SympNetLayer-Tuple{AbstractArray, Any}","page":"Library","title":"GeometricMachineLearning.SympNetLayer","text":"This is called when a SympnetLayer is applied to a NamedTuple. It calls apply_layer_to_nt_and_return_array.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.SymplecticPotential","page":"Library","title":"GeometricMachineLearning.SymplecticPotential","text":"SymplecticPotential(n)\n\nReturns a symplectic matrix of size 2n x 2n\n\nbeginpmatrix\nmathbbO mathbbI \nmathbbO -mathbbI \nendpmatrix\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SystemType","page":"Library","title":"GeometricMachineLearning.SystemType","text":"Can specify a special type of the system, to be used with ReducedSystem. For now the only option is Symplectic (and NoStructure).\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.TrainingData","page":"Library","title":"GeometricMachineLearning.TrainingData","text":"TrainingData stores: \n\n - problem \n\n - shape \n\n - get \n\n - symbols \n\n - dim \n\n - noisemaker\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.TransformerIntegrator","page":"Library","title":"GeometricMachineLearning.TransformerIntegrator","text":"Encompasses various transformer architectures, such as the structure-preserving transformer and the linear symplectic transformer. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.TransformerLoss","page":"Library","title":"GeometricMachineLearning.TransformerLoss","text":"The loss for a transformer network (especially a transformer integrator). The constructor is called with:\n\nseq_length::Int\nprediction_window::Int (default is 1).\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.UpperTriangular","page":"Library","title":"GeometricMachineLearning.UpperTriangular","text":"An upper-triangular matrix is an ntimesn matrix that has ones on the diagonal and zeros on the upper triangular.\n\nThe data are stored in a vector S similarly to SkewSymMatrix.\n\nThe struct two fields: S and n. The first stores all the entries of the matrix in a sparse fashion (in a vector) and the second is the dimension n for AinmathbbR^ntimesn.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.VolumePreservingAttention","page":"Library","title":"GeometricMachineLearning.VolumePreservingAttention","text":"Volume-preserving attention (single head attention)\n\nDrawbacks: \n\nthe super fast activation is only implemented for sequence lengths of 2, 3, 4 and 5.\nother sequence lengths only work on CPU for now (lu decomposition has to be implemented to work for tensors in parallel).\n\nConstructor\n\nThe constructor is called with: \n\ndim::Int: The system dimension \nseq_length::Int: The sequence length to be considered. The default is zero, i.e. arbitrary sequence lengths; this works for all sequence lengths but doesn't apply the super-fast activation. \nskew_sym::Bool (keyword argument): specifies if we the weight matrix is skew symmetric or arbitrary (default is false).\n\nFunctor\n\nApplying a layer of type VolumePreservingAttention does the following: \n\nFirst we perform the operation X mapsto X^T A X = C, where XinmathbbR^Ntimesmathttseq_length is a vector containing time series data and A is the skew symmetric matrix associated with the layer. \nIn a second step we compute the Cayley transform of C; Lambda = mathrmCayley(C).\nThe output of the layer is then XLambda.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.VolumePreservingFeedForward","page":"Library","title":"GeometricMachineLearning.VolumePreservingFeedForward","text":"Realizes a volume-preserving neural network as a combination of VolumePreservingLowerLayer and VolumePreservingUpperLayer. \n\nConstructor\n\nThe constructor is called with the following arguments: \n\nsys_dim::Int: The system dimension. \nn_blocks::Int: The number of blocks in the neural network (containing linear layers and nonlinear layers). Default is 1.\nn_linear::Int: The number of linear VolumePreservingLowerLayers and VolumePreservingUpperLayers in one block. Default is 1.\nactivation: The activation function for the nonlinear layers in a block. \ninit_upper::Bool=false (keyword argument): Specifies if the first layer is lower or upper. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.VolumePreservingFeedForwardLayer","page":"Library","title":"GeometricMachineLearning.VolumePreservingFeedForwardLayer","text":"Super-type of VolumePreservingLowerLayer and VolumePreservingUpperLayer. The layers do the following: \n\nx mapsto begincases sigma(Lx + b) textwhere L is mathttLowerTriangular sigma(Ux + b) textwhere U is mathttUpperTriangular endcases\n\nThe functor can be applied to a vecotr, a matrix or a tensor. \n\nConstructor\n\nThe constructors are called with:\n\nsys_dim::Int: the system dimension. \nactivation=tanh: the activation function. \ninclude_bias::Bool=true (keyword argument): specifies whether a bias should be used. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.VolumePreservingLowerLayer","page":"Library","title":"GeometricMachineLearning.VolumePreservingLowerLayer","text":"See the documentation for VolumePreservingFeedForwardLayer.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.VolumePreservingUpperLayer","page":"Library","title":"GeometricMachineLearning.VolumePreservingUpperLayer","text":"See the documentation for VolumePreservingFeedForwardLayer.\n\n\n\n\n\n","category":"type"},{"location":"library/#AbstractNeuralNetworks.update!-Union{Tuple{CT}, Tuple{T}, Tuple{Optimizer{<:BFGSOptimizer}, CT, AbstractArray{T}}} where {T, CT<:(BFGSCache{T, AT} where AT<:(AbstractArray{T}))}","page":"Library","title":"AbstractNeuralNetworks.update!","text":"Optimization for an entire neural networks with BFGS. What is different in this case is that we still have to initialize the cache.\n\nIf o.step == 1, then we initialize the cache\n\n\n\n\n\n","category":"method"},{"location":"library/#Base.iterate-Union{Tuple{AT}, Tuple{T}, Tuple{NeuralNetwork{<:GeometricMachineLearning.TransformerIntegrator}, @NamedTuple{q::AT, p::AT}}} where {T, AT<:AbstractMatrix{T}}","page":"Library","title":"Base.iterate","text":"This function computes a trajectory for a Transformer that has already been trained for valuation purposes.\n\nIt takes as input: \n\nnn: a NeuralNetwork (that has been trained).\nics: initial conditions (a matrix in mathbbR^2ntimesmathttseq_length or NamedTuple of two matrices in mathbbR^ntimesmathttseq_length)\nn_points::Int=100 (keyword argument): The number of steps for which we run the prediction. \nprediction_window::Int=size(ics.q, 2): The prediction window (i.e. the number of steps we predict into the future) is equal to the sequence length (i.e. the number of input time steps) by default. \n\n\n\n\n\n","category":"method"},{"location":"library/#Base.iterate-Union{Tuple{BT}, Tuple{AT}, Tuple{T}, Tuple{NeuralNetwork{<:GeometricMachineLearning.NeuralNetworkIntegrator}, BT}} where {T, AT<:AbstractVector{T}, BT<:@NamedTuple{q::AT, p::AT}}","page":"Library","title":"Base.iterate","text":"This function computes a trajectory for a SympNet that has already been trained for valuation purposes.\n\nIt takes as input: \n\nnn: a NeuralNetwork (that has been trained).\nics: initial conditions (a NamedTuple of two vectors)\n\n\n\n\n\n","category":"method"},{"location":"library/#Base.vec-Tuple{GeometricMachineLearning.AbstractTriangular}","page":"Library","title":"Base.vec","text":"If vec is applied onto Triangular, then the output is the associated vector. \n\n\n\n\n\n","category":"method"},{"location":"library/#Base.vec-Tuple{SkewSymMatrix}","page":"Library","title":"Base.vec","text":"If vec is applied onto SkewSymMatrix, then the output is the associated vector. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.Gradient","page":"Library","title":"GeometricMachineLearning.Gradient","text":"This is an old constructor and will be depricated. For change_q=true it is equivalent to GradientLayerQ; for change_q=false it is equivalent to GradientLayerP.\n\nIf full_grad=false then ActivationLayer is called\n\n\n\n\n\n","category":"function"},{"location":"library/#GeometricMachineLearning.Transformer-Tuple{Integer, Integer, Integer}","page":"Library","title":"GeometricMachineLearning.Transformer","text":"The architecture for a \"transformer encoder\" is essentially taken from arXiv:2010.11929, but with the difference that no layer normalization is employed. This is because we still need to find a generalization of layer normalization to manifolds. \n\nThe transformer is called with the following inputs: \n\ndim: the dimension of the transformer \nn_heads: the number of heads \nL: the number of transformer blocks\n\nIn addition we have the following optional arguments: \n\nactivation: the activation function used for the ResNet (tanh by default)\nStiefel::Bool: if the matrices P^V, P^Q and P^K should live on a manifold (false by default)\nretraction: which retraction should be used (Geodesic() by default)\nadd_connection::Bool: if the input should by added to the ouput after the MultiHeadAttention layer is used (true by default)\nuse_bias::Bool: If the ResNet should use a bias (true by default)\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.accuracy-Union{Tuple{BT}, Tuple{AT}, Tuple{T1}, Tuple{T}, Tuple{Chain, Tuple, DataLoader{T, AT, BT}}} where {T, T1<:Integer, AT<:(AbstractArray{T}), BT<:(AbstractArray{T1})}","page":"Library","title":"GeometricMachineLearning.accuracy","text":"Computes the accuracy (as opposed to the loss) of a neural network classifier. \n\nIt takes as input:\n\nmodel::Chain\nps: parameters of the network\ndl::DataLoader\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.apply_layer_to_nt_and_return_array-Union{Tuple{M}, Tuple{AbstractArray, GeometricMachineLearning.SympNetLayer{M, M}, Any}} where M","page":"Library","title":"GeometricMachineLearning.apply_layer_to_nt_and_return_array","text":"This function is used in the wrappers where the input to the SympNet layers is not a NamedTuple (as it should be) but an AbstractArray.\n\nIt converts the Array to a NamedTuple (via assign_q_and_p), then calls the SympNet routine(s) and converts back to an AbstractArray (with vcat).\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.assign_batch_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.assign_batch_kernel!","text":"Takes as input a batch tensor (to which the data are assigned), the whole data tensor and two vectors params and time_steps that include the specific parameters and time steps we want to assign. \n\nNote that this assigns sequential data! For e.g. being processed by a transformer.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.assign_output_estimate-Union{Tuple{T}, Tuple{AbstractArray{T, 3}, Int64}} where T","page":"Library","title":"GeometricMachineLearning.assign_output_estimate","text":"The function assign_output_estimate is closely related to the transformer. It takes the last prediction_window columns of the output and uses them for the final prediction. i.e.\n\nmathbbR^NtimesmathttpwtomathbbR^Ntimesmathttpw \nbeginbmatrix \n z^(1)_1 cdots z^(T)_1 \n cdots cdots cdots \n z^(1)_n cdots z^(T)_n\n endbmatrix mapsto \n beginbmatrix \n z^(T - mathttpw)_1 cdots z^(T)_1 \n cdots cdots cdots \n z^(T - mathttpw)_n cdots z^(T)_nendbmatrix \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.assign_output_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.assign_output_kernel!","text":"This should be used together with assign_batch_kernel!. It assigns the corresponding output (i.e. target).\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.assign_q_and_p-Tuple{AbstractVector, Int64}","page":"Library","title":"GeometricMachineLearning.assign_q_and_p","text":"Allocates two new arrays q and p whose first dimension is half of that of the input x. This should also be supplied through the second argument N.\n\nThe output is a Tuple containing q and p.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.augment_zeros_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.augment_zeros_kernel!","text":"Used for differentiating assignoutputestimate (this appears in the loss). \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.compute_output_of_mha-Union{Tuple{T}, Tuple{M}, Tuple{MultiHeadAttention{M, M}, AbstractMatrix{T}, NamedTuple}} where {M, T}","page":"Library","title":"GeometricMachineLearning.compute_output_of_mha","text":"Applies MHA to an abstract matrix. This is the same independent of whether the input is added to the output or not. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.convert_input_and_batch_indices_to_array-Union{Tuple{BT}, Tuple{AT}, Tuple{T}, Tuple{DataLoader{T, BT}, Batch, Vector{Tuple{Int64, Int64}}}} where {T, AT<:AbstractArray{T, 3}, BT<:@NamedTuple{q::AT, p::AT}}","page":"Library","title":"GeometricMachineLearning.convert_input_and_batch_indices_to_array","text":"Takes the output of the batch functor and uses it to create the corresponding array (NamedTuples). \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.convert_input_and_batch_indices_to_array-Union{Tuple{BT}, Tuple{T}, Tuple{DataLoader{T, BT}, Batch, Vector{Tuple{Int64, Int64}}}} where {T, BT<:AbstractArray{T, 3}}","page":"Library","title":"GeometricMachineLearning.convert_input_and_batch_indices_to_array","text":"Takes the output of the batch functor and uses it to create the corresponding array. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.crop_array_for_transformer_loss-Union{Tuple{BT}, Tuple{AT}, Tuple{T2}, Tuple{T}, Tuple{AT, BT}} where {T, T2, AT<:AbstractArray{T, 3}, BT<:AbstractArray{T2, 3}}","page":"Library","title":"GeometricMachineLearning.crop_array_for_transformer_loss","text":"This crops the output array of the neural network so that it conforms with the output it should be compared to. This is needed for the transformer loss. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.custom_mat_mul-Tuple{AbstractMatrix, AbstractVecOrMat}","page":"Library","title":"GeometricMachineLearning.custom_mat_mul","text":"Multiplies a matrix with a vector, a matrix or a tensor.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.draw_batch!-Union{Tuple{T}, Tuple{AbstractMatrix{T}, AbstractMatrix{T}}} where T","page":"Library","title":"GeometricMachineLearning.draw_batch!","text":"This assigns the batch if the data are in form of a matrix.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.init_optimizer_cache-Tuple{GradientOptimizer, Any}","page":"Library","title":"GeometricMachineLearning.init_optimizer_cache","text":"Wrapper for the functions setup_adam_cache, setup_momentum_cache, setup_gradient_cache, setup_bfgs_cache. These appear outside of optimizer_caches.jl because the OptimizerMethods first have to be defined.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.initialize_hessian_inverse-Union{Tuple{AbstractArray{T}}, Tuple{T}} where T","page":"Library","title":"GeometricMachineLearning.initialize_hessian_inverse","text":"This initializes the inverse of the Hessian for various arrays. This requires an implementation of a vectorization operation vec. This is important for custom arrays.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.loss-Tuple{NeuralNetwork, Vararg{Any}}","page":"Library","title":"GeometricMachineLearning.loss","text":"Wrapper if we deal with a neural network.\n\nYou can supply an instance of NeuralNetwork instead of the two arguments model (of type Union{Chain, AbstractExplicitLayer}) and parameters (of type Union{Tuple, NamedTuple}).\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.loss-Union{Tuple{BT}, Tuple{AT}, Tuple{T1}, Tuple{T}, Tuple{Union{AbstractNeuralNetworks.AbstractExplicitLayer, Chain}, Union{Tuple, NamedTuple}, AT, BT}} where {T, T1, AT<:(AbstractArray{T}), BT<:(AbstractArray{T1})}","page":"Library","title":"GeometricMachineLearning.loss","text":"Computes the loss for a neural network and a data set. The computed loss is \n\noutput - mathcalNN(input)_Foutput_F\n\nwhere A_F = sqrtsum_i_1ldotsi_ka_i_1ldotsi_k^2^2 is the Frobenius norm.\n\nIt takes as input: \n\nmodel::Union{Chain, AbstractExplicitLayer}\nps::Union{Tuple, NamedTuple}\ninput::Union{Array, NamedTuple}\noutput::Uniont{Array, NamedTuple}\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.loss-Union{Tuple{BT}, Tuple{AT}, Tuple{T1}, Tuple{T}, Tuple{Union{AbstractNeuralNetworks.AbstractExplicitLayer, Chain}, Union{Tuple, NamedTuple}, DataLoader{T, AT, BT}}} where {T, T1, AT<:AbstractArray{T, 3}, BT<:AbstractArray{T1, 3}}","page":"Library","title":"GeometricMachineLearning.loss","text":"Alternative call of the loss function. This takes as input: \n\nmodel\nps\ndl::DataLoader\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.loss-Union{Tuple{BT}, Tuple{T}, Tuple{Union{AbstractNeuralNetworks.AbstractExplicitLayer, Chain}, Union{Tuple, NamedTuple}, BT}} where {T, BT<:(AbstractArray{T})}","page":"Library","title":"GeometricMachineLearning.loss","text":"The autoencoder loss:\n\noutput - mathcalNN(input)_Foutput_F\n\nIt takes as input: \n\nmodel::Union{Chain, AbstractExplicitLayer}\nps::Union{Tuple, NamedTuple}\ninput::Union{Array, NamedTuple}\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.map_index_for_symplectic_potential-Tuple{Int64, Int64}","page":"Library","title":"GeometricMachineLearning.map_index_for_symplectic_potential","text":"This assigns the right index for the symplectic potential. To be used with assign_ones_for_symplectic_potential_kernel!.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.mat_tensor_mul-Union{Tuple{AT}, Tuple{ST}, Tuple{BT}, Tuple{T}, Tuple{AT, AbstractArray{T, 3}}} where {T, BT<:(AbstractArray{T}), ST<:StiefelManifold{T, BT}, AT<:LinearAlgebra.Adjoint{T, ST}}","page":"Library","title":"GeometricMachineLearning.mat_tensor_mul","text":"Extend mat_tensor_mul to a multiplication by the adjoint of an element of StiefelManifold. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.mat_tensor_mul-Union{Tuple{T}, Tuple{StiefelManifold, AbstractArray{T, 3}}} where T","page":"Library","title":"GeometricMachineLearning.mat_tensor_mul","text":"Extend mat_tensor_mul to a multiplication by an element of StiefelManifold. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.metric-Tuple{StiefelManifold, AbstractMatrix, AbstractMatrix}","page":"Library","title":"GeometricMachineLearning.metric","text":"Implements the canonical Riemannian metric for the Stiefel manifold:\n\ng_Y (Delta_1 Delta_2) mapsto mathrmtr(Delta_1^T(mathbbI - frac12YY^T)Delta_2)\n\nIt is called with: \n\nY::StiefelManifold\nΔ₁::AbstractMatrix\nΔ₂::AbstractMatrix`\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.number_of_batches-Union{Tuple{OT}, Tuple{AT}, Tuple{BT}, Tuple{T}, Tuple{DataLoader{T, AT, OT, :TimeSeries}, Batch}} where {T, BT<:AbstractArray{T, 3}, AT<:Union{@NamedTuple{q::BT, p::BT}, BT}, OT}","page":"Library","title":"GeometricMachineLearning.number_of_batches","text":"Gives the number of batches. Inputs are of type DataLoader and Batch.\n\nHere the big distinction is between data that are time-series like and data that are autoencoder like.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.onehotbatch-Union{Tuple{AbstractVector{T}}, Tuple{T}} where T<:Integer","page":"Library","title":"GeometricMachineLearning.onehotbatch","text":"One-hot-batch encoding of a vector of integers: inputin01ldots9^ell. The output is a tensor of shape 10times1timesell. \n\n0 mapsto beginbmatrix 1 0 ldots 0 endbmatrix\n\nIn more abstract terms: i mapsto e_i.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.optimization_step!-Tuple{Optimizer, Chain, Tuple, Tuple}","page":"Library","title":"GeometricMachineLearning.optimization_step!","text":"Optimization for an entire neural network, the way this function should be called. \n\ninputs: \n\no::Optimizer\nmodel::Chain\nps::Tuple\ndx::Tuple\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.optimization_step!-Tuple{Optimizer, Union{AbstractNeuralNetworks.AbstractExplicitCell, AbstractNeuralNetworks.AbstractExplicitLayer}, NamedTuple, NamedTuple, NamedTuple}","page":"Library","title":"GeometricMachineLearning.optimization_step!","text":"Optimization for a single layer. \n\ninputs: \n\no::Optimizer\nd::Union{AbstractExplicitLayer, AbstractExplicitCell}\nps::NamedTuple: the parameters \nC::NamedTuple: NamedTuple of the caches \ndx::NamedTuple: NamedTuple of the derivatives (output of AD routine)\n\nps, C and dx must have the same keys. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.optimize_for_one_epoch!-Union{Tuple{T}, Tuple{Optimizer, Any, Union{Tuple, NamedTuple}, DataLoader{T, AT} where AT<:Union{AbstractArray{T}, NamedTuple}, Batch, Union{typeof(GeometricMachineLearning.loss), GeometricMachineLearning.NetworkLoss}}} where T","page":"Library","title":"GeometricMachineLearning.optimize_for_one_epoch!","text":"Optimize for an entire epoch. For this you have to supply: \n\nan instance of the optimizer.\nthe neural network model \nthe parameters of the model \nthe data (in form of DataLoader)\nin instance of Batch that contains batch_size (and optionally seq_length)\n\nWith the optional argument:\n\nthe loss, which takes the model, the parameters ps and an instance of DataLoader as input.\n\nThe output of optimize_for_one_epoch! is the average loss over all batches of the epoch:\n\noutput = frac1mathttsteps_per_epochsum_t=1^mathttsteps_per_epochloss(theta^(t-1))\n\nThis is done because any reverse differentiation routine always has two outputs: a pullback and the value of the function it is differentiating. In the case of zygote: loss_value, pullback = Zygote.pullback(ps -> loss(ps), ps) (if the loss only depends on the parameters).\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.patch_index-Union{Tuple{T}, Tuple{T, T, T}, NTuple{4, T}} where T<:Integer","page":"Library","title":"GeometricMachineLearning.patch_index","text":"Based on coordinates i,j this returns the batch index (for MNIST data set for now).\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.reduced_vector_field_from_full_explicit_vector_field-Tuple{Any, Any, Integer, Integer}","page":"Library","title":"GeometricMachineLearning.reduced_vector_field_from_full_explicit_vector_field","text":"This function is needed if we obtain a GeometricIntegrators-like vector field from an explicit vector field VmathbbR^2NtomathbbR^2N. We need this function because buildreducedvector_field is not working in conjunction with implicit integrators.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.rgrad-Tuple{StiefelManifold, AbstractMatrix}","page":"Library","title":"GeometricMachineLearning.rgrad","text":"Computes the Riemannian gradient for the Stiefel manifold given an element YinSt(Nn) and a matrix nablaLinmathbbR^Ntimesn (the Euclidean gradient). It computes the Riemannian gradient with respect to the canonical metric (see the documentation for the function metric for an explanation of this). The precise form of the mapping is: \n\nmathttrgrad(Y nablaL) mapsto nablaL - Y(nablaL)^TY\n\nIt is called with inputs:\n\nY::StiefelManifold\ne_grad::AbstractMatrix: i.e. the Euclidean gradient (what was called nablaL) above.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.split_and_flatten-Union{Tuple{AbstractArray{T, 3}}, Tuple{T}} where T","page":"Library","title":"GeometricMachineLearning.split_and_flatten","text":"split_and_flatten takes a tensor as input and produces another one as output (essentially rearranges the input data in an intricate way) so that it can easily be processed with a transformer.\n\nThe optional arguments are: \n\npatch_length: by default this is 7. \nnumber_of_patches: by default this is 16.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.tensor_mat_skew_sym_assign-Union{Tuple{AT}, Tuple{T}, Tuple{AT, AbstractMatrix{T}}} where {T, AT<:AbstractArray{T, 3}}","page":"Library","title":"GeometricMachineLearning.tensor_mat_skew_sym_assign","text":"Takes as input: \n\nZ::AbstractArray{T, 3}: A tensor that stores a bunch of time series. \nA::AbstractMatrix: A matrix that is used to perform various scalar products. \n\nFor one of these time series the function performs the following computation: \n\n (z^(i) z^(j)) mapsto (z^(i))^TAz^(j) text for i j\n\nThe result of this are n(n-2)div2 scalar products. These scalar products are written into a lower-triangular matrix and the final output of the function is a tensor of these lower-triangular matrices. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.tensor_mat_skew_sym_assign_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.tensor_mat_skew_sym_assign_kernel!","text":"A kernel that computes the weighted scalar products of all combinations of vectors in the matrix Z except where the two vectors are the same and writes the result into a tensor of skew symmetric matrices C. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.train!","page":"Library","title":"GeometricMachineLearning.train!","text":"train!(...)\n\nPerform a training of a neural networks on data using given method a training Method\n\nDifferent ways of use:\n\ntrain!(neuralnetwork, data, optimizer = GradientOptimizer(1e-2), training_method; nruns = 1000, batch_size = default(data, type), showprogress = false )\n\nArguments\n\nneuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend\ndata : the data (see TrainingData)\noptimizer = GradientOptimizer: the optimization method (see Optimizer)\ntraining_method : specify the loss function used \nnruns : number of iteration through the process with default value \nbatch_size : size of batch of data used for each step\n\n\n\n\n\n","category":"function"},{"location":"library/#GeometricMachineLearning.train!-Tuple{AbstractNeuralNetworks.AbstractNeuralNetwork{<:AbstractNeuralNetworks.Architecture}, AbstractTrainingData, TrainingParameters}","page":"Library","title":"GeometricMachineLearning.train!","text":"train!(neuralnetwork, data, optimizer, training_method; nruns = 1000, batch_size, showprogress = false )\n\nArguments\n\nneuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend\ndata::AbstractTrainingData : the data\n``\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.transformer_loss-Union{Tuple{BT}, Tuple{T}, Tuple{Union{AbstractNeuralNetworks.AbstractExplicitLayer, Chain}, Union{Tuple, NamedTuple}, BT, BT}} where {T, BT<:(AbstractArray{T})}","page":"Library","title":"GeometricMachineLearning.transformer_loss","text":"The transformer works similarly to the regular loss, but with the difference that mathcalNN(input) and output may have different sizes. \n\nIt takes as input: \n\nmodel::Union{Chain, AbstractExplicitLayer}\nps::Union{Tuple, NamedTuple}\ninput::Union{Array, NamedTuple}\noutput::Uniont{Array, NamedTuple}\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.within_patch_index-Union{Tuple{T}, Tuple{T, T, T}} where T<:Integer","page":"Library","title":"GeometricMachineLearning.within_patch_index","text":"Based on coordinates i,j this returns the index within the batch\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.write_ones_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.write_ones_kernel!","text":"Kernel that is needed for functions relating to SymmetricMatrix and SkewSymMatrix \n\n\n\n\n\n","category":"method"},{"location":"optimizers/adam_optimizer/#The-Adam-Optimizer","page":"Adam Optimizer","title":"The Adam Optimizer","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"The Adam Optimizer is one of the most widely (if not the most widely used) neural network optimizer. Like most modern neural network optimizers it contains a cache that is updated based on first-order gradient information and then, in a second step, the cache is used to compute a velocity estimate for updating the neural networ weights. ","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"Here we first describe the Adam algorithm for the case where all the weights are on a vector space and then show how to generalize this to the case where the weights are on a manifold. ","category":"page"},{"location":"optimizers/adam_optimizer/#All-weights-on-a-vector-space","page":"Adam Optimizer","title":"All weights on a vector space","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"The cache of the Adam optimizer consists of first and second moments. The first moments B_1 store linear information about the current and previous gradients, and the second moments B_2 store quadratic information about current and previous gradients (all computed from a first-order gradient). ","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"If all the weights are on a vector space, then we directly compute updates for B_1 and B_2:","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"B_1 gets ((rho_1 - rho_1^t)(1 - rho_1^t))cdotB_1 + (1 - rho_1)(1 - rho_1^t)cdotnablaL\nB_2 gets ((rho_2 - rho_1^t)(1 - rho_2^t))cdotB_2 + (1 - rho_2)(1 - rho_2^t)cdotnablaLodotnablaL\nwhere odotmathbbR^ntimesmathbbR^ntomathbbR^n is the Hadamard product: aodotb_i = a_ib_i. rho_1 and rho_2 are hyperparameters. Their defaults, rho_1=09 and rho_2=099, are taken from (Goodfellow et al., 2016, page 301). After having updated the cache (i.e. B_1 and B_2) we compute a velocity (step 3) with which the parameters Y_t are then updated (step 4).\nW_tgets -etaB_1sqrtB_2 + delta\nY_t+1 gets Y_t + W_t","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"Here eta (with default 0.01) is the learning rate and delta (with default 3cdot10^-7) is a small constant that is added for stability. The division, square root and addition in step 3 are performed element-wise. ","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/adam_optimizer.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"optimizers/adam_optimizer/#Weights-on-manifolds","page":"Adam Optimizer","title":"Weights on manifolds","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"The problem with generalizing Adam to manifolds is that the Hadamard product odot as well as the other element-wise operations (, sqrt and + in step 3 above) lack a clear geometric interpretation. In GeometricMachineLearning we get around this issue by utilizing a so-called global tangent space representation. ","category":"page"},{"location":"optimizers/adam_optimizer/#References","page":"Adam Optimizer","title":"References","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT press, 2016.","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"I. Goodfellow, Y. Bengio and A. Courville. Deep learning (MIT press, Cambridge, MA, 2016).\n\n\n\n","category":"page"},{"location":"architectures/autoencoders/#Variational-Autoencoders","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"","category":"section"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"Variational autoencoders (Lee and Carlberg, 2020) train on the following set: ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"mathcalX(mathbbP_mathrmtrain) = mathbfx^k(mu) - mathbfx^0(mu)0leqkleqKmuinmathbbP_mathrmtrain","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"where mathbfx^k(mu)approxmathbfx(t^kmu). Note that mathbf0inmathcalX(mathbbP_mathrmtrain) as k can also be zero. ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"The encoder Psi^mathrmenc and decoder Psi^mathrmdec are then trained on this set mathcalX(mathbbP_mathrmtrain) by minimizing the reconstruction error: ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":" mathbfx - Psi^mathrmdeccircPsi^mathrmenc(mathbfx) text for mathbfxinmathcalX(mathbbP_mathrmtrain)","category":"page"},{"location":"architectures/autoencoders/#Initial-condition","page":"Variational Autoencoders","title":"Initial condition","text":"","category":"section"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"No matter the parameter mu the initial condition in the reduced system is always mathbfx_r0(mu) = mathbfx_r0 = Psi^mathrmenc(mathbf0). ","category":"page"},{"location":"architectures/autoencoders/#Reconstructed-solution","page":"Variational Autoencoders","title":"Reconstructed solution","text":"","category":"section"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"In order to arrive at the reconstructed solution one first has to decode the reduced state and then add the reference state:","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"mathbfx^mathrmreconstr(tmu) = mathbfx^mathrmref(mu) + Psi^mathrmdec(mathbfx_r(tmu))","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"where mathbfx^mathrmref(mu) = mathbfx(t_0mu) - Psi^mathrmdeccircPsi^mathrmdec(mathbf0).","category":"page"},{"location":"architectures/autoencoders/#Symplectic-reduced-vector-field","page":"Variational Autoencoders","title":"Symplectic reduced vector field","text":"","category":"section"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"A symplectic vector field is one whose flow conserves the symplectic structure mathbbJ. This is equivalent[1] to there existing a Hamiltonian H s.t. the vector field X can be written as X = mathbbJnablaH.","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"[1]: Technically speaking the definitions are equivalent only for simply-connected manifolds, so also for vector spaces. ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"If the full-order Hamiltonian is H^mathrmfullequivH we can obtain another Hamiltonian on the reduces space by simply setting: ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"H^mathrmred(mathbfx_r(tmu)) = H(mathbfx^mathrmreconstr(tmu)) = H(mathbfx^mathrmref(mu) + Psi^mathrmdec(mathbfx_r(tmu)))","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"The ODE associated to this Hamiltonian is also the one corresponding to Manifold Galerkin ROM (see (Lee and Carlberg, 2020)).","category":"page"},{"location":"architectures/autoencoders/#Manifold-Galerkin-ROM","page":"Variational Autoencoders","title":"Manifold Galerkin ROM","text":"","category":"section"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"Define the FOM ODE residual as: ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"r (mathbfv xi tau mu) mapsto mathbfv - f(xi tau mu)","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"The reduced ODE is then defined to be: ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"dothatmathbfx(tmu) = mathrmargmin_hatmathbfvinmathbbR^p r(mathcalJ(hatmathbfx(tmu))hatmathbfvhatmathbfx^mathrmref(mu) + Psi^mathrmdec(hatmathbfx(tmu))tmu) _2^2","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"where mathcalJ is the Jacobian of the decoder Psi^mathrmdec. This leads to: ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"mathcalJ(hatmathbfx(tmu))hatmathbfv - f(hatmathbfx^mathrmref(mu) + Psi^mathrmdec(hatmathbfx(tmu)) t mu) overset= 0 implies \nhatmathbfv = mathcalJ(hatmathbfx(tmu))^+f(hatmathbfx^mathrmref(mu) + Psi^mathrmdec(hatmathbfx(tmu)) t mu)","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"where mathcalJ(hatmathbfx(tmu))^+ is the pseudoinverse of mathcalJ(hatmathbfx(tmu)). Because mathcalJ(hatmathbfx(tmu)) is a symplectic matrix the pseudoinverse is the symplectic inverse (see (Peng and Mohseni, 2016)).","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"Furthermore, because f is Hamiltonian, the vector field describing dothatmathbfx(tmu) will also be Hamiltonian. ","category":"page"},{"location":"architectures/autoencoders/#References","page":"Variational Autoencoders","title":"References","text":"","category":"section"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"K. Lee and K. Carlberg. “Model reduction of dynamical systems on nonlinear manifolds using","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"deep convolutional autoencoders”. In: Journal of Computational Physics 404 (2020), p. 108973.","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"Peng L, Mohseni K. Symplectic model reduction of Hamiltonian systems[J]. SIAM Journal on Scientific Computing, 2016, 38(1): A1-A27.","category":"page"},{"location":"data_loader/TODO/#DATA-Loader-TODO","page":"DATA Loader TODO","title":"DATA Loader TODO","text":"","category":"section"},{"location":"data_loader/TODO/","page":"DATA Loader TODO","title":"DATA Loader TODO","text":"[x] Implement @views instead of allocating a new array in every step. \n[x] Implement sampling without replacement.\n[x] Store information on the epoch and the current loss. \n[x] Usually the training loss is computed over the entire data set, we are probably going to do this for one epoch via ","category":"page"},{"location":"data_loader/TODO/","page":"DATA Loader TODO","title":"DATA Loader TODO","text":"loss_e = frac1batchessum_batchinbatchesloss(batch)","category":"page"},{"location":"data_loader/TODO/","page":"DATA Loader TODO","title":"DATA Loader TODO","text":"Point 4 makes sense because the output of an AD routine is the value of the loss function as well as the pullback. ","category":"page"},{"location":"data_loader/data_loader/#Data-Loader","page":"Routines","title":"Data Loader","text":"","category":"section"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning, Markdown\nMarkdown.parse(description(Val(:DataLoader)))","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"The data loader can be called with various types of arrays as input, for example a snapshot matrix:","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nSnapshotMatrix = rand(Float32, 10, 100)\n\ndl = DataLoader(SnapshotMatrix)","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"or a snapshot tensor: ","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nSnapshotTensor = rand(Float32, 10, 100, 5)\n\ndl = DataLoader(SnapshotTensor)","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"Here the DataLoader has different properties :RegularData and :TimeSeries. This indicates that in the first case we treat all columns in the input tensor independently (this is mostly used for autoencoder problems), whereas in the second case we have time series-like data, which are mostly used for integration problems. We can also treat a problem with a matrix as input as a time series-like problem by providing an additional keyword argument: autoencoder=false:","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nSnapshotMatrix = rand(Float32, 10, 100)\n\ndl = DataLoader(SnapshotMatrix; autoencoder=false)\ndl.input_time_steps","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning, Markdown\nMarkdown.parse(description(Val(:data_loader_for_named_tuple)))","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nSymplecticSnapshotTensor = (q = rand(Float32, 10, 100, 5), p = rand(Float32, 10, 100, 5))\n\ndl = DataLoader(SymplecticSnapshotTensor)","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"dl.input_dim","category":"page"},{"location":"data_loader/data_loader/#The-Batch-struct","page":"Routines","title":"The Batch struct","text":"","category":"section"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning, Markdown\nMarkdown.parse(description(Val(:Batch)))","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nmatrix_data = rand(Float32, 2, 10)\ndl = DataLoader(matrix_data; autoencoder = true)\n\nbatch = Batch(3)\nbatch(dl)","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"This also works if the data are in qp form: ","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nqp_data = (q = rand(Float32, 2, 10), p = rand(Float32, 2, 10))\ndl = DataLoader(qp_data; autoencoder = true)\n\nbatch = Batch(3)\nbatch(dl)","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"In those two examples the autoencoder keyword was set to true (the default). This is why the first index was always 1. This changes if we set autoencoder = false: ","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nqp_data = (q = rand(Float32, 2, 10), p = rand(Float32, 2, 10))\ndl = DataLoader(qp_data; autoencoder = false) # false is default \n\nbatch = Batch(3)\nbatch(dl)","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"Specifically the routines do the following: ","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"mathttn_indicesleftarrow mathttn_paramslormathttinput_time_steps \nmathttindices leftarrow mathttshuffle(mathtt1mathttn_indices)\nmathcalI_i leftarrow mathttindices(i - 1) cdot mathttbatch_size + 1 mathtt i cdot mathttbatch_sizetext for i=1 ldots (mathrmlast -1)\nmathcalI_mathttlast leftarrow mathttindices(mathttn_batches - 1) cdot mathttbatch_size + 1mathttend","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"Note that the routines are implemented in such a way that no two indices appear double. ","category":"page"},{"location":"data_loader/data_loader/#Sampling-from-a-tensor","page":"Routines","title":"Sampling from a tensor","text":"","category":"section"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"We can also sample tensor data.","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nqp_data = (q = rand(Float32, 2, 20, 3), p = rand(Float32, 2, 20, 3))\ndl = DataLoader(qp_data)\n\n# also specify sequence length here\nbatch = Batch(4, 5)\nbatch(dl)","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"Sampling from a tensor is done the following way (mathcalI_i again denotes the batch indices for the i-th batch): ","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"mathtttime_indices leftarrow mathttshuffle(mathtt1(mathttinput_time_steps - mathttseq_length - mathttprediction_window)\nmathttparameter_indices leftarrow mathttshuffle(mathtt1n_params)\nmathttcomplete_indices leftarrow mathttproduct(mathtttime_indices mathttparameter_indices)\nmathcalI_i leftarrow mathttcomplete_indices(i - 1) cdot mathttbatch_size + 1 i cdot mathttbatch_sizetext for i=1 ldots (mathrmlast -1)\nmathcalI_mathrmlast leftarrow mathttcomplete_indices(mathrmlast - 1) cdot mathttbatch_size + 1mathttend","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"This algorithm can be visualized the following way (here batch_size = 4):","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/tensor_sampling.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"Here the sampling is performed over the second axis (the time step dimension) and the third axis (the parameter dimension). Whereas each block has thickness 1 in the x direction (i.e. pertains to a single parameter), its length in the y direction is seq_length. In total we sample as many such blocks as the batch size is big. By construction those blocks are never the same throughout a training epoch but may intersect each other!","category":"page"},{"location":"manifolds/basic_topology/#Basic-Concepts-of-General-Topology","page":"Concepts from General Topology","title":"Basic Concepts of General Topology","text":"","category":"section"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"On this page we discuss basic notions of topology that are necessary to define and work manifolds. Here we largely omit concrete examples and only define concepts that are necessary for defining a manifold[1], namely the properties of being Hausdorff and second countable. For a wide range of examples and a detailed discussion of the theory see e.g. [5]. The here-presented theory is also (rudimentary) covered in most differential geometry books such as [6] and [7]. ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"[1]: Some authors (see e.g. [6]) do not require these properties. But since they constitute very weak restrictions and are always satisfied by the manifolds relevant for our purposes we require them here. ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Definition: A topological space is a set mathcalM for which we define a collection of subsets of mathcalM, which we denote by mathcalT and call the open subsets. mathcalT further has to satisfy the following three conditions:","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"The empty set and mathcalM belong to mathcalT.\nAny union of an arbitrary number of elements of mathcalT again belongs to mathcalT.\nAny intersection of a finite number of elements of mathcalT again belongs to mathcalT.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Based on this definition of a topological space we can now define what it means to be Hausdorff: Definition: A topological space mathcalM is said to be Hausdorff if for any two points xyinmathcalM we can find two open sets U_xU_yinmathcalT s.t. xinU_x yinU_y and U_xcapU_y=.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"We now give the second definition that we need for defining manifolds, that of second countability: Definition: A topological space mathcalM is said to be second-countable if we can find a countable subcollection of mathcalT called mathcalU s.t. forallUinmathcalT and xinU we can find an element VinmathcalU for which xinVsubsetU.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"We now give a few definitions and results that are needed for the inverse function theorem which is essential for practical applications of manifold theory.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Definition: A mapping f between topological spaces mathcalM and mathcalN is called continuous if the preimage of every open set is again an open set, i.e. if f^-1UinmathcalT for U open in mathcalN and mathcalT the topology on mathcalM.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Definition: A closed set of a topological space mathcalM is one whose complement is an open set, i.e. F is closed if F^cinmathcalT, where the superscript ^c indicates the complement. For closed sets we thus have the following three properties: ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"The empty set and mathcalM are closed sets.\nAny union of a finite number of closed sets is again closed.\nAny intersection of an arbitrary number of closed sets is again closed.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Theorem: The definition of continuity is equivalent to the following, second definition: fmathcalMtomathcalN is continuous if f^-1FsubsetmathcalM is a closed set for each closed set FsubsetmathcalN.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Proof: First assume that f is continuous according to the first definition and not to the second. Then f^-1F is not closed but f^-1F^c is open. But f^-1F^c = xinmathcalMf(x)notinmathcalN = (f^-1F)^c cannot be open, else f^-1F would be closed. The implication of the first definition under assumption of the second can be shown analogously. ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Theorem: The property of a set F being closed is equivalent to the following statement: If a point y is such that for every open set U containing it we have UcapFneq then this point is contained in F.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Proof: We first proof that if a set is closed then the statement holds. Consider a closed set F and a point ynotinF s.t. every open set containing y has nonempty intersection with F. But the complement F^c also is such a set, which is a clear contradiction. Now assume the above statement for a set F and further assume F is not closed. Its complement F^c is thus not open. Now consider the interior of this set: mathrmint(F^c)=cupUUsubsetF^c, i.e. the biggest open set contained within F^c. Hence there must be a point y which is in F^c but is not in its interior, else F^c would be equal to its interior, i.e. would be open. We further must be able to find an open set U that contains y but is also contained in F^c, else y would be an element of F. A contradiction. ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Definition: An open cover of a topological space mathcalM is a (not necessarily countable) collection of open sets U_i_imathcalI s.t. their union contains mathcalM. A finite open cover is a collection of a finite number of open sets that cover mathcalM. We say that an open cover is reducible to a finite cover if we can find a finite number of elements in the open cover whose union still contains mathcalM.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Definition: A topological space mathcalM is called compact if every open cover is reducible to a finite cover.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Theorem: Consider a continuous function fmathcalMtomathcalN and a compact set KinmathcalM. Then f(K) is also compact. ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Proof: Consider an open cover of f(K): U_i_iinmathcalI. Then f^-1U_i_iinmathcalI is an open cover of K and hence reducible to a finite cover f^-1U_i_iini_1ldotsi_n. But then U_i_iini_1ldotsi_n also covers f(K).","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Theorem: A closed subset of a compact space is compact:","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Proof: Call the closed set F and consider an open cover of this set: U_iinmathcalI. Then this open cover combined with F^c is an open cover for the entire compact space, hence reducible to a finite cover.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Theorem: A compact subset of a Hausdorff space is closed: ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Proof: Consider a compact subset K. If K is not closed, then there has to be a point ynotinK s.t. every open set containing y intersects K. Because the surrounding space is Hausdorff we can now find the following two collections of open sets: (U_z U_zy U_zcapU_zy=)_zinK. The open cover U_z_zinK is then reducible to a finite cover U_z_zinz_1 ldots z_n. The intersection cap_zinz_1 ldots z_nU_zy is then an open set that contains y but has no intersection with K. A contraction. ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Theorem: If mathcalM is compact and mathcalN is Hausdorff, then the inverse of a continuous function fmathcalMtomathcalN is again continuous, i.e. f(V) is an open set in mathcalN for VinmathcalT.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Proof: We can equivalently show that every closed set is mapped to a closed set. First consider the set KinmathcalM. Its image is again compact and hence closed because mathcalN is Hausdorff. ","category":"page"},{"location":"manifolds/basic_topology/#References","page":"Concepts from General Topology","title":"References","text":"","category":"section"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"S. I. Richard L. Bishop. Tensor Analysis on Manifolds (Dover Publications, 1980).\n\n\n\nS. Lang. Fundamentals of differential geometry. Vol. 191 (Springer Science & Business Media, 2012).\n\n\n\nS. Lipschutz. General Topology (McGraw-Hill Book Company, 1965).\n\n\n\n","category":"page"},{"location":"tutorials/mnist_tutorial/#MNIST-tutorial","page":"MNIST","title":"MNIST tutorial","text":"","category":"section"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"This is a short tutorial that shows how we can use GeometricMachineLearning to build a vision transformer and apply it for MNIST, while also putting some of the weights on a manifold. This is also the result presented in [15].","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"First, we need to import the relevant packages: ","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"using GeometricMachineLearning, CUDA, Plots\nimport Zygote, MLDatasets, KernelAbstractions","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"For the AD routine we here use the GeometricMachineLearning default and we get the dataset from MLDatasets. First we need to load the data set, and put it on GPU (if you have one):","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"train_x, train_y = MLDatasets.MNIST(split=:train)[:]\ntest_x, test_y = MLDatasets.MNIST(split=:test)[:]\ntrain_x = train_x |> cu \ntest_x = test_x |> cu \ntrain_y = train_y |> cu \ntest_y = test_y |> cu","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"GeometricMachineLearning has built-in data loaders that make it particularly easy to handle data: ","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"patch_length = 7\ndl = DataLoader(train_x, train_y, patch_length=patch_length)\ndl_test = DataLoader(train_x, train_y, patch_length=patch_length)","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"Here patch_length indicates the size one patch has. One image in MNIST is of dimension 28times28, this means that we decompose this into 16 (7times7) images (also see [15]).","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"We next define the model with which we want to train:","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"model = ClassificationTransformer(dl, n_heads=n_heads, n_layers=n_layers, Stiefel=true)","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"Here we have chosen a ClassificationTransformer, i.e. a composition of a specific number of transformer layers composed with a classification layer. We also set the Stiefel option to true, i.e. we are optimizing on the Stiefel manifold.","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"We now have to initialize the neural network weights. This is done with the constructor for NeuralNetwork:","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"backend = KernelAbstractions.get_backend(dl)\nT = eltype(dl)\nnn = NeuralNetwork(model, backend, T)","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"And with this we can finally perform the training:","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"# an instance of batch is needed for the optimizer\nbatch = Batch(batch_size)\n\noptimizer_instance = Optimizer(AdamOptimizer(), nn)\n\n# this prints the accuracy and is optional\nprintln(\"initial test accuracy: \", accuracy(Ψᵉ, ps, dl_test), \"\\n\")\n\nloss_array = optimizer_instance(nn, dl, batch, n_epochs)\n\nprintln(\"final test accuracy: \", accuracy(Ψᵉ, ps, dl_test), \"\\n\")","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"It is instructive to play with n_layers, n_epochs and the Stiefel property.","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).\n\n\n\n","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/#The-Existence-And-Uniqueness-Theorem","page":"Differential Equations and the EAU theorem","title":"The Existence-And-Uniqueness Theorem","text":"","category":"section"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"In order to proof the existence-and-uniqueness theorem we first need another theorem, the Banach fixed-point theorem for which we also need another definition. ","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"Definition: A contraction mapping is a map TmathbbR^NtomathbbR^N for which there exists qin01) s.t. forallxyinmathbbR^NT(x)-T(y)leqqx-y.","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"Theorem (Banach fixed-point theorem): Every contraction mapping T admits a unique fixed point x^* (i.e. a point x^* s.t. F(x^*)=x^*) and this point can be found by taking an arbitrary point x_0inmathbbR^N and taking the limit lim_ntoinftyT^n(x_0).","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"Proof (Banach fixed-point theorem): Take an arbitrary point x_0inmathbbR^N and consider the sequence (x_n)_ninmathbbN with x_n=T^n(x_0). Then it holds that (for mn): ","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"beginaligned\nx_m - x_n leq x_m - x_m-1 + x_m-1 - x_m-2 + cdots + x_m-(m-n+1)-x_n \n = x_n+(m-n) - x_n+(m-n-1) + cdots + x_n+1 - x_n \n leq sum_i=0^m-n-1q^ix_n+1 - x_n \n leq sum_i=0^m-n-1q^iq^nx_1 - x_0 \n = q^nx_1 -x_0sum_i=1^m-n-1q^i\nendaligned","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"where we have used the triangle inequality in the first line. If we now let m on the right-hand side first go to infinity then we get ","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"beginaligned\nx_m-x_n leq q^nx_1 -x_0sum_i=1^inftyq^i\n =q^nx_1 -x_0 frac11-q\nendaligned","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"proofing that the sequence is Cauchy. Because mathbbR^N is a complete metric space we get that (x_n)_ninmathbbN is a convergent sequence. We call the limit of this sequence x^*. This completes the proof of the Banach fixed-point theorem. ","category":"page"},{"location":"layers/multihead_attention_layer/#Multihead-Attention-Layer","page":"Multihead Attention","title":"Multihead Attention Layer","text":"","category":"section"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"In order to arrive from the attention layer at the multihead attention layer we have to do a few modifications: ","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Note that these neural networks were originally developed for natural language processing (NLP) tasks and the terminology used here bears some resemblance to that field. The input to a multihead attention layer typicaly comprises three components:","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Values VinmathbbR^ntimesT: a matrix whose columns are value vectors, \nQueries QinmathbbR^ntimesT: a matrix whose columns are query vectors, \nKeys KinmathbbR^ntimesT: a matrix whose columns are key vectors.","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Regular attention performs the following operation: ","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"mathrmAttention(QKV) = Vmathrmsoftmax(fracK^TQsqrtn)","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"where n is the dimension of the vectors in V, Q and K. The softmax activation function here acts column-wise, so it can be seen as a transformation mathrmsoftmaxmathbbR^TtomathbbR^T with mathrmsoftmax(v)_i = e^v_ileft(sum_j=1e^v_jright). The K^TQ term is a similarity matrix between the queries and the vectors. ","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"The transformer contains a self-attention mechanism, i.e. takes an input X and then transforms it linearly to V, Q and K, i.e. V = P^VX, Q = P^QX and K = P^KX. What distinguishes the multihead attention layer from the singlehead attention layer, is that there is not just one P^V, P^Q and P^K, but there are several: one for each head of the multihead attention layer. After computing the individual values, queries and vectors, and after applying the softmax, the outputs are then concatenated together in order to obtain again an array that is of the same size as the input array:","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/mha.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Here the various P matrices can be interpreted as being projections onto lower-dimensional subspaces, hence the designation by the letter P. Because of this interpretation as projection matrices onto smaller spaces that should capture features in the input data it makes sense to constrain these elements to be part of the Stiefel manifold. ","category":"page"},{"location":"layers/multihead_attention_layer/#Computing-Correlations-in-the-Multihead-Attention-Layer","page":"Multihead Attention","title":"Computing Correlations in the Multihead-Attention Layer","text":"","category":"section"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"The attention mechanism describes a reweighting of the \"values\" V_i based on correlations between the \"keys\" K_i and the \"queries\" Q_i. First note the structure of these matrices: they are all a collection of T vectors (Ndivmathttn_heads)-dimensional vectors, i.e. V_i=v_i^(1) ldots v_i^(T) K_i=k_i^(1) ldots k_i^(T) Q_i=q_i^(1) ldots q_i^(T) . Those vectors have been obtained by applying the respective projection matrices onto the original input I_iinmathbbR^NtimesT.","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"When performing the reweighting of the columns of V_i we first compute the correlations between the vectors in K_i and in Q_i and store the results in a correlation matrix C_i: ","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":" C_i_mn = left(k_i^(m)right)^Tq_i^(n)","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"The columns of this correlation matrix are than rescaled with a softmax function, obtaining a matrix of probability vectors mathcalP_i:","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":" mathcalP_i_bulletn = mathrmsoftmax(C_i_bulletn)","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Finally the matrix mathcalP_i is multiplied onto V_i from the right, resulting in 16 convex combinations of the 16 vectors v_i^(m) with m=1ldotsT:","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":" V_imathcalP_i = leftsum_m=1^16mathcalP_i_m1v_i^(m) ldots sum_m=1^TmathcalP_i_mTv_i^(m)right","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"With this we can now give a better interpretation of what the projection matrices W_i^V, W_i^K and W_i^Q should do: they map the original data to lower-dimensional subspaces. We then compute correlations between the representation in the K and in the Q basis and use this correlation to perform a convex reweighting of the vectors in the V basis. These reweighted values are then fed into a standard feedforward neural network.","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Because the main task of the W_i^V, W_i^K and W_i^Q matrices here is for them to find bases, it makes sense to constrain them onto the Stiefel manifold; they do not and should not have the maximum possible generality.","category":"page"},{"location":"layers/multihead_attention_layer/#References","page":"Multihead Attention","title":"References","text":"","category":"section"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin. Attention is all you need. Advances in neural information processing systems 30 (2017).\n\n\n\n","category":"page"},{"location":"tutorials/grassmann_layer/#Example-of-a-Neural-Network-with-a-Grassmann-Layer","page":"Grassmann manifold","title":"Example of a Neural Network with a Grassmann Layer","text":"","category":"section"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"Here we show how to implement a neural network that contains a layer whose weight is an element of the Grassmann manifold and where this might be useful. ","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"To answer where we would need this consider the following scenario","category":"page"},{"location":"tutorials/grassmann_layer/#Problem-statement","page":"Grassmann manifold","title":"Problem statement","text":"","category":"section"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"We are given data in a big space mathcalD=d_i_iinmathcalIsubsetmathbbR^N and know these data live on an n-dimensional[1] submanifold[2] in mathbbR^N. Based on these data we would now like to generate new samples from the distributions that produced our original data. This is where the Grassmann manifold is useful: each element V of the Grassmann manifold is an n-dimensional subspace of mathbbR^N from which we can easily sample. We can then construct a (bijective) mapping from this space V onto a space that contains our data points mathcalD. ","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"[1]: We may know n exactly or approximately. ","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"[2]: Problems and solutions related to this scenario are commonly summarized under the term manifold learning (see [16]).","category":"page"},{"location":"tutorials/grassmann_layer/#Example","page":"Grassmann manifold","title":"Example","text":"","category":"section"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"Consider the following toy example: We want to sample from the graph of the (scaled) Rosenbrock function f(xy) = ((1 - x)^2 + 100(y - x^2)^2)1000 while pretending we do not know the function. ","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"using Plots # hide\n# hide\nrosenbrock(x::Vector) = ((1.0 - x[1]) ^ 2 + 100.0 * (x[2] - x[1] ^ 2) ^ 2) / 1000\nx, y = -1.5:0.1:1.5, -1.5:0.1:1.5\nz = Surface((x,y)->rosenbrock([x,y]), x, y)\np = surface(x,y,z; camera=(30,20), alpha=.6, colorbar=false, xlims=(-1.5, 1.5), ylims=(-1.5, 1.5), zlims=(0.0, rosenbrock([-1.5, -1.5])))","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"We now build a neural network whose task it is to map a product of two Gaussians mathcalN(01)timesmathcalN(01) onto the graph of the Rosenbrock function where the range for x and for y is -1515.","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"For computing the loss between the two distributions, i.e. Psi(mathcalN(01)timesmathcalN(01)) and f(-1515 -1515) we use the Wasserstein distance[3].","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"[3]: The implementation of the Wasserstein distance is taken from [17].","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"using GeometricMachineLearning, Zygote, BrenierTwoFluid\nusing LinearAlgebra: norm # hide\nimport Random # hide \nRandom.seed!(123)\n\nmodel = Chain(GrassmannLayer(2,3), Dense(3, 8, tanh), Dense(8, 3, identity))\n\nnn = NeuralNetwork(model, CPU(), Float64)\n\n# this computes the cost that is associated to the Wasserstein distance\nc = (x,y) -> .5 * norm(x - y)^2\n∇c = (x,y) -> x - y\n\nconst ε = 0.1 # entropic regularization. √ε is a length. # hide\nconst q = 1.0 # annealing parameter # hide\nconst Δ = 1.0 # characteristic domain size # hide\nconst s = ε # current scale: no annealing -> equals ε # hide\nconst tol = 1e-6 # marginal condition tolerance # hide \nconst crit_it = 20 # acceleration inference # hide\nconst p_η = 2\n\nfunction compute_wasserstein_gradient(ensemble1::AT, ensemble2::AT) where AT<:AbstractArray\n number_of_particles1 = size(ensemble1, 2)\n number_of_particles2 = size(ensemble2, 2)\n V = SinkhornVariable(copy(ensemble1'), ones(number_of_particles1) / number_of_particles1)\n W = SinkhornVariable(copy(ensemble2'), ones(number_of_particles2) / number_of_particles2)\n params = SinkhornParameters(; ε=ε,q=1.0,Δ=1.0,s=s,tol=tol,crit_it=crit_it,p_η=p_η,sym=false,acc=true) # hide\n S = SinkhornDivergence(V, W, c, params; islog = true)\n initialize_potentials!(S)\n compute!(S)\n value(S), x_gradient!(S, ∇c)'\nend\n\nxyz_points = hcat([[x,y,rosenbrock([x,y])] for x in x for y in y]...)\n\nfunction compute_gradient(ps::Tuple)\n samples = randn(2, size(xyz_points, 2))\n\n estimate, nn_pullback = Zygote.pullback(ps -> model(samples, ps), ps)\n\n valS, wasserstein_gradient = compute_wasserstein_gradient(estimate, xyz_points)\n valS, nn_pullback(wasserstein_gradient)[1]\nend\n\n# note the very high value for the learning rate\noptimizer = Optimizer(nn, AdamOptimizer(1e-1))\n\n# note the small number of training steps\nconst training_steps = 40\nloss_array = zeros(training_steps)\nfor i in 1:training_steps\n val, dp = compute_gradient(nn.params)\n loss_array[i] = val\n optimization_step!(optimizer, model, nn.params, dp)\nend\nplot(loss_array, xlabel=\"training step\", label=\"loss\")","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"Now we plot a few points to check how well they match the graph:","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"const number_of_points = 35\n\ncoordinates = nn(randn(2, number_of_points))\nscatter3d!(p, [coordinates[1, :]], [coordinates[2, :]], [coordinates[3, :]], alpha=.5, color=4, label=\"mapped points\")","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Symplectic-Autoencoder","page":"PSD and Symplectic Autoencoders","title":"Symplectic Autoencoder","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Symplectic Autoencoders are a type of neural network suitable for treating Hamiltonian parametrized PDEs with slowly decaying Kolmogorov n-width. It is based on proper symplectic decomposition (PSD) and symplectic neural networks (SympNets).","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Hamiltonian-Model-Order-Reduction","page":"PSD and Symplectic Autoencoders","title":"Hamiltonian Model Order Reduction","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Hamiltonian PDEs are partial differential equations that, like its ODE counterpart, have a Hamiltonian associated with it. An example of this is the linear wave equation (see [10]) with Hamiltonian ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"mathcalH(q p mu) = frac12int_Omegamu^2(partial_xiq(tximu))^2 + p(tximu)^2dxi","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"The PDE for to this Hamiltonian can be obtained similarly as in the ODE case:","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"partial_tq(tximu) = fracdeltamathcalHdeltap = p(tximu) quad partial_tp(tximu) = -fracdeltamathcalHdeltaq = mu^2partial_xixiq(tximu)","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Symplectic-Solution-Manifold","page":"PSD and Symplectic Autoencoders","title":"Symplectic Solution Manifold","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"As with regular parametric PDEs, we also associate a solution manifold with Hamiltonian PDEs. This is a finite-dimensional manifold, on which the dynamics can be described through a Hamiltonian ODE. I NEED A PROOF OR SOME EXPLANATION FOR THIS!","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Workflow-for-Symplectic-ROM","page":"PSD and Symplectic Autoencoders","title":"Workflow for Symplectic ROM","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"As with any other reduced order modeling technique we first discretize the PDE. This should be done with a structure-preserving scheme, thus yielding a (high-dimensional) Hamiltonian ODE as a result. Discretizing the wave equation above with finite differences yields a Hamiltonian system: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"mathcalH_mathrmdiscr(z(tmu)mu) = frac12x(tmu)^Tbeginbmatrix -mu^2D_xixi mathbbO mathbbO mathbbI endbmatrix x(tmu)","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"In Hamiltonian reduced order modelling we try to find a symplectic submanifold of the solution space[1] that captures the dynamics of the full system as well as possible.","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"[1]: The submanifold is: tildemathcalM = Psi^mathrmdec(z_r)inmathbbR^2Nu_rinmathrmR^2n where z_r is the reduced state of the system. ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Similar to the regular PDE case we again build an encoder Psi^mathrmenc and a decoder Psi^mathrmdec; but now both these mappings are required to be symplectic!","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Concretely this means: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"The encoder is a mapping from a high-dimensional symplectic space to a low-dimensional symplectic space, i.e. Psi^mathrmencmathbbR^2NtomathbbR^2n such that nablaPsi^mathrmencmathbbJ_2N(nablaPsi^mathrmenc)^T = mathbbJ_2n.\nThe decoder is a mapping from a low-dimensional symplectic space to a high-dimensional symplectic space, i.e. Psi^mathrmdecmathbbR^2ntomathbbR^2N such that (nablaPsi^mathrmdec)^TmathbbJ_2NnablaPsi^mathrmdec = mathbbJ_2n.","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"If these two maps are constrained to linear maps, then one can easily find good solutions with proper symplectic decomposition (PSD).","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Proper-Symplectic-Decomposition","page":"PSD and Symplectic Autoencoders","title":"Proper Symplectic Decomposition","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"For PSD the two mappings Psi^mathrmenc and Psi^mathrmdec are constrained to be linear, orthonormal (i.e. Psi^TPsi = mathbbI) and symplectic. The easiest way to enforce this is through the so-called cotangent lift: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Psi_mathrmCL = \nbeginbmatrix Phi mathbbO mathbbO Phi endbmatrix","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"and PhiinSt(nN)subsetmathbbR^Ntimesn, i.e. is an element of the Stiefel manifold. If the snapshot matrix is of the form: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"M = leftbeginarraycccc\nhatq_1(t_0) hatq_1(t_1) quadldotsquad hatq_1(t_f) \nhatq_2(t_0) hatq_2(t_1) ldots hatq_2(t_f) \nldots ldots ldots ldots \nhatq_N(t_0) hatq_N(t_1) ldots hatq_N(t_f) \nhatp_1(t_0) hatp_1(t_1) ldots hatp_1(t_f) \nhatp_2(t_0) hatp_2(t_1) ldots hatp_2(t_f) \nldots ldots ldots ldots \nhatp_N(t_0) hatp_N(t_1) ldots hatp_N(t_f) \nendarrayright","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"then Phi can be computed in a very straight-forward manner: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Rearrange the rows of the matrix M such that we end up with a Ntimes2(f+1) matrix: hatM = M_q M_p.\nPerform SVD: hatM = USigmaV^T; set PhigetsUmathtt1n.","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"For details on the cotangent lift (and other methods for linear symplectic model reduction) consult [11].","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Symplectic-Autoencoders","page":"PSD and Symplectic Autoencoders","title":"Symplectic Autoencoders","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"PSD suffers from the similar shortcomings as regular POD: it is a linear map and the approximation space tildemathcalM= Psi^mathrmdec(z_r)inmathbbR^2Nu_rinmathrmR^2n is strictly linear. For problems with slowly-decaying Kolmogorov n-width this leads to very poor approximations. ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"In order to overcome this difficulty we use neural networks, more specifically SympNets, together with cotangent lift-like matrices. The resulting architecture, symplectic autoencoders, are demonstrated in the following image: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/symplectic_autoencoder.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"So we alternate between SympNet and PSD layers. Because all the PSD layers are based on matrices PhiinSt(nN) we have to optimize on the Stiefel manifold.","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#References","page":"PSD and Symplectic Autoencoders","title":"References","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"P. Buchfink, S. Glas and B. Haasdonk. Symplectic model reduction of Hamiltonian systems on nonlinear manifolds and approximation with weakly symplectic autoencoder. SIAM Journal on Scientific Computing 45, A289–A311 (2023).\n\n\n\nL. Peng and K. Mohseni. Symplectic model reduction of Hamiltonian systems. SIAM Journal on Scientific Computing 38, A1–A27 (2016).\n\n\n\n","category":"page"},{"location":"tutorials/linear_wave_equation/#The-Linear-Wave-Equation","page":"Linear Wave Equation","title":"The Linear Wave Equation","text":"","category":"section"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"The linear wave equation is the prototypical example for a Hamiltonian PDE. It is given by (see [10] and [11]): ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"mathcalH(q p mu) = frac12int_Omegamu^2(partial_xiq(tximu))^2 + p(tximu)^2dxi","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"with xiinOmega=(-1212) and muinmathbbP=51256 as a possible choice for domain and parameters. ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"The PDE for to this Hamiltonian can be obtained similarly as in the ODE case:","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"partial_tq(tximu) = fracdeltamathcalHdeltap = p(tximu) quad partial_tp(tximu) = -fracdeltamathcalHdeltaq = mu^2partial_xixiq(tximu)","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"As with any other PDE, the wave equation can also be discretized to obtain a ODE which can be solved numerically.","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"If we discretize mathcalH directly, to obtain a Hamiltonian on a finite-dimensional vector space mathbbR^2N, we get a Hamiltonian ODE[1]:","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"[1]: This conserves the Hamiltonian structure of the system.","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"mathcalH_h(z) = sum_i=1^tildeNfracDeltax2biggp_i^2 + mu^2frac(q_i - q_i-1)^2 + (q_i+1 - q_i)^22Deltax^2bigg = fracDeltax2p^Tp + q^TKq","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"where the matrix K contains elements of the form: ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"k_ij = begincases fracmu^24Deltax textif (ij)in(00)(tildeN+1tildeN+1) \n -fracmu^22Deltax textif (ij)=(10) or (ij)=(tildeNtildeN+1) \n frac3mu^24Deltax textif (ij)in(11)(tildeNtildeN) \n fracmu^2Deltax textif i=j and iin2ldots(tildeN-2) \n -fracmu^22Deltax textif i-j=1 and ijnotin0tildeN+1 \n 0 textelse\n endcases","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"The vector field of the FOM is described by (see for example (Peng and Mohseni, 2016)):","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":" fracdzdt = mathbbJ_dnabla_zmathcalH_h = mathbbJ_dbeginbmatrixDeltaxmathbbI mathbbO mathbbO K + K^Tendbmatrixz quad mathbbJ_d = fracmathbbJ_2NDeltax","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"The wave equation has a slowely-decaying Kolmogorov n-width (see e.g. Greif and Urban, 2019), which means linear methods like PSD will perform poorly.","category":"page"},{"location":"tutorials/linear_wave_equation/#Using-the-Linear-Wave-Equation-in-Numerical-Experiments","page":"Linear Wave Equation","title":"Using the Linear Wave Equation in Numerical Experiments","text":"","category":"section"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"In order to use the linear wave equation in numerical experiments we have to pick suitable initial conditions. For this, consider the third-order spline: ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"h(s) = begincases\n 1 - frac32s^2 + frac34s^3 textif 0 leq s leq 1 \n frac14(2 - s)^3 textif 1 s leq 2 \n 0 textelse \nendcases","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"Plotted on the relevant domain it looks like this: ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/third_degree_spline.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"if Main.output_type == :html # hide \n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"Taking the above function h(s) as a starting point, the initial conditions for the linear wave equations will now be constructed under the following considerations: ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"the initial condition (i.e. the shape of the wave) should depend on the parameter of the vector field, i.e. u_0(mu)(omega) = h(s(omega mu)).\nthe solutions of the linear wave equation will travel with speed mu, and we should make sure that the wave does not touch the right boundary of the domain, i.e. 0.5. So the peak should be sharper for higher values of mu as the wave will travel faster.\nthe wave should start at the left boundary of the domain, i.e. at point 0.5, so to cover it as much as possible. ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"Based on this we end up with the following choice of parametrized initial conditions: ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"u_0(mu)(omega) = h(s(omega mu)) quad s(omega mu) = 20 mu omega + fracmu2","category":"page"},{"location":"tutorials/linear_wave_equation/#References","page":"Linear Wave Equation","title":"References","text":"","category":"section"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"P. Buchfink, S. Glas and B. Haasdonk. Symplectic model reduction of Hamiltonian systems on nonlinear manifolds and approximation with weakly symplectic autoencoder. SIAM Journal on Scientific Computing 45, A289–A311 (2023).\n\n\n\nL. Peng and K. Mohseni. Symplectic model reduction of Hamiltonian systems. SIAM Journal on Scientific Computing 38, A1–A27 (2016).\n\n\n\nC. Greif and K. Urban. Decay of the Kolmogorov N-width for wave problems. Applied Mathematics Letters 96, 216–222 (2019).\n\n\n\n","category":"page"},{"location":"layers/attention_layer/#The-Attention-Layer","page":"Attention","title":"The Attention Layer","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The attention mechanism was originally applied for image and natural language processing (NLP) tasks. In (Bahdanau et al, 2014) ``additive'' attention is used: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"(z_q z_k) mapsto v^Tsigma(Wz_q + Uz_k)","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"However ``multiplicative'' attention is more straightforward to interpret and cheaper to handle computationally: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"(z_q z_k) mapsto z_q^TWz_k","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Regardless of the type of attention used, they all try to compute correlations among input sequences on whose basis further neural network-based computation is performed. So given two input sequences (z_q^(1) ldots z_q^(T)) and (z_k^(1) ldots z_k^(T)), various attention mechanisms always return an output CinmathbbR^TtimesT with entries C_ij = mathttattention(z_q^(i) z_k^(j).","category":"page"},{"location":"layers/attention_layer/#Self-Attention","page":"Attention","title":"Self Attention","text":"","category":"section"},{"location":"layers/attention_layer/#Attention-in-GeometricMachineLearning","page":"Attention","title":"Attention in GeometricMachineLearning","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The attention layer (and the orthonormal activation function defined for it) in GeometricMachineLearning was specifically designed to generalize transformers to symplectic data. Usually a self-attention layer takes the following form: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Z = z^(1) ldots z^(T) mapsto Zmathrmsoftmax((P^QZ)^T(P^KZ))","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"where we left out the linear mapping onto the values P^V. ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The idea behind is that we can perform a non-linear re-weighting of the columns of Z by multiplying with a Z-dependent matrix from the right and therefore take the sequential nature of the data into account (which is not possible with normal neural networks). After the attention step the transformer applies a simple ResNet from the left.","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"What the softmax does is a vector-wise operation, i.e. it operates on each column of an input matrix A = a_1 ldots a_T. The result is a sequence of probability vectors p^(1) ldots p^(T) for which ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"sum_i=1^Tp^(j)_i=1quadforalljin1dotsT","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"What we want to construct is a symplectic transformation that is transformer-like. For this we modify the attention layer the following way: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Z = z^(1) ldots z^(T) mapsto Zsigma((P^QZ)^T(P^KZ))","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"where sigma(A)=exp(mathttupper_triangular_asymmetrize(A)) and ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"mathttupper_triangular_asymmetrize(A)_ij = begincases a_ij textif ij -a_ji textif ij 0 textelseendcases","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"This has as a consequence that the matrix Lambda(Z) = sigma((P^QZ)^T(P^KZ)) is orthonormal and hence preserves an extended symplectic structure. To make this more clear, consider that the transformer maps sequences of vectors to sequences of vectors, i.e. VtimescdotstimesV ni z^1 ldots z^T mapsto hatz^1 ldots hatz^T. We can define a symplectic structure on VtimescdotstimesV by rearranging z^1 ldots z^T into a vector. We do this in the following way: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"tildeZ = beginpmatrix q^(1)_1 q^(2)_1 cdots q^(T)_1 q^(1)_2 cdots q^(T)_d p^(1)_1 p^(2)_1 cdots p^(T)_1 p^(1)_2 cdots p^(T)_d endpmatrix","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The symplectic structure on this big space is then: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"mathbbJ=beginpmatrix\n mathbbO_dT mathbbI_dT \n -mathbbI_dT mathbbO_dT\nendpmatrix","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Multiplying with the matrix Lambda(Z) from the right onto z^1 ldots z^T corresponds to applying the sparse matrix ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"tildeLambda(Z)=left\nbeginarrayccc\n Lambda(Z) cdots mathbbO_T \n vdots ddots vdots \n mathbbO_T cdots Lambda(Z) \n endarray\nright","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"from the left onto the big vector. ","category":"page"},{"location":"layers/attention_layer/#Historical-Note","page":"Attention","title":"Historical Note","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Attention was used before, but always in connection with recurrent neural networks (see (Luong et al, 2015) and (Bahdanau et al, 2014)). ","category":"page"},{"location":"layers/attention_layer/#References","page":"Attention","title":"References","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"D. Bahdanau, K. Cho and Y. Bengio. Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014).\n\n\n\nM.-T. Luong, H. Pham and C. D. Manning. Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025 (2015).\n\n\n\n","category":"page"},{"location":"manifolds/homogeneous_spaces/#Homogeneous-Spaces","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"","category":"section"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Homogeneous spaces are manifolds mathcalM on which a Lie group G acts transitively, i.e. ","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"forall XYinmathcalM existsAinGtext st AX = Y","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Now fix a distinct element EinmathcalM. We can also establish an isomorphism between mathcalM and the quotient space Gsim with the equivalence relation: ","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"A_1 sim A_2 iff A_1E = A_2E","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Note that this is independent of the chosen E.","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"The tangent spaces of mathcalM are of the form T_YmathcalM = mathfrakgcdotY, i.e. can be fully described through its Lie algebra. Based on this we can perform a splitting of mathfrakg into two parts:","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"The vertical component mathfrakg^mathrmverY is the kernel of the map mathfrakgtoT_YmathcalM V mapsto VY, i.e. mathfrakg^mathrmverY = VinmathfrakgVY = 0\nThe horizontal component mathfrakg^mathrmhorY is the orthogonal complement of mathfrakg^mathrmverY in mathfrakg. It is isomorphic to T_YmathcalM.","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"We will refer to the mapping from T_YmathcalM to mathfrakg^mathrmhor Y by Omega. If we have now defined a metric langlecdotcdotrangle on mathfrakg, then this induces a Riemannian metric on mathcalM:","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"g_Y(Delta_1 Delta_2) = langleOmega(YDelta_1)Omega(YDelta_2)rangletext for Delta_1Delta_2inT_YmathcalM","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Two examples of homogeneous spaces implemented in GeometricMachineLearning are the Stiefel and the Grassmann manifold.","category":"page"},{"location":"manifolds/homogeneous_spaces/#References","page":"Homogeneous Spaces","title":"References","text":"","category":"section"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Frankel, Theodore. The geometry of physics: an introduction. Cambridge university press, 2011.","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/#Kolmogorov-n-width","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"","category":"section"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"The Kolmogorov n-width measures how well some set mathcalM (typically the solution manifold) can be approximated with a linear subspace:","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"d_n(mathcalM) = mathrminf_V_nsubsetVmathrmdimV_n=nmathrmsup(uinmathcalM)mathrminf_v_ninV_n u - v_n _V","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"with mathcalMsubsetV and V is a (typically infinite-dimensional) Banach space. For advection-dominated problems (among others) the decay of the Kolmogorov n-width is very slow, i.e. one has to pick n very high in order to obtain useful approximations (see [12] and [13]).","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"In order to overcome this, techniques based on neural networks (see e.g. [14]) and optimal transport (see e.g. [13]) have been used. ","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/#References","page":"Kolmogorov n-width","title":"References","text":"","category":"section"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"T. Blickhan. A registration method for reduced basis problems using linear optimal transport, arXiv preprint arXiv:2304.14884 (2023).\n\n\n\nC. Greif and K. Urban. Decay of the Kolmogorov N-width for wave problems. Applied Mathematics Letters 96, 216–222 (2019).\n\n\n\nK. Lee and K. T. Carlberg. Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders. Journal of Computational Physics 404, 108973 (2020).\n\n\n\n","category":"page"},{"location":"layers/volume_preserving_feedforward/#Volume-Preserving-Feedforward-Layer","page":"Volume-Preserving Layers","title":"Volume-Preserving Feedforward Layer","text":"","category":"section"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"Volume preserving feedforward layers are a special type of ResNet layer for which we restrict the weight matrices to be of a particular form. I.e. each layer computes: ","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"x mapsto x + sigma(Ax + b)","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"where sigma is a nonlinearity, A is the weight and b is the bias. The matrix A is either a lower-triangular matrix L or an upper-triangular matrix U[1]. The lower triangular matrix is of the form (the upper-triangular layer is simply the transpose of the lower triangular): ","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"[1]: Implemented as LowerTriangular and UpperTriangular in GeometricMachineLearning.","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"L = beginpmatrix\n 0 0 cdots 0 \n a_21 ddots vdots \n vdots ddots ddots vdots \n a_n1 cdots a_n(n-1) 0 \nendpmatrix","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"The Jacobian of a layer of the above form then is of the form","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"J = beginpmatrix\n 1 0 cdots 0 \n b_21 ddots vdots \n vdots ddots ddots vdots \n b_n1 cdots b_n(n-1) 1 \nendpmatrix","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"and the determinant of J is 1, i.e. the map is volume-preserving. ","category":"page"},{"location":"layers/volume_preserving_feedforward/#Neural-network-architecture","page":"Volume-Preserving Layers","title":"Neural network architecture","text":"","category":"section"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"Volume-preserving feedforward neural networks should be used as Architectures in GeometricMachineLearning. The constructor for them is: ","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"using GeometricMachineLearning, Markdown\nMarkdown.parse(description(Val(:VPFconstructor)))","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"The constructor produces the following architecture[2]:","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"[2]: Based on the input arguments n_linear and n_blocks. In this example init_upper is set to false, which means that the first layer is of type lower followed by a layer of type upper. ","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/vp_feedforward.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"Here LinearLowerLayer performs x mapsto x + Lx and NonLinearLowerLayer performs x mapsto x + sigma(Lx + b). The activation function sigma is the forth input argument to the constructor and tanh by default. ","category":"page"},{"location":"layers/volume_preserving_feedforward/#Note-on-Sympnets","page":"Volume-Preserving Layers","title":"Note on Sympnets","text":"","category":"section"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"As SympNets are symplectic maps, they also conserve phase space volume and therefore form a subcategory of volume-preserving feedforward layers. ","category":"page"},{"location":"optimizers/bfgs_optimizer/#The-BFGS-Algorithm","page":"BFGS Optimizer","title":"The BFGS Algorithm","text":"","category":"section"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"The presentation shown here is largely taken from chapters 3 and 6 of reference [9] with a derivation based on an online comment. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is a second order optimizer that can be also be used to train a neural network.","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"It is a version of a quasi-Newton method and is therefore especially suited for convex problems. As is the case with any other (quasi-)Newton method the BFGS algorithm approximates the objective with a quadratic function in each optimization step:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"m_k(x) = f(x_k) + (nabla_x_kf)^T(x - x_k) + frac12(x - x_k)^TB_k(x - x_k)","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"where B_k is referred to as the approximate Hessian. We further require B_k to be symmetric and positive definite. Differentiating the above expression and setting the derivative to zero gives us: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"nabla_xm_k = nabla_x_kf + B_k(x - x_k) = 0","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"or written differently: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"x - x_k = -B_k^-1nabla_x_kf","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"This value we will from now on call p_k = x - x_k and refer to as the search direction. The new iterate then is: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"x_k+1 = x_k + alpha_kp_k","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"where alpha_k is the step length. Techniques that describe how to pick an appropriate alpha_k are called line-search methods and are discussed below. First we discuss what requirements we impose on B_k. A first reasonable condition would be to require the gradient of m_k to be equal to that of f at the points x_k-1 and x_k: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"beginaligned\nnabla_x_km_k = nabla_x_kf + B_k(x_k - x_k) overset= nabla_x_kf text and \nnabla_x_k-1m_k = nablax_kf + B_k(x_k-1 - x_k) overset= nabla_x_k-1f\nendaligned","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"The first one of these conditions is of course automatically satisfied. The second one can be rewritten as: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"B_k(x_k - x_k-1) = overset= nabla_x_kf - nabla_x_k-1f ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"The following notations are often used: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"s_k-1 = alpha_k-1p_k-1 = x_k - x_k-1 text and y_k-1 = nabla_x_kf - nabla_x_k-1f ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"The conditions mentioned above then becomes: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"B_ks_k-1 overset= y_k-1","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"and we call it the secant equation. A second condition we impose on B_k is that is has to be positive-definite at point s_k-1:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"s_k-1^Ty_k-1 0","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"This is referred to as the curvature condition. If we impose the Wolfe conditions, the curvature condition hold automatically. The Wolfe conditions are stated with respect to the parameter alpha_k.","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"The Wolfe conditions are:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"f(x_k+alphap_k)leqf(x_k) + c_1alpha(nabla_x_kf)^Tp_k for c_1in(01).\n(nabla_(x_k + alpha_kp_k)f)^Tp_k geq c_2(nabla_x_kf)^Tp_k for c_2in(c_11).","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"A possible choice for c_1 and c_2 are 10^-4 and 09 (see [9]). The two Wolfe conditions above are respectively called the sufficient decrease condition and the curvature condition respectively. Note that the second Wolfe condition (also called curvature condition) is stronger than the one mentioned before under the assumption that the first Wolfe condition is true:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"(nabla_x_kf)^Tp_k-1 - c_2(nabla_x_k-1f)^Tp_k-1 = y_k-1^Tp_k-1 + (1 - c_2)(nabla_x_k-1f)^Tp_k-1 geq 0","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"and the second term in this expression is (1 - c_2)(nabla_x_k-1f)^Tp_k-1geqfrac1-c_2c_1alpha_k-1(f(x_k) - f(x_k-1)), which is negative. ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"In order to pick the ideal B_k we solve the following problem: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"beginaligned\nmin_B B - B_k-1_W \ntextst B = B^Ttext and Bs_k-1=y_k-1\nendaligned","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"where the first condition is symmetry and the second one is the secant equation. For the norm cdot_W we pick the weighted Frobenius norm:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"A_W = W^12AW^12_F","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"where cdot_F is the usual Frobenius norm[1] and the matrix W=tildeB_k-1 is the inverse of the average Hessian:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"tildeB_k-1 = int_0^1 nabla^2f(x_k-1 + taualpha_k-1p_k-1)dtau","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"[1]: The Frobenius norm is A_F^2 = sum_ija_ij^2.","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"In order to find the ideal B_k under the conditions described above, we introduce some notation: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"tildeB_k-1 = W^12B_k-1W^12,\ntildeB = W^12BW^12, \ntildey_k-1 = W^12y_k-1, \ntildes_k-1 = W^-12s_k-1.","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"With this notation we can rewrite the problem of finding B_k as: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"beginaligned\nmin_tildeB tildeB - tildeB_k-1_F \ntextst tildeB = tildeB^Ttext and tildeBtildes_k-1=tildey_k-1\nendaligned","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"We further have Wy_k-1 = s_k-1 (by the mean value theorem ?) and therefore tildey_k-1 = tildes_k-1.","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"Now we rewrite B and B_k-1 in a new basis U = uu_perp, where u = tildes_k-1tildes_k-1 and u_perp is an orthogonal complement[2] of u:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"[2]: So we must have u^Tu_perp=0 and further u_perp^Tu_perp=mathbbI.","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"beginaligned\nU^TtildeB_k-1U - U^TtildeBU = beginbmatrix u^T u_perp^T endbmatrix(tildeB_k-1 - tildeB)beginbmatrix u u_perp endbmatrix = \nbeginbmatrix\n u^TtildeB_k-1u - 1 u^TtildeB_k-1u \n u_perp^TtildeB_k-1u u_perp^T(tildeB_k-1-tildeB_k)u_perp\nendbmatrix\nendaligned","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"By a property of the Frobenius norm: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"tildeB_k-1 - tildeB^2_F = (u^TtildeB_k-1 -1)^2 + u^TtildeB_k-1u_perp_F^2 + u_perp^TtildeB_k-1u_F^2 + u_perp^T(tildeB_k-1 - tildeB)u_perp_F^2","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"We see that tildeB only appears in the last term, which should therefore be made zero. This then gives: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"tildeB = Ubeginbmatrix 1 0 0 u^T_perptildeB_k-1u_perp endbmatrix = uu^T + (mathbbI-uu^T)tildeB_k-1(mathbbI-uu^T)","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"If we now map back to the original coordinate system, the ideal solution for B_k is: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"B_k = (mathbbI - frac1y_k-1^Ts_k-1y_k-1s_k-1^T)B_k-1(mathbbI - frac1y_k-1^Ts_k-1s_k-1y_k-1^T) + frac1y_k-1^Ts_k-1y_ky_k^T","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"What we need in practice however is not B_k, but its inverse H_k. This is because we need to find s_k-1 based on y_k-1. To get H_k based on the expression for B_k above we can use the Sherman-Morrison-Woodbury formula[3] to obtain:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"[3]: The Sherman-Morrison-Woodbury formula states (A + UCV)^-1 = A^-1 - A^-1 - A^-1U(C^-1 + VA^-1U)^-1VA^-1.","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"H_k = H_k-1 - fracH_k-1y_k-1y_k-1^TH_k-1y_k-1^TH_k-1y_k-1 + fracs_k-1s_k-1^Ty_k-1^Ts_k-1","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"TODO: Example where this works well!","category":"page"},{"location":"optimizers/bfgs_optimizer/#References","page":"BFGS Optimizer","title":"References","text":"","category":"section"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"J. N. Stephen J. Wright. Numerical optimization (Springer Science+Business Media, 2006).\n\n\n\n","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Global-Sections","page":"Global Sections","title":"Global Sections","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"Global sections are needed needed for the generalization of Adam and other optimizers to homogeneous spaces. They are necessary to perform the two mappings represented represented by horizontal and vertical red lines in the section on the general optimizer framework.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Computing-the-global-section","page":"Global Sections","title":"Computing the global section","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"In differential geometry a section is always associated to some bundle, in our case this bundle is piGtomathcalMAmapstoAE. A section is a mapping mathcalMtoG for which pi is a left inverse, i.e. picirclambda = mathrmid. ","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"For the Stiefel manifold St(n N)subsetmathbbR^Ntimesn we compute the global section the following way: ","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"Start with an element YinSt(nN),\nDraw a random matrix AinmathbbR^Ntimes(N-n),\nRemove the subspace spanned by Y from the range of A: AgetsA-YY^TA\nCompute a QR decomposition of A and take as section lambda(Y) = Y Q_1N 1(N-n) = Y barlambda.","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"It is easy to check that lambda(Y)inG=SO(N).","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"In GeometricMachineLearning, GlobalSection takes an element of YinSt(nN)equivStiefelManifold{T} and returns an instance of GlobalSection{T, StiefelManifold{T}}. The application O(N)timesSt(nN)toSt(nN) is done with the functions apply_section! and apply_section.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Computing-the-global-tangent-space-representation-based-on-a-global-section","page":"Global Sections","title":"Computing the global tangent space representation based on a global section","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"The output of the horizontal lift Omega is an element of mathfrakg^mathrmhorY. For this mapping Omega(Y BY) = B if Binmathfrakg^mathrmhorY, i.e. there is no information loss and no projection is performed. We can map the Binmathfrakg^mathrmhorY to mathfrakg^mathrmhor with Bmapstolambda(Y)^-1Blambda(Y).","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"The function global_rep performs both mappings at once[1], i.e. it takes an instance of GlobalSection and an element of T_YSt(nN), and then returns an element of frakg^mathrmhorequivStiefelLieAlgHorMatrix.","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"[1]: For computational reasons.","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"In practice we use the following: ","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"beginaligned\nlambda(Y)^TOmega(YDelta)lambda(Y) = lambda(Y)^T(mathbbI - frac12YY^T)DeltaY^T - YDelta^T(mathbbI - frac12YY^T)lambda(Y) \n = lambda(Y)^T(mathbbI - frac12YY^T)DeltaE^T - YDelta^T(lambda(Y) - frac12YE^T) \n = lambda(Y)^TDeltaE^T - frac12EY^TDeltaE^T - EDelta^Tlambda(Y) + frac12EDelta^TYE^T \n = beginbmatrix Y^TDeltaE^T barlambdaDeltaE^T endbmatrix - frac12EY^TDeltaE - beginbmatrix EDelta^TY EDelta^Tbarlambda endbmatrix + frac12EDelta^TYE^T \n = beginbmatrix Y^TDeltaE^T barlambdaDeltaE^T endbmatrix + EDelta^TYE^T - beginbmatrixEDelta^TY EDelta^Tbarlambda endbmatrix \n = EY^TDeltaE^T + EDelta^TYE^T - EDelta^TYE^T + beginbmatrix mathbbO barlambdaDeltaE^T endbmatrix - beginbmatrix mathbbO EDelta^Tbarlambda endbmatrix \n = EY^TDeltaE^T + beginbmatrix mathbbO barlambdaDeltaE^T endbmatrix - beginbmatrix mathbbO EDelta^Tbarlambda endbmatrix\nendaligned","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"meaning that for an element of the horizontal component of the Lie algebra mathfrakg^mathrmhor we store A=Y^TDelta and B=barlambda^TDelta.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Optimization","page":"Global Sections","title":"Optimization","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"The output of global_rep is then used for all the optimization steps.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#References","page":"Global Sections","title":"References","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"T. Frankel. The geometry of physics: an introduction (Cambridge university press, Cambridge, UK, 2011).\n\n\n\n","category":"page"},{"location":"manifolds/inverse_function_theorem/#The-Inverse-Function-Theorem","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"","category":"section"},{"location":"manifolds/inverse_function_theorem/","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"The inverse function theorem gives a sufficient condition on a vector-valued function to be invertible in a neighborhood of a specific point. This theorem is critical in developing a theory of manifolds and serves as a basis for the submersion theorem. Here we first state the theorem and then give a proof.","category":"page"},{"location":"manifolds/inverse_function_theorem/","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"Theorem (Inverse function theorem): Consider a vector-valued differentiable function FmathbbR^NtomathbbR^N and assume its Jacobian is non-degenerate at a point xinmathbbR^N. Then there exists a neighborhood U that contains F(x) and on which F is invertible, i.e. existsHUtomathbbR^N s.t. forallyinUFcircH(y) = y and the inverse is differentiable.","category":"page"},{"location":"manifolds/inverse_function_theorem/","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"Proof: Consider a mapping FmathbbR^NtomathbbR^N and assume its Jacobian has full rank at point x, i.e. detF(x)neq0. Now consider a ball around x whose radius r we do not yet fix and two points y and z in that ball: yzinB(xr). We further introduce the function G(y)=F(x)-F(x)y. By the mean value theorem we have G(z) - G(y)leqz-ysup_0t1G(x + t(y-x)) where cdot is the operator norm. Because tmapstoG(x+t(y-x)) is continuous and G(x)=0 there must exist an r s.t. foralltin01G(x +t(y-x)) - G(x)frac12F(x). F must then be injective on B(xr) (and hence invertible on F(B(xr))). Assume for the moment it is not. We can then find two distinct elements y zinB(xr) s.t. F(z) - F(y) = 0. This implies G(z) - G(y) = F(x)y - x which is a contradiction. The inverse (which we call HF(B(xr))toB(xr)) is also continuous by the last theorem presented in the section on basic topological concepts[1]. We still have to prove differentiability of the inverse. We now proof that the derivative of H at F(x) exists and that it is equal to F(x)^-1F(x). For this we denote F(x) by xi and let etainF(B(xr)) go to zero.","category":"page"},{"location":"manifolds/inverse_function_theorem/","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"beginaligned\n eta^-1H(xi+eta) - H(xi) - F(x)^-1eta leq eta^-1F(x)^-1F(x)H(xi+eta)-F(x)H(xi) -eta \n leq eta^-1F(x)^-1F(H(xi+eta)) - G(H(xi+eta)) - F(H(xi)) + G(x) - eta \n = eta^-1F(x)^-1xi + eta - G(H(xi+eta)) - xi + G(x) - eta \n = eta^-1F(x)^-1G(H(xi+eta)) - G(H(xi))\nendaligned","category":"page"},{"location":"manifolds/inverse_function_theorem/","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"and this goes to zero as eta goes to zero, because H is continuous and therefore H(xi+eta) goes to H(xi)=x and the expression on the right goes to zero as well.","category":"page"},{"location":"manifolds/inverse_function_theorem/","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"[1]: In order to apply said theorem we must have a mapping from a compact space to a Hausdorff space. The image is clearly Hausdorff. For compactness, we could further restrict our ball to B(xr2), then G and its inverse are at least continuous on the closure of B(xr2) (or its image respectively) and hence also on B(xr2).","category":"page"},{"location":"manifolds/inverse_function_theorem/#References","page":"The Inverse Function Theorem","title":"References","text":"","category":"section"},{"location":"manifolds/inverse_function_theorem/","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"S. Lang. Fundamentals of differential geometry. Vol. 191 (Springer Science & Business Media, 2012).\n\n\n\n","category":"page"},{"location":"optimizers/manifold_related/geodesic/#Geodesic-Retraction","page":"Geodesic Retraction","title":"Geodesic Retraction","text":"","category":"section"},{"location":"optimizers/manifold_related/geodesic/","page":"Geodesic Retraction","title":"Geodesic Retraction","text":"General retractions are approximations of the exponential map. In GeometricMachineLearning we can, instead of using an approximation, solve the geodesic equation exactly (up to numerical error) by specifying Geodesic() as the argument of layers that have manifold weights. ","category":"page"},{"location":"optimizers/manifold_related/cayley/#The-Cayley-Retraction","page":"Cayley Retraction","title":"The Cayley Retraction","text":"","category":"section"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"The Cayley transformation is one of the most popular retractions. For several matrix Lie groups it is a mapping from the Lie algebra mathfrakg onto the Lie group G. They Cayley retraction reads: ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":" mathrmCayley(C) = left(mathbbI -frac12Cright)^-1left(mathbbI +frac12Cright)","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"This is easily checked to be a retraction, i.e. mathrmCayley(mathbbO) = mathbbI and fracpartialpartialtmathrmCayley(tC) = C.","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"What we need in practice is not the computation of the Cayley transform of an arbitrary matrix, but the Cayley transform of an element of mathfrakg^mathrmhor, the global tangent space representation. ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"The elements of mathfrakg^mathrmhor can be written as: ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"C = beginbmatrix\n A -B^T \n B mathbbO\nendbmatrix = beginbmatrix frac12A mathbbI B mathbbO endbmatrix beginbmatrix mathbbI mathbbO frac12A -B^T endbmatrix","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"where the second expression exploits the sparse structure of the array, i.e. it is a multiplication of a Ntimes2n with a 2ntimesN matrix. We can hence use the Sherman-Morrison-Woodbury formula to obtain:","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"(mathbbI - frac12UV)^-1 = mathbbI + frac12U(mathbbI - frac12VU)^-1V","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"So what we have to invert is the term ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"mathbbI - frac12beginbmatrix mathbbI mathbbO frac12A -B^T endbmatrixbeginbmatrix frac12A mathbbI B mathbbO endbmatrix = \nbeginbmatrix mathbbI - frac14A - frac12mathbbI frac12B^TB - frac18A^2 mathbbI - frac14A endbmatrix","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"The whole Cayley transform is then: ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"left(mathbbI + frac12beginbmatrix frac12A mathbbI B mathbbO endbmatrix beginbmatrix mathbbI - frac14A - frac12mathbbI frac12B^TB - frac18A^2 mathbbI - frac14A endbmatrix^-1 beginbmatrix mathbbI mathbbO frac12A -B^T endbmatrix right)left( E + frac12beginbmatrix frac12A mathbbI B mathbbO endbmatrix beginbmatrix mathbbI frac12A endbmatrix right) = \nE + frac12beginbmatrix frac12A mathbbI B mathbbO endbmatrixleft(\n beginbmatrix mathbbI frac12A endbmatrix + \n beginbmatrix mathbbI - frac14A - frac12mathbbI frac12B^TB - frac18A^2 mathbbI - frac14A endbmatrix^-1left(\n beginbmatrix mathbbI frac12A endbmatrix + \n beginbmatrix frac12A frac14A^2 - frac12B^TB endbmatrix\n right)\n right)","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"Note that for computational reason we compute mathrmCayley(C)E instead of just the Cayley transform (see the section on retractions).","category":"page"},{"location":"tutorials/sympnet_tutorial/#SympNets-with-GeometricMachineLearning.jl","page":"Sympnets","title":"SympNets with GeometricMachineLearning.jl","text":"","category":"section"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"This page serves as a short introduction into using SympNets with GeometricMachineLearning.jl. For the general theory see the theory section.","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"With GeometricMachineLearning.jl one can easily implement SympNets. The steps are the following :","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"Specify the architecture with the functions GSympNet and LASympNet,\nSpecify the type and the backend with NeuralNetwork,\nPick an optimizer for training the network,\nTrain the neural networks!","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"We discuss these points is some detail:","category":"page"},{"location":"tutorials/sympnet_tutorial/#Specifying-the-architecture","page":"Sympnets","title":"Specifying the architecture","text":"","category":"section"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"To call an LA-SympNet, one needs to write","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"lasympnet = LASympNet(dim; depth=5, nhidden=1, activation=tanh, init_upper_linear=true, init_upper_act=true) ","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"LASympNet takes one obligatory argument:","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"dim : the dimension of the phase space (i.e. an integer) or optionally an instance of DataLoader. This latter option will be used below.","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"and several keywords argument :","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"depth : the depth for all the linear layers. The default value set to 5 (if width>5, width is set to 5). See the theory section for more details; there depth was called n.\nnhidden : the number of pairs of linear and activation layers with default value set to 1 (i.e the LA-SympNet is a composition of a linear layer, an activation layer and then again a single layer). \nactivation : the activation function for all the activations layers with default set to tanh,\ninitupperlinear : a boolean that indicates whether the first linear layer changes q first. By default this is true.\ninitupperact : a boolean that indicates whether the first activation layer changes q first. By default this is true.","category":"page"},{"location":"tutorials/sympnet_tutorial/#G-SympNet","page":"Sympnets","title":"G-SympNet","text":"","category":"section"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"To call a G-SympNet, one needs to write","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"gsympnet = GSympNet(dim; upscaling_dimension=2*dim, nhidden=2, activation=tanh, init_upper=true) ","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"GSympNet takes one obligatory argument:","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"dim : the dimension of the phase space (i.e. an integer) or optionally an instance of DataLoader. This latter option will be used below.","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"and severals keywords argument :","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"upscaling_dimension: The first dimension of the matrix with which the input is multiplied. In the theory section this matrix is called K and the upscaling dimension is called m.\nnhidden: the number of gradient layers with default value set to 2.\nactivation : the activation function for all the activations layers with default set to tanh.\ninit_upper : a boolean that indicates whether the first gradient layer changes q first. By default this is true.","category":"page"},{"location":"tutorials/sympnet_tutorial/#Loss-function","page":"Sympnets","title":"Loss function","text":"","category":"section"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"The loss function described in the theory section is the default choice used in GeometricMachineLearning.jl for training SympNets.","category":"page"},{"location":"tutorials/sympnet_tutorial/#Data-Structures-in-GeometricMachineLearning.jl","page":"Sympnets","title":"Data Structures in GeometricMachineLearning.jl","text":"","category":"section"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/structs_visualization.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"if Main.output_type == :html # hide \n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"tutorials/sympnet_tutorial/#Examples","page":"Sympnets","title":"Examples","text":"","category":"section"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"Let us see how to use it on several examples.","category":"page"},{"location":"tutorials/sympnet_tutorial/#Example-of-a-pendulum-with-G-SympNet","page":"Sympnets","title":"Example of a pendulum with G-SympNet","text":"","category":"section"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"Let us begin with a simple example, the pendulum system, the Hamiltonian of which is ","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"H(qp)inmathbbR^2 mapsto frac12p^2-cos(q) in mathbbR","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"Here we generate pendulum data with the script GeometricMachineLearning/scripts/pendulum.jl:","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"using GeometricMachineLearning\n\n# load script\ninclude(\"../../../scripts/pendulum.jl\")\n# specify the data type\ntype = Float16 \n# get data \nqp_data = GeometricMachineLearning.apply_toNT(a -> type.(a), pendulum_data((q=[0.], p=[1.]); tspan=(0.,100.)))\n# call the DataLoader\ndl = DataLoader(qp_data)\n# this last line is a hack so as to not display the output # hide\nnothing # hide","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"Next we specify the architectures. GeometricMachineLearning.jl provides useful defaults for all parameters although they can be specified manually (which is done in the following):","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"# layer dimension for gradient module \nconst upscaling_dimension = 2\n# hidden layers\nconst nhidden = 1\n# activation function\nconst activation = tanh\n\n# calling G-SympNet architecture \ngsympnet = GSympNet(dl, upscaling_dimension=upscaling_dimension, nhidden=nhidden, activation=activation)\n\n# calling LA-SympNet architecture \nlasympnet = LASympNet(dl, nhidden=nhidden, activation=activation)\n\n# specify the backend\nbackend = CPU()\n\n# initialize the networks\nla_nn = NeuralNetwork(lasympnet, backend, type) \ng_nn = NeuralNetwork(gsympnet, backend, type)\nnothing # hide","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"If we want to obtain information on the number of parameters in a neural network, we can do that very simply with the function parameterlength. For the LASympNet:","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"parameterlength(la_nn.model)","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"And for the GSympNet:","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"parameterlength(g_nn.model)","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"Remark: We can also specify whether we would like to start with a layer that changes the q-component or one that changes the p-component. This can be done via the keywords init_upper for GSympNet, and init_upper_linear and init_upper_act for LASympNet.","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"We have to define an optimizer which will be use in the training of the SympNet. For more details on optimizer, please see the corresponding documentation. In this example we use Adam:","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"# set up optimizer; for this we first need to specify the optimization method (argue for why we need the optimizer method)\nopt_method = AdamOptimizer(; T=type)\nla_opt = Optimizer(opt_method, la_nn)\ng_opt = Optimizer(opt_method, g_nn)\nnothing # hide","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"We can now perform the training of the neural networks. The syntax is the following :","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"# number of training epochs\nconst nepochs = 300\n# Batchsize used to compute the gradient of the loss function with respect to the parameters of the neural networks.\nconst batch_size = 100\n\nbatch = Batch(batch_size)\n\n# perform training (returns array that contains the total loss for each training step)\ng_loss_array = g_opt(g_nn, dl, batch, nepochs)\nla_loss_array = la_opt(la_nn, dl, batch, nepochs)\nnothing # hide","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"We can also plot the training errors against the epoch (here the y-axis is in log-scale):","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"using Plots\np1 = plot(g_loss_array, xlabel=\"Epoch\", ylabel=\"Training error\", label=\"G-SympNet\", color=3, yaxis=:log)\nplot!(p1, la_loss_array, label=\"LA-SympNet\", color=2)","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"The train function will change the parameters of the neural networks and gives an a vector containing the evolution of the value of the loss function during the training. Default values for the arguments ntraining and batch_size are respectively 1000 and 10.","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"The trainings data data_q and data_p must be matrices of mathbbR^ntimes d where n is the length of data and d is the half of the dimension of the system, i.e data_q[i,j] is q_j(t_i) where (t_1t_n) are the corresponding time of the training data.","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"Then we can make prediction. Let's compare the initial data with a prediction starting from the same phase space point using the provided function Iterate_Sympnet:","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"ics = (q=qp_data.q[:,1], p=qp_data.p[:,1])\n\nsteps_to_plot = 200\n\n#predictions\nla_trajectory = iterate(la_nn, ics; n_points = steps_to_plot)\ng_trajectory = iterate(g_nn, ics; n_points = steps_to_plot)\n\nusing Plots\np2 = plot(qp_data.q'[1:steps_to_plot], qp_data.p'[1:steps_to_plot], label=\"training data\")\nplot!(p2, la_trajectory.q', la_trajectory.p', label=\"LA Sympnet\")\nplot!(p2, g_trajectory.q', g_trajectory.p', label=\"G Sympnet\")","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"We see that GSympNet gives an almost perfect math on the training data whereas LASympNet cannot even properly replicate the training data. It also takes longer to train LASympNet.","category":"page"},{"location":"architectures/sympnet/#SympNet","page":"SympNet","title":"SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"This document discusses the SympNet architecture and its implementation in GeometricMachineLearning.jl.","category":"page"},{"location":"architectures/sympnet/#Quick-overview-of-the-theory-of-SympNets","page":"SympNet","title":"Quick overview of the theory of SympNets","text":"","category":"section"},{"location":"architectures/sympnet/#Principle","page":"SympNet","title":"Principle","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"SympNets (see [1] for the eponymous paper) are a type of neural network that can model the trajectory of a Hamiltonian system in phase space. Take (q^Tp^T)^T=(q_1ldotsq_dp_1ldotsp_d)^Tin mathbbR^2d as the coordinates in phase space, where q=(q_1 ldots q_d)^Tin mathbbR^d is refered to as the position and p=(p_1 ldots p_d)^Tin mathbbR^d the momentum. Given a point (q^Tp^T)^T in mathbbR^2d the SympNet aims to compute the next position ((q)^T(p)^T)^T and thus predicts the trajectory while preserving the symplectic structure of the system. SympNets are enforcing symplecticity strongly, meaning that this property is hard-coded into the network architecture. The layers are reminiscent of traditional neural network feedforward layers, but have a strong restriction imposed on them in order to be symplectic.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"SympNets can be viewed as a \"symplectic integrator\" (see [2] and [3]). Their goal is to predict, based on an initial condition ((q^(0))^T(p^(0))^T)^T, a sequence of points in phase space that fit the training data as well as possible:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"beginpmatrix q^(0) p^(0) endpmatrix cdots beginpmatrix tildeq^(1) tildep^(1) endpmatrix cdots beginpmatrix tildeq^(n) tildep^(n) endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The tilde in the above equation indicates predicted data. The time step between predictions is not a parameter we can choose but is related to the temporal frequency of the training data. This means that if data is recorded in an interval of e.g. 0.1 seconds, then this will be the time step of our integrator.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n Docs.HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/sympnet_architecture.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"if Main.output_type == :html # hide\n Docs.HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"There are two types of SympNet architectures: LA-SympNets and G-SympNets. ","category":"page"},{"location":"architectures/sympnet/#LA-SympNet","page":"SympNet","title":"LA-SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The first type of SympNets, LA-SympNets, are obtained from composing two types of layers: symplectic linear layers and symplectic activation layers. For a given integer n, a symplectic linear layer is defined by","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"mathcalL^nq\nbeginpmatrix\n q \n p \nendpmatrix\n = \nbeginpmatrix \n I S^n0 \n 0S^n I \nendpmatrix\n cdots \nbeginpmatrix \n I 0 \n S^2 I \nendpmatrix\nbeginpmatrix \n I S^1 \n 0 I \nendpmatrix\nbeginpmatrix\n q \n p \nendpmatrix\n+ b ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"or ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"mathcalL^np\nbeginpmatrix q \n p endpmatrix = \n beginpmatrix \n I 0S^n \n S^n0 I\n endpmatrix cdots \n beginpmatrix \n I S^2 \n 0 I\n endpmatrix\n beginpmatrix \n I 0 \n S^1 I\n endpmatrix\n beginpmatrix q \n p endpmatrix\n + b ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The superscripts q and p indicate whether the q or the p part is changed. The learnable parameters are the symmetric matrices S^iinmathbbR^dtimes d and the bias binmathbbR^2d. The integer n is the width of the symplectic linear layer. It can be shown that five of these layers, i.e. ngeq5, can represent any linear symplectic map (see [4]), so n need not be larger than five. We denote the set of symplectic linear layers by mathcalM^L.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The second type of layer needed for LA-SympNets are so-called activation layers:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalA^q beginpmatrix q \n p endpmatrix = \n beginbmatrix \n Ihatsigma^a \n 0I\n endbmatrix beginpmatrix q \n p endpmatrix =\n beginpmatrix \n mathrmdiag(a)sigma(p)+q \n p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"and","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalA^p beginpmatrix q \n p endpmatrix = \n beginbmatrix \n I0 \n hatsigma^aI\n endbmatrix beginpmatrix q \n p endpmatrix\n =\n beginpmatrix \n q \n mathrmdiag(a)sigma(q)+p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The activation function sigma can be any nonlinearity (on which minor restrictions are imposed below). Here the scaling vector ainmathbbR^d constitutes the learnable weights. We denote the set of symplectic activation layers by mathcalM^A. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"An LA-SympNet is a function of the form Psi=l_k circ a_k circ l_k-1 circ cdots circ a_1 circ l_0 where (l_i)_0leq ileq k subset (mathcalM^L)^k+1 and (a_i)_1leq ileq k subset (mathcalM^A)^k. We will refer to k as the number of hidden layers of the SympNet[1] and the number n above as the depth of the linear layer.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"[1]: Note that if k=1 then the LA-SympNet consists of only one linear layer.","category":"page"},{"location":"architectures/sympnet/#G-SympNets","page":"SympNet","title":"G-SympNets","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"G-SympNets are an alternative to LA-SympNets. They are built with only one kind of layer, called gradient layer. For a given activation function sigma and an integer ngeq d, a gradient layers is a symplectic map from mathbbR^2d to mathbbR^2d defined by","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalG^up beginpmatrix q \n p endpmatrix = \n beginbmatrix \n Ihatsigma^Kab \n 0I\n endbmatrix beginpmatrix q \n p endpmatrix =\n beginpmatrix \n K^T mathrmdiag(a)sigma(Kp+b)+q \n p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"or","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalG^low beginpmatrix q \n p endpmatrix = \n beginbmatrix \n I0 \n hatsigma^KabI\n endbmatrix beginpmatrix q \n p endpmatrix\n =\n beginpmatrix \n q \n K^T mathrmdiag(a)sigma(Kq+b)+p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The parameters of this layer are the scaling matrix KinmathbbR^mtimes d, the bias binmathbbR^m and the scaling vector ainmathbbR^m. The name \"gradient layer\" has its origin in the fact that the expression K^Tmathrmdiag(a)sigma(Kq+b)_i = sum_jk_jia_jsigma(sum_ellk_jellq_ell+b_j) is the gradient of a function sum_ja_jtildesigma(sum_ellk_jellq_ell+b_j), where tildesigma is the antiderivative of sigma. The first dimension of K we refer to as the upscaling dimension.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"If we denote by mathcalM^G the set of gradient layers, a G-SympNet is a function of the form Psi=g_k circ g_k-1 circ cdots circ g_0 where (g_i)_0leq ileq k subset (mathcalM^G)^k. The index k is again the number of hidden layers.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Further note here the different roles played by round and square brackets: the latter indicates a nonlinear operation as opposed to a regular vector or matrix. ","category":"page"},{"location":"architectures/sympnet/#Universal-approximation-theorems","page":"SympNet","title":"Universal approximation theorems","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"In order to state the universal approximation theorem for both architectures we first need a few definitions:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Let U be an open set of mathbbR^2d, and let us denote by mathcalSP^r(U) the set of C^r smooth symplectic maps on U. We now define a topology on C^r(K mathbbR^n), the set of C^r-smooth maps from a compact set KsubsetmathbbR^n to mathbbR^n through the norm","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"f_C^r(KmathbbR^n) = undersetalphaleq rsum underset1leq i leq nmaxundersetxin Ksup D^alpha f_i(x)","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"where the differential operator D^alpha is defined by ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"D^alpha f = fracpartial^alpha fpartial x_1^alpha_1x_n^alpha_n","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"with alpha = alpha_1 ++ alpha_n. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Definition sigma is r-finite if sigmain C^r(mathbbRmathbbR) and int D^rsigma(x)dx +infty.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Definition Let mnrin mathbbN with mn0 be given, U an open set of mathbbR^m, and IJsubset C^r(UmathbbR^n). We say J is r-uniformly dense on compacta in I if J subset I and for any fin I, epsilon0, and any compact Ksubset U, there exists gin J such that f-g_C^r(KmathbbR^n) epsilon.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"We can now state the universal approximation theorems:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Theorem (Approximation theorem for LA-SympNet) For any positive integer r0 and open set Uin mathbbR^2d, the set of LA-SympNet is r-uniformly dense on compacta in SP^r(U) if the activation function sigma is r-finite.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Theorem (Approximation theorem for G-SympNet) For any positive integer r0 and open set Uin mathbbR^2d, the set of G-SympNet is r-uniformly dense on compacta in SP^r(U) if the activation function sigma is r-finite.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"There are many r-finite activation functions commonly used in neural networks, for example:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"sigmoid sigma(x)=frac11+e^-x for any positive integer r, \ntanh tanh(x)=frace^x-e^-xe^x+e^-x for any positive integer r. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The universal approximation theorems state that we can, in principle, get arbitrarily close to any symplectomorphism defined on mathbbR^2d. But this does not tell us anything about how to optimize the network. This is can be done with any common neural network optimizer and these neural network optimizers always rely on a corresponding loss function. ","category":"page"},{"location":"architectures/sympnet/#Loss-function","page":"SympNet","title":"Loss function","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"To train the SympNet, one need data along a trajectory such that the model is trained to perform an integration. These data are (QP) where Qij (respectively Pij) is the real number q_j(t_i) (respectively pij) which is the j-th coordinates of the generalized position (respectively momentum) at the i-th time step. One also need a loss function defined as :","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Loss(QP) = undersetisum d(Phi(Qi-Pi-) Qi- Pi-^T)","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"where d is a distance on mathbbR^d.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"See the tutorial section for an introduction into using SympNets with GeometricMachineLearning.jl.","category":"page"},{"location":"architectures/sympnet/#References","page":"SympNet","title":"References","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"P. Jin, Z. Zhang, A. Zhu, Y. Tang and G. E. Karniadakis. SympNets: Intrinsic structure-preserving symplectic networks for identifying Hamiltonian systems. Neural Networks 132, 166–179 (2020).\n\n\n\n","category":"page"},{"location":"Optimizer/#Optimizer","page":"Optimizers","title":"Optimizer","text":"","category":"section"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"In order to generalize neural network optimizers to homogeneous spaces, a class of manifolds we often encounter in machine learning, we have to find a global tangent space representation which we call mathfrakg^mathrmhor here. ","category":"page"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"Starting from an element of the tangent space T_YmathcalM[1], we need to perform two mappings to arrive at mathfrakg^mathrmhor, which we refer to by Omega and a red horizontal arrow:","category":"page"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"[1]: In practice this is obtained by first using an AD routine on a loss function L, and then computing the Riemannian gradient based on this. See the section of the Stiefel manifold for an example of this.","category":"page"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"tikz/general_optimization_with_boundary.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"Here the mapping Omega is a horizontal lift from the tangent space onto the horizontal component of the Lie algebra at Y. ","category":"page"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"The red line maps the horizontal component at Y, i.e. mathfrakg^mathrmhorY, to the horizontal component at mathfrakg^mathrmhor.","category":"page"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"The mathrmcache stores information about previous optimization steps and is dependent on the optimizer. The elements of the mathrmcache are also in mathfrakg^mathrmhor. Based on this the optimer (Adam in this case) computes a final velocity, which is the input of a retraction. Because this update is done for mathfrakg^mathrmhorequivT_YmathcalM, we still need to perform a mapping, called apply_section here, that then finally updates the network parameters. The two red lines are described in global sections.","category":"page"},{"location":"Optimizer/#References","page":"Optimizers","title":"References","text":"","category":"section"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).\n\n\n\n","category":"page"},{"location":"","page":"Home","title":"Home","text":"CurrentModule = GeometricMachineLearning","category":"page"},{"location":"#Geometric-Machine-Learning","page":"Home","title":"Geometric Machine Learning","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"GeometricMachineLearning.jl implements various scientific machine learning models that aim at learning dynamical systems with geometric structure, such as Hamiltonian (symplectic) or Lagrangian (variational) systems.","category":"page"},{"location":"#Installation","page":"Home","title":"Installation","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"GeometricMachineLearning.jl and all of its dependencies can be installed via the Julia REPL by typing ","category":"page"},{"location":"","page":"Home","title":"Home","text":"]add GeometricMachineLearning","category":"page"},{"location":"#Architectures","page":"Home","title":"Architectures","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"There are several architectures tailored towards problems in scientific machine learning implemented in GeometricMachineLearning.","category":"page"},{"location":"","page":"Home","title":"Home","text":"Pages = [\n \"architectures/sympnet.md\",\n]","category":"page"},{"location":"#Manifolds","page":"Home","title":"Manifolds","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"GeometricMachineLearning supports putting neural network weights on manifolds. These include:","category":"page"},{"location":"","page":"Home","title":"Home","text":"Pages = [\n \"manifolds/grassmann_manifold.md\",\n \"manifolds/stiefel_manifold.md\",\n]","category":"page"},{"location":"#Special-Neural-Network-Layer","page":"Home","title":"Special Neural Network Layer","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"Many layers have been adapted in order to be used for problems in scientific machine learning. Including:","category":"page"},{"location":"","page":"Home","title":"Home","text":"Pages = [\n \"layers/attention_layer.md\",\n]","category":"page"},{"location":"#Tutorials","page":"Home","title":"Tutorials","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"Tutorials for using GeometricMachineLearning are: ","category":"page"},{"location":"","page":"Home","title":"Home","text":"Pages = [\n \"tutorials/sympnet_tutorial.md\",\n \"tutorials/mnist_tutorial.md\",\n]","category":"page"},{"location":"#Reduced-Order-Modeling","page":"Home","title":"Reduced Order Modeling","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"A short description of the key concepts in reduced order modeling (where GeometricMachineLearning can be used) are in:","category":"page"},{"location":"","page":"Home","title":"Home","text":"Pages = [\n \"reduced_order_modeling/autoencoder.md\",\n \"reduced_order_modeling/symplectic_autoencoder.md\",\n \"reduced_order_modeling/kolmogorov_n_width.md\",\n]","category":"page"},{"location":"data_loader/snapshot_matrix/#Snapshot-matrix","page":"Snapshot matrix & tensor","title":"Snapshot matrix","text":"","category":"section"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"The snapshot matrix stores solutions of the high-dimensional ODE (obtained from discretizing a PDE). This is then used to construct reduced bases in a data-driven way. So (for a single parameter[1]) the snapshot matrix takes the following form: ","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"[1]: If we deal with a parametrized PDE then there are two stages at which the snapshot matrix has to be processed: the offline stage and the online stage. ","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"M = leftbeginarraycccc\nhatu_1(t_0) hatu_1(t_1) quadldotsquad hatu_1(t_f) \nhatu_2(t_0) hatu_2(t_1) ldots hatu_2(t_f) \nhatu_3(t_0) hatu_3(t_1) ldots hatu_3(t_f) \nldots ldots ldots ldots \nhatu_2N(t_0) hatu_2N(t_1) ldots hatu_2N(t_f) \nendarrayright","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"In the above example we store a matrix whose first axis is the system dimension (i.e. a state is an element of mathbbR^2n) and the second dimension gives the time step. ","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"The starting point for using the snapshot matrix as data for a machine learning model is that all the columns of M live on a lower-dimensional solution manifold and we can use techniques such as POD and autoencoders to find this solution manifold. We also note that the second axis of M does not necessarily indicate time but can also represent various parameters (including initial conditions). The second axis in the DataLoader struct is therefore saved in the field n_params.","category":"page"},{"location":"data_loader/snapshot_matrix/#Snapshot-tensor","page":"Snapshot matrix & tensor","title":"Snapshot tensor","text":"","category":"section"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"The snapshot tensor fulfills the same role as the snapshot matrix but has a third axis that describes different initial parameters (such as different initial conditions). ","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/tensor.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"When drawing training samples from the snapshot tensor we also need to specify a sequence length (as an argument to the Batch struct). When sampling a batch from the snapshot tensor we sample over the starting point of the time interval (which is of length seq_length) and the third axis of the tensor (the parameters). The total number of batches in this case is lceilmathtt(dlinput_time_steps - batchseq_length) * dln_params batchbatch_sizerceil. ","category":"page"},{"location":"arrays/grassmann_lie_alg_hor_matrix/#The-horizontal-component-of-the-Lie-algebra-\\mathfrak{g}-for-the-Grassmann-manifold","page":"Grassmann Global Tangent Space","title":"The horizontal component of the Lie algebra mathfrakg for the Grassmann manifold","text":"","category":"section"},{"location":"arrays/grassmann_lie_alg_hor_matrix/#Tangent-space-to-the-element-\\mathcal{E}","page":"Grassmann Global Tangent Space","title":"Tangent space to the element mathcalE","text":"","category":"section"},{"location":"arrays/grassmann_lie_alg_hor_matrix/","page":"Grassmann Global Tangent Space","title":"Grassmann Global Tangent Space","text":"Consider the tangent space to the distinct element mathcalE=mathrmspan(E)inGr(nN), where E is again:","category":"page"},{"location":"arrays/grassmann_lie_alg_hor_matrix/","page":"Grassmann Global Tangent Space","title":"Grassmann Global Tangent Space","text":"E = beginbmatrix\nmathbbI_n \nmathbbO\nendbmatrix","category":"page"},{"location":"arrays/grassmann_lie_alg_hor_matrix/","page":"Grassmann Global Tangent Space","title":"Grassmann Global Tangent Space","text":"The tangent tangent space T_mathcalEGr(nN) can be represented through matrices: ","category":"page"},{"location":"arrays/grassmann_lie_alg_hor_matrix/","page":"Grassmann Global Tangent Space","title":"Grassmann Global Tangent Space","text":"beginpmatrix\n 0 cdots 0 \n cdots cdots cdots \n 0 cdots 0 \n a_11 cdots a_1n \n cdots cdots cdots \n a_(N-n)1 cdots a_(N-n)n\nendpmatrix","category":"page"},{"location":"arrays/grassmann_lie_alg_hor_matrix/","page":"Grassmann Global Tangent Space","title":"Grassmann Global Tangent Space","text":"where we have used the identification T_mathcalEGr(nN)toT_EmathcalS_E that was discussed in the section on the Grassmann manifold. The Grassmann manifold can also be seen as the Stiefel manifold modulo an equivalence class. This leads to the following (which is used for optimization):","category":"page"},{"location":"arrays/grassmann_lie_alg_hor_matrix/","page":"Grassmann Global Tangent Space","title":"Grassmann Global Tangent Space","text":"mathfrakg^mathrmhor = mathfrakg^mathrmhormathcalE = leftbeginpmatrix 0 -B^T B 0 endpmatrix textB arbitraryright","category":"page"},{"location":"arrays/grassmann_lie_alg_hor_matrix/","page":"Grassmann Global Tangent Space","title":"Grassmann Global Tangent Space","text":"This is equivalent to the horizontal component of mathfrakg for the Stiefel manifold for the case when A is zero. This is a reflection of the rotational invariance of the Grassmann manifold: the skew-symmetric matrices A are connected to the group of rotations O(n) which is factored out in the Grassmann manifold Gr(nN)simeqSt(nN)O(n).","category":"page"}] +[{"location":"manifolds/grassmann_manifold/#Grassmann-Manifold","page":"Grassmann","title":"Grassmann Manifold","text":"","category":"section"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"(The description of the Grassmann manifold is based on that of the Stiefel manifold, so this should be read first.)","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"An element of the Grassmann manifold G(nN) is a vector subspace subsetmathbbR^N of dimension n. Each such subspace (i.e. element of the Grassmann manifold) can be represented by a full-rank matrix AinmathbbR^Ntimesn and we identify two elements with the following equivalence relation: ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"A_1 sim A_2 iff existsCinmathbbR^ntimesntext st A_1C = A_2","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"The resulting manifold is of dimension n(N-n). One can find a parametrization of the manifold the following way: Because the matrix Y has full rank, there have to be n independent columns in it: i_1 ldots i_n. For simplicity assume that i_1 = 1 i_2=2 ldots i_n=n and call the matrix made up by these columns C. Then the mapping to the coordinate chart is: YC^-1 and the last N-n columns are the coordinates.","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"We can also define the Grassmann manifold based on the Stiefel manifold since elements of the Stiefel manifold are already full-rank matrices. In this case we have the following equivalence relation (for Y_1 Y_2inSt(nN)): ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"Y_1 sim Y_2 iff existsCinO(n)text st Y_1C = Y_2","category":"page"},{"location":"manifolds/grassmann_manifold/#The-Riemannian-Gradient","page":"Grassmann","title":"The Riemannian Gradient","text":"","category":"section"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"Obtaining the Riemannian Gradient for the Grassmann manifold is slightly more difficult than it is in the case of the Stiefel manifold. Since the Grassmann manifold can be obtained from the Stiefel manifold through an equivalence relation however, we can use this as a starting point. In a first step we identify charts on the Grassmann manifold to make dealing with it easier. For this consider the following open cover of the Grassmann manifold (also see [8]): ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"mathcalU_W_WinSt(n N) quadtextwherequad mathcalU_W = mathrmspan(Y)mathrmdet(W^TY)neq0","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"We can find a canonical bijective mapping from the set mathcalU_W to the set mathcalS_W = YinmathbbR^NtimesnW^TY=mathbbI_n:","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"sigma_W mathcalU_W to mathcalS_W mathcalY=mathrmspan(Y)mapstoY(W^TY)^-1 = hatY","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"That sigma_W is well-defined is easy to see: Consider YC with CinmathbbR^ntimesn non-singular. Then YC(W^TYC)^-1=Y(W^TY)^-1 = hatY. With this isomorphism we can also find a representation of elements of the tangent space:","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"T_mathcalYsigma_W T_mathcalYGr(nN)toT_hatYmathcalS_W xi mapsto (xi_diamondY -hatY(W^Txi_diamondY))(W^TY)^-1","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"xi_diamondY is the representation of xiinT_mathcalYGr(nN) for the point YinSt(nN), i.e. T_Ypi(xi_diamondY) = xi; because the map sigma_W does not care about the representation of mathrmspan(Y) we can perform the variations in St(nN)[1]:","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"[1]: I.e. Y(t)inSt(nN) for tin(-varepsilonvarepsilon). We also set Y(0) = Y.","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"fracddtY(t)(W^TY(t))^-1 = (dotY(0) - Y(W^TY)^-1W^TdotY(0))(W^TY)^-1","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"where dotY(0)inT_YSt(nN). Also note that the representation of xi in T_YSt(nN) is not unique in general, but T_mathcalYsigma_W is still well-defined. To see this consider two curves Y(t) and barY(t) for which we have Y(0) = barY(0) = Y and further Tpi(dotY(0)) = Tpi(dotbarY(0)). This is equivalent to being able to find a C(cdot)(-varepsilonvarepsilon)toO(n) for which C(0)=mathbbI(0) s.t. barY(t) = Y(t)C(t). We thus have dotbarY(0) = dotY(0) + YdotC(0) and if we replace xi_diamondY above with the second term in the expression we get: YdotC(0) - hatYW^T(YdotC(0)) = 0. The parametrization of T_mathcalYGr(nN) with T_mathcalYsigma_W is thus independent of the choice of dotC(0) and hence of xi_diamondY and is therefore well-defined.","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"Further note that we have T_mathcalYmathcalU_W = T_mathcalYGr(nN) because mathcalU_W is an open subset of Gr(nN). We thus can identify the tangent space T_mathcalYGr(nN) with the following set (where we again have hatY=Y(W^TY)^-1):","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"T_hatYmathcalS_W = (Delta - Y(W^TY)^-1W^TDelta)(W^TDelta)^-1 YinSt(nN)text st mathrmspan(Y)=mathcalYtext and DeltainT_YSt(nN)","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"If we now further take W=Y[2] then we get the identification: ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"[2]: We can pick any element W to construct the charts for a neighborhood around the point mathcalYinGr(nN) as long as we have mathrmdet(W^TY)neq0 for mathrmspan(Y)=mathcalY. ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"T_mathcalYGr(nN) equiv Delta - YY^TDelta YinSt(nN)text st mathrmspan(Y)=mathcalYtext and DeltainT_YSt(nN)","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"which is very easy to handle computationally (we simply store and change the matrix Y that represents an element of the Grassmann manifold). The Riemannian gradient is then ","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"mathrmgrad_mathcalY^GrL = mathrmgrad_Y^StL - YY^Tmathrmgrad_Y^StL = nabla_YL - YY^Tnabla_YL","category":"page"},{"location":"manifolds/grassmann_manifold/","page":"Grassmann","title":"Grassmann","text":"where nabla_YL again is the Euclidean gradient as in the Stiefel manifold case.","category":"page"},{"location":"references/#References","page":"References","title":"References","text":"","category":"section"},{"location":"references/","page":"References","title":"References","text":"P. Jin, Z. Zhang, A. Zhu, Y. Tang and G. E. Karniadakis. SympNets: Intrinsic structure-preserving symplectic networks for identifying Hamiltonian systems. Neural Networks 132, 166–179 (2020).\n\n\n\nE. Hairer, C. Lubich and G. Wanner. Geometric Numerical integration: structure-preserving algorithms for ordinary differential equations (Springer, 2006).\n\n\n\nB. Leimkuhler and S. Reich. Simulating hamiltonian dynamics. No. 14 (Cambridge university press, 2004).\n\n\n\nP. Jin, Z. Lin and B. Xiao. Optimal unit triangular factorization of symplectic matrices. Linear Algebra and its Applications (2022).\n\n\n\nS. Lipschutz. General Topology (McGraw-Hill Book Company, 1965).\n\n\n\nS. Lang. Fundamentals of differential geometry. Vol. 191 (Springer Science & Business Media, 2012).\n\n\n\nS. I. Richard L. Bishop. Tensor Analysis on Manifolds (Dover Publications, 1980).\n\n\n\nP.-A. Absil, R. Mahony and R. Sepulchre. Riemannian geometry of Grassmann manifolds with a view on algorithmic computation. Acta Applicandae Mathematica 80, 199–220 (2004).\n\n\n\nW. S. Moses, V. Churavy, L. Paehler, J. Hückelheim, S. H. Narayanan, M. Schanen and J. Doerfert. Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '21 (Association for Computing Machinery, New York, NY, USA, 2021).\n\n\n\nM. Betancourt. A geometric theory of higher-order automatic differentiation, arXiv preprint arXiv:1812.11592 (2018).\n\n\n\nJ. Bolte and E. Pauwels. A mathematical model for automatic differentiation in machine learning. Advances in Neural Information Processing Systems 33, 10809–10819 (2020).\n\n\n\nJ. N. Stephen J. Wright. Numerical optimization (Springer Science+Business Media, 2006).\n\n\n\nD. Bahdanau, K. Cho and Y. Bengio. Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014).\n\n\n\nA. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin. Attention is all you need. Advances in neural information processing systems 30 (2017).\n\n\n\nK. Jacobs. Discrete Stochastics (Birkhäuser Verlag, Basel, Switzerland, 1992).\n\n\n\nand R. Sepulchre. Optimization algorithms on matrix manifolds (Princeton University Press, Princeton, New Jersey, 2008).\n\n\n\nK. Feng. The step-transition operators for multi-step methods of ODE's. Journal of Computational Mathematics, 193–202 (1998).\n\n\n\nM.-T. Luong, H. Pham and C. D. Manning. Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025 (2015).\n\n\n\nP. Buchfink, S. Glas and B. Haasdonk. Symplectic model reduction of Hamiltonian systems on nonlinear manifolds and approximation with weakly symplectic autoencoder. SIAM Journal on Scientific Computing 45, A289–A311 (2023).\n\n\n\nL. Peng and K. Mohseni. Symplectic model reduction of Hamiltonian systems. SIAM Journal on Scientific Computing 38, A1–A27 (2016).\n\n\n\nC. Greif and K. Urban. Decay of the Kolmogorov N-width for wave problems. Applied Mathematics Letters 96, 216–222 (2019).\n\n\n\nT. Blickhan. A registration method for reduced basis problems using linear optimal transport, arXiv preprint arXiv:2304.14884 (2023).\n\n\n\nK. Lee and K. T. Carlberg. Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders. Journal of Computational Physics 404, 108973 (2020).\n\n\n\nB. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).\n\n\n\nT. Lin and H. Zha. Riemannian manifold learning. IEEE transactions on pattern analysis and machine intelligence 30, 796–809 (2008).\n\n\n\nT. Blickhan. BrenierTwoFluids.jl, https://github.com/ToBlick/BrenierTwoFluids (2023).\n\n\n\nS. Fresca, L. Dede’ and A. Manzoni. A comprehensive deep learning-based approach to reduced order modeling of nonlinear time-dependent parametrized PDEs. Journal of Scientific Computing 87, 1–36 (2021).\n\n\n\nB. Brantner and M. Kraus. Symplectic autoencoders for Model Reduction of Hamiltonian Systems, arXiv preprint arXiv:2312.10004 (2023).\n\n\n\nB. Brantner, G. de Romemont, M. Kraus and Z. Li. Structure-Preserving Transformers for Learning Parametrized Hamiltonian Systems, arXiv preprint arXiv:2312:11166 (2023).\n\n\n\nT. Frankel. The geometry of physics: an introduction (Cambridge university press, Cambridge, UK, 2011).\n\n\n\nI. Goodfellow, Y. Bengio and A. Courville. Deep learning (MIT press, Cambridge, MA, 2016).\n\n\n\nT. Bendokat, R. Zimmermann and P.-A. Absil. A Grassmann manifold handbook: Basic geometry and computational aspects, arXiv preprint arXiv:2011.13699 (2020).\n\n\n\n","category":"page"},{"location":"manifolds/stiefel_manifold/#Stiefel-manifold","page":"Stiefel","title":"Stiefel manifold","text":"","category":"section"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The Stiefel manifold St(n N) is the space (a homogeneous space) of all orthonormal frames in mathbbR^Ntimesn, i.e. matrices YinmathbbR^Ntimesn s.t. Y^TY = mathbbI_n. It can also be seen as the special orthonormal group SO(N) modulo an equivalence relation: AsimBiffAE = BE for ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"E = beginbmatrix\nmathbbI_n \nmathbbO\nendbmatrixinmathcalM","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"which is the canonical element of the Stiefel manifold. In words: the first n columns of A and B are the same.","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The tangent space to the element YinSt(nN) can easily be determined: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"T_YSt(nN)=DeltaDelta^TY + Y^TDelta = 0","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The Lie algebra of SO(N) is mathfrakso(N)=VinmathbbR^NtimesNV^T + V = 0 and the canonical metric associated with it is simply (V_1V_2)mapstofrac12mathrmTr(V_1^TV_2).","category":"page"},{"location":"manifolds/stiefel_manifold/#The-Riemannian-Gradient","page":"Stiefel","title":"The Riemannian Gradient","text":"","category":"section"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"For matrix manifolds (like the Stiefel manifold), the Riemannian gradient of a function can be easily determined computationally:","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"The Euclidean gradient of a function L is equivalent to an element of the cotangent space T^*_YmathcalM via: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"langlenablaLcdotrangleT_YmathcalM to mathbbR Delta mapsto sum_ijnablaL_ijDelta_ij = mathrmTr(nablaL^TDelta)","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"We can then utilize the Riemannian metric on mathcalM to map the element from the cotangent space (i.e. nablaL) to the tangent space. This element is called mathrmgrad_(cdot)L here. Explicitly, it is given by: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":" mathrmgrad_YL = nabla_YL - Y(nabla_YL)^TY","category":"page"},{"location":"manifolds/stiefel_manifold/#rgrad","page":"Stiefel","title":"rgrad","text":"","category":"section"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"What was referred to as nablaL before can in practice be obtained with an AD routine. We then use the function rgrad to map this Euclidean gradient to inT_YSt(nN). This mapping has the property: ","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"mathrmTr((nablaL)^TDelta) = g_Y(mathttrgrad(Y nablaL) Delta) forallDeltainT_YSt(nN)","category":"page"},{"location":"manifolds/stiefel_manifold/","page":"Stiefel","title":"Stiefel","text":"and g is the Riemannian metric.","category":"page"},{"location":"arrays/skew_symmetric_matrix/#SymmetricMatrix-and-SkewSymMatrix","page":"Symmetric and Skew-Symmetric Matrices","title":"SymmetricMatrix and SkewSymMatrix","text":"","category":"section"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"There are special implementations of symmetric and skew-symmetric matrices in GeometricMachineLearning.jl. They are implemented to work on GPU and for multiplication with tensors. The following image demonstrates how the data necessary for an instance of SkewSymMatrix are stored[1]:","category":"page"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"[1]: It works similarly for SymmetricMatrix. ","category":"page"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/skew_sym_visualization.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # ","category":"page"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"So what is stored internally is a vector of size n(n-1)2 for the skew-symmetric matrix and a vector of size n(n+1)2 for the symmetric matrix. We can sample a random skew-symmetric matrix: ","category":"page"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"using GeometricMachineLearning # hide \n\nA = rand(SkewSymMatrix, 5)","category":"page"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"and then access the vector:","category":"page"},{"location":"arrays/skew_symmetric_matrix/","page":"Symmetric and Skew-Symmetric Matrices","title":"Symmetric and Skew-Symmetric Matrices","text":"A.S ","category":"page"},{"location":"manifolds/submersion_theorem/#The-Submersion-Theorem","page":"The Submersion Theorem","title":"The Submersion Theorem","text":"","category":"section"},{"location":"manifolds/submersion_theorem/","page":"The Submersion Theorem","title":"The Submersion Theorem","text":"The submersion theorem is an application of the inverse function theorem that we need in order to show that the spaces we deal with here are indeed manifolds. ","category":"page"},{"location":"optimizers/general_optimization/#Optimization-for-Neural-Networks","page":"General Optimization","title":"Optimization for Neural Networks","text":"","category":"section"},{"location":"optimizers/general_optimization/","page":"General Optimization","title":"General Optimization","text":"Optimization for neural networks is (almost always) some variation on gradient descent. The most basic form of gradient descent is a discretization of the gradient flow equation:","category":"page"},{"location":"optimizers/general_optimization/","page":"General Optimization","title":"General Optimization","text":"dottheta = -nabla_thetaL","category":"page"},{"location":"optimizers/general_optimization/","page":"General Optimization","title":"General Optimization","text":"by means of a Euler time-stepping scheme: ","category":"page"},{"location":"optimizers/general_optimization/","page":"General Optimization","title":"General Optimization","text":"theta^t+1 = theta^t - hnabla_theta^tL","category":"page"},{"location":"optimizers/general_optimization/","page":"General Optimization","title":"General Optimization","text":"where eta (the time step of the Euler scheme) is referred to as the learning rate","category":"page"},{"location":"optimizers/general_optimization/","page":"General Optimization","title":"General Optimization","text":"This equation can easily be generalized to manifolds by replacing the Euclidean gradient nabla_theta^tL by a Riemannian gradient -hmathrmgrad_theta^tL and addition by -hnabla_theta^tL with a retraction by -hmathrmgrad_theta^tL.","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/#The-Horizontal-Lift","page":"Horizontal Lift","title":"The Horizontal Lift","text":"","category":"section"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"For each element YinmathcalM we can perform a splitting mathfrakg = mathfrakg^mathrmhor Yoplusmathfrakg^mathrmver Y, where the two subspaces are the horizontal and the vertical component of mathfrakg at Y respectively. For homogeneous spaces: T_YmathcalM = mathfrakgcdotY, i.e. every tangent space to mathcalM can be expressed through the application of the Lie algebra to the relevant element. The vertical component consists of those elements of mathfrakg which are mapped to the zero element of T_YmathcalM, i.e. ","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"mathfrakg^mathrmver Y = mathrmker(mathfrakgtoT_YmathcalM)","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"The orthogonal complement[1] of mathfrakg^mathrmver Y is the horizontal component and is referred to by mathfrakg^mathrmhor Y. This is naturally isomorphic to T_YmathcalM. For the Stiefel manifold the horizontal lift has the simple form: ","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"Omega(Y V) = left(mathbbI - frac12right)VY^T - YV^T(mathbbI - frac12YY^T)","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"If the element Y is the distinct element E, then the elements of mathfrakg^mathrmhorE take a particularly simple form, see Global Tangent Space for a description of this. ","category":"page"},{"location":"optimizers/manifold_related/horizontal_lift/","page":"Horizontal Lift","title":"Horizontal Lift","text":"[1]: The orthogonal complement is taken with respect to a metric defined on mathfrakg. For the case of G=SO(N) and mathfrakg=mathfrakso(N) = AA+A^T =0 this metric can be chosen as (A_1A_2)mapstofrac12A_1^TA_2.","category":"page"},{"location":"optimizers/manifold_related/retractions/#Retractions","page":"Retractions","title":"Retractions","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/#Classical-Definition","page":"Retractions","title":"Classical Definition","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Classically, retractions are defined as maps smooth maps ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"R TmathcalMtomathcalM(xv)mapstoR_x(v)","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"such that each curve c(t) = R_x(tv) satisfies c(0) = x and c(0) = v.","category":"page"},{"location":"optimizers/manifold_related/retractions/#In-GeometricMachineLearning","page":"Retractions","title":"In GeometricMachineLearning","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Retractions are a map from the horizontal component of the Lie algebra mathfrakg^mathrmhor to the respective manifold.","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"For optimization in neural networks (almost always first order) we solve a gradient flow equation ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"dotW = -mathrmgrad_WL ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"where mathrmgrad_WL is the Riemannian gradient of the loss function L evaluated at position W.","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"If we deal with Euclidean spaces (vector spaces), then the Riemannian gradient is just the result of an AD routine and the solution of the equation above can be approximated with W^t+1 gets W^t - etanabla_W^tL, where eta is the learning rate. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"For manifolds, after we obtained the Riemannian gradient (see e.g. the section on Stiefel manifold), we have to solve a geodesic equation. This is a canonical ODE associated with any Riemannian manifold. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"The general theory of Riemannian manifolds is rather complicated, but for the neural networks treated in GeometricMachineLearning, we only rely on optimization of matrix Lie groups and homogeneous spaces, which is much simpler. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"For Lie groups each tangent space is isomorphic to its Lie algebra mathfrakgequivT_mathbbIG. The geodesic map from mathfrakg to G, for matrix Lie groups with bi-invariant Riemannian metric like SO(N), is simply the application of the matrix exponential exp. Alternatively this can be replaced by the Cayley transform (see (Absil et al, 2008).)","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Starting from this basic map expmathfrakgtoG we can build mappings for more complicated cases: ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"General tangent space to a Lie group T_AG: The geodesic map for an element VinT_AG is simply Aexp(A^-1V).\nSpecial tangent space to a homogeneous space T_EmathcalM: For V=BEinT_EmathcalM the exponential map is simply exp(B)E. \nGeneral tangent space to a homogeneous space T_YmathcalM with Y = AE: For Delta=ABEinT_YmathcalM the exponential map is simply Aexp(B)E. This is the general case which we deal with. ","category":"page"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"The general theory behind points 2. and 3. is discussed in chapter 11 of (O'Neill, 1983). The function retraction in GeometricMachineLearning performs mathfrakg^mathrmhortomathcalM, which is the second of the above points. To get the third from the second point, we simply have to multiply with a matrix from the left. This step is done with apply_section and represented through the red vertical line in the diagram on the general optimizer framework.","category":"page"},{"location":"optimizers/manifold_related/retractions/#Word-of-caution","page":"Retractions","title":"Word of caution","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"The Lie group corresponding to the Stiefel manifold SO(N) has a bi-invariant Riemannian metric associated with it: (B_1B_2)mapsto mathrmTr(B_1^TB_2). For other Lie groups (e.g. the symplectic group) the situation is slightly more difficult (see (Bendokat et al, 2021).)","category":"page"},{"location":"optimizers/manifold_related/retractions/#References","page":"Retractions","title":"References","text":"","category":"section"},{"location":"optimizers/manifold_related/retractions/","page":"Retractions","title":"Retractions","text":"Absil P A, Mahony R, Sepulchre R. Optimization algorithms on matrix manifolds[M]. Princeton University Press, 2008.\nBendokat T, Zimmermann R. The real symplectic Stiefel and Grassmann manifolds: metrics, geodesics and applications[J]. arXiv preprint arXiv:2108.12447, 2021.\nO'Neill, Barrett. Semi-Riemannian geometry with applications to relativity. Academic press, 1983.","category":"page"},{"location":"pullbacks/computation_of_pullbacks/#How-to-compute-pullbacks","page":"Pullbacks","title":"How to compute pullbacks","text":"","category":"section"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"GeometricMachineLearning has many pullbacks for custom array types and other operations implemented. The need for this essentially comes from the fact that we cannot trivially differentiate custom GPU kernels at the moment[1].","category":"page"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"[1]: This will change once we switch to Enzyme (see [9]), but the package is still in its infancy. ","category":"page"},{"location":"pullbacks/computation_of_pullbacks/#What-is-a-pullback?","page":"Pullbacks","title":"What is a pullback?","text":"","category":"section"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"Here we first explain the principle of a pullback with the example of a vector-valued function. The generalization to matrices and higher-order tensors is straight-forward. ","category":"page"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"The pullback of a vector-valued function fmathbbR^ntomathbbR^m can be interpreted as the sensitivities in the input space mathbbR^n with respect to variations in the output space mathbbR^m via the function f: ","category":"page"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"leftmathrmpullback(f)ainmathbbR^n dbinmathbbR^mright_i = sum_j=1^mfracpartialf_jpartiala_idb_j","category":"page"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"This principle can easily be generalized to matrices. For this consider the function gmathbbR^n_1timesn_2tomathbbR^m_1timesm_2. For this case we have: ","category":"page"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"leftmathrmpullback(g)AinmathbbR^n_1timesn_2 dBinmathbbR^m_1timesm_2right_(i_1 i_2) = sum_j_1=1^m_1sum_j_2=1^m_2fracpartialf_(j_1 j_2)partiala_(i_1 i_2)db_(j_1 j_2)","category":"page"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"The generalization to higher-order tensors is again straight-forward.","category":"page"},{"location":"pullbacks/computation_of_pullbacks/#Illustrative-example","page":"Pullbacks","title":"Illustrative example","text":"","category":"section"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"Consider the matrix inverse mathrminv mathbbR^ntimesntomathbbR^ntimesn as an example. This fits into the above framework where inv is a matrix-valued function from mathbbR^ntimesn to mathbbR^ntimesn. We here write B = A^-1 = mathrminv(A). We thus have to compute: ","category":"page"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"leftmathrmpullback(mathrminv)AinmathbbR^ntimesn dBinmathbbR^ntimesnright_(i j) = sum_k=1^nsum_ell=1^nfracpartialb_k ellpartiala_i jdb_k ell","category":"page"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"For a matrix A that depends on a parameter varepsilon we have that: ","category":"page"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"fracpartialpartialvarepsilonB = -Bleft( fracpartialpartialvarepsilon right) B","category":"page"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"This can easily be checked: ","category":"page"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"mathbbO = fracpartialpartialvarepsilonmathbbI = fracpartialpartialvarepsilon(AB) = AfracpartialpartialvarepsilonB + left(fracpartialpartialvarepsilonAright)B","category":"page"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"We can then write: ","category":"page"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"beginaligned\nsum_kellleft( fracpartialpartiala_ij b_kell right) db_kell = sum_kellleft fracpartialpartiala_ij B right_kell db_kell \n = - sum_kellleftB left(fracpartialpartiala_ij Aright) B right_kell db_kell \n = - sum_kellmnb_km left(fracpartiala_mnpartiala_ijright) b_nell db_kell \n = - sum_kellmnb_km delta_imdelta_jn b_nell db_kell \n = - sum_kellb_ki b_jell db_kell \n equiv - B^TcdotdBcdotB^T \nendaligned","category":"page"},{"location":"pullbacks/computation_of_pullbacks/#Motivation-from-a-differential-geometric-perspective","page":"Pullbacks","title":"Motivation from a differential-geometric perspective","text":"","category":"section"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"The notions of a pullback in automatic differentiation and differential geometry are closely related (see e.g. [10] and [11]). In both cases we want to compute, based on a mapping fmathcalVtomathcalW a mapsto f(a) = b, a map of differentials db mapsto da. In the differential geometry case db and da are part of the associated cotangent spaces, i.e. dbinT^*_bmathcalW and dainT^*_amathcalV; in AD we (mostly) deal with spaces of arrays, i.e. vector spaces, which means that dbinmathcalW and dainmathcalV.","category":"page"},{"location":"pullbacks/computation_of_pullbacks/","page":"Pullbacks","title":"Pullbacks","text":"M. Betancourt. A geometric theory of higher-order automatic differentiation, arXiv preprint arXiv:1812.11592 (2018).\n\n\n\nJ. Bolte and E. Pauwels. A mathematical model for automatic differentiation in machine learning. Advances in Neural Information Processing Systems 33, 10809–10819 (2020).\n\n\n\n","category":"page"},{"location":"reduced_order_modeling/autoencoder/#Reduced-Order-modeling-and-Autoencoders","page":"POD and Autoencoders","title":"Reduced Order modeling and Autoencoders","text":"","category":"section"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"Reduced order modeling is a data-driven technique that exploits the structure of parametric PDEs to make solving those PDEs easier.","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"Consider a parametric PDE written in the form: F(z(mu)mu)=0 where z(mu) evolves on a infinite-dimensional Hilbert space V. ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"In modeling any PDE we have to choose a discretization (particle discretization, finite element method, ...) of V, which will be denoted by V_h. ","category":"page"},{"location":"reduced_order_modeling/autoencoder/#Solution-manifold","page":"POD and Autoencoders","title":"Solution manifold","text":"","category":"section"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"To any parametric PDE we associate a solution manifold: ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"mathcalM = z(mu)F(z(mu)mu)=0 muinmathbbP","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"(Image: )","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"In the image above a 2-dimensional solution manifold is visualized as a sub-manifold in 3-dimensional space. In general the embedding space is an infinite-dimensional function space.","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"As an example of this consider the 1-dimensional wave equation: ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"partial_tt^2q(tximu) = mu^2partial_xixi^2q(tximu)text on ItimesOmega","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"where I = (01) and Omega=(-1212). As initial condition for the first derivative we have partial_tq(0ximu) = -mupartial_xiq_0(ximu) and furthermore q(tximu)=0 on the boundary (i.e. xiin-1212).","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"The solution manifold is a 1-dimensional submanifold: ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"mathcalM = (t xi)mapstoq(tximu)=q_0(xi-mutmu)muinmathbbPsubsetmathbbR","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"If we provide an initial condition u_0, a parameter instance mu and a time t, then ximapstoq(tximu) will be the momentary solution. If we consider the time evolution of q(tximu), then it evolves on a two-dimensional submanifold barmathcalM = ximapstoq(tximu)tinImuinmathbbP.","category":"page"},{"location":"reduced_order_modeling/autoencoder/#General-workflow","page":"POD and Autoencoders","title":"General workflow","text":"","category":"section"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"In reduced order modeling we aim to construct a mapping to a space that is close to this solution manifold. This is done through the following steps: ","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"Discretize the PDE.\nSolve the discretized PDE for a certain set of parameter instances muinmathbbP.\nBuild a reduced basis with the data obtained from having solved the discretized PDE. This step consists of finding two mappings: the reduction mathcalP and the reconstruction mathcalR.","category":"page"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"The third step can be done with various machine learning (ML) techniques. Traditionally the most popular of these has been Proper orthogonal decomposition (POD), but in recent years autoencoders have also become a popular alternative (see (Fresca et al, 2021)). ","category":"page"},{"location":"reduced_order_modeling/autoencoder/#References","page":"POD and Autoencoders","title":"References","text":"","category":"section"},{"location":"reduced_order_modeling/autoencoder/","page":"POD and Autoencoders","title":"POD and Autoencoders","text":"S. Fresca, L. Dede’ and A. Manzoni. A comprehensive deep learning-based approach to reduced order modeling of nonlinear time-dependent parametrized PDEs. Journal of Scientific Computing 87, 1–36 (2021).\n\n\n\n","category":"page"},{"location":"manifolds/manifolds/#(Matrix)-Manifolds","page":"General Theory on Manifolds","title":"(Matrix) Manifolds","text":"","category":"section"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"Manifolds are topological spaces that locally look like vector spaces. In the following we restrict ourselves to finite-dimensional manifolds. Definition: A finite-dimensional smooth manifold of dimension n is a second-countable Hausdorff space mathcalM for which forallxinmathcalM we can find a neighborhood U that contains x and a corresponding homeomorphism varphi_UUcongWsubsetmathbbR^n where W is an open subset. The homeomorphisms varphi_U are referred to as coordinate charts. If two such coordinate charts overlap, i.e. if U_1capU_2neq, then the map varphi_U_2^-1circvarphi_U_1 is C^infty.","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"One example of a manifold that is also important for GeometricMachineLearning.jl is the Lie group[1] of orthonormal matrices SO(N). Before we can proof that SO(N) is a manifold we first need another definition and a theorem:","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"[1]: Lie groups are manifolds that also have a group structure, i.e. there is an operation mathcalMtimesmathcalMtomathcalM(ab)mapstoab s.t. (ab)c = a(bc) and existsemathcalM s.t. ae = a forallainmathcalM.","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"Definition: Consider a smooth mapping g mathcalMtomathcalN from one manifold to another. A point BinmathcalM is called a regular value of mathcalM if forallAing^-1B the map T_AgT_AmathcalMtoT_g(A)mathcalN is surjective. ","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"Theorem: Consider a smooth map gmathcalMtomathcalN from one manifold to another. Then the preimage of a regular point B of mathcalN is a submanifold of mathcalM. Furthermore the codimension of g^-1B is equal to the dimension of mathcalN and the tangent space T_A(g^-1B) is equal to the kernel of T_Ag. This is known as the preimage theorem.","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"Proof: ","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"Theorem: The group SO(N) is a Lie group (i.e. has manifold structure). Proof: The vector space mathbbR^NtimesN clearly has manifold structure. The group SO(N) is equivalent to one of the level sets of the mapping: fmathbbR^NtimesNtomathcalS(N) AmapstoA^TA, i.e. it is the component of f^-1mathbbI that contains mathbbI. We still need to proof that mathbbI is a regular point of f, i.e. that for AinSO(N) the mapping T_Af is surjective. This means that forallBinmathcalS(N) AinmathbbR^NtimesN existsCinmathbbR^NtimesN s.t. C^TA + A^TC = B. The element C=frac12ABinmathcalR^NtimesN satisfies this property.","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"With the definition above we can generalize the notion of an ordinary differential equation (ODE) on a vector space to an ordinary differential equation on a manifold:","category":"page"},{"location":"manifolds/manifolds/","page":"General Theory on Manifolds","title":"General Theory on Manifolds","text":"Definition: An ODE on a manifold is a mapping that assigns to each element of the manifold AinmathcalM an element of the corresponding tangent space T_AmathcalM.","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/#Horizontal-component-of-the-Lie-algebra-\\mathfrak{g}","page":"Stiefel Global Tangent Space","title":"Horizontal component of the Lie algebra mathfrakg","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"What we use to optimize Adam (and other algorithms) to manifolds is a global tangent space representation of the homogeneous spaces. ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"For the Stiefel manifold, this global tangent space representation takes a simple form: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"mathcalB = beginbmatrix\n A -B^T \n B mathbbO\nendbmatrix","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"where AinmathbbR^ntimesn is skew-symmetric and BinmathbbR^Ntimesn is arbitary. In GeometricMachineLearning the struct StiefelLieAlgHorMatrix implements elements of this form.","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/#Theoretical-background","page":"Stiefel Global Tangent Space","title":"Theoretical background","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/#Vertical-and-horizontal-components","page":"Stiefel Global Tangent Space","title":"Vertical and horizontal components","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"The Stiefel manifold St(n N) is a homogeneous space obtained from SO(N) by setting two matrices, whose first n columns conincide, equivalent. Another way of expressing this is: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"A_1 sim A_2 iff A_1E = A_2E","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"for ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"E = beginbmatrix mathbbI mathbbOendbmatrix","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"Because St(nN) is a homogeneous space, we can take any element YinSt(nN) and SO(N) acts transitively on it, i.e. can produce any other element in SO(N). A similar statement is also true regarding the tangent spaces of St(nN), namely: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"T_YSt(nN) = mathfrakgcdotY","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"i.e. every tangent space can be expressed through an action of the associated Lie algebra. ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"The kernel of the mapping mathfrakgtoT_YSt(nN) BmapstoBY is referred to as mathfrakg^mathrmverY, the vertical component of the Lie algebra at Y. In the case Y=E it is easy to see that elements belonging to mathfrakg^mathrmverE are of the following form: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"beginbmatrix\nhatmathbbO tildemathbbO^T \ntildemathbbO C\nendbmatrix","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"where hatmathbbOinmathbbR^ntimesn is a \"small\" matrix and tildemathbbOinmathbbR^Ntimesn is a bigger one. CinmathbbR^NtimesN is a skew-symmetric matrix. ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"The orthogonal complement of the vertical component is referred to as the horizontal component and denoted by mathfrakg^mathrmhor Y. It is isomorphic to T_YSt(nN) and this isomorphism can be found explicitly. In the case of the Stiefel manifold: ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"Omega(Y cdot)T_YSt(nN)tomathfrakg^mathrmhorY Delta mapsto (mathbbI - frac12YY^T)DeltaY^T - YDelta^T(mathbbI - frac12YY^T)","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"The elements of mathfrakg^mathrmhorE=mathfrakg^mathrmhor, i.e. for the special case Y=E. Its elements are of the form described on top of this page.","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/#Special-functions","page":"Stiefel Global Tangent Space","title":"Special functions","text":"","category":"section"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"You can also draw random elements from mathfrakg^mathrmhor through e.g. ","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"rand(CUDADevice(), StiefelLieAlgHorMatrix{Float32}, 10, 5)","category":"page"},{"location":"arrays/stiefel_lie_alg_horizontal/","page":"Stiefel Global Tangent Space","title":"Stiefel Global Tangent Space","text":"In this example: N=10 and n=5.","category":"page"},{"location":"reduced_order_modeling/projection_reduction_errors/#Projection-and-Reduction-Errors-of-Reduced-Models","page":"Projection and Reduction Error","title":"Projection and Reduction Errors of Reduced Models","text":"","category":"section"},{"location":"reduced_order_modeling/projection_reduction_errors/","page":"Projection and Reduction Error","title":"Projection and Reduction Error","text":"Two errors that are of very big importance in reduced order modeling are the projection and the reduction error. During training one typically aims to miminimze the projection error, but for the actual application of the model the reduction error is often more important. ","category":"page"},{"location":"reduced_order_modeling/projection_reduction_errors/#Projection-Error","page":"Projection and Reduction Error","title":"Projection Error","text":"","category":"section"},{"location":"reduced_order_modeling/projection_reduction_errors/","page":"Projection and Reduction Error","title":"Projection and Reduction Error","text":"The projection error computes how well a reduced basis, represented by the reduction mathcalP and the reconstruction mathcalR, can represent the data with which it is build. In mathematical terms: ","category":"page"},{"location":"reduced_order_modeling/projection_reduction_errors/","page":"Projection and Reduction Error","title":"Projection and Reduction Error","text":"e_mathrmproj(mu) = \n frac mathcalRcircmathcalP(M) - M M ","category":"page"},{"location":"reduced_order_modeling/projection_reduction_errors/","page":"Projection and Reduction Error","title":"Projection and Reduction Error","text":"where cdot is the Frobenius norm (one could also optimize for different norms).","category":"page"},{"location":"reduced_order_modeling/projection_reduction_errors/#Reduction-Error","page":"Projection and Reduction Error","title":"Reduction Error","text":"","category":"section"},{"location":"reduced_order_modeling/projection_reduction_errors/","page":"Projection and Reduction Error","title":"Projection and Reduction Error","text":"The reduction error measures how far the reduced system diverges from the full-order system during integration (online stage). In mathematical terms (and for a single initial condition): ","category":"page"},{"location":"reduced_order_modeling/projection_reduction_errors/","page":"Projection and Reduction Error","title":"Projection and Reduction Error","text":"e_mathrmred(mu) = sqrt\n fracsum_t=0^K mathbfx^(t)(mu) - mathcalR(mathbfx^(t)_r(mu)) ^2sum_t=0^K mathbfx^(t)(mu) ^2\n","category":"page"},{"location":"reduced_order_modeling/projection_reduction_errors/","page":"Projection and Reduction Error","title":"Projection and Reduction Error","text":"where mathbfx^(t) is the solution of the FOM at point t and mathbfx^(t)_r is the solution of the ROM (in the reduced basis) at point t. The reduction error, as opposed to the projection error, not only measures how well the solution manifold is represented by the reduced basis, but also measures how well the FOM dynamics are approximated by the ROM dynamics (via the induced vector field on the reduced basis).","category":"page"},{"location":"library/","page":"Library","title":"Library","text":"CurrentModule = GeometricMachineLearning","category":"page"},{"location":"library/#GeometricMachineLearning-Library-Functions","page":"Library","title":"GeometricMachineLearning Library Functions","text":"","category":"section"},{"location":"library/","page":"Library","title":"Library","text":"Modules = [GeometricMachineLearning]","category":"page"},{"location":"library/#AbstractNeuralNetworks.Chain-Union{Tuple{GSympNet{AT, true}}, Tuple{AT}} where AT","page":"Library","title":"AbstractNeuralNetworks.Chain","text":"Chain can also be called with a neural network as input.\n\n\n\n\n\n","category":"method"},{"location":"library/#AbstractNeuralNetworks.Chain-Union{Tuple{LASympNet{AT, false, false}}, Tuple{AT}} where AT","page":"Library","title":"AbstractNeuralNetworks.Chain","text":"Build a chain for an LASympnet for which init_upper_linear is false and init_upper_act is false.\n\n\n\n\n\n","category":"method"},{"location":"library/#AbstractNeuralNetworks.Chain-Union{Tuple{LASympNet{AT, false, true}}, Tuple{AT}} where AT","page":"Library","title":"AbstractNeuralNetworks.Chain","text":"Build a chain for an LASympnet for which init_upper_linear is false and init_upper_act is true.\n\n\n\n\n\n","category":"method"},{"location":"library/#AbstractNeuralNetworks.Chain-Union{Tuple{LASympNet{AT, true, false}}, Tuple{AT}} where AT","page":"Library","title":"AbstractNeuralNetworks.Chain","text":"Build a chain for an LASympnet for which init_upper_linear is true and init_upper_act is false.\n\n\n\n\n\n","category":"method"},{"location":"library/#AbstractNeuralNetworks.Chain-Union{Tuple{LASympNet{AT, true, true}}, Tuple{AT}} where AT","page":"Library","title":"AbstractNeuralNetworks.Chain","text":"Build a chain for an LASympnet for which init_upper_linear is true and init_upper_act is true.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.AbstractCache","page":"Library","title":"GeometricMachineLearning.AbstractCache","text":"AbstractCache has subtypes: \n\nAdamCache\nMomentumCache\nGradientCache\nBFGSCache\n\nAll of them can be initialized with providing an array (also supporting manifold types).\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.AbstractLieAlgHorMatrix","page":"Library","title":"GeometricMachineLearning.AbstractLieAlgHorMatrix","text":"AbstractLieAlgHorMatrix is a supertype for various horizontal components of Lie algebras. We usually call this mathfrakg^mathrmhor.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.AbstractRetraction","page":"Library","title":"GeometricMachineLearning.AbstractRetraction","text":"AbstractRetraction is a type that comprises all retraction methods for manifolds. For every manifold layer one has to specify a retraction method that takes the layer and elements of the (global) tangent space.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.ActivationLayer","page":"Library","title":"GeometricMachineLearning.ActivationLayer","text":"ActivationLayer is the struct corresponding to the constructors ActivationLayerQ and ActivationLayerP. See those for more information.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.ActivationLayerP-Tuple{Any, Any}","page":"Library","title":"GeometricMachineLearning.ActivationLayerP","text":"Performs:\n\nbeginpmatrix\n q p\nendpmatrix mapsto \nbeginpmatrix\n q p + mathrmdiag(a)sigma(q)\nendpmatrix\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.ActivationLayerQ-Tuple{Any, Any}","page":"Library","title":"GeometricMachineLearning.ActivationLayerQ","text":"Performs:\n\nbeginpmatrix\n q p\nendpmatrix mapsto \nbeginpmatrix\n q + mathrmdiag(a)sigma(p) p\nendpmatrix\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.AdamOptimizer","page":"Library","title":"GeometricMachineLearning.AdamOptimizer","text":"Defines the Adam Optimizer. Algorithm and suggested defaults are taken from (Goodfellow et al., 2016, page 301), except for δ, because single precision is used!\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.AdamOptimizerWithDecay","page":"Library","title":"GeometricMachineLearning.AdamOptimizerWithDecay","text":"Defines the Adam Optimizer with weight decay.\n\nConstructors\n\nThe default constructor takes as input: \n\nn_epochs::Int\nη₁: the learning rate at the start \nη₂: the learning rate at the end \nρ₁: the decay parameter for the first moment \nρ₂: the decay parameter for the second moment\nδ: the safety parameter \nT (keyword argument): the type. \n\nThe second constructor is called with: \n\nn_epochs::Int\nT\n\n... the rest are keyword arguments\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.BFGSCache","page":"Library","title":"GeometricMachineLearning.BFGSCache","text":"The cache for the BFGS optimizer.\n\nIt stores an array for the previous time step B and the inverse of the Hessian matrix H.\n\nIt is important to note that setting up this cache already requires a derivative! This is not the case for the other optimizers.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.BFGSDummyCache","page":"Library","title":"GeometricMachineLearning.BFGSDummyCache","text":"In order to initialize BGGSCache we first need gradient information. This is why we initially have this BFGSDummyCache until gradient information is available.\n\nNOTE: we may not need this. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.BFGSOptimizer","page":"Library","title":"GeometricMachineLearning.BFGSOptimizer","text":"This is an implementation of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimizer. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Batch","page":"Library","title":"GeometricMachineLearning.Batch","text":"Batch is a struct whose functor acts on an instance of DataLoader to produce a sequence of training samples for training for one epoch. \n\nThe Constructor\n\nThe constructor for Batch is called with: \n\nbatch_size::Int\nseq_length::Int (optional)\nprediction_window::Int (optional)\n\nThe first one of these arguments is required; it indicates the number of training samples in a batch. If we deal with time series data then we can additionaly supply a sequence length and a prediction window as input arguments to Batch. These indicate the number of input vectors and the number of output vectors.\n\nThe functor\n\nAn instance of Batch can be called on an instance of DataLoader to produce a sequence of samples that contain all the input data, i.e. for training for one epoch. The output of applying batch:Batch to dl::DataLoader is a tuple of vectors of integers. Each of these vectors contains two integers: the first is the time index and the second one is the parameter index.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.BiasLayer","page":"Library","title":"GeometricMachineLearning.BiasLayer","text":"A bias layer that does nothing more than add a vector to the input. This is needed for LA-SympNets.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Classification","page":"Library","title":"GeometricMachineLearning.Classification","text":"Classification Layer that takes a matrix as an input and returns a vector that is used for MNIST classification. \n\nIt has the following arguments: \n\nM: input dimension \nN: output dimension \nactivation: the activation function \n\nAnd the following optional argument: \n\naverage: If this is set to true, then the output is computed as frac1Nsum_i=1^Ninput_bulleti. If set to false (the default) it picks the last column of the input. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.ClassificationTransformer","page":"Library","title":"GeometricMachineLearning.ClassificationTransformer","text":"This is a transformer neural network for classification purposes. At the moment this is only used for training on MNIST, but can in theory be used for any classification problem.\n\nIt has to be called with a DataLoader that stores an input and an output tensor. The optional arguments are: \n\nn_heads: The number of heads in the MultiHeadAttention (mha) layers. Default: 7.\nn_layers: The number of transformer layers. Default: 16.\nactivation: The activation function. Default: softmax.\nStiefel: Wheter the matrices in the mha layers are on the Stiefel manifold. \nadd_connection: Whether the input is appended to the output of the mha layer. (skip connection)\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.DataLoader","page":"Library","title":"GeometricMachineLearning.DataLoader","text":"Data Loader is a struct that creates an instance based on a tensor (or different input format) and is designed to make training convenient. \n\nConstructor\n\nThe data loader can be called with various inputs:\n\nA single vector: If the data loader is called with a single vector (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the second axis indicates parameter values and/or time steps and the system has a single degree of freedom (i.e. the system dimension is one).\nA single matrix: If the data loader is called with a single matrix (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the first axis is assumed to indicate the degrees of freedom of the system and the second axis indicates parameter values and/or time steps. \nA single tensor: If the data loader is called with a single tensor, then this is interpreted as an integration problem with the second axis indicating the time step and the third one indicating the parameters.\nA tensor and a vector: This is a special case (MNIST classification problem). For the MNIST problem for example the input are n_p matrices (first input argument) and n_p integers (second input argument).\nA NamedTuple with fields q and p: The NamedTuple contains (i) two matrices or (ii) two tensors. \nAn EnsembleSolution: The EnsembleSolution typically comes from GeometricProblems.\n\nWhen we supply a single vector or a single matrix as input to DataLoader and further set autoencoder = false (keyword argument), then the data are stored as an integration problem and the second axis is assumed to indicate time steps.\n\nFields of DataLoader\n\nThe fields of the DataLoader struct are the following: - input: The input data with axes (i) system dimension, (ii) number of time steps and (iii) number of parameters. - output: The tensor that contains the output (supervised learning) - this may be of type Nothing if the constructor is only called with one tensor (unsupervised learning). - input_dim: The dimension of the system, i.e. what is taken as input by a regular neural network. - input_time_steps: The length of the entire time series (length of the second axis). - n_params: The number of parameters that are present in the data set (length of third axis) - output_dim: The dimension of the output tensor (first axis). If output is of type Nothing, then this is also of type Nothing. - output_time_steps: The size of the second axis of the output tensor. If output is of type Nothing, then this is also of type Nothing.\n\nThe input and output fields of DataLoader\n\nEven though the arguments to the Constructor may be vectors or matrices, internally DataLoader always stores tensors.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.DataLoader-Union{Tuple{@NamedTuple{q::AT, p::AT}}, Tuple{AT}, Tuple{T}} where {T, AT<:AbstractMatrix{T}}","page":"Library","title":"GeometricMachineLearning.DataLoader","text":"Data Loader is a struct that creates an instance based on a tensor (or different input format) and is designed to make training convenient. \n\nConstructor\n\nThe data loader can be called with various inputs:\n\nA single vector: If the data loader is called with a single vector (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the second axis indicates parameter values and/or time steps and the system has a single degree of freedom (i.e. the system dimension is one).\nA single matrix: If the data loader is called with a single matrix (and no other arguments are given), then this is interpreted as an autoencoder problem, i.e. the first axis is assumed to indicate the degrees of freedom of the system and the second axis indicates parameter values and/or time steps. \nA single tensor: If the data loader is called with a single tensor, then this is interpreted as an integration problem with the second axis indicating the time step and the third one indicating the parameters.\nA tensor and a vector: This is a special case (MNIST classification problem). For the MNIST problem for example the input are n_p matrices (first input argument) and n_p integers (second input argument).\nA NamedTuple with fields q and p: The NamedTuple contains (i) two matrices or (ii) two tensors. \nAn EnsembleSolution: The EnsembleSolution typically comes from GeometricProblems.\n\nWhen we supply a single vector or a single matrix as input to DataLoader and further set autoencoder = false (keyword argument), then the data are stored as an integration problem and the second axis is assumed to indicate time steps.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.DataLoader-Union{Tuple{GeometricSolutions.EnsembleSolution{T, T1, Vector{ST}}}, Tuple{ST}, Tuple{DT}, Tuple{T1}, Tuple{T}} where {T, T1, DT, ST<:(GeometricSolutions.GeometricSolution{T, T1, @NamedTuple{q::DT, p::DT}})}","page":"Library","title":"GeometricMachineLearning.DataLoader","text":"Constructor for EnsembleSolution form package GeometricSolutions with fields q and p.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.DataLoader-Union{Tuple{GeometricSolutions.EnsembleSolution{T, T1, Vector{ST}}}, Tuple{ST}, Tuple{DT}, Tuple{T1}, Tuple{T}} where {T, T1, DT, ST<:(GeometricSolutions.GeometricSolution{T, T1, @NamedTuple{q::DT}})}","page":"Library","title":"GeometricMachineLearning.DataLoader","text":"Constructor for EnsembleSolution from package GeometricSolutions with field q.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.GSympNet","page":"Library","title":"GeometricMachineLearning.GSympNet","text":"GSympNet is called with a single input argument, the system dimension, or with an instance of DataLoader. Optional input arguments are: \n\nupscaling_dimension::Int: The upscaling dimension of the gradient layer. See the documentation for GradientLayerQ and GradientLayerP for further explanation. The default is 2*dim.\nnhidden::Int: The number of hidden layers (i.e. layers that are not input or output layers). The default is 2.\nactivation: The activation function that is applied. By default this is tanh.\ninit_upper::Bool: Initialize the gradient layer so that it first modifies the q-component. The default is true.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GlobalSection","page":"Library","title":"GeometricMachineLearning.GlobalSection","text":"This implements global sections for the Stiefel manifold and the Symplectic Stiefel manifold. \n\nIn practice this is implemented using Householder reflections, with the auxiliary column vectors given by: |0| |0| |.| |1| ith spot for i in (n+1) to N (or with random columns) |0| |.| |0|\n\nMaybe consider dividing the output in the check functions by n!\n\nImplement a general global section here!!!! Tₓ𝔐 → G×𝔤 !!!!!! (think about random initialization!)\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GradientLayer","page":"Library","title":"GeometricMachineLearning.GradientLayer","text":"GradientLayer is the struct corresponding to the constructors GradientLayerQ and GradientLayerP. See those for more information.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GradientLayerP-Tuple{Any, Any, Any}","page":"Library","title":"GeometricMachineLearning.GradientLayerP","text":"The gradient layer that changes the q component. It is of the form: \n\nbeginbmatrix\n mathbbI mathbbO nablaV mathbbI \nendbmatrix\n\nwith V(p) = sum_i=1^Ma_iSigma(sum_jk_ijp_j+b_i), where Sigma is the antiderivative of the activation function sigma (one-layer neural network). We refer to M as the upscaling dimension. Such layers are by construction symplectic.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.GradientLayerQ-Tuple{Any, Any, Any}","page":"Library","title":"GeometricMachineLearning.GradientLayerQ","text":"The gradient layer that changes the q component. It is of the form: \n\nbeginbmatrix\n mathbbI nablaV mathbbO mathbbI \nendbmatrix\n\nwith V(p) = sum_i=1^Ma_iSigma(sum_jk_ijp_j+b_i), where Sigma is the antiderivative of the activation function sigma (one-layer neural network). We refer to M as the upscaling dimension. Such layers are by construction symplectic.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.GradientOptimizer","page":"Library","title":"GeometricMachineLearning.GradientOptimizer","text":"Define the Gradient optimizer, i.e. W ← W - η*∇f(W) Or the riemannian manifold equivalent, if applicable.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GrassmannLayer","page":"Library","title":"GeometricMachineLearning.GrassmannLayer","text":"Defines a layer that performs simple multiplication with an element of the Grassmann manifold.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GrassmannLieAlgHorMatrix","page":"Library","title":"GeometricMachineLearning.GrassmannLieAlgHorMatrix","text":"This implements the horizontal component of a Lie algebra that is isomorphic to the Grassmann manifold. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.GrassmannManifold","page":"Library","title":"GeometricMachineLearning.GrassmannManifold","text":"The GrassmannManifold is based on the StiefelManifold\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.LASympNet","page":"Library","title":"GeometricMachineLearning.LASympNet","text":"LASympNet is called with a single input argument, the system dimension, or with an instance of DataLoader. Optional input arguments are: \n\ndepth::Int: The number of linear layers that are applied. The default is 5.\nnhidden::Int: The number of hidden layers (i.e. layers that are not input or output layers). The default is 2.\nactivation: The activation function that is applied. By default this is tanh.\ninit_upper_linear::Bool: Initialize the linear layer so that it first modifies the q-component. The default is true.\ninit_upper_act::Bool: Initialize the activation layer so that it first modifies the q-component. The default is true.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.LayerWithManifold","page":"Library","title":"GeometricMachineLearning.LayerWithManifold","text":"Additional types to make handling manifolds more readable.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.LinearLayer","page":"Library","title":"GeometricMachineLearning.LinearLayer","text":"LinearLayer is the struct corresponding to the constructors LinearLayerQ and LinearLayerP. See those for more information.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.LinearLayerP-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.LinearLayerP","text":"Equivalent to a left multiplication by the matrix:\n\nbeginpmatrix\nmathbbI mathbbO \nB mathbbI\nendpmatrix \n\nwhere B is a symmetric matrix.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.LinearLayerQ-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.LinearLayerQ","text":"Equivalent to a left multiplication by the matrix:\n\nbeginpmatrix\nmathbbI B \nmathbbO mathbbI\nendpmatrix \n\nwhere B is a symmetric matrix.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.LowerTriangular","page":"Library","title":"GeometricMachineLearning.LowerTriangular","text":"A lower-triangular matrix is an ntimesn matrix that has ones on the diagonal and zeros on the upper triangular.\n\nThe data are stored in a vector S similarly to SkewSymMatrix.\n\nThe struct two fields: S and n. The first stores all the entries of the matrix in a sparse fashion (in a vector) and the second is the dimension n for AinmathbbR^ntimesn.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Manifold","page":"Library","title":"GeometricMachineLearning.Manifold","text":"rand is implemented for manifolds that use the initialization of the StiefelManifold and the GrassmannManifold by default. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.ManifoldLayer","page":"Library","title":"GeometricMachineLearning.ManifoldLayer","text":"This defines a manifold layer that only has one matrix-valued manifold A associated with it does xmapstoAx. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.MomentumOptimizer","page":"Library","title":"GeometricMachineLearning.MomentumOptimizer","text":"Define the Momentum optimizer, i.e. V ← αV - ∇f(W) W ← W + ηV Or the riemannian manifold equivalent, if applicable.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.MultiHeadAttention","page":"Library","title":"GeometricMachineLearning.MultiHeadAttention","text":"MultiHeadAttention (MHA) serves as a preprocessing step in the transformer. It reweights the input vectors bases on correlations within those data. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Optimizer","page":"Library","title":"GeometricMachineLearning.Optimizer","text":"Optimizer struct that stores the 'method' (i.e. Adam with corresponding hyperparameters), the cache and the optimization step.\n\nIt takes as input an optimization method and the parameters of a network. \n\nFor technical reasons we first specify an OptimizerMethod that stores all the hyperparameters of the optimizer. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.Optimizer-Tuple{NeuralNetwork, DataLoader, Batch, Int64, GeometricMachineLearning.NetworkLoss}","page":"Library","title":"GeometricMachineLearning.Optimizer","text":"A functor for Optimizer. It is called with: - nn::NeuralNetwork - dl::DataLoader - batch::Batch - n_epochs::Int - loss\n\nThe last argument is a function through which Zygote differentiates. This argument is optional; if it is not supplied GeometricMachineLearning defaults to an appropriate loss for the DataLoader.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.Optimizer-Tuple{OptimizerMethod, NeuralNetwork}","page":"Library","title":"GeometricMachineLearning.Optimizer","text":"Typically the Optimizer is not initialized with the network parameters, but instead with a NeuralNetwork struct.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.PSDLayer","page":"Library","title":"GeometricMachineLearning.PSDLayer","text":"This is a PSD-like layer used for symplectic autoencoders. One layer has the following shape:\n\nA = beginbmatrix Phi mathbbO mathbbO Phi endbmatrix\n\nwhere Phi is an element of the Stiefel manifold St(n N).\n\nThe constructor of PSDLayer is called by PSDLayer(M, N; retraction=retraction): \n\nM is the input dimension.\nN is the output dimension. \nretraction is an instance of a struct with supertype AbstractRetraction. The only options at the moment are Geodesic() and Cayley().\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.ReducedSystem","page":"Library","title":"GeometricMachineLearning.ReducedSystem","text":"ReducedSystem computes the reconstructed dynamics in the full system based on the reduced one. Optionally it can be compared to the FOM solution.\n\nIt can be called using the following constructor: ReducedSystem(N, n, encoder, decoder, fullvectorfield, reducedvectorfield, params, tspan, tstep, ics, projection_error) where \n\nencoder: a function mathbbR^2NmapstomathbbR^2n\ndecoder: a (differentiable) function mathbbR^2nmapstomathbbR^2N\nfullvectorfield: a (differentiable) mapping defined the same way as in GeometricIntegrators \nreducedvectorfield: a (differentiable) mapping defined the same way as in GeometricIntegrators \nparams: a NamedTuple that parametrizes the vector fields (the same for fullvectorfield and reducedvectorfield)\ntspan: a tuple (t₀, tₗ) that specifies start and end point of the time interval over which integration is performed. \ntstep: the time step \nics: the initial condition for the big system.\nprojection_error: the error M - mathcalRcircmathcalP(M) where M is the snapshot matrix; mathcalP and mathcalR are the reduction and reconstruction respectively.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.RegularTransformerIntegrator","page":"Library","title":"GeometricMachineLearning.RegularTransformerIntegrator","text":"The regular transformer used as an integrator (multi-step method). \n\nThe constructor is called with the following arguments: \n\nsys_dim::Int\ntransformer_dim::Int: the default is transformer_dim = sys_dim.\nn_blocks::Int: The default is 1.\nn_heads::Int: the number of heads in the multihead attentio layer (default is n_heads = sys_dim)\nL::Int the number of transformer blocks (default is L = 2).\nupscaling_activation: by default identity\nresnet_activation: by default tanh\nadd_connection:Bool=true (keyword argument): if the input should be added to the output.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SkewSymMatrix","page":"Library","title":"GeometricMachineLearning.SkewSymMatrix","text":"A SkewSymMatrix is a matrix A s.t. A^T = -A.\n\nIf the constructor is called with a matrix as input it returns a symmetric matrix via the projection A mapsto frac12(A - A^T). This is a projection defined via the canonical metric mathbbR^ntimesntimesmathbbR^ntimesntomathbbR (AB) mapsto mathrmTr(A^TB).\n\nThe first index is the row index, the second one the column index.\n\nThe struct two fields: S and n. The first stores all the entries of the matrix in a sparse fashion (in a vector) and the second is the dimension n for AinmathbbR^ntimesn.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.StiefelLayer","page":"Library","title":"GeometricMachineLearning.StiefelLayer","text":"Defines a layer that performs simple multiplication with an element of the Stiefel manifold.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.StiefelLieAlgHorMatrix","page":"Library","title":"GeometricMachineLearning.StiefelLieAlgHorMatrix","text":"StiefelLieAlgHorMatrix is the horizontal component of the Lie algebra of skew-symmetric matrices (with respect to the canonical metric). The projection here is: (\\pi:S \\to SE ) where \n\nE = beginpmatrix mathbbI_n mathbbO_(N-n)timesn endpmatrix\n\nThe matrix (E) is implemented under StiefelProjection in GeometricMachineLearning.\n\nAn element of StiefelLieAlgMatrix takes the form: \n\nbeginpmatrix\nA B^T B mathbbO\nendpmatrix\n\nwhere (A) is skew-symmetric (this is SkewSymMatrix in GeometricMachineLearning).\n\nIf the constructor is called with a big (N\\times{}N) matrix, then the projection is performed the following way: \n\nbeginpmatrix\nA B_1 \nB_2 D\nendpmatrix mapsto \nbeginpmatrix\nmathrmskew(A) -B_2^T \nB_2 mathbbO\nendpmatrix\n\nThe operation mathrmskewmathbbR^ntimesntomathcalS_mathrmskew(n) is the skew-symmetrization operation. This is equivalent to calling the constructor of SkewSymMatrix with an (n\\times{}n) matrix.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.StiefelManifold","page":"Library","title":"GeometricMachineLearning.StiefelManifold","text":"An implementation of the Stiefel manifold. It has various convenience functions associated with it:\n\ncheck \nrand \nrgrad\nmetric\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.StiefelProjection","page":"Library","title":"GeometricMachineLearning.StiefelProjection","text":"Outer constructor for StiefelProjection. This works with two integers as input and optionally the type.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.StiefelProjection-2","page":"Library","title":"GeometricMachineLearning.StiefelProjection","text":"An array that essentially does vcat(I(n), zeros(N-n, n)) with GPU support. It has three inner constructors. The first one is called with the following arguments: \n\nbackend: backends as supported by KernelAbstractions.\nT::Type\nN::Integer\nn::Integer\n\nThe second constructor is called by supplying a matrix as input. The constructor will then extract the backend, the type and the dimensions of that matrix. \n\nThe third constructor is called by supplying an instance of StiefelLieAlgHorMatrix. \n\nTechnically this should be a subtype of StiefelManifold. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SymmetricMatrix","page":"Library","title":"GeometricMachineLearning.SymmetricMatrix","text":"A SymmetricMatrix A is a matrix A^T = A.\n\nIf the constructor is called with a matrix as input it returns a symmetric matrix via the projection:\n\nA mapsto frac12(A + A^T)\n\nThis is a projection defined via the canonical metric (AB) mapsto mathrmtr(A^TB).\n\nInternally the struct saves a vector S of size n(n+1)div2. The conversion is done the following way: \n\nA_ij = begincases S( (i-1) i ) div 2 + j textif igeqj \n S( (j-1) j ) div 2 + i textelse endcases\n\nSo S stores a string of vectors taken from A: S = tildea_1 tildea_2 ldots tildea_n with tildea_i = A_i1A_i2ldotsA_ii.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SympNet","page":"Library","title":"GeometricMachineLearning.SympNet","text":"SympNet type encompasses GSympNets and LASympnets.\n\nTODO: -[ ] add bias to LASympNet!\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SympNetLayer","page":"Library","title":"GeometricMachineLearning.SympNetLayer","text":"Implements the various layers from the SympNet paper: (https://www.sciencedirect.com/science/article/abs/pii/S0893608020303063). This is a super type of Gradient, Activation and Linear.\n\nFor the linear layer, the activation and the bias are left out, and for the activation layer K and b are left out!\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SympNetLayer-Tuple{AbstractArray, Any}","page":"Library","title":"GeometricMachineLearning.SympNetLayer","text":"This is called when a SympnetLayer is applied to a NamedTuple. It calls apply_layer_to_nt_and_return_array.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.SymplecticPotential","page":"Library","title":"GeometricMachineLearning.SymplecticPotential","text":"SymplecticPotential(n)\n\nReturns a symplectic matrix of size 2n x 2n\n\nbeginpmatrix\nmathbbO mathbbI \nmathbbO -mathbbI \nendpmatrix\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.SystemType","page":"Library","title":"GeometricMachineLearning.SystemType","text":"Can specify a special type of the system, to be used with ReducedSystem. For now the only option is Symplectic (and NoStructure).\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.TrainingData","page":"Library","title":"GeometricMachineLearning.TrainingData","text":"TrainingData stores: \n\n - problem \n\n - shape \n\n - get \n\n - symbols \n\n - dim \n\n - noisemaker\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.TransformerIntegrator","page":"Library","title":"GeometricMachineLearning.TransformerIntegrator","text":"Encompasses various transformer architectures, such as the structure-preserving transformer and the linear symplectic transformer. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.TransformerLoss","page":"Library","title":"GeometricMachineLearning.TransformerLoss","text":"The loss for a transformer network (especially a transformer integrator). The constructor is called with:\n\nseq_length::Int\nprediction_window::Int (default is 1).\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.UpperTriangular","page":"Library","title":"GeometricMachineLearning.UpperTriangular","text":"An upper-triangular matrix is an ntimesn matrix that has ones on the diagonal and zeros on the upper triangular.\n\nThe data are stored in a vector S similarly to SkewSymMatrix.\n\nThe struct two fields: S and n. The first stores all the entries of the matrix in a sparse fashion (in a vector) and the second is the dimension n for AinmathbbR^ntimesn.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.VolumePreservingAttention","page":"Library","title":"GeometricMachineLearning.VolumePreservingAttention","text":"Volume-preserving attention (single head attention)\n\nDrawbacks: \n\nthe super fast activation is only implemented for sequence lengths of 2, 3, 4 and 5.\nother sequence lengths only work on CPU for now (lu decomposition has to be implemented to work for tensors in parallel).\n\nConstructor\n\nThe constructor is called with: \n\ndim::Int: The system dimension \nseq_length::Int: The sequence length to be considered. The default is zero, i.e. arbitrary sequence lengths; this works for all sequence lengths but doesn't apply the super-fast activation. \nskew_sym::Bool (keyword argument): specifies if we the weight matrix is skew symmetric or arbitrary (default is false).\n\nFunctor\n\nApplying a layer of type VolumePreservingAttention does the following: \n\nFirst we perform the operation X mapsto X^T A X = C, where XinmathbbR^Ntimesmathttseq_length is a vector containing time series data and A is the skew symmetric matrix associated with the layer. \nIn a second step we compute the Cayley transform of C; Lambda = mathrmCayley(C).\nThe output of the layer is then XLambda.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.VolumePreservingFeedForward","page":"Library","title":"GeometricMachineLearning.VolumePreservingFeedForward","text":"Realizes a volume-preserving neural network as a combination of VolumePreservingLowerLayer and VolumePreservingUpperLayer. \n\nConstructor\n\nThe constructor is called with the following arguments: \n\nsys_dim::Int: The system dimension. \nn_blocks::Int: The number of blocks in the neural network (containing linear layers and nonlinear layers). Default is 1.\nn_linear::Int: The number of linear VolumePreservingLowerLayers and VolumePreservingUpperLayers in one block. Default is 1.\nactivation: The activation function for the nonlinear layers in a block. \ninit_upper::Bool=false (keyword argument): Specifies if the first layer is lower or upper. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.VolumePreservingFeedForwardLayer","page":"Library","title":"GeometricMachineLearning.VolumePreservingFeedForwardLayer","text":"Super-type of VolumePreservingLowerLayer and VolumePreservingUpperLayer. The layers do the following: \n\nx mapsto begincases sigma(Lx + b) textwhere L is mathttLowerTriangular sigma(Ux + b) textwhere U is mathttUpperTriangular endcases\n\nThe functor can be applied to a vecotr, a matrix or a tensor. \n\nConstructor\n\nThe constructors are called with:\n\nsys_dim::Int: the system dimension. \nactivation=tanh: the activation function. \ninclude_bias::Bool=true (keyword argument): specifies whether a bias should be used. \n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.VolumePreservingLowerLayer","page":"Library","title":"GeometricMachineLearning.VolumePreservingLowerLayer","text":"See the documentation for VolumePreservingFeedForwardLayer.\n\n\n\n\n\n","category":"type"},{"location":"library/#GeometricMachineLearning.VolumePreservingUpperLayer","page":"Library","title":"GeometricMachineLearning.VolumePreservingUpperLayer","text":"See the documentation for VolumePreservingFeedForwardLayer.\n\n\n\n\n\n","category":"type"},{"location":"library/#AbstractNeuralNetworks.update!-Union{Tuple{CT}, Tuple{T}, Tuple{Optimizer{<:BFGSOptimizer}, CT, AbstractArray{T}}} where {T, CT<:(BFGSCache{T, AT} where AT<:(AbstractArray{T}))}","page":"Library","title":"AbstractNeuralNetworks.update!","text":"Optimization for an entire neural networks with BFGS. What is different in this case is that we still have to initialize the cache.\n\nIf o.step == 1, then we initialize the cache\n\n\n\n\n\n","category":"method"},{"location":"library/#Base.iterate-Union{Tuple{AT}, Tuple{T}, Tuple{NeuralNetwork{<:GeometricMachineLearning.TransformerIntegrator}, @NamedTuple{q::AT, p::AT}}} where {T, AT<:AbstractMatrix{T}}","page":"Library","title":"Base.iterate","text":"This function computes a trajectory for a Transformer that has already been trained for valuation purposes.\n\nIt takes as input: \n\nnn: a NeuralNetwork (that has been trained).\nics: initial conditions (a matrix in mathbbR^2ntimesmathttseq_length or NamedTuple of two matrices in mathbbR^ntimesmathttseq_length)\nn_points::Int=100 (keyword argument): The number of steps for which we run the prediction. \nprediction_window::Int=size(ics.q, 2): The prediction window (i.e. the number of steps we predict into the future) is equal to the sequence length (i.e. the number of input time steps) by default. \n\n\n\n\n\n","category":"method"},{"location":"library/#Base.iterate-Union{Tuple{BT}, Tuple{AT}, Tuple{T}, Tuple{NeuralNetwork{<:GeometricMachineLearning.NeuralNetworkIntegrator}, BT}} where {T, AT<:AbstractVector{T}, BT<:@NamedTuple{q::AT, p::AT}}","page":"Library","title":"Base.iterate","text":"This function computes a trajectory for a SympNet that has already been trained for valuation purposes.\n\nIt takes as input: \n\nnn: a NeuralNetwork (that has been trained).\nics: initial conditions (a NamedTuple of two vectors)\n\n\n\n\n\n","category":"method"},{"location":"library/#Base.vec-Tuple{GeometricMachineLearning.AbstractTriangular}","page":"Library","title":"Base.vec","text":"If vec is applied onto Triangular, then the output is the associated vector. \n\n\n\n\n\n","category":"method"},{"location":"library/#Base.vec-Tuple{SkewSymMatrix}","page":"Library","title":"Base.vec","text":"If vec is applied onto SkewSymMatrix, then the output is the associated vector. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.Gradient","page":"Library","title":"GeometricMachineLearning.Gradient","text":"This is an old constructor and will be depricated. For change_q=true it is equivalent to GradientLayerQ; for change_q=false it is equivalent to GradientLayerP.\n\nIf full_grad=false then ActivationLayer is called\n\n\n\n\n\n","category":"function"},{"location":"library/#GeometricMachineLearning.Transformer-Tuple{Integer, Integer, Integer}","page":"Library","title":"GeometricMachineLearning.Transformer","text":"The architecture for a \"transformer encoder\" is essentially taken from arXiv:2010.11929, but with the difference that no layer normalization is employed. This is because we still need to find a generalization of layer normalization to manifolds. \n\nThe transformer is called with the following inputs: \n\ndim: the dimension of the transformer \nn_heads: the number of heads \nL: the number of transformer blocks\n\nIn addition we have the following optional arguments: \n\nactivation: the activation function used for the ResNet (tanh by default)\nStiefel::Bool: if the matrices P^V, P^Q and P^K should live on a manifold (false by default)\nretraction: which retraction should be used (Geodesic() by default)\nadd_connection::Bool: if the input should by added to the ouput after the MultiHeadAttention layer is used (true by default)\nuse_bias::Bool: If the ResNet should use a bias (true by default)\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.accuracy-Union{Tuple{BT}, Tuple{AT}, Tuple{T1}, Tuple{T}, Tuple{Chain, Tuple, DataLoader{T, AT, BT}}} where {T, T1<:Integer, AT<:(AbstractArray{T}), BT<:(AbstractArray{T1})}","page":"Library","title":"GeometricMachineLearning.accuracy","text":"Computes the accuracy (as opposed to the loss) of a neural network classifier. \n\nIt takes as input:\n\nmodel::Chain\nps: parameters of the network\ndl::DataLoader\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.apply_layer_to_nt_and_return_array-Union{Tuple{M}, Tuple{AbstractArray, GeometricMachineLearning.SympNetLayer{M, M}, Any}} where M","page":"Library","title":"GeometricMachineLearning.apply_layer_to_nt_and_return_array","text":"This function is used in the wrappers where the input to the SympNet layers is not a NamedTuple (as it should be) but an AbstractArray.\n\nIt converts the Array to a NamedTuple (via assign_q_and_p), then calls the SympNet routine(s) and converts back to an AbstractArray (with vcat).\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.assign_batch_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.assign_batch_kernel!","text":"Takes as input a batch tensor (to which the data are assigned), the whole data tensor and two vectors params and time_steps that include the specific parameters and time steps we want to assign. \n\nNote that this assigns sequential data! For e.g. being processed by a transformer.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.assign_output_estimate-Union{Tuple{T}, Tuple{AbstractArray{T, 3}, Int64}} where T","page":"Library","title":"GeometricMachineLearning.assign_output_estimate","text":"The function assign_output_estimate is closely related to the transformer. It takes the last prediction_window columns of the output and uses them for the final prediction. i.e.\n\nmathbbR^NtimesmathttpwtomathbbR^Ntimesmathttpw \nbeginbmatrix \n z^(1)_1 cdots z^(T)_1 \n cdots cdots cdots \n z^(1)_n cdots z^(T)_n\n endbmatrix mapsto \n beginbmatrix \n z^(T - mathttpw)_1 cdots z^(T)_1 \n cdots cdots cdots \n z^(T - mathttpw)_n cdots z^(T)_nendbmatrix \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.assign_output_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.assign_output_kernel!","text":"This should be used together with assign_batch_kernel!. It assigns the corresponding output (i.e. target).\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.assign_q_and_p-Tuple{AbstractVector, Int64}","page":"Library","title":"GeometricMachineLearning.assign_q_and_p","text":"Allocates two new arrays q and p whose first dimension is half of that of the input x. This should also be supplied through the second argument N.\n\nThe output is a Tuple containing q and p.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.augment_zeros_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.augment_zeros_kernel!","text":"Used for differentiating assignoutputestimate (this appears in the loss). \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.compute_output_of_mha-Union{Tuple{T}, Tuple{M}, Tuple{MultiHeadAttention{M, M}, AbstractMatrix{T}, NamedTuple}} where {M, T}","page":"Library","title":"GeometricMachineLearning.compute_output_of_mha","text":"Applies MHA to an abstract matrix. This is the same independent of whether the input is added to the output or not. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.convert_input_and_batch_indices_to_array-Union{Tuple{BT}, Tuple{AT}, Tuple{T}, Tuple{DataLoader{T, BT}, Batch, Vector{Tuple{Int64, Int64}}}} where {T, AT<:AbstractArray{T, 3}, BT<:@NamedTuple{q::AT, p::AT}}","page":"Library","title":"GeometricMachineLearning.convert_input_and_batch_indices_to_array","text":"Takes the output of the batch functor and uses it to create the corresponding array (NamedTuples). \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.convert_input_and_batch_indices_to_array-Union{Tuple{BT}, Tuple{T}, Tuple{DataLoader{T, BT}, Batch, Vector{Tuple{Int64, Int64}}}} where {T, BT<:AbstractArray{T, 3}}","page":"Library","title":"GeometricMachineLearning.convert_input_and_batch_indices_to_array","text":"Takes the output of the batch functor and uses it to create the corresponding array. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.crop_array_for_transformer_loss-Union{Tuple{BT}, Tuple{AT}, Tuple{T2}, Tuple{T}, Tuple{AT, BT}} where {T, T2, AT<:AbstractArray{T, 3}, BT<:AbstractArray{T2, 3}}","page":"Library","title":"GeometricMachineLearning.crop_array_for_transformer_loss","text":"This crops the output array of the neural network so that it conforms with the output it should be compared to. This is needed for the transformer loss. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.custom_mat_mul-Tuple{AbstractMatrix, AbstractVecOrMat}","page":"Library","title":"GeometricMachineLearning.custom_mat_mul","text":"Multiplies a matrix with a vector, a matrix or a tensor.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.draw_batch!-Union{Tuple{T}, Tuple{AbstractMatrix{T}, AbstractMatrix{T}}} where T","page":"Library","title":"GeometricMachineLearning.draw_batch!","text":"This assigns the batch if the data are in form of a matrix.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.init_optimizer_cache-Tuple{GradientOptimizer, Any}","page":"Library","title":"GeometricMachineLearning.init_optimizer_cache","text":"Wrapper for the functions setup_adam_cache, setup_momentum_cache, setup_gradient_cache, setup_bfgs_cache. These appear outside of optimizer_caches.jl because the OptimizerMethods first have to be defined.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.initialize_hessian_inverse-Union{Tuple{AbstractArray{T}}, Tuple{T}} where T","page":"Library","title":"GeometricMachineLearning.initialize_hessian_inverse","text":"This initializes the inverse of the Hessian for various arrays. This requires an implementation of a vectorization operation vec. This is important for custom arrays.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.loss-Tuple{NeuralNetwork, Vararg{Any}}","page":"Library","title":"GeometricMachineLearning.loss","text":"Wrapper if we deal with a neural network.\n\nYou can supply an instance of NeuralNetwork instead of the two arguments model (of type Union{Chain, AbstractExplicitLayer}) and parameters (of type Union{Tuple, NamedTuple}).\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.loss-Union{Tuple{BT}, Tuple{AT}, Tuple{T1}, Tuple{T}, Tuple{Union{AbstractNeuralNetworks.AbstractExplicitLayer, Chain}, Union{Tuple, NamedTuple}, AT, BT}} where {T, T1, AT<:(AbstractArray{T}), BT<:(AbstractArray{T1})}","page":"Library","title":"GeometricMachineLearning.loss","text":"Computes the loss for a neural network and a data set. The computed loss is \n\noutput - mathcalNN(input)_Foutput_F\n\nwhere A_F = sqrtsum_i_1ldotsi_ka_i_1ldotsi_k^2^2 is the Frobenius norm.\n\nIt takes as input: \n\nmodel::Union{Chain, AbstractExplicitLayer}\nps::Union{Tuple, NamedTuple}\ninput::Union{Array, NamedTuple}\noutput::Uniont{Array, NamedTuple}\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.loss-Union{Tuple{BT}, Tuple{AT}, Tuple{T1}, Tuple{T}, Tuple{Union{AbstractNeuralNetworks.AbstractExplicitLayer, Chain}, Union{Tuple, NamedTuple}, DataLoader{T, AT, BT}}} where {T, T1, AT<:AbstractArray{T, 3}, BT<:AbstractArray{T1, 3}}","page":"Library","title":"GeometricMachineLearning.loss","text":"Alternative call of the loss function. This takes as input: \n\nmodel\nps\ndl::DataLoader\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.loss-Union{Tuple{BT}, Tuple{T}, Tuple{Union{AbstractNeuralNetworks.AbstractExplicitLayer, Chain}, Union{Tuple, NamedTuple}, BT}} where {T, BT<:(AbstractArray{T})}","page":"Library","title":"GeometricMachineLearning.loss","text":"The autoencoder loss:\n\noutput - mathcalNN(input)_Foutput_F\n\nIt takes as input: \n\nmodel::Union{Chain, AbstractExplicitLayer}\nps::Union{Tuple, NamedTuple}\ninput::Union{Array, NamedTuple}\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.map_index_for_symplectic_potential-Tuple{Int64, Int64}","page":"Library","title":"GeometricMachineLearning.map_index_for_symplectic_potential","text":"This assigns the right index for the symplectic potential. To be used with assign_ones_for_symplectic_potential_kernel!.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.mat_tensor_mul-Union{Tuple{AT}, Tuple{ST}, Tuple{BT}, Tuple{T}, Tuple{AT, AbstractArray{T, 3}}} where {T, BT<:(AbstractArray{T}), ST<:StiefelManifold{T, BT}, AT<:LinearAlgebra.Adjoint{T, ST}}","page":"Library","title":"GeometricMachineLearning.mat_tensor_mul","text":"Extend mat_tensor_mul to a multiplication by the adjoint of an element of StiefelManifold. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.mat_tensor_mul-Union{Tuple{T}, Tuple{StiefelManifold, AbstractArray{T, 3}}} where T","page":"Library","title":"GeometricMachineLearning.mat_tensor_mul","text":"Extend mat_tensor_mul to a multiplication by an element of StiefelManifold. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.metric-Tuple{StiefelManifold, AbstractMatrix, AbstractMatrix}","page":"Library","title":"GeometricMachineLearning.metric","text":"Implements the canonical Riemannian metric for the Stiefel manifold:\n\ng_Y (Delta_1 Delta_2) mapsto mathrmtr(Delta_1^T(mathbbI - frac12YY^T)Delta_2)\n\nIt is called with: \n\nY::StiefelManifold\nΔ₁::AbstractMatrix\nΔ₂::AbstractMatrix`\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.number_of_batches-Union{Tuple{OT}, Tuple{AT}, Tuple{BT}, Tuple{T}, Tuple{DataLoader{T, AT, OT, :TimeSeries}, Batch}} where {T, BT<:AbstractArray{T, 3}, AT<:Union{@NamedTuple{q::BT, p::BT}, BT}, OT}","page":"Library","title":"GeometricMachineLearning.number_of_batches","text":"Gives the number of batches. Inputs are of type DataLoader and Batch.\n\nHere the big distinction is between data that are time-series like and data that are autoencoder like.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.onehotbatch-Union{Tuple{AbstractVector{T}}, Tuple{T}} where T<:Integer","page":"Library","title":"GeometricMachineLearning.onehotbatch","text":"One-hot-batch encoding of a vector of integers: inputin01ldots9^ell. The output is a tensor of shape 10times1timesell. \n\n0 mapsto beginbmatrix 1 0 ldots 0 endbmatrix\n\nIn more abstract terms: i mapsto e_i.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.optimization_step!-Tuple{Optimizer, Chain, Tuple, Tuple}","page":"Library","title":"GeometricMachineLearning.optimization_step!","text":"Optimization for an entire neural network, the way this function should be called. \n\ninputs: \n\no::Optimizer\nmodel::Chain\nps::Tuple\ndx::Tuple\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.optimization_step!-Tuple{Optimizer, Union{AbstractNeuralNetworks.AbstractExplicitCell, AbstractNeuralNetworks.AbstractExplicitLayer}, NamedTuple, NamedTuple, NamedTuple}","page":"Library","title":"GeometricMachineLearning.optimization_step!","text":"Optimization for a single layer. \n\ninputs: \n\no::Optimizer\nd::Union{AbstractExplicitLayer, AbstractExplicitCell}\nps::NamedTuple: the parameters \nC::NamedTuple: NamedTuple of the caches \ndx::NamedTuple: NamedTuple of the derivatives (output of AD routine)\n\nps, C and dx must have the same keys. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.optimize_for_one_epoch!-Union{Tuple{T}, Tuple{Optimizer, Any, Union{Tuple, NamedTuple}, DataLoader{T, AT} where AT<:Union{AbstractArray{T}, NamedTuple}, Batch, Union{typeof(GeometricMachineLearning.loss), GeometricMachineLearning.NetworkLoss}}} where T","page":"Library","title":"GeometricMachineLearning.optimize_for_one_epoch!","text":"Optimize for an entire epoch. For this you have to supply: \n\nan instance of the optimizer.\nthe neural network model \nthe parameters of the model \nthe data (in form of DataLoader)\nin instance of Batch that contains batch_size (and optionally seq_length)\n\nWith the optional argument:\n\nthe loss, which takes the model, the parameters ps and an instance of DataLoader as input.\n\nThe output of optimize_for_one_epoch! is the average loss over all batches of the epoch:\n\noutput = frac1mathttsteps_per_epochsum_t=1^mathttsteps_per_epochloss(theta^(t-1))\n\nThis is done because any reverse differentiation routine always has two outputs: a pullback and the value of the function it is differentiating. In the case of zygote: loss_value, pullback = Zygote.pullback(ps -> loss(ps), ps) (if the loss only depends on the parameters).\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.patch_index-Union{Tuple{T}, Tuple{T, T, T}, NTuple{4, T}} where T<:Integer","page":"Library","title":"GeometricMachineLearning.patch_index","text":"Based on coordinates i,j this returns the batch index (for MNIST data set for now).\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.reduced_vector_field_from_full_explicit_vector_field-Tuple{Any, Any, Integer, Integer}","page":"Library","title":"GeometricMachineLearning.reduced_vector_field_from_full_explicit_vector_field","text":"This function is needed if we obtain a GeometricIntegrators-like vector field from an explicit vector field VmathbbR^2NtomathbbR^2N. We need this function because buildreducedvector_field is not working in conjunction with implicit integrators.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.rgrad-Tuple{StiefelManifold, AbstractMatrix}","page":"Library","title":"GeometricMachineLearning.rgrad","text":"Computes the Riemannian gradient for the Stiefel manifold given an element YinSt(Nn) and a matrix nablaLinmathbbR^Ntimesn (the Euclidean gradient). It computes the Riemannian gradient with respect to the canonical metric (see the documentation for the function metric for an explanation of this). The precise form of the mapping is: \n\nmathttrgrad(Y nablaL) mapsto nablaL - Y(nablaL)^TY\n\nIt is called with inputs:\n\nY::StiefelManifold\ne_grad::AbstractMatrix: i.e. the Euclidean gradient (what was called nablaL) above.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.split_and_flatten-Union{Tuple{AbstractArray{T, 3}}, Tuple{T}} where T","page":"Library","title":"GeometricMachineLearning.split_and_flatten","text":"split_and_flatten takes a tensor as input and produces another one as output (essentially rearranges the input data in an intricate way) so that it can easily be processed with a transformer.\n\nThe optional arguments are: \n\npatch_length: by default this is 7. \nnumber_of_patches: by default this is 16.\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.tensor_mat_skew_sym_assign-Union{Tuple{AT}, Tuple{T}, Tuple{AT, AbstractMatrix{T}}} where {T, AT<:AbstractArray{T, 3}}","page":"Library","title":"GeometricMachineLearning.tensor_mat_skew_sym_assign","text":"Takes as input: \n\nZ::AbstractArray{T, 3}: A tensor that stores a bunch of time series. \nA::AbstractMatrix: A matrix that is used to perform various scalar products. \n\nFor one of these time series the function performs the following computation: \n\n (z^(i) z^(j)) mapsto (z^(i))^TAz^(j) text for i j\n\nThe result of this are n(n-2)div2 scalar products. These scalar products are written into a lower-triangular matrix and the final output of the function is a tensor of these lower-triangular matrices. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.tensor_mat_skew_sym_assign_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.tensor_mat_skew_sym_assign_kernel!","text":"A kernel that computes the weighted scalar products of all combinations of vectors in the matrix Z except where the two vectors are the same and writes the result into a tensor of skew symmetric matrices C. \n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.train!","page":"Library","title":"GeometricMachineLearning.train!","text":"train!(...)\n\nPerform a training of a neural networks on data using given method a training Method\n\nDifferent ways of use:\n\ntrain!(neuralnetwork, data, optimizer = GradientOptimizer(1e-2), training_method; nruns = 1000, batch_size = default(data, type), showprogress = false )\n\nArguments\n\nneuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend\ndata : the data (see TrainingData)\noptimizer = GradientOptimizer: the optimization method (see Optimizer)\ntraining_method : specify the loss function used \nnruns : number of iteration through the process with default value \nbatch_size : size of batch of data used for each step\n\n\n\n\n\n","category":"function"},{"location":"library/#GeometricMachineLearning.train!-Tuple{AbstractNeuralNetworks.AbstractNeuralNetwork{<:AbstractNeuralNetworks.Architecture}, AbstractTrainingData, TrainingParameters}","page":"Library","title":"GeometricMachineLearning.train!","text":"train!(neuralnetwork, data, optimizer, training_method; nruns = 1000, batch_size, showprogress = false )\n\nArguments\n\nneuralnetwork::LuxNeuralNetwork : the neural net work using LuxBackend\ndata::AbstractTrainingData : the data\n``\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.transformer_loss-Union{Tuple{BT}, Tuple{T}, Tuple{Union{AbstractNeuralNetworks.AbstractExplicitLayer, Chain}, Union{Tuple, NamedTuple}, BT, BT}} where {T, BT<:(AbstractArray{T})}","page":"Library","title":"GeometricMachineLearning.transformer_loss","text":"The transformer works similarly to the regular loss, but with the difference that mathcalNN(input) and output may have different sizes. \n\nIt takes as input: \n\nmodel::Union{Chain, AbstractExplicitLayer}\nps::Union{Tuple, NamedTuple}\ninput::Union{Array, NamedTuple}\noutput::Uniont{Array, NamedTuple}\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.within_patch_index-Union{Tuple{T}, Tuple{T, T, T}} where T<:Integer","page":"Library","title":"GeometricMachineLearning.within_patch_index","text":"Based on coordinates i,j this returns the index within the batch\n\n\n\n\n\n","category":"method"},{"location":"library/#GeometricMachineLearning.write_ones_kernel!-Tuple{Any}","page":"Library","title":"GeometricMachineLearning.write_ones_kernel!","text":"Kernel that is needed for functions relating to SymmetricMatrix and SkewSymMatrix \n\n\n\n\n\n","category":"method"},{"location":"optimizers/adam_optimizer/#The-Adam-Optimizer","page":"Adam Optimizer","title":"The Adam Optimizer","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"The Adam Optimizer is one of the most widely (if not the most widely used) neural network optimizer. Like most modern neural network optimizers it contains a cache that is updated based on first-order gradient information and then, in a second step, the cache is used to compute a velocity estimate for updating the neural networ weights. ","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"Here we first describe the Adam algorithm for the case where all the weights are on a vector space and then show how to generalize this to the case where the weights are on a manifold. ","category":"page"},{"location":"optimizers/adam_optimizer/#All-weights-on-a-vector-space","page":"Adam Optimizer","title":"All weights on a vector space","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"The cache of the Adam optimizer consists of first and second moments. The first moments B_1 store linear information about the current and previous gradients, and the second moments B_2 store quadratic information about current and previous gradients (all computed from a first-order gradient). ","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"If all the weights are on a vector space, then we directly compute updates for B_1 and B_2:","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"B_1 gets ((rho_1 - rho_1^t)(1 - rho_1^t))cdotB_1 + (1 - rho_1)(1 - rho_1^t)cdotnablaL\nB_2 gets ((rho_2 - rho_1^t)(1 - rho_2^t))cdotB_2 + (1 - rho_2)(1 - rho_2^t)cdotnablaLodotnablaL\nwhere odotmathbbR^ntimesmathbbR^ntomathbbR^n is the Hadamard product: aodotb_i = a_ib_i. rho_1 and rho_2 are hyperparameters. Their defaults, rho_1=09 and rho_2=099, are taken from (Goodfellow et al., 2016, page 301). After having updated the cache (i.e. B_1 and B_2) we compute a velocity (step 3) with which the parameters Y_t are then updated (step 4).\nW_tgets -etaB_1sqrtB_2 + delta\nY_t+1 gets Y_t + W_t","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"Here eta (with default 0.01) is the learning rate and delta (with default 3cdot10^-7) is a small constant that is added for stability. The division, square root and addition in step 3 are performed element-wise. ","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/adam_optimizer.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"optimizers/adam_optimizer/#Weights-on-manifolds","page":"Adam Optimizer","title":"Weights on manifolds","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"The problem with generalizing Adam to manifolds is that the Hadamard product odot as well as the other element-wise operations (, sqrt and + in step 3 above) lack a clear geometric interpretation. In GeometricMachineLearning we get around this issue by utilizing a so-called global tangent space representation. ","category":"page"},{"location":"optimizers/adam_optimizer/#References","page":"Adam Optimizer","title":"References","text":"","category":"section"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT press, 2016.","category":"page"},{"location":"optimizers/adam_optimizer/","page":"Adam Optimizer","title":"Adam Optimizer","text":"I. Goodfellow, Y. Bengio and A. Courville. Deep learning (MIT press, Cambridge, MA, 2016).\n\n\n\n","category":"page"},{"location":"architectures/autoencoders/#Variational-Autoencoders","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"","category":"section"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"Variational autoencoders (Lee and Carlberg, 2020) train on the following set: ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"mathcalX(mathbbP_mathrmtrain) = mathbfx^k(mu) - mathbfx^0(mu)0leqkleqKmuinmathbbP_mathrmtrain","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"where mathbfx^k(mu)approxmathbfx(t^kmu). Note that mathbf0inmathcalX(mathbbP_mathrmtrain) as k can also be zero. ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"The encoder Psi^mathrmenc and decoder Psi^mathrmdec are then trained on this set mathcalX(mathbbP_mathrmtrain) by minimizing the reconstruction error: ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":" mathbfx - Psi^mathrmdeccircPsi^mathrmenc(mathbfx) text for mathbfxinmathcalX(mathbbP_mathrmtrain)","category":"page"},{"location":"architectures/autoencoders/#Initial-condition","page":"Variational Autoencoders","title":"Initial condition","text":"","category":"section"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"No matter the parameter mu the initial condition in the reduced system is always mathbfx_r0(mu) = mathbfx_r0 = Psi^mathrmenc(mathbf0). ","category":"page"},{"location":"architectures/autoencoders/#Reconstructed-solution","page":"Variational Autoencoders","title":"Reconstructed solution","text":"","category":"section"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"In order to arrive at the reconstructed solution one first has to decode the reduced state and then add the reference state:","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"mathbfx^mathrmreconstr(tmu) = mathbfx^mathrmref(mu) + Psi^mathrmdec(mathbfx_r(tmu))","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"where mathbfx^mathrmref(mu) = mathbfx(t_0mu) - Psi^mathrmdeccircPsi^mathrmdec(mathbf0).","category":"page"},{"location":"architectures/autoencoders/#Symplectic-reduced-vector-field","page":"Variational Autoencoders","title":"Symplectic reduced vector field","text":"","category":"section"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"A symplectic vector field is one whose flow conserves the symplectic structure mathbbJ. This is equivalent[1] to there existing a Hamiltonian H s.t. the vector field X can be written as X = mathbbJnablaH.","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"[1]: Technically speaking the definitions are equivalent only for simply-connected manifolds, so also for vector spaces. ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"If the full-order Hamiltonian is H^mathrmfullequivH we can obtain another Hamiltonian on the reduces space by simply setting: ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"H^mathrmred(mathbfx_r(tmu)) = H(mathbfx^mathrmreconstr(tmu)) = H(mathbfx^mathrmref(mu) + Psi^mathrmdec(mathbfx_r(tmu)))","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"The ODE associated to this Hamiltonian is also the one corresponding to Manifold Galerkin ROM (see (Lee and Carlberg, 2020)).","category":"page"},{"location":"architectures/autoencoders/#Manifold-Galerkin-ROM","page":"Variational Autoencoders","title":"Manifold Galerkin ROM","text":"","category":"section"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"Define the FOM ODE residual as: ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"r (mathbfv xi tau mu) mapsto mathbfv - f(xi tau mu)","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"The reduced ODE is then defined to be: ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"dothatmathbfx(tmu) = mathrmargmin_hatmathbfvinmathbbR^p r(mathcalJ(hatmathbfx(tmu))hatmathbfvhatmathbfx^mathrmref(mu) + Psi^mathrmdec(hatmathbfx(tmu))tmu) _2^2","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"where mathcalJ is the Jacobian of the decoder Psi^mathrmdec. This leads to: ","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"mathcalJ(hatmathbfx(tmu))hatmathbfv - f(hatmathbfx^mathrmref(mu) + Psi^mathrmdec(hatmathbfx(tmu)) t mu) overset= 0 implies \nhatmathbfv = mathcalJ(hatmathbfx(tmu))^+f(hatmathbfx^mathrmref(mu) + Psi^mathrmdec(hatmathbfx(tmu)) t mu)","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"where mathcalJ(hatmathbfx(tmu))^+ is the pseudoinverse of mathcalJ(hatmathbfx(tmu)). Because mathcalJ(hatmathbfx(tmu)) is a symplectic matrix the pseudoinverse is the symplectic inverse (see (Peng and Mohseni, 2016)).","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"Furthermore, because f is Hamiltonian, the vector field describing dothatmathbfx(tmu) will also be Hamiltonian. ","category":"page"},{"location":"architectures/autoencoders/#References","page":"Variational Autoencoders","title":"References","text":"","category":"section"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"K. Lee and K. Carlberg. “Model reduction of dynamical systems on nonlinear manifolds using","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"deep convolutional autoencoders”. In: Journal of Computational Physics 404 (2020), p. 108973.","category":"page"},{"location":"architectures/autoencoders/","page":"Variational Autoencoders","title":"Variational Autoencoders","text":"Peng L, Mohseni K. Symplectic model reduction of Hamiltonian systems[J]. SIAM Journal on Scientific Computing, 2016, 38(1): A1-A27.","category":"page"},{"location":"data_loader/TODO/#DATA-Loader-TODO","page":"DATA Loader TODO","title":"DATA Loader TODO","text":"","category":"section"},{"location":"data_loader/TODO/","page":"DATA Loader TODO","title":"DATA Loader TODO","text":"[x] Implement @views instead of allocating a new array in every step. \n[x] Implement sampling without replacement.\n[x] Store information on the epoch and the current loss. \n[x] Usually the training loss is computed over the entire data set, we are probably going to do this for one epoch via ","category":"page"},{"location":"data_loader/TODO/","page":"DATA Loader TODO","title":"DATA Loader TODO","text":"loss_e = frac1batchessum_batchinbatchesloss(batch)","category":"page"},{"location":"data_loader/TODO/","page":"DATA Loader TODO","title":"DATA Loader TODO","text":"Point 4 makes sense because the output of an AD routine is the value of the loss function as well as the pullback. ","category":"page"},{"location":"data_loader/data_loader/#Data-Loader","page":"Routines","title":"Data Loader","text":"","category":"section"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning, Markdown\nMarkdown.parse(description(Val(:DataLoader)))","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"The data loader can be called with various types of arrays as input, for example a snapshot matrix:","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nSnapshotMatrix = rand(Float32, 10, 100)\n\ndl = DataLoader(SnapshotMatrix)","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"or a snapshot tensor: ","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nSnapshotTensor = rand(Float32, 10, 100, 5)\n\ndl = DataLoader(SnapshotTensor)","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"Here the DataLoader has different properties :RegularData and :TimeSeries. This indicates that in the first case we treat all columns in the input tensor independently (this is mostly used for autoencoder problems), whereas in the second case we have time series-like data, which are mostly used for integration problems. We can also treat a problem with a matrix as input as a time series-like problem by providing an additional keyword argument: autoencoder=false:","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nSnapshotMatrix = rand(Float32, 10, 100)\n\ndl = DataLoader(SnapshotMatrix; autoencoder=false)\ndl.input_time_steps","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning, Markdown\nMarkdown.parse(description(Val(:data_loader_for_named_tuple)))","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nSymplecticSnapshotTensor = (q = rand(Float32, 10, 100, 5), p = rand(Float32, 10, 100, 5))\n\ndl = DataLoader(SymplecticSnapshotTensor)","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"dl.input_dim","category":"page"},{"location":"data_loader/data_loader/#The-Batch-struct","page":"Routines","title":"The Batch struct","text":"","category":"section"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning, Markdown\nMarkdown.parse(description(Val(:Batch)))","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nmatrix_data = rand(Float32, 2, 10)\ndl = DataLoader(matrix_data; autoencoder = true)\n\nbatch = Batch(3)\nbatch(dl)","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"This also works if the data are in qp form: ","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nqp_data = (q = rand(Float32, 2, 10), p = rand(Float32, 2, 10))\ndl = DataLoader(qp_data; autoencoder = true)\n\nbatch = Batch(3)\nbatch(dl)","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"In those two examples the autoencoder keyword was set to true (the default). This is why the first index was always 1. This changes if we set autoencoder = false: ","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nqp_data = (q = rand(Float32, 2, 10), p = rand(Float32, 2, 10))\ndl = DataLoader(qp_data; autoencoder = false) # false is default \n\nbatch = Batch(3)\nbatch(dl)","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"Specifically the routines do the following: ","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"mathttn_indicesleftarrow mathttn_paramslormathttinput_time_steps \nmathttindices leftarrow mathttshuffle(mathtt1mathttn_indices)\nmathcalI_i leftarrow mathttindices(i - 1) cdot mathttbatch_size + 1 mathtt i cdot mathttbatch_sizetext for i=1 ldots (mathrmlast -1)\nmathcalI_mathttlast leftarrow mathttindices(mathttn_batches - 1) cdot mathttbatch_size + 1mathttend","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"Note that the routines are implemented in such a way that no two indices appear double. ","category":"page"},{"location":"data_loader/data_loader/#Sampling-from-a-tensor","page":"Routines","title":"Sampling from a tensor","text":"","category":"section"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"We can also sample tensor data.","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"using GeometricMachineLearning # hide\n\nqp_data = (q = rand(Float32, 2, 20, 3), p = rand(Float32, 2, 20, 3))\ndl = DataLoader(qp_data)\n\n# also specify sequence length here\nbatch = Batch(4, 5)\nbatch(dl)","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"Sampling from a tensor is done the following way (mathcalI_i again denotes the batch indices for the i-th batch): ","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"mathtttime_indices leftarrow mathttshuffle(mathtt1(mathttinput_time_steps - mathttseq_length - mathttprediction_window)\nmathttparameter_indices leftarrow mathttshuffle(mathtt1n_params)\nmathttcomplete_indices leftarrow mathttproduct(mathtttime_indices mathttparameter_indices)\nmathcalI_i leftarrow mathttcomplete_indices(i - 1) cdot mathttbatch_size + 1 i cdot mathttbatch_sizetext for i=1 ldots (mathrmlast -1)\nmathcalI_mathrmlast leftarrow mathttcomplete_indices(mathrmlast - 1) cdot mathttbatch_size + 1mathttend","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"This algorithm can be visualized the following way (here batch_size = 4):","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/tensor_sampling.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"data_loader/data_loader/","page":"Routines","title":"Routines","text":"Here the sampling is performed over the second axis (the time step dimension) and the third axis (the parameter dimension). Whereas each block has thickness 1 in the x direction (i.e. pertains to a single parameter), its length in the y direction is seq_length. In total we sample as many such blocks as the batch size is big. By construction those blocks are never the same throughout a training epoch but may intersect each other!","category":"page"},{"location":"manifolds/basic_topology/#Basic-Concepts-of-General-Topology","page":"Concepts from General Topology","title":"Basic Concepts of General Topology","text":"","category":"section"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"On this page we discuss basic notions of topology that are necessary to define and work manifolds. Here we largely omit concrete examples and only define concepts that are necessary for defining a manifold[1], namely the properties of being Hausdorff and second countable. For a wide range of examples and a detailed discussion of the theory see e.g. [5]. The here-presented theory is also (rudimentary) covered in most differential geometry books such as [6] and [7]. ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"[1]: Some authors (see e.g. [6]) do not require these properties. But since they constitute very weak restrictions and are always satisfied by the manifolds relevant for our purposes we require them here. ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Definition: A topological space is a set mathcalM for which we define a collection of subsets of mathcalM, which we denote by mathcalT and call the open subsets. mathcalT further has to satisfy the following three conditions:","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"The empty set and mathcalM belong to mathcalT.\nAny union of an arbitrary number of elements of mathcalT again belongs to mathcalT.\nAny intersection of a finite number of elements of mathcalT again belongs to mathcalT.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Based on this definition of a topological space we can now define what it means to be Hausdorff: Definition: A topological space mathcalM is said to be Hausdorff if for any two points xyinmathcalM we can find two open sets U_xU_yinmathcalT s.t. xinU_x yinU_y and U_xcapU_y=.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"We now give the second definition that we need for defining manifolds, that of second countability: Definition: A topological space mathcalM is said to be second-countable if we can find a countable subcollection of mathcalT called mathcalU s.t. forallUinmathcalT and xinU we can find an element VinmathcalU for which xinVsubsetU.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"We now give a few definitions and results that are needed for the inverse function theorem which is essential for practical applications of manifold theory.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Definition: A mapping f between topological spaces mathcalM and mathcalN is called continuous if the preimage of every open set is again an open set, i.e. if f^-1UinmathcalT for U open in mathcalN and mathcalT the topology on mathcalM.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Definition: A closed set of a topological space mathcalM is one whose complement is an open set, i.e. F is closed if F^cinmathcalT, where the superscript ^c indicates the complement. For closed sets we thus have the following three properties: ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"The empty set and mathcalM are closed sets.\nAny union of a finite number of closed sets is again closed.\nAny intersection of an arbitrary number of closed sets is again closed.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Theorem: The definition of continuity is equivalent to the following, second definition: fmathcalMtomathcalN is continuous if f^-1FsubsetmathcalM is a closed set for each closed set FsubsetmathcalN.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Proof: First assume that f is continuous according to the first definition and not to the second. Then f^-1F is not closed but f^-1F^c is open. But f^-1F^c = xinmathcalMf(x)notinmathcalN = (f^-1F)^c cannot be open, else f^-1F would be closed. The implication of the first definition under assumption of the second can be shown analogously. ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Theorem: The property of a set F being closed is equivalent to the following statement: If a point y is such that for every open set U containing it we have UcapFneq then this point is contained in F.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Proof: We first proof that if a set is closed then the statement holds. Consider a closed set F and a point ynotinF s.t. every open set containing y has nonempty intersection with F. But the complement F^c also is such a set, which is a clear contradiction. Now assume the above statement for a set F and further assume F is not closed. Its complement F^c is thus not open. Now consider the interior of this set: mathrmint(F^c)=cupUUsubsetF^c, i.e. the biggest open set contained within F^c. Hence there must be a point y which is in F^c but is not in its interior, else F^c would be equal to its interior, i.e. would be open. We further must be able to find an open set U that contains y but is also contained in F^c, else y would be an element of F. A contradiction. ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Definition: An open cover of a topological space mathcalM is a (not necessarily countable) collection of open sets U_i_imathcalI s.t. their union contains mathcalM. A finite open cover is a collection of a finite number of open sets that cover mathcalM. We say that an open cover is reducible to a finite cover if we can find a finite number of elements in the open cover whose union still contains mathcalM.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Definition: A topological space mathcalM is called compact if every open cover is reducible to a finite cover.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Theorem: Consider a continuous function fmathcalMtomathcalN and a compact set KinmathcalM. Then f(K) is also compact. ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Proof: Consider an open cover of f(K): U_i_iinmathcalI. Then f^-1U_i_iinmathcalI is an open cover of K and hence reducible to a finite cover f^-1U_i_iini_1ldotsi_n. But then U_i_iini_1ldotsi_n also covers f(K).","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Theorem: A closed subset of a compact space is compact:","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Proof: Call the closed set F and consider an open cover of this set: U_iinmathcalI. Then this open cover combined with F^c is an open cover for the entire compact space, hence reducible to a finite cover.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Theorem: A compact subset of a Hausdorff space is closed: ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Proof: Consider a compact subset K. If K is not closed, then there has to be a point ynotinK s.t. every open set containing y intersects K. Because the surrounding space is Hausdorff we can now find the following two collections of open sets: (U_z U_zy U_zcapU_zy=)_zinK. The open cover U_z_zinK is then reducible to a finite cover U_z_zinz_1 ldots z_n. The intersection cap_zinz_1 ldots z_nU_zy is then an open set that contains y but has no intersection with K. A contraction. ","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Theorem: If mathcalM is compact and mathcalN is Hausdorff, then the inverse of a continuous function fmathcalMtomathcalN is again continuous, i.e. f(V) is an open set in mathcalN for VinmathcalT.","category":"page"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"Proof: We can equivalently show that every closed set is mapped to a closed set. First consider the set KinmathcalM. Its image is again compact and hence closed because mathcalN is Hausdorff. ","category":"page"},{"location":"manifolds/basic_topology/#References","page":"Concepts from General Topology","title":"References","text":"","category":"section"},{"location":"manifolds/basic_topology/","page":"Concepts from General Topology","title":"Concepts from General Topology","text":"S. I. Richard L. Bishop. Tensor Analysis on Manifolds (Dover Publications, 1980).\n\n\n\nS. Lang. Fundamentals of differential geometry. Vol. 191 (Springer Science & Business Media, 2012).\n\n\n\nS. Lipschutz. General Topology (McGraw-Hill Book Company, 1965).\n\n\n\n","category":"page"},{"location":"tutorials/mnist_tutorial/#MNIST-tutorial","page":"MNIST","title":"MNIST tutorial","text":"","category":"section"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"This is a short tutorial that shows how we can use GeometricMachineLearning to build a vision transformer and apply it for MNIST, while also putting some of the weights on a manifold. This is also the result presented in [24].","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"First, we need to import the relevant packages: ","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"using GeometricMachineLearning, CUDA, Plots\nimport Zygote, MLDatasets, KernelAbstractions","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"For the AD routine we here use the GeometricMachineLearning default and we get the dataset from MLDatasets. First we need to load the data set, and put it on GPU (if you have one):","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"train_x, train_y = MLDatasets.MNIST(split=:train)[:]\ntest_x, test_y = MLDatasets.MNIST(split=:test)[:]\ntrain_x = train_x |> cu \ntest_x = test_x |> cu \ntrain_y = train_y |> cu \ntest_y = test_y |> cu","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"GeometricMachineLearning has built-in data loaders that make it particularly easy to handle data: ","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"patch_length = 7\ndl = DataLoader(train_x, train_y, patch_length=patch_length)\ndl_test = DataLoader(train_x, train_y, patch_length=patch_length)","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"Here patch_length indicates the size one patch has. One image in MNIST is of dimension 28times28, this means that we decompose this into 16 (7times7) images (also see [24]).","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"We next define the model with which we want to train:","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"model = ClassificationTransformer(dl, n_heads=n_heads, n_layers=n_layers, Stiefel=true)","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"Here we have chosen a ClassificationTransformer, i.e. a composition of a specific number of transformer layers composed with a classification layer. We also set the Stiefel option to true, i.e. we are optimizing on the Stiefel manifold.","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"We now have to initialize the neural network weights. This is done with the constructor for NeuralNetwork:","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"backend = KernelAbstractions.get_backend(dl)\nT = eltype(dl)\nnn = NeuralNetwork(model, backend, T)","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"And with this we can finally perform the training:","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"# an instance of batch is needed for the optimizer\nbatch = Batch(batch_size)\n\noptimizer_instance = Optimizer(AdamOptimizer(), nn)\n\n# this prints the accuracy and is optional\nprintln(\"initial test accuracy: \", accuracy(Ψᵉ, ps, dl_test), \"\\n\")\n\nloss_array = optimizer_instance(nn, dl, batch, n_epochs)\n\nprintln(\"final test accuracy: \", accuracy(Ψᵉ, ps, dl_test), \"\\n\")","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"It is instructive to play with n_layers, n_epochs and the Stiefel property.","category":"page"},{"location":"tutorials/mnist_tutorial/","page":"MNIST","title":"MNIST","text":"B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).\n\n\n\n","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/#The-Existence-And-Uniqueness-Theorem","page":"Differential Equations and the EAU theorem","title":"The Existence-And-Uniqueness Theorem","text":"","category":"section"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"In order to proof the existence-and-uniqueness theorem we first need another theorem, the Banach fixed-point theorem for which we also need another definition. ","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"Definition: A contraction mapping is a map TmathbbR^NtomathbbR^N for which there exists qin01) s.t. forallxyinmathbbR^NT(x)-T(y)leqqx-y.","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"Theorem (Banach fixed-point theorem): Every contraction mapping T admits a unique fixed point x^* (i.e. a point x^* s.t. F(x^*)=x^*) and this point can be found by taking an arbitrary point x_0inmathbbR^N and taking the limit lim_ntoinftyT^n(x_0).","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"Proof (Banach fixed-point theorem): Take an arbitrary point x_0inmathbbR^N and consider the sequence (x_n)_ninmathbbN with x_n=T^n(x_0). Then it holds that (for mn): ","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"beginaligned\nx_m - x_n leq x_m - x_m-1 + x_m-1 - x_m-2 + cdots + x_m-(m-n+1)-x_n \n = x_n+(m-n) - x_n+(m-n-1) + cdots + x_n+1 - x_n \n leq sum_i=0^m-n-1q^ix_n+1 - x_n \n leq sum_i=0^m-n-1q^iq^nx_1 - x_0 \n = q^nx_1 -x_0sum_i=1^m-n-1q^i\nendaligned","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"where we have used the triangle inequality in the first line. If we now let m on the right-hand side first go to infinity then we get ","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"beginaligned\nx_m-x_n leq q^nx_1 -x_0sum_i=1^inftyq^i\n =q^nx_1 -x_0 frac11-q\nendaligned","category":"page"},{"location":"manifolds/existence_and_uniqueness_theorem/","page":"Differential Equations and the EAU theorem","title":"Differential Equations and the EAU theorem","text":"proofing that the sequence is Cauchy. Because mathbbR^N is a complete metric space we get that (x_n)_ninmathbbN is a convergent sequence. We call the limit of this sequence x^*. This completes the proof of the Banach fixed-point theorem. ","category":"page"},{"location":"layers/multihead_attention_layer/#Multihead-Attention-Layer","page":"Multihead Attention","title":"Multihead Attention Layer","text":"","category":"section"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"In order to arrive from the attention layer at the multihead attention layer we have to do a few modifications: ","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Note that these neural networks were originally developed for natural language processing (NLP) tasks and the terminology used here bears some resemblance to that field. The input to a multihead attention layer typicaly comprises three components:","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Values VinmathbbR^ntimesT: a matrix whose columns are value vectors, \nQueries QinmathbbR^ntimesT: a matrix whose columns are query vectors, \nKeys KinmathbbR^ntimesT: a matrix whose columns are key vectors.","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Regular attention performs the following operation: ","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"mathrmAttention(QKV) = Vmathrmsoftmax(fracK^TQsqrtn)","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"where n is the dimension of the vectors in V, Q and K. The softmax activation function here acts column-wise, so it can be seen as a transformation mathrmsoftmaxmathbbR^TtomathbbR^T with mathrmsoftmax(v)_i = e^v_ileft(sum_j=1e^v_jright). The K^TQ term is a similarity matrix between the queries and the vectors. ","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"The transformer contains a self-attention mechanism, i.e. takes an input X and then transforms it linearly to V, Q and K, i.e. V = P^VX, Q = P^QX and K = P^KX. What distinguishes the multihead attention layer from the singlehead attention layer, is that there is not just one P^V, P^Q and P^K, but there are several: one for each head of the multihead attention layer. After computing the individual values, queries and vectors, and after applying the softmax, the outputs are then concatenated together in order to obtain again an array that is of the same size as the input array:","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/mha.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Here the various P matrices can be interpreted as being projections onto lower-dimensional subspaces, hence the designation by the letter P. Because of this interpretation as projection matrices onto smaller spaces that should capture features in the input data it makes sense to constrain these elements to be part of the Stiefel manifold. ","category":"page"},{"location":"layers/multihead_attention_layer/#Computing-Correlations-in-the-Multihead-Attention-Layer","page":"Multihead Attention","title":"Computing Correlations in the Multihead-Attention Layer","text":"","category":"section"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"The attention mechanism describes a reweighting of the \"values\" V_i based on correlations between the \"keys\" K_i and the \"queries\" Q_i. First note the structure of these matrices: they are all a collection of T vectors (Ndivmathttn_heads)-dimensional vectors, i.e. V_i=v_i^(1) ldots v_i^(T) K_i=k_i^(1) ldots k_i^(T) Q_i=q_i^(1) ldots q_i^(T) . Those vectors have been obtained by applying the respective projection matrices onto the original input I_iinmathbbR^NtimesT.","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"When performing the reweighting of the columns of V_i we first compute the correlations between the vectors in K_i and in Q_i and store the results in a correlation matrix C_i: ","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":" C_i_mn = left(k_i^(m)right)^Tq_i^(n)","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"The columns of this correlation matrix are than rescaled with a softmax function, obtaining a matrix of probability vectors mathcalP_i:","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":" mathcalP_i_bulletn = mathrmsoftmax(C_i_bulletn)","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Finally the matrix mathcalP_i is multiplied onto V_i from the right, resulting in 16 convex combinations of the 16 vectors v_i^(m) with m=1ldotsT:","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":" V_imathcalP_i = leftsum_m=1^16mathcalP_i_m1v_i^(m) ldots sum_m=1^TmathcalP_i_mTv_i^(m)right","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"With this we can now give a better interpretation of what the projection matrices W_i^V, W_i^K and W_i^Q should do: they map the original data to lower-dimensional subspaces. We then compute correlations between the representation in the K and in the Q basis and use this correlation to perform a convex reweighting of the vectors in the V basis. These reweighted values are then fed into a standard feedforward neural network.","category":"page"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"Because the main task of the W_i^V, W_i^K and W_i^Q matrices here is for them to find bases, it makes sense to constrain them onto the Stiefel manifold; they do not and should not have the maximum possible generality.","category":"page"},{"location":"layers/multihead_attention_layer/#References","page":"Multihead Attention","title":"References","text":"","category":"section"},{"location":"layers/multihead_attention_layer/","page":"Multihead Attention","title":"Multihead Attention","text":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin. Attention is all you need. Advances in neural information processing systems 30 (2017).\n\n\n\n","category":"page"},{"location":"tutorials/grassmann_layer/#Example-of-a-Neural-Network-with-a-Grassmann-Layer","page":"Grassmann manifold","title":"Example of a Neural Network with a Grassmann Layer","text":"","category":"section"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"Here we show how to implement a neural network that contains a layer whose weight is an element of the Grassmann manifold and where this might be useful. ","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"To answer where we would need this consider the following scenario","category":"page"},{"location":"tutorials/grassmann_layer/#Problem-statement","page":"Grassmann manifold","title":"Problem statement","text":"","category":"section"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"We are given data in a big space mathcalD=d_i_iinmathcalIsubsetmathbbR^N and know these data live on an n-dimensional[1] submanifold[2] in mathbbR^N. Based on these data we would now like to generate new samples from the distributions that produced our original data. This is where the Grassmann manifold is useful: each element V of the Grassmann manifold is an n-dimensional subspace of mathbbR^N from which we can easily sample. We can then construct a (bijective) mapping from this space V onto a space that contains our data points mathcalD. ","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"[1]: We may know n exactly or approximately. ","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"[2]: Problems and solutions related to this scenario are commonly summarized under the term manifold learning (see [25]).","category":"page"},{"location":"tutorials/grassmann_layer/#Example","page":"Grassmann manifold","title":"Example","text":"","category":"section"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"Consider the following toy example: We want to sample from the graph of the (scaled) Rosenbrock function f(xy) = ((1 - x)^2 + 100(y - x^2)^2)1000 while pretending we do not know the function. ","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"using Plots # hide\n# hide\nrosenbrock(x::Vector) = ((1.0 - x[1]) ^ 2 + 100.0 * (x[2] - x[1] ^ 2) ^ 2) / 1000\nx, y = -1.5:0.1:1.5, -1.5:0.1:1.5\nz = Surface((x,y)->rosenbrock([x,y]), x, y)\np = surface(x,y,z; camera=(30,20), alpha=.6, colorbar=false, xlims=(-1.5, 1.5), ylims=(-1.5, 1.5), zlims=(0.0, rosenbrock([-1.5, -1.5])))","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"We now build a neural network whose task it is to map a product of two Gaussians mathcalN(01)timesmathcalN(01) onto the graph of the Rosenbrock function where the range for x and for y is -1515.","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"For computing the loss between the two distributions, i.e. Psi(mathcalN(01)timesmathcalN(01)) and f(-1515 -1515) we use the Wasserstein distance[3].","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"[3]: The implementation of the Wasserstein distance is taken from [26].","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"using GeometricMachineLearning, Zygote, BrenierTwoFluid\nusing LinearAlgebra: norm # hide\nimport Random # hide \nRandom.seed!(123)\n\nmodel = Chain(GrassmannLayer(2,3), Dense(3, 8, tanh), Dense(8, 3, identity))\n\nnn = NeuralNetwork(model, CPU(), Float64)\n\n# this computes the cost that is associated to the Wasserstein distance\nc = (x,y) -> .5 * norm(x - y)^2\n∇c = (x,y) -> x - y\n\nconst ε = 0.1 # entropic regularization. √ε is a length. # hide\nconst q = 1.0 # annealing parameter # hide\nconst Δ = 1.0 # characteristic domain size # hide\nconst s = ε # current scale: no annealing -> equals ε # hide\nconst tol = 1e-6 # marginal condition tolerance # hide \nconst crit_it = 20 # acceleration inference # hide\nconst p_η = 2\n\nfunction compute_wasserstein_gradient(ensemble1::AT, ensemble2::AT) where AT<:AbstractArray\n number_of_particles1 = size(ensemble1, 2)\n number_of_particles2 = size(ensemble2, 2)\n V = SinkhornVariable(copy(ensemble1'), ones(number_of_particles1) / number_of_particles1)\n W = SinkhornVariable(copy(ensemble2'), ones(number_of_particles2) / number_of_particles2)\n params = SinkhornParameters(; ε=ε,q=1.0,Δ=1.0,s=s,tol=tol,crit_it=crit_it,p_η=p_η,sym=false,acc=true) # hide\n S = SinkhornDivergence(V, W, c, params; islog = true)\n initialize_potentials!(S)\n compute!(S)\n value(S), x_gradient!(S, ∇c)'\nend\n\nxyz_points = hcat([[x,y,rosenbrock([x,y])] for x in x for y in y]...)\n\nfunction compute_gradient(ps::Tuple)\n samples = randn(2, size(xyz_points, 2))\n\n estimate, nn_pullback = Zygote.pullback(ps -> model(samples, ps), ps)\n\n valS, wasserstein_gradient = compute_wasserstein_gradient(estimate, xyz_points)\n valS, nn_pullback(wasserstein_gradient)[1]\nend\n\n# note the very high value for the learning rate\noptimizer = Optimizer(nn, AdamOptimizer(1e-1))\n\n# note the small number of training steps\nconst training_steps = 40\nloss_array = zeros(training_steps)\nfor i in 1:training_steps\n val, dp = compute_gradient(nn.params)\n loss_array[i] = val\n optimization_step!(optimizer, model, nn.params, dp)\nend\nplot(loss_array, xlabel=\"training step\", label=\"loss\")","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"Now we plot a few points to check how well they match the graph:","category":"page"},{"location":"tutorials/grassmann_layer/","page":"Grassmann manifold","title":"Grassmann manifold","text":"const number_of_points = 35\n\ncoordinates = nn(randn(2, number_of_points))\nscatter3d!(p, [coordinates[1, :]], [coordinates[2, :]], [coordinates[3, :]], alpha=.5, color=4, label=\"mapped points\")","category":"page"},{"location":"tutorials/volume_preserving_attention/#Comparison-of-different-VolumePreservingAttention","page":"Volume-Preserving Attention","title":"Comparison of different VolumePreservingAttention","text":"","category":"section"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"In the section of volume-preserving attention we mentioned two ways of computing volume-preserving attention: one where we compute the correlations with a skew-symmetric matrix and one where we compute the correlations with an arbitrary matrix. Here we compare the two approaches. When calling the VolumePreservingAttention layer we can specify whether we want to use the skew-symmetric or the arbitrary weighting by setting the keyword skew_sym = true and skew_sym = false respectively. ","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"In here we demonstrate the differences between the two approaches for computing correlations. For this we first generate a training set consisting of two collections of curves: (i) sine curves and (ii) cosine curve. ","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"using GeometricMachineLearning # hide\nusing GeometricMachineLearning: FeedForwardLoss, TransformerLoss # hide\nusing Plots # hide\nimport Random # hide \nRandom.seed!(123) # hide\n\nsine_cosine = zeros(1, 1000, 2)\nsine_cosine[1, :, 1] .= sin.(0.:.1:99.9)\nsine_cosine[1, :, 2] .= cos.(0.:.1:99.9)\n\n\nconst dl = DataLoader(Float16.(sine_cosine))","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"The third axis (i.e. the parameter axis) has length two, meaning we have two different kinds of curves: ","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"plot(dl.input[1, :, 1], label = \"sine\")\nplot!(dl.input[1, :, 2], label = \"cosine\")","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"We want to train a single neural network on both these curves. We compare three networks which are of the following form: ","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"mathttnetwork = mathcalNN_dcircPsicircmathcalNN_u","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"where mathcalNN_u refers to a neural network that scales up and mathcalNN_d refers to a neural network that scales down. The up and down scaling is done with simple dense layers: ","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"mathcalNN_u(x) = mathrmtanh(a_ux + b_u) text and mathcalNN_d(x) = a_d^Tx + b_d","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"where a_u b_u a_dinmathbbR^mathrmud and b_d is a scalar. ud refers to upscaling dimension. For Psi we consider three different choices:","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"a volume-preserving attention with skew-symmetric weighting,\na volume-preserving attention with arbitrary weighting,\nan identity layer.","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"We further choose a sequence length 5 (i.e. the network always sees the last 5 time steps) and always predict one step into the future (i.e. the prediction window is set to 1):","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"const seq_length = 3\nconst prediction_window = 1\n\nconst upscale_dimension_1 = 2\n\nconst T = Float16\n\nfunction set_up_networks(upscale_dimension::Int = upscale_dimension_1)\n model_skew = Chain(Dense(1, upscale_dimension, tanh), VolumePreservingAttention(upscale_dimension, seq_length; skew_sym = true), Dense(upscale_dimension, 1, identity; use_bias = true))\n model_arb = Chain(Dense(1, upscale_dimension, tanh), VolumePreservingAttention(upscale_dimension, seq_length; skew_sym = false), Dense(upscale_dimension, 1, identity; use_bias = true))\n model_comp = Chain(Dense(1, upscale_dimension, tanh), Dense(upscale_dimension, 1, identity; use_bias = true))\n\n nn_skew = NeuralNetwork(model_skew, CPU(), T)\n nn_arb = NeuralNetwork(model_arb, CPU(), T)\n nn_comp = NeuralNetwork(model_comp, CPU(), T)\n\n nn_skew, nn_arb, nn_comp\nend\n\nnn_skew, nn_arb, nn_comp = set_up_networks()","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"We expect the third network to not be able to learn anything useful since it cannot resolve time series data: a regular feedforward network only ever sees one datum at a time. ","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"Next we train the networks (here we pick a batch size of 30):","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"function set_up_optimizers(nn_skew, nn_arb, nn_comp)\n o_skew = Optimizer(AdamOptimizer(T), nn_skew)\n o_arb = Optimizer(AdamOptimizer(T), nn_arb)\n o_comp = Optimizer(AdamOptimizer(T), nn_comp)\n\n o_skew, o_arb, o_comp\nend\n\no_skew, o_arb, o_comp = set_up_optimizers(nn_skew, nn_arb, nn_comp)\n\nconst n_epochs = 1000\n\nconst batch_size = 30\n\nconst batch = Batch(batch_size, seq_length, prediction_window)\nconst batch2 = Batch(batch_size)\n\nfunction train_networks!(nn_skew, nn_arb, nn_comp)\n loss_array_skew = o_skew(nn_skew, dl, batch, n_epochs, TransformerLoss(batch))\n loss_array_arb = o_arb( nn_arb, dl, batch, n_epochs, TransformerLoss(batch))\n loss_array_comp = o_comp(nn_comp, dl, batch2, n_epochs, FeedForwardLoss())\n\n loss_array_skew, loss_array_arb, loss_array_comp\nend\n\nloss_array_skew, loss_array_arb, loss_array_comp = train_networks!(nn_skew, nn_arb, nn_comp)\n\nfunction plot_training_losses(loss_array_skew, loss_array_arb, loss_array_comp)\n p = plot(loss_array_skew, color = 2, label = \"skew\", yaxis = :log)\n plot!(p, loss_array_arb, color = 3, label = \"arb\")\n plot!(p, loss_array_comp, color = 4, label = \"comp\")\n\n p\nend\n\nplot_training_losses(loss_array_skew, loss_array_arb, loss_array_comp)","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"Looking at the training errors, we can see that the network with the skew-symmetric weighting is stuck at a relatively high error rate, whereas the loss for the network with the arbitrary weighting is decreasing to a significantly lower level. The feedforward network without the attention mechanism is not able to learn anything useful (as was expected). ","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"The following demonstrates the predictions of our approaches[1]: ","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"[1]: Here we have to use the architectures DummyTransformer and DummyNNIntegrator to reformulate the three neural networks defined here as NeuralNetworkIntegrators. Normally the user should try to use predefined architectures in GeometricMachineLearning, that way they never use DummyTransformer and DummyNNIntegrator. ","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"initial_condition = dl.input[:, 1:seq_length, 2]\n\nfunction make_networks_neural_network_integrators(nn_skew, nn_arb, nn_comp)\n nn_skew = NeuralNetwork(GeometricMachineLearning.DummyTransformer(seq_length), nn_skew.model, nn_skew.params)\n nn_arb = NeuralNetwork(GeometricMachineLearning.DummyTransformer(seq_length), nn_arb.model, nn_arb.params)\n nn_comp = NeuralNetwork(GeometricMachineLearning.DummyNNIntegrator(), nn_comp.model, nn_comp.params)\n\n nn_skew, nn_arb, nn_comp\nend\n\nnn_skew, nn_arb, nn_comp = make_networks_neural_network_integrators(nn_skew, nn_arb, nn_comp)\n\nfunction produce_validation_plot(n_points::Int, nn_skew = nn_skew, nn_arb = nn_arb, nn_comp = nn_comp; initial_condition::Matrix=initial_condition, type = :cos)\n validation_skew = iterate(nn_skew, initial_condition; n_points = n_points, prediction_window = 1)\n validation_arb = iterate(nn_arb, initial_condition; n_points = n_points, prediction_window = 1)\n validation_comp = iterate(nn_comp, initial_condition[:, 1]; n_points = n_points)\n\n p2 = type == :cos ? plot(dl.input[1, 1:n_points, 2], color = 1, label = \"reference\") : plot(dl.input[1, 1:n_points, 1], color = 1, label = \"reference\")\n\n plot!(validation_skew[1, :], color = 2, label = \"skew\")\n plot!(p2, validation_arb[1, :], color = 3, label = \"arb\")\n plot!(p2, validation_comp[1, :], color = 4, label = \"comp\")\n vline!([seq_length], color = :red, label = \"start of prediction\")\n\n p2 \nend\n\np2 = produce_validation_plot(40)","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"In the above plot we can see that the network with the arbitrary weighting performs much better; even though the green line does not fit the blue line very well either, it manages to least qualitatively reflect the training data. We can also plot the predictions for longer time intervals: ","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"p3 = produce_validation_plot(400)","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"We can also plot the comparison with the sine function: ","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"initial_condition = dl.input[:, 1:seq_length, 1]\n\np2 = produce_validation_plot(40, initial_condition = initial_condition, type = :sin)","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"This advantage of the volume-preserving attention with arbitrary weighting may however be due to the fact that the skew-symmetric attention only has 3 learnable parameters, as opposed to 9 for the arbitrary weighting. If we increase the upscaling dimension the result changes: ","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"const upscale_dimension_2 = 10\n\nnn_skew, nn_arb, nn_comp = set_up_networks(upscale_dimension_2)\n\no_skew, o_arb, o_comp = set_up_optimizers(nn_skew, nn_arb, nn_comp)\n\nloss_array_skew, loss_array_arb, loss_array_comp = train_networks!(nn_skew, nn_arb, nn_comp)\n\nplot_training_losses(loss_array_skew, loss_array_arb, loss_array_comp)","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"initial_condition = dl.input[:, 1:seq_length, 2]\n\nnn_skew, nn_arb, nn_comp = make_networks_neural_network_integrators(nn_skew, nn_arb, nn_comp)\n\np2 = produce_validation_plot(40, nn_skew, nn_arb, nn_comp)","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"And for a longer time interval: ","category":"page"},{"location":"tutorials/volume_preserving_attention/","page":"Volume-Preserving Attention","title":"Volume-Preserving Attention","text":"p3 = produce_validation_plot(200, nn_skew, nn_arb, nn_comp)","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Symplectic-Autoencoder","page":"PSD and Symplectic Autoencoders","title":"Symplectic Autoencoder","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Symplectic Autoencoders are a type of neural network suitable for treating Hamiltonian parametrized PDEs with slowly decaying Kolmogorov n-width. It is based on proper symplectic decomposition (PSD) and symplectic neural networks (SympNets).","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Hamiltonian-Model-Order-Reduction","page":"PSD and Symplectic Autoencoders","title":"Hamiltonian Model Order Reduction","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Hamiltonian PDEs are partial differential equations that, like its ODE counterpart, have a Hamiltonian associated with it. An example of this is the linear wave equation (see [19]) with Hamiltonian ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"mathcalH(q p mu) = frac12int_Omegamu^2(partial_xiq(tximu))^2 + p(tximu)^2dxi","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"The PDE for to this Hamiltonian can be obtained similarly as in the ODE case:","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"partial_tq(tximu) = fracdeltamathcalHdeltap = p(tximu) quad partial_tp(tximu) = -fracdeltamathcalHdeltaq = mu^2partial_xixiq(tximu)","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Symplectic-Solution-Manifold","page":"PSD and Symplectic Autoencoders","title":"Symplectic Solution Manifold","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"As with regular parametric PDEs, we also associate a solution manifold with Hamiltonian PDEs. This is a finite-dimensional manifold, on which the dynamics can be described through a Hamiltonian ODE. I NEED A PROOF OR SOME EXPLANATION FOR THIS!","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Workflow-for-Symplectic-ROM","page":"PSD and Symplectic Autoencoders","title":"Workflow for Symplectic ROM","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"As with any other reduced order modeling technique we first discretize the PDE. This should be done with a structure-preserving scheme, thus yielding a (high-dimensional) Hamiltonian ODE as a result. Discretizing the wave equation above with finite differences yields a Hamiltonian system: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"mathcalH_mathrmdiscr(z(tmu)mu) = frac12x(tmu)^Tbeginbmatrix -mu^2D_xixi mathbbO mathbbO mathbbI endbmatrix x(tmu)","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"In Hamiltonian reduced order modelling we try to find a symplectic submanifold of the solution space[1] that captures the dynamics of the full system as well as possible.","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"[1]: The submanifold is: tildemathcalM = Psi^mathrmdec(z_r)inmathbbR^2Nu_rinmathrmR^2n where z_r is the reduced state of the system. ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Similar to the regular PDE case we again build an encoder Psi^mathrmenc and a decoder Psi^mathrmdec; but now both these mappings are required to be symplectic!","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Concretely this means: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"The encoder is a mapping from a high-dimensional symplectic space to a low-dimensional symplectic space, i.e. Psi^mathrmencmathbbR^2NtomathbbR^2n such that nablaPsi^mathrmencmathbbJ_2N(nablaPsi^mathrmenc)^T = mathbbJ_2n.\nThe decoder is a mapping from a low-dimensional symplectic space to a high-dimensional symplectic space, i.e. Psi^mathrmdecmathbbR^2ntomathbbR^2N such that (nablaPsi^mathrmdec)^TmathbbJ_2NnablaPsi^mathrmdec = mathbbJ_2n.","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"If these two maps are constrained to linear maps, then one can easily find good solutions with proper symplectic decomposition (PSD).","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Proper-Symplectic-Decomposition","page":"PSD and Symplectic Autoencoders","title":"Proper Symplectic Decomposition","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"For PSD the two mappings Psi^mathrmenc and Psi^mathrmdec are constrained to be linear, orthonormal (i.e. Psi^TPsi = mathbbI) and symplectic. The easiest way to enforce this is through the so-called cotangent lift: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Psi_mathrmCL = \nbeginbmatrix Phi mathbbO mathbbO Phi endbmatrix","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"and PhiinSt(nN)subsetmathbbR^Ntimesn, i.e. is an element of the Stiefel manifold. If the snapshot matrix is of the form: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"M = leftbeginarraycccc\nhatq_1(t_0) hatq_1(t_1) quadldotsquad hatq_1(t_f) \nhatq_2(t_0) hatq_2(t_1) ldots hatq_2(t_f) \nldots ldots ldots ldots \nhatq_N(t_0) hatq_N(t_1) ldots hatq_N(t_f) \nhatp_1(t_0) hatp_1(t_1) ldots hatp_1(t_f) \nhatp_2(t_0) hatp_2(t_1) ldots hatp_2(t_f) \nldots ldots ldots ldots \nhatp_N(t_0) hatp_N(t_1) ldots hatp_N(t_f) \nendarrayright","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"then Phi can be computed in a very straight-forward manner: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"Rearrange the rows of the matrix M such that we end up with a Ntimes2(f+1) matrix: hatM = M_q M_p.\nPerform SVD: hatM = USigmaV^T; set PhigetsUmathtt1n.","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"For details on the cotangent lift (and other methods for linear symplectic model reduction) consult [20].","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#Symplectic-Autoencoders","page":"PSD and Symplectic Autoencoders","title":"Symplectic Autoencoders","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"PSD suffers from the similar shortcomings as regular POD: it is a linear map and the approximation space tildemathcalM= Psi^mathrmdec(z_r)inmathbbR^2Nu_rinmathrmR^2n is strictly linear. For problems with slowly-decaying Kolmogorov n-width this leads to very poor approximations. ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"In order to overcome this difficulty we use neural networks, more specifically SympNets, together with cotangent lift-like matrices. The resulting architecture, symplectic autoencoders, are demonstrated in the following image: ","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/symplectic_autoencoder.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"So we alternate between SympNet and PSD layers. Because all the PSD layers are based on matrices PhiinSt(nN) we have to optimize on the Stiefel manifold.","category":"page"},{"location":"reduced_order_modeling/symplectic_autoencoder/#References","page":"PSD and Symplectic Autoencoders","title":"References","text":"","category":"section"},{"location":"reduced_order_modeling/symplectic_autoencoder/","page":"PSD and Symplectic Autoencoders","title":"PSD and Symplectic Autoencoders","text":"P. Buchfink, S. Glas and B. Haasdonk. Symplectic model reduction of Hamiltonian systems on nonlinear manifolds and approximation with weakly symplectic autoencoder. SIAM Journal on Scientific Computing 45, A289–A311 (2023).\n\n\n\nL. Peng and K. Mohseni. Symplectic model reduction of Hamiltonian systems. SIAM Journal on Scientific Computing 38, A1–A27 (2016).\n\n\n\n","category":"page"},{"location":"tutorials/linear_wave_equation/#The-Linear-Wave-Equation","page":"Linear Wave Equation","title":"The Linear Wave Equation","text":"","category":"section"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"The linear wave equation is the prototypical example for a Hamiltonian PDE. It is given by (see [19] and [20]): ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"mathcalH(q p mu) = frac12int_Omegamu^2(partial_xiq(tximu))^2 + p(tximu)^2dxi","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"with xiinOmega=(-1212) and muinmathbbP=51256 as a possible choice for domain and parameters. ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"The PDE for to this Hamiltonian can be obtained similarly as in the ODE case:","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"partial_tq(tximu) = fracdeltamathcalHdeltap = p(tximu) quad partial_tp(tximu) = -fracdeltamathcalHdeltaq = mu^2partial_xixiq(tximu)","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"As with any other PDE, the wave equation can also be discretized to obtain a ODE which can be solved numerically.","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"If we discretize mathcalH directly, to obtain a Hamiltonian on a finite-dimensional vector space mathbbR^2N, we get a Hamiltonian ODE[1]:","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"[1]: This conserves the Hamiltonian structure of the system.","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"mathcalH_h(z) = sum_i=1^tildeNfracDeltax2biggp_i^2 + mu^2frac(q_i - q_i-1)^2 + (q_i+1 - q_i)^22Deltax^2bigg = fracDeltax2p^Tp + q^TKq","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"where the matrix K contains elements of the form: ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"k_ij = begincases fracmu^24Deltax textif (ij)in(00)(tildeN+1tildeN+1) \n -fracmu^22Deltax textif (ij)=(10) or (ij)=(tildeNtildeN+1) \n frac3mu^24Deltax textif (ij)in(11)(tildeNtildeN) \n fracmu^2Deltax textif i=j and iin2ldots(tildeN-2) \n -fracmu^22Deltax textif i-j=1 and ijnotin0tildeN+1 \n 0 textelse\n endcases","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"The vector field of the FOM is described by (see for example (Peng and Mohseni, 2016)):","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":" fracdzdt = mathbbJ_dnabla_zmathcalH_h = mathbbJ_dbeginbmatrixDeltaxmathbbI mathbbO mathbbO K + K^Tendbmatrixz quad mathbbJ_d = fracmathbbJ_2NDeltax","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"The wave equation has a slowely-decaying Kolmogorov n-width (see e.g. Greif and Urban, 2019), which means linear methods like PSD will perform poorly.","category":"page"},{"location":"tutorials/linear_wave_equation/#Using-the-Linear-Wave-Equation-in-Numerical-Experiments","page":"Linear Wave Equation","title":"Using the Linear Wave Equation in Numerical Experiments","text":"","category":"section"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"In order to use the linear wave equation in numerical experiments we have to pick suitable initial conditions. For this, consider the third-order spline: ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"h(s) = begincases\n 1 - frac32s^2 + frac34s^3 textif 0 leq s leq 1 \n frac14(2 - s)^3 textif 1 s leq 2 \n 0 textelse \nendcases","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"Plotted on the relevant domain it looks like this: ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/third_degree_spline.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"if Main.output_type == :html # hide \n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"Taking the above function h(s) as a starting point, the initial conditions for the linear wave equations will now be constructed under the following considerations: ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"the initial condition (i.e. the shape of the wave) should depend on the parameter of the vector field, i.e. u_0(mu)(omega) = h(s(omega mu)).\nthe solutions of the linear wave equation will travel with speed mu, and we should make sure that the wave does not touch the right boundary of the domain, i.e. 0.5. So the peak should be sharper for higher values of mu as the wave will travel faster.\nthe wave should start at the left boundary of the domain, i.e. at point 0.5, so to cover it as much as possible. ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"Based on this we end up with the following choice of parametrized initial conditions: ","category":"page"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"u_0(mu)(omega) = h(s(omega mu)) quad s(omega mu) = 20 mu omega + fracmu2","category":"page"},{"location":"tutorials/linear_wave_equation/#References","page":"Linear Wave Equation","title":"References","text":"","category":"section"},{"location":"tutorials/linear_wave_equation/","page":"Linear Wave Equation","title":"Linear Wave Equation","text":"P. Buchfink, S. Glas and B. Haasdonk. Symplectic model reduction of Hamiltonian systems on nonlinear manifolds and approximation with weakly symplectic autoencoder. SIAM Journal on Scientific Computing 45, A289–A311 (2023).\n\n\n\nL. Peng and K. Mohseni. Symplectic model reduction of Hamiltonian systems. SIAM Journal on Scientific Computing 38, A1–A27 (2016).\n\n\n\nC. Greif and K. Urban. Decay of the Kolmogorov N-width for wave problems. Applied Mathematics Letters 96, 216–222 (2019).\n\n\n\n","category":"page"},{"location":"layers/attention_layer/#The-Attention-Layer","page":"Attention","title":"The Attention Layer","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The attention mechanism was originally developed for image and natural language processing (NLP) tasks. It is motivated by the need to handle time series data in an efficient way[1]. Its essential idea is to compute correlations between vectors in input sequences. I.e. given sequences ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"(z_q^(1) z_q^(2) ldots z_q^(T)) text and (z_p^(1) z_p^(2) ldots z_p^(T))","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"an attention mechanism computes pair-wise correlations between all combinations of two input vectors from these sequences. In [13] \"additive\" attention is used to compute such correlations: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"[1]: Recurrent neural networks have the same motivation. ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"(z_q z_k) mapsto v^Tsigma(Wz_q + Uz_k) ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"where z_q z_k in mathbbR^d are elements of the input sequences. The learnable parameters are W U in mathbbR^ntimesd and v in mathbbR^n.","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"However multiplicative attention (see e.g. [14])is more straightforward to interpret and cheaper to handle computationally: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"(z_q z_k) mapsto z_q^TWz_k","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"where W in mathbbR^dtimesd is a learnable weight matrix with respect to which correlations are computed as scalar products. Regardless of the type of attention used, they all try to compute correlations among input sequences on whose basis further computation is performed. Given two input sequences Z_q = (z_q^(1) ldots z_q^(T)) and Z_k = (z_k^(1) ldots z_k^(T)), we can arrange the various correlations into a correlation matrix CinmathbbR^TtimesT with entries C_ij = mathttattention(z_q^(i) z_k^(j)). In the case of multiplicative attention this matrix is just C = Z^TWZ.","category":"page"},{"location":"layers/attention_layer/#Reweighting-of-the-input-sequence","page":"Attention","title":"Reweighting of the input sequence","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"In GeometricMachineLearning we always compute self-attention, meaning that the two input sequences Z_q and Z_k are the same, i.e. Z = Z_q = Z_k.[2]","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"[2]: Multihead attention also falls into this category. Here the input Z is multiplied from the left with several projection matrices P^Q_i and P^K_i, where i indicates the head. For each head we then compute a correlation matrix (P^Q_i Z)^T(P^K Z). ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"This is then used to reweight the columns in the input sequence Z. For this we first apply a nonlinearity sigma onto C and then multiply sigma(C) onto Z from the right, i.e. the output of the attention layer is Zsigma(C). So we perform the following mappings:","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Z xrightarrowmathrmcorrelations C(Z) = C xrightarrowsigma sigma(C) xrightarrowtextright multiplication Z sigma(C)","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"After the right multiplication the outputs is of the following form: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":" sum_i=1^Tp^(1)_iz^(i) ldots sum_i=1^Tp^(T)_iz^(i)","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"for p^(i) = sigma(C)_bulleti. What is learned during training are T different linear combinations of the input vectors, where the coefficients p^(i)_j in these linear combinations depend on the input Z nonlinearly. ","category":"page"},{"location":"layers/attention_layer/#VolumePreservingAttention-in-GeometricMachineLearning","page":"Attention","title":"VolumePreservingAttention in GeometricMachineLearning","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The attention layer (and the activation function sigma defined for it) in GeometricMachineLearning was specifically designed to apply it to data coming from physical systems that can be described through a divergence-free or a symplectic vector field. Traditionally the nonlinearity in the attention mechanism is a softmax[3] (see [14]) and the self-attention layer performs the following mapping: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"[3]: The softmax acts on the matrix C in a vector-wise manner, i.e. it operates on each column of the input matrix C = c^(1) ldots c^(T). The result is a sequence of probability vectors p^(1) ldots p^(T) for which sum_i=1^Tp^(j)_i=1quadforalljin1dotsT","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Z = z^(1) ldots z^(T) mapsto Zmathrmsoftmax(Z^TWZ)","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The softmax activation acts vector-wise, i.e. if we supply it with a matrix C as input it returns: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"mathrmsoftmax(C) = mathrmsoftmax(c_bullet1) ldots mathrmsoftmax(c_bulletT)","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The output of a softmax is a probability vector (also called stochastic vector) and the matrix P = p^(1) ldots p^(T), where each column is a probability vector, is sometimes referred to as a stochastic matrix (see [15]). This attention mechanism finds application in transformer neural networks [14]. The problem with this matrix from a geometric point of view is that all the columns are independent of each other and the nonlinear transformation could in theory produce a stochastic matrix for which all columns are identical and thus lead to a loss of information. So the softmax activation function is inherently non-geometric. ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Besides the traditional attention mechanism GeometricMachineLearning therefore also has a volume-preserving transformation that fulfills a similar role. There are two approaches implemented to realize similar transformations. Both of them however utilize the Cayley transform to produce orthogonal matrices sigma(C) instead of stochastic matrices. For an orthogonal matrix Sigma we have Sigma^TSigma = mathbbI, so all the columns are linearly independent which is not necessarily true for a stochastic matrix P. The following explains how this new activation function is implemented.","category":"page"},{"location":"layers/attention_layer/#The-Cayley-transform","page":"Attention","title":"The Cayley transform","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The Cayley transform maps from skew-symmetric matrices to orthonormal matrices[4]. It takes the form:","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"[4]: A matrix A is skew-symmetric if A = -A^T and a matrix B is orthonormal if B^TB = mathbbI. The orthonormal matrices form a Lie group, i.e. the set of orthonormal matrices can be endowed with the structure of a differential manifold and this set also satisfies the group axioms. The corresponding Lie algebra are the skew-symmetric matrices and the Cayley transform is a so-called retraction in this case. For more details consult e.g. [2] and [16].","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"mathrmCayley A mapsto (mathbbI - A)(mathbbI + A)^-1","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"We can easily check that mathrmCayley(A) is orthogonal if A is skew-symmetric. For this consider varepsilon mapsto A(varepsilon)inmathcalS_mathrmskew with A(0) = mathbbI and A(0) = B. Then we have: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"fracdeltamathrmCayleydeltaA = fracddvarepsilon_varepsilon=0 mathrmCayley(A(varepsilon))^T mathrmCayley(A(varepsilon)) = mathbbO","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"In order to use the Cayley transform as an activation function we further need a mapping from the input Z to a skew-symmetric matrix. This is realized in two ways in GeometricMachineLearning: via a scalar-product with a skew-symmetric weighting and via a scalar-product with an arbitrary weighting.","category":"page"},{"location":"layers/attention_layer/#First-approach:-scalar-products-with-a-skew-symmetric-weighting","page":"Attention","title":"First approach: scalar products with a skew-symmetric weighting","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"For this the attention layer is modified in the following way: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Z = z^(1) ldots z^(T) mapsto Zsigma(Z^TAZ)","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"where sigma(C)=mathrmCayley(C) and A is a skew-symmetric matrix that is learnable, i.e. the parameters of the attention layer are stored in A.","category":"page"},{"location":"layers/attention_layer/#Second-approach:-scalar-products-with-an-arbitrary-weighting","page":"Attention","title":"Second approach: scalar products with an arbitrary weighting","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"For this approach we compute correlations between the input vectors with a skew-symmetric weighting. The correlations we consider here are based on: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"(z^(2))^TAz^(1) (z^(3))^TAz^(1) ldots (z^(d))^TAz^(1) (z^(3))^TAz^(2) ldots (z^(d))^TAz^(2) ldots (z^(d))^TAz^(d-1)","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"So in total we consider correlations (z^(i))^Tz^(j) for which i j. We now arrange these correlations into a skew-symmetric matrix: ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"C = beginbmatrix\n 0 -(z^(2))^TAz^(1) -(z^(3))^TAz^(1) ldots -(z^(d))^TAz^(1) \n (z^(2))^TAz^(1) 0 -(z^(3))^TAz^(2) ldots -(z^(d))^TAz^(2) \n ldots ldots ldots ldots ldots \n (z^(d))^TAz^(1) (z^(d))^TAz^(2) (z^(d))^TAz^(3) ldots 0 \nendbmatrix","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"This correlation matrix can now again be used as an input for the Cayley transform to produce an orthogonal matrix.","category":"page"},{"location":"layers/attention_layer/#How-is-structure-preserved?","page":"Attention","title":"How is structure preserved?","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"In order to discuss how structure is preserved we first have to define what structure we mean precisely. This structure is strongly inspired by traditional multi-step methods (see [17]). We now define what volume preservation means for the product space mathbbR^dtimescdotstimesmathbbR^dequivtimes_textT timesmathbbR^d.","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Consider an isomorphism hat times_text(T times)mathbbR^dstackrelapproxlongrightarrowmathbbR^dT. Specifically, this isomorphism takes the form:","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Z = leftbeginarraycccc\n z_1^(1) z_1^(2) quadcdotsquad z_1^(T) \n z_2^(1) z_2^(2) cdots z_2^(T) \n cdots cdots cdots cdots \n z_d^(1) z_d^(2) cdots z_d^(T)\n endarrayright mapsto \n leftbeginarrayc z_1^(1) z_1^(2) cdots z_1^(T) z_2^(1) cdots z_d^(T) endarrayright = Z_mathrmvec","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"The inverse of Z mapsto hatZ we refer to as Y mapsto tildeY. In the following we also write hatvarphi for the mapping hatcircvarphicirctilde.","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"DEFINITION: We say that a mapping varphi times_textT timesmathbbR^d to times_textT timesmathbbR^d is volume-preserving if the associated hatvarphi is volume-preserving.","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"In the transformed coordinate system (in terms of the vector Z_mathrmvec defined above) this is equivalent to multiplication by a sparse matrix tildeLambda(Z) from the left:","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":" tildeLambda(Z) Z_mathrmvec =\n beginpmatrix\n Lambda(Z) mathbbO cdots mathbbO \n mathbbO Lambda(Z) cdots mathbbO \n cdots cdots ddots cdots \n mathbbO mathbbO cdots Lambda(Z) \n endpmatrix\n leftbeginarrayc z_1^(1) z_1^(2) ldots z_1^(T) z_2^(1) ldots z_d^(T) endarrayright ","category":"page"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"tildeLambda(Z) in m[eq:LambdaApplication]m(@latex) is easily shown to be an orthogonal matrix. ","category":"page"},{"location":"layers/attention_layer/#Historical-Note","page":"Attention","title":"Historical Note","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"Attention was used before, but always in connection with recurrent neural networks (see [18] and [13]). ","category":"page"},{"location":"layers/attention_layer/#References","page":"Attention","title":"References","text":"","category":"section"},{"location":"layers/attention_layer/","page":"Attention","title":"Attention","text":"D. Bahdanau, K. Cho and Y. Bengio. Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014).\n\n\n\nM.-T. Luong, H. Pham and C. D. Manning. Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025 (2015).\n\n\n\n","category":"page"},{"location":"manifolds/homogeneous_spaces/#Homogeneous-Spaces","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"","category":"section"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Homogeneous spaces are manifolds mathcalM on which a Lie group G acts transitively, i.e. ","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"forall XYinmathcalM existsAinGtext st AX = Y","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Now fix a distinct element EinmathcalM. We can also establish an isomorphism between mathcalM and the quotient space Gsim with the equivalence relation: ","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"A_1 sim A_2 iff A_1E = A_2E","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Note that this is independent of the chosen E.","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"The tangent spaces of mathcalM are of the form T_YmathcalM = mathfrakgcdotY, i.e. can be fully described through its Lie algebra. Based on this we can perform a splitting of mathfrakg into two parts:","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"The vertical component mathfrakg^mathrmverY is the kernel of the map mathfrakgtoT_YmathcalM V mapsto VY, i.e. mathfrakg^mathrmverY = VinmathfrakgVY = 0\nThe horizontal component mathfrakg^mathrmhorY is the orthogonal complement of mathfrakg^mathrmverY in mathfrakg. It is isomorphic to T_YmathcalM.","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"We will refer to the mapping from T_YmathcalM to mathfrakg^mathrmhor Y by Omega. If we have now defined a metric langlecdotcdotrangle on mathfrakg, then this induces a Riemannian metric on mathcalM:","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"g_Y(Delta_1 Delta_2) = langleOmega(YDelta_1)Omega(YDelta_2)rangletext for Delta_1Delta_2inT_YmathcalM","category":"page"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Two examples of homogeneous spaces implemented in GeometricMachineLearning are the Stiefel and the Grassmann manifold.","category":"page"},{"location":"manifolds/homogeneous_spaces/#References","page":"Homogeneous Spaces","title":"References","text":"","category":"section"},{"location":"manifolds/homogeneous_spaces/","page":"Homogeneous Spaces","title":"Homogeneous Spaces","text":"Frankel, Theodore. The geometry of physics: an introduction. Cambridge university press, 2011.","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/#Kolmogorov-n-width","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"","category":"section"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"The Kolmogorov n-width measures how well some set mathcalM (typically the solution manifold) can be approximated with a linear subspace:","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"d_n(mathcalM) = mathrminf_V_nsubsetVmathrmdimV_n=nmathrmsup(uinmathcalM)mathrminf_v_ninV_n u - v_n _V","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"with mathcalMsubsetV and V is a (typically infinite-dimensional) Banach space. For advection-dominated problems (among others) the decay of the Kolmogorov n-width is very slow, i.e. one has to pick n very high in order to obtain useful approximations (see [21] and [22]).","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"In order to overcome this, techniques based on neural networks (see e.g. [23]) and optimal transport (see e.g. [22]) have been used. ","category":"page"},{"location":"reduced_order_modeling/kolmogorov_n_width/#References","page":"Kolmogorov n-width","title":"References","text":"","category":"section"},{"location":"reduced_order_modeling/kolmogorov_n_width/","page":"Kolmogorov n-width","title":"Kolmogorov n-width","text":"T. Blickhan. A registration method for reduced basis problems using linear optimal transport, arXiv preprint arXiv:2304.14884 (2023).\n\n\n\nC. Greif and K. Urban. Decay of the Kolmogorov N-width for wave problems. Applied Mathematics Letters 96, 216–222 (2019).\n\n\n\nK. Lee and K. T. Carlberg. Model reduction of dynamical systems on nonlinear manifolds using deep convolutional autoencoders. Journal of Computational Physics 404, 108973 (2020).\n\n\n\n","category":"page"},{"location":"layers/volume_preserving_feedforward/#Volume-Preserving-Feedforward-Layer","page":"Volume-Preserving Layers","title":"Volume-Preserving Feedforward Layer","text":"","category":"section"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"Volume preserving feedforward layers are a special type of ResNet layer for which we restrict the weight matrices to be of a particular form. I.e. each layer computes: ","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"x mapsto x + sigma(Ax + b)","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"where sigma is a nonlinearity, A is the weight and b is the bias. The matrix A is either a lower-triangular matrix L or an upper-triangular matrix U[1]. The lower triangular matrix is of the form (the upper-triangular layer is simply the transpose of the lower triangular): ","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"[1]: Implemented as LowerTriangular and UpperTriangular in GeometricMachineLearning.","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"L = beginpmatrix\n 0 0 cdots 0 \n a_21 ddots vdots \n vdots ddots ddots vdots \n a_n1 cdots a_n(n-1) 0 \nendpmatrix","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"The Jacobian of a layer of the above form then is of the form","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"J = beginpmatrix\n 1 0 cdots 0 \n b_21 ddots vdots \n vdots ddots ddots vdots \n b_n1 cdots b_n(n-1) 1 \nendpmatrix","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"and the determinant of J is 1, i.e. the map is volume-preserving. ","category":"page"},{"location":"layers/volume_preserving_feedforward/#Neural-network-architecture","page":"Volume-Preserving Layers","title":"Neural network architecture","text":"","category":"section"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"Volume-preserving feedforward neural networks should be used as Architectures in GeometricMachineLearning. The constructor for them is: ","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"using GeometricMachineLearning, Markdown\nMarkdown.parse(description(Val(:VPFconstructor)))","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"The constructor produces the following architecture[2]:","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"[2]: Based on the input arguments n_linear and n_blocks. In this example init_upper is set to false, which means that the first layer is of type lower followed by a layer of type upper. ","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/vp_feedforward.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"Here LinearLowerLayer performs x mapsto x + Lx and NonLinearLowerLayer performs x mapsto x + sigma(Lx + b). The activation function sigma is the forth input argument to the constructor and tanh by default. ","category":"page"},{"location":"layers/volume_preserving_feedforward/#Note-on-Sympnets","page":"Volume-Preserving Layers","title":"Note on Sympnets","text":"","category":"section"},{"location":"layers/volume_preserving_feedforward/","page":"Volume-Preserving Layers","title":"Volume-Preserving Layers","text":"As SympNets are symplectic maps, they also conserve phase space volume and therefore form a subcategory of volume-preserving feedforward layers. ","category":"page"},{"location":"optimizers/bfgs_optimizer/#The-BFGS-Algorithm","page":"BFGS Optimizer","title":"The BFGS Algorithm","text":"","category":"section"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"The presentation shown here is largely taken from chapters 3 and 6 of reference [12] with a derivation based on an online comment. The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is a second order optimizer that can be also be used to train a neural network.","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"It is a version of a quasi-Newton method and is therefore especially suited for convex problems. As is the case with any other (quasi-)Newton method the BFGS algorithm approximates the objective with a quadratic function in each optimization step:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"m_k(x) = f(x_k) + (nabla_x_kf)^T(x - x_k) + frac12(x - x_k)^TB_k(x - x_k)","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"where B_k is referred to as the approximate Hessian. We further require B_k to be symmetric and positive definite. Differentiating the above expression and setting the derivative to zero gives us: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"nabla_xm_k = nabla_x_kf + B_k(x - x_k) = 0","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"or written differently: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"x - x_k = -B_k^-1nabla_x_kf","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"This value we will from now on call p_k = x - x_k and refer to as the search direction. The new iterate then is: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"x_k+1 = x_k + alpha_kp_k","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"where alpha_k is the step length. Techniques that describe how to pick an appropriate alpha_k are called line-search methods and are discussed below. First we discuss what requirements we impose on B_k. A first reasonable condition would be to require the gradient of m_k to be equal to that of f at the points x_k-1 and x_k: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"beginaligned\nnabla_x_km_k = nabla_x_kf + B_k(x_k - x_k) overset= nabla_x_kf text and \nnabla_x_k-1m_k = nablax_kf + B_k(x_k-1 - x_k) overset= nabla_x_k-1f\nendaligned","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"The first one of these conditions is of course automatically satisfied. The second one can be rewritten as: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"B_k(x_k - x_k-1) = overset= nabla_x_kf - nabla_x_k-1f ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"The following notations are often used: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"s_k-1 = alpha_k-1p_k-1 = x_k - x_k-1 text and y_k-1 = nabla_x_kf - nabla_x_k-1f ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"The conditions mentioned above then becomes: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"B_ks_k-1 overset= y_k-1","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"and we call it the secant equation. A second condition we impose on B_k is that is has to be positive-definite at point s_k-1:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"s_k-1^Ty_k-1 0","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"This is referred to as the curvature condition. If we impose the Wolfe conditions, the curvature condition hold automatically. The Wolfe conditions are stated with respect to the parameter alpha_k.","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"The Wolfe conditions are:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"f(x_k+alphap_k)leqf(x_k) + c_1alpha(nabla_x_kf)^Tp_k for c_1in(01).\n(nabla_(x_k + alpha_kp_k)f)^Tp_k geq c_2(nabla_x_kf)^Tp_k for c_2in(c_11).","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"A possible choice for c_1 and c_2 are 10^-4 and 09 (see [12]). The two Wolfe conditions above are respectively called the sufficient decrease condition and the curvature condition respectively. Note that the second Wolfe condition (also called curvature condition) is stronger than the one mentioned before under the assumption that the first Wolfe condition is true:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"(nabla_x_kf)^Tp_k-1 - c_2(nabla_x_k-1f)^Tp_k-1 = y_k-1^Tp_k-1 + (1 - c_2)(nabla_x_k-1f)^Tp_k-1 geq 0","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"and the second term in this expression is (1 - c_2)(nabla_x_k-1f)^Tp_k-1geqfrac1-c_2c_1alpha_k-1(f(x_k) - f(x_k-1)), which is negative. ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"In order to pick the ideal B_k we solve the following problem: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"beginaligned\nmin_B B - B_k-1_W \ntextst B = B^Ttext and Bs_k-1=y_k-1\nendaligned","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"where the first condition is symmetry and the second one is the secant equation. For the norm cdot_W we pick the weighted Frobenius norm:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"A_W = W^12AW^12_F","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"where cdot_F is the usual Frobenius norm[1] and the matrix W=tildeB_k-1 is the inverse of the average Hessian:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"tildeB_k-1 = int_0^1 nabla^2f(x_k-1 + taualpha_k-1p_k-1)dtau","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"[1]: The Frobenius norm is A_F^2 = sum_ija_ij^2.","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"In order to find the ideal B_k under the conditions described above, we introduce some notation: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"tildeB_k-1 = W^12B_k-1W^12,\ntildeB = W^12BW^12, \ntildey_k-1 = W^12y_k-1, \ntildes_k-1 = W^-12s_k-1.","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"With this notation we can rewrite the problem of finding B_k as: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"beginaligned\nmin_tildeB tildeB - tildeB_k-1_F \ntextst tildeB = tildeB^Ttext and tildeBtildes_k-1=tildey_k-1\nendaligned","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"We further have Wy_k-1 = s_k-1 (by the mean value theorem ?) and therefore tildey_k-1 = tildes_k-1.","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"Now we rewrite B and B_k-1 in a new basis U = uu_perp, where u = tildes_k-1tildes_k-1 and u_perp is an orthogonal complement[2] of u:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"[2]: So we must have u^Tu_perp=0 and further u_perp^Tu_perp=mathbbI.","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"beginaligned\nU^TtildeB_k-1U - U^TtildeBU = beginbmatrix u^T u_perp^T endbmatrix(tildeB_k-1 - tildeB)beginbmatrix u u_perp endbmatrix = \nbeginbmatrix\n u^TtildeB_k-1u - 1 u^TtildeB_k-1u \n u_perp^TtildeB_k-1u u_perp^T(tildeB_k-1-tildeB_k)u_perp\nendbmatrix\nendaligned","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"By a property of the Frobenius norm: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"tildeB_k-1 - tildeB^2_F = (u^TtildeB_k-1 -1)^2 + u^TtildeB_k-1u_perp_F^2 + u_perp^TtildeB_k-1u_F^2 + u_perp^T(tildeB_k-1 - tildeB)u_perp_F^2","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"We see that tildeB only appears in the last term, which should therefore be made zero. This then gives: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"tildeB = Ubeginbmatrix 1 0 0 u^T_perptildeB_k-1u_perp endbmatrix = uu^T + (mathbbI-uu^T)tildeB_k-1(mathbbI-uu^T)","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"If we now map back to the original coordinate system, the ideal solution for B_k is: ","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"B_k = (mathbbI - frac1y_k-1^Ts_k-1y_k-1s_k-1^T)B_k-1(mathbbI - frac1y_k-1^Ts_k-1s_k-1y_k-1^T) + frac1y_k-1^Ts_k-1y_ky_k^T","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"What we need in practice however is not B_k, but its inverse H_k. This is because we need to find s_k-1 based on y_k-1. To get H_k based on the expression for B_k above we can use the Sherman-Morrison-Woodbury formula[3] to obtain:","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"[3]: The Sherman-Morrison-Woodbury formula states (A + UCV)^-1 = A^-1 - A^-1 - A^-1U(C^-1 + VA^-1U)^-1VA^-1.","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"H_k = H_k-1 - fracH_k-1y_k-1y_k-1^TH_k-1y_k-1^TH_k-1y_k-1 + fracs_k-1s_k-1^Ty_k-1^Ts_k-1","category":"page"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"TODO: Example where this works well!","category":"page"},{"location":"optimizers/bfgs_optimizer/#References","page":"BFGS Optimizer","title":"References","text":"","category":"section"},{"location":"optimizers/bfgs_optimizer/","page":"BFGS Optimizer","title":"BFGS Optimizer","text":"J. N. Stephen J. Wright. Numerical optimization (Springer Science+Business Media, 2006).\n\n\n\n","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Global-Sections","page":"Global Sections","title":"Global Sections","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"Global sections are needed needed for the generalization of Adam and other optimizers to homogeneous spaces. They are necessary to perform the two mappings represented represented by horizontal and vertical red lines in the section on the general optimizer framework.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Computing-the-global-section","page":"Global Sections","title":"Computing the global section","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"In differential geometry a section is always associated to some bundle, in our case this bundle is piGtomathcalMAmapstoAE. A section is a mapping mathcalMtoG for which pi is a left inverse, i.e. picirclambda = mathrmid. ","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"For the Stiefel manifold St(n N)subsetmathbbR^Ntimesn we compute the global section the following way: ","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"Start with an element YinSt(nN),\nDraw a random matrix AinmathbbR^Ntimes(N-n),\nRemove the subspace spanned by Y from the range of A: AgetsA-YY^TA\nCompute a QR decomposition of A and take as section lambda(Y) = Y Q_1N 1(N-n) = Y barlambda.","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"It is easy to check that lambda(Y)inG=SO(N).","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"In GeometricMachineLearning, GlobalSection takes an element of YinSt(nN)equivStiefelManifold{T} and returns an instance of GlobalSection{T, StiefelManifold{T}}. The application O(N)timesSt(nN)toSt(nN) is done with the functions apply_section! and apply_section.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Computing-the-global-tangent-space-representation-based-on-a-global-section","page":"Global Sections","title":"Computing the global tangent space representation based on a global section","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"The output of the horizontal lift Omega is an element of mathfrakg^mathrmhorY. For this mapping Omega(Y BY) = B if Binmathfrakg^mathrmhorY, i.e. there is no information loss and no projection is performed. We can map the Binmathfrakg^mathrmhorY to mathfrakg^mathrmhor with Bmapstolambda(Y)^-1Blambda(Y).","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"The function global_rep performs both mappings at once[1], i.e. it takes an instance of GlobalSection and an element of T_YSt(nN), and then returns an element of frakg^mathrmhorequivStiefelLieAlgHorMatrix.","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"[1]: For computational reasons.","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"In practice we use the following: ","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"beginaligned\nlambda(Y)^TOmega(YDelta)lambda(Y) = lambda(Y)^T(mathbbI - frac12YY^T)DeltaY^T - YDelta^T(mathbbI - frac12YY^T)lambda(Y) \n = lambda(Y)^T(mathbbI - frac12YY^T)DeltaE^T - YDelta^T(lambda(Y) - frac12YE^T) \n = lambda(Y)^TDeltaE^T - frac12EY^TDeltaE^T - EDelta^Tlambda(Y) + frac12EDelta^TYE^T \n = beginbmatrix Y^TDeltaE^T barlambdaDeltaE^T endbmatrix - frac12EY^TDeltaE - beginbmatrix EDelta^TY EDelta^Tbarlambda endbmatrix + frac12EDelta^TYE^T \n = beginbmatrix Y^TDeltaE^T barlambdaDeltaE^T endbmatrix + EDelta^TYE^T - beginbmatrixEDelta^TY EDelta^Tbarlambda endbmatrix \n = EY^TDeltaE^T + EDelta^TYE^T - EDelta^TYE^T + beginbmatrix mathbbO barlambdaDeltaE^T endbmatrix - beginbmatrix mathbbO EDelta^Tbarlambda endbmatrix \n = EY^TDeltaE^T + beginbmatrix mathbbO barlambdaDeltaE^T endbmatrix - beginbmatrix mathbbO EDelta^Tbarlambda endbmatrix\nendaligned","category":"page"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"meaning that for an element of the horizontal component of the Lie algebra mathfrakg^mathrmhor we store A=Y^TDelta and B=barlambda^TDelta.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#Optimization","page":"Global Sections","title":"Optimization","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"The output of global_rep is then used for all the optimization steps.","category":"page"},{"location":"optimizers/manifold_related/global_sections/#References","page":"Global Sections","title":"References","text":"","category":"section"},{"location":"optimizers/manifold_related/global_sections/","page":"Global Sections","title":"Global Sections","text":"T. Frankel. The geometry of physics: an introduction (Cambridge university press, Cambridge, UK, 2011).\n\n\n\n","category":"page"},{"location":"manifolds/inverse_function_theorem/#The-Inverse-Function-Theorem","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"","category":"section"},{"location":"manifolds/inverse_function_theorem/","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"The inverse function theorem gives a sufficient condition on a vector-valued function to be invertible in a neighborhood of a specific point. This theorem is critical in developing a theory of manifolds and serves as a basis for the submersion theorem. Here we first state the theorem and then give a proof.","category":"page"},{"location":"manifolds/inverse_function_theorem/","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"Theorem (Inverse function theorem): Consider a vector-valued differentiable function FmathbbR^NtomathbbR^N and assume its Jacobian is non-degenerate at a point xinmathbbR^N. Then there exists a neighborhood U that contains F(x) and on which F is invertible, i.e. existsHUtomathbbR^N s.t. forallyinUFcircH(y) = y and the inverse is differentiable.","category":"page"},{"location":"manifolds/inverse_function_theorem/","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"Proof: Consider a mapping FmathbbR^NtomathbbR^N and assume its Jacobian has full rank at point x, i.e. detF(x)neq0. Now consider a ball around x whose radius r we do not yet fix and two points y and z in that ball: yzinB(xr). We further introduce the function G(y)=F(x)-F(x)y. By the mean value theorem we have G(z) - G(y)leqz-ysup_0t1G(x + t(y-x)) where cdot is the operator norm. Because tmapstoG(x+t(y-x)) is continuous and G(x)=0 there must exist an r s.t. foralltin01G(x +t(y-x)) - G(x)frac12F(x). F must then be injective on B(xr) (and hence invertible on F(B(xr))). Assume for the moment it is not. We can then find two distinct elements y zinB(xr) s.t. F(z) - F(y) = 0. This implies G(z) - G(y) = F(x)y - x which is a contradiction. The inverse (which we call HF(B(xr))toB(xr)) is also continuous by the last theorem presented in the section on basic topological concepts[1]. We still have to prove differentiability of the inverse. We now proof that the derivative of H at F(x) exists and that it is equal to F(x)^-1F(x). For this we denote F(x) by xi and let etainF(B(xr)) go to zero.","category":"page"},{"location":"manifolds/inverse_function_theorem/","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"beginaligned\n eta^-1H(xi+eta) - H(xi) - F(x)^-1eta leq eta^-1F(x)^-1F(x)H(xi+eta)-F(x)H(xi) -eta \n leq eta^-1F(x)^-1F(H(xi+eta)) - G(H(xi+eta)) - F(H(xi)) + G(x) - eta \n = eta^-1F(x)^-1xi + eta - G(H(xi+eta)) - xi + G(x) - eta \n = eta^-1F(x)^-1G(H(xi+eta)) - G(H(xi))\nendaligned","category":"page"},{"location":"manifolds/inverse_function_theorem/","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"and this goes to zero as eta goes to zero, because H is continuous and therefore H(xi+eta) goes to H(xi)=x and the expression on the right goes to zero as well.","category":"page"},{"location":"manifolds/inverse_function_theorem/","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"[1]: In order to apply said theorem we must have a mapping from a compact space to a Hausdorff space. The image is clearly Hausdorff. For compactness, we could further restrict our ball to B(xr2), then G and its inverse are at least continuous on the closure of B(xr2) (or its image respectively) and hence also on B(xr2).","category":"page"},{"location":"manifolds/inverse_function_theorem/#References","page":"The Inverse Function Theorem","title":"References","text":"","category":"section"},{"location":"manifolds/inverse_function_theorem/","page":"The Inverse Function Theorem","title":"The Inverse Function Theorem","text":"S. Lang. Fundamentals of differential geometry. Vol. 191 (Springer Science & Business Media, 2012).\n\n\n\n","category":"page"},{"location":"optimizers/manifold_related/geodesic/#Geodesic-Retraction","page":"Geodesic Retraction","title":"Geodesic Retraction","text":"","category":"section"},{"location":"optimizers/manifold_related/geodesic/","page":"Geodesic Retraction","title":"Geodesic Retraction","text":"General retractions are approximations of the exponential map. In GeometricMachineLearning we can, instead of using an approximation, solve the geodesic equation exactly (up to numerical error) by specifying Geodesic() as the argument of layers that have manifold weights. ","category":"page"},{"location":"optimizers/manifold_related/cayley/#The-Cayley-Retraction","page":"Cayley Retraction","title":"The Cayley Retraction","text":"","category":"section"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"The Cayley transformation is one of the most popular retractions. For several matrix Lie groups it is a mapping from the Lie algebra mathfrakg onto the Lie group G. They Cayley retraction reads: ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":" mathrmCayley(C) = left(mathbbI -frac12Cright)^-1left(mathbbI +frac12Cright)","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"This is easily checked to be a retraction, i.e. mathrmCayley(mathbbO) = mathbbI and fracpartialpartialtmathrmCayley(tC) = C.","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"What we need in practice is not the computation of the Cayley transform of an arbitrary matrix, but the Cayley transform of an element of mathfrakg^mathrmhor, the global tangent space representation. ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"The elements of mathfrakg^mathrmhor can be written as: ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"C = beginbmatrix\n A -B^T \n B mathbbO\nendbmatrix = beginbmatrix frac12A mathbbI B mathbbO endbmatrix beginbmatrix mathbbI mathbbO frac12A -B^T endbmatrix","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"where the second expression exploits the sparse structure of the array, i.e. it is a multiplication of a Ntimes2n with a 2ntimesN matrix. We can hence use the Sherman-Morrison-Woodbury formula to obtain:","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"(mathbbI - frac12UV)^-1 = mathbbI + frac12U(mathbbI - frac12VU)^-1V","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"So what we have to invert is the term ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"mathbbI - frac12beginbmatrix mathbbI mathbbO frac12A -B^T endbmatrixbeginbmatrix frac12A mathbbI B mathbbO endbmatrix = \nbeginbmatrix mathbbI - frac14A - frac12mathbbI frac12B^TB - frac18A^2 mathbbI - frac14A endbmatrix","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"The whole Cayley transform is then: ","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"left(mathbbI + frac12beginbmatrix frac12A mathbbI B mathbbO endbmatrix beginbmatrix mathbbI - frac14A - frac12mathbbI frac12B^TB - frac18A^2 mathbbI - frac14A endbmatrix^-1 beginbmatrix mathbbI mathbbO frac12A -B^T endbmatrix right)left( E + frac12beginbmatrix frac12A mathbbI B mathbbO endbmatrix beginbmatrix mathbbI frac12A endbmatrix right) = \nE + frac12beginbmatrix frac12A mathbbI B mathbbO endbmatrixleft(\n beginbmatrix mathbbI frac12A endbmatrix + \n beginbmatrix mathbbI - frac14A - frac12mathbbI frac12B^TB - frac18A^2 mathbbI - frac14A endbmatrix^-1left(\n beginbmatrix mathbbI frac12A endbmatrix + \n beginbmatrix frac12A frac14A^2 - frac12B^TB endbmatrix\n right)\n right)","category":"page"},{"location":"optimizers/manifold_related/cayley/","page":"Cayley Retraction","title":"Cayley Retraction","text":"Note that for computational reason we compute mathrmCayley(C)E instead of just the Cayley transform (see the section on retractions).","category":"page"},{"location":"tutorials/sympnet_tutorial/#SympNets-with-GeometricMachineLearning.jl","page":"Sympnets","title":"SympNets with GeometricMachineLearning.jl","text":"","category":"section"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"This page serves as a short introduction into using SympNets with GeometricMachineLearning.jl. For the general theory see the theory section.","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"With GeometricMachineLearning.jl one can easily implement SympNets. The steps are the following :","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"Specify the architecture with the functions GSympNet and LASympNet,\nSpecify the type and the backend with NeuralNetwork,\nPick an optimizer for training the network,\nTrain the neural networks!","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"We discuss these points is some detail:","category":"page"},{"location":"tutorials/sympnet_tutorial/#Specifying-the-architecture","page":"Sympnets","title":"Specifying the architecture","text":"","category":"section"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"To call an LA-SympNet, one needs to write","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"lasympnet = LASympNet(dim; depth=5, nhidden=1, activation=tanh, init_upper_linear=true, init_upper_act=true) ","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"LASympNet takes one obligatory argument:","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"dim : the dimension of the phase space (i.e. an integer) or optionally an instance of DataLoader. This latter option will be used below.","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"and several keywords argument :","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"depth : the depth for all the linear layers. The default value set to 5 (if width>5, width is set to 5). See the theory section for more details; there depth was called n.\nnhidden : the number of pairs of linear and activation layers with default value set to 1 (i.e the LA-SympNet is a composition of a linear layer, an activation layer and then again a single layer). \nactivation : the activation function for all the activations layers with default set to tanh,\ninitupperlinear : a boolean that indicates whether the first linear layer changes q first. By default this is true.\ninitupperact : a boolean that indicates whether the first activation layer changes q first. By default this is true.","category":"page"},{"location":"tutorials/sympnet_tutorial/#G-SympNet","page":"Sympnets","title":"G-SympNet","text":"","category":"section"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"To call a G-SympNet, one needs to write","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"gsympnet = GSympNet(dim; upscaling_dimension=2*dim, nhidden=2, activation=tanh, init_upper=true) ","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"GSympNet takes one obligatory argument:","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"dim : the dimension of the phase space (i.e. an integer) or optionally an instance of DataLoader. This latter option will be used below.","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"and severals keywords argument :","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"upscaling_dimension: The first dimension of the matrix with which the input is multiplied. In the theory section this matrix is called K and the upscaling dimension is called m.\nnhidden: the number of gradient layers with default value set to 2.\nactivation : the activation function for all the activations layers with default set to tanh.\ninit_upper : a boolean that indicates whether the first gradient layer changes q first. By default this is true.","category":"page"},{"location":"tutorials/sympnet_tutorial/#Loss-function","page":"Sympnets","title":"Loss function","text":"","category":"section"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"The loss function described in the theory section is the default choice used in GeometricMachineLearning.jl for training SympNets.","category":"page"},{"location":"tutorials/sympnet_tutorial/#Data-Structures-in-GeometricMachineLearning.jl","page":"Sympnets","title":"Data Structures in GeometricMachineLearning.jl","text":"","category":"section"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/structs_visualization.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"if Main.output_type == :html # hide \n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"tutorials/sympnet_tutorial/#Examples","page":"Sympnets","title":"Examples","text":"","category":"section"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"Let us see how to use it on several examples.","category":"page"},{"location":"tutorials/sympnet_tutorial/#Example-of-a-pendulum-with-G-SympNet","page":"Sympnets","title":"Example of a pendulum with G-SympNet","text":"","category":"section"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"Let us begin with a simple example, the pendulum system, the Hamiltonian of which is ","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"H(qp)inmathbbR^2 mapsto frac12p^2-cos(q) in mathbbR","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"Here we generate pendulum data with the script GeometricMachineLearning/scripts/pendulum.jl:","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"using GeometricMachineLearning\n\n# load script\ninclude(\"../../../scripts/pendulum.jl\")\n# specify the data type\ntype = Float16 \n# get data \nqp_data = GeometricMachineLearning.apply_toNT(a -> type.(a), pendulum_data((q=[0.], p=[1.]); tspan=(0.,100.)))\n# call the DataLoader\ndl = DataLoader(qp_data)\n# this last line is a hack so as to not display the output # hide\nnothing # hide","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"Next we specify the architectures. GeometricMachineLearning.jl provides useful defaults for all parameters although they can be specified manually (which is done in the following):","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"# layer dimension for gradient module \nconst upscaling_dimension = 2\n# hidden layers\nconst nhidden = 1\n# activation function\nconst activation = tanh\n\n# calling G-SympNet architecture \ngsympnet = GSympNet(dl, upscaling_dimension=upscaling_dimension, nhidden=nhidden, activation=activation)\n\n# calling LA-SympNet architecture \nlasympnet = LASympNet(dl, nhidden=nhidden, activation=activation)\n\n# specify the backend\nbackend = CPU()\n\n# initialize the networks\nla_nn = NeuralNetwork(lasympnet, backend, type) \ng_nn = NeuralNetwork(gsympnet, backend, type)\nnothing # hide","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"If we want to obtain information on the number of parameters in a neural network, we can do that very simply with the function parameterlength. For the LASympNet:","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"parameterlength(la_nn.model)","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"And for the GSympNet:","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"parameterlength(g_nn.model)","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"Remark: We can also specify whether we would like to start with a layer that changes the q-component or one that changes the p-component. This can be done via the keywords init_upper for GSympNet, and init_upper_linear and init_upper_act for LASympNet.","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"We have to define an optimizer which will be use in the training of the SympNet. For more details on optimizer, please see the corresponding documentation. In this example we use Adam:","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"# set up optimizer; for this we first need to specify the optimization method (argue for why we need the optimizer method)\nopt_method = AdamOptimizer(; T=type)\nla_opt = Optimizer(opt_method, la_nn)\ng_opt = Optimizer(opt_method, g_nn)\nnothing # hide","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"We can now perform the training of the neural networks. The syntax is the following :","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"# number of training epochs\nconst nepochs = 300\n# Batchsize used to compute the gradient of the loss function with respect to the parameters of the neural networks.\nconst batch_size = 100\n\nbatch = Batch(batch_size)\n\n# perform training (returns array that contains the total loss for each training step)\ng_loss_array = g_opt(g_nn, dl, batch, nepochs)\nla_loss_array = la_opt(la_nn, dl, batch, nepochs)\nnothing # hide","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"We can also plot the training errors against the epoch (here the y-axis is in log-scale):","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"using Plots\np1 = plot(g_loss_array, xlabel=\"Epoch\", ylabel=\"Training error\", label=\"G-SympNet\", color=3, yaxis=:log)\nplot!(p1, la_loss_array, label=\"LA-SympNet\", color=2)","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"The train function will change the parameters of the neural networks and gives an a vector containing the evolution of the value of the loss function during the training. Default values for the arguments ntraining and batch_size are respectively 1000 and 10.","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"The trainings data data_q and data_p must be matrices of mathbbR^ntimes d where n is the length of data and d is the half of the dimension of the system, i.e data_q[i,j] is q_j(t_i) where (t_1t_n) are the corresponding time of the training data.","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"Then we can make prediction. Let's compare the initial data with a prediction starting from the same phase space point using the provided function Iterate_Sympnet:","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"ics = (q=qp_data.q[:,1], p=qp_data.p[:,1])\n\nsteps_to_plot = 200\n\n#predictions\nla_trajectory = iterate(la_nn, ics; n_points = steps_to_plot)\ng_trajectory = iterate(g_nn, ics; n_points = steps_to_plot)\n\nusing Plots\np2 = plot(qp_data.q'[1:steps_to_plot], qp_data.p'[1:steps_to_plot], label=\"training data\")\nplot!(p2, la_trajectory.q', la_trajectory.p', label=\"LA Sympnet\")\nplot!(p2, g_trajectory.q', g_trajectory.p', label=\"G Sympnet\")","category":"page"},{"location":"tutorials/sympnet_tutorial/","page":"Sympnets","title":"Sympnets","text":"We see that GSympNet gives an almost perfect math on the training data whereas LASympNet cannot even properly replicate the training data. It also takes longer to train LASympNet.","category":"page"},{"location":"architectures/sympnet/#SympNet","page":"SympNet","title":"SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"This document discusses the SympNet architecture and its implementation in GeometricMachineLearning.jl.","category":"page"},{"location":"architectures/sympnet/#Quick-overview-of-the-theory-of-SympNets","page":"SympNet","title":"Quick overview of the theory of SympNets","text":"","category":"section"},{"location":"architectures/sympnet/#Principle","page":"SympNet","title":"Principle","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"SympNets (see [1] for the eponymous paper) are a type of neural network that can model the trajectory of a Hamiltonian system in phase space. Take (q^Tp^T)^T=(q_1ldotsq_dp_1ldotsp_d)^Tin mathbbR^2d as the coordinates in phase space, where q=(q_1 ldots q_d)^Tin mathbbR^d is refered to as the position and p=(p_1 ldots p_d)^Tin mathbbR^d the momentum. Given a point (q^Tp^T)^T in mathbbR^2d the SympNet aims to compute the next position ((q)^T(p)^T)^T and thus predicts the trajectory while preserving the symplectic structure of the system. SympNets are enforcing symplecticity strongly, meaning that this property is hard-coded into the network architecture. The layers are reminiscent of traditional neural network feedforward layers, but have a strong restriction imposed on them in order to be symplectic.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"SympNets can be viewed as a \"symplectic integrator\" (see [2] and [3]). Their goal is to predict, based on an initial condition ((q^(0))^T(p^(0))^T)^T, a sequence of points in phase space that fit the training data as well as possible:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"beginpmatrix q^(0) p^(0) endpmatrix cdots beginpmatrix tildeq^(1) tildep^(1) endpmatrix cdots beginpmatrix tildeq^(n) tildep^(n) endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The tilde in the above equation indicates predicted data. The time step between predictions is not a parameter we can choose but is related to the temporal frequency of the training data. This means that if data is recorded in an interval of e.g. 0.1 seconds, then this will be the time step of our integrator.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n Docs.HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/sympnet_architecture.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"if Main.output_type == :html # hide\n Docs.HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"There are two types of SympNet architectures: LA-SympNets and G-SympNets. ","category":"page"},{"location":"architectures/sympnet/#LA-SympNet","page":"SympNet","title":"LA-SympNet","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The first type of SympNets, LA-SympNets, are obtained from composing two types of layers: symplectic linear layers and symplectic activation layers. For a given integer n, a symplectic linear layer is defined by","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"mathcalL^nq\nbeginpmatrix\n q \n p \nendpmatrix\n = \nbeginpmatrix \n I S^n0 \n 0S^n I \nendpmatrix\n cdots \nbeginpmatrix \n I 0 \n S^2 I \nendpmatrix\nbeginpmatrix \n I S^1 \n 0 I \nendpmatrix\nbeginpmatrix\n q \n p \nendpmatrix\n+ b ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"or ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"mathcalL^np\nbeginpmatrix q \n p endpmatrix = \n beginpmatrix \n I 0S^n \n S^n0 I\n endpmatrix cdots \n beginpmatrix \n I S^2 \n 0 I\n endpmatrix\n beginpmatrix \n I 0 \n S^1 I\n endpmatrix\n beginpmatrix q \n p endpmatrix\n + b ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The superscripts q and p indicate whether the q or the p part is changed. The learnable parameters are the symmetric matrices S^iinmathbbR^dtimes d and the bias binmathbbR^2d. The integer n is the width of the symplectic linear layer. It can be shown that five of these layers, i.e. ngeq5, can represent any linear symplectic map (see [4]), so n need not be larger than five. We denote the set of symplectic linear layers by mathcalM^L.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The second type of layer needed for LA-SympNets are so-called activation layers:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalA^q beginpmatrix q \n p endpmatrix = \n beginbmatrix \n Ihatsigma^a \n 0I\n endbmatrix beginpmatrix q \n p endpmatrix =\n beginpmatrix \n mathrmdiag(a)sigma(p)+q \n p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"and","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalA^p beginpmatrix q \n p endpmatrix = \n beginbmatrix \n I0 \n hatsigma^aI\n endbmatrix beginpmatrix q \n p endpmatrix\n =\n beginpmatrix \n q \n mathrmdiag(a)sigma(q)+p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The activation function sigma can be any nonlinearity (on which minor restrictions are imposed below). Here the scaling vector ainmathbbR^d constitutes the learnable weights. We denote the set of symplectic activation layers by mathcalM^A. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"An LA-SympNet is a function of the form Psi=l_k circ a_k circ l_k-1 circ cdots circ a_1 circ l_0 where (l_i)_0leq ileq k subset (mathcalM^L)^k+1 and (a_i)_1leq ileq k subset (mathcalM^A)^k. We will refer to k as the number of hidden layers of the SympNet[1] and the number n above as the depth of the linear layer.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"[1]: Note that if k=1 then the LA-SympNet consists of only one linear layer.","category":"page"},{"location":"architectures/sympnet/#G-SympNets","page":"SympNet","title":"G-SympNets","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"G-SympNets are an alternative to LA-SympNets. They are built with only one kind of layer, called gradient layer. For a given activation function sigma and an integer ngeq d, a gradient layers is a symplectic map from mathbbR^2d to mathbbR^2d defined by","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalG^up beginpmatrix q \n p endpmatrix = \n beginbmatrix \n Ihatsigma^Kab \n 0I\n endbmatrix beginpmatrix q \n p endpmatrix =\n beginpmatrix \n K^T mathrmdiag(a)sigma(Kp+b)+q \n p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"or","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":" mathcalG^low beginpmatrix q \n p endpmatrix = \n beginbmatrix \n I0 \n hatsigma^KabI\n endbmatrix beginpmatrix q \n p endpmatrix\n =\n beginpmatrix \n q \n K^T mathrmdiag(a)sigma(Kq+b)+p\n endpmatrix","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The parameters of this layer are the scaling matrix KinmathbbR^mtimes d, the bias binmathbbR^m and the scaling vector ainmathbbR^m. The name \"gradient layer\" has its origin in the fact that the expression K^Tmathrmdiag(a)sigma(Kq+b)_i = sum_jk_jia_jsigma(sum_ellk_jellq_ell+b_j) is the gradient of a function sum_ja_jtildesigma(sum_ellk_jellq_ell+b_j), where tildesigma is the antiderivative of sigma. The first dimension of K we refer to as the upscaling dimension.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"If we denote by mathcalM^G the set of gradient layers, a G-SympNet is a function of the form Psi=g_k circ g_k-1 circ cdots circ g_0 where (g_i)_0leq ileq k subset (mathcalM^G)^k. The index k is again the number of hidden layers.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Further note here the different roles played by round and square brackets: the latter indicates a nonlinear operation as opposed to a regular vector or matrix. ","category":"page"},{"location":"architectures/sympnet/#Universal-approximation-theorems","page":"SympNet","title":"Universal approximation theorems","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"In order to state the universal approximation theorem for both architectures we first need a few definitions:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Let U be an open set of mathbbR^2d, and let us denote by mathcalSP^r(U) the set of C^r smooth symplectic maps on U. We now define a topology on C^r(K mathbbR^n), the set of C^r-smooth maps from a compact set KsubsetmathbbR^n to mathbbR^n through the norm","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"f_C^r(KmathbbR^n) = undersetalphaleq rsum underset1leq i leq nmaxundersetxin Ksup D^alpha f_i(x)","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"where the differential operator D^alpha is defined by ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"D^alpha f = fracpartial^alpha fpartial x_1^alpha_1x_n^alpha_n","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"with alpha = alpha_1 ++ alpha_n. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Definition sigma is r-finite if sigmain C^r(mathbbRmathbbR) and int D^rsigma(x)dx +infty.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Definition Let mnrin mathbbN with mn0 be given, U an open set of mathbbR^m, and IJsubset C^r(UmathbbR^n). We say J is r-uniformly dense on compacta in I if J subset I and for any fin I, epsilon0, and any compact Ksubset U, there exists gin J such that f-g_C^r(KmathbbR^n) epsilon.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"We can now state the universal approximation theorems:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Theorem (Approximation theorem for LA-SympNet) For any positive integer r0 and open set Uin mathbbR^2d, the set of LA-SympNet is r-uniformly dense on compacta in SP^r(U) if the activation function sigma is r-finite.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Theorem (Approximation theorem for G-SympNet) For any positive integer r0 and open set Uin mathbbR^2d, the set of G-SympNet is r-uniformly dense on compacta in SP^r(U) if the activation function sigma is r-finite.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"There are many r-finite activation functions commonly used in neural networks, for example:","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"sigmoid sigma(x)=frac11+e^-x for any positive integer r, \ntanh tanh(x)=frace^x-e^-xe^x+e^-x for any positive integer r. ","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"The universal approximation theorems state that we can, in principle, get arbitrarily close to any symplectomorphism defined on mathbbR^2d. But this does not tell us anything about how to optimize the network. This is can be done with any common neural network optimizer and these neural network optimizers always rely on a corresponding loss function. ","category":"page"},{"location":"architectures/sympnet/#Loss-function","page":"SympNet","title":"Loss function","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"To train the SympNet, one need data along a trajectory such that the model is trained to perform an integration. These data are (QP) where Qij (respectively Pij) is the real number q_j(t_i) (respectively pij) which is the j-th coordinates of the generalized position (respectively momentum) at the i-th time step. One also need a loss function defined as :","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"Loss(QP) = undersetisum d(Phi(Qi-Pi-) Qi- Pi-^T)","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"where d is a distance on mathbbR^d.","category":"page"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"See the tutorial section for an introduction into using SympNets with GeometricMachineLearning.jl.","category":"page"},{"location":"architectures/sympnet/#References","page":"SympNet","title":"References","text":"","category":"section"},{"location":"architectures/sympnet/","page":"SympNet","title":"SympNet","text":"P. Jin, Z. Zhang, A. Zhu, Y. Tang and G. E. Karniadakis. SympNets: Intrinsic structure-preserving symplectic networks for identifying Hamiltonian systems. Neural Networks 132, 166–179 (2020).\n\n\n\n","category":"page"},{"location":"Optimizer/#Optimizer","page":"Optimizers","title":"Optimizer","text":"","category":"section"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"In order to generalize neural network optimizers to homogeneous spaces, a class of manifolds we often encounter in machine learning, we have to find a global tangent space representation which we call mathfrakg^mathrmhor here. ","category":"page"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"Starting from an element of the tangent space T_YmathcalM[1], we need to perform two mappings to arrive at mathfrakg^mathrmhor, which we refer to by Omega and a red horizontal arrow:","category":"page"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"[1]: In practice this is obtained by first using an AD routine on a loss function L, and then computing the Riemannian gradient based on this. See the section of the Stiefel manifold for an example of this.","category":"page"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"tikz/general_optimization_with_boundary.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"Here the mapping Omega is a horizontal lift from the tangent space onto the horizontal component of the Lie algebra at Y. ","category":"page"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"The red line maps the horizontal component at Y, i.e. mathfrakg^mathrmhorY, to the horizontal component at mathfrakg^mathrmhor.","category":"page"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"The mathrmcache stores information about previous optimization steps and is dependent on the optimizer. The elements of the mathrmcache are also in mathfrakg^mathrmhor. Based on this the optimer (Adam in this case) computes a final velocity, which is the input of a retraction. Because this update is done for mathfrakg^mathrmhorequivT_YmathcalM, we still need to perform a mapping, called apply_section here, that then finally updates the network parameters. The two red lines are described in global sections.","category":"page"},{"location":"Optimizer/#References","page":"Optimizers","title":"References","text":"","category":"section"},{"location":"Optimizer/","page":"Optimizers","title":"Optimizers","text":"B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).\n\n\n\n","category":"page"},{"location":"","page":"Home","title":"Home","text":"CurrentModule = GeometricMachineLearning","category":"page"},{"location":"#Geometric-Machine-Learning","page":"Home","title":"Geometric Machine Learning","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"GeometricMachineLearning.jl implements various scientific machine learning models that aim at learning dynamical systems with geometric structure, such as Hamiltonian (symplectic) or Lagrangian (variational) systems.","category":"page"},{"location":"#Installation","page":"Home","title":"Installation","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"GeometricMachineLearning.jl and all of its dependencies can be installed via the Julia REPL by typing ","category":"page"},{"location":"","page":"Home","title":"Home","text":"]add GeometricMachineLearning","category":"page"},{"location":"#Architectures","page":"Home","title":"Architectures","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"There are several architectures tailored towards problems in scientific machine learning implemented in GeometricMachineLearning.","category":"page"},{"location":"","page":"Home","title":"Home","text":"Pages = [\n \"architectures/sympnet.md\",\n]","category":"page"},{"location":"#Manifolds","page":"Home","title":"Manifolds","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"GeometricMachineLearning supports putting neural network weights on manifolds. These include:","category":"page"},{"location":"","page":"Home","title":"Home","text":"Pages = [\n \"manifolds/grassmann_manifold.md\",\n \"manifolds/stiefel_manifold.md\",\n]","category":"page"},{"location":"#Special-Neural-Network-Layer","page":"Home","title":"Special Neural Network Layer","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"Many layers have been adapted in order to be used for problems in scientific machine learning. Including:","category":"page"},{"location":"","page":"Home","title":"Home","text":"Pages = [\n \"layers/attention_layer.md\",\n]","category":"page"},{"location":"#Tutorials","page":"Home","title":"Tutorials","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"Tutorials for using GeometricMachineLearning are: ","category":"page"},{"location":"","page":"Home","title":"Home","text":"Pages = [\n \"tutorials/sympnet_tutorial.md\",\n \"tutorials/mnist_tutorial.md\",\n]","category":"page"},{"location":"#Reduced-Order-Modeling","page":"Home","title":"Reduced Order Modeling","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"A short description of the key concepts in reduced order modeling (where GeometricMachineLearning can be used) are in:","category":"page"},{"location":"","page":"Home","title":"Home","text":"Pages = [\n \"reduced_order_modeling/autoencoder.md\",\n \"reduced_order_modeling/symplectic_autoencoder.md\",\n \"reduced_order_modeling/kolmogorov_n_width.md\",\n]","category":"page"},{"location":"data_loader/snapshot_matrix/#Snapshot-matrix","page":"Snapshot matrix & tensor","title":"Snapshot matrix","text":"","category":"section"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"The snapshot matrix stores solutions of the high-dimensional ODE (obtained from discretizing a PDE). This is then used to construct reduced bases in a data-driven way. So (for a single parameter[1]) the snapshot matrix takes the following form: ","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"[1]: If we deal with a parametrized PDE then there are two stages at which the snapshot matrix has to be processed: the offline stage and the online stage. ","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"M = leftbeginarraycccc\nhatu_1(t_0) hatu_1(t_1) quadldotsquad hatu_1(t_f) \nhatu_2(t_0) hatu_2(t_1) ldots hatu_2(t_f) \nhatu_3(t_0) hatu_3(t_1) ldots hatu_3(t_f) \nldots ldots ldots ldots \nhatu_2N(t_0) hatu_2N(t_1) ldots hatu_2N(t_f) \nendarrayright","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"In the above example we store a matrix whose first axis is the system dimension (i.e. a state is an element of mathbbR^2n) and the second dimension gives the time step. ","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"The starting point for using the snapshot matrix as data for a machine learning model is that all the columns of M live on a lower-dimensional solution manifold and we can use techniques such as POD and autoencoders to find this solution manifold. We also note that the second axis of M does not necessarily indicate time but can also represent various parameters (including initial conditions). The second axis in the DataLoader struct is therefore saved in the field n_params.","category":"page"},{"location":"data_loader/snapshot_matrix/#Snapshot-tensor","page":"Snapshot matrix & tensor","title":"Snapshot tensor","text":"","category":"section"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"The snapshot tensor fulfills the same role as the snapshot matrix but has a third axis that describes different initial parameters (such as different initial conditions). ","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"import Images, Plots # hide\nif Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nelse # hide\n Plots.plot(Images.load(\"../tikz/tensor.png\"), axis=([], false)) # hide\nend # hide","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"if Main.output_type == :html # hide\n HTML(\"\"\"\"\"\") # hide\nend # hide","category":"page"},{"location":"data_loader/snapshot_matrix/","page":"Snapshot matrix & tensor","title":"Snapshot matrix & tensor","text":"When drawing training samples from the snapshot tensor we also need to specify a sequence length (as an argument to the Batch struct). When sampling a batch from the snapshot tensor we sample over the starting point of the time interval (which is of length seq_length) and the third axis of the tensor (the parameters). The total number of batches in this case is lceilmathtt(dlinput_time_steps - batchseq_length) * dln_params batchbatch_sizerceil. ","category":"page"},{"location":"arrays/grassmann_lie_alg_hor_matrix/#The-horizontal-component-of-the-Lie-algebra-\\mathfrak{g}-for-the-Grassmann-manifold","page":"Grassmann Global Tangent Space","title":"The horizontal component of the Lie algebra mathfrakg for the Grassmann manifold","text":"","category":"section"},{"location":"arrays/grassmann_lie_alg_hor_matrix/#Tangent-space-to-the-element-\\mathcal{E}","page":"Grassmann Global Tangent Space","title":"Tangent space to the element mathcalE","text":"","category":"section"},{"location":"arrays/grassmann_lie_alg_hor_matrix/","page":"Grassmann Global Tangent Space","title":"Grassmann Global Tangent Space","text":"Consider the tangent space to the distinct element mathcalE=mathrmspan(E)inGr(nN), where E is again:","category":"page"},{"location":"arrays/grassmann_lie_alg_hor_matrix/","page":"Grassmann Global Tangent Space","title":"Grassmann Global Tangent Space","text":"E = beginbmatrix\nmathbbI_n \nmathbbO\nendbmatrix","category":"page"},{"location":"arrays/grassmann_lie_alg_hor_matrix/","page":"Grassmann Global Tangent Space","title":"Grassmann Global Tangent Space","text":"The tangent tangent space T_mathcalEGr(nN) can be represented through matrices: ","category":"page"},{"location":"arrays/grassmann_lie_alg_hor_matrix/","page":"Grassmann Global Tangent Space","title":"Grassmann Global Tangent Space","text":"beginpmatrix\n 0 cdots 0 \n cdots cdots cdots \n 0 cdots 0 \n a_11 cdots a_1n \n cdots cdots cdots \n a_(N-n)1 cdots a_(N-n)n\nendpmatrix","category":"page"},{"location":"arrays/grassmann_lie_alg_hor_matrix/","page":"Grassmann Global Tangent Space","title":"Grassmann Global Tangent Space","text":"where we have used the identification T_mathcalEGr(nN)toT_EmathcalS_E that was discussed in the section on the Grassmann manifold. The Grassmann manifold can also be seen as the Stiefel manifold modulo an equivalence class. This leads to the following (which is used for optimization):","category":"page"},{"location":"arrays/grassmann_lie_alg_hor_matrix/","page":"Grassmann Global Tangent Space","title":"Grassmann Global Tangent Space","text":"mathfrakg^mathrmhor = mathfrakg^mathrmhormathcalE = leftbeginpmatrix 0 -B^T B 0 endpmatrix textB arbitraryright","category":"page"},{"location":"arrays/grassmann_lie_alg_hor_matrix/","page":"Grassmann Global Tangent Space","title":"Grassmann Global Tangent Space","text":"This is equivalent to the horizontal component of mathfrakg for the Stiefel manifold for the case when A is zero. This is a reflection of the rotational invariance of the Grassmann manifold: the skew-symmetric matrices A are connected to the group of rotations O(n) which is factored out in the Grassmann manifold Gr(nN)simeqSt(nN)O(n).","category":"page"}] } diff --git a/latest/tutorials/grassmann_layer/bb166470.svg b/latest/tutorials/grassmann_layer/8f33d68b.svg similarity index 66% rename from latest/tutorials/grassmann_layer/bb166470.svg rename to latest/tutorials/grassmann_layer/8f33d68b.svg index 9f8b8e7a5..83863934c 100644 --- a/latest/tutorials/grassmann_layer/bb166470.svg +++ b/latest/tutorials/grassmann_layer/8f33d68b.svg @@ -1,992 +1,992 @@ - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/latest/tutorials/grassmann_layer/8d219c66.svg b/latest/tutorials/grassmann_layer/bb8e5585.svg similarity index 85% rename from latest/tutorials/grassmann_layer/8d219c66.svg rename to latest/tutorials/grassmann_layer/bb8e5585.svg index fbeb22422..6a0d52723 100644 --- a/latest/tutorials/grassmann_layer/8d219c66.svg +++ b/latest/tutorials/grassmann_layer/bb8e5585.svg @@ -1,44 +1,44 @@ - + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/latest/tutorials/grassmann_layer/7125e5d0.svg b/latest/tutorials/grassmann_layer/e401c365.svg similarity index 65% rename from latest/tutorials/grassmann_layer/7125e5d0.svg rename to latest/tutorials/grassmann_layer/e401c365.svg index 45bb45598..8181871d0 100644 --- a/latest/tutorials/grassmann_layer/7125e5d0.svg +++ b/latest/tutorials/grassmann_layer/e401c365.svgdiff --git a/latest/tutorials/grassmann_layer/index.html b/latest/tutorials/grassmann_layer/index.html index 964155927..b48ae1da1 100644 --- a/latest/tutorials/grassmann_layer/index.html +++ b/latest/tutorials/grassmann_layer/index.html @@ -1,8 +1,8 @@ -Grassmann manifold · GeometricMachineLearning.jl

Example of a Neural Network with a Grassmann Layer

Here we show how to implement a neural network that contains a layer whose weight is an element of the Grassmann manifold and where this might be useful.

To answer where we would need this consider the following scenario

Problem statement

We are given data in a big space $\mathcal{D}=[d_i]_{i\in\mathcal{I}}\subset\mathbb{R}^N$ and know these data live on an $n$-dimensional[1] submanifold[2] in $\mathbb{R}^N$. Based on these data we would now like to generate new samples from the distributions that produced our original data. This is where the Grassmann manifold is useful: each element $V$ of the Grassmann manifold is an $n$-dimensional subspace of $\mathbb{R}^N$ from which we can easily sample. We can then construct a (bijective) mapping from this space $V$ onto a space that contains our data points $\mathcal{D}$.

Example

Consider the following toy example: We want to sample from the graph of the (scaled) Rosenbrock function $f(x,y) = ((1 - x)^2 + 100(y - x^2)^2)/1000$ while pretending we do not know the function.

rosenbrock(x::Vector) = ((1.0 - x[1]) ^ 2 + 100.0 * (x[2] - x[1] ^ 2) ^ 2) / 1000
+Grassmann manifold · GeometricMachineLearning.jl

Example of a Neural Network with a Grassmann Layer

Here we show how to implement a neural network that contains a layer whose weight is an element of the Grassmann manifold and where this might be useful.

To answer where we would need this consider the following scenario

Problem statement

We are given data in a big space $\mathcal{D}=[d_i]_{i\in\mathcal{I}}\subset\mathbb{R}^N$ and know these data live on an $n$-dimensional[1] submanifold[2] in $\mathbb{R}^N$. Based on these data we would now like to generate new samples from the distributions that produced our original data. This is where the Grassmann manifold is useful: each element $V$ of the Grassmann manifold is an $n$-dimensional subspace of $\mathbb{R}^N$ from which we can easily sample. We can then construct a (bijective) mapping from this space $V$ onto a space that contains our data points $\mathcal{D}$.

Example

Consider the following toy example: We want to sample from the graph of the (scaled) Rosenbrock function $f(x,y) = ((1 - x)^2 + 100(y - x^2)^2)/1000$ while pretending we do not know the function.

rosenbrock(x::Vector) = ((1.0 - x[1]) ^ 2 + 100.0 * (x[2] - x[1] ^ 2) ^ 2) / 1000
 x, y = -1.5:0.1:1.5, -1.5:0.1:1.5
 z = Surface((x,y)->rosenbrock([x,y]), x, y)
-p = surface(x,y,z; camera=(30,20), alpha=.6, colorbar=false, xlims=(-1.5, 1.5), ylims=(-1.5, 1.5), zlims=(0.0, rosenbrock([-1.5, -1.5])))
Example block output

We now build a neural network whose task it is to map a product of two Gaussians $\mathcal{N}(0,1)\times\mathcal{N}(0,1)$ onto the graph of the Rosenbrock function where the range for $x$ and for $y$ is $[-1.5,1.5]$.

For computing the loss between the two distributions, i.e. $\Psi(\mathcal{N}(0,1)\times\mathcal{N}(0,1))$ and $f([-1.5,1.5], [-1.5,1.5])$ we use the Wasserstein distance[3].

using GeometricMachineLearning, Zygote, BrenierTwoFluid
+p = surface(x,y,z; camera=(30,20), alpha=.6, colorbar=false, xlims=(-1.5, 1.5), ylims=(-1.5, 1.5), zlims=(0.0, rosenbrock([-1.5, -1.5])))
Example block output

We now build a neural network whose task it is to map a product of two Gaussians $\mathcal{N}(0,1)\times\mathcal{N}(0,1)$ onto the graph of the Rosenbrock function where the range for $x$ and for $y$ is $[-1.5,1.5]$.

For computing the loss between the two distributions, i.e. $\Psi(\mathcal{N}(0,1)\times\mathcal{N}(0,1))$ and $f([-1.5,1.5], [-1.5,1.5])$ we use the Wasserstein distance[3].

using GeometricMachineLearning, Zygote, BrenierTwoFluid
 import Random # hide
 Random.seed!(123)
 
@@ -50,7 +50,7 @@
     loss_array[i] = val
     optimization_step!(optimizer, model, nn.params, dp)
 end
-plot(loss_array, xlabel="training step", label="loss")
Example block output

Now we plot a few points to check how well they match the graph:

const number_of_points = 35
+plot(loss_array, xlabel="training step", label="loss")
Example block output

Now we plot a few points to check how well they match the graph:

const number_of_points = 35
 
 coordinates = nn(randn(2, number_of_points))
-scatter3d!(p, [coordinates[1, :]], [coordinates[2, :]], [coordinates[3, :]], alpha=.5, color=4, label="mapped points")
Example block output
  • 1We may know $n$ exactly or approximately.
  • 2Problems and solutions related to this scenario are commonly summarized under the term manifold learning (see [16]).
  • 3The implementation of the Wasserstein distance is taken from [17].
+scatter3d!(p, [coordinates[1, :]], [coordinates[2, :]], [coordinates[3, :]], alpha=.5, color=4, label="mapped points")
Example block output
  • 1We may know $n$ exactly or approximately.
  • 2Problems and solutions related to this scenario are commonly summarized under the term manifold learning (see [25]).
  • 3The implementation of the Wasserstein distance is taken from [26].
diff --git a/latest/tutorials/linear_wave_equation/index.html b/latest/tutorials/linear_wave_equation/index.html index 1753120e1..e68e82944 100644 --- a/latest/tutorials/linear_wave_equation/index.html +++ b/latest/tutorials/linear_wave_equation/index.html @@ -1,5 +1,5 @@ -Linear Wave Equation · GeometricMachineLearning.jl

The Linear Wave Equation

The linear wave equation is the prototypical example for a Hamiltonian PDE. It is given by (see [10] and [11]):

\[\mathcal{H}(q, p; \mu) := \frac{1}{2}\int_\Omega\mu^2(\partial_\xi{}q(t,\xi;\mu))^2 + p(t,\xi;\mu)^2d\xi,\]

with $\xi\in\Omega:=(-1/2,1/2)$ and $\mu\in\mathbb{P}:=[5/12,5/6]$ as a possible choice for domain and parameters.

The PDE for to this Hamiltonian can be obtained similarly as in the ODE case:

\[\partial_t{}q(t,\xi;\mu) = \frac{\delta{}\mathcal{H}}{\delta{}p} = p(t,\xi;\mu), \quad \partial_t{}p(t,\xi;\mu) = -\frac{\delta{}\mathcal{H}}{\delta{}q} = \mu^2\partial_{\xi{}\xi}q(t,\xi;\mu).\]

As with any other PDE, the wave equation can also be discretized to obtain a ODE which can be solved numerically.

If we discretize $\mathcal{H}$ directly, to obtain a Hamiltonian on a finite-dimensional vector space $\mathbb{R}^{2N}$, we get a Hamiltonian ODE[1]:

\[\mathcal{H}_h(z) = \sum_{i=1}^{\tilde{N}}\frac{\Delta{}x}{2}\bigg[p_i^2 + \mu^2\frac{(q_i - q_{i-1})^2 + (q_{i+1} - q_i)^2}{2\Delta{}x^2}\bigg] = \frac{\Delta{}x}{2}p^Tp + q^TKq,\]

where the matrix $K$ contains elements of the form:

\[k_{ij} = \begin{cases} \frac{\mu^2}{4\Delta{}x} &\text{if $(i,j)\in\{(0,0),(\tilde{N}+1,\tilde{N}+1)\}$ }, \\ +Linear Wave Equation · GeometricMachineLearning.jl

The Linear Wave Equation

The linear wave equation is the prototypical example for a Hamiltonian PDE. It is given by (see [19] and [20]):

\[\mathcal{H}(q, p; \mu) := \frac{1}{2}\int_\Omega\mu^2(\partial_\xi{}q(t,\xi;\mu))^2 + p(t,\xi;\mu)^2d\xi,\]

with $\xi\in\Omega:=(-1/2,1/2)$ and $\mu\in\mathbb{P}:=[5/12,5/6]$ as a possible choice for domain and parameters.

The PDE for to this Hamiltonian can be obtained similarly as in the ODE case:

\[\partial_t{}q(t,\xi;\mu) = \frac{\delta{}\mathcal{H}}{\delta{}p} = p(t,\xi;\mu), \quad \partial_t{}p(t,\xi;\mu) = -\frac{\delta{}\mathcal{H}}{\delta{}q} = \mu^2\partial_{\xi{}\xi}q(t,\xi;\mu).\]

As with any other PDE, the wave equation can also be discretized to obtain a ODE which can be solved numerically.

If we discretize $\mathcal{H}$ directly, to obtain a Hamiltonian on a finite-dimensional vector space $\mathbb{R}^{2N}$, we get a Hamiltonian ODE[1]:

\[\mathcal{H}_h(z) = \sum_{i=1}^{\tilde{N}}\frac{\Delta{}x}{2}\bigg[p_i^2 + \mu^2\frac{(q_i - q_{i-1})^2 + (q_{i+1} - q_i)^2}{2\Delta{}x^2}\bigg] = \frac{\Delta{}x}{2}p^Tp + q^TKq,\]

where the matrix $K$ contains elements of the form:

\[k_{ij} = \begin{cases} \frac{\mu^2}{4\Delta{}x} &\text{if $(i,j)\in\{(0,0),(\tilde{N}+1,\tilde{N}+1)\}$ }, \\ -\frac{\mu^2}{2\Delta{}x} & \text{if $(i,j)=(1,0)$ or $(i,j)=(\tilde{N},\tilde{N}+1)$} \\ \frac{3\mu^2}{4\Delta{}x} & \text{if $(i,j)\in\{(1,1),(\tilde{N},\tilde{N})\}$} \\ \frac{\mu^2}{\Delta{}x} & \text{if $i=j$ and $i\in\{2,\ldots,(\tilde{N}-2)\}$} \\ @@ -9,4 +9,4 @@ 1 - \frac{3}{2}s^2 + \frac{3}{4}s^3 & \text{if } 0 \leq s \leq 1 \\ \frac{1}{4}(2 - s)^3 & \text{if } 1 < s \leq 2 \\ 0 & \text{else.} -\end{cases}\]

Plotted on the relevant domain it looks like this:

if Main.output_type == :html # hide

Taking the above function $h(s)$ as a starting point, the initial conditions for the linear wave equations will now be constructed under the following considerations:

  • the initial condition (i.e. the shape of the wave) should depend on the parameter of the vector field, i.e. $u_0(\mu)(\omega) = h(s(\omega, \mu))$.
  • the solutions of the linear wave equation will travel with speed $\mu$, and we should make sure that the wave does not touch the right boundary of the domain, i.e. 0.5. So the peak should be sharper for higher values of $\mu$ as the wave will travel faster.
  • the wave should start at the left boundary of the domain, i.e. at point 0.5, so to cover it as much as possible.

Based on this we end up with the following choice of parametrized initial conditions:

\[u_0(\mu)(\omega) = h(s(\omega, \mu)), \quad s(\omega, \mu) = 20 \mu |\omega + \frac{\mu}{2}|.\]

References

[10]
P. Buchfink, S. Glas and B. Haasdonk. Symplectic model reduction of Hamiltonian systems on nonlinear manifolds and approximation with weakly symplectic autoencoder. SIAM Journal on Scientific Computing 45, A289–A311 (2023).
[11]
L. Peng and K. Mohseni. Symplectic model reduction of Hamiltonian systems. SIAM Journal on Scientific Computing 38, A1–A27 (2016).
[12]
C. Greif and K. Urban. Decay of the Kolmogorov N-width for wave problems. Applied Mathematics Letters 96, 216–222 (2019).
  • 1This conserves the Hamiltonian structure of the system.
+\end{cases}\]

Plotted on the relevant domain it looks like this:

if Main.output_type == :html # hide

Taking the above function $h(s)$ as a starting point, the initial conditions for the linear wave equations will now be constructed under the following considerations:

  • the initial condition (i.e. the shape of the wave) should depend on the parameter of the vector field, i.e. $u_0(\mu)(\omega) = h(s(\omega, \mu))$.
  • the solutions of the linear wave equation will travel with speed $\mu$, and we should make sure that the wave does not touch the right boundary of the domain, i.e. 0.5. So the peak should be sharper for higher values of $\mu$ as the wave will travel faster.
  • the wave should start at the left boundary of the domain, i.e. at point 0.5, so to cover it as much as possible.

Based on this we end up with the following choice of parametrized initial conditions:

\[u_0(\mu)(\omega) = h(s(\omega, \mu)), \quad s(\omega, \mu) = 20 \mu |\omega + \frac{\mu}{2}|.\]

References

[19]
P. Buchfink, S. Glas and B. Haasdonk. Symplectic model reduction of Hamiltonian systems on nonlinear manifolds and approximation with weakly symplectic autoencoder. SIAM Journal on Scientific Computing 45, A289–A311 (2023).
[20]
L. Peng and K. Mohseni. Symplectic model reduction of Hamiltonian systems. SIAM Journal on Scientific Computing 38, A1–A27 (2016).
[21]
C. Greif and K. Urban. Decay of the Kolmogorov N-width for wave problems. Applied Mathematics Letters 96, 216–222 (2019).
  • 1This conserves the Hamiltonian structure of the system.
diff --git a/latest/tutorials/mnist_tutorial/index.html b/latest/tutorials/mnist_tutorial/index.html index 8fd2992fe..36c1c2aac 100644 --- a/latest/tutorials/mnist_tutorial/index.html +++ b/latest/tutorials/mnist_tutorial/index.html @@ -1,5 +1,5 @@ -MNIST · GeometricMachineLearning.jl

MNIST tutorial

This is a short tutorial that shows how we can use GeometricMachineLearning to build a vision transformer and apply it for MNIST, while also putting some of the weights on a manifold. This is also the result presented in [15].

First, we need to import the relevant packages:

using GeometricMachineLearning, CUDA, Plots
+MNIST · GeometricMachineLearning.jl

MNIST tutorial

This is a short tutorial that shows how we can use GeometricMachineLearning to build a vision transformer and apply it for MNIST, while also putting some of the weights on a manifold. This is also the result presented in [24].

First, we need to import the relevant packages:

using GeometricMachineLearning, CUDA, Plots
 import Zygote, MLDatasets, KernelAbstractions

For the AD routine we here use the GeometricMachineLearning default and we get the dataset from MLDatasets. First we need to load the data set, and put it on GPU (if you have one):

train_x, train_y = MLDatasets.MNIST(split=:train)[:]
 test_x, test_y = MLDatasets.MNIST(split=:test)[:]
 train_x = train_x |> cu 
@@ -7,7 +7,7 @@
 train_y = train_y |> cu 
 test_y = test_y |> cu

GeometricMachineLearning has built-in data loaders that make it particularly easy to handle data:

patch_length = 7
 dl = DataLoader(train_x, train_y, patch_length=patch_length)
-dl_test = DataLoader(train_x, train_y, patch_length=patch_length)

Here patch_length indicates the size one patch has. One image in MNIST is of dimension $28\times28$, this means that we decompose this into 16 $(7\times7)$ images (also see [15]).

We next define the model with which we want to train:

model = ClassificationTransformer(dl, n_heads=n_heads, n_layers=n_layers, Stiefel=true)

Here we have chosen a ClassificationTransformer, i.e. a composition of a specific number of transformer layers composed with a classification layer. We also set the Stiefel option to true, i.e. we are optimizing on the Stiefel manifold.

We now have to initialize the neural network weights. This is done with the constructor for NeuralNetwork:

backend = KernelAbstractions.get_backend(dl)
+dl_test = DataLoader(train_x, train_y, patch_length=patch_length)

Here patch_length indicates the size one patch has. One image in MNIST is of dimension $28\times28$, this means that we decompose this into 16 $(7\times7)$ images (also see [24]).

We next define the model with which we want to train:

model = ClassificationTransformer(dl, n_heads=n_heads, n_layers=n_layers, Stiefel=true)

Here we have chosen a ClassificationTransformer, i.e. a composition of a specific number of transformer layers composed with a classification layer. We also set the Stiefel option to true, i.e. we are optimizing on the Stiefel manifold.

We now have to initialize the neural network weights. This is done with the constructor for NeuralNetwork:

backend = KernelAbstractions.get_backend(dl)
 T = eltype(dl)
 nn = NeuralNetwork(model, backend, T)

And with this we can finally perform the training:

# an instance of batch is needed for the optimizer
 batch = Batch(batch_size)
@@ -19,4 +19,4 @@
 
 loss_array = optimizer_instance(nn, dl, batch, n_epochs)
 
-println("final test accuracy: ", accuracy(Ψᵉ, ps, dl_test), "\n")

It is instructive to play with n_layers, n_epochs and the Stiefel property.

[15]
B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).
+println("final test accuracy: ", accuracy(Ψᵉ, ps, dl_test), "\n")

It is instructive to play with n_layers, n_epochs and the Stiefel property.

[24]
B. Brantner. Generalizing Adam To Manifolds For Efficiently Training Transformers, arXiv preprint arXiv:2305.16901 (2023).
diff --git a/latest/tutorials/sympnet_tutorial/b42e7145.svg b/latest/tutorials/sympnet_tutorial/1230253a.svg similarity index 87% rename from latest/tutorials/sympnet_tutorial/b42e7145.svg rename to latest/tutorials/sympnet_tutorial/1230253a.svg index 7f4284218..991a3ffe9 100644 --- a/latest/tutorials/sympnet_tutorial/b42e7145.svg +++ b/latest/tutorials/sympnet_tutorial/1230253a.svg @@ -1,50 +1,50 @@ - + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/latest/tutorials/sympnet_tutorial/c2597183.svg b/latest/tutorials/sympnet_tutorial/ac4f8ec5.svg similarity index 88% rename from latest/tutorials/sympnet_tutorial/c2597183.svg rename to latest/tutorials/sympnet_tutorial/ac4f8ec5.svg index 57a3e9e8d..538a83c6c 100644 --- a/latest/tutorials/sympnet_tutorial/c2597183.svg +++ b/latest/tutorials/sympnet_tutorial/ac4f8ec5.svg @@ -1,40 +1,40 @@ - + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + diff --git a/latest/tutorials/sympnet_tutorial/index.html b/latest/tutorials/sympnet_tutorial/index.html index 4685c89b6..7f55860ea 100644 --- a/latest/tutorials/sympnet_tutorial/index.html +++ b/latest/tutorials/sympnet_tutorial/index.html @@ -1,5 +1,5 @@ -Sympnets · GeometricMachineLearning.jl

SympNets with GeometricMachineLearning.jl

This page serves as a short introduction into using SympNets with GeometricMachineLearning.jl. For the general theory see the theory section.

With GeometricMachineLearning.jl one can easily implement SympNets. The steps are the following :

  • Specify the architecture with the functions GSympNet and LASympNet,
  • Specify the type and the backend with NeuralNetwork,
  • Pick an optimizer for training the network,
  • Train the neural networks!

We discuss these points is some detail:

Specifying the architecture

To call an $LA$-SympNet, one needs to write

lasympnet = LASympNet(dim; depth=5, nhidden=1, activation=tanh, init_upper_linear=true, init_upper_act=true) 

LASympNet takes one obligatory argument:

  • dim : the dimension of the phase space (i.e. an integer) or optionally an instance of DataLoader. This latter option will be used below.

and several keywords argument :

  • depth : the depth for all the linear layers. The default value set to 5 (if width>5, width is set to 5). See the theory section for more details; there depth was called $n$.
  • nhidden : the number of pairs of linear and activation layers with default value set to 1 (i.e the $LA$-SympNet is a composition of a linear layer, an activation layer and then again a single layer).
  • activation : the activation function for all the activations layers with default set to tanh,
  • initupperlinear : a boolean that indicates whether the first linear layer changes $q$ first. By default this is true.
  • initupperact : a boolean that indicates whether the first activation layer changes $q$ first. By default this is true.

G-SympNet

To call a G-SympNet, one needs to write

gsympnet = GSympNet(dim; upscaling_dimension=2*dim, nhidden=2, activation=tanh, init_upper=true) 

GSympNet takes one obligatory argument:

  • dim : the dimension of the phase space (i.e. an integer) or optionally an instance of DataLoader. This latter option will be used below.

and severals keywords argument :

  • upscaling_dimension: The first dimension of the matrix with which the input is multiplied. In the theory section this matrix is called $K$ and the upscaling dimension is called $m$.
  • nhidden: the number of gradient layers with default value set to 2.
  • activation : the activation function for all the activations layers with default set to tanh.
  • init_upper : a boolean that indicates whether the first gradient layer changes $q$ first. By default this is true.

Loss function

The loss function described in the theory section is the default choice used in GeometricMachineLearning.jl for training SympNets.

Data Structures in GeometricMachineLearning.jl

if Main.output_type == :html # hide

Examples

Let us see how to use it on several examples.

Example of a pendulum with G-SympNet

Let us begin with a simple example, the pendulum system, the Hamiltonian of which is

\[H:(q,p)\in\mathbb{R}^2 \mapsto \frac{1}{2}p^2-cos(q) \in \mathbb{R}.\]

Here we generate pendulum data with the script GeometricMachineLearning/scripts/pendulum.jl:

using GeometricMachineLearning
+Sympnets · GeometricMachineLearning.jl

SympNets with GeometricMachineLearning.jl

This page serves as a short introduction into using SympNets with GeometricMachineLearning.jl. For the general theory see the theory section.

With GeometricMachineLearning.jl one can easily implement SympNets. The steps are the following :

  • Specify the architecture with the functions GSympNet and LASympNet,
  • Specify the type and the backend with NeuralNetwork,
  • Pick an optimizer for training the network,
  • Train the neural networks!

We discuss these points is some detail:

Specifying the architecture

To call an $LA$-SympNet, one needs to write

lasympnet = LASympNet(dim; depth=5, nhidden=1, activation=tanh, init_upper_linear=true, init_upper_act=true) 

LASympNet takes one obligatory argument:

  • dim : the dimension of the phase space (i.e. an integer) or optionally an instance of DataLoader. This latter option will be used below.

and several keywords argument :

  • depth : the depth for all the linear layers. The default value set to 5 (if width>5, width is set to 5). See the theory section for more details; there depth was called $n$.
  • nhidden : the number of pairs of linear and activation layers with default value set to 1 (i.e the $LA$-SympNet is a composition of a linear layer, an activation layer and then again a single layer).
  • activation : the activation function for all the activations layers with default set to tanh,
  • initupperlinear : a boolean that indicates whether the first linear layer changes $q$ first. By default this is true.
  • initupperact : a boolean that indicates whether the first activation layer changes $q$ first. By default this is true.

G-SympNet

To call a G-SympNet, one needs to write

gsympnet = GSympNet(dim; upscaling_dimension=2*dim, nhidden=2, activation=tanh, init_upper=true) 

GSympNet takes one obligatory argument:

  • dim : the dimension of the phase space (i.e. an integer) or optionally an instance of DataLoader. This latter option will be used below.

and severals keywords argument :

  • upscaling_dimension: The first dimension of the matrix with which the input is multiplied. In the theory section this matrix is called $K$ and the upscaling dimension is called $m$.
  • nhidden: the number of gradient layers with default value set to 2.
  • activation : the activation function for all the activations layers with default set to tanh.
  • init_upper : a boolean that indicates whether the first gradient layer changes $q$ first. By default this is true.

Loss function

The loss function described in the theory section is the default choice used in GeometricMachineLearning.jl for training SympNets.

Data Structures in GeometricMachineLearning.jl

if Main.output_type == :html # hide

Examples

Let us see how to use it on several examples.

Example of a pendulum with G-SympNet

Let us begin with a simple example, the pendulum system, the Hamiltonian of which is

\[H:(q,p)\in\mathbb{R}^2 \mapsto \frac{1}{2}p^2-cos(q) \in \mathbb{R}.\]

Here we generate pendulum data with the script GeometricMachineLearning/scripts/pendulum.jl:

using GeometricMachineLearning
 
 # load script
 include("../../../scripts/pendulum.jl")
@@ -38,448 +38,384 @@
 
 # perform training (returns array that contains the total loss for each training step)
 g_loss_array = g_opt(g_nn, dl, batch, nepochs)
-la_loss_array = la_opt(la_nn, dl, batch, nepochs)

Progress:   1%|▎                                        |  ETA: 0:16:51
+la_loss_array = la_opt(la_nn, dl, batch, nepochs)

Progress:   1%|▎                                        |  ETA: 0:17:06
   TrainingLoss:  0.909222836011647
-

Progress:  12%|████▊                                    |  ETA: 0:01:07
-  TrainingLoss:  0.2707095773187127
-

Progress:  14%|█████▋                                   |  ETA: 0:00:57
-  TrainingLoss:  0.21999370001441165
-

Progress:  15%|██████▏                                  |  ETA: 0:00:52
-  TrainingLoss:  0.18784218013538828
-

Progress:  16%|██████▊                                  |  ETA: 0:00:48
-  TrainingLoss:  0.15866004989642474
-

Progress:  18%|███████▎                                 |  ETA: 0:00:44
-  TrainingLoss:  0.1341243646944143
-

Progress:  19%|███████▊                                 |  ETA: 0:00:41
-  TrainingLoss:  0.11131824628452096
-

Progress:  20%|████████▍                                |  ETA: 0:00:38
+

Progress:  11%|████▍                                    |  ETA: 0:01:15
+  TrainingLoss:  0.29725316338963303
+

Progress:  17%|███████▏                                 |  ETA: 0:00:46
+  TrainingLoss:  0.13982333708831512
+

Progress:  18%|███████▌                                 |  ETA: 0:00:44
+  TrainingLoss:  0.12242977503763892
+

Progress:  19%|███████▉                                 |  ETA: 0:00:41
+  TrainingLoss:  0.10606854606545517
+

Progress:  20%|████████▍                                |  ETA: 0:00:39
   TrainingLoss:  0.09224355567759161
-

Progress:  22%|████████▉                                |  ETA: 0:00:36
-  TrainingLoss:  0.07711666735650555
-

Progress:  23%|█████████▍                               |  ETA: 0:00:33
-  TrainingLoss:  0.06292411472539625
-

Progress:  24%|██████████                               |  ETA: 0:00:31
+

Progress:  21%|████████▊                                |  ETA: 0:00:37
+  TrainingLoss:  0.08080088746678245
+

Progress:  22%|█████████▏                               |  ETA: 0:00:36
+  TrainingLoss:  0.06986076690420591
+

Progress:  23%|█████████▋                               |  ETA: 0:00:34
+  TrainingLoss:  0.05963487293338036
+

Progress:  24%|██████████                               |  ETA: 0:00:33
   TrainingLoss:  0.050648454856902815
-

Progress:  26%|██████████▌                              |  ETA: 0:00:30
-  TrainingLoss:  0.04170501315745365
-

Progress:  27%|██████████▉                              |  ETA: 0:00:28
-  TrainingLoss:  0.03715385976771302
-

Progress:  28%|███████████▌                             |  ETA: 0:00:27
-  TrainingLoss:  0.03332873755735711
-

Progress:  29%|████████████                             |  ETA: 0:00:26
+

Progress:  25%|██████████▍                              |  ETA: 0:00:31
+  TrainingLoss:  0.04366821160208312
+

Progress:  26%|██████████▊                              |  ETA: 0:00:30
+  TrainingLoss:  0.038484837461107875
+

Progress:  27%|███████████▎                             |  ETA: 0:00:29
+  TrainingLoss:  0.035155358402035305
+

Progress:  28%|███████████▋                             |  ETA: 0:00:28
+  TrainingLoss:  0.03254796132031385
+

Progress:  29%|████████████                             |  ETA: 0:00:27
   TrainingLoss:  0.03002121025704575
-

Progress:  30%|████████████▍                            |  ETA: 0:00:25
+

Progress:  30%|████████████▍                            |  ETA: 0:00:26
   TrainingLoss:  0.028098005436551547
-

Progress:  32%|█████████████                            |  ETA: 0:00:23
-  TrainingLoss:  0.02577253251908417
-

Progress:  33%|█████████████▌                           |  ETA: 0:00:22
-  TrainingLoss:  0.023281006734764808
-

Progress:  34%|██████████████▏                          |  ETA: 0:00:21
+

Progress:  31%|████████████▉                            |  ETA: 0:00:25
+  TrainingLoss:  0.026246661761262592
+

Progress:  32%|█████████████▎                           |  ETA: 0:00:24
+  TrainingLoss:  0.024349815238206823
+

Progress:  33%|█████████████▋                           |  ETA: 0:00:23
+  TrainingLoss:  0.022715596892490447
+

Progress:  34%|██████████████▏                          |  ETA: 0:00:22
   TrainingLoss:  0.021281667325134722
-

Progress:  36%|██████████████▋                          |  ETA: 0:00:20
-  TrainingLoss:  0.020002841333850434
-

Progress:  37%|███████████████▏                         |  ETA: 0:00:19
-  TrainingLoss:  0.018694377190663976
-

Progress:  38%|███████████████▊                         |  ETA: 0:00:18
+

Progress:  35%|██████████████▌                          |  ETA: 0:00:21
+  TrainingLoss:  0.02036181031063595
+

Progress:  36%|██████████████▉                          |  ETA: 0:00:21
+  TrainingLoss:  0.01923765140273361
+

Progress:  37%|███████████████▎                         |  ETA: 0:00:20
+  TrainingLoss:  0.018417784651478128
+

Progress:  38%|███████████████▊                         |  ETA: 0:00:19
   TrainingLoss:  0.01771676858136383
-

Progress:  40%|████████████████▎                        |  ETA: 0:00:18
-  TrainingLoss:  0.01671047586587387
-

Progress:  41%|████████████████▊                        |  ETA: 0:00:17
-  TrainingLoss:  0.01583640206689827
-

Progress:  42%|█████████████████▍                       |  ETA: 0:00:16
+

Progress:  39%|████████████████▏                        |  ETA: 0:00:19
+  TrainingLoss:  0.01691771673258003
+

Progress:  40%|████████████████▌                        |  ETA: 0:00:18
+  TrainingLoss:  0.01624378238608939
+

Progress:  41%|█████████████████                        |  ETA: 0:00:18
+  TrainingLoss:  0.015535916959958696
+

Progress:  42%|█████████████████▍                       |  ETA: 0:00:17
   TrainingLoss:  0.014929130853420344
-

Progress:  44%|█████████████████▉                       |  ETA: 0:00:16
-  TrainingLoss:  0.01413571251544236
-

Progress:  45%|██████████████████▌                      |  ETA: 0:00:15
-  TrainingLoss:  0.013430767977697822
-

Progress:  46%|███████████████████                      |  ETA: 0:00:14
+

Progress:  43%|█████████████████▊                       |  ETA: 0:00:16
+  TrainingLoss:  0.01436009220151282
+

Progress:  44%|██████████████████▏                      |  ETA: 0:00:16
+  TrainingLoss:  0.013819415260382597
+

Progress:  45%|██████████████████▋                      |  ETA: 0:00:15
+  TrainingLoss:  0.013208620663237353
+

Progress:  46%|███████████████████                      |  ETA: 0:00:15
   TrainingLoss:  0.012802956037120991
-

Progress:  48%|███████████████████▌                     |  ETA: 0:00:14
-  TrainingLoss:  0.012213005637506723
-

Progress:  49%|████████████████████▏                    |  ETA: 0:00:13
-  TrainingLoss:  0.011609319493380392
+

Progress:  47%|███████████████████▍                     |  ETA: 0:00:15
+  TrainingLoss:  0.012380512078110134
+

Progress:  48%|███████████████████▉                     |  ETA: 0:00:14
+  TrainingLoss:  0.011865888831338196
+

Progress:  49%|████████████████████▎                    |  ETA: 0:00:14
+  TrainingLoss:  0.011513163766239015
 

Progress:  50%|████████████████████▋                    |  ETA: 0:00:13
   TrainingLoss:  0.011071099032179621
-

Progress:  52%|█████████████████████▏                   |  ETA: 0:00:12
-  TrainingLoss:  0.010663843576407608
+

Progress:  51%|█████████████████████                    |  ETA: 0:00:13
+  TrainingLoss:  0.010798583868263614
 

Progress:  52%|█████████████████████▌                   |  ETA: 0:00:12
   TrainingLoss:  0.010373200887543397
-

Progress:  54%|██████████████████████                   |  ETA: 0:00:11
-  TrainingLoss:  0.009966718243090401
-

Progress:  55%|██████████████████████▌                  |  ETA: 0:00:11
-  TrainingLoss:  0.009529085234212562
-

Progress:  56%|███████████████████████▏                 |  ETA: 0:00:10
+

Progress:  53%|█████████████████████▉                   |  ETA: 0:00:12
+  TrainingLoss:  0.010113246291306851
+

Progress:  54%|██████████████████████▎                  |  ETA: 0:00:12
+  TrainingLoss:  0.009750831637723466
+

Progress:  55%|██████████████████████▋                  |  ETA: 0:00:11
+  TrainingLoss:  0.009459923398791776
+

Progress:  56%|███████████████████████▏                 |  ETA: 0:00:11
   TrainingLoss:  0.00923169773886583
-

Progress:  58%|███████████████████████▋                 |  ETA: 0:00:10
-  TrainingLoss:  0.008887308303050952
-

Progress:  59%|████████████████████████▎                |  ETA: 0:00:09
-  TrainingLoss:  0.008655760949679367
-

Progress:  60%|████████████████████████▊                |  ETA: 0:00:09
+

Progress:  57%|███████████████████████▌                 |  ETA: 0:00:11
+  TrainingLoss:  0.008918903550656197
+

Progress:  58%|███████████████████████▉                 |  ETA: 0:00:10
+  TrainingLoss:  0.008832151219499502
+

Progress:  59%|████████████████████████▍                |  ETA: 0:00:10
+  TrainingLoss:  0.008560885156218034
+

Progress:  60%|████████████████████████▊                |  ETA: 0:00:10
   TrainingLoss:  0.008320687618735174
-

Progress:  62%|█████████████████████████▎               |  ETA: 0:00:09
-  TrainingLoss:  0.008057216002955423
-

Progress:  63%|█████████████████████████▉               |  ETA: 0:00:08
-  TrainingLoss:  0.007817013381626867
+

Progress:  61%|█████████████████████████▏               |  ETA: 0:00:09
+  TrainingLoss:  0.008130829562582617
+

Progress:  62%|█████████████████████████▌               |  ETA: 0:00:09
+  TrainingLoss:  0.007945766406464255
+

Progress:  63%|██████████████████████████               |  ETA: 0:00:09
+  TrainingLoss:  0.007743307671420374
 

Progress:  64%|██████████████████████████▍              |  ETA: 0:00:08
   TrainingLoss:  0.007606258020944676
-

Progress:  66%|██████████████████████████▉              |  ETA: 0:00:07
-  TrainingLoss:  0.00741897936575434
-

Progress:  67%|███████████████████████████▌             |  ETA: 0:00:07
-  TrainingLoss:  0.0072318641745363664
+

Progress:  65%|██████████████████████████▊              |  ETA: 0:00:08
+  TrainingLoss:  0.0074229887045026005
+

Progress:  66%|███████████████████████████▎             |  ETA: 0:00:08
+  TrainingLoss:  0.007314861747524862
+

Progress:  67%|███████████████████████████▋             |  ETA: 0:00:07
+  TrainingLoss:  0.007160189981425366
 

Progress:  68%|████████████████████████████             |  ETA: 0:00:07
   TrainingLoss:  0.007030576968018319
-

Progress:  70%|████████████████████████████▋            |  ETA: 0:00:06
-  TrainingLoss:  0.006842160831814867
-

Progress:  71%|█████████████████████████████▏           |  ETA: 0:00:06
-  TrainingLoss:  0.006663295693090479
+

Progress:  69%|████████████████████████████▍            |  ETA: 0:00:07
+  TrainingLoss:  0.006870848875129526
+

Progress:  70%|████████████████████████████▉            |  ETA: 0:00:07
+  TrainingLoss:  0.006747481287128898
+

Progress:  71%|█████████████████████████████▎           |  ETA: 0:00:06
+  TrainingLoss:  0.00664129804237869
 

Progress:  72%|█████████████████████████████▋           |  ETA: 0:00:06
   TrainingLoss:  0.006541829406911069
-

Progress:  74%|██████████████████████████████▎          |  ETA: 0:00:05
-  TrainingLoss:  0.006439266215098352
-

Progress:  75%|██████████████████████████████▊          |  ETA: 0:00:05
-  TrainingLoss:  0.006373630354463582
+

Progress:  73%|██████████████████████████████▏          |  ETA: 0:00:06
+  TrainingLoss:  0.0064552203631803836
+

Progress:  74%|██████████████████████████████▌          |  ETA: 0:00:06
+  TrainingLoss:  0.0063660010423700775
+

Progress:  75%|██████████████████████████████▉          |  ETA: 0:00:05
+  TrainingLoss:  0.0064022197079272895
 

Progress:  76%|███████████████████████████████▎         |  ETA: 0:00:05
   TrainingLoss:  0.006231744483980814
-

Progress:  78%|███████████████████████████████▉         |  ETA: 0:00:04
-  TrainingLoss:  0.006105925322761813
-

Progress:  78%|████████████████████████████████▏        |  ETA: 0:00:04
+

Progress:  77%|███████████████████████████████▊         |  ETA: 0:00:05
+  TrainingLoss:  0.006119335012969507
+

Progress:  78%|████████████████████████████████▏        |  ETA: 0:00:05
   TrainingLoss:  0.006051537518187265
-

Progress:  80%|████████████████████████████████▋        |  ETA: 0:00:04
-  TrainingLoss:  0.005958265393168101
-

Progress:  81%|█████████████████████████████████▎       |  ETA: 0:00:04
-  TrainingLoss:  0.00589933221157149
-

Progress:  82%|█████████████████████████████████▊       |  ETA: 0:00:03
+

Progress:  79%|████████████████████████████████▌        |  ETA: 0:00:04
+  TrainingLoss:  0.005995399385285204
+

Progress:  80%|████████████████████████████████▉        |  ETA: 0:00:04
+  TrainingLoss:  0.005940739644776499
+

Progress:  81%|█████████████████████████████████▍       |  ETA: 0:00:04
+  TrainingLoss:  0.005864659481308273
+

Progress:  82%|█████████████████████████████████▊       |  ETA: 0:00:04
   TrainingLoss:  0.005839728541554363
-

Progress:  84%|██████████████████████████████████▎      |  ETA: 0:00:03
-  TrainingLoss:  0.005732300286263278
-

Progress:  85%|██████████████████████████████████▉      |  ETA: 0:00:03
-  TrainingLoss:  0.00564177681931222
+

Progress:  83%|██████████████████████████████████▏      |  ETA: 0:00:03
+  TrainingLoss:  0.005756940686556331
+

Progress:  84%|██████████████████████████████████▋      |  ETA: 0:00:03
+  TrainingLoss:  0.00566662379100261
+

Progress:  85%|███████████████████████████████████      |  ETA: 0:00:03
+  TrainingLoss:  0.0056088887753173566
 

Progress:  86%|███████████████████████████████████▍     |  ETA: 0:00:03
   TrainingLoss:  0.005588522487225395
-

Progress:  88%|████████████████████████████████████     |  ETA: 0:00:02
-  TrainingLoss:  0.005518383756603809
-

Progress:  89%|████████████████████████████████████▌    |  ETA: 0:00:02
-  TrainingLoss:  0.005447130651414077
+

Progress:  87%|███████████████████████████████████▊     |  ETA: 0:00:03
+  TrainingLoss:  0.005510213746739661
+

Progress:  88%|████████████████████████████████████▎    |  ETA: 0:00:02
+  TrainingLoss:  0.0054979164963569
+

Progress:  89%|████████████████████████████████████▋    |  ETA: 0:00:02
+  TrainingLoss:  0.005419126072428864
 

Progress:  90%|█████████████████████████████████████    |  ETA: 0:00:02
   TrainingLoss:  0.005383498125335612
-

Progress:  92%|█████████████████████████████████████▋   |  ETA: 0:00:02
-  TrainingLoss:  0.005338191077391543
-

Progress:  93%|██████████████████████████████████████▏  |  ETA: 0:00:01
-  TrainingLoss:  0.005325000191606857
+

Progress:  91%|█████████████████████████████████████▌   |  ETA: 0:00:02
+  TrainingLoss:  0.0053433202349327386
+

Progress:  92%|█████████████████████████████████████▉   |  ETA: 0:00:01
+  TrainingLoss:  0.005289632405488795
+

Progress:  93%|██████████████████████████████████████▎  |  ETA: 0:00:01
+  TrainingLoss:  0.005252486571131763
 

Progress:  94%|██████████████████████████████████████▋  |  ETA: 0:00:01
   TrainingLoss:  0.005241139972852765
-

Progress:  96%|███████████████████████████████████████▎ |  ETA: 0:00:01
-  TrainingLoss:  0.005199936369162577
-

Progress:  97%|███████████████████████████████████████▊ |  ETA: 0:00:01
-  TrainingLoss:  0.0051580930550792214
+

Progress:  95%|███████████████████████████████████████▏ |  ETA: 0:00:01
+  TrainingLoss:  0.005224773711242664
+

Progress:  96%|███████████████████████████████████████▌ |  ETA: 0:00:01
+  TrainingLoss:  0.005189084656079959
+

Progress:  97%|███████████████████████████████████████▉ |  ETA: 0:00:01
+  TrainingLoss:  0.005160276706676276
 

Progress:  98%|████████████████████████████████████████▍|  ETA: 0:00:00
   TrainingLoss:  0.005100797449693223
-

Progress:  99%|████████████████████████████████████████▉|  ETA: 0:00:00
-  TrainingLoss:  0.005108432700258919
-

Progress: 100%|█████████████████████████████████████████| Time: 0:00:17
+

Progress:  99%|████████████████████████████████████████▊|  ETA: 0:00:00
+  TrainingLoss:  0.0051866932602439055
+

Progress: 100%|█████████████████████████████████████████| Time: 0:00:18
   TrainingLoss:  0.005070411268189842
-
Progress:   1%|▎                                        |  ETA: 0:07:39
+
Progress:   1%|▎                                        |  ETA: 0:09:03
   TrainingLoss:  14.632655986264927
-

Progress:   1%|▌                                        |  ETA: 0:03:56
-  TrainingLoss:  13.329559242102803
-

Progress:   2%|▉                                        |  ETA: 0:02:42
-  TrainingLoss:  12.143594203779958
-

Progress:   3%|█▏                                       |  ETA: 0:02:04
+

Progress:   2%|▋                                        |  ETA: 0:03:43
+  TrainingLoss:  12.722386237539288
+

Progress:   3%|█▏                                       |  ETA: 0:02:23
   TrainingLoss:  11.039649287252608
-

Progress:   3%|█▍                                       |  ETA: 0:01:42
+

Progress:   3%|█▍                                       |  ETA: 0:01:57
   TrainingLoss:  10.02114636245331
-

Progress:   4%|█▋                                       |  ETA: 0:01:27
-  TrainingLoss:  9.097046791654707
-

Progress:   5%|█▉                                       |  ETA: 0:01:16
-  TrainingLoss:  8.237909753448795
-

Progress:   5%|██▏                                      |  ETA: 0:01:08
+

Progress:   4%|█▊                                       |  ETA: 0:01:32
+  TrainingLoss:  8.664983302582026
+

Progress:   5%|██▏                                      |  ETA: 0:01:17
   TrainingLoss:  7.4387152627155375
-

Progress:   6%|██▌                                      |  ETA: 0:01:02
-  TrainingLoss:  6.6946504960811195
-

Progress:   7%|██▊                                      |  ETA: 0:00:57
-  TrainingLoss:  5.99335742380527
-

Progress:   7%|███                                      |  ETA: 0:00:53
+

Progress:   6%|██▋                                      |  ETA: 0:01:06
+  TrainingLoss:  6.3372364378906365
+

Progress:   7%|███                                      |  ETA: 0:00:58
   TrainingLoss:  5.337824496581051
-

Progress:   8%|███▎                                     |  ETA: 0:00:49
-  TrainingLoss:  4.722331395613759
-

Progress:   9%|███▌                                     |  ETA: 0:00:46
-  TrainingLoss:  4.157010558636954
-

Progress:   9%|███▉                                     |  ETA: 0:00:44
+

Progress:   8%|███▍                                     |  ETA: 0:00:52
+  TrainingLoss:  4.427968285628076
+

Progress:   9%|███▉                                     |  ETA: 0:00:48
   TrainingLoss:  3.6389835326953497
-

Progress:  10%|████▏                                    |  ETA: 0:00:42
-  TrainingLoss:  3.196596575110065
-

Progress:  11%|████▍                                    |  ETA: 0:00:40
-  TrainingLoss:  2.8546007176758708
-

Progress:  12%|████▊                                    |  ETA: 0:00:37
-  TrainingLoss:  2.563477605016243
-

Progress:  12%|█████                                    |  ETA: 0:00:36
+

Progress:  10%|████▎                                    |  ETA: 0:00:44
+  TrainingLoss:  3.018038703349647
+

Progress:  11%|████▋                                    |  ETA: 0:00:41
+  TrainingLoss:  2.6301816496199746
+

Progress:  12%|█████                                    |  ETA: 0:00:38
   TrainingLoss:  2.4589363154824095
-

Progress:  13%|█████▍                                   |  ETA: 0:00:34
-  TrainingLoss:  2.367851135226425
-

Progress:  14%|█████▋                                   |  ETA: 0:00:33
-  TrainingLoss:  2.2839316905065385
-

Progress:  14%|█████▉                                   |  ETA: 0:00:32
-  TrainingLoss:  2.2027034713961204
-

Progress:  15%|██████▏                                  |  ETA: 0:00:31
+

Progress:  13%|█████▌                                   |  ETA: 0:00:36
+  TrainingLoss:  2.324857048096907
+

Progress:  14%|█████▊                                   |  ETA: 0:00:34
+  TrainingLoss:  2.2426168024857316
+

Progress:  15%|██████▏                                  |  ETA: 0:00:32
   TrainingLoss:  2.119989438641279
-

Progress:  16%|██████▍                                  |  ETA: 0:00:30
-  TrainingLoss:  2.039395047170094
-

Progress:  16%|██████▊                                  |  ETA: 0:00:29
-  TrainingLoss:  1.9560871301681106
-

Progress:  17%|███████                                  |  ETA: 0:00:28
+

Progress:  16%|██████▌                                  |  ETA: 0:00:31
+  TrainingLoss:  1.997256026673932
+

Progress:  17%|███████                                  |  ETA: 0:00:29
   TrainingLoss:  1.8737919561973513
-

Progress:  18%|███████▎                                 |  ETA: 0:00:27
-  TrainingLoss:  1.7956070934878448
-

Progress:  18%|███████▌                                 |  ETA: 0:00:27
-  TrainingLoss:  1.7173402100382613
-

Progress:  19%|███████▊                                 |  ETA: 0:00:26
+

Progress:  18%|███████▍                                 |  ETA: 0:00:28
+  TrainingLoss:  1.7567483719104793
+

Progress:  19%|███████▊                                 |  ETA: 0:00:27
   TrainingLoss:  1.6415047688305637
-

Progress:  20%|████████▏                                |  ETA: 0:00:25
-  TrainingLoss:  1.5599512629481986
-

Progress:  20%|████████▍                                |  ETA: 0:00:25
-  TrainingLoss:  1.474598400713758
-

Progress:  21%|████████▋                                |  ETA: 0:00:24
+

Progress:  20%|████████▎                                |  ETA: 0:00:26
+  TrainingLoss:  1.5173744553496762
+

Progress:  21%|████████▋                                |  ETA: 0:00:25
   TrainingLoss:  1.3934137812896394
 

Progress:  22%|████████▉                                |  ETA: 0:00:24
   TrainingLoss:  1.3150502016603007
-

Progress:  22%|█████████▏                               |  ETA: 0:00:23
-  TrainingLoss:  1.2414577504431759
-

Progress:  23%|█████████▍                               |  ETA: 0:00:23
-  TrainingLoss:  1.164763172928613
+

Progress:  23%|█████████▎                               |  ETA: 0:00:23
+  TrainingLoss:  1.2024391157169103
 

Progress:  24%|█████████▊                               |  ETA: 0:00:22
   TrainingLoss:  1.0992520972478235
-

Progress:  24%|██████████                               |  ETA: 0:00:22
-  TrainingLoss:  1.043503871028655
-

Progress:  25%|██████████▎                              |  ETA: 0:00:21
-  TrainingLoss:  0.9907049996147231
+

Progress:  25%|██████████▏                              |  ETA: 0:00:22
+  TrainingLoss:  1.0160185077110067
 

Progress:  26%|██████████▌                              |  ETA: 0:00:21
   TrainingLoss:  0.937864831394004
-

Progress:  26%|██████████▊                              |  ETA: 0:00:21
-  TrainingLoss:  0.8866346274340288
-

Progress:  27%|███████████▏                             |  ETA: 0:00:20
-  TrainingLoss:  0.837410160030462
+

Progress:  27%|██████████▉                              |  ETA: 0:00:20
+  TrainingLoss:  0.8628668852189971
 

Progress:  28%|███████████▍                             |  ETA: 0:00:20
   TrainingLoss:  0.7890516517640938
-

Progress:  28%|███████████▋                             |  ETA: 0:00:19
-  TrainingLoss:  0.7423383837164257
-

Progress:  29%|███████████▉                             |  ETA: 0:00:19
-  TrainingLoss:  0.697538174671303
+

Progress:  29%|███████████▊                             |  ETA: 0:00:19
+  TrainingLoss:  0.7197111629630445
 

Progress:  30%|████████████▏                            |  ETA: 0:00:19
   TrainingLoss:  0.6582292233771829
-

Progress:  30%|████████████▍                            |  ETA: 0:00:18
-  TrainingLoss:  0.6224919123900028
-

Progress:  31%|████████████▊                            |  ETA: 0:00:18
-  TrainingLoss:  0.58730267432665
-

Progress:  32%|█████████████                            |  ETA: 0:00:18
+

Progress:  31%|████████████▋                            |  ETA: 0:00:18
+  TrainingLoss:  0.6046570188658607
+

Progress:  32%|█████████████                            |  ETA: 0:00:17
   TrainingLoss:  0.5524066912925043
-

Progress:  32%|█████████████▎                           |  ETA: 0:00:17
-  TrainingLoss:  0.5182720630782547
-

Progress:  33%|█████████████▌                           |  ETA: 0:00:17
-  TrainingLoss:  0.48443695420371735
+

Progress:  33%|█████████████▍                           |  ETA: 0:00:17
+  TrainingLoss:  0.5012478464386658
 

Progress:  34%|█████████████▊                           |  ETA: 0:00:17
   TrainingLoss:  0.4512055050806613
-

Progress:  34%|██████████████▏                          |  ETA: 0:00:17
-  TrainingLoss:  0.416535305058476
-

Progress:  35%|██████████████▍                          |  ETA: 0:00:16
-  TrainingLoss:  0.3823346069524062
+

Progress:  35%|██████████████▎                          |  ETA: 0:00:16
+  TrainingLoss:  0.39920842494415393
 

Progress:  36%|██████████████▋                          |  ETA: 0:00:16
   TrainingLoss:  0.3488733373242542
-

Progress:  36%|██████████████▉                          |  ETA: 0:00:16
-  TrainingLoss:  0.31524269728197235
-

Progress:  37%|███████████████▏                         |  ETA: 0:00:16
-  TrainingLoss:  0.28228433330338454
+

Progress:  37%|███████████████                          |  ETA: 0:00:15
+  TrainingLoss:  0.29843649062868993
 

Progress:  38%|███████████████▌                         |  ETA: 0:00:15
   TrainingLoss:  0.2492514554951062
-

Progress:  38%|███████████████▊                         |  ETA: 0:00:15
-  TrainingLoss:  0.21779730660923474
-

Progress:  39%|████████████████                         |  ETA: 0:00:15
-  TrainingLoss:  0.18764192633612334
-

Progress:  40%|████████████████▎                        |  ETA: 0:00:15
+

Progress:  39%|███████████████▉                         |  ETA: 0:00:14
+  TrainingLoss:  0.20263297146958736
+

Progress:  40%|████████████████▎                        |  ETA: 0:00:14
   TrainingLoss:  0.15869341883306012
-

Progress:  40%|████████████████▌                        |  ETA: 0:00:14
-  TrainingLoss:  0.1318885979667291
-

Progress:  41%|████████████████▊                        |  ETA: 0:00:14
-  TrainingLoss:  0.1099539138032661
-

Progress:  42%|█████████████████▏                       |  ETA: 0:00:14
+

Progress:  41%|████████████████▋                        |  ETA: 0:00:14
+  TrainingLoss:  0.12024188731388612
+

Progress:  42%|█████████████████▏                       |  ETA: 0:00:13
   TrainingLoss:  0.09701118126007126
-

Progress:  42%|█████████████████▍                       |  ETA: 0:00:14
-  TrainingLoss:  0.08900259651992284
-

Progress:  43%|█████████████████▋                       |  ETA: 0:00:13
-  TrainingLoss:  0.0808094563549285
+

Progress:  43%|█████████████████▌                       |  ETA: 0:00:13
+  TrainingLoss:  0.08468925248251521
 

Progress:  44%|█████████████████▉                       |  ETA: 0:00:13
   TrainingLoss:  0.07339300802370965
-

Progress:  44%|██████████████████▏                      |  ETA: 0:00:13
-  TrainingLoss:  0.06675713528309779
-

Progress:  45%|██████████████████▌                      |  ETA: 0:00:13
-  TrainingLoss:  0.0606663803236915
-

Progress:  46%|██████████████████▊                      |  ETA: 0:00:13
+

Progress:  45%|██████████████████▍                      |  ETA: 0:00:12
+  TrainingLoss:  0.06369773407477967
+

Progress:  46%|██████████████████▊                      |  ETA: 0:00:12
   TrainingLoss:  0.05545599246206947
-

Progress:  46%|███████████████████                      |  ETA: 0:00:12
-  TrainingLoss:  0.050715554791113025
-

Progress:  47%|███████████████████▎                     |  ETA: 0:00:12
-  TrainingLoss:  0.0464026436710707
-

Progress:  48%|███████████████████▌                     |  ETA: 0:00:12
+

Progress:  47%|███████████████████▏                     |  ETA: 0:00:12
+  TrainingLoss:  0.04848350141710935
+

Progress:  48%|███████████████████▌                     |  ETA: 0:00:11
   TrainingLoss:  0.042549202200844925
-

Progress:  48%|███████████████████▉                     |  ETA: 0:00:12
-  TrainingLoss:  0.03921800743473182
-

Progress:  49%|████████████████████▏                    |  ETA: 0:00:12
-  TrainingLoss:  0.03614571237958915
+

Progress:  49%|████████████████████                     |  ETA: 0:00:11
+  TrainingLoss:  0.0376094401206204
 

Progress:  50%|████████████████████▍                    |  ETA: 0:00:11
   TrainingLoss:  0.033762651633194245
-

Progress:  50%|████████████████████▋                    |  ETA: 0:00:11
-  TrainingLoss:  0.03160352098444022
-

Progress:  51%|████████████████████▉                    |  ETA: 0:00:11
-  TrainingLoss:  0.029496129613007144
-

Progress:  52%|█████████████████████▏                   |  ETA: 0:00:11
+

Progress:  51%|████████████████████▊                    |  ETA: 0:00:11
+  TrainingLoss:  0.030484846335811215
+

Progress:  52%|█████████████████████▏                   |  ETA: 0:00:10
   TrainingLoss:  0.027812358892054377
-

Progress:  52%|█████████████████████▌                   |  ETA: 0:00:11
-  TrainingLoss:  0.026299936002555913
-

Progress:  53%|█████████████████████▊                   |  ETA: 0:00:10
-  TrainingLoss:  0.024968781910377685
+

Progress:  53%|█████████████████████▋                   |  ETA: 0:00:10
+  TrainingLoss:  0.025572399489283498
 

Progress:  54%|██████████████████████                   |  ETA: 0:00:10
   TrainingLoss:  0.02374785247234018
-

Progress:  54%|██████████████████████▎                  |  ETA: 0:00:10
-  TrainingLoss:  0.022809070756924087
-

Progress:  55%|██████████████████████▌                  |  ETA: 0:00:10
-  TrainingLoss:  0.02188893586840457
-

Progress:  56%|██████████████████████▉                  |  ETA: 0:00:10
+

Progress:  55%|██████████████████████▍                  |  ETA: 0:00:09
+  TrainingLoss:  0.02228156977561408
+

Progress:  56%|██████████████████████▉                  |  ETA: 0:00:09
   TrainingLoss:  0.02118680572754904
-

Progress:  56%|███████████████████████▏                 |  ETA: 0:00:10
-  TrainingLoss:  0.020551368049606222
-

Progress:  57%|███████████████████████▍                 |  ETA: 0:00:09
-  TrainingLoss:  0.020100220024996764
-

Progress:  58%|███████████████████████▋                 |  ETA: 0:00:09
-  TrainingLoss:  0.019509878917005626
+

Progress:  57%|███████████████████████▎                 |  ETA: 0:00:09
+  TrainingLoss:  0.02015232128016263
+

Progress:  57%|███████████████████████▌                 |  ETA: 0:00:09
+  TrainingLoss:  0.019683732878998238
 

Progress:  58%|███████████████████████▉                 |  ETA: 0:00:09
   TrainingLoss:  0.019141533776909803
-

Progress:  59%|████████████████████████▎                |  ETA: 0:00:09
-  TrainingLoss:  0.018975374225275038
-

Progress:  60%|████████████████████████▌                |  ETA: 0:00:09
-  TrainingLoss:  0.018515774727828112
-

Progress:  60%|████████████████████████▊                |  ETA: 0:00:09
+

Progress:  59%|████████████████████████▍                |  ETA: 0:00:08
+  TrainingLoss:  0.01861647699201924
+

Progress:  60%|████████████████████████▊                |  ETA: 0:00:08
   TrainingLoss:  0.018091452314537394
-

Progress:  61%|█████████████████████████                |  ETA: 0:00:08
-  TrainingLoss:  0.017772086369565808
-

Progress:  62%|█████████████████████████▎               |  ETA: 0:00:08
-  TrainingLoss:  0.017536378851434803
+

Progress:  61%|█████████████████████████▏               |  ETA: 0:00:08
+  TrainingLoss:  0.017666769868618086
 

Progress:  62%|█████████████████████████▌               |  ETA: 0:00:08
   TrainingLoss:  0.01736003468706266
-

Progress:  63%|█████████████████████████▉               |  ETA: 0:00:08
-  TrainingLoss:  0.017025823849866683
-

Progress:  64%|██████████████████████████▏              |  ETA: 0:00:08
-  TrainingLoss:  0.016688046574225616
-

Progress:  64%|██████████████████████████▍              |  ETA: 0:00:08
+

Progress:  63%|██████████████████████████               |  ETA: 0:00:07
+  TrainingLoss:  0.017229078634239885
+

Progress:  64%|██████████████████████████▍              |  ETA: 0:00:07
   TrainingLoss:  0.016490627785595374
-

Progress:  65%|██████████████████████████▋              |  ETA: 0:00:07
-  TrainingLoss:  0.01629352014149605
-

Progress:  66%|██████████████████████████▉              |  ETA: 0:00:07
-  TrainingLoss:  0.016141001976350034
+

Progress:  65%|██████████████████████████▊              |  ETA: 0:00:07
+  TrainingLoss:  0.016200267040036086
 

Progress:  66%|███████████████████████████▎             |  ETA: 0:00:07
   TrainingLoss:  0.015970433275107075
-

Progress:  67%|███████████████████████████▌             |  ETA: 0:00:07
-  TrainingLoss:  0.016120769218408826
-

Progress:  68%|███████████████████████████▊             |  ETA: 0:00:07
-  TrainingLoss:  0.015608063327593916
-

Progress:  68%|████████████████████████████             |  ETA: 0:00:07
+

Progress:  67%|███████████████████████████▋             |  ETA: 0:00:06
+  TrainingLoss:  0.015731965981298234
+

Progress:  68%|████████████████████████████             |  ETA: 0:00:06
   TrainingLoss:  0.015571504111262176
-

Progress:  69%|████████████████████████████▎            |  ETA: 0:00:06
-  TrainingLoss:  0.01535747458202367
-

Progress:  70%|████████████████████████████▋            |  ETA: 0:00:06
-  TrainingLoss:  0.015155005946412165
+

Progress:  69%|████████████████████████████▍            |  ETA: 0:00:06
+  TrainingLoss:  0.015216197049415917
 

Progress:  70%|████████████████████████████▉            |  ETA: 0:00:06
   TrainingLoss:  0.015021847326137547
-

Progress:  71%|█████████████████████████████▏           |  ETA: 0:00:06
-  TrainingLoss:  0.01498047418815078
-

Progress:  72%|█████████████████████████████▍           |  ETA: 0:00:06
-  TrainingLoss:  0.014843390995077163
-

Progress:  72%|█████████████████████████████▋           |  ETA: 0:00:06
-  TrainingLoss:  0.014603739926525047
-

Progress:  73%|█████████████████████████████▉           |  ETA: 0:00:06
+

Progress:  71%|█████████████████████████████▎           |  ETA: 0:00:06
+  TrainingLoss:  0.014882070595647692
+

Progress:  72%|█████████████████████████████▌           |  ETA: 0:00:05
+  TrainingLoss:  0.014765721426894621
+

Progress:  73%|█████████████████████████████▉           |  ETA: 0:00:05
   TrainingLoss:  0.014603550598535745
-

Progress:  74%|██████████████████████████████▎          |  ETA: 0:00:05
-  TrainingLoss:  0.01439023563791847
-

Progress:  74%|██████████████████████████████▌          |  ETA: 0:00:05
-  TrainingLoss:  0.01423336734781801
+

Progress:  74%|██████████████████████████████▍          |  ETA: 0:00:05
+  TrainingLoss:  0.014267721531651619
 

Progress:  75%|██████████████████████████████▊          |  ETA: 0:00:05
   TrainingLoss:  0.014331711552141727
-

Progress:  76%|███████████████████████████████          |  ETA: 0:00:05
-  TrainingLoss:  0.014013409330169263
-

Progress:  76%|███████████████████████████████▎         |  ETA: 0:00:05
-  TrainingLoss:  0.01408242157975964
-

Progress:  77%|███████████████████████████████▋         |  ETA: 0:00:05
+

Progress:  76%|███████████████████████████████▏         |  ETA: 0:00:05
+  TrainingLoss:  0.013950384301689652
+

Progress:  77%|███████████████████████████████▋         |  ETA: 0:00:04
   TrainingLoss:  0.01408093688187484
-

Progress:  78%|███████████████████████████████▉         |  ETA: 0:00:05
-  TrainingLoss:  0.013818558168160439
-

Progress:  78%|████████████████████████████████▏        |  ETA: 0:00:04
-  TrainingLoss:  0.01367697398957804
+

Progress:  78%|████████████████████████████████         |  ETA: 0:00:04
+  TrainingLoss:  0.01378699822584451
 

Progress:  79%|████████████████████████████████▍        |  ETA: 0:00:04
   TrainingLoss:  0.013549440438040137
-

Progress:  80%|████████████████████████████████▋        |  ETA: 0:00:04
-  TrainingLoss:  0.013426807806641008
-

Progress:  80%|████████████████████████████████▉        |  ETA: 0:00:04
-  TrainingLoss:  0.013372290643131928
+

Progress:  80%|████████████████████████████████▊        |  ETA: 0:00:04
+  TrainingLoss:  0.013412273071652405
 

Progress:  81%|█████████████████████████████████▎       |  ETA: 0:00:04
   TrainingLoss:  0.013328176140989832
-

Progress:  82%|█████████████████████████████████▌       |  ETA: 0:00:04
-  TrainingLoss:  0.013238893272391144
-

Progress:  82%|█████████████████████████████████▊       |  ETA: 0:00:04
-  TrainingLoss:  0.013278033298022615
+

Progress:  82%|█████████████████████████████████▋       |  ETA: 0:00:03
+  TrainingLoss:  0.01340025936965985
 

Progress:  83%|██████████████████████████████████       |  ETA: 0:00:03
   TrainingLoss:  0.013245051519894888
-

Progress:  84%|██████████████████████████████████▎      |  ETA: 0:00:03
-  TrainingLoss:  0.013154241194858857
-

Progress:  84%|██████████████████████████████████▋      |  ETA: 0:00:03
-  TrainingLoss:  0.013149573858188027
+

Progress:  84%|██████████████████████████████████▌      |  ETA: 0:00:03
+  TrainingLoss:  0.013246567117499886
 

Progress:  85%|██████████████████████████████████▉      |  ETA: 0:00:03
   TrainingLoss:  0.012926247174580352
-

Progress:  86%|███████████████████████████████████▏     |  ETA: 0:00:03
-  TrainingLoss:  0.01280929631423787
-

Progress:  86%|███████████████████████████████████▍     |  ETA: 0:00:03
-  TrainingLoss:  0.01272471105609572
-

Progress:  87%|███████████████████████████████████▋     |  ETA: 0:00:03
+

Progress:  86%|███████████████████████████████████▎     |  ETA: 0:00:03
+  TrainingLoss:  0.01276347268643548
+

Progress:  87%|███████████████████████████████████▋     |  ETA: 0:00:02
   TrainingLoss:  0.012702724975240678
-

Progress:  88%|████████████████████████████████████     |  ETA: 0:00:02
-  TrainingLoss:  0.01263513290271608
-

Progress:  88%|████████████████████████████████████▎    |  ETA: 0:00:02
-  TrainingLoss:  0.012752449197258367
+

Progress:  88%|████████████████████████████████████▏    |  ETA: 0:00:02
+  TrainingLoss:  0.012627784321068208
 

Progress:  89%|████████████████████████████████████▌    |  ETA: 0:00:02
   TrainingLoss:  0.012606558868147044
-

Progress:  90%|████████████████████████████████████▊    |  ETA: 0:00:02
-  TrainingLoss:  0.012426964471411332
-

Progress:  90%|█████████████████████████████████████    |  ETA: 0:00:02
-  TrainingLoss:  0.012383712026531914
+

Progress:  90%|████████████████████████████████████▉    |  ETA: 0:00:02
+  TrainingLoss:  0.012386777048435858
 

Progress:  91%|█████████████████████████████████████▎   |  ETA: 0:00:02
   TrainingLoss:  0.012410364144697136
-

Progress:  92%|█████████████████████████████████████▋   |  ETA: 0:00:02
-  TrainingLoss:  0.012527577342140467
-

Progress:  92%|█████████████████████████████████████▉   |  ETA: 0:00:02
-  TrainingLoss:  0.012315353593095175
+

Progress:  92%|█████████████████████████████████████▊   |  ETA: 0:00:01
+  TrainingLoss:  0.012404490403885467
 

Progress:  93%|██████████████████████████████████████▏  |  ETA: 0:00:01
   TrainingLoss:  0.012138139310187578
-

Progress:  94%|██████████████████████████████████████▍  |  ETA: 0:00:01
-  TrainingLoss:  0.012166164573733653
-

Progress:  94%|██████████████████████████████████████▋  |  ETA: 0:00:01
-  TrainingLoss:  0.012176447522759625
+

Progress:  94%|██████████████████████████████████████▌  |  ETA: 0:00:01
+  TrainingLoss:  0.012123159121121198
 

Progress:  95%|███████████████████████████████████████  |  ETA: 0:00:01
   TrainingLoss:  0.012009738657569196
-

Progress:  96%|███████████████████████████████████████▎ |  ETA: 0:00:01
-  TrainingLoss:  0.012070973284929563
-

Progress:  96%|███████████████████████████████████████▌ |  ETA: 0:00:01
-  TrainingLoss:  0.011965496381345492
+

Progress:  96%|███████████████████████████████████████▍ |  ETA: 0:00:01
+  TrainingLoss:  0.01205138131541313
 

Progress:  97%|███████████████████████████████████████▊ |  ETA: 0:00:01
   TrainingLoss:  0.01206445045092103
-

Progress:  98%|████████████████████████████████████████ |  ETA: 0:00:00
-  TrainingLoss:  0.011794511225930305
-

Progress:  98%|████████████████████████████████████████▍|  ETA: 0:00:00
-  TrainingLoss:  0.011790251492423666
+

Progress:  98%|████████████████████████████████████████▏|  ETA: 0:00:00
+  TrainingLoss:  0.011777294044999347
 

Progress:  99%|████████████████████████████████████████▋|  ETA: 0:00:00
   TrainingLoss:  0.011915804680435628
-

Progress:  99%|████████████████████████████████████████▉|  ETA: 0:00:00
-  TrainingLoss:  0.011769304365704187
-

Progress: 100%|█████████████████████████████████████████| Time: 0:00:19
+

Progress: 100%|█████████████████████████████████████████| Time: 0:00:17
   TrainingLoss:  0.011774418870212931

We can also plot the training errors against the epoch (here the $y$-axis is in log-scale):

using Plots
 p1 = plot(g_loss_array, xlabel="Epoch", ylabel="Training error", label="G-SympNet", color=3, yaxis=:log)
-plot!(p1, la_loss_array, label="LA-SympNet", color=2)
Example block output

The train function will change the parameters of the neural networks and gives an a vector containing the evolution of the value of the loss function during the training. Default values for the arguments ntraining and batch_size are respectively $1000$ and $10$.

The trainings data data_q and data_p must be matrices of $\mathbb{R}^{n\times d}$ where $n$ is the length of data and $d$ is the half of the dimension of the system, i.e data_q[i,j] is $q_j(t_i)$ where $(t_1,...,t_n)$ are the corresponding time of the training data.

Then we can make prediction. Let's compare the initial data with a prediction starting from the same phase space point using the provided function Iterate_Sympnet:

ics = (q=qp_data.q[:,1], p=qp_data.p[:,1])
+plot!(p1, la_loss_array, label="LA-SympNet", color=2)
Example block output

The train function will change the parameters of the neural networks and gives an a vector containing the evolution of the value of the loss function during the training. Default values for the arguments ntraining and batch_size are respectively $1000$ and $10$.

The trainings data data_q and data_p must be matrices of $\mathbb{R}^{n\times d}$ where $n$ is the length of data and $d$ is the half of the dimension of the system, i.e data_q[i,j] is $q_j(t_i)$ where $(t_1,...,t_n)$ are the corresponding time of the training data.

Then we can make prediction. Let's compare the initial data with a prediction starting from the same phase space point using the provided function Iterate_Sympnet:

ics = (q=qp_data.q[:,1], p=qp_data.p[:,1])
 
 steps_to_plot = 200
 
@@ -490,4 +426,4 @@
 using Plots
 p2 = plot(qp_data.q'[1:steps_to_plot], qp_data.p'[1:steps_to_plot], label="training data")
 plot!(p2, la_trajectory.q', la_trajectory.p', label="LA Sympnet")
-plot!(p2, g_trajectory.q', g_trajectory.p', label="G Sympnet")
Example block output

We see that GSympNet gives an almost perfect math on the training data whereas LASympNet cannot even properly replicate the training data. It also takes longer to train LASympNet.

+plot!(p2, g_trajectory.q', g_trajectory.p', label="G Sympnet")
Example block output

We see that GSympNet gives an almost perfect math on the training data whereas LASympNet cannot even properly replicate the training data. It also takes longer to train LASympNet.

diff --git a/latest/tutorials/volume_preserving_attention/3a82e666.svg b/latest/tutorials/volume_preserving_attention/3a82e666.svg new file mode 100644 index 000000000..794fe07bb --- /dev/null +++ b/latest/tutorials/volume_preserving_attention/3a82e666.svg @@ -0,0 +1,52 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/latest/tutorials/volume_preserving_attention/45fa085f.svg b/latest/tutorials/volume_preserving_attention/45fa085f.svg new file mode 100644 index 000000000..b8ac547e4 --- /dev/null +++ b/latest/tutorials/volume_preserving_attention/45fa085f.svg @@ -0,0 +1,54 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/latest/tutorials/volume_preserving_attention/47124b52.svg b/latest/tutorials/volume_preserving_attention/47124b52.svg new file mode 100644 index 000000000..7d7d5b206 --- /dev/null +++ b/latest/tutorials/volume_preserving_attention/47124b52.svg @@ -0,0 +1,58 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/latest/tutorials/volume_preserving_attention/5a71b95c.svg b/latest/tutorials/volume_preserving_attention/5a71b95c.svg new file mode 100644 index 000000000..251adcc20 --- /dev/null +++ b/latest/tutorials/volume_preserving_attention/5a71b95c.svg @@ -0,0 +1,54 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/latest/tutorials/volume_preserving_attention/76b43f5f.svg b/latest/tutorials/volume_preserving_attention/76b43f5f.svg new file mode 100644 index 000000000..9f0dd881b --- /dev/null +++ b/latest/tutorials/volume_preserving_attention/76b43f5f.svg @@ -0,0 +1,44 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/latest/tutorials/volume_preserving_attention/9f1e543e.svg b/latest/tutorials/volume_preserving_attention/9f1e543e.svg new file mode 100644 index 000000000..2c1105576 --- /dev/null +++ b/latest/tutorials/volume_preserving_attention/9f1e543e.svg @@ -0,0 +1,56 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/latest/tutorials/volume_preserving_attention/ae3188fa.svg b/latest/tutorials/volume_preserving_attention/ae3188fa.svg new file mode 100644 index 000000000..f57adebff --- /dev/null +++ b/latest/tutorials/volume_preserving_attention/ae3188fa.svg @@ -0,0 +1,48 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/latest/tutorials/volume_preserving_attention/dcef1700.svg b/latest/tutorials/volume_preserving_attention/dcef1700.svg new file mode 100644 index 000000000..fd7a95f46 --- /dev/null +++ b/latest/tutorials/volume_preserving_attention/dcef1700.svg @@ -0,0 +1,46 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/latest/tutorials/volume_preserving_attention/index.html b/latest/tutorials/volume_preserving_attention/index.html new file mode 100644 index 000000000..df1de868c --- /dev/null +++ b/latest/tutorials/volume_preserving_attention/index.html @@ -0,0 +1,105 @@ + +Volume-Preserving Attention · GeometricMachineLearning.jl

Comparison of different VolumePreservingAttention

In the section of volume-preserving attention we mentioned two ways of computing volume-preserving attention: one where we compute the correlations with a skew-symmetric matrix and one where we compute the correlations with an arbitrary matrix. Here we compare the two approaches. When calling the VolumePreservingAttention layer we can specify whether we want to use the skew-symmetric or the arbitrary weighting by setting the keyword skew_sym = true and skew_sym = false respectively.

In here we demonstrate the differences between the two approaches for computing correlations. For this we first generate a training set consisting of two collections of curves: (i) sine curves and (ii) cosine curve.

import Random # hide
+
+sine_cosine = zeros(1, 1000, 2)
+sine_cosine[1, :, 1] .= sin.(0.:.1:99.9)
+sine_cosine[1, :, 2] .= cos.(0.:.1:99.9)
+
+
+const dl = DataLoader(Float16.(sine_cosine))
DataLoader{Float16, Array{Float16, 3}, Nothing, :TimeSeries}(Float16[0.0 0.09985 … -0.6675 -0.59;;; 1.0 0.995 … 0.7446 0.8076], nothing, 1, 1000, 2, nothing, nothing)

The third axis (i.e. the parameter axis) has length two, meaning we have two different kinds of curves:

plot(dl.input[1, :, 1], label = "sine")
+plot!(dl.input[1, :, 2], label = "cosine")
Example block output

We want to train a single neural network on both these curves. We compare three networks which are of the following form:

\[\mathtt{network} = \mathcal{NN}_d\circ\Psi\circ\mathcal{NN}_u,\]

where $\mathcal{NN}_u$ refers to a neural network that scales up and $\mathcal{NN}_d$ refers to a neural network that scales down. The up and down scaling is done with simple dense layers:

\[\mathcal{NN}_u(x) = \mathrm{tanh}(a_ux + b_u) \text{ and } \mathcal{NN}_d(x) = a_d^Tx + b_d,\]

where $a_u, b_u, a_d\in\mathbb{R}^\mathrm{ud}$ and $b_d$ is a scalar. ud refers to upscaling dimension. For $\Psi$ we consider three different choices:

  1. a volume-preserving attention with skew-symmetric weighting,
  2. a volume-preserving attention with arbitrary weighting,
  3. an identity layer.

We further choose a sequence length 5 (i.e. the network always sees the last 5 time steps) and always predict one step into the future (i.e. the prediction window is set to 1):

const seq_length = 3
+const prediction_window = 1
+
+const upscale_dimension_1 = 2
+
+const T = Float16
+
+function set_up_networks(upscale_dimension::Int = upscale_dimension_1)
+    model_skew = Chain(Dense(1, upscale_dimension, tanh), VolumePreservingAttention(upscale_dimension, seq_length; skew_sym = true),  Dense(upscale_dimension, 1, identity; use_bias = true))
+    model_arb  = Chain(Dense(1, upscale_dimension, tanh), VolumePreservingAttention(upscale_dimension, seq_length; skew_sym = false), Dense(upscale_dimension, 1, identity; use_bias = true))
+    model_comp = Chain(Dense(1, upscale_dimension, tanh), Dense(upscale_dimension, 1, identity; use_bias = true))
+
+    nn_skew = NeuralNetwork(model_skew, CPU(), T)
+    nn_arb  = NeuralNetwork(model_arb,  CPU(), T)
+    nn_comp = NeuralNetwork(model_comp, CPU(), T)
+
+    nn_skew, nn_arb, nn_comp
+end
+
+nn_skew, nn_arb, nn_comp = set_up_networks()
(NeuralNetwork{AbstractNeuralNetworks.UnknownArchitecture, Chain{Tuple{Dense{1, 2, true, typeof(tanh)}, VolumePreservingAttention{2, 2, :skew_sym, 3}, Dense{2, 1, true, typeof(identity)}}}, Tuple{@NamedTuple{W::Matrix{Float16}, b::Vector{Float16}}, @NamedTuple{A::SkewSymMatrix{Float16, Vector{Float16}}}, @NamedTuple{W::Matrix{Float16}, b::Vector{Float16}}}}(AbstractNeuralNetworks.UnknownArchitecture(), Chain{Tuple{Dense{1, 2, true, typeof(tanh)}, VolumePreservingAttention{2, 2, :skew_sym, 3}, Dense{2, 1, true, typeof(identity)}}}((Dense{1, 2, true, typeof(tanh)}(tanh), VolumePreservingAttention{2, 2, :skew_sym, 3}(), Dense{2, 1, true, typeof(identity)}(identity))), ((W = Float16[-0.6455; -1.463;;], b = Float16[-1.624, -0.2177]), (A = Float16[0.0 -0.1244; 0.1244 0.0],), (W = Float16[0.981 0.07996], b = Float16[1.549]))), NeuralNetwork{AbstractNeuralNetworks.UnknownArchitecture, Chain{Tuple{Dense{1, 2, true, typeof(tanh)}, VolumePreservingAttention{2, 2, :arbitrary, 3}, Dense{2, 1, true, typeof(identity)}}}, Tuple{@NamedTuple{W::Matrix{Float16}, b::Vector{Float16}}, @NamedTuple{A::Matrix{Float16}}, @NamedTuple{W::Matrix{Float16}, b::Vector{Float16}}}}(AbstractNeuralNetworks.UnknownArchitecture(), Chain{Tuple{Dense{1, 2, true, typeof(tanh)}, VolumePreservingAttention{2, 2, :arbitrary, 3}, Dense{2, 1, true, typeof(identity)}}}((Dense{1, 2, true, typeof(tanh)}(tanh), VolumePreservingAttention{2, 2, :arbitrary, 3}(), Dense{2, 1, true, typeof(identity)}(identity))), ((W = Float16[-1.342; 0.412;;], b = Float16[0.5933, -0.7686]), (A = Float16[-1.125 -0.3372; -0.0861 -0.9746],), (W = Float16[0.206 -1.019], b = Float16[0.5054]))), NeuralNetwork{AbstractNeuralNetworks.UnknownArchitecture, Chain{Tuple{Dense{1, 2, true, typeof(tanh)}, Dense{2, 1, true, typeof(identity)}}}, Tuple{@NamedTuple{W::Matrix{Float16}, b::Vector{Float16}}, @NamedTuple{W::Matrix{Float16}, b::Vector{Float16}}}}(AbstractNeuralNetworks.UnknownArchitecture(), Chain{Tuple{Dense{1, 2, true, typeof(tanh)}, Dense{2, 1, true, typeof(identity)}}}((Dense{1, 2, true, typeof(tanh)}(tanh), Dense{2, 1, true, typeof(identity)}(identity))), ((W = Float16[-0.538; 0.2615;;], b = Float16[-1.312, -1.275]), (W = Float16[0.3884 0.109], b = Float16[0.2737]))))

We expect the third network to not be able to learn anything useful since it cannot resolve time series data: a regular feedforward network only ever sees one datum at a time.

Next we train the networks (here we pick a batch size of 30):

function set_up_optimizers(nn_skew, nn_arb, nn_comp)
+    o_skew = Optimizer(AdamOptimizer(T), nn_skew)
+    o_arb  = Optimizer(AdamOptimizer(T), nn_arb)
+    o_comp = Optimizer(AdamOptimizer(T), nn_comp)
+
+    o_skew, o_arb, o_comp
+end
+
+o_skew, o_arb, o_comp = set_up_optimizers(nn_skew, nn_arb, nn_comp)
+
+const n_epochs = 1000
+
+const batch_size = 30
+
+const batch = Batch(batch_size, seq_length, prediction_window)
+const batch2 = Batch(batch_size)
+
+function train_networks!(nn_skew, nn_arb, nn_comp)
+    loss_array_skew = o_skew(nn_skew, dl, batch, n_epochs, TransformerLoss(batch))
+    loss_array_arb  = o_arb( nn_arb,  dl, batch, n_epochs, TransformerLoss(batch))
+    loss_array_comp = o_comp(nn_comp, dl, batch2, n_epochs, FeedForwardLoss())
+
+    loss_array_skew, loss_array_arb, loss_array_comp
+end
+
+loss_array_skew, loss_array_arb, loss_array_comp = train_networks!(nn_skew, nn_arb, nn_comp)
+
+function plot_training_losses(loss_array_skew, loss_array_arb, loss_array_comp)
+    p = plot(loss_array_skew, color = 2, label = "skew", yaxis = :log)
+    plot!(p, loss_array_arb,  color = 3, label = "arb")
+    plot!(p, loss_array_comp, color = 4, label = "comp")
+
+    p
+end
+
+plot_training_losses(loss_array_skew, loss_array_arb, loss_array_comp)
Example block output

Looking at the training errors, we can see that the network with the skew-symmetric weighting is stuck at a relatively high error rate, whereas the loss for the network with the arbitrary weighting is decreasing to a significantly lower level. The feedforward network without the attention mechanism is not able to learn anything useful (as was expected).

The following demonstrates the predictions of our approaches[1]:

initial_condition = dl.input[:, 1:seq_length, 2]
+
+function make_networks_neural_network_integrators(nn_skew, nn_arb, nn_comp)
+    nn_skew = NeuralNetwork(GeometricMachineLearning.DummyTransformer(seq_length), nn_skew.model, nn_skew.params)
+    nn_arb  = NeuralNetwork(GeometricMachineLearning.DummyTransformer(seq_length), nn_arb.model,  nn_arb.params)
+    nn_comp = NeuralNetwork(GeometricMachineLearning.DummyNNIntegrator(), nn_comp.model, nn_comp.params)
+
+    nn_skew, nn_arb, nn_comp
+end
+
+nn_skew, nn_arb, nn_comp = make_networks_neural_network_integrators(nn_skew, nn_arb, nn_comp)
+
+function produce_validation_plot(n_points::Int, nn_skew = nn_skew, nn_arb = nn_arb, nn_comp = nn_comp; initial_condition::Matrix=initial_condition, type = :cos)
+    validation_skew = iterate(nn_skew, initial_condition; n_points = n_points, prediction_window = 1)
+    validation_arb  = iterate(nn_arb,  initial_condition; n_points = n_points, prediction_window = 1)
+    validation_comp = iterate(nn_comp, initial_condition[:, 1]; n_points = n_points)
+
+    p2 = type == :cos ? plot(dl.input[1, 1:n_points, 2], color = 1, label = "reference") : plot(dl.input[1, 1:n_points, 1], color = 1, label = "reference")
+
+    plot!(validation_skew[1, :], color = 2, label = "skew")
+    plot!(p2, validation_arb[1, :], color = 3, label = "arb")
+    plot!(p2, validation_comp[1, :], color = 4, label = "comp")
+    vline!([seq_length], color = :red, label = "start of prediction")
+
+    p2
+end
+
+p2 = produce_validation_plot(40)
Example block output

In the above plot we can see that the network with the arbitrary weighting performs much better; even though the green line does not fit the blue line very well either, it manages to least qualitatively reflect the training data. We can also plot the predictions for longer time intervals:

p3 = produce_validation_plot(400)
Example block output

We can also plot the comparison with the sine function:

initial_condition = dl.input[:, 1:seq_length, 1]
+
+p2 = produce_validation_plot(40, initial_condition = initial_condition, type = :sin)
Example block output

This advantage of the volume-preserving attention with arbitrary weighting may however be due to the fact that the skew-symmetric attention only has 3 learnable parameters, as opposed to 9 for the arbitrary weighting. If we increase the upscaling dimension the result changes:

const upscale_dimension_2 = 10
+
+nn_skew, nn_arb, nn_comp = set_up_networks(upscale_dimension_2)
+
+o_skew, o_arb, o_comp = set_up_optimizers(nn_skew, nn_arb, nn_comp)
+
+loss_array_skew, loss_array_arb, loss_array_comp = train_networks!(nn_skew, nn_arb, nn_comp)
+
+plot_training_losses(loss_array_skew, loss_array_arb, loss_array_comp)
Example block output
initial_condition = dl.input[:, 1:seq_length, 2]
+
+nn_skew, nn_arb, nn_comp = make_networks_neural_network_integrators(nn_skew, nn_arb, nn_comp)
+
+p2 = produce_validation_plot(40, nn_skew, nn_arb, nn_comp)
Example block output

And for a longer time interval:

p3 = produce_validation_plot(200, nn_skew, nn_arb, nn_comp)
Example block output
  • 1Here we have to use the architectures DummyTransformer and DummyNNIntegrator to reformulate the three neural networks defined here as NeuralNetworkIntegrators. Normally the user should try to use predefined architectures in GeometricMachineLearning, that way they never use DummyTransformer and DummyNNIntegrator.