Merge pull request #164 from JuliaGNI/docs_on_hamiltonian_systems

Docs on Hamiltonian systems (among others)
JuliaGNI · Jul 25, 2024 · 95b5c47 · 95b5c47
2 parents e4bc624 + f3af3ee
commit 95b5c47
Show file tree

Hide file tree

Showing 25 changed files with 660 additions and 132 deletions.
diff --git a/docs/make.jl b/docs/make.jl
@@ -162,6 +162,10 @@ makedocs(;
             "Global Tangent Spaces" => "arrays/global_tangent_spaces.md",
             "Pullbacks" => "pullbacks/computation_of_pullbacks.md",
         ],
+        "Structure-Preservation" => [
+            "Symplecticity" => "structure_preservation/symplecticity.md",
+            "Volume-Preservation" => "structure_preservation/volume_preservation.md",
+        ],
         "Optimizers" => [
             "Optimizers" => "optimizers/optimizer_framework.md",
             "Global Sections" => "optimizers/manifold_related/global_sections.md",
@@ -173,7 +177,7 @@ makedocs(;
         "Special Neural Network Layers" => [
             "Sympnet Layers" => "layers/sympnet_gradient.md",
             "Volume-Preserving Layers" => "layers/volume_preserving_feedforward.md",
-            "Attention" => "layers/attention_layer.md",
+            "(Volume-Preserving) Attention" => "layers/attention_layer.md",
             "Multihead Attention" => "layers/multihead_attention_layer.md",
             "Linear Symplectic Attention" => "layers/linear_symplectic_attention.md",
         ],

diff --git a/docs/src/GeometricMachineLearning.bib b/docs/src/GeometricMachineLearning.bib
@@ -72,6 +72,16 @@ @book{bishop1980tensor
     address={Mineola, New York}
 }
 
+@book{arnold1978mathematical,
+  title={Mathematical methods of classical mechanics},
+  author={Arnold, Vladimir Igorevich},
+  volume={60},
+  year={1978},
+  series={Graduate Texts in Mathematics},
+  publisher={Springer Verlag},
+  address={Berlin}
+}
+
 @book{o1983semi,
   title={Semi-Riemannian geometry with applications to relativity},
   author={O'neill, Barrett},
@@ -458,4 +468,15 @@ @article{raissi2019physics
   pages={686--707},
   year={2019},
   publisher={Elsevier}
+}
+
+@article{kraus2017gempic,
+  title={GEMPIC: geometric electromagnetic particle-in-cell methods},
+  author={Kraus, Michael and Kormann, Katharina and Morrison, Philip J and Sonnendr{\"u}cker, Eric},
+  journal={Journal of Plasma Physics},
+  volume={83},
+  number={4},
+  pages={905830401},
+  year={2017},
+  publisher={Cambridge University Press}
 }
diff --git a/docs/src/layers/attention_layer.md b/docs/src/layers/attention_layer.md
@@ -1,6 +1,6 @@
 # The Attention Layer
 
-The *attention* mechanism was originally developed for image and natural language processing (NLP) tasks. It is motivated by the need to handle time series data in an efficient way[^1]. Its essential idea is to compute correlations between vectors in input sequences. I.e. given sequences 
+The *attention* mechanism was originally developed for image and natural language processing (NLP) tasks. It is motivated by the need to handle time series data in an efficient way[^1]. Its essential idea is to compute correlations between vectors in input sequences. So given two sequences 
 
 ```math
 (z_q^{(1)}, z_q^{(2)}, \ldots, z_q^{(T)}) \text{ and } (z_p^{(1)}, z_p^{(2)}, \ldots, z_p^{(T)}),
@@ -15,19 +15,19 @@ an attention mechanism computes pair-wise correlations between all combinations
 
 where ``z_q, z_k \in \mathbb{R}^d`` are elements of the input sequences. The learnable parameters are ``W, U \in \mathbb{R}^{n\times{}d}`` and ``v \in \mathbb{R}^n``.
 
-However *multiplicative attention* (see e.g. [vaswani2017attention](@cite))is more straightforward to interpret and cheaper to handle computationally: 
+However *multiplicative attention* (see e.g. [vaswani2017attention](@cite)) is more straightforward to interpret and cheaper to handle computationally: 
 
 ```math
 (z_q, z_k) \mapsto z_q^TWz_k,
 ```
 
 where ``W \in \mathbb{R}^{d\times{}d}`` is a learnable weight matrix with respect to which correlations are computed as scalar products. Regardless of the type of attention used, they all try to compute correlations among input sequences on whose basis further computation is performed. Given two input sequences ``Z_q = (z_q^{(1)}, \ldots, z_q^{(T)})`` and ``Z_k = (z_k^{(1)}, \ldots, z_k^{(T)})``, we can arrange the various correlations into a *correlation matrix* ``C\in\mathbb{R}^{T\times{}T}`` with entries ``[C]_{ij} = \mathtt{attention}(z_q^{(i)}, z_k^{(j)})``. In the case of multiplicative attention this matrix is just ``C = Z^TWZ``.
 
-## Reweighting of the input sequence 
+## Reweighting of the Input Sequence 
 
 In `GeometricMachineLearning` we always compute *self-attention*, meaning that the two input sequences ``Z_q`` and ``Z_k`` are the same, i.e. ``Z = Z_q = Z_k``.[^2]
 
-[^2]: [Multihead attention](multihead_attention_layer.md) also falls into this category. Here the input ``Z`` is multiplied from the left with several *projection matrices* ``P^Q_i`` and ``P^K_i``, where ``i`` indicates the *head*. For each head we then compute a correlation matrix ``(P^Q_i Z)^T(P^K Z)``. 
+[^2]: [Multihead attention](@ref "Multihead Attention") also falls into this category. Here the input ``Z`` is multiplied from the left with several *projection matrices* ``P^Q_i`` and ``P^K_i``, where ``i`` indicates the *head*. For each head we then compute a correlation matrix ``(P^Q_i Z)^T(P^K Z)``. 
 
 This is then used to reweight the columns in the input sequence ``Z``. For this we first apply a nonlinearity ``\sigma`` onto ``C`` and then multiply ``\sigma(C)`` onto ``Z`` from the right, i.e. the output of the attention layer is ``Z\sigma(C)``. So we perform the following mappings:
 
@@ -45,10 +45,10 @@ for ``p^{(i)} = [\sigma(C)]_{\bullet{}i}``. What is *learned* during training ar
 
 ## Volume-Preserving Attention
 
-The attention layer (and the activation function ``\sigma`` defined for it) in `GeometricMachineLearning` was specifically designed to apply it to data coming from physical systems that can be described through a divergence-free or a symplectic vector field. 
-Traditionally the nonlinearity in the attention mechanism is a softmax[^3] (see [vaswani2017attention](@cite)) and the self-attention layer performs the following mapping: 
+The [`VolumePreservingAttention`](@ref) layer (and the activation function ``\sigma`` defined for it) in `GeometricMachineLearning` was specifically designed to apply it to data coming from physical systems that can be described through a divergence-free or a symplectic vector field. 
+Traditionally the nonlinearity in the attention mechanism is a softmax[^3] [vaswani2017attention](@cite) and the self-attention layer performs the following mapping: 
 
-[^3]: The softmax acts on the matrix ``C`` in a vector-wise manner, i.e. it operates on each column of the input matrix ``C = [c^{(1)}, \ldots, c^{(T)}]``. The result is a sequence of probability vectors ``[p^{(1)}, \ldots, p^{(T)}]`` for which ``\sum_{i=1}^Tp^{(j)}_i=1\quad\forall{}j\in\{1,\dots,T\}.``
+[^3]: The softmax acts on the matrix ``C`` in a vector-wise manner, i.e. it operates on each column of the input matrix ``C = [c^{(1)}, \ldots, c^{(T)}]``. The result is a sequence of probability vectors ``[y^{(1)}, \ldots, y^{(T)}]`` for which ``\sum_{i=1}^Ty^{(j)}_i=1\quad\forall{}j\in\{1,\dots,T\}.``
 
 ```math
 Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\mathrm{softmax}(Z^TWZ).
@@ -60,9 +60,15 @@ The softmax activation acts vector-wise, i.e. if we supply it with a matrix ``C`
 \mathrm{softmax}(C) = [\mathrm{softmax}(c_{\bullet{}1}), \ldots, \mathrm{softmax}(c_{\bullet{}T})].
 ```
 
-The output of a softmax is a *probability vector* (also called *stochastic vector*) and the matrix ``P = [p^{(1)}, \ldots, p^{(T)}]``, where each column is a probability vector, is sometimes referred to as a *stochastic matrix* (see [jacobs1992discrete](@cite)). This attention mechanism finds application in *transformer neural networks* [vaswani2017attention](@cite). The problem with this matrix from a geometric point of view is that all the columns are independent of each other and the nonlinear transformation could in theory produce a stochastic matrix for which all columns are identical and thus lead to a loss of information. So the softmax activation function is inherently non-geometric. 
+The output of a softmax is a *probability vector* (also called *stochastic vector*) and the matrix ``Y = [y^{(1)}, \ldots, y^{(T)}]``, where each column is a probability vector, is sometimes referred to as a *stochastic matrix* (see [jacobs1992discrete](@cite)). This attention mechanism finds application in *transformer neural networks* [vaswani2017attention](@cite). The problem with this matrix from a geometric point of view is that all the columns are independent of each other and the nonlinear transformation could in theory produce a stochastic matrix for which all columns are identical and thus lead to a loss of information. So the softmax activation function is inherently non-geometric. We visualize this with the figure below: 
 
-Besides the traditional attention mechanism `GeometricMachineLearning` therefore also has a volume-preserving transformation that fulfills a similar role. There are two approaches implemented to realize similar transformations. Both of them however utilize the *Cayley transform* to produce orthogonal matrices ``\sigma(C)`` instead of stochastic matrices. For an orthogonal matrix ``\Sigma`` we have ``\Sigma^T\Sigma = \mathbb{I}``, so all the columns are linearly independent which is not necessarily true for a stochastic matrix ``P``. The following explains how this new activation function is implemented.
+```@example
+Main.include_graphics("../tikz/convex_recombination") # hide
+```
+
+So the ``y`` coefficients responsible for producing the first output vector are independent from those producing the second output vector etc., they have the condition ``\sum_{i=1}^Ty^{(j)}_iz_\mu^{(i)}`` for each column ``j`` imposed on them, but the coefficients for two different columns are independent of each other.
+
+Besides the traditional attention mechanism `GeometricMachineLearning` therefore also has a volume-preserving transformation that fulfills a similar role. There are two approaches implemented to realize similar transformations. Both of them however utilize the *Cayley transform* to produce orthogonal matrices ``\sigma(C)`` instead of stochastic matrices. For an orthogonal matrix ``\Sigma`` we have ``\Sigma^T\Sigma = \mathbb{I}``, so all the columns are linearly independent which is not necessarily true for a stochastic matrix ``P``. In the following we explain how this new activation function is implemented. First we need to briefly discuss the *Cayley transform*. 
 
 ### The Cayley transform 
 
@@ -77,10 +83,10 @@ The Cayley transform maps from skew-symmetric matrices to orthonormal matrices[^
 We can easily check that ``\mathrm{Cayley}(A)`` is orthogonal if ``A`` is skew-symmetric. For this consider ``\varepsilon \mapsto A(\varepsilon)\in\mathcal{S}_\mathrm{skew}`` with ``A(0) = \mathbb{I}`` and ``A'(0) = B``. Then we have: 
 
 ```math
-\frac{\delta\mathrm{Cayley}}{\delta{}A} = \frac{d}{d\varepsilon}|_{\varepsilon=0} \mathrm{Cayley}(A(\varepsilon))^T \mathrm{Cayley}(A(\varepsilon)) = \mathbb{O}.
+\frac{\delta(\mathrm{Cayley}(A)^T\mathrm{Cayley}(A))}{\delta{}A} = \frac{d}{d\varepsilon}|_{\varepsilon=0} \mathrm{Cayley}(A(\varepsilon))^T \mathrm{Cayley}(A(\varepsilon)) = A'(0)^T + A'(0) = \mathbb{O},
 ```
 
-In order to use the Cayley transform as an activation function we further need a mapping from the input ``Z`` to a skew-symmetric matrix. This is realized in two ways in `GeometricMachineLearning`: via a scalar-product with a skew-symmetric weighting and via a scalar-product with an arbitrary weighting.
+So ``\mathrm{Cayley}(A)^T\mathrm{Cayley}(A)`` remains unchanged among ``\varepsilon``. In order to use the Cayley transform as an activation function we further need a mapping from the input ``Z`` to a skew-symmetric matrix. This is realized in two ways in `GeometricMachineLearning`: via a scalar-product with a skew-symmetric weighting and via a scalar-product with an arbitrary weighting.
 
 ### First approach: scalar products with a skew-symmetric weighting
 
@@ -89,17 +95,17 @@ For this the attention layer is modified in the following way:
 ```math 
 Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\sigma(Z^TAZ),
 ```
-where ``\sigma(C)=\mathrm{Cayley}(C)`` and ``A`` is a skew-symmetric matrix that is learnable, i.e. the parameters of the attention layer are stored in ``A``.
+where ``\sigma(C)=\mathrm{Cayley}(C)`` and ``A`` is a matrix of type [`SkewSymMatrix`](@ref) that is learnable, i.e. the parameters of the attention layer are stored in ``A``.
 
 ### Second approach: scalar products with an arbitrary weighting
 
-For this approach we compute correlations between the input vectors with a skew-symmetric weighting. The correlations we consider here are based on: 
+For this approach we compute correlations between the input vectors based on scalar product with an arbitrary weighting. This arbitrary ``T\times{}T`` matrix ``A`` constitutes the learnable parameters of the attention layer. The correlations we consider here are based on: 
 
 ```math
 (z^{(2)})^TAz^{(1)}, (z^{(3)})^TAz^{(1)}, \ldots, (z^{(d)})^TAz^{(1)}, (z^{(3)})^TAz^{(2)}, \ldots, (z^{(d)})^TAz^{(2)}, \ldots, (z^{(d)})^TAz^{(d-1)}.
 ```
 
-So in total we consider correlations ``(z^{(i)})^Tz^{(j)}`` for which ``i > j``. We now arrange these correlations into a skew-symmetric matrix: 
+So we consider correlations ``(z^{(i)})^Tz^{(j)}`` for which ``i > j``. We now arrange these correlations into a skew-symmetric matrix: 
 
 ```math
 C = \begin{bmatrix}
@@ -110,7 +116,13 @@ C = \begin{bmatrix}
 \end{bmatrix}.
 ```
 
-This correlation matrix can now again be used as an input for the Cayley transform to produce an orthogonal matrix.
+This correlation matrix can now again be used as an input for the Cayley transform to produce an orthogonal matrix. Mathematically this is also equivalent to first computing all correlations ``Z^TAZ`` and then mapping the lower triangular to the upper triangular and negating these elements. This is visualized below: 
+
+```@example
+Main.include_graphics("../tikz/skew_sym_mapping") # hide
+```
+
+Internally `GeometricMachineLearning` computes this more efficiently with the function [`GeometricMachineLearning.tensor_mat_skew_sym_assign`](@ref).
 
 ## How is structure preserved? 
 
@@ -124,13 +136,14 @@ Z =  \left[\begin{array}{cccc}
             \cdots &  \cdots & \cdots & \cdots \\
             z_d^{(1)} & z_d^{(2)} & \cdots & z_d^{(T)}
             \end{array}\right] \mapsto 
-            \left[\begin{array}{c}  z_1^{(1)} \\ z_1^{(2)} \\ \cdots \\ z_1^{(T)} \\ z_2^{(1)} \\ \cdots \\ z_d^{(T)} \end{array}\right] =: Z_\mathrm{vec}.
+            \left[\begin{array}{c}  z_1^{(1)} \\ z_1^{(2)} \\ \cdots \\ z_1^{(T)} \\ z_2^{(1)} \\ \cdots \\ z_d^{(T)} \end{array}\right] =: Z_\mathrm{vec},
 ```
 
-The inverse of ``Z \mapsto \hat{Z} `` we refer to as ``Y \mapsto \tilde{Y}``. In the following we also write ``\hat{\varphi}`` for the mapping ``\,\hat{}\circ\varphi\circ\tilde{}\,``.
+so we arrange the rows consecutively into a vector. The inverse of ``Z \mapsto \hat{Z} `` we refer to as ``Y \mapsto \tilde{Y}``. In the following we also write ``\hat{\varphi}`` for the mapping ``\,\hat{}\circ\varphi\circ\tilde{}\,``.
 
-__DEFINITION__:
-We say that a mapping ``\varphi: \times_\text{$T$ times}\mathbb{R}^{d} \to \times_\text{$T$ times}\mathbb{R}^{d}`` is **volume-preserving** if the associated ``\hat{\varphi}`` is volume-preserving.
+```@eval
+Main.definition(raw"We say that a mapping ``\varphi: \times_\text{$T$ times}\mathbb{R}^{d} \to \times_\text{$T$ times}\mathbb{R}^{d}`` is **volume-preserving** if the associated ``\hat{\varphi}`` is volume-preserving.")
+```
 
 In the transformed coordinate system (in terms of the vector ``Z_\mathrm{vec}`` defined above) this is equivalent to multiplication by a sparse matrix ``\tilde\Lambda(Z)`` from the left:
 
@@ -145,13 +158,29 @@ In the transformed coordinate system (in terms of the vector ``Z_\mathrm{vec}``
     \left[\begin{array}{c}  z_1^{(1)} \\ z_1^{(2)} \\ \ldots \\ z_1^{(T)} \\ z_2^{(1)} \\ \ldots \\ z_d^{(T)} \end{array}\right] .
 ```
 
-``\tilde{\Lambda}(Z)`` in m[eq:LambdaApplication]m(@latex) is easily shown to be an orthogonal matrix. 
+``\tilde{\Lambda}(Z)`` is easily shown to be an orthogonal matrix and a symplectic matrix, i.e. it satisfies
+
+```math
+\tilde{\Lambda}(Z)^T\tilde{\Lambda}(Z) = \mathbb{I}
+```
+
+and
+
+```math
+\tilde{\Lambda}(Z)^T\mathbb{J}\tilde{\Lambda}(Z) = \mathbb{J}.
+```
 
 
 ## Historical Note 
 
-Attention was used before, but always in connection with **recurrent neural networks** (see [luong2015effective](@cite) and [bahdanau2014neural](@cite)). 
+Attention was used before the transformer was introduced, but mostly in connection with *recurrent neural networks* (see [luong2015effective](@cite) and [bahdanau2014neural](@cite)). 
 
+## Library Functions
+
+```@docs; canonical = false
+GeometricMachineLearning.tensor_mat_skew_sym_assign
+VolumePreservingAttention
+```
 
 ## References 
 

diff --git a/docs/src/layers/linear_symplectic_attention.md b/docs/src/layers/linear_symplectic_attention.md
@@ -2,18 +2,21 @@
 
 The attention layer introduced here is an extension of the [Sympnet gradient layer](@ref "SympNet Gradient Layer") to the setting where we deal with time series data. We first have to define a notion of symplecticity for [multi-step methods](@ref "Multi-step methods"). 
 
-This definition is essentially taken from [feng1987symplectic, ge1988approximation](@cite) and similar to the definition of volume-preservation in [brantner2024volume](@cite). 
+This definition is different from [feng1987symplectic, ge1988approximation](@cite), but similar to the definition of volume-preservation in [brantner2024volume](@cite)[^1]. 
+
+[^1]: This definition is also recalled in the section on [volume-preserving attention](@ref "How is structure preserved?").
 
 ```@eval
 Main.definition(raw"""
-A multi-step method ``\times_T\mathbb{R}^{2n}\to\times_T\mathbb{R}^{2n}`` is called **symplectic** if it preserves the the symplectic product structure.
-""")
+A multi-step method ``\varphi\times_T\mathbb{R}^{2n}\to\times_T\mathbb{R}^{2n}`` is called **symplectic** if it preserves the the symplectic product structure, i.e. if ``hat{\varphi}`` is symplectic.""")
 ```
 
-The *symplectic product structure* is the following skew-symmetric non-degenerate bilinear form: 
-
-```math
-\mathbb{J}([z^{(1)}, \ldots, z^{(T)}], [\tilde{z}^{(1)}, \ldots, \tilde{z}^{(T)}]) := \sum_{i=1}^T (z^{(i)})^T\tilde{z}^{(i)}.
+```@eval
+Main.remark(raw"The **symplectic product structure** is the following skew-symmetric non-degenerate bilinear form: 
+" * Main.indentation * raw"```math
+" * Main.indentation * raw"\hat{\mathbb{J}}([z^{(1)}, \ldots, z^{(T)}], [\tilde{z}^{(1)}, \ldots, \tilde{z}^{(T)}]) := \sum_{i=1}^T (z^{(i)})^T\tilde{z}^{(i)}.
+" * Main.indentation * raw"```
+" * Main.indentation * raw"``\hat{\mathbb{J}}`` is defined through the isomorphism between the product space and the space of big vectors ``\hat{}: \times_\text{($T$ times)}\mathbb{R}^{d}\stackrel{\approx}{\longrightarrow}\mathbb{R}^{dT}``.")
 ```
 
 In order to construct a symplectic attention mechanism we extend the principle [SympNet gradient layer](@ref "SympNet Gradient Layer"), i.e. we construct scalar functions that only depend on ``[q^{(1)}, \ldots, q^{(T)}]`` or ``[p^{(1)}, \ldots, p^{(T)}]``. The specific choice we make here is the following: 
@@ -28,12 +31,14 @@ where ``Q := [q^{(1)}, \ldots, q^{(T)}]``. We therefore have for the gradient:
 \nabla_Qf = \frac{1}{2}Q(A + A^T) =: Q\bar{A},
 ```
 
-where ``A\in\mathcal{S}_\mathrm{skew}(T). So the map performs:
+where ``A\in\mathcal{S}_\mathrm{skew}(T)``. So the map performs:
 
 ```math
 [q^{(1)}, \ldots, q^{(T)}] \mapsto \left[ \sum_{i=1}^Ta_{1i}q^{(i)}, \ldots, \sum_{i=1}^Ta_{Ti}q^{(i)} \right].
 ```
 
+Note that there is still a reweighting of the input vectors performed with this linear symplectic attention, like in [standard attention](@ref "Reweighting of the Input Sequence ") and [volume-preserving attention](@ref "Volume-Preserving Attention"), but the crucial difference is that the coefficients ``a`` here are in linear relation to the input vectors, as opposed to the coefficients ``y`` for the [standard and volume-preserving attention layers](@ref "The Attention Layer"), which depend on the input vectors non-linearly. We hence call this attention mechanism *linear symplectic attention* to distinguish it from the standard attention mechanism, which computes reweighting coefficients that depend on the input nonlinearly.
+
 ## Library Functions
 
 ```@docs; canonical=false