Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs on Hamiltonian systems (among others) #164

Merged
merged 23 commits into from
Jul 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
5979f7b
Added section on Hamiltonian systems.
benedict-96 Jul 4, 2024
8361641
Added gempic and Arnold book.
benedict-96 Jul 4, 2024
cb0878d
which -> with.
benedict-96 Jul 4, 2024
e92c513
Put Riemannian gradient into a definition environment.
benedict-96 Jul 4, 2024
0351bc3
Increased epochs for SympNet tutorial.
benedict-96 Jul 4, 2024
87552a9
Added description of QPT.
benedict-96 Jul 4, 2024
3635014
Fixed citation.
benedict-96 Jul 4, 2024
416e87c
Added description of volume-preserving systems.
benedict-96 Jul 4, 2024
fb8eabf
Added transitional sentence to volume-preservation and added headers …
benedict-96 Jul 4, 2024
8dd1afd
Added picture that shows convex reweighting.
benedict-96 Jul 14, 2024
f608163
Fixed typo: q -> p. and improved docstring.
benedict-96 Jul 14, 2024
3cc31be
Added visualization of skew-symmetrization operation.
benedict-96 Jul 14, 2024
c5eade8
Changed the wording of some phrases and added new figure.
benedict-96 Jul 14, 2024
a9f36fe
Fixed number of type parameters in function definition.
benedict-96 Jul 14, 2024
6af1587
Improved colors for dark mode.
benedict-96 Jul 15, 2024
0a77459
Improve documentation.
benedict-96 Jul 15, 2024
49bc71d
Improved docstrings.
benedict-96 Jul 15, 2024
6d6b471
Updated docstring and removed geodesic option.:
benedict-96 Jul 15, 2024
e35da50
Fixed typo (A+ reduced).
benedict-96 Jul 17, 2024
0f0e327
This part is mainly on the volume-preserving attention module.
benedict-96 Jul 17, 2024
2641cfe
Expanded docs and fixed some issues.
benedict-96 Jul 17, 2024
8fde7e4
Added reference.
benedict-96 Jul 18, 2024
f3af3ee
Fixed \sub -> \subset.
benedict-96 Jul 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,10 @@ makedocs(;
"Global Tangent Spaces" => "arrays/global_tangent_spaces.md",
"Pullbacks" => "pullbacks/computation_of_pullbacks.md",
],
"Structure-Preservation" => [
"Symplecticity" => "structure_preservation/symplecticity.md",
"Volume-Preservation" => "structure_preservation/volume_preservation.md",
],
"Optimizers" => [
"Optimizers" => "optimizers/optimizer_framework.md",
"Global Sections" => "optimizers/manifold_related/global_sections.md",
Expand All @@ -173,7 +177,7 @@ makedocs(;
"Special Neural Network Layers" => [
"Sympnet Layers" => "layers/sympnet_gradient.md",
"Volume-Preserving Layers" => "layers/volume_preserving_feedforward.md",
"Attention" => "layers/attention_layer.md",
"(Volume-Preserving) Attention" => "layers/attention_layer.md",
"Multihead Attention" => "layers/multihead_attention_layer.md",
"Linear Symplectic Attention" => "layers/linear_symplectic_attention.md",
],
Expand Down
21 changes: 21 additions & 0 deletions docs/src/GeometricMachineLearning.bib
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,16 @@ @book{bishop1980tensor
address={Mineola, New York}
}

@book{arnold1978mathematical,
title={Mathematical methods of classical mechanics},
author={Arnold, Vladimir Igorevich},
volume={60},
year={1978},
series={Graduate Texts in Mathematics},
publisher={Springer Verlag},
address={Berlin}
}

@book{o1983semi,
title={Semi-Riemannian geometry with applications to relativity},
author={O'neill, Barrett},
Expand Down Expand Up @@ -458,4 +468,15 @@ @article{raissi2019physics
pages={686--707},
year={2019},
publisher={Elsevier}
}

@article{kraus2017gempic,
title={GEMPIC: geometric electromagnetic particle-in-cell methods},
author={Kraus, Michael and Kormann, Katharina and Morrison, Philip J and Sonnendr{\"u}cker, Eric},
journal={Journal of Plasma Physics},
volume={83},
number={4},
pages={905830401},
year={2017},
publisher={Cambridge University Press}
}
71 changes: 50 additions & 21 deletions docs/src/layers/attention_layer.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# The Attention Layer

The *attention* mechanism was originally developed for image and natural language processing (NLP) tasks. It is motivated by the need to handle time series data in an efficient way[^1]. Its essential idea is to compute correlations between vectors in input sequences. I.e. given sequences
The *attention* mechanism was originally developed for image and natural language processing (NLP) tasks. It is motivated by the need to handle time series data in an efficient way[^1]. Its essential idea is to compute correlations between vectors in input sequences. So given two sequences

```math
(z_q^{(1)}, z_q^{(2)}, \ldots, z_q^{(T)}) \text{ and } (z_p^{(1)}, z_p^{(2)}, \ldots, z_p^{(T)}),
Expand All @@ -15,19 +15,19 @@ an attention mechanism computes pair-wise correlations between all combinations

where ``z_q, z_k \in \mathbb{R}^d`` are elements of the input sequences. The learnable parameters are ``W, U \in \mathbb{R}^{n\times{}d}`` and ``v \in \mathbb{R}^n``.

However *multiplicative attention* (see e.g. [vaswani2017attention](@cite))is more straightforward to interpret and cheaper to handle computationally:
However *multiplicative attention* (see e.g. [vaswani2017attention](@cite)) is more straightforward to interpret and cheaper to handle computationally:

```math
(z_q, z_k) \mapsto z_q^TWz_k,
```

where ``W \in \mathbb{R}^{d\times{}d}`` is a learnable weight matrix with respect to which correlations are computed as scalar products. Regardless of the type of attention used, they all try to compute correlations among input sequences on whose basis further computation is performed. Given two input sequences ``Z_q = (z_q^{(1)}, \ldots, z_q^{(T)})`` and ``Z_k = (z_k^{(1)}, \ldots, z_k^{(T)})``, we can arrange the various correlations into a *correlation matrix* ``C\in\mathbb{R}^{T\times{}T}`` with entries ``[C]_{ij} = \mathtt{attention}(z_q^{(i)}, z_k^{(j)})``. In the case of multiplicative attention this matrix is just ``C = Z^TWZ``.

## Reweighting of the input sequence
## Reweighting of the Input Sequence

In `GeometricMachineLearning` we always compute *self-attention*, meaning that the two input sequences ``Z_q`` and ``Z_k`` are the same, i.e. ``Z = Z_q = Z_k``.[^2]

[^2]: [Multihead attention](multihead_attention_layer.md) also falls into this category. Here the input ``Z`` is multiplied from the left with several *projection matrices* ``P^Q_i`` and ``P^K_i``, where ``i`` indicates the *head*. For each head we then compute a correlation matrix ``(P^Q_i Z)^T(P^K Z)``.
[^2]: [Multihead attention](@ref "Multihead Attention") also falls into this category. Here the input ``Z`` is multiplied from the left with several *projection matrices* ``P^Q_i`` and ``P^K_i``, where ``i`` indicates the *head*. For each head we then compute a correlation matrix ``(P^Q_i Z)^T(P^K Z)``.

This is then used to reweight the columns in the input sequence ``Z``. For this we first apply a nonlinearity ``\sigma`` onto ``C`` and then multiply ``\sigma(C)`` onto ``Z`` from the right, i.e. the output of the attention layer is ``Z\sigma(C)``. So we perform the following mappings:

Expand All @@ -45,10 +45,10 @@ for ``p^{(i)} = [\sigma(C)]_{\bullet{}i}``. What is *learned* during training ar

## Volume-Preserving Attention

The attention layer (and the activation function ``\sigma`` defined for it) in `GeometricMachineLearning` was specifically designed to apply it to data coming from physical systems that can be described through a divergence-free or a symplectic vector field.
Traditionally the nonlinearity in the attention mechanism is a softmax[^3] (see [vaswani2017attention](@cite)) and the self-attention layer performs the following mapping:
The [`VolumePreservingAttention`](@ref) layer (and the activation function ``\sigma`` defined for it) in `GeometricMachineLearning` was specifically designed to apply it to data coming from physical systems that can be described through a divergence-free or a symplectic vector field.
Traditionally the nonlinearity in the attention mechanism is a softmax[^3] [vaswani2017attention](@cite) and the self-attention layer performs the following mapping:

[^3]: The softmax acts on the matrix ``C`` in a vector-wise manner, i.e. it operates on each column of the input matrix ``C = [c^{(1)}, \ldots, c^{(T)}]``. The result is a sequence of probability vectors ``[p^{(1)}, \ldots, p^{(T)}]`` for which ``\sum_{i=1}^Tp^{(j)}_i=1\quad\forall{}j\in\{1,\dots,T\}.``
[^3]: The softmax acts on the matrix ``C`` in a vector-wise manner, i.e. it operates on each column of the input matrix ``C = [c^{(1)}, \ldots, c^{(T)}]``. The result is a sequence of probability vectors ``[y^{(1)}, \ldots, y^{(T)}]`` for which ``\sum_{i=1}^Ty^{(j)}_i=1\quad\forall{}j\in\{1,\dots,T\}.``

```math
Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\mathrm{softmax}(Z^TWZ).
Expand All @@ -60,9 +60,15 @@ The softmax activation acts vector-wise, i.e. if we supply it with a matrix ``C`
\mathrm{softmax}(C) = [\mathrm{softmax}(c_{\bullet{}1}), \ldots, \mathrm{softmax}(c_{\bullet{}T})].
```

The output of a softmax is a *probability vector* (also called *stochastic vector*) and the matrix ``P = [p^{(1)}, \ldots, p^{(T)}]``, where each column is a probability vector, is sometimes referred to as a *stochastic matrix* (see [jacobs1992discrete](@cite)). This attention mechanism finds application in *transformer neural networks* [vaswani2017attention](@cite). The problem with this matrix from a geometric point of view is that all the columns are independent of each other and the nonlinear transformation could in theory produce a stochastic matrix for which all columns are identical and thus lead to a loss of information. So the softmax activation function is inherently non-geometric.
The output of a softmax is a *probability vector* (also called *stochastic vector*) and the matrix ``Y = [y^{(1)}, \ldots, y^{(T)}]``, where each column is a probability vector, is sometimes referred to as a *stochastic matrix* (see [jacobs1992discrete](@cite)). This attention mechanism finds application in *transformer neural networks* [vaswani2017attention](@cite). The problem with this matrix from a geometric point of view is that all the columns are independent of each other and the nonlinear transformation could in theory produce a stochastic matrix for which all columns are identical and thus lead to a loss of information. So the softmax activation function is inherently non-geometric. We visualize this with the figure below:

Besides the traditional attention mechanism `GeometricMachineLearning` therefore also has a volume-preserving transformation that fulfills a similar role. There are two approaches implemented to realize similar transformations. Both of them however utilize the *Cayley transform* to produce orthogonal matrices ``\sigma(C)`` instead of stochastic matrices. For an orthogonal matrix ``\Sigma`` we have ``\Sigma^T\Sigma = \mathbb{I}``, so all the columns are linearly independent which is not necessarily true for a stochastic matrix ``P``. The following explains how this new activation function is implemented.
```@example
Main.include_graphics("../tikz/convex_recombination") # hide
```

So the ``y`` coefficients responsible for producing the first output vector are independent from those producing the second output vector etc., they have the condition ``\sum_{i=1}^Ty^{(j)}_iz_\mu^{(i)}`` for each column ``j`` imposed on them, but the coefficients for two different columns are independent of each other.

Besides the traditional attention mechanism `GeometricMachineLearning` therefore also has a volume-preserving transformation that fulfills a similar role. There are two approaches implemented to realize similar transformations. Both of them however utilize the *Cayley transform* to produce orthogonal matrices ``\sigma(C)`` instead of stochastic matrices. For an orthogonal matrix ``\Sigma`` we have ``\Sigma^T\Sigma = \mathbb{I}``, so all the columns are linearly independent which is not necessarily true for a stochastic matrix ``P``. In the following we explain how this new activation function is implemented. First we need to briefly discuss the *Cayley transform*.

### The Cayley transform

Expand All @@ -77,10 +83,10 @@ The Cayley transform maps from skew-symmetric matrices to orthonormal matrices[^
We can easily check that ``\mathrm{Cayley}(A)`` is orthogonal if ``A`` is skew-symmetric. For this consider ``\varepsilon \mapsto A(\varepsilon)\in\mathcal{S}_\mathrm{skew}`` with ``A(0) = \mathbb{I}`` and ``A'(0) = B``. Then we have:

```math
\frac{\delta\mathrm{Cayley}}{\delta{}A} = \frac{d}{d\varepsilon}|_{\varepsilon=0} \mathrm{Cayley}(A(\varepsilon))^T \mathrm{Cayley}(A(\varepsilon)) = \mathbb{O}.
\frac{\delta(\mathrm{Cayley}(A)^T\mathrm{Cayley}(A))}{\delta{}A} = \frac{d}{d\varepsilon}|_{\varepsilon=0} \mathrm{Cayley}(A(\varepsilon))^T \mathrm{Cayley}(A(\varepsilon)) = A'(0)^T + A'(0) = \mathbb{O},
```

In order to use the Cayley transform as an activation function we further need a mapping from the input ``Z`` to a skew-symmetric matrix. This is realized in two ways in `GeometricMachineLearning`: via a scalar-product with a skew-symmetric weighting and via a scalar-product with an arbitrary weighting.
So ``\mathrm{Cayley}(A)^T\mathrm{Cayley}(A)`` remains unchanged among ``\varepsilon``. In order to use the Cayley transform as an activation function we further need a mapping from the input ``Z`` to a skew-symmetric matrix. This is realized in two ways in `GeometricMachineLearning`: via a scalar-product with a skew-symmetric weighting and via a scalar-product with an arbitrary weighting.

### First approach: scalar products with a skew-symmetric weighting

Expand All @@ -89,17 +95,17 @@ For this the attention layer is modified in the following way:
```math
Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\sigma(Z^TAZ),
```
where ``\sigma(C)=\mathrm{Cayley}(C)`` and ``A`` is a skew-symmetric matrix that is learnable, i.e. the parameters of the attention layer are stored in ``A``.
where ``\sigma(C)=\mathrm{Cayley}(C)`` and ``A`` is a matrix of type [`SkewSymMatrix`](@ref) that is learnable, i.e. the parameters of the attention layer are stored in ``A``.

### Second approach: scalar products with an arbitrary weighting

For this approach we compute correlations between the input vectors with a skew-symmetric weighting. The correlations we consider here are based on:
For this approach we compute correlations between the input vectors based on scalar product with an arbitrary weighting. This arbitrary ``T\times{}T`` matrix ``A`` constitutes the learnable parameters of the attention layer. The correlations we consider here are based on:

```math
(z^{(2)})^TAz^{(1)}, (z^{(3)})^TAz^{(1)}, \ldots, (z^{(d)})^TAz^{(1)}, (z^{(3)})^TAz^{(2)}, \ldots, (z^{(d)})^TAz^{(2)}, \ldots, (z^{(d)})^TAz^{(d-1)}.
```

So in total we consider correlations ``(z^{(i)})^Tz^{(j)}`` for which ``i > j``. We now arrange these correlations into a skew-symmetric matrix:
So we consider correlations ``(z^{(i)})^Tz^{(j)}`` for which ``i > j``. We now arrange these correlations into a skew-symmetric matrix:

```math
C = \begin{bmatrix}
Expand All @@ -110,7 +116,13 @@ C = \begin{bmatrix}
\end{bmatrix}.
```

This correlation matrix can now again be used as an input for the Cayley transform to produce an orthogonal matrix.
This correlation matrix can now again be used as an input for the Cayley transform to produce an orthogonal matrix. Mathematically this is also equivalent to first computing all correlations ``Z^TAZ`` and then mapping the lower triangular to the upper triangular and negating these elements. This is visualized below:

```@example
Main.include_graphics("../tikz/skew_sym_mapping") # hide
```

Internally `GeometricMachineLearning` computes this more efficiently with the function [`GeometricMachineLearning.tensor_mat_skew_sym_assign`](@ref).

## How is structure preserved?

Expand All @@ -124,13 +136,14 @@ Z = \left[\begin{array}{cccc}
\cdots & \cdots & \cdots & \cdots \\
z_d^{(1)} & z_d^{(2)} & \cdots & z_d^{(T)}
\end{array}\right] \mapsto
\left[\begin{array}{c} z_1^{(1)} \\ z_1^{(2)} \\ \cdots \\ z_1^{(T)} \\ z_2^{(1)} \\ \cdots \\ z_d^{(T)} \end{array}\right] =: Z_\mathrm{vec}.
\left[\begin{array}{c} z_1^{(1)} \\ z_1^{(2)} \\ \cdots \\ z_1^{(T)} \\ z_2^{(1)} \\ \cdots \\ z_d^{(T)} \end{array}\right] =: Z_\mathrm{vec},
```

The inverse of ``Z \mapsto \hat{Z} `` we refer to as ``Y \mapsto \tilde{Y}``. In the following we also write ``\hat{\varphi}`` for the mapping ``\,\hat{}\circ\varphi\circ\tilde{}\,``.
so we arrange the rows consecutively into a vector. The inverse of ``Z \mapsto \hat{Z} `` we refer to as ``Y \mapsto \tilde{Y}``. In the following we also write ``\hat{\varphi}`` for the mapping ``\,\hat{}\circ\varphi\circ\tilde{}\,``.

__DEFINITION__:
We say that a mapping ``\varphi: \times_\text{$T$ times}\mathbb{R}^{d} \to \times_\text{$T$ times}\mathbb{R}^{d}`` is **volume-preserving** if the associated ``\hat{\varphi}`` is volume-preserving.
```@eval
Main.definition(raw"We say that a mapping ``\varphi: \times_\text{$T$ times}\mathbb{R}^{d} \to \times_\text{$T$ times}\mathbb{R}^{d}`` is **volume-preserving** if the associated ``\hat{\varphi}`` is volume-preserving.")
```

In the transformed coordinate system (in terms of the vector ``Z_\mathrm{vec}`` defined above) this is equivalent to multiplication by a sparse matrix ``\tilde\Lambda(Z)`` from the left:

Expand All @@ -145,13 +158,29 @@ In the transformed coordinate system (in terms of the vector ``Z_\mathrm{vec}``
\left[\begin{array}{c} z_1^{(1)} \\ z_1^{(2)} \\ \ldots \\ z_1^{(T)} \\ z_2^{(1)} \\ \ldots \\ z_d^{(T)} \end{array}\right] .
```

``\tilde{\Lambda}(Z)`` in m[eq:LambdaApplication]m(@latex) is easily shown to be an orthogonal matrix.
``\tilde{\Lambda}(Z)`` is easily shown to be an orthogonal matrix and a symplectic matrix, i.e. it satisfies

```math
\tilde{\Lambda}(Z)^T\tilde{\Lambda}(Z) = \mathbb{I}
```

and

```math
\tilde{\Lambda}(Z)^T\mathbb{J}\tilde{\Lambda}(Z) = \mathbb{J}.
```


## Historical Note

Attention was used before, but always in connection with **recurrent neural networks** (see [luong2015effective](@cite) and [bahdanau2014neural](@cite)).
Attention was used before the transformer was introduced, but mostly in connection with *recurrent neural networks* (see [luong2015effective](@cite) and [bahdanau2014neural](@cite)).

## Library Functions

```@docs; canonical = false
GeometricMachineLearning.tensor_mat_skew_sym_assign
VolumePreservingAttention
```

## References

Expand Down
21 changes: 13 additions & 8 deletions docs/src/layers/linear_symplectic_attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,21 @@

The attention layer introduced here is an extension of the [Sympnet gradient layer](@ref "SympNet Gradient Layer") to the setting where we deal with time series data. We first have to define a notion of symplecticity for [multi-step methods](@ref "Multi-step methods").

This definition is essentially taken from [feng1987symplectic, ge1988approximation](@cite) and similar to the definition of volume-preservation in [brantner2024volume](@cite).
This definition is different from [feng1987symplectic, ge1988approximation](@cite), but similar to the definition of volume-preservation in [brantner2024volume](@cite)[^1].

[^1]: This definition is also recalled in the section on [volume-preserving attention](@ref "How is structure preserved?").

```@eval
Main.definition(raw"""
A multi-step method ``\times_T\mathbb{R}^{2n}\to\times_T\mathbb{R}^{2n}`` is called **symplectic** if it preserves the the symplectic product structure.
""")
A multi-step method ``\varphi\times_T\mathbb{R}^{2n}\to\times_T\mathbb{R}^{2n}`` is called **symplectic** if it preserves the the symplectic product structure, i.e. if ``hat{\varphi}`` is symplectic.""")
```

The *symplectic product structure* is the following skew-symmetric non-degenerate bilinear form:

```math
\mathbb{J}([z^{(1)}, \ldots, z^{(T)}], [\tilde{z}^{(1)}, \ldots, \tilde{z}^{(T)}]) := \sum_{i=1}^T (z^{(i)})^T\tilde{z}^{(i)}.
```@eval
Main.remark(raw"The **symplectic product structure** is the following skew-symmetric non-degenerate bilinear form:
" * Main.indentation * raw"```math
" * Main.indentation * raw"\hat{\mathbb{J}}([z^{(1)}, \ldots, z^{(T)}], [\tilde{z}^{(1)}, \ldots, \tilde{z}^{(T)}]) := \sum_{i=1}^T (z^{(i)})^T\tilde{z}^{(i)}.
" * Main.indentation * raw"```
" * Main.indentation * raw"``\hat{\mathbb{J}}`` is defined through the isomorphism between the product space and the space of big vectors ``\hat{}: \times_\text{($T$ times)}\mathbb{R}^{d}\stackrel{\approx}{\longrightarrow}\mathbb{R}^{dT}``.")
```

In order to construct a symplectic attention mechanism we extend the principle [SympNet gradient layer](@ref "SympNet Gradient Layer"), i.e. we construct scalar functions that only depend on ``[q^{(1)}, \ldots, q^{(T)}]`` or ``[p^{(1)}, \ldots, p^{(T)}]``. The specific choice we make here is the following:
Expand All @@ -28,12 +31,14 @@ where ``Q := [q^{(1)}, \ldots, q^{(T)}]``. We therefore have for the gradient:
\nabla_Qf = \frac{1}{2}Q(A + A^T) =: Q\bar{A},
```

where ``A\in\mathcal{S}_\mathrm{skew}(T). So the map performs:
where ``A\in\mathcal{S}_\mathrm{skew}(T)``. So the map performs:

```math
[q^{(1)}, \ldots, q^{(T)}] \mapsto \left[ \sum_{i=1}^Ta_{1i}q^{(i)}, \ldots, \sum_{i=1}^Ta_{Ti}q^{(i)} \right].
```

Note that there is still a reweighting of the input vectors performed with this linear symplectic attention, like in [standard attention](@ref "Reweighting of the Input Sequence ") and [volume-preserving attention](@ref "Volume-Preserving Attention"), but the crucial difference is that the coefficients ``a`` here are in linear relation to the input vectors, as opposed to the coefficients ``y`` for the [standard and volume-preserving attention layers](@ref "The Attention Layer"), which depend on the input vectors non-linearly. We hence call this attention mechanism *linear symplectic attention* to distinguish it from the standard attention mechanism, which computes reweighting coefficients that depend on the input nonlinearly.

## Library Functions

```@docs; canonical=false
Expand Down
Loading
Loading