Skip to content

Commit

Permalink
Merge pull request #164 from JuliaGNI/docs_on_hamiltonian_systems
Browse files Browse the repository at this point in the history
Docs on Hamiltonian systems (among others)
  • Loading branch information
michakraus authored Jul 25, 2024
2 parents e4bc624 + f3af3ee commit 95b5c47
Show file tree
Hide file tree
Showing 25 changed files with 660 additions and 132 deletions.
6 changes: 5 additions & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,10 @@ makedocs(;
"Global Tangent Spaces" => "arrays/global_tangent_spaces.md",
"Pullbacks" => "pullbacks/computation_of_pullbacks.md",
],
"Structure-Preservation" => [
"Symplecticity" => "structure_preservation/symplecticity.md",
"Volume-Preservation" => "structure_preservation/volume_preservation.md",
],
"Optimizers" => [
"Optimizers" => "optimizers/optimizer_framework.md",
"Global Sections" => "optimizers/manifold_related/global_sections.md",
Expand All @@ -173,7 +177,7 @@ makedocs(;
"Special Neural Network Layers" => [
"Sympnet Layers" => "layers/sympnet_gradient.md",
"Volume-Preserving Layers" => "layers/volume_preserving_feedforward.md",
"Attention" => "layers/attention_layer.md",
"(Volume-Preserving) Attention" => "layers/attention_layer.md",
"Multihead Attention" => "layers/multihead_attention_layer.md",
"Linear Symplectic Attention" => "layers/linear_symplectic_attention.md",
],
Expand Down
21 changes: 21 additions & 0 deletions docs/src/GeometricMachineLearning.bib
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,16 @@ @book{bishop1980tensor
address={Mineola, New York}
}

@book{arnold1978mathematical,
title={Mathematical methods of classical mechanics},
author={Arnold, Vladimir Igorevich},
volume={60},
year={1978},
series={Graduate Texts in Mathematics},
publisher={Springer Verlag},
address={Berlin}
}

@book{o1983semi,
title={Semi-Riemannian geometry with applications to relativity},
author={O'neill, Barrett},
Expand Down Expand Up @@ -458,4 +468,15 @@ @article{raissi2019physics
pages={686--707},
year={2019},
publisher={Elsevier}
}

@article{kraus2017gempic,
title={GEMPIC: geometric electromagnetic particle-in-cell methods},
author={Kraus, Michael and Kormann, Katharina and Morrison, Philip J and Sonnendr{\"u}cker, Eric},
journal={Journal of Plasma Physics},
volume={83},
number={4},
pages={905830401},
year={2017},
publisher={Cambridge University Press}
}
71 changes: 50 additions & 21 deletions docs/src/layers/attention_layer.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# The Attention Layer

The *attention* mechanism was originally developed for image and natural language processing (NLP) tasks. It is motivated by the need to handle time series data in an efficient way[^1]. Its essential idea is to compute correlations between vectors in input sequences. I.e. given sequences
The *attention* mechanism was originally developed for image and natural language processing (NLP) tasks. It is motivated by the need to handle time series data in an efficient way[^1]. Its essential idea is to compute correlations between vectors in input sequences. So given two sequences

```math
(z_q^{(1)}, z_q^{(2)}, \ldots, z_q^{(T)}) \text{ and } (z_p^{(1)}, z_p^{(2)}, \ldots, z_p^{(T)}),
Expand All @@ -15,19 +15,19 @@ an attention mechanism computes pair-wise correlations between all combinations

where ``z_q, z_k \in \mathbb{R}^d`` are elements of the input sequences. The learnable parameters are ``W, U \in \mathbb{R}^{n\times{}d}`` and ``v \in \mathbb{R}^n``.

However *multiplicative attention* (see e.g. [vaswani2017attention](@cite))is more straightforward to interpret and cheaper to handle computationally:
However *multiplicative attention* (see e.g. [vaswani2017attention](@cite)) is more straightforward to interpret and cheaper to handle computationally:

```math
(z_q, z_k) \mapsto z_q^TWz_k,
```

where ``W \in \mathbb{R}^{d\times{}d}`` is a learnable weight matrix with respect to which correlations are computed as scalar products. Regardless of the type of attention used, they all try to compute correlations among input sequences on whose basis further computation is performed. Given two input sequences ``Z_q = (z_q^{(1)}, \ldots, z_q^{(T)})`` and ``Z_k = (z_k^{(1)}, \ldots, z_k^{(T)})``, we can arrange the various correlations into a *correlation matrix* ``C\in\mathbb{R}^{T\times{}T}`` with entries ``[C]_{ij} = \mathtt{attention}(z_q^{(i)}, z_k^{(j)})``. In the case of multiplicative attention this matrix is just ``C = Z^TWZ``.

## Reweighting of the input sequence
## Reweighting of the Input Sequence

In `GeometricMachineLearning` we always compute *self-attention*, meaning that the two input sequences ``Z_q`` and ``Z_k`` are the same, i.e. ``Z = Z_q = Z_k``.[^2]

[^2]: [Multihead attention](multihead_attention_layer.md) also falls into this category. Here the input ``Z`` is multiplied from the left with several *projection matrices* ``P^Q_i`` and ``P^K_i``, where ``i`` indicates the *head*. For each head we then compute a correlation matrix ``(P^Q_i Z)^T(P^K Z)``.
[^2]: [Multihead attention](@ref "Multihead Attention") also falls into this category. Here the input ``Z`` is multiplied from the left with several *projection matrices* ``P^Q_i`` and ``P^K_i``, where ``i`` indicates the *head*. For each head we then compute a correlation matrix ``(P^Q_i Z)^T(P^K Z)``.

This is then used to reweight the columns in the input sequence ``Z``. For this we first apply a nonlinearity ``\sigma`` onto ``C`` and then multiply ``\sigma(C)`` onto ``Z`` from the right, i.e. the output of the attention layer is ``Z\sigma(C)``. So we perform the following mappings:

Expand All @@ -45,10 +45,10 @@ for ``p^{(i)} = [\sigma(C)]_{\bullet{}i}``. What is *learned* during training ar

## Volume-Preserving Attention

The attention layer (and the activation function ``\sigma`` defined for it) in `GeometricMachineLearning` was specifically designed to apply it to data coming from physical systems that can be described through a divergence-free or a symplectic vector field.
Traditionally the nonlinearity in the attention mechanism is a softmax[^3] (see [vaswani2017attention](@cite)) and the self-attention layer performs the following mapping:
The [`VolumePreservingAttention`](@ref) layer (and the activation function ``\sigma`` defined for it) in `GeometricMachineLearning` was specifically designed to apply it to data coming from physical systems that can be described through a divergence-free or a symplectic vector field.
Traditionally the nonlinearity in the attention mechanism is a softmax[^3] [vaswani2017attention](@cite) and the self-attention layer performs the following mapping:

[^3]: The softmax acts on the matrix ``C`` in a vector-wise manner, i.e. it operates on each column of the input matrix ``C = [c^{(1)}, \ldots, c^{(T)}]``. The result is a sequence of probability vectors ``[p^{(1)}, \ldots, p^{(T)}]`` for which ``\sum_{i=1}^Tp^{(j)}_i=1\quad\forall{}j\in\{1,\dots,T\}.``
[^3]: The softmax acts on the matrix ``C`` in a vector-wise manner, i.e. it operates on each column of the input matrix ``C = [c^{(1)}, \ldots, c^{(T)}]``. The result is a sequence of probability vectors ``[y^{(1)}, \ldots, y^{(T)}]`` for which ``\sum_{i=1}^Ty^{(j)}_i=1\quad\forall{}j\in\{1,\dots,T\}.``

```math
Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\mathrm{softmax}(Z^TWZ).
Expand All @@ -60,9 +60,15 @@ The softmax activation acts vector-wise, i.e. if we supply it with a matrix ``C`
\mathrm{softmax}(C) = [\mathrm{softmax}(c_{\bullet{}1}), \ldots, \mathrm{softmax}(c_{\bullet{}T})].
```

The output of a softmax is a *probability vector* (also called *stochastic vector*) and the matrix ``P = [p^{(1)}, \ldots, p^{(T)}]``, where each column is a probability vector, is sometimes referred to as a *stochastic matrix* (see [jacobs1992discrete](@cite)). This attention mechanism finds application in *transformer neural networks* [vaswani2017attention](@cite). The problem with this matrix from a geometric point of view is that all the columns are independent of each other and the nonlinear transformation could in theory produce a stochastic matrix for which all columns are identical and thus lead to a loss of information. So the softmax activation function is inherently non-geometric.
The output of a softmax is a *probability vector* (also called *stochastic vector*) and the matrix ``Y = [y^{(1)}, \ldots, y^{(T)}]``, where each column is a probability vector, is sometimes referred to as a *stochastic matrix* (see [jacobs1992discrete](@cite)). This attention mechanism finds application in *transformer neural networks* [vaswani2017attention](@cite). The problem with this matrix from a geometric point of view is that all the columns are independent of each other and the nonlinear transformation could in theory produce a stochastic matrix for which all columns are identical and thus lead to a loss of information. So the softmax activation function is inherently non-geometric. We visualize this with the figure below:

Besides the traditional attention mechanism `GeometricMachineLearning` therefore also has a volume-preserving transformation that fulfills a similar role. There are two approaches implemented to realize similar transformations. Both of them however utilize the *Cayley transform* to produce orthogonal matrices ``\sigma(C)`` instead of stochastic matrices. For an orthogonal matrix ``\Sigma`` we have ``\Sigma^T\Sigma = \mathbb{I}``, so all the columns are linearly independent which is not necessarily true for a stochastic matrix ``P``. The following explains how this new activation function is implemented.
```@example
Main.include_graphics("../tikz/convex_recombination") # hide
```

So the ``y`` coefficients responsible for producing the first output vector are independent from those producing the second output vector etc., they have the condition ``\sum_{i=1}^Ty^{(j)}_iz_\mu^{(i)}`` for each column ``j`` imposed on them, but the coefficients for two different columns are independent of each other.

Besides the traditional attention mechanism `GeometricMachineLearning` therefore also has a volume-preserving transformation that fulfills a similar role. There are two approaches implemented to realize similar transformations. Both of them however utilize the *Cayley transform* to produce orthogonal matrices ``\sigma(C)`` instead of stochastic matrices. For an orthogonal matrix ``\Sigma`` we have ``\Sigma^T\Sigma = \mathbb{I}``, so all the columns are linearly independent which is not necessarily true for a stochastic matrix ``P``. In the following we explain how this new activation function is implemented. First we need to briefly discuss the *Cayley transform*.

### The Cayley transform

Expand All @@ -77,10 +83,10 @@ The Cayley transform maps from skew-symmetric matrices to orthonormal matrices[^
We can easily check that ``\mathrm{Cayley}(A)`` is orthogonal if ``A`` is skew-symmetric. For this consider ``\varepsilon \mapsto A(\varepsilon)\in\mathcal{S}_\mathrm{skew}`` with ``A(0) = \mathbb{I}`` and ``A'(0) = B``. Then we have:

```math
\frac{\delta\mathrm{Cayley}}{\delta{}A} = \frac{d}{d\varepsilon}|_{\varepsilon=0} \mathrm{Cayley}(A(\varepsilon))^T \mathrm{Cayley}(A(\varepsilon)) = \mathbb{O}.
\frac{\delta(\mathrm{Cayley}(A)^T\mathrm{Cayley}(A))}{\delta{}A} = \frac{d}{d\varepsilon}|_{\varepsilon=0} \mathrm{Cayley}(A(\varepsilon))^T \mathrm{Cayley}(A(\varepsilon)) = A'(0)^T + A'(0) = \mathbb{O},
```

In order to use the Cayley transform as an activation function we further need a mapping from the input ``Z`` to a skew-symmetric matrix. This is realized in two ways in `GeometricMachineLearning`: via a scalar-product with a skew-symmetric weighting and via a scalar-product with an arbitrary weighting.
So ``\mathrm{Cayley}(A)^T\mathrm{Cayley}(A)`` remains unchanged among ``\varepsilon``. In order to use the Cayley transform as an activation function we further need a mapping from the input ``Z`` to a skew-symmetric matrix. This is realized in two ways in `GeometricMachineLearning`: via a scalar-product with a skew-symmetric weighting and via a scalar-product with an arbitrary weighting.

### First approach: scalar products with a skew-symmetric weighting

Expand All @@ -89,17 +95,17 @@ For this the attention layer is modified in the following way:
```math
Z := [z^{(1)}, \ldots, z^{(T)}] \mapsto Z\sigma(Z^TAZ),
```
where ``\sigma(C)=\mathrm{Cayley}(C)`` and ``A`` is a skew-symmetric matrix that is learnable, i.e. the parameters of the attention layer are stored in ``A``.
where ``\sigma(C)=\mathrm{Cayley}(C)`` and ``A`` is a matrix of type [`SkewSymMatrix`](@ref) that is learnable, i.e. the parameters of the attention layer are stored in ``A``.

### Second approach: scalar products with an arbitrary weighting

For this approach we compute correlations between the input vectors with a skew-symmetric weighting. The correlations we consider here are based on:
For this approach we compute correlations between the input vectors based on scalar product with an arbitrary weighting. This arbitrary ``T\times{}T`` matrix ``A`` constitutes the learnable parameters of the attention layer. The correlations we consider here are based on:

```math
(z^{(2)})^TAz^{(1)}, (z^{(3)})^TAz^{(1)}, \ldots, (z^{(d)})^TAz^{(1)}, (z^{(3)})^TAz^{(2)}, \ldots, (z^{(d)})^TAz^{(2)}, \ldots, (z^{(d)})^TAz^{(d-1)}.
```

So in total we consider correlations ``(z^{(i)})^Tz^{(j)}`` for which ``i > j``. We now arrange these correlations into a skew-symmetric matrix:
So we consider correlations ``(z^{(i)})^Tz^{(j)}`` for which ``i > j``. We now arrange these correlations into a skew-symmetric matrix:

```math
C = \begin{bmatrix}
Expand All @@ -110,7 +116,13 @@ C = \begin{bmatrix}
\end{bmatrix}.
```

This correlation matrix can now again be used as an input for the Cayley transform to produce an orthogonal matrix.
This correlation matrix can now again be used as an input for the Cayley transform to produce an orthogonal matrix. Mathematically this is also equivalent to first computing all correlations ``Z^TAZ`` and then mapping the lower triangular to the upper triangular and negating these elements. This is visualized below:

```@example
Main.include_graphics("../tikz/skew_sym_mapping") # hide
```

Internally `GeometricMachineLearning` computes this more efficiently with the function [`GeometricMachineLearning.tensor_mat_skew_sym_assign`](@ref).

## How is structure preserved?

Expand All @@ -124,13 +136,14 @@ Z = \left[\begin{array}{cccc}
\cdots & \cdots & \cdots & \cdots \\
z_d^{(1)} & z_d^{(2)} & \cdots & z_d^{(T)}
\end{array}\right] \mapsto
\left[\begin{array}{c} z_1^{(1)} \\ z_1^{(2)} \\ \cdots \\ z_1^{(T)} \\ z_2^{(1)} \\ \cdots \\ z_d^{(T)} \end{array}\right] =: Z_\mathrm{vec}.
\left[\begin{array}{c} z_1^{(1)} \\ z_1^{(2)} \\ \cdots \\ z_1^{(T)} \\ z_2^{(1)} \\ \cdots \\ z_d^{(T)} \end{array}\right] =: Z_\mathrm{vec},
```

The inverse of ``Z \mapsto \hat{Z} `` we refer to as ``Y \mapsto \tilde{Y}``. In the following we also write ``\hat{\varphi}`` for the mapping ``\,\hat{}\circ\varphi\circ\tilde{}\,``.
so we arrange the rows consecutively into a vector. The inverse of ``Z \mapsto \hat{Z} `` we refer to as ``Y \mapsto \tilde{Y}``. In the following we also write ``\hat{\varphi}`` for the mapping ``\,\hat{}\circ\varphi\circ\tilde{}\,``.

__DEFINITION__:
We say that a mapping ``\varphi: \times_\text{$T$ times}\mathbb{R}^{d} \to \times_\text{$T$ times}\mathbb{R}^{d}`` is **volume-preserving** if the associated ``\hat{\varphi}`` is volume-preserving.
```@eval
Main.definition(raw"We say that a mapping ``\varphi: \times_\text{$T$ times}\mathbb{R}^{d} \to \times_\text{$T$ times}\mathbb{R}^{d}`` is **volume-preserving** if the associated ``\hat{\varphi}`` is volume-preserving.")
```

In the transformed coordinate system (in terms of the vector ``Z_\mathrm{vec}`` defined above) this is equivalent to multiplication by a sparse matrix ``\tilde\Lambda(Z)`` from the left:

Expand All @@ -145,13 +158,29 @@ In the transformed coordinate system (in terms of the vector ``Z_\mathrm{vec}``
\left[\begin{array}{c} z_1^{(1)} \\ z_1^{(2)} \\ \ldots \\ z_1^{(T)} \\ z_2^{(1)} \\ \ldots \\ z_d^{(T)} \end{array}\right] .
```

``\tilde{\Lambda}(Z)`` in m[eq:LambdaApplication]m(@latex) is easily shown to be an orthogonal matrix.
``\tilde{\Lambda}(Z)`` is easily shown to be an orthogonal matrix and a symplectic matrix, i.e. it satisfies

```math
\tilde{\Lambda}(Z)^T\tilde{\Lambda}(Z) = \mathbb{I}
```

and

```math
\tilde{\Lambda}(Z)^T\mathbb{J}\tilde{\Lambda}(Z) = \mathbb{J}.
```


## Historical Note

Attention was used before, but always in connection with **recurrent neural networks** (see [luong2015effective](@cite) and [bahdanau2014neural](@cite)).
Attention was used before the transformer was introduced, but mostly in connection with *recurrent neural networks* (see [luong2015effective](@cite) and [bahdanau2014neural](@cite)).

## Library Functions

```@docs; canonical = false
GeometricMachineLearning.tensor_mat_skew_sym_assign
VolumePreservingAttention
```

## References

Expand Down
21 changes: 13 additions & 8 deletions docs/src/layers/linear_symplectic_attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,21 @@

The attention layer introduced here is an extension of the [Sympnet gradient layer](@ref "SympNet Gradient Layer") to the setting where we deal with time series data. We first have to define a notion of symplecticity for [multi-step methods](@ref "Multi-step methods").

This definition is essentially taken from [feng1987symplectic, ge1988approximation](@cite) and similar to the definition of volume-preservation in [brantner2024volume](@cite).
This definition is different from [feng1987symplectic, ge1988approximation](@cite), but similar to the definition of volume-preservation in [brantner2024volume](@cite)[^1].

[^1]: This definition is also recalled in the section on [volume-preserving attention](@ref "How is structure preserved?").

```@eval
Main.definition(raw"""
A multi-step method ``\times_T\mathbb{R}^{2n}\to\times_T\mathbb{R}^{2n}`` is called **symplectic** if it preserves the the symplectic product structure.
""")
A multi-step method ``\varphi\times_T\mathbb{R}^{2n}\to\times_T\mathbb{R}^{2n}`` is called **symplectic** if it preserves the the symplectic product structure, i.e. if ``hat{\varphi}`` is symplectic.""")
```

The *symplectic product structure* is the following skew-symmetric non-degenerate bilinear form:

```math
\mathbb{J}([z^{(1)}, \ldots, z^{(T)}], [\tilde{z}^{(1)}, \ldots, \tilde{z}^{(T)}]) := \sum_{i=1}^T (z^{(i)})^T\tilde{z}^{(i)}.
```@eval
Main.remark(raw"The **symplectic product structure** is the following skew-symmetric non-degenerate bilinear form:
" * Main.indentation * raw"```math
" * Main.indentation * raw"\hat{\mathbb{J}}([z^{(1)}, \ldots, z^{(T)}], [\tilde{z}^{(1)}, \ldots, \tilde{z}^{(T)}]) := \sum_{i=1}^T (z^{(i)})^T\tilde{z}^{(i)}.
" * Main.indentation * raw"```
" * Main.indentation * raw"``\hat{\mathbb{J}}`` is defined through the isomorphism between the product space and the space of big vectors ``\hat{}: \times_\text{($T$ times)}\mathbb{R}^{d}\stackrel{\approx}{\longrightarrow}\mathbb{R}^{dT}``.")
```

In order to construct a symplectic attention mechanism we extend the principle [SympNet gradient layer](@ref "SympNet Gradient Layer"), i.e. we construct scalar functions that only depend on ``[q^{(1)}, \ldots, q^{(T)}]`` or ``[p^{(1)}, \ldots, p^{(T)}]``. The specific choice we make here is the following:
Expand All @@ -28,12 +31,14 @@ where ``Q := [q^{(1)}, \ldots, q^{(T)}]``. We therefore have for the gradient:
\nabla_Qf = \frac{1}{2}Q(A + A^T) =: Q\bar{A},
```

where ``A\in\mathcal{S}_\mathrm{skew}(T). So the map performs:
where ``A\in\mathcal{S}_\mathrm{skew}(T)``. So the map performs:

```math
[q^{(1)}, \ldots, q^{(T)}] \mapsto \left[ \sum_{i=1}^Ta_{1i}q^{(i)}, \ldots, \sum_{i=1}^Ta_{Ti}q^{(i)} \right].
```

Note that there is still a reweighting of the input vectors performed with this linear symplectic attention, like in [standard attention](@ref "Reweighting of the Input Sequence ") and [volume-preserving attention](@ref "Volume-Preserving Attention"), but the crucial difference is that the coefficients ``a`` here are in linear relation to the input vectors, as opposed to the coefficients ``y`` for the [standard and volume-preserving attention layers](@ref "The Attention Layer"), which depend on the input vectors non-linearly. We hence call this attention mechanism *linear symplectic attention* to distinguish it from the standard attention mechanism, which computes reweighting coefficients that depend on the input nonlinearly.

## Library Functions

```@docs; canonical=false
Expand Down
Loading

0 comments on commit 95b5c47

Please sign in to comment.