Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some minor typos #140

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion tex/appendix/AD-norm.tex
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
% Chris on https://discourse.julialang.org/t/default-norm-used-in-ode-error-control/70995/4 mention to this norm to be the defaul since 70s

The goal of a stepize controller is to pick $\Delta t_{n+1}$ as large as possible (so the solver requires less total steps) at the same time that $\text{Err}_\text{scaled} \leq 1$.
One of the most used methods to archive this is the proportional-integral controller (PIC) that updates the stepsize according to \cite{hairer-solving-2, Ranocha_Dalcin_Parsani_Ketcheson_2022}
One of the most used methods to achieve this is the proportional-integral controller (PIC) that updates the stepsize according to \cite{hairer-solving-2, Ranocha_Dalcin_Parsani_Ketcheson_2022}
\begin{equation}
\Delta t_{n+1} = \eta \, \Delta t_n
\qquad
Expand Down
2 changes: 1 addition & 1 deletion tex/sections/complex-step.tex
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
\end{equation}
The method of \textit{complex step differentiation} consists then in estimating the gradient as $\text{Im}(L(\theta + i \varepsilon)) / \varepsilon$ for a small value of $\varepsilon$.
Besides the advantage of being a method with precision $\mathcal{O}(\varepsilon^2)$, the complex step method avoids subtracting cancellation error and then the value of $\varepsilon$ can be reduced to almost machine precision error without affecting the calculation of the derivative.
However, a major limitation of this method is that it only applicable to locally complex analytical functions \cite{Martins_Sturdza_Alonso_2003_complex_differentiation} and does not outperform AD (see Sections \ref{section:direct-methods} and \ref{section:recomendations}).
However, a major limitation of this method is that it is only applicable to locally complex analytical functions \cite{Martins_Sturdza_Alonso_2003_complex_differentiation} and does not outperform AD (see Sections \ref{section:direct-methods} and \ref{section:recomendations}).
One additional limitation is that it requires the evaluation of mathematical functions with small complex values, e.g., operations such as $\sin(1 + 10^{-16} i)$, which are not necessarily always computable to high accuracy with modern math libraries.
Extension to higher order derivatives can be obtained by introducing multicomplex variables \cite{Lantoine_Russell_Dargent_2012}.

2 changes: 1 addition & 1 deletion tex/sections/methods-intro.tex
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
% However, one has to keep in mind that AD computes the exact derivative of an approximation of the objective and may not yield an approximation to the exact derivatives of the objective (Section \ref{section:forwardAD-sensitivity}).

The distinction between \textit{forward} and \textit{reverse} regards whether sensitivities are computed for intermediate variables with respect to the input variable or parameter to differentiate (forward) or, on the contrary, we compute the sensitivity of the output variable with respect to each intermediate variable by defining a new adjoint variable (reverse).
Mathematically speaking, this distinction translated to the fact that forward methods compute directional derivatives by mapping sequential mappings between tangent spaces, while reverse methods apply sequential mappings between co-tangent spaces from the direction of the output variable to the input variable (Section \ref{sec:vjp-jvp}).
Mathematically speaking, this distinction translates to the fact that forward methods compute directional derivatives by mapping sequential mappings between tangent spaces, while reverse methods apply sequential mappings between co-tangent spaces from the direction of the output variable to the input variable (Section \ref{sec:vjp-jvp}).
In all forward methods the DE is solved sequentially and simultaneously with the directional derivative during the forward pass of the numerical solver.
On the contrary, reverse methods compute the gradient by solving a new problem that moves in the opposite direction as the original numerical solver.
In DE-based models, intermediate variables correspond to intermediate solutions of the DE.
Expand Down
4 changes: 2 additions & 2 deletions tex/sections/preliminaries.tex
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ \subsubsection{Numerical solvers for ODEs}
% The choice of the norm $\frac{1}{\sqrt n} \| \cdot \|_2$ for computing the total error $\text{Err}_\text{scaled}$, sometimes known as Hairer norm, has been the standard for a long time \cite{Ranocha_Dalcin_Parsani_Ketcheson_2022} and it is based on the assumption that a small increase in the size of the systems of ODEs (e.g., by simply duplicating the ODE system) should not affect the stepsize choice, but other options can be considered \cite{hairer-solving-1}.

Modern solvers include stepsize controllers that pick $\Delta t_m$ as large as possible to minimize the total number of steps while preventing large errors by keeping $\text{Err}^m_\text{scaled} \leq 1$.
One of the most used methods to archive this is the proportional-integral controller (PIC) that updates the stepsize according to
One of the most used methods to achieve this is the proportional-integral controller (PIC) that updates the stepsize according to
\begin{equation}
\Delta t_{m} = \eta \, \Delta t_{m-1}
\qquad
Expand Down Expand Up @@ -189,7 +189,7 @@ \subsubsection{Gradient-based optimization}
A direct implementation of gradient descent following Equation \eqref{eq:gradient-descent} is prone to converge to a local minimum and slows down in a neighborhood of saddle points.
To address these issues, variants of this scheme employing more advanced updating strategies have been proposed, including Newton-type methods \cite{second-order-optimization}, quasi-Newton methods, acceleration techniques \cite{JMLR:v22:20-207}, and natural gradient descent methods \cite{doi:10.1137/22M1477805}.
For instance, ADAM is an adaptive, momentum-based algorithm that stores the parameter update at each iteration, and determines the next update as a linear combination of the gradient and the previous update \cite{Kingma2014}.
ADAM been widely adopted to train highly parametrized neural networks (up to the order of $10^8$ parameters \cite{NIPS2017_3f5ee243}).
ADAM has been widely adopted to train highly parametrized neural networks (up to the order of $10^8$ parameters \cite{NIPS2017_3f5ee243}).
Other widely employed algorithms are the Broyden–Fletcher–Goldfarb–Shanno (BFGS) and its limited-memory version algorithm (L-BFGS), which determine the descent direction by preconditioning the gradient with curvature information.
% ADAM is less prone to converge to a local minimum, while (L-)BFGS has a faster converge rate.
% Using ADAM for the first iterations followed by (L-)BFGS proves to be a successful strategy to minimize a loss function with best accuracy.
Expand Down