diff --git a/tex/appendix/AD-norm.tex b/tex/appendix/AD-norm.tex index 1cbf015..c16ec74 100644 --- a/tex/appendix/AD-norm.tex +++ b/tex/appendix/AD-norm.tex @@ -16,7 +16,7 @@ % Chris on https://discourse.julialang.org/t/default-norm-used-in-ode-error-control/70995/4 mention to this norm to be the defaul since 70s The goal of a stepize controller is to pick $\Delta t_{n+1}$ as large as possible (so the solver requires less total steps) at the same time that $\text{Err}_\text{scaled} \leq 1$. -One of the most used methods to archive this is the proportional-integral controller (PIC) that updates the stepsize according to \cite{hairer-solving-2, Ranocha_Dalcin_Parsani_Ketcheson_2022} +One of the most used methods to achieve this is the proportional-integral controller (PIC) that updates the stepsize according to \cite{hairer-solving-2, Ranocha_Dalcin_Parsani_Ketcheson_2022} \begin{equation} \Delta t_{n+1} = \eta \, \Delta t_n \qquad diff --git a/tex/sections/complex-step.tex b/tex/sections/complex-step.tex index 1a2ef2c..abe1e3c 100644 --- a/tex/sections/complex-step.tex +++ b/tex/sections/complex-step.tex @@ -28,7 +28,7 @@ \end{equation} The method of \textit{complex step differentiation} consists then in estimating the gradient as $\text{Im}(L(\theta + i \varepsilon)) / \varepsilon$ for a small value of $\varepsilon$. Besides the advantage of being a method with precision $\mathcal{O}(\varepsilon^2)$, the complex step method avoids subtracting cancellation error and then the value of $\varepsilon$ can be reduced to almost machine precision error without affecting the calculation of the derivative. -However, a major limitation of this method is that it only applicable to locally complex analytical functions \cite{Martins_Sturdza_Alonso_2003_complex_differentiation} and does not outperform AD (see Sections \ref{section:direct-methods} and \ref{section:recomendations}). +However, a major limitation of this method is that it is only applicable to locally complex analytical functions \cite{Martins_Sturdza_Alonso_2003_complex_differentiation} and does not outperform AD (see Sections \ref{section:direct-methods} and \ref{section:recomendations}). One additional limitation is that it requires the evaluation of mathematical functions with small complex values, e.g., operations such as $\sin(1 + 10^{-16} i)$, which are not necessarily always computable to high accuracy with modern math libraries. Extension to higher order derivatives can be obtained by introducing multicomplex variables \cite{Lantoine_Russell_Dargent_2012}. diff --git a/tex/sections/methods-intro.tex b/tex/sections/methods-intro.tex index b3e38b7..82b2cfa 100644 --- a/tex/sections/methods-intro.tex +++ b/tex/sections/methods-intro.tex @@ -29,7 +29,7 @@ % However, one has to keep in mind that AD computes the exact derivative of an approximation of the objective and may not yield an approximation to the exact derivatives of the objective (Section \ref{section:forwardAD-sensitivity}). The distinction between \textit{forward} and \textit{reverse} regards whether sensitivities are computed for intermediate variables with respect to the input variable or parameter to differentiate (forward) or, on the contrary, we compute the sensitivity of the output variable with respect to each intermediate variable by defining a new adjoint variable (reverse). -Mathematically speaking, this distinction translated to the fact that forward methods compute directional derivatives by mapping sequential mappings between tangent spaces, while reverse methods apply sequential mappings between co-tangent spaces from the direction of the output variable to the input variable (Section \ref{sec:vjp-jvp}). +Mathematically speaking, this distinction translates to the fact that forward methods compute directional derivatives by mapping sequential mappings between tangent spaces, while reverse methods apply sequential mappings between co-tangent spaces from the direction of the output variable to the input variable (Section \ref{sec:vjp-jvp}). In all forward methods the DE is solved sequentially and simultaneously with the directional derivative during the forward pass of the numerical solver. On the contrary, reverse methods compute the gradient by solving a new problem that moves in the opposite direction as the original numerical solver. In DE-based models, intermediate variables correspond to intermediate solutions of the DE. diff --git a/tex/sections/preliminaries.tex b/tex/sections/preliminaries.tex index 6f346b0..fafc006 100644 --- a/tex/sections/preliminaries.tex +++ b/tex/sections/preliminaries.tex @@ -83,7 +83,7 @@ \subsubsection{Numerical solvers for ODEs} % The choice of the norm $\frac{1}{\sqrt n} \| \cdot \|_2$ for computing the total error $\text{Err}_\text{scaled}$, sometimes known as Hairer norm, has been the standard for a long time \cite{Ranocha_Dalcin_Parsani_Ketcheson_2022} and it is based on the assumption that a small increase in the size of the systems of ODEs (e.g., by simply duplicating the ODE system) should not affect the stepsize choice, but other options can be considered \cite{hairer-solving-1}. Modern solvers include stepsize controllers that pick $\Delta t_m$ as large as possible to minimize the total number of steps while preventing large errors by keeping $\text{Err}^m_\text{scaled} \leq 1$. -One of the most used methods to archive this is the proportional-integral controller (PIC) that updates the stepsize according to +One of the most used methods to achieve this is the proportional-integral controller (PIC) that updates the stepsize according to \begin{equation} \Delta t_{m} = \eta \, \Delta t_{m-1} \qquad @@ -189,7 +189,7 @@ \subsubsection{Gradient-based optimization} A direct implementation of gradient descent following Equation \eqref{eq:gradient-descent} is prone to converge to a local minimum and slows down in a neighborhood of saddle points. To address these issues, variants of this scheme employing more advanced updating strategies have been proposed, including Newton-type methods \cite{second-order-optimization}, quasi-Newton methods, acceleration techniques \cite{JMLR:v22:20-207}, and natural gradient descent methods \cite{doi:10.1137/22M1477805}. For instance, ADAM is an adaptive, momentum-based algorithm that stores the parameter update at each iteration, and determines the next update as a linear combination of the gradient and the previous update \cite{Kingma2014}. -ADAM been widely adopted to train highly parametrized neural networks (up to the order of $10^8$ parameters \cite{NIPS2017_3f5ee243}). +ADAM has been widely adopted to train highly parametrized neural networks (up to the order of $10^8$ parameters \cite{NIPS2017_3f5ee243}). Other widely employed algorithms are the Broyden–Fletcher–Goldfarb–Shanno (BFGS) and its limited-memory version algorithm (L-BFGS), which determine the descent direction by preconditioning the gradient with curvature information. % ADAM is less prone to converge to a local minimum, while (L-)BFGS has a faster converge rate. % Using ADAM for the first iterations followed by (L-)BFGS proves to be a successful strategy to minimize a loss function with best accuracy.