[latex] Patrick and Frank comments, bib format, minor changes to soft…

…ware section (#79) * Reorganize direct vs solver-based methods section * Introduce changes from Giles Co-authored-by: gileshooker <[email protected]> Co-authored-by: facusapienza21 <[email protected]> * Updates from Overleaf * [latex] Incorporated comments from Patrick and Frank Co-authored-by: heimbach <[email protected]> Co-authored-by: frankschae <[email protected]> Co-authored-by: facusapienza21 <[email protected]> * Create bib format file --------- Co-authored-by: gileshooker <[email protected]> Co-authored-by: heimbach <[email protected]> Co-authored-by: frankschae <[email protected]>
ODINN-SciML · Nov 14, 2023 · 73a6277 · 73a6277
1 parent d4b69f3
commit 73a6277
Show file tree

Hide file tree

Showing 15 changed files with 122 additions and 61 deletions.
diff --git a/tex/appendix/code.tex b/tex/appendix/code.tex
@@ -0,0 +1 @@
+% Appendix to mentioned the code provided in the GithUb repository.
diff --git a/tex/bibliography.bib b/tex/bibliography.bib
@@ -692,4 +692,24 @@ @article{griewank2012invented
     journal={Documenta Mathematica, Extra Volume ISMP},
     volume={389400},
     year={2012}
+}
+
+@article{Dandekar_2020, 
+    title={A Machine Learning-Aided Global Diagnostic and Comparative Tool to Assess Effect of Quarantine Control in COVID-19 Spread}, 
+    volume={1}, 
+    ISSN={2666-3899},
+    DOI={10.1016/j.patter.2020.100145}, 
+    number={9},
+    journal={Patterns},
+    author={Dandekar, Raj and Rackauckas, Chris and Barbastathis, George}, 
+    year={2020},
+    pages={100145} 
+}
+
+@article{Bezanson_Karpinski_Shah_Edelman_2012, 
+    title={Julia: A Fast Dynamic Language for Technical Computing}, 
+    DOI={10.48550/arxiv.1209.5145},  
+    journal={arXiv}, 
+    author={Bezanson, Jeff and Karpinski, Stefan and Shah, Viral B and Edelman, Alan}, 
+    year={2012} 
 }
diff --git a/tex/contributors.sty b/tex/contributors.sty
@@ -4,13 +4,17 @@
 \author[1]{Facundo Sapienza\thanks{Corresponding author: \texttt{[email protected]}}}
 \author[2]{Jordi Bolibar}
 \author[3]{Frank Sch\"afer}
+\author[6]{Patrick Heimbach}
 \author[4]{Giles Hooker}
 \author[1]{Fernando Pérez}
+\author[5]{Per-Olof Persson}
 
 \affil[1]{Department of Statistics, University of California, Berkeley (USA)}
 \affil[2]{TU Delft, Department of Geosciences and Civil Engineering, Delft (Netherlands)}
 \affil[3]{CSAIL, Massachusetts Institute of Technology, Cambridge (USA)}
 \affil[4]{Department of Statistics and Data Science, University of Pennsylvania (USA)}
+\affil[5]{Department of Mathematics, University of California, Berkeley (USA)}
+\affil[6]{Department of Earth and Planetary Sciences, Jackson School of Geosciences, University of Texas, Austin (USA)}
 
 % Author font
 \renewcommand\Authfont{\normalshape}

diff --git a/tex/main.tex b/tex/main.tex
@@ -19,7 +19,7 @@
 \usepackage{enumitem} % Necesarry for enumerating with romans (i), (ii), ...
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%% Customized packages %%%%%
+%%%%%% Customized packages %%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
 % Authors information
@@ -28,17 +28,8 @@
 \usepackage{jlcode}
 % Math 
 \usepackage{mymath}
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%% Bibliography %%%%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\usepackage[
-backend=biber,
-style=numeric, 
-sorting=none
-]{biblatex}
-
+% Bibliography %
+\usepackage{mybib}
 \addbibresource{bibliography.bib}
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -115,17 +106,15 @@ \section{Computational implementation}
 \label{sec:computational-implementation}
 \input{sections/software}
 
-\section{Recommendations}
-\input{sections/recommendations}
+% \section{Extensions}
 
-\section{Conclusions}
-\input{sections/conclusions}
+% \section{Recommendations}
+% \input{sections/recommendations}
 
-% \section{Do we need full gradients?}
-
-% \section{Generalization to PDEs}
+% \section{Conclusions}
+% \input{sections/conclusions}
 
-% \section{Open science from scratch}
+% \section{Do we need full gradients?}
 
 % \section{Notation}
 % \input{sections/notation}
@@ -139,6 +128,9 @@ \section*{Appendices}
 \subsection{Lagrangian derivation of adjoints}
 \input{appendix/lagrangian}
 
+\subsection{Supplementaty code}
+\input{appendix/code}
+
 % \section{Glossary} 
 
 \newpage

diff --git a/tex/mybib.sty b/tex/mybib.sty
@@ -0,0 +1,19 @@
+\ProvidesPackage{mybib}
+
+\usepackage[
+    backend=biber,
+    style=numeric-comp, 
+    sorting=nyt,
+    autocite=superscript
+]{biblatex}
+
+% prints author names as small caps
+\renewcommand{\mkbibnamefirst}[1]{\textsc{#1}}
+\renewcommand{\mkbibnamelast}[1]{\textsc{#1}}
+\renewcommand{\mkbibnameprefix}[1]{\textsc{#1}}
+\renewcommand{\mkbibnameaffix}[1]{\textsc{#1}}
+
+% Redefinde cite command to be based on autocite with superscript
+\renewcommand{\cite}[1]{\autocite{#1}}
+
+\DeclareFieldFormat{labelnumberwidth}{{#1\adddot}}
diff --git a/tex/sections/abstract.tex b/tex/sections/abstract.tex
@@ -1,4 +1,4 @@
-Differentiable programming has become a central component in modern machine learning techniques. 
-In the context of models describe by differential equations, calculation of sensitivities and gradients require careful algebraic and numeric manipulations of the underlying dynamical system.
-We aim to summarize some of the most used techniques that exists to compute gradients on numerical models that include numerical solutions of differential equations. 
+Differentiable programming has become a central component of modern machine learning techniques. 
+In the context of models described by differential equations, the calculation of sensitivities and gradients requires careful algebraic and numeric manipulations of the underlying dynamical system.
+We aim to summarize some of the most used techniques that exist to compute gradients for numerical solutions of differential equations. 
 We cover this problem by first introducing motivations in current areas of research, such as geophysics; the mathematical foundations of the different approaches; and finally the computational consideration and solutions that exist in modern scientific software. 
diff --git a/tex/sections/automatic-differentiation.tex b/tex/sections/automatic-differentiation.tex
@@ -1,10 +1,10 @@
 Automatic differentiation (AD) is a technology that allows computing gradients thought a computer program \cite{griewank2008evaluatingderivatives}. 
-The main idea is that every computer program manipulating numbers can be reduced to a sequence of simple algebraic operations that can be easily differentiable. 
+The main idea is that every computer program manipulating numbers can be reduced to a sequence of simple algebraic operations that have straightforward derivative expressions, based upon elementary rules of differentiation.
 The derivatives of the outputs of the computer program with respect to their inputs are then combined using the chain rule.
 One advantage of AD systems is that we can automatically differentiate programs that include control flow, such as branching, loops or recursions. 
 This is because at the end of the day, any program can be reduced to a trace of input, intermediate and output variables \cite{Baydin_Pearlmutter_Radul_Siskind_2015}.
 
-Depending if the concatenation of these gradients is done as we execute the program (from input to output) or in a later instance were we trace-back the calculation from the end (from output to input), we are going to talk about \textit{forward} or \textit{backward} AD, respectively.
+Depending if the concatenation of these gradients is done as we execute the program (from input to output) or in a later instance were we trace-back the calculation from the end (from output to input), we are going to talk about \textit{forward} or \textit{reverse} AD, respectively.
 
 \subsubsection{Forward mode}
 
@@ -140,6 +140,7 @@ \subsubsection{AD connection with JVPs and VJPs}
 % On the other side, when the output dimension is larger than the input space dimension, forwards AD is more efficient.
 % This is the reason why in most machine learning application people use backwards AD. 
 However, notice that backwards mode AD requires us to save the solution through the forward run in order to run backwards afterwards \cite{Bennett_1973}, while in forward mode we can just evaluate the gradient as we iterate our sequence of functions. 
+This problem can be overcome with a good checkpointing scheme, somewhat we will discuss later. 
 This means that for problems with a small number of parameters, forward mode can be faster and more memory-efficient that backwards AD.
 
 \begin{figure}

diff --git a/tex/sections/community-statement.tex b/tex/sections/community-statement.tex
@@ -6,6 +6,6 @@
     We hope this encourages new people to be an active part of the ecosystem, by using and developing open-source tools. 
     This work was done under the premise \textbf{open-science from scratch}, meaning all the contents of this work, both code and text, have been in the open from the beginning and that any interested person can contribute to the project. 
     You can contribute directly to the GitHub repository 
-    \url{github.com/ODINN-SciML/DiffEqSensitivity-Review}
+    \url{github.com/ODINN-SciML/DiffEqSensitivity-Review}.
     }
 \end{quote}
diff --git a/tex/sections/finite-differences.tex b/tex/sections/finite-differences.tex
@@ -13,14 +13,14 @@
 leads also to a more precise estimation of the derivative. 
 While Equation \eqref{eq:finite_diff} gives to an error of magnitude $\mathcal O (\varepsilon)$, the centered differences schemes improves to $\mathcal O (\varepsilon^2)$ \cite{ascher2008-numerical-methods}. 
 
-However, there are a series of problems associated to this approach.
+However, there are a series of problems associated with this approach.
 The first one is due to how this scales with the number of parameters $p$.
 Each directional derivative requires the evaluation of the loss function $L$ twice.
-For the centered differences approach in Equation \eqref{eq:finite_diff2}, this requires a total of $2p$ function evaluations, which at the same time demands to solve the differential equation in forward mode each time for a new set of parameters.
+For the centered differences approach in Equation \eqref{eq:finite_diff2}, this requires a total of $2p$ function evaluations, which at the same time demands solving the differential equation in forward mode each time for a new set of parameters.
 
 A second problem is due to rounding errors.
-Every computer ultimately stores and manipulate numbers using floating points arithmetic \cite{Goldberg_1991_floatingpoint}. 
-Equations \eqref{eq:finite_diff} and \eqref{eq:finite_diff2} involve the subtraction of two numbers that are very close to each other, which leads to large cancellation errors for small values of $\varepsilon$ than are amplified by the division by $\varepsilon$.
+Every computer ultimately stores and manipulates numbers using floating points arithmetic \cite{Goldberg_1991_floatingpoint}. 
+Equations \eqref{eq:finite_diff} and \eqref{eq:finite_diff2} involve the subtraction of two numbers that are very close to each other, which leads to large cancellation errors for small values of $\varepsilon$ that are amplified by the division by $\varepsilon$.
 On the other hand, large values of the stepsize give inaccurate estimations of the gradient. 
 Finding the optimal value of $\varepsilon$ that trade-offs these two effects is sometimes called the \textit{stepsize dilemma} \cite{mathur2012stepsize-finitediff}.
 Due to this, some heuristics and algorithms have been introduced in order to pick the value of $\varepsilon$ \cite{mathur2012stepsize-finitediff, BARTON_1992_finite_diff, SUNDIALS-hindmarsh2005sundials}, 

diff --git a/tex/sections/introduction.tex b/tex/sections/introduction.tex
@@ -1,16 +1,16 @@
 % General statement: why gradients are important?
-Evaluating how the value of a function changes with respect to its arguments and parameters plays a central role in optimization, sensitivity analysis, Bayesian inference, and uncertainty quantification, among many. 
-Modern machine learning applications require the use of gradients to explore and exploit more efficiently the space of parameters. 
+Evaluating how the value of a function changes with respect to its arguments and parameters plays a central role in optimization, sensitivity analysis, Bayesian inference, inverse methods, and uncertainty quantification, among many. 
+Modern machine learning applications require the use of gradients to explore and exploit more efficiently the space of parameters (e.g., weight of a neural network). 
 When optimizing a loss function, gradient-based methods (for example, gradient descent and its many variants \cite{ruder2016overview-gradient-descent}) are more efficient at finding a minimum and converge faster to them than gradient-free methods.
 When numerically computing the posterior of a probabilistic model, gradient-based sampling strategies converge faster to the posterior distribution than gradient-free methods. 
-Second derivatives further help to improve the convergence rates of these algorithms and enable uncertainty quantification around parameter values.
+Hessians further help to improve the convergence rates of these algorithms and enable uncertainty quantification around parameter values.
 \textit{A gradient serves as a compass in modern data science: it tells us in which direction in the open wide ocean of parameters we should move towards in order to increase our chances of success}.  
 
 % Differential Programming
 Dynamical systems governed by differential equations are not an exception to the rule.
 Differential equations play a central role in describing the behaviour of systems in natural and social sciences. 
 Some authors have recently suggested differentiable programming as the bridge between modern machine learning methods and traditional scientific models \cite{Ramsundar_Krishnamurthy_Viswanathan_2021, Shen_diff_modelling, Gelbrecht-differential-programming-Earth}. 
-Being able to compute gradients and sensitivities of dynamical systems opens the door to more complex models.
+Being able to compute gradients or sensitivities of dynamical systems opens the door to more complex models.
 This is very appealing in geophysical models, where there is a broad literature on physical models and a long tradition in numerical methods. 
 The first goal of this work is to introduce some of the applications of this emerging technology and to motivate its incorporation for the modelling of complex systems in the natural and social sciences. 
 \begin{quote}
@@ -25,8 +25,9 @@
 In numerical analysis, sensitivities quantify how the solution of a differential equation fluctuates with respect to certain parameters. 
 This is particularly useful in optimal control theory \cite{Giles_Pierce_2000}, where the goal is to find the optimal value of some control (e.g. the shape of a wing) that minimizes a given loss function. 
 In recent years, there has been an increasing interest in designing machine learning workflows that include constraints in the form of differential equations. 
-Examples of this include Physics-Informed Neural Networks (PINNs) \cite{PINNs_2019} and Universal Differential Equations (UDEs) \cite{rackauckas2020universal}.  
-Furthermore, numerical solvers are used as forward models in the case of Neural ordinary differential equations \cite{chen_neural_2019}.
+Examples of this include methods that numerical solve differential equations \cite{PINNs_2019}, as well as methods that augment and learn parts of the differential equation \cite{rackauckas2020universal, Dandekar_2020}.
+% Examples of this include Physics-Informed Neural Networks (PINNs) \cite{PINNs_2019} and Universal Differential Equations (UDEs) \cite{rackauckas2020universal}.  
+Furthermore, numerical solvers are used as forward models in the case of neural ordinary differential equations \cite{chen_neural_2019}.
 
 % soft / hard constrains
 
@@ -63,11 +64,11 @@
     \textit{What are the advantages and disadvantages of different differentiation methods and how can I incorporate them in my research?}
 \end{quote}
 Despite the fact that these methods can be (in principle) implemented in different programming languages, here we decided to use the Julia programming language for the different examples. 
-Julia is a recently new but mature programming language that has already a large tradition in implementing packages aiming to advance differentiable programming \cite{Julialang_2017}. 
+Julia is a recent but mature programming language that has already a large tradition in implementing packages aiming to advance differentiable programming \cite{Bezanson_Karpinski_Shah_Edelman_2012, Julialang_2017}. 
 
-% The need to introduce all this methods in a common framework
+% The need to introduce all these methods in a common framework
 Without aiming at making an extensive and specialized review of the field, we believe this study will be useful to other researchers working on problems that combine optimization and sensitivity analysis with differential equations.
-Differentiable programming is opening new ways of doing research across sciences. 
+Differentiable programming is opening new ways of doing research across sciences, and we need close collaboration between domain scientists, methodological scientists, and computer scientists in order to develop successful, scalable, practical, and efficient frameworks for real-worlds applications.
 As we make progress in the use of these tools, new methodological questions start to emerge. 
 How do these methods compare? How can their been improved? 
 We also hope this paper serves as a gateway to new questions regarding advances in these methods. 

diff --git a/tex/sections/methods-intro.tex b/tex/sections/methods-intro.tex
@@ -8,10 +8,10 @@
 These methods can be roughly classified as:
 \begin{itemize}
     \item \textit{Discrete} vs \textit{continuous} methods
-    \item \textit{Forward} vs \textit{backwards} methods
+    \item \textit{Forward} vs \textit{backward} methods
 \end{itemize}
 The first difference regards the fact that the method for computing the gradient can be either based on the manipulation of atomic operations that are easy to differentiate using the chain rule several times (discrete), in opposition to the approach of approximating the gradient as the numerical solution of a new set of differential equations (continuous).
-Another way of conceptualizing this difference is by comparing them with the discretize-differentiate and differentiate-discretize approaches \cite{bradley2013pde, Onken_Ruthotto_2020, FATODE2014}.   
+Another way of conceptualizing this difference is by comparing them with the discretize-differentiate and differentiate-discretize approaches \cite{bradley2013pde, Onken_Ruthotto_2020, FATODE2014, Sirkes_Tziperman_1997}.   
 We can either discretize the original system of ODEs in order to numerically solve it and then define the set of adjoint equations on top of the numerical scheme; or instead define the adjoint equation directly using the differential equation and then discretize both in order to solve \cite{Giles_Pierce_2000}.
 
 The second distinction is related to the fact that some methods compute gradients by resolving a new sequential problem that may move in the same direction as the original numerical solver - i.e. moving forward in time - or, instead, they solve a new system that goes backwards in time.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		% Appendix to mentioned the code provided in the GithUb repository.