Skip to content

Commit

Permalink
Small changes and expansion of bibliography (#87)
Browse files Browse the repository at this point in the history
* [build-bib] Included paragraph on static-dynamic graph for AD


Co-authored-by: frankschae <[email protected]>
Co-authored-by: facusapienza21 <facusapienza21>

* 📁 Automated biblatex merge [build-latex]

Signed-off-by: My GitHub Actions Bot <41898282+github-actions[bot]@users.noreply.github.com>

* Updates from Overleaf

* [build-bib] try to solve backslash multiplication

* 📁 Automated biblatex merge [build-latex]

Signed-off-by: My GitHub Actions Bot <41898282+github-actions[bot]@users.noreply.github.com>

* Motivation and some references

* [build-bib] Changes and recompiling biblatex files

* 📁 Automated biblatex merge [build-latex]

Signed-off-by: My GitHub Actions Bot <41898282+github-actions[bot]@users.noreply.github.com>

* [build-bib] trigger bib creation

* 📁 Automated biblatex merge [build-latex]

Signed-off-by: My GitHub Actions Bot <41898282+github-actions[bot]@users.noreply.github.com>

---------

Signed-off-by: My GitHub Actions Bot <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: frankschae <[email protected]>
Co-authored-by: Author <[email protected]>
  • Loading branch information
3 people authored Jan 15, 2024
1 parent 9493998 commit 34b2d28
Show file tree
Hide file tree
Showing 12 changed files with 863 additions and 489 deletions.
5 changes: 5 additions & 0 deletions .utils/biblatex_merger.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,11 @@ def merge_bib(path):
tex_files = find_extension(".tex", ".")
bib_files = find_extension(".bib", ".")

# Filter .ipynb files
bib_files = list(filter(lambda x : '.ipynb' not in x, bib_files))
# Put output file at the end to avoid backslash plague
bib_files = sorted(bib_files, key = lambda x : 'bibliography.bib' in x)

# Attach all the text in files to a single string
latex = ""
authors = []
Expand Down
1,083 changes: 670 additions & 413 deletions tex/bibliography.bib

Large diffs are not rendered by default.

68 changes: 68 additions & 0 deletions tex/bibs/bibliography-facu.bib
Original file line number Diff line number Diff line change
Expand Up @@ -880,4 +880,72 @@ @article{Betancourt_2017
journal={arXiv},
author={Betancourt, Michael},
year={2017}
}

@inproceedings{abadi-tensorflow,
author = {Abadi, Mart\'{\i}n and Barham, Paul and Chen, Jianmin and Chen, Zhifeng and Davis, Andy and Dean, Jeffrey and Devin, Matthieu and Ghemawat, Sanjay and Irving, Geoffrey and Isard, Michael and Kudlur, Manjunath and Levenberg, Josh and Monga, Rajat and Moore, Sherry and Murray, Derek G. and Steiner, Benoit and Tucker, Paul and Vasudevan, Vijay and Warden, Pete and Wicke, Martin and Yu, Yuan and Zheng, Xiaoqiang},
title = {TensorFlow: A System for Large-Scale Machine Learning},
year = {2016},
isbn = {9781931971331},
publisher = {USENIX Association},
address = {USA},
booktitle = {Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation},
pages = {265–283},
numpages = {19},
location = {Savannah, GA, USA},
series = {OSDI'16}
}

@article{McGreivy_stellarator_2021,
title={Optimized finite-build stellarator coils using automatic differentiation},
volume={61},
ISSN={0029-5515},
DOI={10.1088/1741-4326/abcd76},
journal={Nuclear Fusion},
author={McGreivy, N. and Hudson, S.R. and Zhu, C.},
year={2021},
pages={026020}
}

@article{allaire2015review,
title={A review of adjoint methods for sensitivity analysis, uncertainty quantification and optimization in numerical codes},
author={Allaire, Gr{\'e}goire},
journal={Ing{\'e}nieurs de l'Automobile},
volume={836},
pages={33--36},
year={2015}
}

@book{lions1971optimal,
title={Optimal control of systems governed by partial differential equations},
author={Lions, Jacques Louis},
volume={170},
year={1971},
publisher={Springer}
}

@inproceedings{pironneau2005optimal,
title={Optimal shape design for elliptic systems},
author={Pironneau, Olivier},
booktitle={System Modeling and Optimization: Proceedings of the 10th IFIP Conference New York City, USA, August 31--September 4, 1981},
pages={42--66},
year={2005},
organization={Springer}
}

@article{allaire2014shape,
title={Shape optimization with a level set based mesh evolution method},
author={Allaire, Gr{\'e}goire and Dapogny, Charles and Frey, Pascal},
journal={Computer Methods in Applied Mechanics and Engineering},
volume={282},
pages={22--53},
year={2014},
publisher={Elsevier}
}

@book{mohammadi2009applied,
title={Applied shape optimization for fluids},
author={Mohammadi, Bijan and Pironneau, Olivier},
year={2009},
publisher={OUP Oxford}
}
3 changes: 3 additions & 0 deletions tex/contributors.sty
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,16 @@
\author[4]{Giles Hooker}
\author[1]{Fernando Pérez}
\author[5]{Per-Olof Persson}
\author[6,7]{Christopher Rackauckas}

\affil[1]{Department of Statistics, University of California, Berkeley (USA)}
\affil[2]{TU Delft, Department of Geosciences and Civil Engineering, Delft (Netherlands)}
\affil[3]{CSAIL, Massachusetts Institute of Technology, Cambridge (USA)}
\affil[4]{Department of Statistics and Data Science, University of Pennsylvania (USA)}
\affil[5]{Department of Mathematics, University of California, Berkeley (USA)}
\affil[6]{Department of Earth and Planetary Sciences, Jackson School of Geosciences, University of Texas, Austin (USA)}
\affil[7]{Massachusetts Institute of Technology, Cambridge (USA)}
\affil[8]{JuliaHub, Cambridge (USA)}

% Author font
\renewcommand\Authfont{\normalshape}
Expand Down
3 changes: 2 additions & 1 deletion tex/mybib.sty
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
sorting=nyt,
autocite=superscript,
% firstinits=true
giveninits=true
giveninits=true,
maxnames=6
]{biblatex}

% prints author names as small caps
Expand Down
5 changes: 3 additions & 2 deletions tex/sections/adjoint-state.tex
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Also known as the adjoint state method, it is another example of a discrete method that aims to find the gradient by solving an alternative system of linear equations, known as the \textit{adjoint equations}, at the same time that we solve the original system of linear equations defined by the numerical solver.
These methods are extremely popular in optimal control theory in fluid dynamics, for example for the design of geometries for vehicles and airplanes that optimize performance \cite{Elliott_Peraire_1996, Giles_Pierce_2000}.
This approach follows the discretize-optimize approach, meaning that we first discretize the system of continuous ODEs and then solve on top of these linear equations \cite{Giles_Pierce_2000}.
This approach follows the discretize-optimize approach, meaning that we first discretize the system of continuous ODEs and then solve on top of these linear equations \cite{Giles_Pierce_2000, allaire2015review}.

% Just as in the case of automatic differentiation, the adjoint state method evaluates the gradient by moving forward in time and applying the chain rule sequentially over a discrete set of operations that dictate the updates by the numerical scheme for solving the differential equation. However, it does so by directly computing the gradient by solving a new system of equations.

Expand Down Expand Up @@ -44,7 +44,7 @@
This system can be solved sequentially, by solving for $u_i$ in increasing order of index using Newton method.
If we call the super-vector $U = (u_1, u_2, \ldots, u_N) \in \R^{nN}$, we can combine all these equations into one single system of equations $G(U) = (g_1(u_1; \theta), \ldots, g_N(u_N; \theta)) = 0$.

In the simplest case where the algebraic set of equations is linear and we can write $g_i (u_{i+1}; \theta) = u_{i+1} - A_i (\theta) \, u_i - b_i$ with $A_i \in \R^{n \times n}$ and $b_i \in \R^n$ defined by the numerical solver, the condition $G(U)=0$ simplifies to the linear system of equations
In the simplest case of linear one-step methods with $g_i (u_{i+1}; \theta) = u_{i+1} - A_i (\theta) \, u_i - b_i$ with $A_i \in \R^{n \times n}$ and $b_i \in \R^n$ defined by the numerical solver, the condition $G(U)=0$ simplifies to the linear system of equations
\begin{equation}
A(\theta) U
=
Expand Down Expand Up @@ -154,6 +154,7 @@
The important trick to notice here is the rearrangement of the multiplicative terms involved in equation \eqref{eq:dhdtheta}. Computing the full Jacobian/sensitivity $\partial u / \partial \theta$ will be computationally expensive and involves the product of two matrices. However, we are not interested in the calculation of the Jacobian, but instead in the VJP given by $\frac{\partial L}{\partial U} \frac{\partial U}{\partial \theta}$. By rearranging these terms, we can make the same computation more efficient.

Notice that the algebraic equation of the adjoint $\lambda$ in Equation \eqref{eq:adjoint-state-equation} is a linear system of equations even when the original system $G(U)=0$ was not necessarily linear in $U$.
This means that while the forward mode may required multiple iterations calls in order to solve the non-linear system $G(U) = 0$ using Krylov methods, the backwards step to compute the adjoint is one single linear system of equations.
For the linear system of discrete equations $G(U; \theta)=A(\theta) U - b(\theta)=0$, we have \cite{Johnson}
\begin{equation}
\frac{\partial G}{\partial \theta}
Expand Down
17 changes: 2 additions & 15 deletions tex/sections/automatic-differentiation.tex
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ \subsubsection{Forward mode}
\end{align}
When computing first order derivatives, we can ignore everything of order $\epsilon^2$ or larger, which is represented in the condition $\epsilon^2 = 0$.
This implies that we can use dual numbers to implement forward AD through a numerical algorithm.
We will explore how this is carried in Section \ref{sec:computational-implementation}.
In Section \ref{sec:computational-implementation} we will explore how this is implemented and compare this approach with complex-step differentiation.

\vspace*{10px}
\noindent \textbf{\textit{Computational graph}}
Expand Down Expand Up @@ -94,19 +94,6 @@ \subsubsection{Backward mode}
Here we have introduced the notation $\bar{\omega} = \frac{\partial \ell}{\partial \omega}$ to indicate that the partial derivative is always of the final loss function with respect to the different program variables, a quantity sometimes refers as adjoint (do not confuse with the adjoint method of later sections) or cotangent.
Since in backwards AD the values of $\bar \omega$ are being updated in reverse order, in order to evaluate the terms $\frac{\partial \omega}{\partial v}$ we need to know the state value of all the argument variables $v$ of $\omega$, which need to be stored in memory during the evaluation of the function in order to be able to apply backward AD.

Another way of implementing backwards AD is by defining a \textit{pullback} function \cite{Innes_2018}, a method also known as \textit{continuation-passing style} \cite{Wang_Zheng_Decker_Wu_Essertel_Rompf_2019}.
In the backward step, this executes a series of function calls, one for each elementary operation.
If one of the nodes in the graph $w$ is the output of an operation involving the nodes $v_1, \ldots, v_m$, where $v_i \rightarrow w$ are all nodes in the graph, then the pullback $\bar v_1, \ldots, \bar v_m = \mathcal B_w(\bar w)$ is a function that accepts gradients with respect to $w$ (defined as $\bar w$) and returns gradients with respect to each $v_i$ ($\bar v_i$) by applying the chain rule.
Consider the example of the multiplicative operation $w = v_1 \times v_2$. Then
\begin{equation}
\bar v_1, \bar v_2 = v_2 \times \bar w , \quad
v_1 \times \bar w = \mathcal{B}_w (\bar w),
\end{equation}
which is equivalent to using the chain rule as
\begin{equation}
\frac{\partial \ell}{\partial v_1} = \frac{\partial}{\partial v_1}(v_1 \times v_2) \frac{\partial \ell}{\partial w}.
\end{equation}


\subsubsection{AD connection with JVPs and VJPs}

Expand Down Expand Up @@ -140,7 +127,7 @@ \subsubsection{AD connection with JVPs and VJPs}
% On the other side, when the output dimension is larger than the input space dimension, forwards AD is more efficient.
% This is the reason why in most machine learning application people use backwards AD.
However, notice that backwards mode AD requires us to save the solution through the forward run in order to run backwards afterwards \cite{Bennett_1973}, while in forward mode we can just evaluate the gradient as we iterate our sequence of functions.
This problem can be overcome with a good checkpointing scheme, somewhat we will discuss later.
This problem can be mitigated, to some extent, with a good checkpointing scheme we will discuss later.
This means that for problems with a small number of parameters, forward mode can be faster and more memory-efficient that backwards AD.

\begin{figure}[t]
Expand Down
2 changes: 1 addition & 1 deletion tex/sections/finite-differences.tex
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
\begin{equation}
Jv \approx \frac{f(u + \varepsilon v, \theta, t) - f(u, \theta, t)}{\varepsilon}
\end{equation}
This approach is used in numerical solvers based on Krylow methods, where linear systems are solved by iterative solving matrix-vectors products \cite{Ipsen_Meyer_1998}.
This approach is used in numerical solvers based on Krylov methods, where linear systems are solved by iterative solving matrix-vectors products \cite{Ipsen_Meyer_1998}.

% Replacing derivatives by finite differences is also a common practice when solving partial differential equations (PDEs), a technique known as the \textit{method of lines} \cite{ascher2008numerical}.
% To illustrate this point, let us consider the case of the one-dimensional heat equation
Expand Down
16 changes: 7 additions & 9 deletions tex/sections/introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
Evaluating how the value of a function changes with respect to its arguments and parameters plays a central role in optimization, sensitivity analysis, Bayesian inference, inverse methods, and uncertainty quantification, among many\cite{Razavi.2021}.
Modern machine learning applications require the use of gradients to efficiently exploit the high-dimensional space of parameters to be inferred or learned (e.g., weights of a neural network).
The same is true for partial differential equation (PDE) constrained optimization, which infers uncertain (or unknown) parameters from (generally sparse) data \cite{Ghattas.2021}.
When optimizing an objective function (loss function in machine learning or cost function in inverse modeling), gradient-based methods (for example, gradient descent and its many variants \cite{ruder2016overview-gradient-descent}) are more efficient at finding a minimum and converge faster to them than gradient-free methods
Furthermore, the \textit{curse of dimensionality} renders gradient-free optimization and sampling methods computationally intractable for most large-scale problems \cite{Oden:2010tv})
When optimizing an objective function (loss function in machine learning or cost function in inverse modeling), gradient-based methods (for example, gradient descent and its many variants \cite{ruder2016overview-gradient-descent}) are more efficient and converge faster than gradient-free methods.
Furthermore, the \textit{curse of dimensionality} renders gradient-free optimization and sampling methods computationally intractable for most large-scale problems \cite{Oden:2010tv}.
.
When numerically computing the posterior of a probabilistic model, gradient-based sampling strategies converge faster to the posterior distribution than gradient-free methods.
When numerically computing the posterior of a probabilistic model, gradient-based sampling strategies tend to converge faster to the posterior distribution than gradient-free methods.
Hessians further help to improve the convergence rates of these algorithms and enable uncertainty quantification around parameter values \cite{BuiThanh:2012ul}.
\textit{A gradient serves as a compass in modern data science: it tells us in which direction in the vast, open ocean of parameters we should move towards in order to increase our chances of success}.

Expand All @@ -14,7 +14,7 @@
Differential equations play a central role in describing the behaviour of systems in natural and social sciences.
Some authors have recently suggested differentiable programming as the bridge between modern machine learning methods and traditional scientific models \cite{Ramsundar_Krishnamurthy_Viswanathan_2021, Shen_diff_modelling, Gelbrecht-differential-programming-Earth}.
Being able to compute gradients or sensitivities of dynamical systems opens the door to more complex models.
This is very appealing in geophysical models, where there is a broad literature on physical models and a long tradition in numerical methods.
This is very appealing in fields like computational physics, geophysics, and biology (to mention a few) where there is a broad literature on physical models and a long tradition in numerical methods.
The first goal of this work is to introduce some of the applications of this emerging technology and to motivate its incorporation for the modelling of complex systems in the natural and social sciences.
\begin{quote}
\textbf{Question 1. }
Expand Down Expand Up @@ -56,21 +56,19 @@
The premise is simple: every computer program, including a numerical solver, is ultimately an algorithm described by a chain of elementary algebraic operations (addition, multiplication) that are easy to differentiate and their combination is easy to differentiate by using the chain rule\cite{Giering:1998in}.
Although many modern differentiation tools use AD to some extent, there is also a family of methods that compute the gradient by relying on an auxiliary set of differential equations.
We are going to refer to this family of methods as \textit{continuous}, and we will dedicate them a special treatment in future sections to distinguish them from the discrete algorithms that resemble more to pure AD.
We emphasize that AD is merely a \emph{tool} for generating derivatives of computer code. Alternatively, such code can be (and often has been) generated ``by hand'', e.g., to avoid restrictions incurred by the available AD tool.
Down-sides of hand-written derivative code are, (i) it is error-prone, (ii) it is difficult to keep pace with the development of the parent code, and (iii) [I need to remember ...].
%We emphasize that AD is merely a \emph{tool} for generating derivatives of computer code. Alternatively, such code can be (and often has been) generated ``by hand'', e.g., to avoid restrictions incurred by the available AD tool.
%Down-sides of hand-written derivative code are, (i) it is error-prone, (ii) it is difficult to keep pace with the development of the parent code, and (iii) [I need to remember ...].

The differences between methods arise both from their mathematical formulation and their computational implementation.
The first provides different guarantees on the method returning the actual gradient or a good approximation thereof.
The second involves how theory is translated to software, and what are the data structures and algorithms used to implement it.
Different methods have different computational complexities depending on the number of parameters and differential equations, and these complexities are also balanced between total execution time and required memory.
As coined by Uwe Naumann, \textit{the automatic generation of optimal (in terms of robustness and efficiency) adjoint versions of large-scale simulation code is one of the great open challenges in the field of High-Performance Scientific Computing}\cite{Naumann.2011}.
The third goal of this work, then, is to illustrate the different strengths and weaknesses of these methods, and how to use them in modern scientific software.
\begin{quote}
\textbf{Question 3. }
\textit{What are the advantages and disadvantages of different differentiation methods and how can I incorporate them in my research?}
\end{quote}
Despite the fact that these methods can be (in principle) implemented in different programming languages, here we decided to use the Julia programming language for the different examples.
Julia is a recent but mature programming language that has already a large tradition in implementing packages aiming to advance differentiable programming \cite{Bezanson_Karpinski_Shah_Edelman_2012, Julialang_2017}.
Nevertheless, in reviewing existing work, we'll also point to applications developed in other programming languages.

% The need to introduce all these methods in a common framework
Without aiming at making an extensive and specialized review of the field, we believe this study will be useful to other researchers working on problems that combine optimization and sensitivity analysis with differential equations.
Expand Down
Loading

0 comments on commit 34b2d28

Please sign in to comment.