-
Notifications
You must be signed in to change notification settings - Fork 0
/
thesis.tex
3246 lines (2840 loc) · 221 KB
/
thesis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% !TeX spellcheck = en-US
% !TeX encoding = utf8
% !TeX program = pdflatex
% !BIB program = biber
% -*- coding:utf-8 mod:LaTeX -*-
% vv scroll down to line 200 for content vv
\let\ifdeutsch\iffalse
\let\ifenglisch\iftrue
\input{pre-documentclass}
\documentclass[
% fontsize=11pt is the standard
a4paper, % Standard format - only KOMAScript uses paper=a4 - https://tex.stackexchange.com/a/61044/9075
twoside, % we are optimizing for both screen and two-side printing. So the page numbers will jump, but the content is configured to stay in the middle (by using the geometry package)
bibliography=totoc,
% idxtotoc, %Index ins Inhaltsverzeichnis
% liststotoc, %List of X ins Inhaltsverzeichnis, mit liststotocnumbered werden die Abbildungsverzeichnisse nummeriert
headsepline,
cleardoublepage=empty,
parskip=half,
% draft % um zu sehen, wo noch nachgebessert werden muss - wichtig, da Bindungskorrektur mit drin
draft=false
]{scrbook}
\input{config}
\DeclareFieldFormat{labelalpha}{\mkbibbold{#1}}
\DeclareFieldFormat{extraalpha}{\mkbibbold{\mknumalph{#1}}}
\usepackage[
title={Transformed Sparse Grids for High-Dimensional Models},
author={Christopher Schnick},
type=master,
institute={ipvs},
course={Informatik},
examiner={Prof.\ Dr.\ Dirk Pflüger},
supervisor={Dr.\ Michael Rehme},
startdate={April 26, 2021},
enddate={October 26, 2021},
language=english
]{scientific-thesis-cover}
\makeindex
\begin{document}
%tex4ht-Konvertierung verschönern
\iftex4ht
% tell tex4ht to create picures also for formulas starting with '$'
% WARNING: a tex4ht run now takes forever!
\Configure{$}{\PicMath}{\EndPicMath}{}
%$ % <- syntax highlighting fix for emacs
\Css{body {text-align:justify;}}
%conversion of .pdf to .png
\Configure{graphics*}
{pdf}
{\Needs{"convert \csname Gin@base\endcsname.pdf
\csname Gin@base\endcsname.png"}%
\Picture[pict]{\csname Gin@base\endcsname.png}%
}
\fi
%\VerbatimFootnotes %verbatim text in Fußnoten erlauben. Geht normalerweise nicht.
\input{commands}
\setcounter{page}{1}
\pagenumbering{roman}
\Coverpage
%Eigener Seitenstil fuer die Kurzfassung und das Inhaltsverzeichnis
\deftripstyle{preamble}{}{}{}{}{}{\pagemark}
%Doku zu deftripstyle: scrguide.pdf
\pagestyle{preamble}
\renewcommand*{\chapterpagestyle}{preamble}
%Kurzfassung / abstract
%auch im Stil vom Inhaltsverzeichnis
\ifdeutsch
\section*{Kurzfassung}
\else
\section*{Abstract}
\fi
\input{abstract_en.txt}
\cleardoublepage
% BEGIN: Verzeichnisse
\iftex4ht
\else
\microtypesetup{protrusion=false}
\fi
%%%
% Literaturverzeichnis ins TOC mit aufnehmen, aber nur wenn nichts anderes mehr hilft!
% \addcontentsline{toc}{chapter}{Literaturverzeichnis}
%
% oder zB
%\addcontentsline{toc}{section}{Abkürzungsverzeichnis}
%
%%%
%Produce table of contents
%
%In case you have trouble with headings reaching into the page numbers, enable the following three lines.
%Hint by http://golatex.de/inhaltsverzeichnis-schreibt-ueber-rand-t3106.html
%
%\makeatletter
%\renewcommand{\@pnumwidth}{2em}
%\makeatother
%
\tableofcontents
\iftex4ht
\else
%Optischen Randausgleich und Grauwertkorrektur wieder aktivieren
\microtypesetup{protrusion=true}
\fi
% END: Verzeichnisse
% Headline and footline
\renewcommand*{\chapterpagestyle}{scrplain}
\pagestyle{scrheadings}
\pagestyle{scrheadings}
\ihead[]{}
\chead[]{}
\ohead[]{\headmark}
\cfoot[]{}
\ofoot[\usekomafont{pagenumber}\thepage]{\usekomafont{pagenumber}\thepage}
\ifoot[]{}
%% vv scroll down for content vv %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% Main content starts here
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Introduction}
\label{chap:c1}
\pagenumbering{arabic}
\setcounter{page}{1}
Computer simulations for complex models are in many cases very computationally expensive and are therefore limited in the feasible amount of simulation runs given the available computing power.
Because of this limitation, a common approach is to construct a surrogate model and adapt the simulation to work on the surrogate instead.
Done properly, this can massively reduce the required resources, such as runtime and memory, to perform one simulation run.
Discretization methods \cite{Stetter1973} are one such way of constructing a surrogate by discretizing the continuous model domain and approximating the output while trying to keep the introduced discretization error to a minimum.
Full grids, a simple discretization scheme for functions on hypercubes, suffer from the curse of dimensionality \cite{Bellman1961}, as the amount of grid points required to generate a uniform grid grows exponentially with the model dimensionality.
As a result, the amount of simulation runs needed for a full grid construction also grow exponentially, severely limiting the viable model dimensionality for full grids.
While all grid-based methods share the common problem of being affected by the curse of dimensionality to some degree, the sparse grid technique \cite{Zenger1991} manages to reduce the impact of the curse of dimensionality and therefore delay the point at which their application becomes infeasible.
To decouple the model dimensionality from the actual sparse grid dimensionality, at least to some extent, various methods for the purpose of performing a dimensionality reduction on sparse grids are available to choose from.
For example, there exist several kinds of adaptive sparse grids, which are able to eliminate or neglect certain dimensions of the model and are able to tailor the constructed sparse grid to the model.
These grid based techniques in general are however limited in their effectiveness, as the grid structures itself, the associated basis functions, and adaptivity rules are all axis-aligned.
As we will see later on, axis-aligned basis functions, and therefore all techniques that are based on them, become rather ineffective when dealing with model functions that primarily consist out of structures that are not axis-aligned.
This limitation is used as a motivation to look into ways of transforming the input parameters of the model, such that sparse grid-based methods can be applied more effectively on all kinds of models.
By conducting a parameter study to identify potential starting points and exploring the influence of different input parameters on the model output, we try to construct and apply a transformation function on the model inputs prior to feeding them into sparse grids.
The goal is to obtain transformed input parameters that have a better alignment and optionally a reduced dimensionality, such that we can construct a sparse grid surrogate of a lower dimensionality while managing to maintain a similar approximation error compared to standard sparse grids.
The concrete method presented in this thesis combines the process of determining so-called active subspaces in the inputs \cite{Constantine2015} with dimensionality reduction techniques.
It takes inspiration from already established ideas in data science, such as the concept of intrinsic dimensionality \cite{Bennett1969} and principal component analysis (PCA) \cite{Abdi2010}.
This allows us to transform the original model approximation problem into a lower-dimensional one while still preserving input structures that are suitable for sparse grids.
Over the course of this thesis, we will first introduce and explore the name-giving transformed sparse grid technique and then put it to the test in multiple experiments with high-dimensional models for real-world phenomena afterwards.
\section{Outline}
We start off in \cref{chap:c2} by covering all the fundamentals needed to understand and use sparse grids in conjunction with different kinds of basis functions and spatial adaptivity to create model surrogates.
Furthermore, we will also take a look at other components commonly needed for model approximation tasks, such as sampling techniques and error metrics.
Next, we extensively introduce input transformations, which are the central focus of this thesis, in \cref{chap:c3} through a simple motivational example function that causes various established grid-based methods to underperform.
We show how an input transformation can be used to fix the underlying problem and improve the approximation.
Following that, we define various concrete types of input transformations with the main focus lying on transformations generated with the active subspace method.
Over the course of \cref{chap:c4}, we look at the process of generating sparse grid surrogates for already transformed inputs to obtain the name-giving transformed sparse grids.
We focus especially on the unique characteristics of this construction process and its resulting surrogates compared to the normal sparse grid workflow.
\Cref{chap:c5} then combines input transformations with an iterative approach to obtain the more powerful compounded transformed surrogates.
We also provide a formal algorithm description that is used as the basis for the practical implementation.
Moreover, some basic properties of the algorithm are also investigated.
Then, in \cref{chap:c6}, we model the concrete structure of the practical implementation for the transformation process and explore various configuration options that can be used to fine tune different aspects of the algorithm.
The individual components of the implementation are evaluated in \cref{chap:c7}, in which we apply the new transformation-based approach on two models using a variety of different configurations, with the goal of evaluating the implementation components in an isolated environment to prepare for the practical applications in the succeeding chapter.
The implementation is then put to the test in \cref{chap:c8}, where we perform multiple experiments on practical, high-dimensional models to draw conclusions about the technique in general and about its comparative quality.
To achieve the best possible results, we leverage the insights gained in preceeding chapter.
\Cref{chap:c9} concludes the thesis by providing a summary of the newly introduced technique and its results.
We also reflect on the limitations and possible improvements of the method.
\chapter{Surrogate construction}
\label{chap:c2}
The general process of performing computer simulations on mathematical models to study their properties can be complicated by different aspects of the underlying model itself.
For example, the complexity of simulations is often heavily dependent on the model dimensionality, which is referred to as the curse of dimensionality \cite{Bellman1961}.
Moreover, a simulation run can also require a lot of computational resources when for example a differential equation has to be solved.
Discretization methods are one way to mitigate this problem by constructing a continuous surrogate that can fill in the gaps between a discrete set of simulation results.
Other times when dealing with black box models, \ie when the underlying model function is not known and only a fixed set of model outputs is given, we can employ regression algorithms to construct a surrogate instead.
More formally, we assume that a model is represented by a mathematical model function $f \colon \Omega \to \mathds{R}$, where $\Omega \coloneqq [0,1]^d$ is the model domain.
While the original domain of many models may not be the $d$-dimensional unit hypercube, it can be transformed into one as long as the original domain is a bounded Cartesian product.
Furthermore, many models are also subject to uncertainties in their inputs.
To incorporate these uncertainties, we model the inputs stochastically with a probability distribution defined by a probability density function $\rho \colon \Omega \to \mathds{R_+}$.
In case the model outputs are also subject to uncertainties, such as random noise or measurement errors, a well fitted surrogate should be able to average out the uncertainties and represent a smoothed version of the original model.
The usual course of action is to construct a surrogate $\hat{f}$ that approximates $f$ and preserves an required amount of accuracy, such that the simulation results are still representative for the original model.
In other words, we are trying to construct an approximating function $\hat{f}$ such that
\begin{equation}
\forall x \in \Omega \colon f(x) \approx \hat{f}(x).
\end{equation}
Moreover, by constructing surrogates of a certain type, other additional useful properties can be provided at no additional cost, even if they were not available for the original model function.
For example, the gradients and moments of spline-based surrogates are very easy to compute and are essentially obtained for free.
Plenty of different surrogate types and construction methods already exist and each one comes with its own specific properties and limitations.
These properties can range from their sensitivity to the model dimensionality, over their approximation quality in general, to unique effects when being applied to certain types of model functions.
The primary focus of this thesis lies on the already established sparse grid surrogates, their various subtypes, and construction methods, with the goal of enhancing the sparse grid technique by finding and exploiting important lower-dimensional subsets of the model input parameters.
\newpage
\section{Notation}
Prior to starting out with formalizing the fundamentals of the surrogate construction process, we first have to define some basic notation for working with vectors that will be frequently used.
In rare cases where a scalar and vector with the same identifier are used, for example for the level and index notation of sparse grids later on, we use an underlined identifier to differentiate a vector from a scalar.
For example, the identifier $\ell$ would represent a scalar, whereas $\underline{\ell}$ would represent a vector.
Other times, when no differentiation between scalars and vectors is necessary, no explicit underlined identifier is used for vectors.
An underlined constant value $\underline{k}$ with $k \in \mathds{R}$ is used to define a $d$-dimensional vector of the form $\underline{k} \coloneqq (k, \dots, k)^T \in \mathds{R}^d$.
Furthermore, other commonly used constructs, namely basic vector norms and vector comparison operators, are defined as follows:
\begin{definition}[Vector norms]
Let $v \in \mathds{R}^d$ be a $d$-dimensional vector.
We make use of the $\ell_1$ norm,
\begin{equation}
|v|_1 \coloneqq \sum_{i=1}^d |v_i| ~,
\end{equation}
the Euclidean norm,
\begin{equation}
|v|_2 \coloneqq \sqrt{\sum_{i=1}^d v_i^2} ~,
\end{equation}
and the maximum norm,
\begin{equation}
|v|_\infty \coloneqq \max_{1 \leq i \leq d} v_i ~.
\end{equation}
\end{definition}
%
\begin{definition}[Vector comparison]
Let $u, v \in \mathds{R}^d$ be $d$-dimensional vectors.
We define the vector comparison operator,
\begin{equation}
u \leq v \Leftrightarrow \forall i \in \{1,\dots,d\} \colon u_i \leq v_i,
\end{equation}
as a component-wise comparison.
\end{definition}
\section{Full grids}
Model functions can easily be discretized using a naive full grid approach, which involves creating a uniform isotropic grid that spans $\Omega$ and interpolates function values using nearby grid points.
More formally, we can first construct a one-dimensional mesh of level $\ell$ with equidistant points $x_{\ell,i}=i2^{-\ell}$, which results in the mesh size $h=2^{-\ell}$.
Then, by associating appropriate basis functions $\phi_{\ell,i} \colon [0,1] \mapsto \mathds{R}$ with every grid point $x_{\ell,i}$, we can interpolate functions very accurately for reasonably small mesh sizes.
By using a basic tensor product approach, we can obtain a uniform isotropic $d$-dimensional grid mesh where the grid point indices become vectors and the basis functions $\phi_{\underline{\ell},\underline{i}} \colon [0,1]^d \mapsto \mathds{R}$ become multivariate functions.
The asymptotic $L^2$ interpolation error for full grids constructed this way with hat basis functions is $\mathcal{O}(h^2)$, however this comes at the cost of requiring $\mathcal{O}(2^{\ell d})$ grid points.
Full grids therefore suffer from the curse of dimensionality, as the required amount of function evaluations for the grid points grows exponentially with the level $\ell$ and model dimensionality $d$.
\begin{definition}[Levels and indices]
Let $\underline{\ell} \coloneqq (\ell_1, \dots, \ell_d) \in \mathds{N}_0^d$ be a $d$-dimensional multi-index.
For every multi-index $\underline{\ell}$, we define the hierarchical index set,
\begin{equation}
H_{\underline{\ell}} \coloneqq \left\{ \underline{i} \in \mathds{N}^d_0 \mid
\begin{cases}
i_t=1,3\dots,2^{\ell_k} - 3, 2^{\ell_k} - 1, & \ell_k \geq 1 \\
i_t=0,1, & \ell_k = 0
\end{cases} \right\}.
\end{equation}
\end{definition}
%
We use these hierarchical index sets in order to define a full grid as the sum of grid increments that are constructed for a given level multi-index $\underline{\ell}$ and its associated index set $H_{\underline{l}}$ such that every point in the grid mesh is assigned to exactly one level multi-index and hierarchical multi-index.
An incremental construction like this will make it easier to transition from full grids to sparse grids later on.
\begin{definition}[Grid points]
Let $x_{\ell,i}=i2^{-\ell}$ be a one-dimensional grid point coordinate with $x_{\ell,i} \in [0,1]$, and let
\begin{equation}
x_{\underline{\ell},\underline{i}} \coloneqq (x_{\ell_1,i_1}, \dots, x_{\ell_d,i_d})
\end{equation}
be a $d$-dimensional grid point. The set of grid points of a regular full grid with boundary is then given by
\begin{equation}
X^{\mathrm{fb}}_{\ell} \coloneqq \{x_{\underline{\ell'},\underline{i}} \mid |\underline{\ell'}|_{\infty} \leq \ell, \underline{i} \in H_{\underline{\ell}}\},
\end{equation}
and the set of grid points of regular full grid without boundary points by
\begin{equation}
X^{\mathrm{f}}_{\ell} \coloneqq \{x_{\underline{\ell'},\underline{i}} \mid |\underline{\ell'}|_{\infty} \leq \ell, \underline{1} \leq \ell', \underline{i} \in H_{\underline{\ell}}\}.
\end{equation}
\end{definition}
%
Next, we assign each grid point a supporting multivariate basis function that stems from one family of basis functions, for example a family of pieceiwse linear functions or polynomials.
\begin{definition}[Basis functions]
Let $\phi_{\ell,i}(x)$ be a univariate basis function for the one-dimensional grid point $x_{\ell,i}$.\\
By applying the tensor product approach on $\phi_{\ell,i}(x)$, we obtain the multivariate $d-$dimensional basis function,
\begin{equation}
\phi_{\underline{\ell},\underline{i}}(x) \coloneqq \prod_{k=1}^{d} \phi_{\ell_k,i_k}(x_k),
\end{equation}
for the $d$-dimensional grid point $x_{\underline{\ell},\underline{i}}$.
\label{def:basis_functions}
\end{definition}
%
\begin{definition}[Full grid]
The span of the multivariate basis functions $\phi_{\underline{\ell},\underline{i}}$ over all regular full boundary grid points,
\begin{equation}
V^{\mathrm{fb}}_{\ell} \coloneqq \mathrm{span}~ \{\phi_{\underline{\ell},\underline{i}} \mid x_{\underline{\ell},\underline{i}} \in X^{\mathrm{fb}}_{\ell}\},
\end{equation}
is defined as the regular full boundary grid space of level $\ell$ and the span over all regular non-boundary full grid points,
\begin{equation}
V^{\mathrm{f}}_{\ell} \coloneqq \mathrm{span}~ \{\phi_{\underline{\ell},\underline{i}} \mid x_{\underline{\ell},\underline{i}} \in X^{\mathrm{f}}_{\ell}\},
\end{equation}
is defined as the regular full grid space of level $\ell$, respectively.
\label{def:full_grid}
\end{definition}
\section{Sparse grids}
Full grids can be decomposed into a direct sum of subspaces in a hierarchical way, which allows for a granular selection of subspaces to keep.
By keeping only a specific subset of subspaces, we obtain the so-called sparse grids.
Compared to full grids, a sparse grid construction with hat basis functions leads to a worse approximation quality with the asymptotic $L^2$ interpolation error including an additional logarithmic factor with $\mathcal{O}(h^2 (\log h^{-1})^{d-1})$.
As shown in \cite{Bungartz2004}, the amount of grid points however only grows with $\mathcal{O}(h^{-1} (\log h^{-1})^{d-1})$, which is a significant improvement over full grids.
As a result, the reduced amount of required grid points makes it possible to apply the sparse grid technique to higher-dimensional problems than the full grid technique.
\begin{definition}[Sparse grid points]
Let $\ell$ be a level. Using the hierarchical index sets $H_{\underline{\ell}}$, we define the grid points for a regular sparse boundary grid,
\begin{equation}
X^{\mathrm{sb}}_{\ell} \coloneqq \{x_{\underline{\ell'},\underline{i}} \mid |\underline{\ell'}|_1 \leq \ell, \underline{i} \in H_{\underline{\ell'}}\},
\end{equation}
and the grid points for a regular sparse grid,
\begin{equation}
X^{\mathrm{s}}_{\ell} \coloneqq \{x_{\underline{\ell'},\underline{i}} \mid |\underline{\ell'}|_1 \leq \ell, \underline{1} \leq \ell', \underline{i} \in H_{\underline{\ell}}\}.
\end{equation}
\end{definition}
%
As we can see, the sparse grid point set definition is very similar to the full grid definition (\cref{def:full_grid}), with the only difference being the level multi-index selection using the $\ell_1$ norm ${|\cdot|}_1$ instead of the maximum norm $|\cdot|_\infty$, effectively performing a diagonal cut on the available grid increments, as seen in \cref{fig:grid_splitting}.
The sparse grid space can then also be defined similar to full grids:
\begin{definition}[Sparse grid]
The span of the basis functions $\phi_{\underline{\ell},\underline{i}}$ over all regular sparse boundary grid points,
\begin{equation}
V^{\mathrm{sb}}_{\ell} \coloneqq \mathrm{span}~ \{\phi_{\underline{\ell},\underline{i}} \mid x_{\underline{\ell},\underline{i}} \in X^{\mathrm{sb}}_{\ell}\},
\end{equation}
is the regular sparse boundary grid space, and the span over all regular sparse non-boundary grid points,
\begin{equation}
V^{\mathrm{s}}_{\ell} \coloneqq \mathrm{span}~ \{\phi_{\underline{\ell},\underline{i}} \mid x_{\underline{\ell},\underline{i}} \in X^{\mathrm{s}}_{\ell}\},
\label{eq:sparse_grid}
\end{equation}
is the regular sparse grid space.
\label{def:sparse_grid}
\end{definition}
%
The grid construction results for full grids as well as sparse grids are visualized in \cref{fig:full_grid} and \cref{fig:sparse_grid}.
In related sparse grid literature, the hierarchical decomposition, which is also called hierarchical splitting, throughout which sparse grids are realized is achieved through the decomposition of subspaces instead of the decomposition of the grid point set.
This allows for easier proofs of various approximation qualities but is not required for the purposes of this thesis.
It is sufficient to see that the sparse grid construction used in this thesis is equivalent to the construction through hierarchical subspaces.
\begin{definition}[Hierarchical subspaces]
The span of the basis functions $\phi_{\underline{\ell},\underline{i}}$ of a grid increment identified by a level multi-index $\underline{\ell}$,
\begin{equation}
W_{\underline{\ell}} \coloneqq \mathrm{span}~ \{\phi_{\underline{\ell},\underline{i}} \mid \underline{i} \in H_{\underline{\ell}}\},
\end{equation}
is defined as the hierarchical subspace of $\underline{\ell}$.
The sparse grid space can then also be represented as a direct sum of the hierarchical subspaces with
\begin{equation}
V^{\mathrm{s}}_{\ell} = \mathrm{span}~ \left\{\phi_{\underline{\ell},\underline{i}} \mid x_{\underline{\ell},\underline{i}} \in X^{\mathrm{s}}_{\ell}\right\}=\bigoplus_{|\underline{\ell'}|_1 \leq \ell} W_{\underline{\ell'}}.
\label{eq:direct_sum}
\end{equation}
\end{definition}
%
The sparse grid space construction of \cref{eq:direct_sum} is equal to \cref{eq:sparse_grid} because the only difference between both is that the hierarchical decomposition is realized in different stages.
\begin{mdframed}[style=style]
\vspace{2.5mm}
\begin{figure}[H]
\centering
\begin{minipage}{.71\textwidth}
\centering
\includegraphics[width=.8\linewidth]{graphics/grid_subspaces}
\subcaption{Hierarchical splitting of the grid point set $X^{\text{fb}}_{3}$}
\label{fig:grid_splitting}
\vspace{2.5mm}
\end{minipage}%
\begin{minipage}{0.28\textwidth}
\centering
\includegraphics[width=.8\linewidth]{graphics/full_grid}
\subcaption{Grid points of $X^{\text{fb}}_{3}$}
\label{fig:full_grid}
\vspace{4.5mm}
\includegraphics[width=.8\linewidth]{graphics/sparse_grid}
\subcaption{Grid points of $X^{\text{sb}}_{3}$}
\label{fig:sparse_grid}
\vspace{2.5mm}
\end{minipage}
\delimit
\caption{Construction of different grids from hierarchical grid point set increments. The full boundary grid point set $X^{\text{fb}}_{3}$ is a sum of all grid increments outlined in dotted \greencomma\,\,whereas the sparse boundary grid point set $X^{\text{sb}}_{3}$ is a sum of all grid increments outlined in dotted \reddot\,\,Boundary grid point increments are shaded in \greydot}
\label{fig:grid_construction}
\end{figure}
\end{mdframed}
\section{Basis functions}
Multivariate basis functions $\phi_{\underline{\ell},\underline{i}}$ (\cref{def:basis_functions}), are a central aspect of sparse grids.
Many different families of basis functions exist, each with various levels of complexity, capabilities, and properties.
In this section we will take a short look at the basis functions used in this thesis.
\paragraph{Hat functions}
An elementary basis for sparse grids can be constructed with the help of hat functions, which are just piecewise linear functions.
A one-dimensional hat function, $h(x) \coloneqq \max(1 - |x|,0)$, can be transformed into a basis function for a level $\ell$ and index $i$ to obtain
\begin{equation}
\phi^\mathrm{h}_{\ell,i}(x) \coloneqq h(2^\ell x-i).
\end{equation}
These basis functions are centered at their associated grid point $x_{\ell,i}$.
Furthermore, their supports are disjoint for the same level.
\Cref{fig:basis_hat_normal} shows the hat basis functions for the first few levels.
\paragraph{Modified basis functions}
One method of removing the need of using boundary grid points to accurately approximate some model functions in the boundary regions are modified basis functions \cite{Pflueger2010}.
These basis functions extrapolate the values at the boundaries from the grid points closest to the boundaries by using a modified version of normal basis functions.
Extrapolation of this sort can, at least to some degree, be a viable alternative to using boundary grids as a lot of grid points would be allocated just for the boundaries by high-dimensional sparse grid surrogates.
Removing all of them while still providing acceptable boundary approximations usually results in a good tradeoff between grid points and accuracy, which is the intuition behind sparse grids.
More formally, it is possible to modify any non-boundary basis functions by adding a constant function to it and also defining special boundary extrapolation functions $\phi^{\text{left}}_{\ell}$ and $\phi^{\text{right}}_{\ell}$ that are usually are similar with regards to their construction compared to the normal basis functions $\phi_{\ell,i}$.
For example, modified basis functions for the hat basis can be obtained by adding boundary functions and a constant function to the hat basis with
\begin{equation}
\phi^{\mathrm{mod,h}}_{\ell,i}(x) \coloneqq
\begin{cases}
1,& \ell=1\\
\max(2 - 2^\ell x, 0),& \ell>1, i=1\\
\max(2 - 2^\ell (1 - x), 0),& \ell>1, i=2^\ell - 1\\
\phi^{\mathrm{h}}_{\ell,i}(x),& \text{else}
\end{cases}.
\end{equation}
\Cref{fig:basis_hat_mod} shows the modified hat basis functions for the first few levels.
\begin{mdframed}[style=style]
\begin{figure}[H]
\begin{subfigure}{.5\textwidth}
\includegraphics[width=\linewidth]{graphics/basis_hat}
\subcaption{Hat basis for $\ell=0,\dots,2$.}
\label{fig:basis_hat_normal}
\vspace{1.5mm}
\end{subfigure}%
\begin{subfigure}{.5\textwidth}
\includegraphics[width=\linewidth]{graphics/basis_mod}
\subcaption{Modified hat basis for $\ell=1,\dots,3$.}
\label{fig:basis_hat_mod}
\vspace{1.5mm}
\end{subfigure}
\delimit
\caption{Hat basis functions and their modified counterparts.}
\label{fig:basis_hat}
\end{figure}
\end{mdframed}
\paragraph{Other basis functions}
Over time, many kinds of basis functions have been developed.
Each one was created with a specific purpose in mind and has the potential to outshine other basis functions for certain applications.
There exist polynomial basis functions, spline basis functions that are just piecewise polynomials, wavelet basis functions, and numerous subtypes of them.
However, for the sake of simplicity, they are not covered in detail as they will not be primarily used in this thesis.
Instead, we will focus on hat basis functions and modified hat basis functions.
\section{Sparse grid surrogates}
\label{sec:gsc}
Sparse grid surrogates can be constructed using one of multiple different approaches.
As a sparse grid surrogate is a linear combination of all basis functions, in order to represent the instance $\hat{f}$, each basis function $\phi_{\underline{\ell},\underline{i}}$ is assigned a weight $\alpha_{\underline{\ell},\underline{i}}$, also called hierarchical surplus.
\begin{definition}[Sparse grid surrogates]
Let $\phi_{\underline{\ell},\underline{i}}$ be multivariate basis functions and $X$ a set of grid points.
Furthermore, let $\alpha_{\underline{\ell},\underline{i}} \in \mathds{R}$ be the hierarchical surplusses.
A grid surrogate is then a linear combination with
\begin{equation}
\hat{f}(x) \coloneqq \sum_{x_{\underline{\ell},\underline{i}} \in X} \alpha_{\underline{\ell},\underline{i}} ~ \phi_{
\underline{\ell},\underline{i}}(x).
\end{equation}
In the case of a regular sparse grid with $X=X^{\mathrm{s}}_{\ell}$, the surrogate instance can also be expressed with
\begin{equation}
\hat{f}(x) = \sum_{|\underline{\ell}|_1 \leq \ell, \underline{1} \leq \ell'} \sum_{\underline{i} \in {H_{\underline{\ell}}}} \alpha_{\underline{\ell},\underline{i}} ~ \phi_{
\underline{\ell},\underline{i}}(x).
\end{equation}
\end{definition}
%
The hierarchical surplusses can be computed in a variety of different ways, which results in multiple possible sparse grid construction schemes.
Most of the time, the interpolation approach is used.
In cases where only model function observations are given, a sparse grid surrogate can also be constructed through regression
\begin{definition}[Fundamental property]
A family of univariate basis functions $\phi_{\ell,i}$ fulfills the fundamental property iff
\begin{equation}
\begin{split}
&\phi_{\ell,i}(x_{\ell', j}) = 0, ~~ \ell' < \ell, j \in H_{\ell'},\\
&\phi_{\ell,i}(x_{\ell', j}) = \delta_{ij}, ~~ j \in H_{\ell'}.
\end{split}
\end{equation}
As the used multivariate basis functions are constructed using the tensor-product approach, the fundamental property holds for a multivariate basis iff it holds for the univariate one.
\label{def:fund}
\end{definition}
\paragraph{Interpolation}
The interpolation method is the most simple and common sparse grid surrogate construction scheme.
It evaluates the model function at every grid point and performs a hierarchization algorithm to obtain the hierarchical surplusses $\alpha_{\underline{\ell},\underline{i}}$.
A multivariate basis that fulfills the fundamental property can be hierarchized very efficiently through the unidirectional principle \cite{Balder1994} by applying a hierarchization operator iteratively on every dimension separately, effectively breaking down the hierarchization process into a series of one-dimensional hierarchizations.
The resulting surrogate constructed with the hierarchization method equals the model function at all grid points, \ie $\forall x_{\underline{\ell},\underline{i}} \in X \colon f(x_{\underline{\ell},\underline{i}})=\hat{f}(x_{\underline{\ell},\underline{i}})$.
As the model function has to be evaluated at every grid point, a grid interpolation surrogate requires exactly $|X|$ function evaluations and can not be used to approximate a function from only model function observations.
\paragraph{Regression}
An alternative surrogate construction scheme is sparse grid regression \cite{Pflueger2010}.
In contrast to interpolation, a regression approach does no longer require the function values at the grid points to interpolate between them.
Instead, it uses established regression approaches to iteratively improve the weights $\alpha_{\underline{\ell},\underline{i}}$ with regards to the approximation error by using training and validation observations from the model function.
As a result, the amount of required observations does no longer have to be equal to the amount of grid points.
This results in a greater level of flexibility as the amount of available inputs and the amount of grid points can vary.
The basic approach is to solve the least squares problem for the observation set, $S \coloneqq \left\{(x_i, f(x_i)\right\}_{i=1}^n \subseteq \Omega \times \mathds{R}$, with the standard square loss function,
\begin{equation}
\varepsilon_{\text{MSE}}(\hat{f}) \coloneqq \frac{1}{n} \sum_{i=1}^n \left(f(x_i) - \hat{f}(x_i)\right)^2 ,
\end{equation}
which contains a regularization functional $R$ and a regularization control parameter $\lambda > 0$ that allows for smoothing and can also prevent overfitting.
Sparse grid regression therefore tries to solve the regularized least squares problem to find the best possible surrogate for a set of possible regularization factors $\lambda \in \Lambda$ and sparse grid level $\ell$ to obtain the surrogate with
\begin{equation}
\hat{f} = \argmin_{\hat{f} \in V^{\text{s}}_{\ell}, ~ \lambda \in \Lambda} \varepsilon_{\text{MSE}}(\hat{f}) + \lambda R(\hat{f}).
\end{equation}
\section{Spatially adaptive sparse grids}
The process of generating a regular sparse grid can be improved upon in many cases by introducing adaptivity.
This allows us to spend more grid points, and therefore be more accurate, in regions that are more important and conversely reduce the amount of grid points spent in less important regions.
When done correctly, the surrogate can adapt itself to the model function characteristics and deliver a better approximation.
One method of implementing adaptivity during the sparse grid construction process are spatially adaptive sparse grids \cite{Pflueger2012}.
The central idea is to define a grid point hierarchy where each non-boundary grid point has two child grid points on the next level increment along every dimension.
A stepwise algorithm then adds these child grid points to the surrogate if they are needed and do not exist yet.
This creation of new grid points that stem from a parent grid point is called the refinement of the parent grid point.
\begin{definition}[Hierarchical grid point children]
Let $X$ be a set of grid points, and let $x_{\underline{\ell},\underline{i}} \in X$ be a grid point.
We define
\begin{equation}
c_{k}^{\mathrm{left}}(x_{\underline{\ell},\underline{i}}) \coloneqq (x_{\ell_1,i_1}, \dots, x_{\ell_k + 1,2 i_k - 1} \dots, x_{\ell_d,i_d})
\end{equation}
as its left child along the $k$-th dimension if $i_k > 0$ and
\begin{equation}
c_{k}^{\mathrm{right}}(x_{\underline{\ell},\underline{i}}) \coloneqq (x_{\ell_1,i_1}, \dots, x_{\ell_k + 1,2 i_k + 1} \dots, x_{\ell_d,i_d})
\end{equation}
as its right child along the $k$-th dimension if $i_k < 2^{\ell_k}$.
In addition, we define
\begin{equation}
c_{k}(x_{\underline{\ell},\underline{i}}) \coloneqq
\begin{cases}
\left\{c_{k}^{\mathrm{right}}(x_{\underline{\ell},\underline{i}})\right\}&, i_k=0\\
\left\{c_{k}^{\mathrm{left}}(x_{\underline{\ell},\underline{i}})\right\}&,i_k= 2^{\ell_k}\\
\left\{c_{k}^{\mathrm{left}}(x_{\underline{\ell},\underline{i}}), ~ c_{k}^{\mathrm{right}}(x_{\underline{\ell},\underline{i}}) \right\}&, \text{else}
\end{cases}
\end{equation}
as the set of children along the $k$-th dimension of the grid point $x_{\underline{\ell},\underline{i}}$, and
\begin{equation}
c(x_{\underline{\ell},\underline{i}}) \coloneqq \bigcup_{k=1}^d \left\{c_{k}(x_{\underline{\ell},\underline{i}})\right\}
\end{equation}
as the set of all children of a grid point.
It always holds that $\abs{c(x_{\underline{\ell},\underline{i}})} \leq 2d$, and in case there is no boundary involved, $\abs{c(x_{\underline{\ell},\underline{i}})}=2d$ holds as well.
Furthermore, we denote a grid point $x_{\underline{\ell},\underline{i}} \in X$ as a leaf if it holds that
\begin{equation}
c(x_{\underline{\ell},\underline{i}}) \cap X = \emptyset,
\end{equation}
\ie no children of it are already contained in $X$.
\end{definition}
%
To choose the best grid points to refine, a refinement criterion is introduced.
This abstract refinement criterion assigns a value to each grid point such that the grid points with the highest values can be refined.
\begin{definition}[Refinement criterion]
Let $X$ be a set of grid points.
We define $\xi \colon X \mapsto \mathds{R}$ as the refinement criterion, which assigns each grid point a value that determines the suitability of a refinement.
The commonly used surplus criterion,
\begin{equation}
\xi_s(x_{\underline{\ell},\underline{i}}) \coloneqq \abs{\alpha_{\underline{\ell},\underline{i}}},
\end{equation}
uses the absolute hierarchical surplus value, as a high value indicates steep rates of change in the area of a grid point.
\end{definition}
%
To prepare for an iterative adaptivity algorithm, we define the grid points of adaptive sparse grids after $m$ steps as $X_a^{(m)}$.
A spatially adaptive sparse grid usually starts out as a regular sparse grid with a coarse level, \ie $X_a^{(0)}=X^{\text{s}}_{\ell}$.
Then, during the $m$-th iteration, we first order all leaf grid points with
\begin{equation}
\xi(x_1) \geq \dots \geq \xi(x_p), ~~ x_1, \dots, x_p \in X_a^{(m-1)},
\end{equation}
and then perform $k$ refinements to obtain $X_a^{(m)}=X_a^{(m-1)} \cup c(x_1) \cup \dots \cup c(x_k)$.
Usually, when a leaf grid point is refined, all children are created because selective children creation would create additional overhead.
After $q$ steps, the spatially adaptive sparse grid is finished and consists out of the grid points $X_a^{(q)}$.
Important parameters that influence the results are the amount of iterations $q$, the refinements per step $k$, the refinement criterion $\xi$, as well as the initial grid type and size of $X_a^{(0)}$.
There also exist various other established adaptivity methods that adapt the underlying grid in other ways.
For example, dimensionally adaptive sparse grids use a more rigid refinement rule that adds the grid points of whole hierarchical subspaces to the surrogate.
One small downside of spatially adaptive sparse grids is that usually all child grid points are added in a refinement step.
In a dimensionality reduction focused setting this behaviour is not optimal, since we want to focus on more important dimensions when refining a grid point.
One more recently introduced method are spatially dimension adaptive sparse grids \cite{Khakhutskyy2016} that provide a solution for this by only selectively adding child grid points along certain dimensions.
\section{Radial basis functions}
\label{sec:rbf}
Radial basis function (RBF) surrogates \cite{Broomhead1988} are an alternative way of approximating a model function.
A radial basis function $\varphi' \colon \Omega \mapsto \mathds{R}$ is a function whose value solely depends on the distance between the input and its associated center point $c \in \Omega$.
More formally, there must exist a function $\varphi$ such that $\varphi'(x)=\varphi(\|x - c\|)$ holds.
Functions that fulfill this property are called radial functions and the associated univariate function $\varphi$ is called a radial kernel centered at $c$.
Similar to sparse grid surrogates, a radial basis function surrogate is a linear combination of $m$ radial basis functions with center points $c_i$ and weights $\omega_i$ with
\begin{equation}
f(x) \approx \sum_{i=1}^m \omega_i \varphi(\|x - c_i\|).
\end{equation}
In the context of this thesis, only Gaussian functions with the Euclidean distance ${|\cdot |}_2$ measure,
\begin{equation}
\varphi(x) \coloneqq e^{-(\epsilon \abs{x - c_i}_2)^2},
\end{equation}
are used as basis functions.
The only parameter that can be tuned is $\epsilon$, which is referred to as the shape parameter and is used to scale the effects of the input radius.
Given a set of center points, type of basis functions, and shape parameter $\epsilon$, we can calculate the weight coefficients $\omega_i$ by solving a linear least squares problem.
This, combined with the fact that exponentially more center points are needed to maintain the same level of coverage when the model dimensionality increases, also causes this method to suffer from the curse of dimensionality.
Radial basis function approximation is covered in this thesis as it is almost a polar opposite to grid-based approximation.
Sparse grid surrogates require a hierarchical and axis-aligned grid, while radial basis function surrogates allow us to specify arbitrary center points.
Furthermore, radial basis functions are fundamentally different to tensor-product-based basis functions with regards to their orientation.
While tensor product basis functions are strictly axis-aligned, radial basis functions are unaware of the concept of axis-alignment or orientation as only the radius matters to them.
This property will allow us to use radial basis functions as a reference when investigating the effects of basis function alignment later on.
\section{Sampling}
\label{sec:lds}
In the coming chapters we will make use of regression and some Monte-Carlo-based methods, both of which requiring model function observations as inputs.
For this purpose, even the relatively simple process of drawing samples from the input parameter probability distribution of the model can improved to achieve better results.
We assume that the inputs of model functions are distributed according to a $d$-dimensional multivariate distribution defined by the probability density function,
\begin{equation}
\rho(x) \coloneqq \prod_{i=1}^d \rho_i(x_i),
\end{equation}
which is made up out of $d$ independent distributions $\rho_i$.
A problem that arises when using a pseudorandom number generator to draw samples from the $d$-dimensional uniform input distribution is that the sequence does not cover the function domain as evenly and quickly as it theoretically can.
One solution are so-called low-discrepancy sequences, which are also called quasirandom sequences to distinguish them from normal pseudorandom sequences.
These sequences are more evenly distributed than pseudorandom samples and therefore have a lower discrepancy.
Thus, they cause the result of various Monte-Carlo methods to converge faster than it would if we were just using the same amount of pseudorandom samples.
A commonly used quasirandom sequence is the Sobol sequence \cite{Niederreiter1988}, which is also used in this thesis.
It is generated using the Antonov and Saleev variant that uses the Gray code of $i$ to construct the $i$-th sample.
This approach can also be extended to allow for the generation of pseudorandom samples from non-uniform distributions through the commonly used inverse transform sampling method \cite{Devroye1986}.
However, this approach is not used in this thesis as the gains are minimal.
\section{Error metrics}
\label{sec:errors}
To evaluate the quality of a constructed surrogate $\hat{f}$, we define a couple of error metrics to measure the approximation error and also make it comparable to other surrogates.
An established and commonly used error metric in the context of sparse grids is the $L^2$ error:
\begin{definition}[$L^2$ error]
Let
\begin{equation}
{\|g\|}_{L^p}^p \coloneqq \int_{\Omega} \abs{g(x)}^p \mathrm{d}x
\end{equation}
be the $L^p$ norm for a real function $g \colon \Omega \mapsto \mathds{R}$.
We then define
\begin{equation}
\varepsilon_{\text{$L^2$}}(f, \hat{f}) \coloneqq \|f - \hat{f}\|_{L^2}=\sqrt{\int_{\Omega} {\abs{f(x) - \hat{f}(x)}}^2 ~ \mathrm{d}x}
\end{equation}
as the $L^2$ surrogate approximation error.
\label{def:l2}
\end{definition}
%
In numerics, to evaluate the accuracy of an estimator, the mean squared error and the root mean squared error are commonly used.
These metrics can be described as the expected value of the squared errors and its square root, respectively.
\begin{definition}[Mean squared errors]
Let $\hat{f}$ be model function surrogate for $f$ and $S=\left\{(x_i, f(x_i)\right\}_{i=1}^n \subseteq \Omega \times \mathds{R}$ a set of observations.
We define
\begin{equation}
\varepsilon_{\mathrm{MSE}}(f, \hat{f}) \coloneqq \frac{1}{n} \sum_{i=1}^n \left(f(x_i) - \hat{f}(x_i)\right)^2
\end{equation}
as the mean squared approximation error of the surrogate.
In the same fashion, we also define
\begin{equation}
\varepsilon_{\mathrm{RMSE}}(f, \hat{f}) \coloneqq \sqrt{\varepsilon_{\mathrm{MSE}}(f, \hat{f})} = \sqrt{\frac{1}{n} \sum_{i=1}^n \left(f(x_i) - \hat{f}(x_i)\right)^2}
\end{equation}
as the root mean squared error of the surrogate.
\end{definition}
%
The RMSE is closely related to the continuous $L^2$ error as it is just a discrete version of it in case the observations $S$ are uniformly distributed.
Thus, it holds that
\begin{equation}
\lim_{n \to \infty} \varepsilon_{\mathrm{RMSE}}(f, \hat{f}) = \varepsilon_{\text{$L^2$}}(f, \hat{f})
\end{equation}
when the inputs are uniformly distributed.
One problem of these error metrics is that they are all sensitive to the magnitude of the function values.
To improve upon these absolute error metrics and allow for comparisons across model functions with different ranges of function values, we define normalized error metrics.
\begin{definition}[Normalized root mean squared error]
Let $\hat{f}$ be a model function surrogate for $f$ and $S=\{(x_i, f(x_i)\}_{i=1}^n \subseteq \Omega \times \mathds{R}$ a set of observations.
We define
\begin{equation}
\varepsilon_{\mathrm{NRMSE}}(f, \hat{f}) \coloneqq \frac{\varepsilon_{\mathrm{RMSE}}(f, \hat{f})}{\max\limits_{(x, f(x)) \in S} f(x) - \min\limits_{(x, f(x)) \in S} f(x)}
\end{equation}
as the normalized root mean squared error.
\end{definition}
\chapter{Input transformations}
\label{chap:c3}
A lot of work has already gone into exploring all kinds of variants and optimizations for sparse grids that resulted in many different families of basis functions and grid types, showcased for example in \cite{Feuersaenger2010, Valentin2019}.
Moreover, many methods and algorithms that work on sparse grids have also been developed, for example in \cite{Gerstner1998, Garcke2001, Pflueger2010, Valentin2019, Rehme2021}, and allow the general sparse grids technique to be applied in a better and more flexible way to many different kinds of problems.
In the related literature, optimizing the sparse grid construction process itself was the primary focus most of the time.
In contrast to this approach, we instead inspect and adapt the inputs itself prior to constructing any sparse grid.
The basic idea that is propagated in this thesis is that instead of directly using the sparse grids to approximate a function with $f(x) \approx \hat{f}(x)$, we insert another layer of flexibility into the process by introducing an input transformation function $t$ and creating a surrogate $\hat{f}_t$ in tandem with $t$ such that a transformed sparse grid approximation $f(x) \approx \hat{f}_t(t(x))$ performs better.
This general idea and also the necessary details will be expanded upon in this chapter.
\section{Intrinsic dimensionality}
\label{sec:intrinsic}
There already exist several techniques used in signal processing and data science that are related to the idea of input transformations, for example principal component analysis (PCA) \cite{Abdi2010}, which tries to find and exploit certain characteristics of the input data.
Related to PCA is the concept of intrinsic dimensionality \cite{Bennett1969}.
The intrinsic dimensionality of a function or dataset is the minimal amount of features required to completely represent its output.
It can be used as a guidance to determine the actual dimensionality of the model function and also optionally exploit the knowledge to construct a lower-dimensional surrogate, \ie performing a dimensionality reduction.
\begin{definition}[Intrinsic dimensionality]
A $d$-dimensional model function $f \colon \Omega \mapsto \mathds{R}$ has an intrinsic dimensionality of $r \leq d$ if there exists an $r$-dimensional function $f_t \colon \Omega_r \mapsto \mathds{R}$ and a transformation function $t \colon \Omega \mapsto \Omega_r$ such that
\begin{equation}
\forall x \in \Omega \colon f(x)=f_t(t(x)).
\end{equation}
\label{def:intrinsic}
\end{definition}
Even though we do not explicitly define $\Omega_r$, it should also be a bounded Euclidean product such that it can transformed into the unit hypercube $\Omega_r=[0,1]^r$.
We denote the minimal intrinsic dimensionality of a function as $r^\ast$.
Even if we are able to find a non-optimal $r$-dimensional transformation with $r > r^\ast$ that satisfies the requirement of \cref{def:intrinsic}, we still have gained an advantage and can define that function as having an intrinsic dimensionality of at most $r$.
Furthermore, in case it holds that $r=r^\ast$, we still refer to it as having an intrinsic dimensionality of at most $r$ because the true minimal intrinsic dimensionality $r^\ast$ is often unknown.
Since we are dealing with function approximation problems in this thesis, we extend \cref{def:intrinsic} to also allow for approximation errors:
\begin{definition}[Approximate intrinsic dimensionality]
Let $f$ be a $d$-dimensional function, and $\varepsilon$ an error metric.
We then define $f$ as having an approximate intrinsic dimensionality of $r \leq d$ with an error of $e$, if there exists an $r$-dimensional function $f_t \colon \Omega_r \mapsto \mathds{R}$ and a function $t \colon \Omega \mapsto \Omega_r$ such that
\begin{equation}
e=\varepsilon\left(f, f_t \circ t\right).
\end{equation}
\end{definition}
%
Similar to the minimal intrinsic dimensionality $r^\ast$, we are not concerned about the finding, and also usually do not know, the minimal error $e^\ast=\min_{t, f_t} \varepsilon\left(f, f_t \circ t\right)$ for a given $r$.
The introduction of approximation errors also enables us to determine a good tradeoff between intrinsic dimensionality and approximation error.
To achieve a good convergence rate and general representativeness when performing an operation on the model function that makes use of observations, \eg a Monte-Carlo quadrature operation, it is important to achieve a low discrepancy and thus cover the input space $\Omega$ more evenly.
Under normal circumstances, to achieve the same discrepancy in a high-dimensional space as in a low-dimensional space,
exponentially more observations are required.
However, combining this requirement with the concept of intrinsic dimensionality, we observe that we effectively only need to cover $\Omega_r$ and adapt any operation that normally works on $f$ to instead work on the lower dimensional function $f_t$.
Therefore, the amount of observations required to achieve a good coverage of the domain of $f_t$ would depend on $r$ and not on $d$.
\begin{definition}[Transformed distribution]
Let $t^{-1}(x_{t})=\{x \in \Omega \mid t(x)=x_{t}\}$ be the inverse transformation function.
We then define the transformed distribution,
\begin{equation}
\rho_t \colon \Omega_r \to \mathds{R_+}, ~~ \rho_t(x_t)=\int_{t^{-1}(x_t)} \rho(x) \; \mathrm{d}x.
\end{equation}
\end{definition}
This is also a valid probability distribution since
\begin{equation}
\int_{\Omega_r} \rho_t(x_t) \; \mathrm{d}x_t=\int_{\Omega_r} \int_{t^{-1}(x_t)} \rho(x) \; \mathrm{d}x_t = \int_{\Omega} \rho(x) \; \mathrm{d}x = 1
\end{equation}
and $\rho_t(x_t) \geq 0$ holds.
One thing to take note of is that applying a transformation on the original input distribution $\rho$ has the potential to change the distribution type.
For example, in case of a uniform distribution $\rho=\mathcal{U}[0,1]^d$, the associated transformed distribution $\rho_t$ could be a non-uniform distribution.
To effectively sample from $\rho_t$, it is still easier to just draw the samples $(x_1, \dots, x_n)$ from $\rho$ directly and feed them into the transformation function to obtain the transformed sequence $(t(x_1), \dots, t(x_n))$, which is distributed according to $\rho_t$.
The general quality and properties of transformed low discrepancy sequences (\cref{sec:lds}) also change compared to its original, untransformed sequence \cite{Wang2008}.
\section{Limitations of sparse grids}
All grid-based interpolation methods share the common problem of being affected by the curse of dimensionality \cite{Bellman1961}, which effectively puts an upper limit on the feasible dimensionality of sparse grid surrogates.
Sparse grids themselves are already an effective strategy to reduce the consequences of the curse of dimensionality compared to full grids because they provide a good tradeoff between the amount of grid points used and their approximation quality.
However, sparse grids can only weaken the curse of dimensionality and delay the effects of it a bit more.
Therefore, employing some form of dimensionality reduction is necessary for high-dimensional models.
Of course, the feasability and effectiveness of dimensionality reduction techniques is limited by the minimum amount of accuracy that is required for the specific use case.
In this section, we take a look at a few already established approaches to dimensionality reduction in the context of sparse grids and take a look at their properties with the help of simple example functions.
The first example function,
\begin{equation}
f_1(x_1, x_2)=\sin(2 \pi x_1) + 1,
\end{equation}
is designed to be a good fit for grid-based methods.
It is a two-dimensional function, where only the first component and a constant term contribute to the total function value, while the second component is unused.
We can easily see that it has an intrinsic dimensionality of $1$ according to \cref{def:intrinsic} since a transformation function could just cut off the second input component.
By simply rotating the function by a few degrees, we obtain the second example function,
\begin{equation}
f_2(x_1,x_2)=\sin(2 \pi x_1') + 1, ~~~~~ x_1'=\cos(0.15 \pi) x_1 -\sin(0.15 \pi) x_2,
\end{equation}
which is also a two-dimensional function but more importantly still has an intrinsic dimensionality of $1$.
\subsection{Spatially adaptive sparse grids}
We start off by interpolating both functions using spatially adaptive sparse grids to investigate how their generated grid points and interpolation errors compare.
\begin{mdframed}[style=style]
\begin{figure}[H]
\begin{subfigure}{.5\textwidth}
\centering
\includegraphics[width=.65\linewidth]{graphics/grid_f1}
\subcaption{Surrogate grid points for $f_1$}
\label{fig:grid_f1}
\vspace{2.5mm}
\end{subfigure}%
\begin{subfigure}{.5\textwidth}
\centering
\includegraphics[width=.65\linewidth]{graphics/grid_f2}
\subcaption{Surrogate grid points for $f_2$}
\label{fig:grid_f2}
\vspace{2.5mm}
\end{subfigure}
\delimit
\caption{Grid points of generated spatially adaptive sparse grids for $f_1$ and $f_2$.
The sparse grids uses modified hat basis functions and started out as a regular sparse grid with level $\ell=1$. Moreover, the surplus refinement criterion and 15 refinements steps with one grid point refinement per step were used to adapt the sparse grids.}
\label{fig:grids}
\end{figure}
\end{mdframed}
%
\Cref{fig:grids} illustrates how spatially adaptive sparse grids are affected by the different alignment of the input functions.
The spatially adaptive strategy is able to exploit the one-dimensional nature of $f_1$ in \cref{fig:grid_f1} by only refining along the first dimension and therefore spending the grid points only where they matter.
Note that due to the default refinement rule that add all child grid points, the upper and lower grid point rows are created even though they are not necessary.
A tweaked refinement rule could easily fix this.
However, as we can see, \cref{fig:grid_f2}, the intrinsically one dimensional function $f_2$ can not be exploited as it is no longer axis-aligned.
This can be deduced from the fact that grid points all over the grid and along both dimensions are refined.
Moreover, this observation is reinforced by the actual approximation errors of surrogates for both functions seen in \cref{fig:grid_approx_f2}.
\subsection{Radial basis functions}
Radial basis functions behave differently than sparse grids when being applied to the same two functions as their basis functions are not constructed using the tensor-product approach.
Instead, they are orientation agnostic because only the radius relative to a support point determines the basis function value.
Therefore, as seen in \cref{fig:grid_approx_f1}, the approximation errors stay almost identical in both cases.
However, the general approximation quality for $f_1$ is worse compared to the sparse grid approximation since radial basis functions are not well suited for approximating functions that are constant along a dimension.
Modified basis functions come with a special constant basis function that is well suited for this purpose and therefore have an advantage.
\begin{mdframed}[style=style]
\begin{figure}[H]
\begin{subfigure}{.5\textwidth}
\includegraphics[width=\linewidth]{graphics/grid_approx_f1}
\subcaption{RBF surrogates created with Gaussian basis functions, Euclidean distance measure, and shape parameter $\epsilon=0.5$.}
\label{fig:grid_approx_f1}
\vspace{2.5mm}
\end{subfigure}%
\begin{subfigure}{.5\textwidth}
\includegraphics[width=\linewidth]{graphics/grid_approx_f2}
\subcaption{Spatially adaptive sparse grids with $\ell=1$, surplus refinement criterion, 280 single refinement steps, and mod. hat basis functions.}
\label{fig:grid_approx_f2}
\vspace{2.5mm}
\end{subfigure}
\delimit
\caption{
Approximation error comparison between radial basis functions and spatially adaptive sparse grids for both example functions $f_1$ and $f_2$.}
\label{fig:dim_reduction_errors}
\end{figure}
\end{mdframed}
\subsection{Variance-based sensitivity analysis}
The method of variance-based sensitivity analysis, a form of global sensitivity analysis introduced by Sobol \cite{S01}, is a well established method in statistics to determine the importance of model input parameters.
Given a set of dimension indices $D=\{1,\dots,d\}$, we define the set of ANOVA components with $C \coloneqq \mathcal{P}(D)$ where an ANOVA component $c \in C$ that contains a specific dimension index $i \in D$ indicates that a function term is varying along the $i$-th dimension.
We can then decompose a model function into a componentwise sum with
\begin{equation}
f(x)=\sum_{c \in \mathcal{P}(D)} f_{c}(x),
\label{anovaComp}
\end{equation}
which is called the ANOVA decomposition of $f$.
A decomposition of a high-dimensional function into $|\mathcal{P}(D)|=2^d$ lower-dimensional function terms allows for an examination of individual terms with the goal of identifying terms that contribute the least amount of variance to the total function variance.
This decomposition can also be employed on sparse grids \cite{Feuersaenger2010} by using special semi-orthogonal basis functions, for example pre-wavelet basis functions, to represent the terms $f_c$.
The terms $f_c$ are required to be semi-orthogonal such that we can define the variance function,
\begin{equation}
\sigma^2(f) \coloneqq \int_{\Omega} f(x)^2 \rho(x) ~ \text{d} x=\sum_{c \in \mathcal{P}(D)} \sigma^2(f_c),
\end{equation}
in an additive manner.
For every ANOVA component $c \in C$, we are then able to calculate the variance share, also called Sobol index,
\begin{equation}
S_{c} \coloneqq \frac{\sigma^2(f_{\underline{c}} )}{\sigma^2(f)}, ~~ \sum_{c \in \mathcal{P}(D)} S_{c} = 1.
\end{equation}
Computing the Sobol indices of all terms for both example functions gives insight into the structure of the function and tells us what share of the function we are able to represent with axis-aligned terms of varying dimension.
It also helps us to understand why spatial adaptivity only works well for one of them.
The Sobol method decomposes the two-dimensional functions into four axis-aligned terms.
A constant term $f_\emptyset$, two one-dimensional terms $f_{\{1\}}$ and $f_{\{2\}}$, and a two dimensional term $f_{\{1,2\}}$.
\begin{mdframed}[style=style]
\begin{figure}[H]
\centering
\begin{minipage}[H]{.45\textwidth}
\centering
\begin{tabular}{ l | c c }
& $2 \notin c~$ & $2 \in c$ \\
\hline
$1 \notin c$ & 0.667 & 0.0\\
$1 \in c$ & 0.333 & 0.0\\
\end{tabular}
\delimit
\captionof{table}{Sobol indices for all ANOVA components of $f_1$.}
\label{tab:anova_f1}
\end{minipage}%
\hspace{.05\textwidth}
\begin{minipage}[H]{0.45\textwidth}
\centering
\begin{tabular}{ l | c c }
& $2 \notin c~$ & $2 \in c$ \\
\hline
$1 \notin c$ & 0.636 & 0.001\\
$1 \in c$ & 0.163 & 0.199\\
\end{tabular}
\delimit
\captionof{table}{Sobol indices for all ANOVA components of $f_2$.}
\label{tab:anova_f2}
\end{minipage}
\end{figure}
\end{mdframed}
%
As we can see in \cref{tab:anova_f1}, the method correctly identifies that only the terms $f_\emptyset$ and $f_{\{1\}}$ have some output variance with $\sigma^2(f_\emptyset)=0.667 > 0$ and $\sigma^2(f_{\{1\}})=0.333 > 0$.
Therefore, the other terms can be dropped and the function can be completely represented using only one-dimensional function terms.
However, we can see in \cref{tab:anova_f2} that the Sobol method attributes a share of output variance to all terms, since it holds that $\sigma^2(f_{c}) > 0$ for all ANOVA components.
In this case, the Sobol method shows us that no dimensionality reduction can be performed, as the two-dimensional term $f_{\{1,2\}}$ has a decent amount of variance contribution.
Note that if the constant term $f_\emptyset$ is excluded from the analysis, as is often done, the effects of the changed alignment of $f_2$ on the Sobol indices would be even greater.
\subsection{Summary}
We can conclude that a fundamental property of any function approximation method that uses tensor-product basis functions is that the approximation quality significantly depends on the axis-alignment of the model function itself.
In contrast to this behaviour, other basis functions, \eg radial basis functions, do not depend on the alignment of the function terms they are trying to approximate, \ie they can be seen as orientation agnostic.
In turn this means that an effective dimensionality reduction with conventional sparse grid based methods is highly dependent on the alignment of the model function.
Without adding another layer of flexibility, sparse grids can not use their full potential in many cases as we have seen with the function $f_2$.
If we were somehow able to change the alignment of a model function in a more axis-aligned way that matches the underlying sparse grid surrogate, we could improve the approximation in general and also allow a more efficient dimensionality reduction to take place.
This will allow us to perform a better dimensionality reduction.
\section{Active direction transformations}
The concept of intrinsic dimensionality allows us, at least in theory, to use arbitrary transformation functions to map into the lower-dimensional input space $\Omega_r$.
For this purpose, there are numerous possible types of transformation functions to choose from.
As an example, the previous chapter showed how a simple change of basis of the input parameters can have a huge impact on constructed sparse grid surrogates.
We will therefore start out with constructing an orthonormal transformation matrix $W \in \mathds{R}^{d \times d}$ to transform the inputs into a new coordinate system prior to passing them to a sparse grid surrogate.
A good first choice of new basis vectors would be some form of directions along which the model function output has high rates of change.
Various data analysis methods, such as PCA \cite{Abdi2010} or active subspaces \cite{Constantine2015}, employ some form of a singular value decomposition to obtain these directions of high change.
As the active subspace method will be used later on to obtain $W$, we will refer to the new basis vectors of the matrix $W$ as active directions, which is consistent with the active subspace method.
We then refer to the matrix $W$ as the active direction matrix and assume that its basis vectors are ordered by the level of change from highest to lowest, \ie it should hold that the column vector $w_1$ is the most active direction while $w_d$ is the most inactive direction.
After performing a change of basis with $W$ on the inputs, we can effectively eliminate the most inactive dimensions, as all active directions are now mapped onto dimensions in the new coordinate system.
This results in the possibility of drastically reducing the surrogate dimensionality while only suffering a small increase in the approximation error, as only the most inactive directions, which should not contribute a lot to the model function, are eliminated that way.
\begin{definition}[Reduced active direction matrix]
Let $W \in \mathds{R}^{d \times d}$ be an active direction matrix with $W^T W=I$, and let $r \leq d$ be the reduced dimensionality.
We then define
\begin{equation}
W_r \coloneqq \begin{bmatrix}
\\
w_1 ~ w_2 ~ \dots ~ w_{r-1} ~ w_r\\
\\
\end{bmatrix}
\end{equation}
as the reduced active direction matrix.
\end{definition}
%
As we can see, the reduced active direction matrix is simply obtained from a given active direction matrix $W$ and a reduced dimension $r$ by cutting off the last $d - r$ directions from $W$.
For now, we assume that $W$ and $r$ are given as we will cover the process of determining those essential inputs later on in detail.
\subsection{Unit hypercube projections}