From 66b4b0eb7b517f9e134e4cf5abee36122af4bbb2 Mon Sep 17 00:00:00 2001
From: odunbar <odunbar@caltech.edu>
Date: Tue, 10 Oct 2023 16:49:02 -0700
Subject: [PATCH] typos and notes

---
 docs/src/random_feature_emulator.md | 36 ++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/docs/src/random_feature_emulator.md b/docs/src/random_feature_emulator.md
index d4c57be39..ffb6bfa4a 100644
--- a/docs/src/random_feature_emulator.md
+++ b/docs/src/random_feature_emulator.md
@@ -3,20 +3,25 @@
 !!! note "Have a go with Gaussian processes first"
     We recommend that users first try `GaussianProcess` for their problems. As random features are a more recent tool, the training procedures and interfaces are still experimental and in development. 
 
-Random features provide a more scalable (numbers of training points, input-output dimensions) and flexible framework  that approximates a Gaussian process through a random sampling of these features. Theoretical work guarantees convergence of random features to Gaussian processes with the number of sampled features, and several feature distributions are known for common kernel families.
+Random features provide a flexible framework to approximates a Gaussian process. Using random sampling of features to approximate a low-rank factorization of the Gaussian process kernel, the method is scalable (in numbers of training points, input-output dimensions). In the infinite sample limit, it is known how random features with certain given feature distributions converge to known kernel families.
 
 We provide two types of `MachineLearningTool` for RandomFeatures, the `ScalarRandomFeatureInterface` and the `VectorRandomFeatureInterface`.
 
-The `ScalarRandomFeatureInterface` closely mimics the role of the `GaussianProcessEmulator` package, by training a scalar-output function distribution. It can be applied to multidimensional output problems as with `GaussianProcessEmulators` by relying on a decorrelation of the output space, followed by training a series of independent scalar functions.
+The `ScalarRandomFeatureInterface` closely mimics the role of a `GaussianProcess` package, by training a scalar-output function distribution. It can be applied to multidimensional output problems as with `GaussianProcess` by relying on a decorrelation of the output space, followed by training a series of independent scalar functions (all computed internally using the `Emulator` object).
 
-The `VectorRandomFeatureInterface` directly trains the mapping between multi-dimensional spaces. Therefore it does not rely on decorrelation of the output space (though this can still be helpful), and can be cheap to evaluate; on the other hand the training can be more challenging/computationally expensive.
+The `VectorRandomFeatureInterface`, when applied to multidimensional problems, directly trains a function distribution between multi-dimensional spaces. This approach is not restricted to the data processing of the scalar method (though this can still be helpful). It can be cheaper to evaluate, but on the other hand the training can be more challenging/computationally expensive.
 
-To build a random feature emulator, as with Gaussian process one defines a kernel to encode similarities between outputs ``(y_i,y_j)`` based on inputs ``(x_i,x_j)``. Additionally one must specify the number of random feature samples to be taken to build the emulator.
+Building a random feature interface is similar to building a Gaussian process: one defines a kernel to encode similarities between outputs ``(y_i,y_j)`` based on inputs ``(x_i,x_j)``. Additionally, one must specify the number of random feature samples to be taken to build the emulator.
 
 # User Interface
 
-`CalibrateEmulateSample.jl` allows the random feature emulator to be built using the external package `RandomFeatures.jl`. In the notation of this package, our interface allows for families of `RandomFourierFeature` objects to be constructed with different kernels defining different structures of the "`xi`" a.k.a weight distribution, and with a learnable "`sigma`", a.k.a scaling parameter.
+`CalibrateEmulateSample.jl` allows the random feature emulator to be built using the external package [`RandomFeatures.jl`](https://github.com/CliMA/RandomFeatures.jl). In the notation of this package's documentation, our interface allows for families of `RandomFourierFeature` objects to be constructed with different Gaussian distributions of the "`xi`" a.k.a weight distribution, and with a learnable "`sigma`", a.k.a scaling parameter.
 
+!!! note "Relating features and kernels"
+    The parallels of random features and gaussian processes can be quite strong. For example:
+    - The restriction to `RandomFourierFeature` objects is a restriction to the approximation of shift-invariant kernels (i.e. ``K(x,y) = K(x-y)``)
+    - The restriction of the weight ("`xi`") distribution to Gaussians is a restriction of approximating squared-exponential kernels. Other distributions (e.g. student-t) leads to other kernels (e.g. Matern)
+    
 The interfaces are defined minimally with
 
 ```julia 
@@ -24,7 +29,7 @@ srfi = ScalarRandomFeatureInterface(n_features, input_dim; ...)
 vrfi = VectorRandomFeatureInterface(n_features, input_dim, output_dim; ...)
 ```
 
-This will build an interface around a random feature family based on `n_features` features and mapping between spaces of dimenstion `input_dim` to 1 (scalar), or `output_dim` (vector).
+This will build an interface around a random feature family based on `n_features` features and mapping between spaces of dimenstion `input_dim` to `1` (scalar), or `output_dim` (vector).
 
 ## The `kernel_structure` keyword - for flexibility
 
@@ -42,31 +47,36 @@ diagonal_structure = DiagonalFactor() # impose diagonal structure (e.g. ARD kern
 cholesky_structure = CholeskyFactor() # general positive definite matrix
 lr_perturbation = LowRankFactor(r) # assume structure is a rank-r perturbation from identity
 ```
-All covariance structures (except `OneDimFactor`) have their final positional argument being a "nugget" term adding ``+\epsilon I` to the covariance structure. Set to 1 by default.
+All covariance structures (except `OneDimFactor`) have their final positional argument being a "nugget" term adding ``+\epsilon I`` to the covariance structure. Set to 1 by default.
 
 The current default kernels are as follows:
 ```julia
 scalar_default_kernel = SeparableKernel(LowRankFactor(Int(ceil(sqrt(input_dim)))), OneDimFactor())
 vector_default_kernel = SeparableKernel(LowRankFactor(Int(ceil(sqrt(output_dim)))), LowRankFactor(Int(ceil(sqrt(output_dim)))))
 ```
-
+!!! note "Relating covariance structure and training"
+    The parallels between random feature and gaussian process also extends to the hyperparameter learning. For example,
+    - A `ScalarRandomFeatureInterface` with a `DiagonalFactor` input covariance structure approximates a Gaussian process with automatic relevance determination (ARD) kernel, where one learns a lengthscale in each dimension of the input space
+    
 ## The `optimizer_options` keyword - for performance
 
-Passed as a dictionary this allows the user to configure many options from their defaults in the hyperparameter optimization. The optimizer itself relies on the `EnsembleKalmanProcesses` package.
+Passed as a dictionary, this keyword allows the user to configure many options from their defaults in the hyperparameter optimization. The optimizer itself relies on the [`EnsembleKalmanProcesses`](https://github.com/CliMA/EnsembleKalmanProcesses.jl) package.
 
 We recommend users experiment with a subset of these flags. At first enable
 ```julia
 Dict("verbose" => true)
 ```
-Then if the covariance sampling takes too long, run with multithreading (e.g. `julia --project -t n_threads script.jl`) if it still takes too long, try
+If the covariance sampling takes too long, run with multithreading (e.g. `julia --project -t n_threads script.jl`). Sampling is embarassingly parallel so this acheives near linear scaling,
+
+If sampling still takes too long, try setting
 ```julia
 Dict(
     "cov_sample_multiplier" => csm,
     "train_fraction" => tf,
 )
 ```
-- Reducing `csm` below `1.0` (towards `0.0`) directly reduces the number of samples to estimate a covariance matrix in the optimizer, by using a shrinkage estimator - the more shrinkage the more approximation (suggestion, keep shrinkage below 0.2).
-- Increasing `tf` towards 1 changes the train-validate split, reducing samples but increasing cost-per-sample and reducing the available validation data (suggested range `(0.5,0.95)`).
+- Decreasing `csm` (default `10.0`) towards `0.0` directly reduces the number of samples to estimate a covariance matrix in the optimizer, by using a shrinkage estimator - the more shrinkage the more approximation (suggestion, keep shrinkage amount below `0.2`).
+- Increasing `tf` towards `1` changes the train-validate split, reducing samples but increasing cost-per-sample and reducing the available validation data (default `0.8`, suggested range `(0.5,0.95)`).
 
 If optimizer convergence stagnates or is too slow, or if it terminates before producing good results, try:
 ```julia
@@ -77,7 +87,7 @@ Dict(
     "scheduler" => sch,
 )
 ```
-We suggest looking at the `EnsembleKalmanProcesses` documentation for more details; but to summarize
+We suggest looking at the [`EnsembleKalmanProcesses`](https://github.com/CliMA/EnsembleKalmanProcesses.jl) documentation for more details; but to summarize
 - Reducing optimizer samples `n_e` and iterations `n_i` reduces computation time.
 - If `n_e` becomes less than the number of hyperparameters, the updates will fail and a localizer must be specified in `loc`.
 - If the algorithm terminates at `T=1` and resulting emulators looks unacceptable one can change or add arguments in `sch` e.g. `DataMisfitController("on_terminate"=continue)`