Update documentation

medford-group · Aug 28, 2023 · 7bee290 · 7bee290
1 parent 57b88ad
commit 7bee290
Show file tree

Hide file tree

Showing 52 changed files with 1,241 additions and 403 deletions.
diff --git a/_images/DBSCAN.gif b/_images/DBSCAN.gif
diff --git a/_images/GMM.gif b/_images/GMM.gif
diff --git a/_images/OvA.png b/_images/OvA.png
diff --git a/_images/ROC_curve.jpg b/_images/ROC_curve.jpg
diff --git a/_images/agglomerative.gif b/_images/agglomerative.gif
diff --git a/_images/autoencoder.png b/_images/autoencoder.png
diff --git a/_images/bio_dendrogram.png b/_images/bio_dendrogram.png
diff --git a/_images/class_imbalance.png b/_images/class_imbalance.png
diff --git a/_images/confusion_matrix.png b/_images/confusion_matrix.png
diff --git a/_images/discriminationLine.png b/_images/discriminationLine.png
diff --git a/_images/discriminative_vs_generative.png b/_images/discriminative_vs_generative.png
diff --git a/_images/kernel_schematic.png b/_images/kernel_schematic.png
diff --git a/_images/lasso_vs_ridge_regression.png b/_images/lasso_vs_ridge_regression.png
diff --git a/_images/linearly_separable.png b/_images/linearly_separable.png
diff --git a/_images/margin_cost.png b/_images/margin_cost.png
diff --git a/_images/margin_size.png b/_images/margin_size.png
diff --git a/_images/multiclass_cost.png b/_images/multiclass_cost.png
diff --git a/_images/oversampling.png b/_images/oversampling.png
diff --git a/_images/perceptron_NN.png b/_images/perceptron_NN.png
diff --git a/_images/precision_recall.png b/_images/precision_recall.png
diff --git a/_images/smote.png b/_images/smote.png
diff --git a/_images/underfitting_overfitting.png b/_images/underfitting_overfitting.png
diff --git a/_sources/docs/Goals.md b/_sources/docs/Goals.md
@@ -0,0 +1,8 @@
+# Development Goals
+
+- Theoretical EST section drawing inspriration from Szabo and Ostlund Intro to Modern Quantum Chemistry
+    - Fall 2024
+- Reformulate ChBE 6745 materials into jupyter book
+    - Fall 2024
+- Include information on job arrays in the HPC section
+	- Fall 2024
diff --git a/_sources/docs/Intro_to_EST.md b/_sources/docs/Intro_to_EST.md
@@ -0,0 +1,12 @@
+# Introduction to Electronic Structure Theory
+
+This section will serve to introduce key concepts in electronic structure theory. This will include:
+- Basic Linear Algebra
+- Operators
+- Basic Dirac notation
+- Continuous Generalization of Linear Algebra
+- Eigenvalues and Eigenvectors
+- Hamiltonians as Operators
+- Schrodinger Equation and Eigenvalue Problems
+
+This is currently just a placeholder. The goal will be to add this to the training materials by Fall 2024.
diff --git a/_sources/docs/ML_1_1_Non-parametric_Models.ipynb b/_sources/docs/ML_1_1_Non-parametric_Models.ipynb
@@ -422,17 +422,6 @@
     "We see that the model successful at interpolating between the points. This is an example of a **non-parametric** model. The number of parameters, $\\vec{w}$ is equal to the number of data points."
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Exercise: Use every third data point of the spectra dataset to train a linear interpolation model\n",
-    "\n",
-    "First, select every third datapoint from the `(x_peak, y_peak)` dataset, and use this to train a linear interpolation model. Then, predict the full dataset using the model.\n",
-    "\n",
-    "![image_info](./images/interpolation.png)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -570,19 +559,6 @@
     "x_peak.shape"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Exercise: Evaluate the performance of the rbf kernel as a function of kernel width\n",
-    "\n",
-    "Use the same strategy as the previous exercise to select every third point in the spectra to use as the training set. Then, vary the width of the radial basis function with $\\sigma = [1, 10, 100]$, and compute the $r^2$ score for each *using the entire dataset*.\n",
-    "\n",
-    "Plot $r^2$ as a function of $\\sigma$.\n",
-    "\n",
-    "![image_info](./images/r2_vs_kernelwidth.png)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},

diff --git a/_sources/docs/ML_1_2_Complexity_Optimization.ipynb b/_sources/docs/ML_1_2_Complexity_Optimization.ipynb
@@ -19,9 +19,10 @@
     "\n",
     "The key to machine learning is creating models that generalize to new examples. This means we are looking for models with enough complexity to describe the behavior, but not so much complexity that it just reproduces the data points.\n",
     "\n",
-    "<center>\n",
-    "<img src=\"images/underfitting_overfitting.png\" width=\"800\">\n",
-    "</center>\n",
+    "```{image} ./underfitting_overfitting.png\n",
+    ":width: 800px\n",
+    ":align: center\n",
+    "```\n",
     "\n",
     "* Underfitting: The model is just \"guessing\" at the data, and will be equally bad at the data it has been trained on and the data that it is tested on.\n",
     "\n",
@@ -249,15 +250,6 @@
     "We can see that the BIC correctly predicts that the Gaussian model is preferred."
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Exercise: Use the BIC to determine the optimum number of evenly-spaced Gaussians for the spectra\n",
-    "\n",
-    "![image](./images/BIC.png)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -518,15 +510,6 @@
     "print('The largest coefficient is {:.3f}.'.format(max(abs(coeffs))[0]));"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Exercise: Use cross validation to determine the optimal value of $\\alpha$ when $\\sigma=20$.\n",
-    "\n",
-    "![image](./images/optimalAlpha.png)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -574,9 +557,10 @@
     "\n",
     "We will not go through the derivation of *why* the L1 norm causes parameters to go to zero, but the following schematic, borrowed from [this website](https://niallmartin.wordpress.com/2016/05/12/shrinkage-methods-ridge-and-lasso-regression/) may be useful (note that $\\vec{\\beta}$ is equivalent to $\\vec{w}$:\n",
     "\n",
-    "<center>\n",
-    "<img src=\"images/lasso_vs_ridge_regression.png\" width=\"600\">\n",
-    "</center>\n",
+    "```{image} ./lasso_vs_ridge_regression.png\n",
+    ":width: 600px\n",
+    ":align: center\n",
+    "```\n",
     "\n",
     "We can also test it using `scikit-learn`. Unfortunately, we need to create our own feature (basis) matrix, $X_{ij}$, similar to linear regression, so we will need a function to evaluate the `rbf`. Instead of using our own, we can use the one from `scikit-learn`:"
    ]
@@ -793,17 +777,6 @@
     "\n",
     "One note is that the best model will depend on the parameters you search over, as well as the cross-validation strategy. In this case, `cv=3` means that the model performs 3-fold cross-validation at each gridpoint."
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Exercise: Optimize the hyperparameters of a LASSO model for the spectrum data\n",
-    "\n",
-    "Search over the same values of $\\alpha$ and $\\sigma$ as for KRR above, and use 3-fold cross validation.\n",
-    "\n",
-    "Note: You will need to use a for loop over the $\\sigma$ values. Use `GridSearchCV.best_score_` as accuracy metric."
-   ]
   }
  ],
  "metadata": {

diff --git a/_sources/docs/ML_1_3_High_Dimensional_Data.ipynb b/_sources/docs/ML_1_3_High_Dimensional_Data.ipynb
@@ -419,7 +419,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Exercise: Compute the mean and standard deviation of each feature in the Dow dataset."
+    "### Example: Compute the mean and standard deviation of each feature in the Dow dataset."
    ]
   },
   {

diff --git a/_sources/docs/ML_1_4_Dimensionality_Reduction.ipynb b/_sources/docs/ML_1_4_Dimensionality_Reduction.ipynb
@@ -267,7 +267,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Exercise: Find the rank of the covariance matrix\n",
+    "### Example: Find the rank of the covariance matrix\n",
     "\n",
     "Hint: Remember the definition of rank in term of eigenvalues."
    ]
@@ -665,7 +665,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Exercise: Find the \"average\" 10-dimensional vector for an \"8\"\n",
+    "### Example: Find the \"average\" 10-dimensional vector for an \"8\"\n",
     "\n",
     "Use PCA to project the data onto 10 dimensions, then select the points labeled as 8 and take the average. "
    ]
@@ -1065,7 +1065,7 @@
    "source": [
     "### Autoencoding\n",
     "\n",
-    "<img src=\"images/autoencoder.png\">\n",
+    "![autoencoder](./autoencoder.png)\n",
     "\n",
     "The final approach we will discuss is \"autoencoding\", which is the use of neural networks for dimensional reduction. This is a relatively new approach without any implementation in scikit-learn, but it is conceptually different from others so it is included here.\n",
     "\n",
@@ -1085,6 +1085,11 @@
     "\n",
     "This is a field of research on its own, but worth being aware of nonetheless."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
   }
  ],
  "metadata": {

diff --git a/_sources/docs/ML_2_1_Classification_Basics.ipynb b/_sources/docs/ML_2_1_Classification_Basics.ipynb
@@ -140,25 +140,28 @@
     "\n",
     "* **Linearly separable**: A problem where it is possible to exactly separate the classes with a straight line (or plane) in the feature space.\n",
     "\n",
-    "<center>\n",
-    "<img src=\"images/linearly_separable.png\" width=\"500\">\n",
-    "</center>\n",
+    "```{image} ./linearly_separable.png\n",
+    ":align: center\n",
+    ":width: 400px\n",
+    "```\n",
     "\n",
     "* **Binary vs. Multi-class**: A binary classification problem has only 2 classes, while a multi-class problem has more than 2 classes. \n",
     "\n",
     "There are two approaches to dealing with multi-class problems:\n",
     "\n",
     "1) Convert multi-class problems to binary problems using a series of \"one vs. the rest\" binary classifiers\n",
     "\n",
-    "<center>\n",
-    "<img src=\"images/OvA.png\" width=\"400\">\n",
-    "</center>\n",
+    "```{image} ./OvA.png\n",
+    ":align: center\n",
+    ":width: 400px\n",
+    "```\n",
     "\n",
     "2) Consider the multi-class nature of the problem when deriving the method (e.g. kNN) or determining the cost function (e.g. logistic regression)\n",
     "\n",
-    "<center>\n",
-    "<img src=\"images/multiclass_cost.png\" width=\"400\">\n",
-    "</center>\n",
+    "```{image} ./multiclass_cost.png\n",
+    ":align: center\n",
+    ":width: 400px\n",
+    "```\n",
     "\n",
     "In the end, the difference between these approaches tend to be relatively minor, although the training procedures can be quite different. One vs. the rest is more efficient for parallel training, while multi-class objective functions are more efficient in serial.\n",
     "\n",
@@ -260,10 +263,11 @@
     "\n",
     "Generative models are more difficult to understand, but they have a key advantage: new synthetic data can be generated by using the function $P(\\vec{x}|y_i)$. This opens the possibility of iterative training schemes that systematically improve the estimate of $P(\\vec{x}|y_i)$ (e.g. Generative Artificial Neural Networks) and can also aid in diagnosing problems in models.\n",
     "\n",
-    "<br>\n",
-    "<center>\n",
-    "<img src=\"images/discriminative_vs_generative.png\" width=\"800\">\n",
-    "</center>"
+    "\n",
+    "```{image} ./discriminative_vs_generative.png\n",
+    ":align: center\n",
+    ":width: 800px\n",
+    "```"
    ]
   },
   {
@@ -303,9 +307,10 @@
     "\n",
     "This is the \"harmonic mean\" of precision and recall and ranges from 0 for a model with 0 precision or recall to 1 for a model with perfec precision and recall.\n",
     "\n",
-    "<center>\n",
-    "<img src=\"images/precision_recall.png\" width=\"500\">\n",
-    "</center>\n",
+    "```{image} ./precision_recall.png\n",
+    ":align: center\n",
+    ":width: 500px\n",
+    "```\n",
     "\n",
     "We can implement this with a simple Python function:"
    ]
@@ -406,15 +411,6 @@
     "acc_prec_recall(y_guess, y_imbalanced)"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Exercise: Plot the accuracy of the \"guessing zero\" model as a function of number of 1's included in the actual data\n",
-    "\n",
-    "![image](./images/accuracy.png)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -423,9 +419,10 @@
     "\n",
     "The \"receiver operating characteristic\", or ROC curve, is useful for models where a threshold is used to tune the rate of false positives and false negatives. The area under the curve can be used as a metric for how well the model performs.\n",
     "\n",
-    "<center>\n",
-    "<img src=\"images/ROC_curve.jpg\" width=\"300\">\n",
-    "</center>\n",
+    "```{image} ./ROC_curve.jpg\n",
+    ":align: center\n",
+    ":width: 300px\n",
+    "```\n",
     "\n",
     "We will discuss this metric more once the meaning of a \"threshold\" has been described."
    ]
@@ -482,9 +479,10 @@
     "\n",
     "False positives and false negatives only apply to binary problems. The \"confusion matrix\" is a multi-class generalization of the concept, and can help identify which classes are \"confusing\" the algorithm.\n",
     "\n",
-    "<center>\n",
-    "<img src=\"images/confusion_matrix.png\" width=\"500\">\n",
-    "</center>"
+    "```{image} ./confusion_matrix.png\n",
+    ":width: 500px\n",
+    ":align: center\n",
+    "```"
    ]
   },
   {
@@ -508,22 +506,24 @@
     "\n",
     "2) Undersampling: discarding information from over-represented class. This is inefficient since not all information is used.\n",
     "\n",
-    "<center>\n",
-    "<img src=\"images/class_imbalance.png\" width=\"500\">\n",
-    "</center>\n",
-    "\n",
+    "```{image} ./class_imbalance.png\n",
+    ":width: 500px\n",
+    ":align: center\n",
+    "```\n",
     "\n",
     "3) Oversampling: add repeates of the under-represented class (very similar to re-balancing the cost function). This can lead to over-fitting of the decision boundary to the few examples of the under-represented class.\n",
     "\n",
-    "<center>\n",
-    "<img src=\"images/oversampling.png\" width=\"500\">\n",
-    "</center>\n",
+    "```{image} ./oversampling.png\n",
+    ":width: 500px\n",
+    ":align: center\n",
+    "```\n",
     "\n",
     "4) Resampling: Re-sample from the under-represented class, but add some noise. This is a robust solution, but requires some knowledge of the distribution of the under-represented data (e.g. generative models) or special techniques (e.g. SMOTE).\n",
     "\n",
-    "<center>\n",
-    "<img src=\"images/smote.png\" width=\"500\">\n",
-    "</center>"
+    "```{image} ./smote.png\n",
+    ":width: 500px\n",
+    ":align: center\n",
+    "```"
    ]
   },
   {
@@ -728,7 +728,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Exercise: Derive the slope and intercept of the line that discriminates between the two classes."
+    "### Example: Derive the slope and intercept of the line that discriminates between the two classes."
    ]
   },
   {