Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
ltimmerman3 committed Aug 28, 2023
1 parent 57b88ad commit 7bee290
Show file tree
Hide file tree
Showing 52 changed files with 1,241 additions and 403 deletions.
Binary file added _images/DBSCAN.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/GMM.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/OvA.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/ROC_curve.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/agglomerative.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/autoencoder.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/bio_dendrogram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/class_imbalance.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/confusion_matrix.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/discriminationLine.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/discriminative_vs_generative.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/kernel_schematic.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/lasso_vs_ridge_regression.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/linearly_separable.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/margin_cost.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/margin_size.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/multiclass_cost.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/oversampling.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/perceptron_NN.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/precision_recall.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/smote.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/underfitting_overfitting.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 8 additions & 0 deletions _sources/docs/Goals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Development Goals

- Theoretical EST section drawing inspriration from Szabo and Ostlund Intro to Modern Quantum Chemistry
- Fall 2024
- Reformulate ChBE 6745 materials into jupyter book
- Fall 2024
- Include information on job arrays in the HPC section
- Fall 2024
12 changes: 12 additions & 0 deletions _sources/docs/Intro_to_EST.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Introduction to Electronic Structure Theory

This section will serve to introduce key concepts in electronic structure theory. This will include:
- Basic Linear Algebra
- Operators
- Basic Dirac notation
- Continuous Generalization of Linear Algebra
- Eigenvalues and Eigenvectors
- Hamiltonians as Operators
- Schrodinger Equation and Eigenvalue Problems

This is currently just a placeholder. The goal will be to add this to the training materials by Fall 2024.
24 changes: 0 additions & 24 deletions _sources/docs/ML_1_1_Non-parametric_Models.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -422,17 +422,6 @@
"We see that the model successful at interpolating between the points. This is an example of a **non-parametric** model. The number of parameters, $\\vec{w}$ is equal to the number of data points."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise: Use every third data point of the spectra dataset to train a linear interpolation model\n",
"\n",
"First, select every third datapoint from the `(x_peak, y_peak)` dataset, and use this to train a linear interpolation model. Then, predict the full dataset using the model.\n",
"\n",
"![image_info](./images/interpolation.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -570,19 +559,6 @@
"x_peak.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise: Evaluate the performance of the rbf kernel as a function of kernel width\n",
"\n",
"Use the same strategy as the previous exercise to select every third point in the spectra to use as the training set. Then, vary the width of the radial basis function with $\\sigma = [1, 10, 100]$, and compute the $r^2$ score for each *using the entire dataset*.\n",
"\n",
"Plot $r^2$ as a function of $\\sigma$.\n",
"\n",
"![image_info](./images/r2_vs_kernelwidth.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
43 changes: 8 additions & 35 deletions _sources/docs/ML_1_2_Complexity_Optimization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,10 @@
"\n",
"The key to machine learning is creating models that generalize to new examples. This means we are looking for models with enough complexity to describe the behavior, but not so much complexity that it just reproduces the data points.\n",
"\n",
"<center>\n",
"<img src=\"images/underfitting_overfitting.png\" width=\"800\">\n",
"</center>\n",
"```{image} ./underfitting_overfitting.png\n",
":width: 800px\n",
":align: center\n",
"```\n",
"\n",
"* Underfitting: The model is just \"guessing\" at the data, and will be equally bad at the data it has been trained on and the data that it is tested on.\n",
"\n",
Expand Down Expand Up @@ -249,15 +250,6 @@
"We can see that the BIC correctly predicts that the Gaussian model is preferred."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise: Use the BIC to determine the optimum number of evenly-spaced Gaussians for the spectra\n",
"\n",
"![image](./images/BIC.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -518,15 +510,6 @@
"print('The largest coefficient is {:.3f}.'.format(max(abs(coeffs))[0]));"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise: Use cross validation to determine the optimal value of $\\alpha$ when $\\sigma=20$.\n",
"\n",
"![image](./images/optimalAlpha.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -574,9 +557,10 @@
"\n",
"We will not go through the derivation of *why* the L1 norm causes parameters to go to zero, but the following schematic, borrowed from [this website](https://niallmartin.wordpress.com/2016/05/12/shrinkage-methods-ridge-and-lasso-regression/) may be useful (note that $\\vec{\\beta}$ is equivalent to $\\vec{w}$:\n",
"\n",
"<center>\n",
"<img src=\"images/lasso_vs_ridge_regression.png\" width=\"600\">\n",
"</center>\n",
"```{image} ./lasso_vs_ridge_regression.png\n",
":width: 600px\n",
":align: center\n",
"```\n",
"\n",
"We can also test it using `scikit-learn`. Unfortunately, we need to create our own feature (basis) matrix, $X_{ij}$, similar to linear regression, so we will need a function to evaluate the `rbf`. Instead of using our own, we can use the one from `scikit-learn`:"
]
Expand Down Expand Up @@ -793,17 +777,6 @@
"\n",
"One note is that the best model will depend on the parameters you search over, as well as the cross-validation strategy. In this case, `cv=3` means that the model performs 3-fold cross-validation at each gridpoint."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise: Optimize the hyperparameters of a LASSO model for the spectrum data\n",
"\n",
"Search over the same values of $\\alpha$ and $\\sigma$ as for KRR above, and use 3-fold cross validation.\n",
"\n",
"Note: You will need to use a for loop over the $\\sigma$ values. Use `GridSearchCV.best_score_` as accuracy metric."
]
}
],
"metadata": {
Expand Down
2 changes: 1 addition & 1 deletion _sources/docs/ML_1_3_High_Dimensional_Data.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -419,7 +419,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise: Compute the mean and standard deviation of each feature in the Dow dataset."
"### Example: Compute the mean and standard deviation of each feature in the Dow dataset."
]
},
{
Expand Down
11 changes: 8 additions & 3 deletions _sources/docs/ML_1_4_Dimensionality_Reduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -267,7 +267,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise: Find the rank of the covariance matrix\n",
"### Example: Find the rank of the covariance matrix\n",
"\n",
"Hint: Remember the definition of rank in term of eigenvalues."
]
Expand Down Expand Up @@ -665,7 +665,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise: Find the \"average\" 10-dimensional vector for an \"8\"\n",
"### Example: Find the \"average\" 10-dimensional vector for an \"8\"\n",
"\n",
"Use PCA to project the data onto 10 dimensions, then select the points labeled as 8 and take the average. "
]
Expand Down Expand Up @@ -1065,7 +1065,7 @@
"source": [
"### Autoencoding\n",
"\n",
"<img src=\"images/autoencoder.png\">\n",
"![autoencoder](./autoencoder.png)\n",
"\n",
"The final approach we will discuss is \"autoencoding\", which is the use of neural networks for dimensional reduction. This is a relatively new approach without any implementation in scikit-learn, but it is conceptually different from others so it is included here.\n",
"\n",
Expand All @@ -1085,6 +1085,11 @@
"\n",
"This is a field of research on its own, but worth being aware of nonetheless."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
Expand Down
84 changes: 42 additions & 42 deletions _sources/docs/ML_2_1_Classification_Basics.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -140,25 +140,28 @@
"\n",
"* **Linearly separable**: A problem where it is possible to exactly separate the classes with a straight line (or plane) in the feature space.\n",
"\n",
"<center>\n",
"<img src=\"images/linearly_separable.png\" width=\"500\">\n",
"</center>\n",
"```{image} ./linearly_separable.png\n",
":align: center\n",
":width: 400px\n",
"```\n",
"\n",
"* **Binary vs. Multi-class**: A binary classification problem has only 2 classes, while a multi-class problem has more than 2 classes. \n",
"\n",
"There are two approaches to dealing with multi-class problems:\n",
"\n",
"1) Convert multi-class problems to binary problems using a series of \"one vs. the rest\" binary classifiers\n",
"\n",
"<center>\n",
"<img src=\"images/OvA.png\" width=\"400\">\n",
"</center>\n",
"```{image} ./OvA.png\n",
":align: center\n",
":width: 400px\n",
"```\n",
"\n",
"2) Consider the multi-class nature of the problem when deriving the method (e.g. kNN) or determining the cost function (e.g. logistic regression)\n",
"\n",
"<center>\n",
"<img src=\"images/multiclass_cost.png\" width=\"400\">\n",
"</center>\n",
"```{image} ./multiclass_cost.png\n",
":align: center\n",
":width: 400px\n",
"```\n",
"\n",
"In the end, the difference between these approaches tend to be relatively minor, although the training procedures can be quite different. One vs. the rest is more efficient for parallel training, while multi-class objective functions are more efficient in serial.\n",
"\n",
Expand Down Expand Up @@ -260,10 +263,11 @@
"\n",
"Generative models are more difficult to understand, but they have a key advantage: new synthetic data can be generated by using the function $P(\\vec{x}|y_i)$. This opens the possibility of iterative training schemes that systematically improve the estimate of $P(\\vec{x}|y_i)$ (e.g. Generative Artificial Neural Networks) and can also aid in diagnosing problems in models.\n",
"\n",
"<br>\n",
"<center>\n",
"<img src=\"images/discriminative_vs_generative.png\" width=\"800\">\n",
"</center>"
"\n",
"```{image} ./discriminative_vs_generative.png\n",
":align: center\n",
":width: 800px\n",
"```"
]
},
{
Expand Down Expand Up @@ -303,9 +307,10 @@
"\n",
"This is the \"harmonic mean\" of precision and recall and ranges from 0 for a model with 0 precision or recall to 1 for a model with perfec precision and recall.\n",
"\n",
"<center>\n",
"<img src=\"images/precision_recall.png\" width=\"500\">\n",
"</center>\n",
"```{image} ./precision_recall.png\n",
":align: center\n",
":width: 500px\n",
"```\n",
"\n",
"We can implement this with a simple Python function:"
]
Expand Down Expand Up @@ -406,15 +411,6 @@
"acc_prec_recall(y_guess, y_imbalanced)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise: Plot the accuracy of the \"guessing zero\" model as a function of number of 1's included in the actual data\n",
"\n",
"![image](./images/accuracy.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -423,9 +419,10 @@
"\n",
"The \"receiver operating characteristic\", or ROC curve, is useful for models where a threshold is used to tune the rate of false positives and false negatives. The area under the curve can be used as a metric for how well the model performs.\n",
"\n",
"<center>\n",
"<img src=\"images/ROC_curve.jpg\" width=\"300\">\n",
"</center>\n",
"```{image} ./ROC_curve.jpg\n",
":align: center\n",
":width: 300px\n",
"```\n",
"\n",
"We will discuss this metric more once the meaning of a \"threshold\" has been described."
]
Expand Down Expand Up @@ -482,9 +479,10 @@
"\n",
"False positives and false negatives only apply to binary problems. The \"confusion matrix\" is a multi-class generalization of the concept, and can help identify which classes are \"confusing\" the algorithm.\n",
"\n",
"<center>\n",
"<img src=\"images/confusion_matrix.png\" width=\"500\">\n",
"</center>"
"```{image} ./confusion_matrix.png\n",
":width: 500px\n",
":align: center\n",
"```"
]
},
{
Expand All @@ -508,22 +506,24 @@
"\n",
"2) Undersampling: discarding information from over-represented class. This is inefficient since not all information is used.\n",
"\n",
"<center>\n",
"<img src=\"images/class_imbalance.png\" width=\"500\">\n",
"</center>\n",
"\n",
"```{image} ./class_imbalance.png\n",
":width: 500px\n",
":align: center\n",
"```\n",
"\n",
"3) Oversampling: add repeates of the under-represented class (very similar to re-balancing the cost function). This can lead to over-fitting of the decision boundary to the few examples of the under-represented class.\n",
"\n",
"<center>\n",
"<img src=\"images/oversampling.png\" width=\"500\">\n",
"</center>\n",
"```{image} ./oversampling.png\n",
":width: 500px\n",
":align: center\n",
"```\n",
"\n",
"4) Resampling: Re-sample from the under-represented class, but add some noise. This is a robust solution, but requires some knowledge of the distribution of the under-represented data (e.g. generative models) or special techniques (e.g. SMOTE).\n",
"\n",
"<center>\n",
"<img src=\"images/smote.png\" width=\"500\">\n",
"</center>"
"```{image} ./smote.png\n",
":width: 500px\n",
":align: center\n",
"```"
]
},
{
Expand Down Expand Up @@ -728,7 +728,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise: Derive the slope and intercept of the line that discriminates between the two classes."
"### Example: Derive the slope and intercept of the line that discriminates between the two classes."
]
},
{
Expand Down
Loading

0 comments on commit 7bee290

Please sign in to comment.