UC Davis DataLab
Spring 2024
Instructor: Nick Ulle
Maintainer: Nick Ulle <[email protected]>
This workshop provides an overview of contemporary machine learning methods. We'll cover important terminology and popular methods so that you can determine whether machine learning is relevant to your research and what to learn more about if it is. This is a concept-focused, non-technical workshop. No laptops needed.
After this workshop, learners should be able to:
- Define the following terms: observation, feature, machine learning, supervised learning, unsupervised learning, regression, classification, clustering, training set, validation set, test set, cross-validation, overfitting, underfitting, model bias, model variance, bias-variance tradeoff, ensemble model;
- Explain the difference between supervised and unsupervised learning;
- Explain the difference between regression and classification;
- List and briefly describe popular machine learning methods;
- Give an example of an ensemble model;
- Explain what cross-validation is used for and give an overview of the procedure;
- Assess whether and which machine learning methods might be helpful for a given research problem.
This two-part workshop series provides an introduction to using R for two popular machine learning techniques: clustering and classification.
Clustering involves identifying groups of similar observations (called clusters) within data. Clustering can be an effective tool for finding patterns and an important part of exploratory data analysis. Classification refers to modeling categorical variables. Classification models can provide insight into the relationship between the predictors and response, as well as a way to make predictions about new observations.
In the first session, we'll begin with the advantages and disadvantages of several popular algorithms for clustering, and work through examples of how to run clustering algorithms in R. In the second session, we'll provide an overview of popular classification models, and then delve into the details of actually using them. We'll cover how to choose a model, how to partition data into training and test sets, how to use cross-validation to tune model hyperparameters, and how to evaluate the performance of models in R. We'll also explain some strategies you can use to improve model performance. This series concludes with a brief discussion of the machine learning landscape and how you can continue to learn more about machine learning and its application it to your research.
After this workshop series, learners should be able to:
- Assess whether classification or clustering are relevant to their research problems and data sets;
- Explain the tradeoffs between popular clustering algorithms;
- Run a clustering algorithm on their data;
- Build and train a classification model on their data;
- Use cross-validation to estimate accuracy and tune hyperparameters for classification models;
- Identify strategies to improve results from classification models.
The course reader is a live webpage, hosted through GitHub, where you can enter curriculum content and post it to a public-facing site for learners.