class: middle, center, title-slide
Lecture 1: Fundamentals of machine learning
Prof. Gilles Louppe
[email protected]
???
R: over-parametrization https://arxiv.org/pdf/2109.02355.pdf
A recap on statistical learning:
- Supervised learning
- Empirical risk minimization
- Under-fitting and over-fitting
- Bias-variance dilemma
class: middle
Consider an unknown joint probability distribution
Assume training data
- In most cases,
-
$\mathbf{x}_i$ is a$p$ -dimensional vector of features or descriptors, -
$y_i$ is a scalar (e.g., a category or a real value).
-
- The training data is generated i.i.d.
- The training data can be of any finite size
$N$ . - In general, we do not have any prior information about
$p_{X,Y}$ .
???
In most cases, x is a vector, but it could be an image, a piece of text or a sample of sound.
class: middle
Supervised learning is usually concerned with the two following inference problems:
-
Classification:
Given
$(\mathbf{x}_i, y_i) \in \mathcal{X}\times\mathcal{Y} = \mathbb{R}^p \times \bigtriangleup^C$ , for$i=1, ..., N$ , we want to estimate for any new$\mathbf{x}$ ,$$\arg \max_y p(Y=y|X=\mathbf{x}).$$ -
Regression:
Given
$(\mathbf{x}_i, y_i) \in \mathcal{X}\times\mathcal{Y} = \mathbb{R}^p \times \mathbb{R}$ , for$i=1, ..., N$ , we want to estimate for any new$\mathbf{x}$ ,$$\mathbb{E}\left[ Y|X=\mathbf{x} \right].$$
???
class: middle, center
Classification consists in identifying
a decision boundary between objects of distinct classes.
class: middle, center
Regression aims at estimating relationships among (usually continuous) variables.
class: middle
Supervised learning can be framed as probabilistic inference, where the goal is to estimate the conditional distribution
This is the framing we will adopt in this course (starting from Lecture 2).
The traditional perspective on supervised learning is empirical risk minimization.
Consider a function
## Examples of loss functions
.grid[ .kol-1-3[Classification:] .kol-2-3[$\ell(y,f(\mathbf{x})) = \mathbf{1}_{y \neq f(\mathbf{x})}$] ] .grid[ .kol-1-3[Regression:] .kol-2-3[$\ell(y,f(\mathbf{x})) = (y - f(\mathbf{x}))^2$] ]
class: middle
Let
We are looking for a function
This means that for a given data generating distribution
class: middle
Since
However, if we have i.i.d. training data
This estimator is unbiased and can be used for finding a good enough approximation of
???
What does unbiased mean?
=> The expected empirical risk estimate (over d) is the expected risk.
class: middle
Most machine learning algorithms, including neural networks, implement empirical risk minimization.
Under regularity assumptions, empirical risk minimizers converge:
???
This is why tuning the parameters of the model to make it work on the training data is a reasonable thing to do.
Consider the joint probability distribution
class: middle
Our goal is to find a function
Consider the hypothesis space
class: middle
For this regression problem, we use the squared error loss
Therefore, our goal is to find the best value
class: middle
Given a large enough training set
class: middle
This is ordinary least squares regression, for which the solution is derived as
class: middle
The expected risk minimizer
Therefore, on this toy problem, we can verify that
class: middle
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
What if we consider a hypothesis space
class: middle
.grid[ .kol-1-3[
] .kol-1-3[
] .kol-1-3[
] ]
???
In this course, we will argue for
Large parameter spaces are not a problem, as long as the capacity of the hypothesis space is controlled. For example, by using stochastic gradient descent, we can optimize
class: middle
.center[$\mathcal{F}$ = polynomials of degree 1]
class: middle count: false
.center[$\mathcal{F}$ = polynomials of degree 2]
class: middle count: false
.center[$\mathcal{F}$ = polynomials of degree 3]
class: middle count: false
.center[$\mathcal{F}$ = polynomials of degree 4]
class: middle count: false
.center[$\mathcal{F}$ = polynomials of degree 5]
class: middle count: false
.center[$\mathcal{F}$ = polynomials of degree 10]
class: middle, center
Degree
???
Why shouldn't we pick the largest
class: middle
Let
We define the Bayes risk as the minimal expected risk over all possible functions,
No model
class: middle
The capacity of an hypothesis space induced by a learning algorithm intuitively represents the ability to
find a good model
In practice, capacity can be controlled through hyper-parameters of the learning algorithm. For example:
- The degree of the family of polynomials;
- The number of layers in a neural network;
- The number of training iterations;
- Regularization terms.
???
We talk about the capacity of the hypothesis space induced by the learning algorithm (parametric model + optimization algorithm). This is different from the capacity of the model itself.
class: middle
- If the capacity of
$\mathcal{F}$ is too low, then$f_B \notin \mathcal{F}$ and$R(f) - R_B$ is large for any$f \in \mathcal{F}$ , including$f_*$ and$f_*^{\mathbf{d}}$ . Such models$f$ are said to underfit the data. - If the capacity of
$\mathcal{F}$ is too high, then$f_B \in \mathcal{F}$ or$R(f_*) - R_B$ is small.
However, because of the high capacity of the hypothesis space, the empirical risk minimizer$f_*^{\mathbf{d}}$ could fit the training data arbitrarily well such that$$R(f_*^{\mathbf{d}}) \geq R_B \geq \hat{R}(f_*^{\mathbf{d}}, \mathbf{d}) \geq 0.$$ In this situation,$f_*^{\mathbf{d}}$ becomes too specialized with respect to the true data generating process and a large reduction of the empirical risk (often) comes at the price of an increase of the expected risk of the empirical risk minimizer$R(f_*^{\mathbf{d}})$ . In this situation,$f_*^{\mathbf{d}}$ is said to overfit the data.
class: middle
Therefore, our goal is to adjust the capacity of the hypothesis space such that the expected risk of the empirical risk minimizer gets as low as possible.
???
Comment that for deep networks, training error may goes to 0 while the generalization error may not necessarily go up!
class: middle
When overfitting,
This indicates that the empirical risk
Nevertheless, an unbiased estimate of the expected risk can be obtained by evaluating
This test error estimate can be used to evaluate the actual performance of the model. However, it should not be used, at the same time, for model selection.
class: middle, center
Degree
???
What value of
But then how good is this selected model?
class: middle
There may be over-fitting, but it does not bias the final performance evaluation.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.center[This should be avoided at all costs!]
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
.center[Instead, keep a separate validation set for tuning the hyper-parameters.]
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
???
Comment on the comparison of algorithms from one paper to the other.
Consider a fixed point
Then the local expected risk of
-
$R(f_B|x)$ is the local expected risk of the Bayes model. This term cannot be reduced. -
$(f_B(x) - f_*^{\mathbf{d}}(x))^2$ represents the discrepancy between$f_B$ and$f_*^{\mathbf{d}}$ .
class: middle
If
class: middle
???
What do you observe?
class: middle count: false
class: middle count: false
class: middle count: false
class: middle count: false
class: middle
Formally, the expected local expected risk yields to: $$\begin{aligned} &\mathbb{E}_\mathbf{d} \left[ R(f_*^{\mathbf{d}}|x) \right] \\ &= \mathbb{E}_\mathbf{d} \left[ R(f_B|x) + (f_B(x) - f_*^{\mathbf{d}}(x))^2 \right] \\ &= R(f_B|x) + \mathbb{E}_\mathbf{d} \left[ (f_B(x) - f_*^{\mathbf{d}}(x))^2 \right] \\ &= \underbrace{R(f_B|x)}_{\text{noise}(x)} + \underbrace{(f_B(x) - \mathbb{E}_\mathbf{d}\left[ f_*^\mathbf{d}(x) \right] )^2}_{\text{bias}^2(x)} + \underbrace{\mathbb{E}_\mathbf{d}\left[ ( \mathbb{E}_\mathbf{d}\left[ f_*^\mathbf{d}(x) \right] - f_*^\mathbf{d}(x))^2 \right]}_{\text{var}(x)} \end{aligned}$$
This decomposition is known as the bias-variance decomposition.
- The noise term quantifies the irreducible part of the expected risk.
- The bias term measures the discrepancy between the average model and the Bayes model.
- The variance term quantities the variability of the predictions.
class: middle
- Reducing the capacity makes
$f_*^\mathbf{d}$ fit the data less on average, which increases the bias term. - Increasing the capacity makes
$f_*^\mathbf{d}$ vary a lot with the training data, which increases the variance term.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle, center, red-slide
What about a neural network with .bold[millions] of parameters?
class: middle
class: middle
.footnote[Credits: Belkin et al, 2018.]
???
This plot is known as the "double descent" curve. It shows that the test error can decrease as the number of parameters increases, even after the model has enough capacity to fit the training data.
The x-axis is misleading, as the number of parameters is not the same as the capacity.
class: middle
.footnote[Credits: Belkin et al, 2018.]
class: end-slide, center count: false
The end.