Skip to content

Commit

Permalink
Nutshell Eval done (#1151)
Browse files Browse the repository at this point in the history
  • Loading branch information
pkopper authored Nov 3, 2023
1 parent 1e756c5 commit a27ea76
Show file tree
Hide file tree
Showing 2 changed files with 106 additions and 47 deletions.
Binary file added slides/evaluation/figure/nutshell-overfit.pdf
Binary file not shown.
153 changes: 106 additions & 47 deletions slides/evaluation/slides-evaluation-nutshell.tex
Original file line number Diff line number Diff line change
Expand Up @@ -60,26 +60,41 @@
\begin{vbframe}{Estimating the generalization error}

\begin{itemize}
\item For a fixed model, we are interested in the Generalization Error (GE): $\GEfL := \E \left[ \Lyfhx) \right]$, i.e. the expected error the model makes for data $(\xv, y) \sim \Pxy$.
\item We need an estimator for the GE with $m$ test observations: $$\GEh(\fh, L) := \frac{1}{m}\sum_{(\xv, y)} \left[ \Lyfhx \right]$$
\item However, if $(\xv, y) \in \Dtrain$, $\GEh(\fh, L)$ will be biased via overfitting the training data.
\item Thus, we estimate the GE using unseen data $(\xv, y) \in \Dtest$:
%\item For a fixed model, we are interested in the Generalization Error (GE): $\GEfL := \E \left[ \Lyfhx) \right]$, i.e. the expected error the model makes for data $(\xv, y) \sim \Pxy$.
%\item We need an estimator for the GE with $m$ test observations: $$\GEh(\fh, L) := \frac{1}{m}\sum_{(\xv, y)} \left[ \Lyfhx \right]$$
%\item However, if $(\xv, y) \in \Dtrain$, $\GEh(\fh, L)$ will be biased via overfitting the training data.
%\item Thus, we estimate the GE using unseen data $(\xv, y) \in \Dtest$:
% $$\GEh(\fh, L) := \frac{1}{m}\sum_{(\xv, y) \in \Dtest} \left[ \Lyfhx \right]$$
\item Previously we learnt how to find good models.
\item However, how can we assess how good they actually are?
\item Idea: Use value of loss function $\Lyfhx$ after training.
\item Problem: This value can be very optimistic.
\item Example: Overfitting of a polynomial regression.
\end{itemize}

$$\GEh(\fh, L) := \frac{1}{m}\sum_{(\xv, y) \in \Dtest} \left[ \Lyfhx \right]$$
\begin{figure}
\includegraphics[width=0.53\textwidth]{figure/nutshell-overfit}
\end{figure}

\begin{itemize}
\item Degree 50 will result in lowest training loss, however, degree 5 seems to be the "best" model.
\item "Best" means that using new data, this model will probably produce the most meaningful predictions.
\end{itemize}

\framebreak

\begin{itemize}
\item Usually, we have no access to new \textbf{unseen} data.
\item Thus, we divide our data set manually into $\Dtrain$ and $\Dtest$.
\item This process is depicted below.
\item In other words, the "best" model will generalize well and have a low \textbf{G}eneralization \textbf{E}rror.¸
\item Formally, for a fixed model, the GE can be expressed via: $\GEfL := \E \left[ \Lyfhx) \right]$.
\item i.e., "what is the expected loss for an observation from the data?"
\item Ideally, the GE should be estimated with new, unseen data.
\item Usually, we have no access to new \textbf{unseen} data, though.
\item Thus, we divide our data set manually into $\Dtrain$ and $\Dtest$ and use the latter to estimate the GE.
\end{itemize}

\begin{center}
% FIGURE SOURCE: https://docs.google.com/drawings/d/13AH298rMnDL5p0SrBd6VCukC9vg1qyRXGqgMcvuPRc0/edit?usp=sharing
\includegraphics[trim = 0 0 0 30, clip, width=0.75\textwidth]
\includegraphics[trim = 0 0 0 30, clip, width=0.575\textwidth]
{figure_man/evaluation-intro-ge.pdf}
\end{center}

Expand All @@ -88,61 +103,78 @@
\begin{vbframe}{Metrics}

But what is $\Lyfhx$?
\vspace{0.2cm}

$\Lyfhx$ will always indicate how good the target matches our prediction.
While we can always use the (inner) loss function that we trained the model on as outer loss, this may not always be ideal:

\begin{itemize}
\item Explicit values of loss functions may not have a \textbf{meaningful interpretation} beyond ordinality.
\item The loss function may not be applicable to all models that we are interested in comparing (\textbf{model agnostic}ism), e.g. when comparing generative and discriminative approaches.
\item $\Lyfhx$ will always indicate how good the target matches our prediction.
While we can always use the (inner) loss function that we trained the model on as outer loss, this may not always be ideal.
\item For both, classification and regression there is a large variety of evaluation metrics, of which we will just cover a fraction.
\end{itemize}

Thus, there also exist evaluation metrics that are not based on inner losses.
Yet, they can (still) be faced with these problems:

\begin{itemize}
\item They might be not \textbf{useful} (for a specific use case, e.g. when we have imbalanced data).
\item They might be im\textbf{proper}, i.e. they might draw false conclusions.
\end{itemize}

%\begin{itemize}
%\item Explicit values of loss functions may not have a \textbf{meaningful interpretation} beyond ordinality.
%\item The loss function may not be applicable to all models that we are interested in comparing (\textbf{model agnostic}ism), e.g. when comparing generative and discriminative approaches.
%\end{itemize}

%Thus, there also exist evaluation metrics that are not based on inner losses.
%Yet, they can (still) be faced with these problems:

%\begin{itemize}
%\item They might be not \textbf{useful} (for a specific use case, e.g. when we have imbalanced data).
%\item They might be im\textbf{proper}, i.e. they might draw false conclusions.
%\end{itemize}

\end{vbframe}

\begin{vbframe}{Deep Dive: Properness}
%\begin{vbframe}{Deep Dive: Properness}



\begin{itemize}
%\begin{itemize}


\item A scoring rule $\mathbf{S}$ is proper relative to $\mathcal {F}$ if (where a low value of the scoring rule is better):
\end{itemize}
%\item A scoring rule $\mathbf{S}$ is proper relative to $\mathcal {F}$ if (where a low value of the scoring rule is better):
%\end{itemize}

$$\mathbf {S} (Q,Q) \leq \mathbf {S} (F,Q) \forall F,Q \in \mathcal {F}$$
%$$\mathbf {S} (Q,Q) \leq \mathbf {S} (F,Q) \forall F,Q \in \mathcal {F}$$

with $\mathcal{F}$ being a convex class of probability measures.
%with $\mathcal{F}$ being a convex class of probability measures.

\begin{itemize}
\item This means that a scoring rule should be optimal for the actual data target distribution, i.e. we are rewarded for properly modeling the target.
\end{itemize}
%\begin{itemize}
%\item This means that a scoring rule should be optimal for the actual data target distribution, i.e. we are rewarded for properly modeling the target.
%\end{itemize}

\end{vbframe}
%\end{vbframe}

\begin{vbframe}{Metrics for Classification}

Commonly used evaluation metrics include:
\begin{itemize}
\item Accuracy: \\
$ \rho_{ACC} = \frac{1}{m} \sum_{i = 1}^m [\yi = \yih] \in [0, 1]. $
\begin{itemize}
\item $ \rho_{ACC} = \frac{1}{m} \sum_{i = 1}^m [\yi = \yih] \in [0, 1]. $
\item "Proportion of correctly classified observations."
\end{itemize}
\item Misclassification error (MCE): \\
$ \rho_{MCE} = \frac{1}{m} \sum_{i = 1}^m [\yi \neq \yih] \in [0, 1]. $
\begin{itemize}
\item $ \rho_{MCE} = \frac{1}{m} \sum_{i = 1}^m [\yi \neq \yih] \in [0, 1]. $
\item "Proportion of incorrectly classified observations."
\end{itemize}
\item Brier Score: \\
$\rho_{BS} = \frac{1}{m} \sum_{i = 1}^m
\begin{itemize}
\item $\rho_{BS} = \frac{1}{m} \sum_{i = 1}^m
\left( \hat \pi^{(i)} - \yi \right)^2$
\item "Squared error btw. predicted probability and actual label."
\end{itemize}
\item Log-loss: \\
$\rho_{LL} = \frac{1}{m} \sum_{i = 1}^m \left( - \yi \log \left(
\begin{itemize}
\item $\rho_{LL} = \frac{1}{m} \sum_{i = 1}^m \left( - \yi \log \left(
\hat \pi^{(i)} \right) - \left( 1-\yi \right) \log \left( 1 - \hat \pi^{(i)}
\right) \right).$
\item "Distance of predicted and actual label distribution."
\end{itemize}
\end{itemize}

The probabalistic metrics, Brier Score and Log-Loss penalize false confidence, i.e. predicting the wrong label with high probability, heavily.
Expand All @@ -167,13 +199,29 @@
$$ Precision = \frac{TP}{TP + FP}$$
$$Recall = \frac{TP}{TP + FN} $$

\framebreak

\begin{itemize}
\item Other frequently used metrics like the False / True Negative Rate (FPR / TPR) can also be derived from the confusion matrix.
\item The confusion matrix below covers many of these.
\item Many of these metrics may also go with different names.
\end{itemize}

\begin{center}
% FIGURE SOURCE: https://en.wikipedia.org/wiki/F1_score#Diagnostic_testing
\includegraphics[width=0.85\textwidth]{figure_man/roc-confmatrix-allterms.png}
\end{center}

\href{https://en.wikipedia.org/wiki/F1_score#Diagnostic_testing}{\beamergotobutton{Clickable version/picture source}} $\phantom{blablabla}$
\href{https://upload.wikimedia.org/wikipedia/commons/0/0e/DiagnosticTesting_Diagram.svg}{\beamergotobutton{Interactive diagram}}


\end{vbframe}

\begin{vbframe}{Receiver operating characteristics}

\begin{itemize}
\item Receiver operating characteristics (ROC) performs evaluation for binary classifiers beyond single metrics.
\item Receiver operating characteristics (ROC) perform evaluation for binary classifiers beyond single metrics.
\item We can assess classifiers by their TPR (y-axis) and FPR (x-axis).
\item We aim to identify good classifiers who (weakly) dominate others.
\item For example, the "Best" classifier in the image strictly dominates "Pos-25\%" and "Pos-75\%" and weakly dominates the others.
Expand Down Expand Up @@ -203,44 +251,55 @@

\begin{vbframe}{Estimating the generalization error (better)}

While $$\GEh(\fh, L) := \frac{1}{m}\sum_{(\xv, y) \in \Dtest} \left[ \Lyfhx \right]$$ will be unbiased, with a small $m$ it will suffer from high variance.
We have two options to decrease the variance:
We can estimate the GE with one test data set via: $$\GEh(\fh, L) := \frac{1}{m}\sum_{(\xv, y) \in \Dtest} \left[ \Lyfhx \right],$$
i.e. we compute the selected metric $\Lyfhx$ for each observation in the test set and compute the mean.

\vspace{0.2cm}
This will give an appropriate estimate for the GE.
However, with only a few test observations (small $m$), this estimate will be unstable or, in other words, have high variance.
We have two options to decrease it:

\begin{itemize}
\item Increase $m$.
\item Compute $\GEh(\fh, L)$ for multiple test sets and aggregate them.
\end{itemize}

With a finite amount of data, increasing $m$ would mean to decrease the size of the training data.
Thus, we focus on using multiple ($B$) test sets:
\end{vbframe}

$$\JJ = \JJset.$$
\begin{vbframe}{Resampling}

where we compute $\GEh(\fh, L)$ for each set and aggregate the estimates.
As we do not have access to infinite data and increasing $m$ will mean a redcution of the number of test observations, aggregation over $B$ sets is the preferred option:

These $B$ sets are generated through \textbf{resampling}.
$$\JJ = \JJset.$$

where we compute $\GEh(\fh, L)$ for each set and aggregate the estimates.

\end{vbframe}
These $B$ distinct sets are generated through \textbf{resampling}.

\begin{vbframe}{Resampling}
\vspace{0.2cm}

There exist a few well established resampling strategies:
There exist a few well-established resampling strategies:

\begin{itemize}
\item (Repeated) Hold-out / Subsampling
\item Cross validation
\item Bootstrap
\end{itemize}

All methods aim to generate $\JJ$ by splitting the full data set (repeatedly) into a train and test set.
\end{vbframe}

\begin{vbframe}{Resampling}



All methods aim to generate the train-test spilis $\JJ$ by splitting the full data set repeatedly.
The model is trained on the respective train set and evaluated on the test set.

\textbf{Example:} 3-fold cross validation

\begin{center}
% FIGURE SOURCE: practical tuning paper
\includegraphics[width=0.4\textwidth]{figure_man/crossvalidation.png}
\includegraphics[width=0.6\textwidth]{figure_man/crossvalidation.png}
\end{center}


Expand Down

0 comments on commit a27ea76

Please sign in to comment.