Nutshell Eval done (#1151)

slds-lmu · Nov 3, 2023 · a27ea76 · a27ea76
1 parent 1e756c5
commit a27ea76
Show file tree

Hide file tree

Showing 2 changed files with 106 additions and 47 deletions.
diff --git a/slides/evaluation/figure/nutshell-overfit.pdf b/slides/evaluation/figure/nutshell-overfit.pdf
diff --git a/slides/evaluation/slides-evaluation-nutshell.tex b/slides/evaluation/slides-evaluation-nutshell.tex
@@ -60,26 +60,41 @@
 \begin{vbframe}{Estimating the generalization error}
 
 \begin{itemize}
-\item For a fixed model, we are interested in the Generalization Error (GE): $\GEfL := \E \left[ \Lyfhx) \right]$, i.e. the expected error the model makes for data $(\xv, y) \sim \Pxy$.
-\item We need an estimator for the GE with $m$ test observations: $$\GEh(\fh, L) := \frac{1}{m}\sum_{(\xv, y)} \left[ \Lyfhx \right]$$
-\item However, if $(\xv, y) \in \Dtrain$, $\GEh(\fh, L)$ will be biased via overfitting the training data.
-\item Thus, we estimate the GE using unseen data $(\xv, y) \in \Dtest$:
+%\item For a fixed model, we are interested in the Generalization Error (GE): $\GEfL := \E \left[ \Lyfhx) \right]$, i.e. the expected error the model makes for data $(\xv, y) \sim \Pxy$.
+%\item We need an estimator for the GE with $m$ test observations: $$\GEh(\fh, L) := \frac{1}{m}\sum_{(\xv, y)} \left[ \Lyfhx \right]$$
+%\item However, if $(\xv, y) \in \Dtrain$, $\GEh(\fh, L)$ will be biased via overfitting the training data.
+%\item Thus, we estimate the GE using unseen data $(\xv, y) \in \Dtest$:
+% $$\GEh(\fh, L) := \frac{1}{m}\sum_{(\xv, y) \in \Dtest} \left[ \Lyfhx \right]$$
+\item Previously we learnt how to find good models.
+\item However, how can we assess how good they actually are?
+\item Idea: Use value of loss function $\Lyfhx$ after training.
+\item Problem: This value can be very optimistic.
+\item Example: Overfitting of a polynomial regression.
+\end{itemize}
 
- $$\GEh(\fh, L) := \frac{1}{m}\sum_{(\xv, y) \in \Dtest} \left[ \Lyfhx \right]$$
+\begin{figure}
+\includegraphics[width=0.53\textwidth]{figure/nutshell-overfit}
+\end{figure}
 
+\begin{itemize}
+\item Degree 50 will result in lowest training loss, however, degree 5 seems to be the "best" model.
+\item "Best" means that using new data, this model will probably produce the most meaningful predictions.
 \end{itemize}
 
 \framebreak
 
 \begin{itemize}
-\item Usually, we have no access to new \textbf{unseen} data.
-\item Thus, we divide our data set manually into $\Dtrain$ and $\Dtest$.
-\item This process is depicted below.
+\item In other words, the "best" model will generalize well and have a low \textbf{G}eneralization \textbf{E}rror.¸
+\item Formally, for a fixed model, the GE can be expressed via: $\GEfL := \E \left[ \Lyfhx) \right]$.
+\item i.e., "what is the expected loss for an observation from the data?"
+\item Ideally, the GE should be estimated with new, unseen data.
+\item Usually, we have no access to new \textbf{unseen} data, though.
+\item Thus, we divide our data set manually into $\Dtrain$ and $\Dtest$ and use the latter to estimate the GE.
 \end{itemize}
 
 \begin{center}
 % FIGURE SOURCE: https://docs.google.com/drawings/d/13AH298rMnDL5p0SrBd6VCukC9vg1qyRXGqgMcvuPRc0/edit?usp=sharing
-\includegraphics[trim = 0 0 0 30, clip, width=0.75\textwidth]
+\includegraphics[trim = 0 0 0 30, clip, width=0.575\textwidth]
 {figure_man/evaluation-intro-ge.pdf}
 \end{center}
 
@@ -88,61 +103,78 @@
 \begin{vbframe}{Metrics}
 
 But what is $\Lyfhx$?
-\vspace{0.2cm}
 
-$\Lyfhx$ will always indicate how good the target matches our prediction.
-While we can always use the (inner) loss function that we trained the model on as outer loss, this may not always be ideal:
 
 \begin{itemize}
-\item Explicit values of loss functions may not have a \textbf{meaningful interpretation} beyond ordinality.
-\item The loss function may not be applicable to all models that we are interested in comparing (\textbf{model agnostic}ism), e.g. when comparing generative and discriminative approaches.
+\item $\Lyfhx$ will always indicate how good the target matches our prediction.
+While we can always use the (inner) loss function that we trained the model on as outer loss, this may not always be ideal.
+\item For both, classification and regression there is a large variety of evaluation metrics, of which we will just cover a fraction.
 \end{itemize}
 
-Thus, there also exist evaluation metrics that are not based on inner losses.
-Yet, they can (still) be faced with these problems:
 
-\begin{itemize}
-\item They might be not \textbf{useful} (for a specific use case, e.g. when we have imbalanced data).
-\item They might be im\textbf{proper}, i.e. they might draw false conclusions.
-\end{itemize}
+
+%\begin{itemize}
+%\item Explicit values of loss functions may not have a \textbf{meaningful interpretation} beyond ordinality.
+%\item The loss function may not be applicable to all models that we are interested in comparing (\textbf{model agnostic}ism), e.g. when comparing generative and discriminative approaches.
+%\end{itemize}
+
+%Thus, there also exist evaluation metrics that are not based on inner losses.
+%Yet, they can (still) be faced with these problems:
+
+%\begin{itemize}
+%\item They might be not \textbf{useful} (for a specific use case, e.g. when we have imbalanced data).
+%\item They might be im\textbf{proper}, i.e. they might draw false conclusions.
+%\end{itemize}
 
 \end{vbframe}
 
-\begin{vbframe}{Deep Dive: Properness}
+%\begin{vbframe}{Deep Dive: Properness}
 
 
 
-\begin{itemize}
+%\begin{itemize}
 
 
-\item A scoring rule $\mathbf{S}$  is proper relative to $\mathcal {F}$ if (where a low value of the scoring rule is better):
-\end{itemize} 
+%\item A scoring rule $\mathbf{S}$  is proper relative to $\mathcal {F}$ if (where a low value of the scoring rule is better):
+%\end{itemize} 
 
-$$\mathbf {S} (Q,Q) \leq \mathbf {S} (F,Q) \forall F,Q \in \mathcal {F}$$
+%$$\mathbf {S} (Q,Q) \leq \mathbf {S} (F,Q) \forall F,Q \in \mathcal {F}$$
 
-with $\mathcal{F}$ being a convex class of probability measures.
+%with $\mathcal{F}$ being a convex class of probability measures.
 
-\begin{itemize}
-\item This means that a scoring rule should be optimal for the actual data target distribution, i.e. we are rewarded for properly modeling the target.
-\end{itemize}
+%\begin{itemize}
+%\item This means that a scoring rule should be optimal for the actual data target distribution, i.e. we are rewarded for properly modeling the target.
+%\end{itemize}
 
-\end{vbframe}
+%\end{vbframe}
 
 \begin{vbframe}{Metrics for Classification}
 
 Commonly used evaluation metrics include:
 \begin{itemize}
 \item Accuracy: \\
-$ \rho_{ACC} = \frac{1}{m} \sum_{i = 1}^m [\yi = \yih] \in [0, 1]. $
+\begin{itemize}
+\item $ \rho_{ACC} = \frac{1}{m} \sum_{i = 1}^m [\yi = \yih] \in [0, 1]. $
+\item "Proportion of correctly classified observations."
+\end{itemize}
 \item Misclassification error (MCE): \\
-$ \rho_{MCE} = \frac{1}{m} \sum_{i = 1}^m [\yi \neq \yih] \in [0, 1]. $
+\begin{itemize}
+\item $ \rho_{MCE} = \frac{1}{m} \sum_{i = 1}^m [\yi \neq \yih] \in [0, 1]. $
+\item "Proportion of incorrectly classified observations."
+\end{itemize}
 \item Brier Score: \\
-$\rho_{BS} = \frac{1}{m} \sum_{i = 1}^m 
+\begin{itemize}
+\item $\rho_{BS} = \frac{1}{m} \sum_{i = 1}^m 
 \left( \hat \pi^{(i)} - \yi \right)^2$
+\item "Squared error btw. predicted probability and actual label."
+\end{itemize}
 \item Log-loss: \\
-$\rho_{LL} = \frac{1}{m} \sum_{i = 1}^m \left( - \yi \log \left( 
+\begin{itemize}
+\item $\rho_{LL} = \frac{1}{m} \sum_{i = 1}^m \left( - \yi \log \left( 
 \hat \pi^{(i)} \right) - \left( 1-\yi \right) \log \left( 1 - \hat \pi^{(i)} 
 \right) \right).$
+\item "Distance of predicted and actual label distribution."
+\end{itemize}
 \end{itemize}
 
 The probabalistic metrics, Brier Score and Log-Loss penalize false confidence, i.e. predicting the wrong label with high probability, heavily.
@@ -167,13 +199,29 @@
 $$ Precision = \frac{TP}{TP + FP}$$
 $$Recall = \frac{TP}{TP + FN} $$
 
+\framebreak
+
+\begin{itemize}
+\item Other frequently used metrics like the False / True Negative Rate (FPR / TPR) can also be derived from the confusion matrix.
+\item The confusion matrix below covers many of these.
+\item Many of these metrics may also go with different names.
+\end{itemize}
+
+\begin{center}
+% FIGURE SOURCE: https://en.wikipedia.org/wiki/F1_score#Diagnostic_testing
+\includegraphics[width=0.85\textwidth]{figure_man/roc-confmatrix-allterms.png}
+\end{center}
+
+\href{https://en.wikipedia.org/wiki/F1_score#Diagnostic_testing}{\beamergotobutton{Clickable version/picture source}} $\phantom{blablabla}$
+\href{https://upload.wikimedia.org/wikipedia/commons/0/0e/DiagnosticTesting_Diagram.svg}{\beamergotobutton{Interactive diagram}}
+
 
 \end{vbframe}
 
 \begin{vbframe}{Receiver operating characteristics}
 
 \begin{itemize}
-\item Receiver operating characteristics (ROC) performs evaluation for binary classifiers beyond single metrics.
+\item Receiver operating characteristics (ROC) perform evaluation for binary classifiers beyond single metrics.
 \item We can assess classifiers by their TPR (y-axis) and FPR (x-axis).
 \item We aim to identify good classifiers who (weakly) dominate others. 
 \item For example, the "Best" classifier in the image strictly dominates "Pos-25\%" and "Pos-75\%" and weakly dominates the others.
@@ -203,44 +251,55 @@
 
 \begin{vbframe}{Estimating the generalization error (better)}
 
-While  $$\GEh(\fh, L) := \frac{1}{m}\sum_{(\xv, y) \in \Dtest} \left[ \Lyfhx \right]$$ will be unbiased, with a small $m$ it will suffer from high variance.
-We have two options to decrease the variance:
+We can estimate the GE with one test data set via:  $$\GEh(\fh, L) := \frac{1}{m}\sum_{(\xv, y) \in \Dtest} \left[ \Lyfhx \right],$$ 
+i.e. we compute the selected metric $\Lyfhx$ for each observation in the test set and compute the mean.
+
+\vspace{0.2cm}
+This will give an appropriate estimate for the GE.
+However, with only a few test observations (small $m$), this estimate will be unstable or, in other words, have high variance.
+We have two options to decrease it:
 
 \begin{itemize}
 \item Increase $m$.
 \item Compute $\GEh(\fh, L)$ for multiple test sets and aggregate them.
 \end{itemize}
 
-With a finite amount of data, increasing $m$ would mean to decrease the size of the training data.
-Thus, we focus on using multiple ($B$) test sets: 
+\end{vbframe}
 
-$$\JJ = \JJset.$$
+\begin{vbframe}{Resampling}
 
-where we compute $\GEh(\fh, L)$ for each set and aggregate the estimates.
+As we do not have access to infinite data and increasing $m$ will mean a redcution of the number of test observations, aggregation over $B$ sets is the preferred option:
 
-These $B$ sets are generated through \textbf{resampling}.
+$$\JJ = \JJset.$$
 
+where we compute $\GEh(\fh, L)$ for each set and aggregate the estimates.
 
-\end{vbframe}
+These $B$ distinct sets are generated through \textbf{resampling}.
 
-\begin{vbframe}{Resampling}
+\vspace{0.2cm}
 
-There exist a few well established resampling strategies:
+There exist a few well-established resampling strategies:
 
 \begin{itemize}
 \item (Repeated) Hold-out / Subsampling
 \item Cross validation
 \item Bootstrap
 \end{itemize}
 
-All methods aim to generate $\JJ$ by splitting the full data set (repeatedly) into a train and test set.
+\end{vbframe}
+
+\begin{vbframe}{Resampling}
+
+
+
+All methods aim to generate the train-test spilis $\JJ$ by splitting the full data set repeatedly.
 The model is trained on the respective train set and evaluated on the test set.
 
 \textbf{Example:} 3-fold cross validation
 
 \begin{center}
 % FIGURE SOURCE: practical tuning paper
-\includegraphics[width=0.4\textwidth]{figure_man/crossvalidation.png}
+\includegraphics[width=0.6\textwidth]{figure_man/crossvalidation.png}
 \end{center}