-
Notifications
You must be signed in to change notification settings - Fork 2
/
intro.Rmd
executable file
·1102 lines (847 loc) · 30.9 KB
/
intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "1. Introduction to High Dimensional Data Analysis"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
output:
bookdown::pdf_document2:
toc: true
number_sections: true
latex_engine: xelatex
always_allow_html: true
---
```{r, child="_setup.Rmd"}
```
# Introduction
```{r setup2, include=FALSE}
knitr::opts_chunk$set(message = FALSE)
```
- We live in a big data era
- Massive Datasets on our location, surfing behavior, consumer data, social media, ...
- Life Sciences: advent of high throughput technologies has enabled us to measure brain activity (brain images) as well as the expression of thousands of genes, proteins, ... for each subject or even single cell.
- Industry: process control with many sensors that follow process in real time...
- Data drive journalism
- ...
Challenge: We have to learn from high dimensional data!
---
## What are high dimensional data?
- We typically observe multiple variables/features (p) for each subject/experimental unit $i=1,\ldots,n$ i.e.
\[ \mathbf{x}_i^T=[x_{i1},\ldots,x_{ip}]\]
- Multivariate statistics have a long history, but were designed for the $n>>>p$ case,
- Nowadays many high throughput technologies generate multivariate data with many variables (large $p$) as compared to the number of independent replicates or samples (sample size $n$), resulting in **{high-dimensional data}**, which is characterised by
\[
p >>> n.
\]
- New statistical methods for dealing with the $p >>> n$ case have been developed in the last 20 years. Some of them are adaptations of multivariate methods.
---
Issues with high-dimensional data:
- *Computational problems*: large matrices, numerical accuracy of computers become an issue
- *Classical asymptotic theory does not hold* for $p\rightarrow\infty$ as $n \rightarrow \infty$
- *Model (or feature) selection* requires specialised methods to deal with the enormous number of possible models. (e.g. in linear regression with p potential predictors: $2^p$ possible models)
- Models that can potentially depend on large-p predictors are vulnerable to *huge overfitting*.
- In searching for associations between an outcome and large p potential exploratory variables, we are at risk to make *many false discoveries*
- The *Curse of Dimensionality* may cause a prediction model to become useless. ($n$ data points become sparse in large-$p$ dimensional space)
---
# Important tasks in high dimensional data analysis?
## Example: Kang et al. (2018)’s droplet-based scRNA-seq data of PBMCs cells from 8 lupus patients measured before and after 6h-treatment with INF-$\beta$ (16 samples in total).
```{r libraries, message=FALSE, warning=FALSE}
library(tidyverse)
## Packages to load and visualize the single-cell data
library(ExperimentHub)
library(scater)
```
```{r}
## Load Kang single-cell data from ExperimentHub
eh <- ExperimentHub()
sce <- eh[["EH2259"]]
sce
```
- Data on gene expression of `r nrow(sce)` genes of `r ncol(sce)` cells.
## Data exploration and dimensionality reduction
- Visualisation is a first essential step to learn from data
```{r}
counts(sce)[18990:19000,9:20]
```
- It is impossible to learn about the structure in the data by staring at the data matrix.
- We should be able to explore the data in a low dimensional projection
```{r}
plotReducedDim(sce, dimred="TSNE", colour_by="cell")
plotReducedDim(sce, dimred="TSNE", colour_by="stim")
```
- Note, that we see huge effect of treatment. If I see this I am always on my guard!
- We contacted the authors and learned that all control cells were sequenced in a first run and all stimulated cells were on a second sequencing run. So the large effect might be an effect of batch!
## Prediction
```{r}
plotReducedDim(sce, dimred="TSNE", colour_by="cell")
```
- In single cell analysis it is key to identify cell types in scRNA-seq data sets before in-depth investigations of their functional and pathological roles.
- Use models that were build based on reference data sets to predict cell types in new data sets based on their gene expression pattern
- Problem: we have `r nrow(sce)` genes at our disposal to build this prediction model!
- Other examples
- Prediction of risk on mortality is used on a daily basis in intensive care units to prioritise patient care.
- Facebook: predict the identity of the faces of people that are on an new image that is uploaded.
- Netflix: Suggest movies that you would like
## Large scale hypothesis testing
```{r}
plotReducedDim(sce, dimred="TSNE", colour_by="cell", shape_by="stim")
```
- Which genes are differentially expressed between control and stimulated treatment (assess in each cell type)
- For which genes we see that the differential expression according to treatment is changing according to the cell type (interaction)
- We have to model the gene expression for each gene
$$
\left \{
\begin{array}{lcl}
y_{ig} &\sim& NB(\mu_{ig}, \phi_{g})\\
E[y_{ig}] &=& \mu_{ig}\\
\log(\mu_{ig}) &=& \eta_{ig}\\
\eta_{ig} &=& \beta_{0,g} + \sum\limits_{c=2}^C \beta_{c,g} X_{ic} + \beta_{s,g} X_{is} + \sum\limits_{c=2}^C \beta_{c:s,g} X_{ic} X_{is} + \sum\limits_{e=2}^E \beta_{e,g} X_{ie} + \alpha_{ig}
\end{array} \right.
$$
With
- Cell type $c=2\ldots C$ and $X_{ic}$ is a dummy variable that $X_{ic}=1$ if cell $i$ is of cell type c and $X_{ic}=0$ otherwise. Note that cell type $c=1$ is the reference cell type
- Indicator $X_{is}$ indicates if cell $i$ was stimulated $X_{is}=1$ or not $X_{is}=0$. So the control treatment is the reference treatment.
- Experimental unit $e=2\ldots E$ and $X_{ie}$ a dummy variable that $X_{ie}=1$ if cell $i$ originates of experimental unit (subject) $e$ and $X_{ie}=0$ otherwise. Note that cell type $e=1$ is the reference experimental unit.
- $\alpha_{ig}$ a cell specific and gene specific normalisation factor
Suppose we want to test if the effect of the stimulus (average difference in expression between stimulated and non stimulated cells) is different in cell type c than in the reference cell type 1?
- $H_0: \beta_{c:s,g} = 0$
- $H_1: \beta_{c:s,g} \neq 0$
- We have to assess this for `r nrow(sce)` genes!
- If we assess each test at the 5% level we can expect 0.05 * `r nrow(sce)` = `r round(0.05*nrow(sce),0)` false positives.
$\rightarrow$ massive multiple testing problem!
Note, that we cannot differentiate between batch and treatment because of the flaw in the experimental design!
Other examples
- Find regions (voxels) in the brain (on brain image) that are associated with a certain condition/treatment
- Evaluation of trading rules
---
# Linear regression
- Linear regression is a very important statistical tool to study the association between variables and to build prediction models.
## Toy-Data
Consider the following toy-dataset with 3 observation (X,Y):
```{r}
library(tidyverse)
data <- data.frame(x=1:3,y=c(1,2,2))
data
```
## Model
### Scalar form
- Consider a vector of predictors $\mathbf{x}=(x_1,\ldots,x_p)$ and
- a real-valued response $Y$
- then the linear regression model can be written as
\[
Y=f(\mathbf{x}) +\epsilon=\beta_0+\sum\limits_{j=1}^p x_j\beta_j + \epsilon
\]
with i.i.d. $\epsilon\sim N(0,\sigma^2)$
- We will often work on mean centered variables: $Y_c = Y-\bar{Y}$ and $X_c = X - \bar{X}$, then the intercept is dropped from the model:
\[
Y_c=f(\mathbf{x}_c) +\epsilon=\sum\limits_{j=1}^p x_{cj}\beta_j + \epsilon
\]
### Scalar model for the toy dataset
$$
y_i=\beta_0+\beta_1x + \epsilon_i
$$
If we write the model for each observation:
$$
\begin{array} {lcl}
1 &=& \beta_0+\beta_1 1 + \epsilon_1 \\
2 &=& \beta_0 + \beta_1 2 + \epsilon_2 \\
2 &=& \beta_0+\beta_1 3+ \epsilon_3 \\
\end{array}
$$
### Vector/Matrix form
- $n$ observations $(\mathbf{x}_1,y_1) \ldots (\mathbf{x}_n,y_n)$ with $\mathbf{x}_1^T=[\begin{array}{cccc} 1& x_1& \ldots& x_p\end{array}]$
- Regression in matrix notation
\[\mathbf{Y}=\mathbf{X\beta} + \mathbf{\epsilon}\]
with $\mathbf{Y}=\left[\begin{array}{c}y_1\\ \vdots\\y_n\end{array}\right]$,
$\mathbf{X}=\left[\begin{array}{cccc} 1&x_{11}&\ldots&x_{1p}\\
\vdots&\vdots&&\vdots\\
1&x_{n1}&\ldots&x_{np}
\end{array}\right]$ or $\mathbf{X}=\left[\begin{array}{c} \mathbf{x}_1^T\\\vdots\\\mathbf{x}_n^T\end{array}\right]$,
$\boldsymbol{\beta}=\left[\begin{array}{c}\beta_0\\ \vdots\\ \beta_p\end{array}\right]$ and
$\mathbf{\epsilon}=\left[\begin{array}{c} \epsilon_1 \\ \vdots \\ \epsilon_n\end{array}\right]$
Note, that upon centering of $\mathbf{X}$ and $\mathbf{Y}$ the 1 is dropped from each $\mathbf{x}_i$ and thus the column of 1 is dropped in $\mathbf{X}_c$
### Matrix form for toy dataset
We can also write this in matrix form
$$
\mathbf{Y} = \mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}
$$
with
$$
\mathbf{Y}=\left[
\begin{array}{c}
1\\
2\\
2\\
\end{array}\right],
\quad
\mathbf{X}= \left[
\begin{array}{cc}
1&1\\
1&2\\
1&3\\
\end{array}
\right],
\quad \boldsymbol{\beta} = \left[
\begin{array}{c}
\beta_0\\
\beta_1\\
\end{array}
\right]
\quad
\text{and}
\quad
\boldsymbol{\epsilon}=
\left[
\begin{array}{c}
\epsilon_1\\
\epsilon_2\\
\epsilon_3
\end{array}
\right]
$$
---
## Interpretation
From the linear regression we get
\[
E[Y \mid \mathbf{x}] = \mathbf{x}^T\boldsymbol{\beta} .
\]
- Hence, the $\beta$ parameters relate the predictor $\mathbf{x}$ to the mean outcome.
- If we know the covariate pattern $\mathbf{x}$ we can use the model to
predict $Y$.
For a model with a single predictor we obtain
\[
E[Y \mid x] = \beta_0 + \beta_1 x
\]
and
\[
\beta_1 = E[Y\mid x+1] - E[Y\mid x] = \beta_0 + \beta_1 (x+1) - \beta_0 - \beta_1 x.
\]
In this course we will typically center Y and X and then we get the following
\[
E[Y_c \mid x_c] = \beta_1 x_c
\]
and
\[
\beta_1 = E[Y_c\mid x_c+1] - E[Y_c\mid x_c] .
\]
Note, that the estimator for $\beta_1$ will be exactly same when estimated based on the models with/or without centering.
- The parameter $\beta_1$ has an interpretation as the average difference in the outcome between subjects that differ with one unit for the predictor.
- The parameter $\beta_1$ does not say much about individual outcomes. The residual variance $\sigma^2$ determines how much individual outcomes vary about the mean outcome.
The $\beta$ parameters are used to measure association, but a $\beta \neq 0$ does not necessarily mean that the model will give good predictions.
$$
\hat Y = \mathbf{x}^T \hat \beta
$$
- In Chapter 3 will we will discuss the prediction problem for high dimensional data
Model fit and predictions based on the toy dataset
```{r}
lm1 <- lm(y~x,data)
data$yhat <- lm1$fitted
data %>%
ggplot(aes(x,y)) +
geom_point() +
ylim(0,4) +
xlim(0,4) +
stat_smooth(method = "lm", color = "red", fullrange = TRUE) +
geom_point(aes(x=x, y =yhat), pch = 2, size = 3, color = "red") +
geom_segment(data = data, aes(x = x, xend = x, y = y, yend = yhat), lty = 2 )
```
---
- In a Chapter 6 we will discuss the problem of large scale hypothesis testing: testing many hypotheses in a single study (ten to hundred thousands of hypotheses).
- A statistical test is constructed to control the type I error rate at the significance level $\alpha$ to assess the null hypothesis that there is no association between a predictor and the outcome vs the alternative hypothesis that there is an association between a predictor and the outcome.
$$
H_0: \beta_1 = 0 \text{ vs }H_1:\beta_1 \neq 0.
$$
- However, when many hypotheses are to be tested in a single study, the probability to find false associations is no longer controlled if p-values are compared to the significance level $\alpha$.
- We will later introduce the concept of false discovery rates to overcome the problem.
## Least Squares (LS)
- Minimize the residual sum of squares
\begin{eqnarray*}
RSS(\boldsymbol{\beta})&=&\sum\limits_{i=1}^n e^2_i\\\\
&=&\sum\limits_{i=1}^n \left(y_i-\beta_0-\sum\limits_{j=1}^p x_{ij}\beta_j\right)^2
\end{eqnarray*}
- or in matrix notation
\begin{eqnarray*}
RSS(\boldsymbol{\beta})&=&\mathbf{e}^T\mathbf{e}\\\\
&=& \left[\begin{array}{ccc} e_1 &\ldots& e_n \end{array}\right]\left[\begin{array}{c}e_1\\\vdots\\e_n\end{array}\right]\\
&=& e_1^2 + e_2^2 + \ldots + e_n^2\\\\
&=&(\mathbf{Y}-\mathbf{X\beta})^T(\mathbf{Y}-\mathbf{X\beta})\\\\
&=&\Vert \mathbf{Y}-\mathbf{X\beta}\Vert^2_2\\
\end{eqnarray*}
with the $L_2$-norm of a $p$-dim. vector $v$ $\Vert \mathbf{v} \Vert_2=\sqrt{v_1^2+\ldots+v_p^2}$
$\rightarrow$ $\hat{\boldsymbol{\beta}}=\text{argmin}_\beta \Vert \mathbf{Y}-\mathbf{X\beta}\Vert^2_2$
---
### Minimize RSS
\[
\begin{array}{ccc}
\frac{\partial RSS}{\partial \boldsymbol{\beta}}&=&\mathbf{0}\\\\
\frac{(\mathbf{Y}-\mathbf{X\beta})^T(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})}{\partial \boldsymbol{\beta}}&=&\mathbf{0}\\\\
-2\mathbf{X}^T(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta})&=&\mathbf{0}\\\\
\mathbf{X}^T\mathbf{X\beta}&=&\mathbf{X}^T\mathbf{Y}\\\\
\hat{\boldsymbol{\beta}}&=&(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}
\end{array}
\]
It can be shown that the estimator is unbiased:
$$
E[\hat{\boldsymbol{\beta}}]=\boldsymbol{\beta}
$$
---
### Projection
There is also another picture to regression:
- Instead of plotting each observation $i= 1 \ldots n$ as a data-point in $\mathbb{R}^p$ with dimensions $1 \ldots p$ for every variable/feature that is recorded for each observation
- We can also plot $\mathbf{Y}$, $\hat{\mathbf{Y}}$ and each column of $\mathbf{X}$: $\mathbf{X}_j$ with $j=1 \ldots p$ as a vector in $\mathbb{R}^n$ with dimensions $1 \ldots n$ for every observation.
- In this representation linear regression can be interpreted as a projection of the vector $\mathbf{Y}$ onto the subspace of $\mathbb{R}^n$ that is spanned by the vectors for the predictors $\mathbf{X}_1 \ldots \mathbf{X}_p$.
- The space $\mathbf{X}_1 \ldots \mathbf{X}_p$ is also referred to as the column space of $\mathbf{X}$, the space that consists of all linear combinations of the vectors of the predictors or columns $\mathbf{X}_1 \ldots \mathbf{X}_p$.
#### Intermezzo: Projection of vector on X and Y axis
$$
\mathbf{e}=\left[\begin{array}{c} e_1\\e_2\end{array}\right], \mathbf{u}_1 = \left[\begin{array}{c} 1\\0\end{array}\right], \mathbf{u}_2 = \left[\begin{array}{c} 0\\1\end{array}\right]
$$
```{r echo=FALSE}
plotdata <- data.frame(e1=3,e2=2)
plotdata %>% ggplot(aes(x=e1,y=e2)) +
geom_point() +
geom_segment(aes(x = 3, y = 2, xend = 3, yend = 0),color="orange",linetype=2,size=2) +
geom_segment(aes(x = 3, y = 2, xend = 0, yend = 2),color="orange",linetype=2,size=2) +
geom_segment(aes(x = 0, y = 0, xend = 3, yend = 2),
arrow = arrow(length = unit(0.5, "cm")),color="red",size=2) +
geom_segment(aes(x = 0, y = 0, xend = 3, yend = 0),
arrow = arrow(length = unit(0.5, "cm")),color="orange",size=2) +
geom_segment(aes(x = 0, y = 0, xend = 0, yend = 2),
arrow = arrow(length = unit(0.5, "cm")),color="orange",size=2) +
geom_segment(aes(x = 0, y = 0, xend = 0, yend = 1),
arrow = arrow(length = unit(0.5, "cm")),size=2) +
geom_segment(aes(x = 0, y = 0, xend = 1, yend = 0),
arrow = arrow(length = unit(0.5, "cm")),size=2) +
geom_text(aes(x=3.05,y=2.2,label="e=(e1,e2)"),color="red",size = 7,hjust="left") +
geom_text(aes(x=3.05,y=0,label="(e1,0)"),color="orange",size = 7,hjust="left",vjust="top") +
geom_text(aes(x=0,y=2.2,label="(0,e2)"),color="orange",size = 7,hjust="left") +
geom_text(aes(x=0.05,y=1.2,label="u2=(0,1)"),size = 7,hjust="left") +
geom_text(aes(x=1.05,y=0.2,label="u1=(1,0)"),size = 7,hjust="left") +
coord_fixed() +
xlim(-.5,4) +
ylim(-.5,2.5)
```
1. Projection of error on x-axis
\begin{eqnarray*}
\mathbf{u}_1^T \mathbf{e} &=& \Vert \mathbf{u}_1\Vert_2 \Vert \mathbf{e}_1\Vert_2 \cos <\mathbf{u}_1,\mathbf{e}_1>\\
&=&\left[\begin{array}{cc} 1&0\end{array}\right] \left[\begin{array}{c} e_1\\e_2\end{array}\right]\\ &=& 1\times e_1 + 0 \times e_2 \\
&=& e_1\\
\end{eqnarray*}
2. Projection of error on y-axis
\begin{eqnarray*}
\mathbf{u}_2^T \mathbf{e} &=& \left[\begin{array}{cc} 0&1\end{array}\right] \left[\begin{array}{c} e_1\\e_2\end{array}\right]\\ &=& 0\times e_1 + 1 \times e_2 \\
&=& e_2
\end{eqnarray*}
3. Projection of error on itself
\begin{eqnarray*}
\mathbf{e}^T \mathbf{e} &=&\left[\begin{array}{cc} e_1&e_2\end{array}\right] \left[\begin{array}{c} e_1\\e_2\end{array}\right]\\
&=&e_1^2+e_2^2\\
&=&\Vert e \Vert^2_2 \rightarrow \text{ Pythagorean theorem}
\end{eqnarray*}
---
#### Interpretation of least squares as a projection
Fitted values:
$$
\begin{array}{lcl}
\hat{\mathbf{Y}} &=& \mathbf{X}\hat{\boldsymbol{\beta}}\\
&=& \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\\
&=& \mathbf{HY}
\end{array}
$$
with $\mathbf{H}$ the projection matrix also referred to as the hat matrix.
```{r}
X <- model.matrix(~x,data)
X
```
```{r}
XtX <- t(X)%*%X
XtX
```
```{r}
XtXinv <- solve(t(X)%*%X)
XtXinv
```
```{r}
H <- X %*% XtXinv %*% t(X)
H
```
```{r}
Y <- data$y
Yhat <- H%*%Y
Yhat
```
- We can also interpret the fit as the projection of the $n\times 1$ vector $\mathbf{Y}$ on the column space of the matrix $\mathbf{X}$.
- So each column in $\mathbf{X}$ is also an $n\times 1$ vector.
- For the toy example n=3 and p=2.
The other picture to linear regression is to consider $X_0$, $X_1$ and $Y$ as vectors in the space of the data $\mathbb{R}^n$, here $\mathbb{R}^3$ because we have three data points.
So the column space of X is a plane in the three dimensional space.
\[
\hat{\mathbf{Y}} = \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}
\]
1. Plane spanned by column space:
The other picture to linear regression is to consider $X_0$, $X_1$ and $Y$ as vectors in the space of the data $\mathbb{R}^n$, here $\mathbb{R}^3$ because we have three data points.
```{r}
originRn <- data.frame(X1=0,X2=0,X3=0)
data$x0 <- 1
dataRn <- data.frame(t(data))
library(plotly)
p1 <- plot_ly(
originRn,
x = ~ X1,
y = ~ X2,
z= ~ X3, name="origin") %>%
add_markers(type="scatter3d") %>%
layout(
scene = list(
aspectmode="cube",
xaxis = list(range=c(-4,4)), yaxis = list(range=c(-4,4)), zaxis = list(range=c(-4,4))
)
)
p1 <- p1 %>%
add_trace(
x = c(0,1),
y = c(0,0),
z = c(0,0),
mode = "lines",
line = list(width = 5, color = "grey"),
type="scatter3d",
name = "obs1") %>%
add_trace(
x = c(0,0),
y = c(0,1),
z = c(0,0),
mode = "lines",
line = list(width = 5, color = "grey"),
type="scatter3d",
name = "obs2") %>%
add_trace(
x = c(0,0),
y = c(0,0),
z = c(0,1),
mode = "lines",
line = list(width = 5, color = "grey"),
type="scatter3d",
name = "obs3") %>%
add_trace(
x = c(0,1),
y = c(0,1),
z = c(0,1),
mode = "lines",
line = list(width = 5, color = "black"),
type="scatter3d",
name = "X1") %>%
add_trace(
x = c(0,1),
y = c(0,2),
z = c(0,3),
mode = "lines",
line = list(width = 5, color = "black"),
type="scatter3d",
name = "X2")
p1
```
2. Vector of Y:
Actual values of $\mathbf{Y}$:
```{r}
data$y
```
\[
\mathbf{Y}=\left[\begin{array}{c}
`r data$y[1]` \\
`r data$y[2]` \\
`r data$y[3]`
\end{array}\right]
\]
```{r}
p2 <- p1 %>%
add_trace(
x = c(0,Y[1]),
y = c(0,Y[2]),
z = c(0,Y[3]),
mode = "lines",
line = list(width = 5, color = "red"),
type="scatter3d",
name = "Y")
p2
```
3. Projection of Y onto column space
Actual values of fitted values $\mathbf{\hat{Y}}$:
```{r}
data$yhat
```
\[
\mathbf{Y}=\left[\begin{array}{c}
`r data$yhat[1]` \\
`r data$yhat[2]` \\
`r data$yhat[3]`
\end{array}\right]
\]
```{r}
p2 <- p2 %>%
add_trace(
x = c(0,Yhat[1]),
y = c(0,Yhat[2]),
z = c(0,Yhat[3]),
mode = "lines",
line = list(width = 5, color = "orange"),
type="scatter3d",
name="Yhat") %>%
add_trace(
x = c(Y[1],Yhat[1]),
y = c(Y[2],Yhat[2]),
z = c(Y[3],Yhat[3]),
mode = "lines",
line = list(width = 5, color = "red", dash="dash"),
type="scatter3d",
name="Y -> Yhat"
)
p2
```
$\mathbf{Y}$ is projected in the column space of $\mathbf{X}$! spanned by the columns.
#### How does this projection works?
$$
\begin{array}{lcl}
\hat{\mathbf{Y}} &=& \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\\
&=& \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1/2}(\mathbf{X}^T\mathbf{X})^{-1/2}\mathbf{X}^T\mathbf{Y}\\
&=& \mathbf{U}\mathbf{U}^T\mathbf{Y}
\end{array}
$$
- $\mathbf{U}$ is a new orthonormal basis in $\mathbb{R}^2$, a subspace of $\mathbb{R}^3$
- The space spanned by U and X is the column space of X, e.g. it contains all possible linear combinantions of X.
$\mathbf{U}^t\mathbf{Y}$ is the projection of Y on this new orthonormal basis
```{r}
eigenXtX <- eigen(XtX)
XtXinvSqrt <- eigenXtX$vectors %*%diag(1/eigenXtX$values^.5)%*%t(eigenXtX$vectors)
U <- X %*% XtXinvSqrt
```
- $\mathbf{U}$ orthonormal basis
```{r}
U
t(U)%*%U
```
- $\mathbf{UU}^T$ equals projection matrix
```{r}
U%*%t(U)
H
```
```{r}
p3 <- p1 %>%
add_trace(
x = c(0,U[1,1]),
y = c(0,U[2,1]),
z = c(0,U[3,1]),
mode = "lines",
line = list(width = 5, color = "blue"),
type="scatter3d",
name = "U1") %>%
add_trace(
x = c(0,U[1,2]),
y = c(0,U[2,2]),
z = c(0,U[3,2]),
mode = "lines",
line = list(width = 5, color = "blue"),
type="scatter3d",
name = "U2")
p3
```
- $\mathbf{U}^T\mathbf{Y}$ is the projection of $\mathbf{Y}$ in the space spanned by $\mathbf{U}$.
- Indeed $\mathbf{U}_1^T\mathbf{Y}$
```{r}
p4 <- p3 %>%
add_trace(
x = c(0,Y[1]),
y = c(0,Y[2]),
z = c(0,Y[3]),
mode = "lines",
line = list(width = 5, color = "red"),
type="scatter3d",
name = "Y") %>%
add_trace(
x = c(0,U[1,1]*(U[,1]%*%Y)),
y = c(0,U[2,1]*(U[,1]%*%Y)),
z = c(0,U[3,1]*(U[,1]%*%Y)),
mode = "lines",
line = list(width = 5, color = "red",dash="dash"),
type="scatter3d",
name="Y -> U1") %>% add_trace(
x = c(Y[1],U[1,1]*(U[,1]%*%Y)),
y = c(Y[2],U[2,1]*(U[,1]%*%Y)),
z = c(Y[3],U[3,1]*(U[,1]%*%Y)),
mode = "lines",
line = list(width = 5, color = "red", dash="dash"),
type="scatter3d",
name="Y -> U1")
p4
```
- and $\mathbf{U}_2^T\mathbf{Y}$
```{r}
p5 <- p4 %>%
add_trace(
x = c(0,U[1,2]*(U[,2]%*%Y)),
y = c(0,U[2,2]*(U[,2]%*%Y)),
z = c(0,U[3,2]*(U[,2]%*%Y)),
mode = "lines",
line = list(width = 5, color = "red",dash="dash"),
type="scatter3d",
name="Y -> U2") %>% add_trace(
x = c(Y[1],U[1,2]*(U[,2]%*%Y)),
y = c(Y[2],U[2,2]*(U[,2]%*%Y)),
z = c(Y[3],U[3,2]*(U[,2]%*%Y)),
mode = "lines",
line = list(width = 5, color = "red", dash="dash"),
type="scatter3d",
name="Y -> U2")
p5
```
- Yhat is the resulting vector that lies in the plane spanned by $\mathbf{U}_1$ and $\mathbf{U}_2$ and thus also in the column space of $\mathbf{X}$.
```{r}
p6 <- p5 %>%
add_trace(
x = c(0,Yhat[1]),
y = c(0,Yhat[2]),
z = c(0,Yhat[3]),
mode = "lines",
line = list(width = 5, color = "orange"),
type="scatter3d",
name = "Yhat") %>%
add_trace(
x = c(Y[1],Yhat[1]),
y = c(Y[2],Yhat[2]),
z = c(Y[3],Yhat[3]),
mode = "lines",
line = list(width = 5, color = "maroon2"),
type="scatter3d",
name = "e") %>%
add_trace(
x = c(U[1,1]*(U[,1]%*%Y),Yhat[1]),
y = c(U[2,1]*(U[,1]%*%Y),Yhat[2]),
z = c(U[3,1]*(U[,1]%*%Y),Yhat[3]),
mode = "lines",
line = list(width = 5, color = "orange", dash="dash"),
type="scatter3d",
name = "Y -> U") %>%
add_trace(
x = c(U[1,2]*(U[,2]%*%Y),Yhat[1]),
y = c(U[2,2]*(U[,2]%*%Y),Yhat[2]),
z = c(U[3,2]*(U[,2]%*%Y),Yhat[3]),
mode = "lines",
line = list(width = 5, color = "orange", dash="dash"),
type="scatter3d",
name = "Y -> U")
p6
```
### Error
Note, that it is also clear from the equation in the derivation of the least squares solution that the residual is orthogonal on the column space:
\[
-2 \mathbf{X}^T(\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}) = 0
\]
### Curse of dimensionality?
- Imagine what happens when p approaches n $p=n$ or becomes much larger than p >> n!!!
- Suppose that we add a predictor $\mathbf{X}_2 = [2,0,1]^T$?
$$
\mathbf{Y}=\left[
\begin{array}{c}
1\\
2\\
2\\
\end{array}\right],
\quad
\mathbf{X}= \left[
\begin{array}{ccc}
1&1&2\\
1&2&0\\
1&3&1\\
\end{array}
\right],
\quad \boldsymbol{\beta} = \left[
\begin{array}{c}
\beta_0\\
\beta_1\\
\beta_2
\end{array}
\right]
\quad
\text{and}
\quad
\boldsymbol{\epsilon}=
\left[
\begin{array}{c}
\epsilon_1\\
\epsilon_2\\
\epsilon_3
\end{array}
\right]
$$
```{r}
data$x2 <- c(2,0,1)
fit <- lm(y~x+x2,data)
# predict values on regular xy grid
x1pred <- seq(-1, 4, length.out = 10)
x2pred <- seq(-1, 4, length.out = 10)
xy <- expand.grid(x = x1pred,
x2 = x2pred)
ypred <- matrix (nrow = 30, ncol = 30,
data = predict(fit, newdata = data.frame(xy)))
library(plot3D)
# fitted points for droplines to surface
th=20
ph=5
scatter3D(data$x,
data$x2,
Y,
pch = 16,
col="darkblue",
cex = 1,
theta = th,
ticktype = "detailed",
xlab = "x1",
ylab = "x2",
zlab = "y",
colvar=FALSE,
bty = "g",
xlim=c(-1,3),
ylim=c(-1,3),
zlim=c(-2,4))
z.pred3D <- outer(
x1pred,
x2pred,
function(x1,x2)
{
fit$coef[1] + fit$coef[2]*x1+fit$coef[2]*x2
})
x.pred3D <- outer(
x1pred,
x2pred,
function(x,y) x)
y.pred3D <- outer(
x1pred,
x2pred,
function(x,y) y)
scatter3D(data$x,
data$x2,
data$y,
pch = 16,
col="darkblue",
cex = 1,
theta = th,
ticktype = "detailed",
xlab = "x1",
ylab = "x2",
zlab = "y",
colvar=FALSE,
bty = "g",
xlim=c(-1,4),
ylim=c(-1,4),
zlim=c(-2,4))
surf3D(
x.pred3D,
y.pred3D,
z.pred3D,
col="blue",
facets=NA,
add=TRUE)
```
Note, that the linear regression is now a plane.
However, we obtain a perfect fit and all the data points are falling in the plane! `r set.seed(4);emo::ji("fear")`
This is obvious if we look at the column space of X!
```{r}
X <- cbind(X,c(2,0,1))
XtX <- t(X)%*%X
eigenXtX <- eigen(XtX)
XtXinvSqrt <- eigenXtX$vectors %*%diag(1/eigenXtX$values^.5)%*%t(eigenXtX$vectors)
U <- X %*% XtXinvSqrt
p7 <- p1 %>%
add_trace(
x = c(0,2),
y = c(0,0),
z = c(0,1),
mode = "lines",
line = list(width = 5, color = "darkgreen"),
type="scatter3d",
name = "X3")
p7
```
```{r}
p8 <- p7 %>%
add_trace(
x = c(0,U[1,1]),
y = c(0,U[2,1]),
z = c(0,U[3,1]),
mode = "lines",
line = list(width = 5, color = "blue"),
type="scatter3d",
name = "U1") %>%
add_trace(
x = c(0,U[1,2]),
y = c(0,U[2,2]),
z = c(0,U[3,2]),
mode = "lines",
line = list(width = 5, color = "blue"),
type="scatter3d",
name = "U2") %>%