-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathlm_model_fitting_winter.Rmd
717 lines (467 loc) · 37.6 KB
/
lm_model_fitting_winter.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
## Development of a linear regression model for weather data
In order to model T2 and D2 we use the linear regression model. Linear regression is the modelling technique in which we find a linear relationship between a scalar dependent variable, called response or predictand and one or more exploratory variables called predictors.
Let us take a statistical dataset which consists of $(y, x_1, x_2, ...., x_n) $ where $y$ is the response or predictand variable and $ x_1, x_2, ...., x_n)$ are the predictors. We can find a linear relationship between the predictand and the predictors as
$\hat{y} \,= \, \beta_1 x_1 + \beta_2_x2 + \ldots + \beta_n x_n $, where $ \hat{y}$ is the approximation to the predictand $y$
In linear regression model our aim is to find the $\beta$ variables so that the squared error, $ (y - \hat{y})^2 $ is minimum.
In our assignment, we find the relationship between T2Obs and the variables given as the output of the ECMWF numerical prediction model. A similar regression model for the D2Obs and the ECMWF outputs are also found.
The weather data vary widely between the seasons and the ECMWF forecast model depends much on the forecast period. But for the analysis purpose we take only one dataset for analysis, winter data for 1-day forecast period and summer data for one-day forecast period. We predict T2 and D2 using our model on the data. In the later part of the assignment we use the model to predict T2 and D2 on the whole Kumpula weather station data for 4 seasons and 64 forecast periods.
### Load the data for one-day forecast for the winter season
We load the data from the file "data/station2998_T2_D2_Obs_model_data_with_timestamps.csv" and select the data for one-day forecast for the winter season. We use the analysis period as 00:00 UTC. We then randomly split the data as training and test sets. The dataset is named as df_winter_00_08.
The dataset df_winter has 573 observations of 32 variables. The first four columns are the station_id, season, analysis_time, forecast_period and timestamp and we call them collectively as station info. The 6th column is T2Obs and the 32nd column is the D2Obs. The columns from 7 to 31 contains the model data from the ECMWF numerical model.
```{r, warning=FALSE, message=F, echo=TRUE}
# loading the data from the file
T2_D2_Obs_model_data_with_timestamps = read.table(file = "data/station2998_T2_D2_Obs_model_data_with_timestamps.csv", sep = ",", header = TRUE)
df_winter_00_08 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 8 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & T2_D2_Obs_model_data_with_timestamps$season == 1),]
dim(df_winter_00_08)
str(df_winter_00_08)
```
### Splitting the training and testing data.
From the df_winter_00_08, we create two datasets one for the T2Obs and model data and the other for D2Obs and the model data. The model data is the same for both the datasets and the response variables T2Obs and D2Obs are in the first column of the T2Obs dataset and the D2Obs dataset respectively. We then split the data into train and test data. The train data has 430 observations of 26 variables and the test data has 143 observations of 26 variables. Notice that we removed the station info is removed as the variables there are not influencing the linear regression model.
```{r, warning=FALSE, message=F, echo=TRUE}
# Splitting into train and test data
set.seed(123)
n <- nrow(df_winter_00_08)
train = sample(1:n, size = round(0.75*n), replace=FALSE)
T2_winter_00_08_train <- df_winter_00_08[train,6:(ncol(df_winter_00_08) - 1)]
T2_winter_00_08_test <- df_winter_00_08[-train,6:(ncol(df_winter_00_08) - 1)]
D2_winter_00_08_train <- cbind(df_winter_00_08[train,ncol(df_winter_00_08)],
df_winter_00_08[train,7:(ncol(df_winter_00_08) - 1)])
colnames(D2_winter_00_08_train)[1] <- "D2Obs"
D2_winter_00_08_test <- cbind(df_winter_00_08[-train,ncol(df_winter_00_08)],
df_winter_00_08[-train,7:(ncol(df_winter_00_08) - 1)])
colnames(D2_winter_00_08_test)[1] <- "D2Obs"
dim(T2_winter_00_08_train)
dim(T2_winter_00_08_test)
dim(D2_winter_00_08_train)
dim(D2_winter_00_08_test)
str(T2_winter_00_08_train)
str(D2_winter_00_08_train)
```
### Correlation matrix
Here we find the correlation matrix for the variables of the dataset df_winter_00_08
```{r, warning=FALSE, message=F, echo=TRUE}
library(corrplot); library(dplyr)
cor(as.matrix(df_winter_00_08[,6:ncol(df_winter_00_08)]))
```
## Linear regression model for T2
We use the correlation matrix and the scatter plots to find out the predictor variables for our model which predicts T2.
### Correlation coefficients and scatter plots of the the variables of the dataset df_winter_00_08 with T2Obs
We study and plot the relationship between T2Obs and the ECMWF model variables.
The correlation coefficients of of the the variables of the dataset df_winter_00_08 with T2Obs are given below.
<pre><code>
T2Obs MSL T2 D2 U10 V10 TP LCC
T2Obs 1.00000000 -0.413811966 0.94819952 0.9283360 0.39888861 0.40988014 0.12437463 -0.03288617
MCC HCC SKT RH_950 T_950 RH_925 T_925 RH_850
T2Obs 0.145251318 0.23019018 0.8863533 0.18680122 0.88894005 0.216119199 0.853054151 0.167963469
T_850 RH_700 T_700 RH_500 T_500 T2_M1 T_950_M1 T_925_M1
T2Obs 0.769399085 0.04118724 0.67602135 0.09403511 0.587334852 0.94419536 0.88398440 0.85172114
DECLINATION D2_M1 D2Obs
T2Obs 0.45237252 0.92448408 0.9419332
</code></pre>
We can observe from the scatter plots that all "temperature variables" are linearly correlated with T2Obs. Please zoom the figure to see the details.
```{r, warning=FALSE, message=F, echo=TRUE}
library (ggplot2)
par(mfrow=c(4,7))
for (i in 7:31) {
plot(df_winter_00_08[,i],df_winter_00_08[,6], ylab = "T2Obs", xlab = colnames(df_winter_00_08[i]))
}
```
#### First linear regression model for T2
For our first model, we take all the variables which have a positive correlation higher than 0.8 with T2Obs and all the variables which are negatively correlated with T2Obs.
All the "temperature variables" are linearly correlated with T2Obs but some have correlation coefficients less than 0.8. We take all the temperature variables that have higher correlation coefficients than 0.8 with T2Obs. So we use the temperature variables T2, T_950, T_925, T2_M1, T2_950_M1, T2_925_M1 and SKT in our model. D2 and D2_M1 are also have high correlation coefficient with T2Obs and are linearly distributed with T2Obs, we include D2 in our model.
Even though the RH variables are seen to be randomly distributed with T2Obs, they have correlations less than 0.8 and we exclude them from our model. Even though DECLINATION is linearly correlated with T2Obs, its correlation coefficient is less than 0.5, we exclude it from our model. In the case of TP, it is not linearly correlated with T2Obs and has very small correlation coefficient, we exclude it from the model.
From the scatter plots, we can see that the cloud cover variables LCC, MCC and HCC are randomly correlated with T2Obs and have low correlation coefficients with T2Obs. As LCC has a negative correlation with T2Obs, we take LCC to our list of predictors. As MSL also has a negative correlation coefficient, we also include it to our list of predictors. The wind velocity variables, U10 and V10 are randomly distributed and have low correlation coefficients we exclude them from our model.
Based on the above hypothesis, the formula for our initial model is
** T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + LCC + MSL **
We use the R function "lm" to carry out the linear regression model.
The summary of our first model, my_model1_T2 shows that it has a Residual standard error: 1.428. F-statistic: 506.9 and the p-value: < 2.2e-16.
The p-value shows the probability that the null hypothesis is true. If the null hypothesis is true, it is an intercept only model and there is no linear relationship between the predictand and the predictors. In our case as the p value, is very small, the null hypothesis is false and there is a linear relationship between the predictand and the predictors.
The Fischer value or F-statistic is the ratio of explained variance and unexplained variance and has a value near 1 if the null hypothesis is true.
F-statistic is defined as F-statistic = MSM (Mean of Squares for Model) / MSE (Mean of Squares for Error) . In our case F-statistic is 506.9, so our model rejects the null hypothesis.
The t value is the value of the t-statistic for testing whether the corresponding regression coefficient is different from 0. As a rule of thump, a t value around or greater than 2 is good.
Pr. value is the p-value for the hypothesis test for which the t value is the test statistic. It gives the probability of a test statistic at least as the t value obtained, if the null hypothesis were true. A low Pr value for the variables shows that there is linear relationship between this variable and the predictor.
The Signif codes actually shows the ordering based on p values, a lower p value, higher is the significance.
Based on the F-statistic and p value we see that our model is a good one.
```{r, warning=FALSE, message=F, echo=TRUE}
my.model1.T2 <- lm(T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + LCC + MSL , data = T2_winter_00_08_train)
summary(my.model1.T2)
```
#### Second linear regression model for T2
We try in this step to see whether we can improve our first model. Based on the t value and Pr value for the variables obtained in our first model, we include the variables which have Pr value less than 0.5. This results in a new formula for the linear regression as
** T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC **
The summary of the model shows that F-statistic has increased from 506 to 698. The p value remains as the same low value, the residual standard error has almost same (1.428 for the first model and 1.426 for the second model)
```{r, warning=FALSE, message=F, echo=TRUE}
my.model2.T2 <- lm(T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC , data = T2_winter_00_08_train)
summary(my.model2.T2)
```
#### Third linear regression model for T2
We try to see if we can improve the model still further. Based on the t value and Pr value for the variables obtained in our second model, we include only the variables which have Pr value less than 0.1. This results in a new formula for the linear regression as
** T2Obs ~ T2 + D2 + T_950 + T2_M1 + LCC **
The summary of the model shows that F-statistic has increased and has a value 1114. The p value remains as the same low value, the residual standard error is the same as the first model.
```{r, warning=FALSE, message=F, echo=TRUE}
my.model3.T2 <- lm(T2Obs ~ T2 + D2 + T_950 + T2_M1 + LCC , data = T2_winter_00_08_train)
summary(my.model3.T2)
```
### The selected linear model for T2
From the above analysis we saw that we do not gain much in terms of residual error and p value when we move from model 1 to model 3.
Our analysis is based only on the winter data, we cannot strongly state that the five variables for the third model are the only ones that significantly affect T2. As a compromise we select the second model with eight variables, which has a higher F value than the first model.
The linear regression formula for our model is taken as
** T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC **
```{r, warning=FALSE, message=F, echo=TRUE}
my.model.T2 <- lm(T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC , data = T2_winter_00_08_train)
summary(my.model.T2)
```
### Predicting T2 with my model
We now see the predicting ability of our model by using it to predict T2 for the test data T2_winter_00_08_test.
The results show that the root mean square error (rmse) is 1.426.
Our linear regression model for T2 is
<pre><code>
T2 = 9.07883695 + 0.26398396 * T2 (of the ECMWF model) + 0.45513159 * D2 (of the ECMWF model) + 0.18812205 * T_950 - 0.07796135 * T_925 + 0.30452258 * T2_M1 - 0.11617459 * D2_M1 - 0.04322546 * SKT - 1.35912603 * LCC
</code></pre>
```{r, warning=FALSE, message=F, echo=TRUE}
# calculating the fit
fitted.values <- predict(object = my.model.T2, newdata <- T2_winter_00_08_test)
T2pred.mymodel <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
coefficients <- my.model.T2$coefficients
rmse.mymodel.T2 <- sqrt(mean(residuals^2))
coefficients
rmse.mymodel.T2
```
## Linear regression model for D2
We use the correlation matrix and the scatter plots to find out the predictor variables for our model for D2.
### Correlation and Scatter plots of the variables of the dataset df_winter_00_08 with D2Obs
We study and lot the relationship between D2Obs and the ECMWF model variables. The correlation coefficients
<pre><code>
T2Obs MSL T2 D2 U10 V10 TP LCC
D2Obs 0.94193317 -0.455461577 0.93054344 0.9664760 0.38240272 0.47678192 0.20345976 0.14652865
MCC HCC SKT RH_950 T_950 RH_925 T_925 RH_850
D2Obs 0.195814995 0.25890049 0.8969396 0.36156717 0.85307293 0.361084178 0.826701929 0.268052174
T_850 RH_700 T_700 RH_500 T_500 T2_M1 T_950_M1 T_925_M1
D2Obs 0.765794084 0.10873350 0.67134133 0.13140822 0.564374564 0.90946289 0.83419344 0.81248000
DECLINATION D2_M1 D2Obs
D2Obs 0.33219331 0.95292046 1.0000000
</code></pre>
We can observe from the scatter plots that all "temperature variables" are linearly correlated with D2Obs. Please zoom the figure to see the details.
```{r, warning=FALSE, message=F, echo=TRUE}
library (ggplot2)
par(mfrow=c(4,7))
for (i in 7:31) {
plot(df_winter_00_08[,i],df_winter_00_08[,32], ylab = "D2Obs", xlab = colnames(df_winter_00_08[i]))
}
```
#### First linear regression model for D2
Similar to the modelling of T2Obs, for our first model, we take all the variables which have a positive correlation higher than 0.8 with D2Obs and all the variables which are negatively correlated with D2Obs.
All the "temperature variables" are linearly correlated with D2Obs but some have correlation coefficients less than 0.8. We take all the temperature variables that have higher correlation coefficients than 0.8 with D2Obs. So we use the temperature variables T2, T_950, T_925, T2_M1, T2_950_M1, T2_925_M1 and SKT in our model. D2 and D2_M1 are also have high correlation coefficient with D2Obs and are linearly distributed with T2Obs, we include D2 in our model.
The RH variables are seen to be randomly distributed with D2Obs and they have correlations less than 0.8 and we exclude them from our model. Even though DECLINATION is linearly correlated with D2Obs, its correlation coefficient is less than 0.5, we exclude it from our model.
As the cloud cover variables LCC, MCC and HCC are randomly correlated with D2Obs and have low correlation coefficients with T2Obs. we exclude them from our model. As MSL has a negative correlation coefficient, we also include it to our list of predictors. The wind velocity variables, U10 and V10 are randomly distributed and have low correlation coefficients.
From a layman's point of view, we have a feeling that D2 may be related to LCC, TP, RH_950 and V10 even though they are not linearly correlated with D2Obs and has very small correlation coefficient.
Based on the above hypothesis, the formula for our initial model is
** D2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + MSL + TP + RH_950 + LCC**
The summary of our first model, my_model1_D2 shows that it has a Residual standard error: 1.447. F-statistic: 486.7 and the p-value: < 2.2e-16.
As the p value, is very small, the null hypothesis is false and there is a linear relationship between the predictand and the predictors.
```{r, warning=FALSE, message=F, echo=TRUE}
my_model1_D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + MSL + TP + RH_950 + + V10 + LCC, data = D2_winter_00_08_train)
summary(my_model1_D2)
```
#### Second linear regression model for D2
We try in this step to see whether we can improve our first model. Based on the t value and Pr value for the variables obtained in our first model, we include the variables which have Pr value less than 0.5. Even though T2 has a Pr greater than 0.5 we include T2 in the second model also as we believe that T2 and D2 are related. As we have seen in the quantile plot of Winter data, the ECMWF model T2 deviates from the observed T2Obs widely. This may be the reason for high Pr value for T2. This results in a new formula for the linear regression as
** D2Obs ~ T2 + D2 + T_950 + T2_M1 + T_950_M1 + D2_M1 + SKT + MSL + TP + RH_950 + V10 + LCC **
The summary of the model shows that F-statistic has increased from 486.7 to 569.9. The p value remains as the same low value, the residual standard error has almost same (1.447for the first model and 1.444 for the second model)
```{r, warning=FALSE, message=F, echo=TRUE}
my_model2_D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T2_M1 + T_950_M1 + D2_M1 + SKT + MSL + TP + RH_950 + V10 + LCC, data = D2_winter_00_08_train)
summary(my_model2_D2)
```
#### Third linear regression model for D2
We try in this step to see whether we can improve our first model. Based on the t value and Pr value for the variables obtained in our first model, we include the variables which have Pr value around 0.1. Even though T2 has a Pr greater than 0.5 we include T2 in the third model also as we believe that T2 and D2 are related. This results in a new formula for the linear regression as
** D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC **
The summary of the model shows that F-statistic has increased from 569.9 to 756.7. The p value remains as the same low value, the residual standard error has almost same (1.446 for the third model and 1.444 for the second model)
```{r, warning=FALSE, message=F, echo=TRUE}
my_model3_D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC, data = D2_winter_00_08_train)
summary(my_model3_D2)
```
### The selected linear model for D2
From the above analysis we saw that we do not gain much in terms of residual error and p value when we move from model 1 to model 3.
As the model 3 has 9 variables and they have quite good Pr values we select the third model with eight variables, which has a higher F value than the other two models.
The linear regression formula for our model is taken as
** D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC **
```{r, warning=FALSE, message=F, echo=TRUE}
my.model.D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC, data = D2_winter_00_08_train)
summary(my.model.D2)
```
### Predicting D2 with the my.model.D2
We now see the predicting ability of our model by using it to predict D2 for the test data D2_winter_00_08_test.
The results show that the root mean square error (rmse) is 1.351.
Our linear regression model for D2 is
<pre><code>
D2 = 2.472731 - 0.063511 * T2 (of the ECMWF model) + 0.564682 * D2 (of the ECMWF model) + 0.364282 * T_950 - 0.148566* T_950_M1 + 0.265225 *D2_M1 - 0.603928* TP + 0.035618 * RH_950 + 0.082966 * V10 - 0.629401* LCC
</code></pre>
```{r, warning=FALSE, message=F, echo=TRUE}
# calculating the fit
fitted.values <- predict(object = my.model.D2, newdata <- D2_winter_00_08_test)
D2pred.mymodel <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
coefficients <- my.model.D2$coefficients
rmse.mymodel.D2 <- sqrt(mean(residuals^2))
coefficients
rmse.mymodel.D2
```
## Comparing my.model.T2 and my.model.D2 with standard linear regression models
In order to analyze my.model, we compare it with models produced by three other linear regression models.
The first one is known as the "lm model" which uses R function lm with all the variables of the dataset.
The second is the "lmstep"" model which uses the R "step" algorithm. This algorithm develops a sequence of linear models, removing or
adding a variable at each step. A model is updated when it has a lower Akaike’s Information Criterion (AIC) than the original model. AIC is defined as
$AIC \ = \ = -N \, log\(frac{RSS}{N}) + 2k $ where N is the number of observations, RSS is the residual sum of squares and k is the number of degrees of freedom of the original model.
The third method is the glmcv, Gaussian linear regression model with cross validation. Here the training dataset is divided to K equal parts and always one part is set aside as test data. The idea behind the K-fold cross validation is that the training of the linear regression model is done with K-1 parts and cross validation (prediction with cross validation data) is done with the part which is set aside. The prediction error is calculated for this part. In the next iteration the set-aside part is included in the training data which forms a K-1 group and a new part is set aside. The iterations continue in a modular fashion, each time K-1 parts used for training and one part is used for cross validation. For each iteration prediction error is calculated and glmcv choose the iteration with the least error and selects that model.
We then use the models of lm, lmstep and glmcv to predict T2 and D2 using the test data. We use the root mean square error (RMSE) to compare the different models.
### Modelling with lm for T2
For the lm model we use all the model variables. With this model we get Residual standard error: 1.413, F-statistic: 228.9 and p-value: < 2.2e-16
```{r, warning=FALSE, message=F, echo=TRUE}
lm.model.T2 <- lm(T2Obs ~ ., data = T2_winter_00_08_train)
summary(lm.model.T2)
```
### Predicting T2 with the lm.model.T2
We now see the predicting ability of lm.model.T2 by using it to predict T2 for the test data T2_winter_00_08_test.
The results show that the root mean square error (rmse) is 1.411888.
```{r, warning=FALSE, message=F, echo=TRUE}
# calculating the fit
fitted.values <- predict(object = lm.model.T2, newdata <- T2_winter_00_08_test)
T2pred.lm <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
coefficients <- lm.model.T2$coefficients
rmse.lm.T2 <- sqrt(mean(residuals^2))
coefficients
rmse.lm.T2
```
### Modelling with lm for D2
For the lm model we use all the model variables. With this model we get Residual standard error: 1.438, F-statistic: 276.2 and p-value: < 2.2e-16
```{r, warning=FALSE, message=F, echo=TRUE}
lm.model.D2 <- lm(D2Obs ~ ., data = D2_winter_00_08_train)
summary(lm.model.D2)
```
### Predicting D2 with the lm.model.D2
We now see the predicting ability of lm.model.D2 by using it to predict D2 for the test data D2_winter_00_08_test.
The results show that the root mean square error (rmse) is 1.305783.
```{r, warning=FALSE, message=F, echo=TRUE}
# calculating the fit
fitted.values <- predict(object = lm.model.D2, newdata <- D2_winter_00_08_test)
D2pred.lm <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
coefficients <- lm.model.D2$coefficients
rmse.lm.D2 <- sqrt(mean(residuals^2))
coefficients
rmse.lm.D2
```
### Modelling T2 with lmstep
As explained earlier this algorithm develops a sequence of linear models, removing or adding a variable at each step. lm.step model starts with a formula known as the object.formula. We can define a lower and upper formulas and the model can go in step backward or forward to the lower or upper formulas from the object formula in some number of steps defined in the step variable. Here we define the object.formula as a constant one (no variables present), i.e., T2Obs ~1. The lower.formula is the same as the object.formula and the upper.formula is the formula with all the variables, i.e, T2Obs ~. We use only the forward direction.
With this model we get Residual standard error: 1.397, F-statistic: 531.2 and p-value: < 2.2e-16
```{r, warning=FALSE, message=F, echo=TRUE}
# generating the linear models
response <- colnames(T2_winter_00_08_train)[1]
object.formula <- as.formula(paste(response, " ~ 1", sep = ""))
upper.formula <- as.formula(paste(response, " ~ .", sep = ""))
lower.formula <- as.formula(paste(response, " ~ 1", sep = ""))
object.lm <- lm(object.formula, data = T2_winter_00_08_train)
upper.lm <- lm(upper.formula, data = T2_winter_00_08_train)
lower.lm <- lm(lower.formula, data = T2_winter_00_08_train)
direction <- "forward"
steps <- 1000
# choosing the best model
step.model.T2 <- step(object = object.lm,
scope = list(upper = upper.lm, lower = lower.lm),
direction = direction, trace = 0 , steps = steps)
step.model.T2
summary(step.model.T2)
```
### Predicting T2 with the lmstep model
We now see the predicting ability of lm.step.T2 by using it to predict T2 for the test data T2_winter_00_08_test.
The results show that the root mean square error (rmse) is 1.422109.
```{r, warning=FALSE, message=F, echo=TRUE}
# calculating the fit
fitted.values <- predict(object = step.model.T2, newdata <- T2_winter_00_08_test)
T2pred.lmstep <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
coefficients <- step.model.T2$coefficients
rmse.lmstep.T2 <- sqrt(mean(residuals^2))
coefficients
rmse.lmstep.T2
```
### Modelling D2 with the lmstep model
Similar to the T2 model, we define the object.formula as a constant one (no variables present), i.e., D2Obs ~1. The lower.formula is the same as the object.formula and the upper.formula is the formula with all the variables, i.e, D2Obs ~. We use only the forward direction.
With this model we get Residual standard error: 1.428, F-statistic: 777.9 and p-value: < 2.2e-16.
Notice that T2 is not present in the final formula which gave the best AIC.
```{r, warning=FALSE, message=F, echo=TRUE}
# generating the linear models
response <- colnames(D2_winter_00_08_train)[1]
object.formula <- as.formula(paste(response, " ~ 1", sep = ""))
upper.formula <- as.formula(paste(response, " ~ .", sep = ""))
lower.formula <- as.formula(paste(response, " ~ 1", sep = ""))
object.lm <- lm(object.formula, data = D2_winter_00_08_train)
upper.lm <- lm(upper.formula, data = D2_winter_00_08_train)
lower.lm <- lm(lower.formula, data = D2_winter_00_08_train)
direction <- "forward"
steps <- 1000
# choosing the best model
step.model.D2 <- step(object = object.lm,
scope = list(upper = upper.lm, lower = lower.lm),
direction = direction, trace = 0 , steps = steps)
summary(step.model.D2)
```
### Predicting D2 with the lmstep model
We now see the predicting ability of lm.step.D2 by using it to predict D2 for the test data D2_winter_00_08_test.
The results show that the root mean square error (rmse) is 1.334799.
```{r, warning=FALSE, message=F, echo=TRUE}
# calculating the fit
fitted.values <- predict(object = step.model.D2, newdata <- D2_winter_00_08_test)
D2pred.lmstep <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
coefficients <- step.model.D2$coefficients
rmse.lmstep.D2 <- sqrt(mean(residuals^2))
coefficients
rmse.lmstep.D2
```
#### Modelling with glm function and cross validation cv.glm function
We use the glm function in the boot package. We use K = 4 folds, the data for training will be around 300 observations and for cross validation is around 100 observations. The cross validation error is smallest for the first iteration.
```{r, warning=FALSE, message=F, echo=TRUE}
library(boot)
set.seed(123)
cv.error.10 = rep(0, 10)
for (i in 1:10) {
glm.fit.T2 = glm(T2Obs ~., data = T2_winter_00_08_train)
cv.error.10[i] = cv.glm(T2_winter_00_08_train, glm.fit.T2, K = 4)$delta[1]
cv.error.10
}
cv.error.10
summary(glm.fit.T2)
```
### Predicting T2 with the glm model
We now see the predicting ability of glmcv.T2 by using it to predict T2 for the test data D2_winter_00_08_test.
The results show that the root mean square error (rmse) is 1.411888.
```{r, warning=FALSE, message=F, echo=TRUE}
# calculating the fit
fitted.values <- predict(object = glm.fit.T2, newdata <- T2_winter_00_08_test)
T2pred.glmcv <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
coefficients <- glm.fit.T2$coefficients
rmse.glmcv.T2 <- sqrt(mean(residuals^2))
coefficients
rmse.glmcv.T2
```
# D2 model boot library to use cv.glm function
Similar to T2 modelling using glm we carry out the modelling for D2
```{r, warning=FALSE, message=F, echo=TRUE}
library(boot)
set.seed(123)
cv.error.10 = rep(0, 10)
for (i in 1:10) {
glm.fit.D2 = glm(D2Obs ~., data = D2_winter_00_08_train)
cv.error.10[i] = cv.glm(D2_winter_00_08_train, glm.fit.D2, K = 5)$delta[1]
cv.error.10
}
cv.error.10
summary(glm.fit.D2)
```
## Predicting D2
We now see the predicting ability of glmcv.D2 by using it to predict D2 for the test data D2_winter_00_08_test.
The results show that the root mean square error (rmse) is 1.305783.
```{r, warning=FALSE, message=F, echo=TRUE}
# calculating the fit
fitted.values <- predict(object = glm.fit.D2, newdata <- D2_winter_00_08_test)
D2pred.glmcv <- fitted.values
fitted.values <- as.numeric(fitted.values)
residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
coefficients <- glm.fit.D2$coefficients
rmse.glmcv.D2 <- sqrt(mean(residuals^2))
coefficients
rmse.glmcv.D2
```
# rmse values for my.model lm, lmstep and glmcv predictions of T2 and D2
We can see that my.model for T2 and D2 have comparable rmse values to that obtained using the lm, lmstep and glmcv.
```{r, warning=FALSE, message=F, echo=TRUE}
cat("rmse mymodel T2=", rmse.mymodel.T2, "rmse lm T2=", rmse.lm.T2, "rmse lmstep T2=", rmse.lmstep.T2, "rmse glmcv T2=", rmse.glmcv.T2,
"rmse mymodel D2=", rmse.mymodel.D2, "rmse lm D2=", rmse.lm.D2,"rmse lmstep D2=", rmse.lmstep.D2, "rmse glmcv D2=", rmse.glmcv.D2
)
```
## The diagnostic plots
We plot the following graphs Residuals vs Fitted values, Normal QQ-plot, Standardized residuals vs Fitted values and Residuals vs Leverage for all the models.
The Residuals vs Fitted values plot shows if the residuals have non-linear patterns. There could be a non-linear relationship between predictor variables and the response variable and the pattern could show up in this plot if the model is not able to capture the non-linear relationship. Ideally there should not be any pattern and the residuals should be randomly spread around a horizontal line drawn at 0 on the y-axis. A red line in the graph helps to show the trend in the pattern, its deviation from the horizontal line at 0.
The second graph, the normal QQ-plot shows if the residuals are normally distributed. If the residuals follow the straight dashed line, the residuals are normally distributed.
The third graph, Standardized residuals vs Fitted values, shows the rmse in the predictions.
The third graph, Residuals vs Leverage graph helps to find the influential extreme values or outliers. The data have extreme values, but we can consider these extreme values are not influential if the linear regression model we find does not change if we either include or exclude them from analysis. The red dotted line in the graph shows the Cook's distance. If all the points in the leverage graph are well within the the Cook's distance, then there are no outliers that influence the linear regression.
We plot the diagnostic plots for our linear regression models, mymodel, lm, lmstep and glmcv. For mymodel, the Residuals vs Fitted values shows most of the residuals are along the mean value. The normal QQ plot is linear and the leverage plot do not have outliers. These graphs show that the mymodel is a good fit for the weather data. Similar graphs for lm, lmstep and glmcv show that they also produce good linear regression models for the weather data.
```{r, warning=FALSE, message=F, echo=TRUE}
# We plot the graphs Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
par(mfrow = c(2,2))
plot(my.model.T2, which = c(1,2,3,5), caption = list("T2-my model Residuals vs Fitted values", "T2-mymodel Normal QQ-plot", "T2 Standardized residuals vs Fitted values", "T2-mymodel Residuals vs Leverage"))
par(mfrow = c(2,2))
plot(my.model.D2, which = c(1,2,5), caption = list("D2-my model Residuals vs Fitted values", "D2-lm Normal QQ-plot", "D2-mymodel Residuals vs Leverage"))
par(mfrow = c(2,2))
plot(lm.model.T2, which = c(1,2,5), caption = list("T2-lm Residuals vs Fitted values", "T2-lm Normal QQ-plot", "T2-lm Residuals vs Leverage"))
par(mfrow = c(2,2))
plot(lm.model.D2, which = c(1,2,5), caption = list("D2-lm Residuals vs Fitted values", "D2-lm Normal QQ-plot", "D2-lm Residuals vs Leverage"))
par(mfrow = c(2,2))
plot(step.model.T2, which = c(1,2,5), caption = list("T2-lmstep Residuals vs Fitted values", "T2-lmstep Normal QQ-plot", "T2-lmstep Residuals vs Leverage"))
par(mfrow = c(2,2))
plot(step.model.D2, which = c(1,2,5), caption = list("D2-lmstep Residuals vs Fitted values", "D2-lmstep Normal QQ-plot", "D2-lmstep Residuals vs Leverage"))
par(mfrow = c(2,2))
plot(glm.fit.T2, which = c(1,2,5), caption = list("T2-glm Residuals vs Fitted values", "T2-glm Normal QQ-plot", "T2-glm Residuals vs Leverage"))
par(mfrow = c(2,2))
plot(glm.fit.D2, which = c(1,2,5), caption = list("D2-glm Residuals vs Fitted values", "D2-glm Normal QQ-plot", "D2-glm Residuals vs Leverage"))
```
## RH predictions
The aim of my assignment was to find the interdependence of the variables T2 and D2. We find this by calculating the Relative humidity which is a function of T2 and D2.
In this section, we calculate the RH values using the T2Obs, D2Obs, T2 and D2 from the ECMWF model and T2 and D2 obtained from the linear regression models mymodel, lm, lmstep and glmcv
```{r, warning=FALSE, message=F, echo=TRUE}
T2Obs.test <- T2_winter_00_08_test[,1]
D2Obs.test <- D2_winter_00_08_test[,1]
T2model.test <- T2_winter_00_08_test[,3]
D2model.test <- D2_winter_00_08_test[,4]
# RH calculation
# RH from T2Obs and D2Obs
saturation_humidity_obs <- 6.112*exp((17.62*T2Obs.test)/(T2Obs.test + 243.12))
specific_humidity_obs <- 6.112*exp((17.62*D2Obs.test)/(D2Obs.test + 243.12))
RHObs <- 100*(specific_humidity_obs/saturation_humidity_obs)
# RH from T2 and D2 from ECMWF model
saturation_humidity_model <- 6.112*exp((17.62*T2model.test)/(T2Obs.test + 243.12))
specific_humidity_model <- 6.112*exp((17.62*D2model.test)/(D2Obs.test + 243.12))
RHmodel <- 100*(specific_humidity_model/saturation_humidity_model)
```
### T2 and D2 predicted using mymodel, lm, lmstep and glmcv
```{r, warning=FALSE, message=F, echo=TRUE}
# T2 and D2 predicted using mymodel
saturation_humidity_pred_mymodel <- 6.112*exp((17.62*T2pred.mymodel)/(T2pred.mymodel + 243.12))
specific_humidity_pred_mymodel <- 6.112*exp((17.62*D2pred.mymodel)/(D2pred.mymodel + 243.12))
RHpred.mymodel <- 100*(specific_humidity_pred_mymodel/saturation_humidity_pred_mymodel)
# T2 and D2 predicted using lm
saturation_humidity_pred_lm <- 6.112*exp((17.62*T2pred.lm)/(T2pred.lm + 243.12))
specific_humidity_pred_lm <- 6.112*exp((17.62*D2pred.lm)/(D2pred.lm + 243.12))
RHpred.lm <- 100*(specific_humidity_pred_lm/saturation_humidity_pred_lm)
# T2 and D2 predicted using lmstep
saturation_humidity_pred_lmstep <- 6.112*exp((17.62*T2pred.lmstep)/(T2pred.lmstep + 243.12))
specific_humidity_pred_lmstep <- 6.112*exp((17.62*D2pred.lmstep)/(D2pred.lmstep + 243.12))
RHpred.lmstep <- 100*(specific_humidity_pred_lmstep/saturation_humidity_pred_lmstep)
# T2 and D2 predicted using glm cross validation
saturation_humidity_pred_glmcv <- 6.112*exp((17.62*T2pred.glmcv)/(T2pred.glmcv + 243.12))
specific_humidity_pred_glmcv <- 6.112*exp((17.62*D2pred.glmcv)/(D2pred.glmcv + 243.12))
RHpred.glmcv <- 100*(specific_humidity_pred_lmstep/saturation_humidity_pred_glmcv)
```
### The rmse in the RH value calculated using the T2 and D2 observations and model values
We get the following rmse values for the different models
rmse RH ECMWF model = 4.823471 rmse RM mymodel = 2.000381 rmse RM lm = 2.029639 rmse RH lmstep = 1.980861 rmse RH glmcv = 1.983662.
We can see that the ECMWF model has high rmse in RH value compared to all other models. The model we developed "mymodel" has comparable rmse values for RH than that of lm, lmstep and glmcv models. This experiment shows that the statistical preprocessing improves the forecast as it reduces the rmse compared to the raw model forecast.
Recall that the rmse for T2 and D2 for the linear regression were around 1.4 in all the models. The rmse for RH is much for than (even double) that of T2 and D2.
```{r, warning=FALSE, message=F, echo=TRUE}
rmse.RH.model <- sqrt(mean((RHmodel - RHObs)^2))
rmse.RH.mymodel <- sqrt(mean((RHpred.mymodel - RHObs)^2))
rmse.RH.lm <- sqrt(mean((RHpred.lm - RHObs)^2))
rmse.RH.lmstep <- sqrt(mean((RHpred.lmstep - RHObs)^2))
rmse.RH.glmcv <- sqrt(mean((RHpred.glmcv - RHObs)^2))
cat("rmse RH ECMWF model = ", rmse.RH.model, "rmse RM mymodel = ", rmse.RH.mymodel, "rmse RM lm = ", rmse.RH.lm, "rmse RH lmstep = ", rmse.RH.lmstep, "rmse RH glmcv = ", rmse.RH.glmcv)
```