lm_model_fitting_winter.Rmd


## Development of a linear regression model for weather data 

In order to model T2 and D2 we use the linear regression model. Linear regression is the modelling technique in which we find a  linear relationship between a scalar dependent variable, called response or predictand and one or more exploratory variables called predictors.

Let us take a statistical dataset which consists of $(y, x_1, x_2, ...., x_n) $ where $y$ is the response or predictand variable and $ x_1, x_2, ...., x_n)$ are the predictors. We can find a linear relationship between the predictand and the predictors as 

$\hat{y} \,= \, \beta_1 x_1 + \beta_2_x2 + \ldots + \beta_n x_n $, where $ \hat{y}$ is the approximation to the predictand $y$ 

In linear regression model our aim is to find the $\beta$ variables so that the squared error, $ (y - \hat{y})^2 $ is minimum.

In our assignment,  we find the relationship between T2Obs and the variables given as the output of the ECMWF numerical prediction model. A similar regression model for the D2Obs and the ECMWF outputs are also found.

The weather data vary widely between the seasons and the ECMWF forecast model depends much on the forecast period.  But for the analysis purpose we  take only one  dataset for analysis, winter data for 1-day forecast period and summer data for one-day forecast period. We predict T2 and D2 using our model on the data. In the later part of the assignment we use the model to predict T2 and D2 on the whole Kumpula weather station data for 4 seasons and 64 forecast periods.


### Load the data for one-day forecast for the winter season 

We load the data from the file "data/station2998_T2_D2_Obs_model_data_with_timestamps.csv" and select the data for one-day forecast for the winter season. We use the analysis period as 00:00 UTC. We then randomly split the data as training and test sets. The dataset is named as df_winter_00_08.

The dataset df_winter has 573 observations of 32 variables. The first four columns are the station_id, season, analysis_time, forecast_period and timestamp and we call them collectively as station info. The 6th column is T2Obs and the 32nd column is the D2Obs. The columns from 7 to 31 contains the model data from the ECMWF numerical model.

```{r, warning=FALSE, message=F, echo=TRUE}
# loading the data from the file 
T2_D2_Obs_model_data_with_timestamps = read.table(file = "data/station2998_T2_D2_Obs_model_data_with_timestamps.csv", sep = ",", header = TRUE)
df_winter_00_08 <- T2_D2_Obs_model_data_with_timestamps[which(T2_D2_Obs_model_data_with_timestamps$forecast_period == 8 & T2_D2_Obs_model_data_with_timestamps$analysis_time == 1 & T2_D2_Obs_model_data_with_timestamps$season == 1),]

dim(df_winter_00_08)
str(df_winter_00_08)
```


### Splitting the training and testing data. 

From the df_winter_00_08, we create two datasets one for the T2Obs and model data and the other for D2Obs and the model data. The model data is the same for both the datasets and the response variables T2Obs and D2Obs are in the first column of the T2Obs dataset and the D2Obs dataset respectively. We then split the data into train and test data. The train data  has 430 observations of 26 variables and the test data has 143 observations of 26 variables. Notice that we removed the station info is removed as the variables there are not influencing the linear regression model. 


```{r, warning=FALSE, message=F, echo=TRUE}
# Splitting into train and test data
  set.seed(123)
  n <- nrow(df_winter_00_08)
  train = sample(1:n, size = round(0.75*n), replace=FALSE)
  T2_winter_00_08_train <- df_winter_00_08[train,6:(ncol(df_winter_00_08) - 1)]
  T2_winter_00_08_test <-  df_winter_00_08[-train,6:(ncol(df_winter_00_08) - 1)] 
  D2_winter_00_08_train <- cbind(df_winter_00_08[train,ncol(df_winter_00_08)],
                                 df_winter_00_08[train,7:(ncol(df_winter_00_08) - 1)])
  colnames(D2_winter_00_08_train)[1] <- "D2Obs"
  D2_winter_00_08_test <- cbind(df_winter_00_08[-train,ncol(df_winter_00_08)],
                                df_winter_00_08[-train,7:(ncol(df_winter_00_08) - 1)])
  colnames(D2_winter_00_08_test)[1] <- "D2Obs"
  
  dim(T2_winter_00_08_train)
  dim(T2_winter_00_08_test)
  dim(D2_winter_00_08_train)
  dim(D2_winter_00_08_test)
  
  str(T2_winter_00_08_train)
  str(D2_winter_00_08_train)
```


### Correlation matrix 

Here we find the correlation matrix for the variables of the dataset df_winter_00_08
```{r, warning=FALSE, message=F, echo=TRUE}
library(corrplot); library(dplyr)
cor(as.matrix(df_winter_00_08[,6:ncol(df_winter_00_08)])) 

```


## Linear regression model for  T2

We use the correlation matrix and the scatter plots to find out the predictor variables for our model which predicts T2. 

### Correlation coefficients and scatter plots of the the variables of the  dataset df_winter_00_08 with T2Obs 
We study and plot the relationship between T2Obs and the ECMWF model variables. 
The correlation coefficients of of the the variables of the  dataset df_winter_00_08 with T2Obs are given below.

<pre><code>
                T2Obs          MSL          T2         D2         U10         V10          TP         LCC
T2Obs 	    1.00000000 -0.413811966  0.94819952  0.9283360  0.39888861  0.40988014  0.12437463 -0.03288617
                MCC         HCC        SKT      RH_950       T_950       RH_925        T_925       RH_850
T2Obs        0.145251318  0.23019018  0.8863533  0.18680122  0.88894005  0.216119199  0.853054151  0.167963469
               T_850      RH_700       T_700      RH_500        T_500       T2_M1    T_950_M1    T_925_M1
T2Obs        0.769399085  0.04118724  0.67602135  0.09403511  0.587334852  0.94419536  0.88398440  0.85172114
            DECLINATION       D2_M1      D2Obs
T2Obs        0.45237252  0.92448408  0.9419332
</code></pre>


We can observe from the scatter plots that all "temperature variables" are linearly correlated with T2Obs. Please zoom the figure to see the details.

```{r, warning=FALSE, message=F, echo=TRUE}
library (ggplot2)
par(mfrow=c(4,7)) 
for (i in 7:31) {
  plot(df_winter_00_08[,i],df_winter_00_08[,6], ylab = "T2Obs", xlab = colnames(df_winter_00_08[i]))
}

```


#### First linear regression model for  T2

For our first model, we take all the variables which have a positive correlation higher than 0.8 with T2Obs and all the variables which are negatively correlated with T2Obs. 

All the "temperature variables" are linearly correlated with T2Obs but some have correlation coefficients less than 0.8. We take all the temperature variables  that have higher correlation coefficients than 0.8 with T2Obs. So we use the temperature variables T2, T_950, T_925,  T2_M1, T2_950_M1, T2_925_M1 and SKT in our model. D2 and D2_M1 are also have  high correlation coefficient with T2Obs and are  linearly distributed with T2Obs, we include D2 in our model. 

Even though the  RH variables are seen to be randomly distributed with T2Obs, they have correlations less  than 0.8 and we exclude them from  our model.  Even though DECLINATION is  linearly correlated with T2Obs, its  correlation coefficient is less than 0.5, we exclude it  from our model. In the case of TP, it is not linearly correlated with T2Obs and has very small correlation coefficient, we exclude it from the model.

From the scatter plots, we can see that the cloud cover variables LCC, MCC and HCC  are randomly correlated with T2Obs and have low correlation coefficients with T2Obs. As LCC has a negative correlation with T2Obs, we take LCC to our list of predictors. As MSL also has a negative correlation coefficient, we also include it to our list of  predictors. The wind velocity variables,  U10 and V10 are randomly distributed and have low correlation coefficients we exclude them from our model.

Based on the above hypothesis, the formula for our initial model is
** T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + LCC + MSL **

We use the R function "lm" to carry out the linear regression model.

The summary of our first model, my_model1_T2 shows that it has a Residual standard error: 1.428. F-statistic: 506.9 and the p-value: < 2.2e-16.
The p-value shows the probability that the null hypothesis is true. If the null hypothesis is true, it is an intercept only model and there is no linear relationship between the predictand and the predictors. In our case as the p value,  is very small, the null hypothesis is false and there is a linear relationship between the predictand and the predictors. 

The Fischer value or F-statistic  is the ratio of explained variance and unexplained variance and has a value near 1 if the null hypothesis is true.
F-statistic is defined as  F-statistic = MSM (Mean of Squares for Model) / MSE (Mean of Squares for Error) . In our case F-statistic is 506.9, so our model rejects the null hypothesis.
 
 The t value is the value of the t-statistic for testing whether the corresponding regression coefficient is different from 0. As a rule of thump, a t value around or greater than 2 is good.
 
 Pr. value  is the p-value for the hypothesis test for which the t value is the test statistic. It gives  the probability of a test statistic at least as the t  value obtained, if the null hypothesis were true.  A low Pr value for the variables shows that there is linear relationship between this variable and the predictor.
 The Signif codes actually shows the ordering based on p values, a lower  p value, higher is the significance.
 
 Based on the F-statistic and p value we see that our model is a good one.

```{r, warning=FALSE, message=F, echo=TRUE} 
my.model1.T2 <- lm(T2Obs  ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + LCC + MSL , data = T2_winter_00_08_train)
summary(my.model1.T2)

```


#### Second  linear regression model for  T2

We try in this step to see whether we can improve our first model. Based on the t value and Pr value  for the variables obtained in our first model, we include the variables which have Pr value less than 0.5.  This results in a new formula for the linear regression as 

** T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC ** 

The summary of the model shows that F-statistic has increased from 506 to 698. The p value remains as the same low value, the residual standard error has almost same (1.428 for the first model and 1.426 for the second model)  


```{r, warning=FALSE, message=F, echo=TRUE} 
my.model2.T2 <- lm(T2Obs  ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC , data = T2_winter_00_08_train)
summary(my.model2.T2)

```

#### Third   linear regression model for  T2

We try to see if we can improve the model still further. Based on the t value and Pr value  for the variables obtained in our second  model, we include only the variables which have Pr value less than 0.1.  This results in a new formula for the linear regression as 

** T2Obs ~ T2 + D2 + T_950 + T2_M1 +  LCC ** 

The summary of the model shows that F-statistic has increased and has a value 1114. The p value remains as the same low value, the residual standard error is the same as the first model.  


```{r, warning=FALSE, message=F, echo=TRUE} 
my.model3.T2 <- lm(T2Obs  ~ T2 + D2 + T_950 + T2_M1 +  LCC  , data = T2_winter_00_08_train)
summary(my.model3.T2)

```


### The selected linear model for T2

From the above analysis we saw that we do not gain much in terms of residual error and p value when we move from model 1 to model 3.
Our analysis is based only on the winter data, we cannot strongly state that the  five variables for the third model are the only ones that significantly affect T2. As a compromise  we select the second model with eight variables, which has a higher F value than the first model.
The linear regression formula for our model is taken as 

** T2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC ** 


```{r, warning=FALSE, message=F, echo=TRUE} 
my.model.T2 <- lm(T2Obs  ~ T2 + D2 + T_950 + T_925 + T2_M1 + D2_M1 + SKT + LCC , data = T2_winter_00_08_train)
summary(my.model.T2)

```
### Predicting T2 with my model

We now see the predicting ability of our model by using it to predict T2 for the test data T2_winter_00_08_test.
The results show that the root mean square error (rmse) is 1.426. 
Our linear regression model for T2 is 
<pre><code>
T2  =  9.07883695 + 0.26398396 * T2 (of the ECMWF model) + 0.45513159 * D2 (of the ECMWF model) + 0.18812205 * T_950  - 0.07796135 * T_925 + 0.30452258 * T2_M1  - 0.11617459 * D2_M1 - 0.04322546 * SKT - 1.35912603 * LCC
</code></pre>

```{r, warning=FALSE, message=F, echo=TRUE} 
# calculating the fit
  fitted.values <- predict(object = my.model.T2, newdata <- T2_winter_00_08_test)
  T2pred.mymodel <- fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
  coefficients <- my.model.T2$coefficients
  rmse.mymodel.T2 <- sqrt(mean(residuals^2))
  coefficients
  rmse.mymodel.T2 
```

## Linear regression model for D2 

We use the correlation matrix and the scatter plots to find out the predictor variables for our model for D2. 

### Correlation and Scatter plots of  the variables of the  dataset df_winter_00_08 with D2Obs 

We study and lot the relationship between D2Obs and the ECMWF model variables. The correlation coefficients 

<pre><code>
                T2Obs          MSL          T2         D2         U10         V10          TP         LCC
D2Obs        0.94193317 -0.455461577  0.93054344  0.9664760  0.38240272  0.47678192  0.20345976  0.14652865
                MCC         HCC        SKT      RH_950       T_950       RH_925        T_925       RH_850
D2Obs        0.195814995  0.25890049  0.8969396  0.36156717  0.85307293  0.361084178  0.826701929  0.268052174
               T_850      RH_700       T_700      RH_500        T_500       T2_M1    T_950_M1    T_925_M1
D2Obs        0.765794084  0.10873350  0.67134133  0.13140822  0.564374564  0.90946289  0.83419344  0.81248000
               DECLINATION   D2_M1      D2Obs
D2Obs        0.33219331  0.95292046  1.0000000
</code></pre>


We can observe from the scatter plots that all "temperature variables" are linearly correlated with D2Obs. Please zoom the figure to see the details.

```{r, warning=FALSE, message=F, echo=TRUE}
library (ggplot2)
par(mfrow=c(4,7)) 
for (i in 7:31) {
  plot(df_winter_00_08[,i],df_winter_00_08[,32], ylab = "D2Obs", xlab = colnames(df_winter_00_08[i]))
}

```  


#### First linear regression model for D2

Similar to the modelling of T2Obs, for our first model, we take all the variables which have a positive correlation higher than 0.8 with D2Obs and all the variables which are negatively correlated with D2Obs. 

All the "temperature variables" are linearly correlated with D2Obs but some have correlation coefficients less than 0.8. We take all the temperature variables  that have higher correlation coefficients than 0.8 with D2Obs. So we use the temperature variables T2, T_950, T_925, T2_M1, T2_950_M1, T2_925_M1 and SKT in our model. D2 and D2_M1 are also have  high correlation coefficient with D2Obs and are  linearly distributed with T2Obs, we include D2 in our model. 


The  RH variables are seen to be randomly distributed with D2Obs and  they have correlations less  than 0.8 and we exclude them from  our model.  Even though DECLINATION is  linearly correlated with D2Obs, its  correlation coefficient is less than 0.5, we exclude it  from our model. 

As the cloud cover variables LCC, MCC and HCC  are randomly correlated with D2Obs and have low correlation coefficients with T2Obs. we exclude them from our model.  As MSL has a negative correlation coefficient, we also include it to our list of  predictors. The wind velocity variables,  U10 and V10 are randomly distributed and have low correlation coefficients.

From a layman's point of view, we have a feeling that D2 may be related to LCC, TP, RH_950 and V10 even though they are  not linearly correlated with D2Obs and has very small correlation coefficient.

Based on the above hypothesis, the formula for our initial model is

** D2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + MSL + TP + RH_950 + LCC**


The summary of our first model, my_model1_D2 shows that it has a Residual standard error: 1.447. F-statistic: 486.7 and the p-value: < 2.2e-16.
As the p value,  is very small, the null hypothesis is false and there is a linear relationship between the predictand and the predictors. 

```{r, warning=FALSE, message=F, echo=TRUE} 
my_model1_D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T_925 + T2_M1 + T_950_M1 + T_925_M1 + D2_M1 + SKT + MSL + TP + RH_950 + + V10 + LCC, data = D2_winter_00_08_train)
summary(my_model1_D2)

```

#### Second  linear regression model for  D2

We try in this step to see whether we can improve our first model. Based on the t value and Pr value  for the variables obtained in our first model, we include the variables which have Pr value less than 0.5.  Even though T2 has a Pr greater than 0.5 we include T2 in the second model also as we believe that T2 and D2 are related. As we have seen in the quantile plot of Winter data, the ECMWF model T2 deviates from the observed T2Obs widely. This may be the reason for high Pr value for T2. This results in a new formula for the linear regression as 

** D2Obs ~ T2 + D2 + T_950 + T2_M1 + T_950_M1 + D2_M1 + SKT + MSL + TP + RH_950 + V10 + LCC ** 

The summary of the model shows that F-statistic has increased from 486.7 to 569.9. The p value remains as the same low value, the residual standard error has almost same (1.447for the first model and 1.444 for the second model)  

```{r, warning=FALSE, message=F, echo=TRUE} 
my_model2_D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T2_M1 + T_950_M1 + D2_M1 + SKT + MSL + TP + RH_950 + V10 + LCC, data = D2_winter_00_08_train)
summary(my_model2_D2)

```

#### Third linear regression model for  D2

We try in this step to see whether we can improve our first model. Based on the t value and Pr value  for the variables obtained in our first model, we include the variables which have Pr value around 0.1.  Even though T2 has a Pr greater than 0.5 we include T2 in the third model also as we believe that T2 and D2 are related.  This results in a new formula for the linear regression as 

** D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC ** 

The summary of the model shows that F-statistic has increased from 569.9 to  756.7. The p value remains as the same low value, the residual standard error has almost same (1.446 for the third  model and 1.444 for the second model)

```{r, warning=FALSE, message=F, echo=TRUE} 
my_model3_D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC, data = D2_winter_00_08_train)
summary(my_model3_D2)

```

### The selected linear model for D2

From the above analysis we saw that we do not gain much in terms of residual error and p value when we move from model 1 to model 3.
As the model 3 has 9 variables and they have quite good Pr values we select  the third  model with eight variables, which has a higher F value than the other two models.

The linear regression formula for our model is taken as 

** D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC **  


```{r, warning=FALSE, message=F, echo=TRUE} 
my.model.D2 <- lm(D2Obs ~ T2 + D2 + T_950 + T_950_M1 + D2_M1 + TP + RH_950 + V10 + LCC, data = D2_winter_00_08_train)
summary(my.model.D2)

```


### Predicting D2 with the my.model.D2

We now see the predicting ability of our model by using it to predict D2 for the test data D2_winter_00_08_test.
The results show that the root mean square error (rmse) is 1.351. 
Our linear regression model for D2 is 
<pre><code>
D2  =  2.472731 - 0.063511 *  T2 (of the ECMWF model) + 0.564682 * D2 (of the ECMWF model) + 0.364282 * T_950  - 0.148566* T_950_M1 + 0.265225 *D2_M1 - 0.603928* TP +  0.035618 * RH_950 + 0.082966 * V10 - 0.629401* LCC
</code></pre>


```{r, warning=FALSE, message=F, echo=TRUE} 
# calculating the fit
  fitted.values <- predict(object = my.model.D2, newdata <- D2_winter_00_08_test)
  D2pred.mymodel <- fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
  coefficients <- my.model.D2$coefficients
  rmse.mymodel.D2 <- sqrt(mean(residuals^2))
  coefficients 
  rmse.mymodel.D2
```

## Comparing my.model.T2 and my.model.D2 with standard linear regression models

In order to analyze my.model, we compare it with models produced by three other linear regression models. 
The first one is known as the "lm model" which uses R function lm with all the variables of the dataset.

The second is the "lmstep"" model which uses the R "step"  algorithm. This algorithm develops a sequence of linear models, removing or
adding a variable at each step. A model is updated when it has a lower Akaike’s Information Criterion (AIC) than the original model. AIC is defined as 
$AIC \ = \ = -N \, log\(frac{RSS}{N}) + 2k $ where N is the number of observations, RSS is the residual sum of squares and k is the number of degrees of freedom of the original model.

The third method is the glmcv,  Gaussian linear regression model with cross validation. Here the training dataset is divided to K equal parts and always one part is set aside as test data. The idea behind the K-fold cross validation is that the training of the linear regression model is done with K-1 parts and cross validation (prediction with cross validation data)  is done with the part which is set aside. The prediction error is calculated for this part. In the next iteration the set-aside part is included in the training data which forms a K-1 group and a new part is set aside. The iterations continue in a modular fashion, each time K-1 parts used for training and one part is used for cross validation. For each iteration prediction error is calculated and glmcv choose the iteration with the least error and selects that model.

We then use the models of lm, lmstep and glmcv to predict T2 and D2 using the test data.  We use the root mean square error (RMSE) to compare the different models.

### Modelling with lm for T2

For the lm model we use all the model variables.  With this model we get Residual standard error: 1.413, F-statistic:  228.9 and   p-value: < 2.2e-16

```{r, warning=FALSE, message=F, echo=TRUE} 
lm.model.T2 <- lm(T2Obs ~ ., data = T2_winter_00_08_train)
summary(lm.model.T2)

```


### Predicting T2 with the lm.model.T2

We now see the predicting ability of lm.model.T2 by using it to predict T2 for the test data T2_winter_00_08_test.
The results show that the root mean square error (rmse) is 1.411888. 


```{r, warning=FALSE, message=F, echo=TRUE} 
# calculating the fit
  fitted.values <- predict(object = lm.model.T2, newdata <- T2_winter_00_08_test)
  T2pred.lm <- fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
  coefficients <- lm.model.T2$coefficients
  rmse.lm.T2 <- sqrt(mean(residuals^2))
  coefficients 
  rmse.lm.T2
```


### Modelling with lm for D2

For the lm model we use all the model variables.  With this model we get Residual standard error: 1.438, F-statistic:  276.2 and   p-value: < 2.2e-16

```{r, warning=FALSE, message=F, echo=TRUE} 
lm.model.D2 <- lm(D2Obs ~ ., data = D2_winter_00_08_train)
summary(lm.model.D2)

```

### Predicting D2 with the lm.model.D2

We now see the predicting ability of lm.model.D2 by using it to predict D2 for the test data D2_winter_00_08_test.
The results show that the root mean square error (rmse) is 1.305783. 

```{r, warning=FALSE, message=F, echo=TRUE} 
# calculating the fit
  fitted.values <- predict(object = lm.model.D2, newdata <- D2_winter_00_08_test)
  D2pred.lm <- fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
  coefficients <- lm.model.D2$coefficients
  rmse.lm.D2 <- sqrt(mean(residuals^2))
  coefficients 
  rmse.lm.D2
```


### Modelling T2 with lmstep

As explained earlier this algorithm develops a sequence of linear models, removing or adding a variable at each step. lm.step model starts with a  formula known as the object.formula. We can define a lower and upper formulas and the model can go in step backward or forward to the lower or upper formulas  from the object formula in some number of steps defined in the step variable. Here we define the object.formula as a constant one (no variables present), i.e., T2Obs ~1. The lower.formula is the same as the object.formula and the upper.formula is the formula with all the variables, i.e, T2Obs ~. We use only the forward direction.

With this model we get Residual standard error: 1.397, F-statistic:  531.2 and   p-value: < 2.2e-16

```{r, warning=FALSE, message=F, echo=TRUE} 
 # generating the linear models
  response <- colnames(T2_winter_00_08_train)[1]
  object.formula <- as.formula(paste(response, " ~ 1", sep = ""))
  upper.formula <- as.formula(paste(response, " ~ .", sep = ""))
  lower.formula <- as.formula(paste(response, " ~ 1", sep = ""))
  object.lm <- lm(object.formula, data = T2_winter_00_08_train)
  upper.lm <- lm(upper.formula, data = T2_winter_00_08_train)
  lower.lm <- lm(lower.formula, data = T2_winter_00_08_train)
  direction <- "forward"
  steps <-  1000
  # choosing the best model
  step.model.T2 <- step(object = object.lm,
                     scope = list(upper = upper.lm, lower = lower.lm),
                     direction = direction, trace = 0 , steps = steps)
  step.model.T2
  summary(step.model.T2)
  
```


### Predicting T2 with the lmstep model

We now see the predicting ability of lm.step.T2 by using it to predict T2 for the test data T2_winter_00_08_test.
The results show that the root mean square error (rmse) is  1.422109. 

```{r, warning=FALSE, message=F, echo=TRUE} 
# calculating the fit
  fitted.values <- predict(object = step.model.T2, newdata <- T2_winter_00_08_test)
  T2pred.lmstep <- fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
  coefficients <- step.model.T2$coefficients
  rmse.lmstep.T2 <- sqrt(mean(residuals^2))
  coefficients
  rmse.lmstep.T2
```

### Modelling D2 with the lmstep model

Similar to the T2 model,  we define the object.formula as a constant one (no variables present), i.e., D2Obs ~1. The lower.formula is the same as the object.formula and the upper.formula is the formula with all the variables, i.e, D2Obs ~. We use only the forward direction.

With this model we get Residual standard error: 1.428, F-statistic:  777.9 and   p-value: < 2.2e-16.
Notice that T2 is not present in the final formula which gave the best AIC.

```{r, warning=FALSE, message=F, echo=TRUE} 
 # generating the linear models
  response <- colnames(D2_winter_00_08_train)[1]
  object.formula <- as.formula(paste(response, " ~ 1", sep = ""))
  upper.formula <- as.formula(paste(response, " ~ .", sep = ""))
  lower.formula <- as.formula(paste(response, " ~ 1", sep = ""))
  object.lm <- lm(object.formula, data = D2_winter_00_08_train)
  upper.lm <- lm(upper.formula, data = D2_winter_00_08_train)
  lower.lm <- lm(lower.formula, data = D2_winter_00_08_train)
  direction <- "forward"
  steps <-  1000
  # choosing the best model
  step.model.D2 <- step(object = object.lm,
                     scope = list(upper = upper.lm, lower = lower.lm),
                     direction = direction, trace = 0 , steps = steps)
 summary(step.model.D2) 
```

### Predicting D2 with the lmstep model

We now see the predicting ability of lm.step.D2 by using it to predict D2 for the test data D2_winter_00_08_test.
The results show that the root mean square error (rmse) is  1.334799. 

```{r, warning=FALSE, message=F, echo=TRUE} 
# calculating the fit
  fitted.values <- predict(object = step.model.D2, newdata <- D2_winter_00_08_test)
  D2pred.lmstep <- fitted.values
  
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
  coefficients <- step.model.D2$coefficients
  rmse.lmstep.D2 <- sqrt(mean(residuals^2))
  coefficients
  rmse.lmstep.D2
```


#### Modelling with glm function and cross validation cv.glm function

We use the glm function in the boot package. We use K = 4 folds, the data for training will be around 300 observations and for cross validation is around 100 observations. The cross validation error is smallest for the first iteration.

```{r, warning=FALSE, message=F, echo=TRUE} 
library(boot)
set.seed(123)
cv.error.10 = rep(0, 10)
for (i in 1:10) {
    glm.fit.T2 = glm(T2Obs ~., data = T2_winter_00_08_train)
    cv.error.10[i] = cv.glm(T2_winter_00_08_train, glm.fit.T2, K = 4)$delta[1]
cv.error.10
}
cv.error.10

summary(glm.fit.T2)


```


### Predicting T2 with the glm model
We now see the predicting ability of glmcv.T2 by using it to predict T2 for the test data D2_winter_00_08_test.
The results show that the root mean square error (rmse) is  1.411888. 


```{r, warning=FALSE, message=F, echo=TRUE} 
# calculating the fit
  fitted.values <- predict(object = glm.fit.T2, newdata <- T2_winter_00_08_test)
  T2pred.glmcv <- fitted.values
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(T2_winter_00_08_test[,1] - fitted.values)
  coefficients <- glm.fit.T2$coefficients
  rmse.glmcv.T2 <- sqrt(mean(residuals^2))
  coefficients 
  rmse.glmcv.T2
```


# D2 model boot library to use cv.glm function

Similar to T2 modelling using glm we carry out the modelling for D2

```{r, warning=FALSE, message=F, echo=TRUE} 
library(boot)
set.seed(123)
cv.error.10 = rep(0, 10)
for (i in 1:10) {
    glm.fit.D2 = glm(D2Obs ~., data = D2_winter_00_08_train)
    cv.error.10[i] = cv.glm(D2_winter_00_08_train, glm.fit.D2, K = 5)$delta[1]
cv.error.10
}
cv.error.10
summary(glm.fit.D2)

```

## Predicting D2
We now see the predicting ability of glmcv.D2 by using it to predict D2 for the test data D2_winter_00_08_test.
The results show that the root mean square error (rmse) is  1.305783. 


```{r, warning=FALSE, message=F, echo=TRUE} 
# calculating the fit
  fitted.values <- predict(object = glm.fit.D2, newdata <- D2_winter_00_08_test)
   D2pred.glmcv <- fitted.values
  fitted.values <- as.numeric(fitted.values)
  residuals <- as.numeric(D2_winter_00_08_test[,1] - fitted.values)
  coefficients <- glm.fit.D2$coefficients
  rmse.glmcv.D2 <- sqrt(mean(residuals^2))
  coefficients 
  rmse.glmcv.D2
```


# rmse values for my.model lm, lmstep and glmcv predictions of T2 and D2 

We can see that my.model for T2 and D2 have comparable rmse values to that obtained using the lm, lmstep and glmcv.

```{r, warning=FALSE, message=F, echo=TRUE} 

 cat("rmse mymodel T2=", rmse.mymodel.T2, "rmse lm T2=", rmse.lm.T2, "rmse lmstep T2=", rmse.lmstep.T2, "rmse glmcv T2=", rmse.glmcv.T2,
     "rmse mymodel D2=", rmse.mymodel.D2, "rmse lm D2=", rmse.lm.D2,"rmse lmstep D2=", rmse.lmstep.D2, "rmse glmcv D2=", rmse.glmcv.D2
 )
```

## The diagnostic plots  

We plot the following graphs Residuals vs Fitted values, Normal QQ-plot, Standardized residuals vs Fitted values and Residuals vs Leverage for all the models.

The Residuals vs Fitted values  plot shows if the residuals have non-linear patterns. There could be a non-linear relationship between predictor variables and the response variable and the pattern could show up in this plot if the model is not able to  capture the non-linear relationship. Ideally there should not be any pattern and the residuals should be randomly spread around a horizontal line drawn at 0 on the y-axis. A red line in the graph helps to show the trend in the pattern, its deviation from the horizontal line at 0.

The second graph, the normal QQ-plot shows if the residuals are normally distributed. If the residuals follow the  straight dashed line, the residuals are normally distributed.


The third graph, Standardized residuals vs Fitted values, shows the rmse in the predictions.
The third graph, Residuals vs Leverage graph helps to find the influential extreme values or outliers. The  data have extreme values,  but we can consider these extreme values are not influential if the linear  regression model we find does not change if we either include or exclude them from analysis. The red dotted line in the graph shows the Cook's distance. If all the points in the leverage graph are well within the the Cook's distance, then there are no outliers that influence the linear regression. 


We plot the diagnostic plots for our linear regression models, mymodel, lm, lmstep and glmcv. For mymodel, the Residuals vs Fitted values shows most of the residuals are along the mean value. The normal QQ plot is linear and the leverage plot do not have outliers. These graphs show that the mymodel is  a good fit for the weather data. Similar graphs for lm, lmstep and glmcv show that they also produce good linear regression models for the weather data.


```{r, warning=FALSE, message=F, echo=TRUE} 
# We plot the graphs  Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

par(mfrow = c(2,2))
plot(my.model.T2, which = c(1,2,3,5), caption = list("T2-my model  Residuals vs Fitted values", "T2-mymodel Normal QQ-plot", "T2 Standardized residuals vs Fitted values", "T2-mymodel Residuals vs Leverage"))

par(mfrow = c(2,2))
plot(my.model.D2, which = c(1,2,5), caption = list("D2-my model Residuals vs Fitted values", "D2-lm Normal QQ-plot", "D2-mymodel Residuals vs Leverage"))


par(mfrow = c(2,2))
plot(lm.model.T2, which = c(1,2,5), caption = list("T2-lm  Residuals vs Fitted values", "T2-lm Normal QQ-plot", "T2-lm Residuals vs Leverage"))

par(mfrow = c(2,2))
plot(lm.model.D2, which = c(1,2,5), caption = list("D2-lm Residuals vs Fitted values", "D2-lm Normal QQ-plot", "D2-lm Residuals vs Leverage"))

par(mfrow = c(2,2))

plot(step.model.T2, which = c(1,2,5), caption = list("T2-lmstep  Residuals vs Fitted values", "T2-lmstep Normal QQ-plot", "T2-lmstep Residuals vs Leverage"))

par(mfrow = c(2,2))
plot(step.model.D2, which = c(1,2,5), caption = list("D2-lmstep Residuals vs Fitted values", "D2-lmstep Normal QQ-plot", "D2-lmstep Residuals vs Leverage"))

par(mfrow = c(2,2))
plot(glm.fit.T2, which = c(1,2,5), caption = list("T2-glm Residuals vs Fitted values", "T2-glm Normal QQ-plot", "T2-glm Residuals vs Leverage"))

par(mfrow = c(2,2))
plot(glm.fit.D2, which = c(1,2,5), caption = list("D2-glm Residuals vs Fitted values", "D2-glm Normal QQ-plot", "D2-glm Residuals vs Leverage"))


```
## RH predictions

The aim of my assignment was to find the interdependence of the variables T2 and D2. We find this by calculating the Relative humidity which is a function of T2 and D2.

In this section, we calculate the RH values using the T2Obs, D2Obs,   T2 and D2 from the ECMWF model and T2 and D2 obtained from the linear regression models mymodel, lm, lmstep and glmcv

```{r, warning=FALSE, message=F, echo=TRUE} 

  T2Obs.test <- T2_winter_00_08_test[,1]
  D2Obs.test <- D2_winter_00_08_test[,1]
  
  
  T2model.test <- T2_winter_00_08_test[,3]
  D2model.test <- D2_winter_00_08_test[,4]
  
 # RH calculation

  # RH  from T2Obs and D2Obs
  
  saturation_humidity_obs <-  6.112*exp((17.62*T2Obs.test)/(T2Obs.test + 243.12))
  specific_humidity_obs <-  6.112*exp((17.62*D2Obs.test)/(D2Obs.test + 243.12))
  RHObs <- 100*(specific_humidity_obs/saturation_humidity_obs)
  
  # RH  from T2 and D2 from ECMWF model
  
  saturation_humidity_model <-  6.112*exp((17.62*T2model.test)/(T2Obs.test + 243.12))
  specific_humidity_model <-  6.112*exp((17.62*D2model.test)/(D2Obs.test + 243.12))
  RHmodel <- 100*(specific_humidity_model/saturation_humidity_model)
``` 

### T2 and D2 predicted using mymodel, lm, lmstep and glmcv
  
```{r, warning=FALSE, message=F, echo=TRUE} 

 # T2 and D2 predicted using mymodel

  saturation_humidity_pred_mymodel <-  6.112*exp((17.62*T2pred.mymodel)/(T2pred.mymodel + 243.12))
  specific_humidity_pred_mymodel <-  6.112*exp((17.62*D2pred.mymodel)/(D2pred.mymodel + 243.12))
  RHpred.mymodel <- 100*(specific_humidity_pred_mymodel/saturation_humidity_pred_mymodel)

  # T2 and D2 predicted using lm
  
  saturation_humidity_pred_lm <-  6.112*exp((17.62*T2pred.lm)/(T2pred.lm + 243.12))
  specific_humidity_pred_lm <-  6.112*exp((17.62*D2pred.lm)/(D2pred.lm + 243.12))
  RHpred.lm <- 100*(specific_humidity_pred_lm/saturation_humidity_pred_lm)

  # T2 and D2 predicted using lmstep 
  
  saturation_humidity_pred_lmstep <-  6.112*exp((17.62*T2pred.lmstep)/(T2pred.lmstep + 243.12))
  specific_humidity_pred_lmstep <-  6.112*exp((17.62*D2pred.lmstep)/(D2pred.lmstep + 243.12))
  RHpred.lmstep <- 100*(specific_humidity_pred_lmstep/saturation_humidity_pred_lmstep)

   # T2 and D2 predicted using glm cross validation 
  
  saturation_humidity_pred_glmcv <-  6.112*exp((17.62*T2pred.glmcv)/(T2pred.glmcv + 243.12))
  specific_humidity_pred_glmcv <-  6.112*exp((17.62*D2pred.glmcv)/(D2pred.glmcv + 243.12))
  RHpred.glmcv <- 100*(specific_humidity_pred_lmstep/saturation_humidity_pred_glmcv)
    
```
 
### The rmse in the RH value calculated using the T2 and D2 observations and model values

We get the following rmse values for the different models
rmse RH ECMWF model =  4.823471 rmse RM mymodel =  2.000381 rmse RM lm =  2.029639 rmse RH lmstep =  1.980861 rmse RH glmcv =  1.983662.

We can see that the ECMWF model has high rmse in RH value compared to all other models. The model we developed "mymodel" has comparable rmse values for RH than that of lm, lmstep and glmcv models. This experiment shows that the statistical preprocessing improves the forecast as it reduces the rmse compared to the raw model forecast. 

Recall that the rmse for T2 and D2 for the linear regression were around 1.4 in all the models. The rmse for RH is much for than  (even double) that of T2 and D2.

  
```{r, warning=FALSE, message=F, echo=TRUE} 
  rmse.RH.model <- sqrt(mean((RHmodel - RHObs)^2))
  rmse.RH.mymodel <- sqrt(mean((RHpred.mymodel - RHObs)^2))
  rmse.RH.lm <- sqrt(mean((RHpred.lm - RHObs)^2))
  rmse.RH.lmstep <- sqrt(mean((RHpred.lmstep - RHObs)^2))
  rmse.RH.glmcv <- sqrt(mean((RHpred.glmcv - RHObs)^2))
  cat("rmse RH ECMWF model = ", rmse.RH.model, "rmse RM mymodel = ", rmse.RH.mymodel, "rmse RM lm = ", rmse.RH.lm, "rmse RH lmstep = ", rmse.RH.lmstep, "rmse RH glmcv = ", rmse.RH.glmcv)
  
```