Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added some html and Jupyter files #2

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
<h2>Introduction</h2>
In the last chapter we introduced simple linear regression and model fitness test. In this chapter we are going to expand our independent variable from one to many, which is known as <strong>Multiple Linear Regression</strong>.
We have already known that a simple linear regression model is usually written in the following forms:
\[Y = \alpha + \beta X + \epsilon\]
Since a multiple linear regression has multiple variables, the model is given by:
\[Y = \alpha + \beta_1 X_i + \beta_2 X_2 + \beta_3 X_3 +......+ \beta_n X_n\]
<h2>Python Implementation</h2>
In the last chapter we used S&amp;P 500 index as a variable to predict Amazon stock. Now we can add more variable to see whether the fitness of the model increases.
<pre class="prettyprint linenums">import numpy as np
import pandas as pd
import quandl
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
quandl.ApiConfig.api_key = 'tAyfv1zpWnyhmDsp91yv'
spy_table = quandl.get('BCIW/_SPXT')
amzn_table = quandl.get('WIKI/AMZN')
ebay_table = quandl.get('WIKI/EBAY')
wal_table = quandl.get('WIKI/WMT')
aapl_table = quandl.get('WIKI/AAPL')
spy = spy_table.loc['2016',['Close']]
amzn = amzn_table.loc['2016',['Close']]
ebay = ebay_table.loc['2016',['Close']]
wal = wal_table.loc['2016',['Close']]
aapl = aapl_table.loc['2016',['Close']]
spy_log = np.log(spy.Close).diff().dropna()
amzn_log = np.log(amzn.Close).diff().dropna()
ebay_log = np.log(ebay.Close).diff().dropna()
wal_log = np.log(wal.Close).diff().dropna()
aapl_log = np.log(aapl.Close).diff().dropna()
df = pd.concat([spy_log,amzn_log,ebay_log,wal_log,aapl_log],axis = 1).dropna()
df.columns = ['spy','amzn','ebay','wal','aapl']
df.tail()
</pre>
Now let's build the model. In the same way as building a simple linear regression model, we use the package 'statsmodels' by inputing the formula:
<pre class="prettyprint linenums">model = sm.ols(formula = 'amzn ~ spy+ebay+wal',data = df).fit()
print model.summary()
</pre>
Here we built a multiple linear regression model. As we can read from the summary table, the P-value for Ebay, Wal-mart and Apple are 0.174, 0.330 and 0.068 respectively, which means none of the them are significance at a 95% confidence. If we compare the result with a simple regression model:
<pre class="prettyprint linenums">simple = sm.ols(formula = 'amzn ~ spy',data = df).fit()
print simple.summary()
</pre>
We can see that the r-square increased a little bit. Actually as we increase the number of variables, the r-square will increase. However, this doesn't means it's better to add hundreds of variables. We will talk about this issue in the latter chapter.
The reason why it's just increased a little is that we selected some stocks based on our intuition. We think that the competitors and the substitutes of AMZN may explain AMZN's daily return, but actually they don't. Here we try <strong>Fama-French 5 factor</strong>, which is an important model asset pricing theory and we will cover it in the later tutorials.
<pre class="prettyprint linenums">fama_table = quandl.get('KFRENCH/FACTORS5_D')
fama = fama_table['2016']
fama = fama.rename(columns = {'Mkt-RF':'MKT'})
fama = fama.apply(lambda x: x/100)
fama_df = pd.concat([fama,amzn_log],axis = 1)
fama_model = sm.ols(formula = 'Close~MKT+SMB+HML+RMW+CMA',data = fama_df).fit()
print fama_model.summary()
</pre>
As we can see, the r-square increased significantly if we use Fama-French 5 factor to do the multiple lienar regression. We can compare the prediction from simple linear regression and Fama-French multiple regression by plotting them together on one chart:
<pre class="prettyprint linenums">result = pd.DataFrame({'simple regression':simple.predict(),'fama_french':fama_model.predict(),'sample':df.amzn},index = df.index)
plt.figure(figsize = (15,7.5))
plt.plot(result['2016-7':'2016-9'].index,result.loc['2016-7':'2016-9','simple regression'])
plt.plot(result['2016-7':'2016-9'].index,result.loc['2016-7':'2016-9','fama_french'])
plt.plot(result['2016-7':'2016-9'].index,result.loc['2016-7':'2016-9','sample'])
plt.legend()
plt.show()
</pre>
Although it's hard to observe, we can still found that the prediction from multiple regression is closer to the actual value. Usually we don't plot the result to observe which model is better, instead we read the statistical results from the summary table to make decisions.
<h2>Model Significance Test</h2>
<p style="text-align: left;">From the last tutorial we know that r-square is an important statistical indicator for parameter significance. Now we are going to introduce another indicator for only multiple linear regression: F-score.
The null hypothesis and alternative hypothesis for F-test are:
\[H_0: \beta_1 = \beta_2 = \beta_3 =... = \beta_n = 0\]</p>
<p style="text-align: center;">\(H_1:\) At least one coefficient is not 0</p>
We won't explain F test procedure in detail here. You just need to understand the null hypothesis and alternative hypothesis. The 'F-statistic' in the summary table is the value for F-test, and the 'prob (F-statistic)' is the p-value for F-test. As you can see, the P-value of our F-test is extreme small, which means we are nearly 100% sure that at least one of the coefficient is not 0 from the statistic result.

After setting up a multiple linear regression model, the first step to do is check the F-score for model significance. If the p-value is not small enough to reject the null hypothesis, you should reconsider the feasibility of this model and find some other variables to build a new one.

If we are doing simple linear regression, the null hypothesis for F-test and the null hypothesis for t-test on the slope should be exactly the same, so the p-values for both test should also be the same. You can find it from the simple linear regression summary table above.
<h2>Residual Analysis</h2>
<h3>normality</h3>
According to the linear regression assumptions, the residual should be independent and normally distributed. We can plot the density of the residual to check that:
<pre class="prettyprint linenums">plt.figure()
fama_model.resid.plot.density()
plt.show()
</pre>
As we can see from the plot, the residual is indeed normal distributed. Now let's check if the mean is zero:
<pre class="prettyprint linenums">print 'residual mean: ', np.mean(fama_model.resid)
print 'residual variance: ', np.var(fama_model.resid)
</pre>
The mean is small enough to be deemed as zero. This model passed the normality part of residual analysis.
<h3>Homoskedastic</h3>
This word is difficult to pronounce but not difficult to understand. The model assume that the square of the residual does not increase with the value of X. If it does, we say 'Heteroskedasticity' detected.
<pre class="prettyprint linenums">plt.figure(figsize = (20,10))
plt.scatter(df.spy,simple.resid)
plt.axhline(0.05)
plt.axhline(-0.05)
plt.xlabel('x value')
plt.ylabel('residual')
plt.show()
</pre>
As we can see from the chart, the variance of residual doesn't increase with x obviously, except for three outliers, which would not change our conclusion.
We can plot to test Heteroskedasticity for simple linear regression. However, we can't do this for multiple regression, we use python package instead:
<pre class="prettyprint linenums">from statsmodels.stats import diagnostic as dia
het = dia.het_breuschpagan(fama_model.resid,fama_df[['MKT','SMB','HML','RMW','CMA']][1:])
print 'p-value of Heteroskedasticity: ', het[-1]
</pre>
For the multiple regression, we have 85% confidence level to claim that there is no Heteroskedasticity detected.
<h2>Summary</h2>
In this chapter we introduced multiple linear regression, F-test and basic residual analysis, which are the basic for quantitative linear financial modeling. Next chapter we will introduce some linear algebra, which are used for the model portfolio theory and CAPM.
Loading