-
Notifications
You must be signed in to change notification settings - Fork 0
/
FactorAnalysis.R
256 lines (232 loc) · 8.36 KB
/
FactorAnalysis.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
#' #### Example exercise: "Residential location satisfaction in the Lisbon metropolitan area"
#'
#' The aim of this study was to examine the perception of households towards their residential location considering several land use and accessibility factors as well as household socioeconomic and attitudinal characteristics.
#'
#' _Reference:_ Martinez, L. G., de Abreu e Silva, J., & Viegas, J. M. (2010). Assessment of residential location satisfaction in the Lisbon metropolitan area, TRB (No. 10-1161).
#'
#' **Your task:** Analyse the data and create meaningful latent factors.
#'
#' ## Data
#' #### Variables:
#'
#' * `DWELCLAS`: Classification of the dwelling;
#' * `INCOME`: Income of the household;
#' * `CHILD13`: Number of children under 13 years old;
#' * `H18`: Number of household members above 18 years old;
#' * `HEMPLOY`: Number of household members employed;
#' * `HSIZE`: Household size;
#' * `IAGE`: Age of the respondent;
#' * `ISEX`: Sex of the respondent;
#' * `NCARS`: Number of cars in the household;
#' * `AREA`: Area of the dwelling;
#' * `BEDROOM`: Number of bedrooms in the dwelling;
#' * `PARK`: Number of parking spaces in the dwelling;
#' * `BEDSIZE`: BEDROOM/HSIZE;
#' * `PARKSIZE`: PARK/NCARS;
#' * `RAGE10`: 1 if Dwelling age <= 10;
#' * `TCBD`: Private car distance in time to CBD;
#' * `DISTTC`: Euclidean distance to heavy public transport system stops;
#' * `TWCBD`: Private car distance in time of workplace to CBD;
#' * `TDWWK`: Private car distance in time of dwelling to work place;
#' * `HEADH`: 1 if Head of the Household;
#' * `POPDENS`: Population density per hectare;
#' * `EQUINDEX`: Number of undergraduate students/Population over 20 years old (500m)
#'
#'
#' #### Rules of thumb:
#' * At least 10 variables
#' * n < 50 (Unacceptable); n > 200 (recommended)
#' * It is recommended to use continuous variables. If your data contains categorical variables, you should transform them to dummy variables.
#'
#' #### Assumptions:
#' * Normality;
#' * linearity;
#' * Homogeneity;
#' * Homoscedasticity (some multicollinearity is desirable);
#' * Correlations between variables < 0.3 (not appropriate to use Factor Analysis)
#'
#' ## Let's start!
#' #### Import Libraries
library(foreign) # Library used to read SPSS files
library(nFactors) # Library used for factor analysis
library(tidyverse) # Library used in data science to perform exploratory data analysis
library(summarytools) # Library used for checking the summary of the dataset
library(psych) # Library used for factor analysis
library(GPArotation) # Library used for factor analysis
#'
#' ### Get to know your dataset
#' ##### Import dataset
#'
df <- read.spss("Data/example_fact.sav", to.data.frame = T) #transforms a list into a data.frame directly
#'
#' ##### Take a look at the main characteristics of the dataset
#'
class(df) #type of data
str(df)
#'
#' ##### Check summary statistics of variables
descriptive_stats <- dfSummary(df)
view(descriptive_stats)
#' > **Note:** I used a different library of the MLR chapter for perfoming the summary statistics. "R" allows you to do the same or similar tasks with different packages.
#'
#' ##### Take a look at the first values of the dataset
#'
head(df,5)
#'
#' ##### Make ID as row names or case number
#'
df<-data.frame(df, row.names = 1)
#'
#' ### Evaluating the assumptions for factoral analysis
#'
#' Let's run a random regression model in order to evaluate some assumptions
#'
random = rchisq(nrow(df), 32)
fake = lm(random ~ ., data = df)
standardized = rstudent(fake)
fitted = scale(fake$fitted.values)
#'
#' * **Normality**
#'
hist(standardized)
#'
#' * **Linearity**
#'
qqnorm(standardized)
abline(0,1)
#'
#' * **Homogeneity**
#'
plot(fitted, standardized)
abline(h=0,v=0)
#'
#' * **Correlations between variables**
#' Correlation matrix
corr_matrix <- cor(df, method = "pearson")
#'
#' The **Bartlett** test examines if there is equal variance (homogeneity) between variables. Thus, it evaluates if there is any pattern between variables. Check for correlation adequacy - Bartlett's Test.
cortest.bartlett(corr_matrix, n = nrow(df))
#' > **Note:** The null hypothesis is that there is no correlation between variables. Therefore, in factor analysis you want to reject the null hypothesis.
#'
#' ##### Check for sampling adequacy - KMO test
#'
KMO(corr_matrix)
#'
#' > **Note:** We want at least 0.7 of the overall Mean Sample Adequacy (MSA). If, 0.6 < MSA < 0.7, it is not a good value, but acceptable in some cases.
#'
#' ##### Determine the number of factors to extract
#'
#' **1. Parallel Analysis**
#'
num_factors = fa.parallel(df, fm = "ml", fa = "fa")
#'
#' > **Note:** `fm` = factor math; `ml` = maximum likelihood; `fa` = factor analysis
#'
#' The selection of the number of factors in the Parallel analysis can be threefold:
#'
#' * Detect where there is an "elbow" in the graph;
#' * Detect the intersection between the "FA Actual Data" and the "FA Simulated Data";
#' * Consider the number of factors with eigenvalue > 1.
#'
#'
#' **2. Kaiser Criterion**
#'
sum(num_factors$fa.values > 1) #Determines the number of factors with eigenvalue > 1
#'
#' You can also consider factors with eigenvalue > 0.7, since some of the literature indicate that this value does not overestimate the number of factors as much as considering an eigenvalue = 1.
#'
#' **3. Principal Component Analysis (PCA)**
#'
#' * Print variance that explains the components
df_pca <- princomp(df, cor=TRUE) #cor = TRUE, standardizes your dataset before running a PCA
summary(df_pca)
#'
#' * Scree Plot
plot(df_pca,type="lines", npcs = 31)
#'
#' > **Note:** Check the cummulative variance of the first components and the scree plot, and see if the PCA is a good approach to detect the number of factors in this case.
#'
#' **PCA is not the same thing as Factor Analysis!** PCA only considers the common information (variance) of the variables, while factor analysis takes into account also the unique variance of the variable. Both approaches are often mixed up. In this example we use PCA as only a first criteria for choosing the number of factors. PCA is very used in image recognition and data reduction of big data.
#'
#' ## Exploratory Factor Analysis
#'
#' * **Model 1**: No rotation
#' * **Model 2**: Rotation Varimax
#' * **Model 3**: Rotation Oblimin
# No rotation
df_factor <- factanal(df, factors = 4, rotation = "none", scores=c("regression"), fm = "ml")
# Rotation Varimax
df_factor_var <- factanal(df, factors = 4, rotation = "varimax", scores=c("regression"), fm = "ml")
# Rotiation Oblimin
df_factor_obl <- factanal(df, factors = 4, rotation = "oblimin", scores=c("regression"), fm = "ml")
#'
#' Let's print out the results of `df_factor_obl`, and have a look.
print(df_factor, digits=2, cutoff=0.3, sort=TRUE) #cutoff of 0.3 due to the sample size is higher than 350 observations.
#'
#' > **Note:**
#' The variability contained in the factors = Communality + Uniqueness.
#' Varimax assigns orthogonal rotation, and oblimin assigns oblique rotation.
#'
#'
#' Plot factor 1 against factor 2, and compare the results of different rotations
#'
#' * **No Rotation**
plot(
df_factor$loadings[, 1],
df_factor$loadings[, 2],
xlab = "Factor 1",
ylab = "Factor 2",
ylim = c(-1, 1),
xlim = c(-1, 1),
main = "No rotation"
)
abline(h = 0, v = 0)
load <- df_factor$loadings[, 1:2]
text(
load,
names(df),
cex = .7,
col = "blue"
)
#'
#' * **Varimax rotation**
plot(
df_factor_var$loadings[, 1],
df_factor_var$loadings[, 2],
xlab = "Factor 1",
ylab = "Factor 2",
ylim = c(-1, 1),
xlim = c(-1, 1),
main = "Varimax rotation"
)
abline(h = 0, v = 0)
load <- df_factor_var$loadings[, 1:2]
text(
load,
labels = names(df),
cex = .7,
col = "red"
)
#'
#'
#' * **Oblimin Rotation**
plot(
df_factor_obl$loadings[, 1],
df_factor_obl$loadings[, 2],
xlab = "Factor 1",
ylab = "Factor 2",
ylim = c(-1, 1),
xlim = c(-1, 1),
main = "Oblimin rotation"
)
abline(h = 0, v = 0)
load <- df_factor_obl$loadings[, 1:2]
text(
load,
labels = names(df),
cex = .7,
col = "darkgreen"
)
#'
#' When you have more than two factors it is difficult to analyse the factors by the plots. Variables that have low explaining variance in the two factors analyzed, could be highly explained by the other factors not present in the graph. However, try comparing the plots with the factor loadings and plot the other graphs to get more familiar with exploratory factor analysis.
#'