-
Notifications
You must be signed in to change notification settings - Fork 0
/
CourseProject.Rmd
99 lines (73 loc) · 2.43 KB
/
CourseProject.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
Coursera Practical Machine Learning - Course Project
==========================================================
Given the test data, I have written a machine learning code to predict what activity an individual has preformed, based off of their tracker data.
This was done using the caret library and randomForest, which allowed me to accurately predict the 20 quiz data cases. For repeatability, I have set the seed to 2017:
```{r}
library(Hmisc)
library(caret)
library(randomForest)
library(foreach)
library(doParallel)
set.seed(2017)
options(warn = -1)
```
The data is first loaded (both the training and test). #Div/0 becomes NA for conisistency:
```{r}
trainData <- read.csv("pml-training.csv", na.strings = c("#DIV/0!"))
testData <- read.csv("pml-testing.csv", na.strings = c("#DIV/0!"))
```
Column 8 until the end are casted numerically to make things easier:
```{r}
for(i in c(8:ncol(trainData) - 1)) {
trainData[, i] = as.numeric(as.character(trainData[, i]))
}
for(i in c(8:ncol(testData) - 1)) {
testData[, i] = as.numeric(as.character(testData[,i]))
}
```
Then, I remove blank columns and include only full columns, without timestamps and name:
```{r}
mainSet <- colnames(trainData[colSums(is.na(trainData)) == 0])[-(1:7)]
modelData <- trainData[mainSet]
mainSet
```
This is now the model we will use:
```{r}
ids <- createDataPartition(y = modelData$classe, p = 0.75, list = FALSE )
training <- modelData[ids,]
testing <- modelData[-ids,]
```
Build 5 RF's with 150 trees each, making use of parallel processing, speeding up the process greatly...
```{r}
registerDoParallel()
x <- training[-ncol(training)]
y <- training$classe
rf <- foreach(ntree = rep(150, 6), .combine = randomForest::combine, .packages = 'randomForest') %dopar% {
randomForest(x, y, ntree=ntree)
}
```
Error reports for both training data and test data.
K-fold cross validation:
```{r}
predictions1 <- predict(rf, newdata = training)
confusionMatrix(predictions1, training$classe)
predictions2 <- predict(rf, newdata = testing)
confusionMatrix(predictions2, testing$classe)
```
Conclusions
--------------------------------
20/20 on the quiz, the model is quite accurate.
Hooray.
```{r}
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
x <- testData
x <- x[mainSet[mainSet!='classe']]
answers <- predict(rf, newdata=x)
answers
```