-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFinalProject.Rmd
152 lines (112 loc) · 5.89 KB
/
FinalProject.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
title: "FinalProject"
author: "Stephen Diehl"
date: "June 10, 2017"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Overview
This is the final project for the Data Science Specialization Course: Practical Machine Learning.
## Background
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
## Data
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
## Reading in the Data
In this data set, there are 3 representations for NA. Account for this in read_csv.
```{r warning=FALSE, message=FALSE}
library(tidyverse)
library(caret)
pml_train <- read_csv("./data/pml-training.csv", na=c("NA", "", "#DIV/0!"))
pml_test <- read_csv("./data/pml-testing.csv", na=c("NA", "", "#DIV/0!"))
print(paste("Train", paste(dim(pml_train), collapse=" ")))
print(paste("Test", paste(dim(pml_test), collapse=" ")))
```
## Cleaning the Data
Many columns are almost entirely NAs. Examine the distribution of NAs.
```{r}
# get the counts of NA in each column
na_count <-sapply(pml_train, function(y) sum(length(which(is.na(y)))))
# look at the distribution of counts
table(na_count)
```
We see 57 columns with no NAs, and 3 columns with 1 NA. The rest of the columns are over 97% NA.
Also, a visual examination of the data shows that the first 6 columns are not useful as predictors. Remove these as well.
```{r}
# remove the columns with 19216 or more NAs
cols_to_remove <- which(na_count >= 19216)
# also remove the first 6 columns
cols_to_remove <- union(cols_to_remove, 1:6)
# whatever is done to one data set must be done to the other
train_data <- subset(pml_train, select = -cols_to_remove)
test_data <- subset(pml_test, select = -cols_to_remove)
print(paste("train_data", paste(dim(train_data), collapse=" ")))
print(paste("test_data", paste(dim(test_data), collapse=" ")))
```
So we are down to 19622 rows and 54 columns on train_data.
Let's just look at complete cases (rows with no NAs).
```{r}
# only keep records with no NAs
train_data <- train_data[complete.cases(train_data),]
print(paste("train_data", paste(dim(train_data), collapse=" ")))
print(paste("test_data", paste(dim(test_data), collapse=" ")))
```
We see one row was removed from train_data.
Normally data would be split something like 70% train and 30% test, or perhaps 60% train, 20% validation and 20% test.
Here we see that train_data (pml_train with fewer columns) has 19621 records but test_data (pml_test with fewer columns) only has 20 records. 20 records is much too small to be useful
as a test set for estimating model accuarcy. pml_test will only be used for making predictions against it as required by the last Quiz in this course.
We need to create a train and test set from train_data.
```{r}
set.seed(314159) # so createDataPartition is reproducible
indexes <- createDataPartition(train_data$classe, p=0.70, list=F)
train <- train_data[indexes, ]
test <- train_data[-indexes, ]
print(paste("train", paste(dim(train), collapse=" ")))
print(paste("test", paste(dim(test), collapse=" ")))
```
## Building the Model
### For Performance Only
My Linux computer has 4 cores. Use all 4 cores to create the Random Forest.
```{r message=FALSE}
library(doParallel)
# be sure to stop an existing cluster before starting a new one
if (exists("cl")) {
stopCluster(cl)
}
# FORK only works on Linux
cl <- makeCluster(4, type="FORK")
registerDoParallel(cl)
```
### Create the Random Forest
Use 5 fold cross validation. Allow parallel processing.
```{r message=FALSE}
controlRF <- trainControl(method = "cv",
number = 5,
allowParallel = TRUE)
modelRF <- train(classe ~ ., data=train, method="rf", trControl=controlRF, ntree=250)
modelRF
```
## Estimate the Accuarcy
The caret package automatically tuned the mtry hyperparameter to be 27. As the train data was used to find the best model, we cannot also use it to estimate the accuarcy of the best model it found (without being overly optimistic).
```{r}
# use the best RF model (mtry=27) on the test data and compute the accuarcy
predictRf <- predict(modelRF, test)
confusionMatrix(test$classe, predictRf)
```
We see from above that the estimated accuarcy is 99.71% with a 95% confidence interval of the accuarcy being between 99.54% and 99.83%.
As this model was run on out-of-sample data, 99.71% is the out-of-sample accuarcy. The out-of-sample error rate is the compliment of this, which is 0.29%
## Predict on test_data for Quiz
```{r}
# test_data is pml_test with appropriate columns removed
predict(modelRF, test_data)
```
## House Keeping for Parallel Processing Only
It's a good idea to shut down the parallel cluster explicitly to ensure no memory leaks.
```{r}
stopCluster(cl)
```