-
Notifications
You must be signed in to change notification settings - Fork 45
/
hackathon3 - solution.Rmd
116 lines (74 loc) · 2.31 KB
/
hackathon3 - solution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
title: "Hackathon 3 - clustering"
author: "Charles Lang"
date: "10/29/2020"
output: html_document
---
# Solution
```{r}
library(dplyr)
library(tidyr)
#Difference between anaylsis for advice and analysis for automation/production
#Predicting the future vs. predicting as characterization
D1 <- read.csv("engagement-data.csv", header = TRUE)
D2 <- read.csv("student-level.csv", header = TRUE)
#Look at variable types
#Get a fell for the data - structure, meaning of variables, ranges
# Restrucre and combine
D1b <- unite(D1, modality, c("modality", "week"), sep = "")
D1b <- spread(D1b, modality, measure)
D3 <- full_join(D1b,D2, by = "id")
#Visualize Engagement
summary(D1b$forum1)
hist(D1b$forum4, breaks = 100)
summary(D1b$game4)
hist(D1b$game2, breaks = 10)
summary(D1b$video1)
# Negative video watching recoded to 0
D1b$video1 <- ifelse(D1b$video1 < 0, 0, D1b$video1)
D1b$video2 <- ifelse(D1b$video2 < 0, 0, D1b$video2)
D1b$video6 <- ifelse(D1b$video6 < 0, 0, D1b$video6)
hist(D1b$video1, breaks = 100)
summary(D1b$logins1)
hist(D1b$logins1)
D4 <- select(D3, c(2,8,12,18))
pairs(D4)
# Visualize demographics
plot(D3$gender)
plot(D3$gender, D3$forum2)
table(D3$gender, D3$forum2)
plot(D3$gender, D3$video1)
#chisq.test(D3$gender, D3$video1)
plot(D3$parent.ed)
# Visualize Performance
hist(D3$exam)
plot(D3$exam, D3$forum1)
plot(D3$exam, D3$video1)
D4 <- select(D3, c(2,8,12,18,34))
pairs(D4)
D5 <- select(D3, c(24:34))
pairs(D5)
#I can be pretty confident in a theory that there are four groups, low/low, high/high, low/high, high/low
```
# k-means
Benefits
* Works with widely any number of variables
* Very fast
* Will find clusters
Gotchas
* Must be on the same scale
* Need to provide cluster number
* Assumes there are clusters to find - it will find clusters regardless of whether there are any or not
* Does not work on some distributional shapes
* Need uniform scale (uniform variance) - larger scale will swamp smaller scale
* Doesn’t work on categorical data of more than two categories (and the scale may be difficult to interpret)
* Can get stuck on local minima (need to run iterations)
* Too easy
```{r}
D5 <- select(D3, 2:34) #remove id variable!
D5 <- scale(D5)
D5 <- data.frame(D5)
fit <- kmeans(D5, 4)
D6 <- data.frame(D3, fit$cluster)
plot(D6$forum1, D6$assignment1, col = D6$fit.cluster)
```