Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#7 Fernando Carnauba - Rmd and Html files. #89

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 153 additions & 18 deletions Assignment7.Rmd
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: "Assignment 7 - Answers"
author: "Charles Lang"
date: "11/30/2016"
author: "Fernando Carnaúba"
date: "11/26/2019"
output: html_document
---

Expand All @@ -11,72 +11,207 @@ In the following assignment you will be looking at data from an one level of an

#Upload data
```{r}
library(ggplot2)
library(gridExtra)
library(dplyr)
library(ggcorrplot)
library(rpart)
library(party)
library(rattle)
library(RColorBrewer)

rm(list=ls())
df <- read.csv("~/R/HUDK/assignment7/online.data.csv")

```

#Visualization
```{r}

#Start by creating histograms of the distributions for all variables (#HINT: look up "facet" in the ggplot documentation)
df <- mutate(df, quartile = ntile(messages,4))

# Let's create histograms that show students whi "levelled up" separately from those who didn't
h1 <- ggplot(df) +
geom_histogram(aes(x=pre.test.score, fill=level.up),
colour="grey50", alpha=0.5, position="identity")
h2 <- ggplot(df) +
geom_histogram(aes(x=post.test.score, fill=level.up),
colour="grey50", alpha=0.5, position="identity")
h3 <- ggplot(df) +
geom_histogram(aes(x=messages, fill=level.up),
colour="grey50", alpha=0.5, position="identity")
h4 <- ggplot(df) +
geom_histogram(aes(x=forum.posts, fill=level.up),
colour="grey50", alpha=0.5, position="identity")
h5 <- ggplot(df) +
geom_histogram(aes(x=av.assignment.score, fill=level.up),
colour="grey50", alpha=0.5, position="identity")
grid.arrange(h1,h2,h3,h4,h5, nrow = 3)

# Let's now create histograms for the diferent quartiles of forum posts, to see if "messaging more" sweeps away the predictive power of other variables (i.e. students are learning when they post!)
h6 <- h1 + facet_wrap(~ quartile, nrow = 1)
h7 <- h2 + facet_wrap(~ quartile, nrow = 1)
h8 <- h4 + facet_wrap(~ quartile, nrow = 1)
h9 <- h5 + facet_wrap(~ quartile, nrow = 1)

grid.arrange(h6,h7,h8,h9, nrow = 4, top = "Histograms by 'messages' quartile")

#Then visualize the relationships between variables
df2 <- mutate(df, level.up = ifelse(level.up == "yes",1,0))
df2 <- select(df2,-quartile, -id)

corr <- round(cor(df2), 1)
ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE)

#Try to capture an intution about the data and the relationships

```

ANSWER:
- Fist image. Looking into the first plot, it becomes clear that pre-test scores, post-test scores and messages are strong predictors of "levelling up", since the distribution of each variable is shifted to the left for the case of students who do not level up. Another interesting thing is that"av.assignment.score" almost perfeclty separates studetns who level up from those who dont't, at the .25 threshold. This might be a straightly mechanical relationship (i.e the game algorithm passes students with av.scoore > .25).

- Second image. We now want to investigate a the variable "massage", because this is a variable that instructors could try to "foster" among students - it is an "actionable" variable. We now reproduce the same histograms, but separately for each quartile of "messages". We can see that students with higher level of messaging have much lower counts for "not-levelling up" (lower count bars for the graphs on the right, which represent the fourth quartile). Also, the distribution of pre and post scores of those who fail becomes much more similar to that of students who pass. Se "messaging" seems to have a positive effect. In the case of av.scores, we keep seeing a strong threshold, which is probably the formal rule used by the game to level students up.

- Third image. Looking at the correlations graph, levelling up is most strongly associated with av.assignemnt scores, confirming what we discussed above. Also, across the two "actionable" variables ("messages" and "forum.posts"), messaging is the one strongly associated with levelling up.


#Classification tree
```{r}
#Create a classification tree that predicts whether a student "levels up" in the online course using three variables of your choice (As we did last time, set all controls to their minimums)
#Create a classification tree that predicts whether a student "levels up" in the online course using three variables of your choice (As we did last time, set all controls to their minimums) [We will NOT use av.score since we have sufficiente evidence that this variable sets the actual threshold for levelling up (so it is basically the SAME variable - so there is no point in using it to predict. I've actually tested using it, and coupled with the post-test score it gives us a perfect prediction - although one that doesn't make sense )]. I chose to use "messages", "forum posts" and "pre-test" scores, which are variable which are available for instructors to consult either at the beginning (pre test) or during student use (messages and forum posts).
tree1 <- rpart(level.up ~ pre.test.score + messages+forum.posts, method="class", data=df, minsplit = 0, minbucket = 1, cp = 0.01)


#Plot and generate a CP table for your tree
fancyRpartPlot(tree1, caption = NULL)
printcp(tree1)

#Generate a probability value that represents the probability that a student levels up based your classification tree

D1$pred <- predict(rp, type = "prob")[,2]#Last class we used type = "class" which predicted the classification for us, this time we are using type = "prob" to see the probability that our classififcation is based on.
df$pred <- predict(tree1, type = "prob")[,2]#Last class we used type = "class" which predicted the classification for us, this time we are using type = "prob" to see the probability that our classififcation is based on.

```


## Part II
#Now you can generate the ROC curve for your model. You will need to install the package ROCR to do this.
```{r}
library(ROCR)

#Plot the curve
pred.detail <- prediction(D1$pred, D1$level.up)
pred.detail <- prediction(df$pred, df$level.up)
plot(performance(pred.detail, "tpr", "fpr"))
abline(0, 1, lty = 2)

#Calculate the Area Under the Curve
unlist(slot(performance(Pred2,"auc"), "y.values"))#Unlist liberates the AUC value from the "performance" object created by ROCR
unlist(slot(performance(pred.detail,"auc"), "y.values"))
#Unlist liberates the AUC value from the "performance" object created by ROCR

#Now repeat this process, but using the variables you did not use for the previous model and compare the plots & results of your two models. Which one do you think was the better model? Why?

tree2 <- rpart(level.up ~ post.test.score + av.assignment.score, method="class", data=df, minsplit = 0, minbucket = 1, cp = 0.01)
fancyRpartPlot(tree2, caption = NULL)

df$pred2 <- predict(tree2, type = "prob")[,2]#Last class we used type = "class" which predicted the classification for us, this time we are using type = "prob" to see the probability that our classififcation is based on.

#Plot the curve (model 2)
pred.detail2 <- prediction(df$pred2, df$level.up)
plot(performance(pred.detail2, "tpr", "fpr"))
abline(0, 1, lty = 2)

#Calculate the Area Under the Curve
unlist(slot(performance(pred.detail2,"auc"), "y.values"))
#Unlist liberates the AUC value from the "performance" object created by ROCR

```
ANSWER:
As mentioned before, using av.assignment doesn't seem to make sense, because this variable is probably what is used by the system to level students up. Accordingly, the second model predicts "leveling up" perfectly (auc=1). So it is not a variable that is avaliable for instructors before the game decides whether students will level up. The same is true for post.test.score, since this variable is only available after students use the game - so there it cannot be used as a predictor. So, for me, the first model is the only one that makes sense from a conceptual point of view.

## Part III
#Thresholds
```{r}
#Look at the ROC plot for your first model. Based on this plot choose a probability threshold that balances capturing the most correct predictions against false positives. Then generate a new variable in your data set that classifies each student according to your chosen threshold.

threshold.pred1 <-
#let1s try 50%

threshold.pred1 <- ifelse(df$pred > 0.5, 1, 0)

#Some useful parameters
df$true.pos.model1 <- threshold.pred1*df2$level.up
df$false.pos.model1 <- threshold.pred1*(1-df2$level.up)
df$true.neg.model1 <- (1-threshold.pred1)*(1-df2$level.up)
df$false.neg.model1 <- (1-threshold.pred1)*df2$level.up

#Now generate three diagnostics:

D1$accuracy.model1 <-
# accuracy = %predicted correctly
df$accuracy.model1 <- mean(ifelse(threshold.pred1 == df2$level.up,1,0))

# precision = true_pos/(true_pos + false_pos)
df$precision.model1 <-sum(df$true.pos.model1)/(sum(df$true.pos.model1)+sum(df$false.pos.model1))

D1$precision.model1 <-
# recall = true_pos/(true_pos + false_neg)
df$recall.model1 <-sum(df$true.pos.model1)/(sum(df$true.pos.model1)+sum(df$false.neg.model1))

D1$recall.model1 <-
#Calculate kappa (I'm calculating it directly from the definition)
po1 = mean(ifelse(threshold.pred1 == df2$level.up,1,0))
pe1 = mean(threshold.pred1)*mean(df2$level.up)
kappa1 = (po1-pe1)/(1-pe1)
kappa1

#Finally, calculate Kappa for your model according to:
#Now choose a different threshold value and repeat these diagnostics. What conclusions can you draw about your two thresholds?

#First generate the table of comparisons
table1 <- table(D1$level.up, D1$threshold.pred1)

#Convert to matrix
matrix1 <- as.matrix(table1)
#let1s try a higher threshold: 75%

threshold.pred2 <- ifelse(df$pred > 0.75, 1, 0)

#Some useful parameters
df$true.pos.model2 <- threshold.pred2*df2$level.up
df$false.pos.model2 <- threshold.pred2*(1-df2$level.up)
df$true.neg.model2 <- (1-threshold.pred2)*(1-df2$level.up)
df$false.neg.model2 <- (1-threshold.pred2)*df2$level.up

#Now generate three diagnostics:

# accuracy = %predicted correctly
df$accuracy.model2 <- mean(ifelse(threshold.pred2 == df2$level.up,1,0))

# precision = true_pos/(true_pos + false_pos)
df$precision.model2 <-sum(df$true.pos.model2)/(sum(df$true.pos.model2)+sum(df$false.pos.model2))

# recall = true_pos/(true_pos + false_neg)
df$recall.model2 <-sum(df$true.pos.model2)/(sum(df$true.pos.model2)+sum(df$false.neg.model2))

#Calculate kappa (I'm calculating it directly from the definition)
po2 = mean(ifelse(threshold.pred2 == df2$level.up,1,0))
pe2 = mean(threshold.pred2)*mean(df2$level.up)
kappa2 = (po2-pe2)/(1-pe2)
kappa2

#Calculate kappa
kappa(matrix1, exact = TRUE)/kappa(matrix1)

#Now choose a different threshold value and repeat these diagnostics. What conclusions can you draw about your two thresholds?

```
MODEL 1 - Threshold 50%
Accuracy - .826
Precision - .731
Recall - .895
Kappa - .783


MODEL 2 - Threshold 75%
Accuracy - .772
Precision - .844
Recall - .527
Kappa - .743

Choosing a higher threshold leads to the following implications:
- less true posives
- less false positives
- more true negatives
- more false negatives

Using 50% in our first model as a baseline, as we increse the threshold, we reduce accuracy, because we are departing from the 50% threshold that maximizes our "correct guesses". Also, we increased precision and reduced accuraccy and the kappa.

### To Submit Your Assignment

Expand Down
Loading