-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathprediction.Rmd
192 lines (146 loc) · 9.02 KB
/
prediction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---
title: "HUDK4051: Prediction - Comparing Trees"
author: "Xingyi Xie"
date: "4/14/2021"
output: html_document
---
In this assignment you will modelling student data using three flavors of tree algorithm: CART, C4.5 and C5.0. We will be using these algorithms to attempt to predict which students drop out of courses. Many universities have a problem with students over-enrolling in courses at the beginning of semester and then dropping most of them as the make decisions about which classes to attend. This makes it difficult to plan for the semester and allocate resources. However, schools don't want to restrict the choice of their students. One solution is to create predictions of which students are likley to drop out of which courses and use these predictions to inform semester planning.
In this assignment we will be using the tree algorithms to build models of which students are likely to drop out of which classes.
## Software
In order to generate our models we will need several packages. The first package you should install is [caret](https://cran.r-project.org/web/packages/caret/index.html).
There are many prediction packages available and they all have slightly different syntax. caret is a package that brings all the different algorithms under one hood using the same syntax.
We will also be accessing an algorithm from the [Weka suite](https://www.cs.waikato.ac.nz/~ml/weka/). Weka is a collection of machine learning algorithms that have been implemented in Java and made freely available by the University of Waikato in New Zealand. To access these algorithms you will need to first install both the [Java Runtime Environment (JRE) and Java Development Kit](http://www.oracle.com/technetwork/java/javase/downloads/jre9-downloads-3848532.html) on your machine. You can then then install the [RWeka](https://cran.r-project.org/web/packages/RWeka/index.html) package within R.
**Weka requires Java and Java causes problems. If you cannot install Java and make Weka work, please follow the alternative instructions at line 121**
(Issue 1: failure to install RWeka/RWekajars, paste "sudo R CMD javareconf" into terminal and try to install again)
The last package you will need is [C50](https://cran.r-project.org/web/packages/C50/index.html).
## Data
The data comes from a university registrar's office. The code book for the variables are available in the file code-book.txt. Examine the variables and their definitions.
Upload the drop-out.csv data into R as a data frame.
```{r}
data = read.table("drop-out.csv",header=T, sep=",")
```
The next step is to separate your data set into a training set and a test set. Randomly select 25% of the students to be the test data set and leave the remaining 75% for your training data set. (Hint: each row represents an answer, not a single student.)
```{r}
library(caret)
library(lattice)
library(ggplot2)
inTrain <- createDataPartition(y=data$complete, p=0.75, list=F)
TRAIN1<-data[inTrain,]
TEST1<-data[-inTrain,]
```
For this assignment you will be predicting the student level variable "complete".
(Hint: make sure you understand the increments of each of your chosen variables, this will impact your tree construction)
Visualize the relationships between your chosen variables as a scatterplot matrix. Save your image as a .pdf named scatterplot_matrix.pdf. Based on this visualization do you see any patterns of interest? Why or why not?
```{r}
car::scatterplotMatrix(TRAIN1[c("years", "entrance_test_score", "courses_taken")],
smooth = list(spread = T, lty.smooth=2, lwd.smooth=3, lty.spread=3, lwd.spread=2))
```
## CART Trees
You will use the [rpart package](https://cran.r-project.org/web/packages/rpart/rpart.pdf) to generate CART tree models.
Construct a classification tree that predicts complete using the caret package.
```{r}
library(caret)
TRAIN2 <- TRAIN1[,c(2:10)] #Remove the student_id variable that we do not want to use in the model
#caret does not summarize the metrics we want by default so we have to modify the output
MySummary <- function(data, lev = NULL, model = NULL){
df <- defaultSummary(data, lev, model)
tc <- twoClassSummary(data, lev, model)
pr <- prSummary(data, lev, model)
out <- c(df,tc,pr)
out}
#Define the control elements we would like to use
ctrl <- trainControl(method = "repeatedcv", #Tell caret to perform k-fold cross validation
repeats = 3, #Tell caret to repeat each fold three times
classProbs = TRUE, #Calculate class probabilities
summaryFunction = MySummary)
#Define the model
cartFit <- train(complete ~ ., #Define which variable to predict
data = TRAIN2, #Define the data set to train the model on
trControl = ctrl, #Tell caret the control elements
method = "rpart", #Define the model type
metric = "Accuracy", #Final model choice is made according to sensitivity
preProc = c("center", "scale")) #Center and scale the data to minimize the
#Check the results
cartFit
```
Describe important model attribues of your tree. Do you believe it is a successful model of student performance, why/why not?
```{r}
importance = varImp(cartFit,scale = FALSE)
importance
```
##course_id is the most important feature
##is a valid model roc area reaches 0.88, recall reaches 0.66, and precision reaches 0.98. It means the accuracy is very high
Can you use the sensitivity and specificity metrics to calculate the F1 metric?
##F1=2*precision*recall/(precision+recall)
Now predict results from the test data and describe important attributes of this test. Do you believe it is a successful model of student performance, why/why not?
```{r}
TEST2 <- TEST1[,c(2:10)] #Remove the student_id variable that we do not want to use in the model
#Generate prediction using previously trained model
cartClasses <- predict(cartFit, newdata = TEST2)
#Generate model statistics
confusionMatrix(data = cartClasses, as.factor(TEST2$complete))
```
## is a valid model. acc reaches 0.90
## Conditional Inference Trees
Train a Conditional Inference Tree using the `party` package on the same training data and examine your results.
```{r}
#condFit <- ctree(complete ~ ., data = Train1)
library(party)
library(grid)
library(mvtnorm)
library(caret)
library(modeltools)
library(stats4)
library(strucchange)
library(zoo)
#Define the model
TRAIN2$complete<-as.factor(TRAIN2$complete)
condFit <- ctree(complete ~ years+entrance_test_score+courses_taken+gender, data = TRAIN2)
#Check the results
condFit
```
```{r}
plot(condFit)
```
```
Describe important model attribues of your tree. Do you believe it is a successful model of student performance, why/why not?
##years is the most important because it splits at the first level of nodes, followed by courses_taken
##This model is a valid model because acc also reaches 0.88 and Specificity reaches 1 which is very high
What does the plot represent? What information does this plot tell us?
The ## graph represents which features were used in the split, and the splitting point of the features. It also shows the importance of the features, the earlier they split the more information they contain
Now test your new Conditional Inference model by predicting the test data and generating model fit statistics.
```{r}
condFit.pred <- predict(condFit, newdata = TEST2, type = 'response')
confusionMatrix(condFit.pred, as.factor(TEST2$complete))
```
There is an updated version of the C4.5 model called C5.0, it is implemented in the C50 package. What improvements have been made to the newer version?
##Faster
##More efficient memory usage
##Smaller decision trees built: C5.0 obtains very similar results to C4.5, but builds quite small decision trees.
##Similar accuracy: C5.0 obtains similar accuracy to C4.5.
##Boosting support: Boosting can make the decision tree more accurate.
##Weighting: With C5.0, you can weight different attributes and misclassification types. C5.0 can build classifiers to minimize the expected misclassification cost instead of the error rate.
Install the C50 package, train and then test the C5.0 model on the same data.
```{r}
library(C50)
c50Fit <- C5.0(complete ~ years+entrance_test_score+courses_taken+enroll_date_time+international+online+gender, data = TRAIN2)
summary(c50Fit)
```
```{r}
c50Fit.pred <- predict(c50Fit, newdata = TEST2)
confusionMatrix(c50Fit.pred, as.factor(TEST2$complete))
```
## Compare the models
caret allows us to compare all three models at once.
```{r}
library(caret)
list(cart = cartFit, condinf = condFit, cfiveo = c50Fit)
```
What does the model summary tell us? Which model do you believe is the best?
##acc: cartFit:0.8867 condFit:0.8737 c50Fit:0.8737: cartFit
##Specificity: cartFit:0.9942 condFit:1 c50Fit:1 :c50Fit,cartFit
##Sensitivity cartFit:0.6288 condFit:0.5708 c50Fit:0.5708 :cartFit
##Overall cartFit is the best
Which variables (features) within your chosen model are important, do these features provide insights that may be useful in solving the problem of students dropping out of courses?
#years course_id courses_taken are important
#Not easy for students with years greater than 0 dropping out of courses