prediction.Rmd

---
title: "HUDK4051: Prediction - Comparing Trees"
author: "Xingyi Xie"
date: "4/14/2021"
output: html_document
---

In this assignment you will modelling student data using three flavors of tree algorithm: CART, C4.5 and C5.0. We will be using these algorithms to attempt to predict which students drop out of courses. Many universities have a problem with students over-enrolling in courses at the beginning of semester and then dropping most of them as the make decisions about which classes to attend. This makes it difficult to plan for the semester and allocate resources. However, schools don't want to restrict the choice of their students. One solution is to create predictions of which students are likley to drop out of which courses and use these predictions to inform semester planning. 

In this assignment we will be using the tree algorithms to build models of which students are likely to drop out of which classes. 

## Software

In order to generate our models we will need several packages. The first package you should install is [caret](https://cran.r-project.org/web/packages/caret/index.html).

There are many prediction packages available and they all have slightly different syntax. caret is a package that brings all the different algorithms under one hood using the same syntax. 

We will also be accessing an algorithm from the [Weka suite](https://www.cs.waikato.ac.nz/~ml/weka/). Weka is a collection of machine learning algorithms that have been implemented in Java and made freely available by the University of Waikato in New Zealand. To access these algorithms you will need to first install both the [Java Runtime Environment (JRE) and Java Development Kit](http://www.oracle.com/technetwork/java/javase/downloads/jre9-downloads-3848532.html) on your machine. You can then then install the [RWeka](https://cran.r-project.org/web/packages/RWeka/index.html) package within R.

**Weka requires Java and Java causes problems. If you cannot install Java and make Weka work, please follow the alternative instructions at line 121**
(Issue 1: failure to install RWeka/RWekajars, paste "sudo R CMD javareconf" into terminal and try to install again)

The last package you will need is [C50](https://cran.r-project.org/web/packages/C50/index.html).

## Data

The data comes from a university registrar's office. The code book for the variables are available in the file code-book.txt. Examine the variables and their definitions.

Upload the drop-out.csv data into R as a data frame. 

```{r}
data = read.table("drop-out.csv",header=T, sep=",")
```

The next step is to separate your data set into a training set and a test set. Randomly select 25% of the students to be the test data set and leave the remaining 75% for your training data set. (Hint: each row represents an answer, not a single student.)

```{r}
library(caret)
library(lattice)
library(ggplot2)
inTrain <- createDataPartition(y=data$complete, p=0.75, list=F)
TRAIN1<-data[inTrain,]
TEST1<-data[-inTrain,]
```

For this assignment you will be predicting the student level variable "complete". 
(Hint: make sure you understand the increments of each of your chosen variables, this will impact your tree construction)

Visualize the relationships between your chosen variables as a scatterplot matrix.  Save your image as a .pdf named scatterplot_matrix.pdf. Based on this visualization do you see any patterns of interest? Why or why not?

```{r}
car::scatterplotMatrix(TRAIN1[c("years", "entrance_test_score", "courses_taken")], 
    smooth = list(spread = T,  lty.smooth=2, lwd.smooth=3, lty.spread=3, lwd.spread=2))
```

## CART Trees

You will use the [rpart package](https://cran.r-project.org/web/packages/rpart/rpart.pdf) to generate CART tree models.

Construct a classification tree that predicts complete using the caret package.

```{r}
library(caret)

TRAIN2 <- TRAIN1[,c(2:10)] #Remove the student_id variable that we do not want to use in the model

#caret does not summarize the metrics we want by default so we have to modify the output
MySummary  <- function(data, lev = NULL, model = NULL){
  df <- defaultSummary(data, lev, model)
  tc <- twoClassSummary(data, lev, model)
  pr <- prSummary(data, lev, model)
  out <- c(df,tc,pr)
  out}

#Define the control elements we would like to use
ctrl <- trainControl(method = "repeatedcv", #Tell caret to perform k-fold cross validation
                repeats = 3, #Tell caret to repeat each fold three times
                classProbs = TRUE, #Calculate class probabilities
                summaryFunction = MySummary)

#Define the model
cartFit <- train(complete ~ ., #Define which variable to predict 
                data = TRAIN2, #Define the data set to train the model on
                trControl = ctrl, #Tell caret the control elements
                method = "rpart", #Define the model type
                metric = "Accuracy", #Final model choice is made according to sensitivity
                preProc = c("center", "scale")) #Center and scale the data to minimize the 
#Check the results
cartFit
```


Describe important model attribues of your tree. Do you believe it is a successful model of student performance, why/why not?
```{r}
importance = varImp(cartFit,scale = FALSE)
importance
```
##course_id is the most important feature
##is a valid model roc area reaches 0.88, recall reaches 0.66, and precision reaches 0.98. It means the accuracy is very high

Can you use the sensitivity and specificity metrics to calculate the F1 metric?
##F1=2*precision*recall/(precision+recall)
Now predict results from the test data and describe important attributes of this test. Do you believe it is a successful model of student performance, why/why not?

```{r}
TEST2 <- TEST1[,c(2:10)] #Remove the student_id variable that we do not want to use in the model

#Generate prediction using previously trained model
cartClasses <- predict(cartFit, newdata = TEST2)

#Generate model statistics
confusionMatrix(data = cartClasses, as.factor(TEST2$complete))

```
## is a valid model. acc reaches 0.90
## Conditional Inference Trees

Train a Conditional Inference Tree using the `party` package on the same training data and examine your results.
```{r}
#condFit <- ctree(complete ~ ., data = Train1)
library(party)
library(grid)
library(mvtnorm)
library(caret)
library(modeltools)
library(stats4) 
library(strucchange)
library(zoo)
#Define the model
TRAIN2$complete<-as.factor(TRAIN2$complete)
condFit <- ctree(complete ~ years+entrance_test_score+courses_taken+gender, data = TRAIN2)
#Check the results
condFit
```

```{r}
plot(condFit)
```
```
Describe important model attribues of your tree. Do you believe it is a successful model of student performance, why/why not?
##years is the most important because it splits at the first level of nodes, followed by courses_taken 
##This model is a valid model because acc also reaches 0.88 and Specificity reaches 1 which is very high

What does the plot represent? What information does this plot tell us?
The ## graph represents which features were used in the split, and the splitting point of the features. It also shows the importance of the features, the earlier they split the more information they contain

Now test your new Conditional Inference model by predicting the test data and generating model fit statistics.

```{r}
condFit.pred <- predict(condFit, newdata = TEST2, type = 'response')
confusionMatrix(condFit.pred, as.factor(TEST2$complete))
```

There is an updated version of the C4.5 model called C5.0, it is implemented in the C50 package. What improvements have been made to the newer version? 
##Faster
##More efficient memory usage
##Smaller decision trees built: C5.0 obtains very similar results to C4.5, but builds quite small decision trees.
##Similar accuracy: C5.0 obtains similar accuracy to C4.5.
##Boosting support: Boosting can make the decision tree more accurate.
##Weighting: With C5.0, you can weight different attributes and misclassification types. C5.0 can build classifiers to minimize the expected misclassification cost instead of the error rate.

Install the C50 package, train and then test the C5.0 model on the same data.

```{r}
library(C50)
c50Fit <- C5.0(complete ~ years+entrance_test_score+courses_taken+enroll_date_time+international+online+gender, data = TRAIN2)
summary(c50Fit)
```
```{r}
c50Fit.pred <- predict(c50Fit, newdata = TEST2)
confusionMatrix(c50Fit.pred, as.factor(TEST2$complete))
```


## Compare the models

caret allows us to compare all three models at once.

```{r}
library(caret)
list(cart = cartFit, condinf = condFit, cfiveo = c50Fit)
```
What does the model summary tell us? Which model do you believe is the best?
##acc： cartFit:0.8867  condFit:0.8737 c50Fit:0.8737: cartFit
##Specificity： cartFit:0.9942  condFit:1 c50Fit:1 :c50Fit，cartFit
##Sensitivity   cartFit:0.6288  condFit:0.5708 c50Fit:0.5708 :cartFit
##Overall cartFit is the best 


Which variables (features) within your chosen model are important, do these features provide insights that may be useful in solving the problem of students dropping out of courses?
#years course_id courses_taken are important
#Not easy for students with years greater than 0 dropping out of courses