Prediction

Prediction of student behavior has been a prominant area of research in learning analytics and a major concern for higher education institutions and ed tech companies alike. It is the bedrock of methodology within the world of cognitive tutors and these methods have been exported to other areas within the education technology landscape. The ability to predict what a student is likely to do in the future so that interventions can be tailored to them has seen major growth and investment, though implementation is non-trivial and expensive. Although some institutions, such as Purdue University, have seen success we are yet to see widespread adoption of these approaches as they tend to be highly institution specific and require very concrete outcomes to be useful.

Project Objective:

The purpose of this project was build models using CART, C4.5 and C5.0 classification algorithms to predict student course dropout and then compare these models based on validation metrics.

Dataset:

drop-out.csv

A codebook can be found in this repository.

R Packages:

dplyr
tidyr
ggplot2
caret
RWeka
C50

Procedure:

The dataset was separated into a training set and a test set. 75% of the dataset was randomly selected for a training set and the other 25% was selected for a test set. All of the variables were incorporated into the model to predict whether the student will complete the courses. Then the following models and their validation metrics were generated:

CART Tree:

ROC against the complexity parameter:

C4.5-Type (J48) Tree:

ROC against the complexity parameter:

C5.0 Tree:

ROC against the number of boosting iterations:

Then the models were compared using the following code, which generated a model summary:

resamps <- resamples(list(cart = cartFit, jfoureight = j48Fit, cfiveo = c50Fit))
summary(resamps)

Results:

Based on the model summary, C5.0 model has the highest average ROC, C4.5 model has the highest average sensitivity, and the CART model has the highest average specificity. Furthermore, the CART model has the lowest average ROC and sensiitvity. Nonetheless, the C5.0 model has a greater variation in regards to the sensitivity and specificity compared to the other models while the C4.5 model has the least variation in regards to the distribution of these metrics. Thus, the C4.5 model is the best in predicting whether the students will complete the course.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
C50Fit.png		C50Fit.png
CartFit.png		CartFit.png
J48Fit.png		J48Fit.png
Lizarov,Anna.Prediction.Rmd		Lizarov,Anna.Prediction.Rmd
Prediction_Project.Rproj		Prediction_Project.Rproj
README.md		README.md
drop-out-codebook		drop-out-codebook
drop-out.csv		drop-out.csv
scatterplot_matrix.pdf		scatterplot_matrix.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prediction

Project Objective:

Dataset:

R Packages: