Skip to content

benoitchamot/credit-risk-classification

Repository files navigation

Supervised Learning for Credit Risk Classification

Train and evaluate models to classify loan risks.

File structure

  • The img directory contains PNG images of the charts generated in the analysis
  • The Resources directory contains the lending data provided by Monash University
  • The credit_risk_classification.ipynb Jupyter notebook contains the main analysis, including the code and results
  • The ml_classification.py module contains functions that are commonly used in classification projects (supervised learning)
  • The models_optimisation.ipynb Jupyter notebook contains contains a side analysis with the aim to optimise some of the models' parameters
  • The style.py module contains variables used to define the style of the charts, such as colors

All code is the author's, unless otherwise specifically specified.

This README files contains the main points from the analysis and our conclusions. The most complete and up-to-date report can be found in the credit_risk_classification.ipynb Jupyter notebook itself, from which most of the text below is copied for convenience.

Overview of the Analysis

Purpose

In this analysis, we look at the ability of different machine learning (ML) models to classify healthy and high-risk loans.

About the data

The dataset includes the following information about the loans and the borrowers:

  • loan size
  • interest rate
  • borrower income
  • debt to income ratio
  • number of accounts
  • derogatory marks
  • total debt of the borrower

Because the values range can differ a lot between the different columns, it is expected that the classifier will benefit from using scaled data. An initial model is used with the original data first to establish a baseline but scaled data will be used as a first optimisation step (see below.)

The loan status is the value we try to predict. It can take a value of 0 (healthy loans) or 1 (high-risk). The data provided include 75036 loans classified as healthy and 2500 classified as high-risk.

Methods used

The following models form the sklearn library are evaluated:

  • LogisticRegression with original data (Model 1) and scaled data (Model 2)
  • SVC with scaled data (Model 3)
  • tree with scaled data (Model 4)
  • RandomForest with scaled data (Model 5)
  • KNeighborsClassifier with scaled data (Model 6)

Note that PCA is not used as the number of features (dimensions) is reasonable and we do not expect any significant improvement by using Principal Components.

Stages

We prepare the data for all the models in the next sections by performing the following steps:

  • Import all necessary modules (there are no imports within the other code blocks)
  • Load the data from the CSV file into a pandas DataFrame
  • Split the data between the training and test sets using train_test_split from sklearn

For each of the models, we then perform the following steps

  • Scaling (optional)
  • Fitting (i.e. train the model with the training set)
  • Predictions (i.e. use the test set )
  • Describe the stages of the machine learning process you went through as part of this analysis.

Results

A comparison between all the models is given in the table below.

Model Description Data accuracy precision_0 precision_1 recall_0 recall_1
Model 4 DecisionTreeClassifier (max_depth = 3) Scaled 0.9952 0.9998 0.8735 0.9952 0.9952
Model 6 KNeighborsClassifier (max_depth = 11) Scaled 0.9950 0.9997 0.8732 0.9952 0.9920
Model 2 LogisticRegression Scaled 0.9947 0.9993 0.8719 0.9952 0.9808
Model 3 SVC Scaled 0.9947 0.9993 0.8719 0.9952 0.9808
Model 1 LogisticRegression Original 0.9924 0.9964 0.8746 0.9957 0.8928
Model 5 RandomForestClassifier (n_estimators = 220) Scaled 0.9923 0.9962 0.8765 0.9958 0.8864

Main observations

  • The accuracy is excellent for all models
  • All models perform similarly against all metrics
  • There is however a bigger spread in Class-1 recall (see next chart)
  • Class-1 recall is probably the single most important performance metric for this application: we want to make sure that the lenders trust the models and provide and agree to loans that are correctly classified as healthy
  • All models show Class-1 recall metrics of 0.88 or higher but Model 4 (decision tree) shows a near-perfect recall of 0.99, which greatly reduces the risk to the lenders

The precision and recall for the different models are shown in the charts below. The models are sorted by their Class-1 recall value.

Recommendations

While all the models perform well with all metrics above 0.8 and usually close to 0.9, the Decision Tree Classifier offers oustanding performance when it comes to recall (ability to correctly classify risky loans as such) and therefore can be trusted by lenders who want to avoid unexpected risks.

Class-1 recall is by far the most variable metrics among the different metrics (accuracy, precision, recall.) Optimising for Class-1 recall will therefore have very limited impact on the other metrics and can be treated as a one-dimension optimisation problem.

About

Train and evaluate a model based on loan risk.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published