Credit Card Fraud Detection using Azure Machine Learning

Fraud detection in financial transactions is one of the most important problems in financial companies. This project aims to detect potential fraud cases is credit card transactions and the task here is to differentiate between them. My ultimate intent is to tackle this situation by building classification models to classify and distinguish fraud transactions. Here I trianed models using AutoML and Hyperdrive after which I deployed the best model which in the case is the AutoML model

Steps from data acquisition to model deployment.

Dataset

The original dataset is in Kaggle Datasets. The original data is licensed by Open Database License (ODbL) 1.0.Open Database License (ODbL) 1.0.

This data is about fraud detection in credit card transactions. The data was made by credit cards in September 2013 by European cardholders. The dataset is highly unbalanced, the positive class which depicts fraudulent transactions (frauds) account for 0.17% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we do not have the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with Principal component analysis (PCA), the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Here goes my

Task

This project aims to detect potential fraud cases is credit card transactions and the task here is to differentiate between them. My ultimate intent is to tackle this situation by building classification models to classify and distinguish fraud transactions. Detecting potential frauds so that customers are not charged for items that they did not purchase is the key objective of this project. So the goal is to build a classifier that tells if a transaction is a fraud or not.

Access

The data is first loaded into this project repository and through this link which points to the raw data in my repository I accessed it in different notebooks and scripts where it was used.

Automated ML

Image below shows the AML Congiguration and Settings used in this project.

n_cross-validation

How many cross validations to perform when user validation data is not specified.

enable_early_stopping

Whether to enable early termination if the score is not improving in the short term. The default is False but it is set to True here.

experiment_timeout_minutess

Maximum amount of time in minutes that all iterations combined can take before the experiment terminates. It is set to 5 minutes here.

verbosity

This is the verbosity level for writing to the log file and it is set to logging.INFO

training_data

This can be any of these: DataFrame or Dataset or DatasetDefinition or TabularDataset The training data to be used within the experiment. It should contain both training features and a label column (optionally a sample weights column). If training_data is specified, then the label_column_name parameter must also be specified.

label_column_name

This is the name of the label column. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers. Here we have column headers and our arget column is the Class column which we aim to predict in the project.

max_cores_per_iteration

The maximum number of threads to use for a given training iteration. Acceptable values: Equal to -1, which means to use all the possible cores per iteration per child-run.

max_concurrent_iterations

Represents the maximum number of iterations that would be executed in parallel. The value used here is 4

compute_target

The Azure Machine Learning compute target to run the Automated Machine Learning experiment on.

primary_metric

The metric that Automated Machine Learning will optimize for model selection. Accuracy is the primary_metric here.

task

The type of task to run. Values the here is 'classification'

Results

The votingEmsemble gave the best model with an accuracy of 0.9996. The data used in this experiment is highly skewed and i suggest that in the futere beter ways of handling this kind of skeed dataset should be applied o further improve the model. Additionally, employinhg deep model algorithm will yield significanrt imprrovement, so I highly recommend it.

RunDetails output

View in Azure Ml Studio

Hyperparameter Tuning

LogisticRegression is the algorithm used in this classification task. The algorithm is a two class classification to predict between two categories(fraudulent or not fraudulent). And To improve the model we optimized the hyperparameters using the powers of Azure Machine Learning's Hyperdrive

The hyperparameter space defined implies tuning the C and max_iter parameters. Random sampling, which supports discrete and continuous hyperparameters was used and the primary metric to optimize was accuracy and the the goal was to maximize.

Early termination policy was Bandit Policy and the parameters are slack_factor and evaluation_interval. A slack factor equal to 0.1 as criteria for evaluation to conserve resources by terminating runs where the primary metric is not within the specified slack factor/slack amount compared to the best performing run. Once completed we create the SKLearn estimator

I then defined the hyperdrive configuration and submitted the experiment

Results

The best model gave an accuracy of 0.998.

The best model was generated using Regularization strenght of '100.0', max_iter = '250' which gave an accuracy of '0.9988' as shown in the screenshot below.

View in Azure ML Studio

This experiment can be improved using a different algorithm, using differnet features and also adding more iteration in the hyperdrive configuration which can deliver a better result.

Model Deployment

Below are screenshots which demonstratethe overview of the deployed model and instructions on how to query the endpoint with a sample input.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.gitattributes		.gitattributes
README.md		README.md
aml run id.png		aml run id.png
automl.log		automl.log
automl_run1.ipynb		automl_run1.ipynb
azure details.png		azure details.png
azureml_automl.log		azureml_automl.log
best_remote_model.pkl		best_remote_model.pkl
conda_env.yml		conda_env.yml
conda_env_v_1_0_0.yml		conda_env_v_1_0_0.yml
config.json		config.json
deploy model.png		deploy model.png
deployed model.png		deployed model.png
env_dependencies.json		env_dependencies.json
fraud-data.csv		fraud-data.csv
hyperdrive run details.png		hyperdrive run details.png
hyperdrive settings.png		hyperdrive settings.png
hyperdrive_model.joblib		hyperdrive_model.joblib
hyperdriverun id.png		hyperdriverun id.png
hyperparameter_tuning (1).ipynb		hyperparameter_tuning (1).ipynb
internal_cross_validated_models.pkl		internal_cross_validated_models.pkl
metrics.png		metrics.png
model.pkl		model.pkl
pipeline_graph.json		pipeline_graph.json
register model.png		register model.png
request output.png		request output.png
run details.png		run details.png
run_id.png		run_id.png
run_info.png		run_info.png
run_info_LI (2).jpg		run_info_LI (2).jpg
sample data.png		sample data.png
save download model.png		save download model.png
scoring_file_v_1_0_0.txt		scoring_file_v_1_0_0.txt
train.py		train.py
view in azure.png		view in azure.png

Tekhunt/Creditcard-fraud-detection

Folders and files

Latest commit

History

Repository files navigation

Credit Card Fraud Detection using Azure Machine Learning

Steps from data acquisition to model deployment.

Dataset

Here goes my

Task

Access

Automated ML

Image below shows the AML Congiguration and Settings used in this project.

n_cross-validation

enable_early_stopping

experiment_timeout_minutess

verbosity

training_data

label_column_name

max_cores_per_iteration

max_concurrent_iterations

compute_target

primary_metric

task

Results

RunDetails output

View in Azure Ml Studio

Hyperparameter Tuning

Results

The best model was generated using Regularization strenght of '100.0', max_iter = '250' which gave an accuracy of '0.9988' as shown in the screenshot below.

View in Azure ML Studio

Model Deployment

Save and register the best model for the deployment

Download the conda

set the environment

and set the inference config and the Aci Web service config

download the scoring uri and swagger uri

Extract sample data

Experiment Output

Screen Recording

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages