Skip to content

Tekhunt/Creditcard-fraud-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Credit Card Fraud Detection using Azure Machine Learning

Fraud detection in financial transactions is one of the most important problems in financial companies. torch This project aims to detect potential fraud cases is credit card transactions and the task here is to differentiate between them. My ultimate intent is to tackle this situation by building classification models to classify and distinguish fraud transactions. step Here I trianed models using AutoML and Hyperdrive after which I deployed the best model which in the case is the AutoML model

Steps from data acquisition to model deployment.

flow1

Dataset

The original dataset is in Kaggle Datasets. The original data is licensed by Open Database License (ODbL) 1.0.Open Database License (ODbL) 1.0.

This data is about fraud detection in credit card transactions. The data was made by credit cards in September 2013 by European cardholders. The dataset is highly unbalanced, the positive class which depicts fraudulent transactions (frauds) account for 0.17% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we do not have the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with Principal component analysis (PCA), the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Here goes my

Task

This project aims to detect potential fraud cases is credit card transactions and the task here is to differentiate between them. My ultimate intent is to tackle this situation by building classification models to classify and distinguish fraud transactions. Detecting potential frauds so that customers are not charged for items that they did not purchase is the key objective of this project. So the goal is to build a classifier that tells if a transaction is a fraud or not.

Access

The data is first loaded into this project repository and through this link which points to the raw data in my repository I accessed it in different notebooks and scripts where it was used.

Automated ML

Image below shows the AML Congiguration and Settings used in this project.

aml settings

n_cross-validation

How many cross validations to perform when user validation data is not specified.

enable_early_stopping

Whether to enable early termination if the score is not improving in the short term. The default is False but it is set to True here.

experiment_timeout_minutess

Maximum amount of time in minutes that all iterations combined can take before the experiment terminates. It is set to 5 minutes here.

verbosity

This is the verbosity level for writing to the log file and it is set to logging.INFO

training_data

This can be any of these: DataFrame or Dataset or DatasetDefinition or TabularDataset The training data to be used within the experiment. It should contain both training features and a label column (optionally a sample weights column). If training_data is specified, then the label_column_name parameter must also be specified.

label_column_name

This is the name of the label column. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers. Here we have column headers and our arget column is the Class column which we aim to predict in the project.

max_cores_per_iteration

The maximum number of threads to use for a given training iteration. Acceptable values: Equal to -1, which means to use all the possible cores per iteration per child-run.

max_concurrent_iterations

Represents the maximum number of iterations that would be executed in parallel. The value used here is 4

compute_target

The Azure Machine Learning compute target to run the Automated Machine Learning experiment on.

primary_metric

The metric that Automated Machine Learning will optimize for model selection. Accuracy is the primary_metric here.

task

The type of task to run. Values the here is 'classification'

Results

The votingEmsemble gave the best model with an accuracy of 0.9996. The data used in this experiment is highly skewed and i suggest that in the futere beter ways of handling this kind of skeed dataset should be applied o further improve the model. Additionally, employinhg deep model algorithm will yield significanrt imprrovement, so I highly recommend it.

run_id_LI

RunDetails output

run details

aml run id_LI aml run id

View in Azure Ml Studio

view in azure

Hyperparameter Tuning

hyperdrive settings LogisticRegression is the algorithm used in this classification task. The algorithm is a two class classification to predict between two categories(fraudulent or not fraudulent). And To improve the model we optimized the hyperparameters using the powers of Azure Machine Learning's Hyperdrive

The hyperparameter space defined implies tuning the C and max_iter parameters. Random sampling, which supports discrete and continuous hyperparameters was used and the primary metric to optimize was accuracy and the the goal was to maximize.

Early termination policy was Bandit Policy and the parameters are slack_factor and evaluation_interval. A slack factor equal to 0.1 as criteria for evaluation to conserve resources by terminating runs where the primary metric is not within the specified slack factor/slack amount compared to the best performing run. Once completed we create the SKLearn estimator

I then defined the hyperdrive configuration and submitted the experiment

Results

The best model gave an accuracy of 0.998. hyperdrive run details metrics

hyperdriverun id

The best model was generated using Regularization strenght of '100.0', max_iter = '250' which gave an accuracy of '0.9988' as shown in the screenshot below.

run_info run_info_LI (2)

View in Azure ML Studio

azure details

This experiment can be improved using a different algorithm, using differnet features and also adding more iteration in the hyperdrive configuration which can deliver a better result.

Model Deployment

Below are screenshots which demonstratethe overview of the deployed model and instructions on how to query the endpoint with a sample input.

Save and register the best model for the deployment

Download the conda

set the environment

and set the inference config and the Aci Web service config

save download model register model

deploy model

download the scoring uri and swagger uri

deployed model

Extract sample data

sample data

Experiment Output

request output

Screen Recording

https://youtu.be/wJ5l3GzWEVM

  • A working model
  • Demo of the deployed model
  • Demo of a sample request sent to the endpoint and its response

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published