Fraud detection in financial transactions is one of the most important problems in financial companies. This project aims to detect potential fraud cases is credit card transactions and the task here is to differentiate between them. My ultimate intent is to tackle this situation by building classification models to classify and distinguish fraud transactions. Here I trianed models using AutoML and Hyperdrive after which I deployed the best model which in the case is the AutoML model
The original dataset is in Kaggle Datasets. The original data is licensed by Open Database License (ODbL) 1.0.Open Database License (ODbL) 1.0.
This data is about fraud detection in credit card transactions. The data was made by credit cards in September 2013 by European cardholders. The dataset is highly unbalanced, the positive class which depicts fraudulent transactions (frauds) account for 0.17% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we do not have the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with Principal component analysis (PCA), the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
This project aims to detect potential fraud cases is credit card transactions and the task here is to differentiate between them. My ultimate intent is to tackle this situation by building classification models to classify and distinguish fraud transactions. Detecting potential frauds so that customers are not charged for items that they did not purchase is the key objective of this project. So the goal is to build a classifier that tells if a transaction is a fraud or not.
The data is first loaded into this project repository and through this link which points to the raw data in my repository I accessed it in different notebooks and scripts where it was used.
How many cross validations to perform when user validation data is not specified.
Whether to enable early termination if the score is not improving in the short term. The default is False but it is set to True here.
Maximum amount of time in minutes that all iterations combined can take before the experiment terminates. It is set to 5 minutes here.
This is the verbosity level for writing to the log file and it is set to logging.INFO
This can be any of these: DataFrame or Dataset or DatasetDefinition or TabularDataset The training data to be used within the experiment. It should contain both training features and a label column (optionally a sample weights column). If training_data is specified, then the label_column_name parameter must also be specified.
This is the name of the label column. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers. Here we have column headers and our arget column is the Class column which we aim to predict in the project.
The maximum number of threads to use for a given training iteration. Acceptable values: Equal to -1, which means to use all the possible cores per iteration per child-run.
Represents the maximum number of iterations that would be executed in parallel. The value used here is 4
The Azure Machine Learning compute target to run the Automated Machine Learning experiment on.
The metric that Automated Machine Learning will optimize for model selection. Accuracy is the primary_metric here.
The type of task to run. Values the here is 'classification'
The votingEmsemble gave the best model with an accuracy of 0.9996. The data used in this experiment is highly skewed and i suggest that in the futere beter ways of handling this kind of skeed dataset should be applied o further improve the model. Additionally, employinhg deep model algorithm will yield significanrt imprrovement, so I highly recommend it.
LogisticRegression is the algorithm used in this classification task. The algorithm is a two class classification to predict between two categories(fraudulent or not fraudulent). And To improve the model we optimized the hyperparameters using the powers of Azure Machine Learning's Hyperdrive
The hyperparameter space defined implies tuning the C and max_iter parameters. Random sampling, which supports discrete and continuous hyperparameters was used and the primary metric to optimize was accuracy and the the goal was to maximize.
Early termination policy was Bandit Policy and the parameters are slack_factor and evaluation_interval. A slack factor equal to 0.1 as criteria for evaluation to conserve resources by terminating runs where the primary metric is not within the specified slack factor/slack amount compared to the best performing run. Once completed we create the SKLearn estimator
I then defined the hyperdrive configuration and submitted the experiment
The best model gave an accuracy of 0.998.
The best model was generated using Regularization strenght of '100.0', max_iter = '250' which gave an accuracy of '0.9988' as shown in the screenshot below.
This experiment can be improved using a different algorithm, using differnet features and also adding more iteration in the hyperdrive configuration which can deliver a better result.
Below are screenshots which demonstratethe overview of the deployed model and instructions on how to query the endpoint with a sample input.
- A working model
- Demo of the deployed model
- Demo of a sample request sent to the endpoint and its response