diff --git a/ai-credit-fraud-workflow/Dockerfile b/ai-credit-fraud-workflow/Dockerfile new file mode 100644 index 0000000..43d2c7c --- /dev/null +++ b/ai-credit-fraud-workflow/Dockerfile @@ -0,0 +1,4 @@ +FROM nvcr.io/nvidia/pyg:24.09-py3 +WORKDIR /ai-credit-fraud-workflow +COPY requirements.txt /ai-credit-fraud-workflow +RUN pip install --no-cache-dir -r requirements.txt diff --git a/ai-credit-fraud-workflow/LICENSE b/ai-credit-fraud-workflow/LICENSE new file mode 100644 index 0000000..261eeb9 --- /dev/null +++ b/ai-credit-fraud-workflow/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/ai-credit-fraud-workflow/README.md b/ai-credit-fraud-workflow/README.md index 32692c2..7653a14 100644 --- a/ai-credit-fraud-workflow/README.md +++ b/ai-credit-fraud-workflow/README.md @@ -1,20 +1,76 @@ - +__Note__: The sample datasets must be downloaded manually (see Setup) -# AI Credit Card Fraud Workflow -Future home of the AI Credit Card Fraud Workflow. +Table of Content +* [Background](./docs/background.md) +* [This Workflow](./docs/workflow.md) +* [Datasets and Data Prep](./docs/datasets.md) +* [Setup](./docs/setup.md) + +Executing these examples: +1. Setup your environment or container (see [Setup](./docs/setup.md)) +1. Download the datasets (see [Datasets](./docs/datasets.md)) +1. Start Jupyter +1. Run the [Notebooks](./docs/run_notebooks.md) + * Determine which dataset you want (Notebook names are related to a dataset) + * Run the data pre-processing Notebook + * Run the GNN training Notebook + * Run the inference Notebook + + +### Notebooks need to executed in the correct order +The notebooks need to be executed in the correct order. For a particular dataset, the preprocessing notebook must be executed before the training notebook. Once the training notebook produces models, the inference notebook can be executed to run inference on unseen data. + + +For example, for the TabFormer dataset, the notebooks need to be executed in the following order - + + - preprocess_Tabformer.ipynb + - train_gnn_based_xgboost.ipynb + - inference_gnn_based_xgboost_TabFormer.ipynb + +To train a standalone XGBoost model, that doesn't utilize node embedding, run the following two notebooks in the following oder - + + - train_xgboost.ipynb + - inference_xgboost_TabFormer.ipynb + +__Note__: Before executing `train_xgboost.ipynb` and `train_gnn_based_xgboost.ipynb` notebooks, make sure that the right dataset is selected in the second code cell of of the notebooks. + +```code + DATASET = TABFORMER +``` + +

+ + +## Copyright and License +Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. + +
+ + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/ai-credit-fraud-workflow/conda/fraud_conda_env.yaml b/ai-credit-fraud-workflow/conda/fraud_conda_env.yaml new file mode 100644 index 0000000..0af8bd8 --- /dev/null +++ b/ai-credit-fraud-workflow/conda/fraud_conda_env.yaml @@ -0,0 +1,60 @@ +# This file is generated by `rapids-dependency-file-generator`. +# To make changes, edit ../../dependencies.yaml and run `rapids-dependency-file-generator`. +channels: +- rapidsai +- rapidsai-nightly +- pyg +- conda-forge +- nvidia +dependencies: +- breathe +- conda-forge::category_encoders +- cmake>=3.26.4,!=3.30.0 +- cuda-cudart-dev +- cuda-nvtx-dev +- cuda-profiler-api +- cuda-version=12.1 +- cudf==24.8.* +- cugraph==24.8.* +- cugraph-pyg==24.8.* +- cuml==24.8.* +- cupy>=12.0.0 +- cython>=3.0.0 +- doxygen +- graphviz +- ipython +- libcublas-dev +- libcurand-dev +- libcusolver-dev +- libcusparse-dev +- nbsphinx +- ninja +- notebook>=0.5.0 +- numba>=0.57 +- numpy>=1.23,<2.0a0 +- conda-forge::matplotlib +- pandas +- pre-commit +- pydantic +- pydata-sphinx-theme +- pyg::pyg +- pylibcugraphops==24.8.* +- pylibraft==24.8.* +- pylibwholegraph==24.8.* +- pytest +- pytorch-cuda=12.1 +- pytorch::pytorch>=2.0,<2.2.0a0 +- py-xgboost-gpu +- rmm==24.8.* +- scikit-build-core>=0.7.0 +- scipy +- setuptools>=61.0.0 +- sphinx-copybutton +- sphinx-markdown-tables +- sphinx<6 +- sphinxcontrib-websupport +- torchdata +- tensordict +- wget +- wheel +name: fraud_conda_env diff --git a/ai-credit-fraud-workflow/data/Sparkov/README.md b/ai-credit-fraud-workflow/data/Sparkov/README.md new file mode 100644 index 0000000..36d258b --- /dev/null +++ b/ai-credit-fraud-workflow/data/Sparkov/README.md @@ -0,0 +1,7 @@ +# Sparkov data folder + +please download the data here + +https://www.kaggle.com/datasets/kartik2112/fraud-detection + + diff --git a/ai-credit-fraud-workflow/data/Sparkov/raw/READEME.md b/ai-credit-fraud-workflow/data/Sparkov/raw/READEME.md new file mode 100644 index 0000000..4e20a2a --- /dev/null +++ b/ai-credit-fraud-workflow/data/Sparkov/raw/READEME.md @@ -0,0 +1,5 @@ +# Sparkov raw data folder + +Place the extract files here. +* fraudTest.csv +* fraudTrain.csv \ No newline at end of file diff --git a/ai-credit-fraud-workflow/data/Sparkov/xgb/README.md b/ai-credit-fraud-workflow/data/Sparkov/xgb/README.md new file mode 100644 index 0000000..b41ac3c --- /dev/null +++ b/ai-credit-fraud-workflow/data/Sparkov/xgb/README.md @@ -0,0 +1,2 @@ +# # Sparkov XGB data folder + \ No newline at end of file diff --git a/ai-credit-fraud-workflow/data/TabFormer/README.md b/ai-credit-fraud-workflow/data/TabFormer/README.md new file mode 100644 index 0000000..6b17487 --- /dev/null +++ b/ai-credit-fraud-workflow/data/TabFormer/README.md @@ -0,0 +1,3 @@ +# IBM TabFormer Dataset + +The data needs to be downloaded manually. Please go https://ibm.ent.box.com/v/tabformer-data/folder/130747715605 and download the "transaction.tgz" file \ No newline at end of file diff --git a/ai-credit-fraud-workflow/data/TabFormer/gnn/README.md b/ai-credit-fraud-workflow/data/TabFormer/gnn/README.md new file mode 100644 index 0000000..e69de29 diff --git a/ai-credit-fraud-workflow/data/TabFormer/raw/README.md b/ai-credit-fraud-workflow/data/TabFormer/raw/README.md new file mode 100644 index 0000000..e69de29 diff --git a/ai-credit-fraud-workflow/data/TabFormer/xgb/README.md b/ai-credit-fraud-workflow/data/TabFormer/xgb/README.md new file mode 100644 index 0000000..e69de29 diff --git a/ai-credit-fraud-workflow/docs/background.md b/ai-credit-fraud-workflow/docs/background.md new file mode 100644 index 0000000..dd4b0ed --- /dev/null +++ b/ai-credit-fraud-workflow/docs/background.md @@ -0,0 +1,46 @@ +# Background +Transaction fraud is +[expected to exceed $43B by 2026](https://nilsonreport.com/articles/card-fraud-losses-worldwide/) +and poses a significant challenge upon financial institutions to detect and prevent +sophisticated fraudulent activities. Traditionally, financial institutions +have relied upon rules based techniques which are reactive in nature and +result in higher false positives and lower fraud detection accuracy. As data +volumes and attacks have become more sophisticated, accelerated machine and +graph learning techniques become mandatory and is a more proactive approach. +AI for fraud detection uses multiple machine learning models to detect anomalies +in customer behaviors and connections as well as patterns of accounts and +behaviors that fit fraudulent characteristics. + +Fraud detection has been a challenge across banking, finance, retail and +e-commerce. Fraud doesn’t only hurt organizations financially, it can also +do reputational harm. It’s a headache for consumers, as well, when fraud models +from financial services firms overreact and register false positives that shut +down legitimate transactions. Financial services sectors are developing more +advanced models using more data to fortify themselves against losses +financially and reputationally. They’re also aiming to reduce false positives +in fraud detection for transactions to improve customer satisfaction and win +greater share among merchants. + +As data needs grow and AI models expand in size, intricacy, and diversity, +energy-efficient processing power is becoming more critical to operations in +financial services. Traditional data science pipelines lack the necessary +acceleration to handle the volumes of data involved in fraud detection, +resulting in slower processing times, which limits real-time data analysis +and detection of fraud. To efficiently manage large-scale datasets and deliver +real-time performance for AI in production, financial institutions must shift +from legacy infrastructure to accelerated computing. + +The Fraud Detection AI workflow offers enterprises an end-to-end solution using +the NVIDIA accelerated computing platform for GPU-accelerated data processing +and AI deployment, enabling real-time analysis and detection of fraudulent +activities. It is important to note that there are several types of fraud. +The initial focus is on supervised credit card transaction fraud. Other areas +beyond fraud that could be converted to products include:New Account Fraud, +Account Takeover, Fraud Ring Detection, Abnormal Behavior, and Anti-Money +Laundering. + +
+
+ +[<-- Back](../README.md)
+[--> Next: This Workflow](./workflow.md) diff --git a/ai-credit-fraud-workflow/docs/datasets.md b/ai-credit-fraud-workflow/docs/datasets.md new file mode 100644 index 0000000..82f6c13 --- /dev/null +++ b/ai-credit-fraud-workflow/docs/datasets.md @@ -0,0 +1,96 @@ +# Datasets +The exemplars here are based on two different datasets with a different set of notebooks for each dataset. + +__Both datasets need to be download manually.__ + +## Dataset 1: IBM TabFormer +* https://github.com/IBM/TabFormer + * just the Credit Card Transaction Dataset and not the others +* License: Apache License Version 2.0, January 2004 +* 24 million transaction records + + +## Dataset 2: Sparkov +The data generator: + * https://github.com/namebrandon/Sparkov_Data_Generation + + +The generator was used to produce a dataset for Kaggle: + * https://www.kaggle.com/datasets/kartik2112/fraud-detection + * Released under CC0: Public Domain + * Contains 1,296,675 records with 23 fields + * one field being the "is_fraud" label which we use for training. + + +

+ + +# Data Prep + +Preprocessing, along with feature engineering are very important steps in machine learning that significantly impact model performance. Here is summary of preprocessing we performed for the two datasets + +## TabFormer + +### Data fields +* Ordinal categorical fields - 'Year', 'Month', 'Day' +* Nominal categorical fields - 'User', 'Card', 'Merchant Name', 'Merchant City', 'Merchant State', 'Zip', 'MCC', 'Errors?' +* Target label - 'Is Fraud?' + +### Preprocessing +* Missing values for 'Merchant State', 'Zip' and 'Errors?' fields are replaced with markers as these columns have nominal categorical values. +* Dollar symbol ($) in 'Amount' and extra character (,) in 'Errors?' field are removed. +* 'Time' in converted to number of minutes over the span of a day. +* 'Card' is converted to 'User' * MAX_NUMBER_OF_CARD_PER_USERS + 'Card' and finally treated as nominal categorical values to make sure that Card 0 from User 1 is different from Card 0 of User 2 +* Filtered out categorical and numerical columns that don't have significant correlation with target column +* Hot-encoded nominal categorical columns with less than nine categories and binary encoded nominal categorical columns with nine or more categories +* Scaled numerical column. As the 'Amount' field has a few extreme values, we scaled the field with a Robust Scaler. +* We save the fitted transformer, transformed train and test data in CSV files. + +NOTE: Binary encoding and scaling performed using a column transformer, which is composed of encoders and a scaler. + +### To create Graph from GNN +* Assigned unique and consecutive ids for the transactions, which become node ids of the transactions in the Graph. +* Card (or user) ids are used to create consecutive ids for user nodes +* Merchant strings are converted mapped to consecutive ids for merchant nodes. +* If an user U makes a transaction T to a merchant M, user node U will have an edge (directional or bidirectional depending on flag) to transaction node T, and the transaction node T will be connected with an edge (directional or bidirectional depending on flag) to the merchant node M. +* Transformed transaction node features are saved in a csv file using node id as index. +* Merchant and User nodes are initialized with zero vectors of same length of a transaction node features. +* Target values of all the nodes are saved in a separate CSV file which are loaded during GNN training. + + +## Sparkov + +### Data fields +* Nominal categorical fields - 'cc_num', 'merchant', 'category', 'first', 'last', 'street', 'city', 'state', 'zip', 'job', 'trans_num' +* Numerical fields - 'amt', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long' +* Timestamp fields - 'dob', 'trans_date_trans_time', 'unix_time' +* Target label - 'is_fraud' + +### Preprocessing +* From 'unix_time' and ('lat', 'long') and ('merchant_lat', 'merchant_long') we calculated the transaction 'speed'. +* Converted 'dob' to age. +* Converted 'trans_date_trans_time' in to number of minutes over the span of a day. + +* Filter out categorical and numerical columns that don't have significant correlation with target column. +* Binary encoded nominal categorical columns. +* Scaled numerical columns. As the 'amt' field has a few extreme values, we scaled the field with a Robust Scaler. The 'speed' and 'age' are scaled with standard scaler. +* We save the fitted transformer, transformed train and test data in CSV files. + +NOTE: Binary encoding and scaling performed using a column transformer, which is composed of encoders and scalers. + +### To create Graph from GNN +* Assigned unique and consecutive ids for the transactions, which become node ids of the transactions in the Graph. +* 'cc_num' are used to create consecutive ids for user nodes +* Merchant strings are converted mapped to consecutive ids for merchant nodes. +* If an user U makes a transaction T to a merchant M, user node U will have an edge (directional or bidirectional depending on flag) to transaction node T, and the transaction node T will be connected with an edge (directional or bidirectional depending on flag) to the merchant node M. +* Transformed transaction node features are saved in a csv file using node id as index. +* Merchant and User nodes are initialized with zero vectors of same length of a transaction node features. +* Target values of all the nodes are saved in a separate CSV file which are loaded during GNN training. + + +
+
+ +[<-- Top](../README.md)
+[<-- Back: Workflow](./workflow.md)
+[--> Next: Setup](./setup.md) diff --git a/ai-credit-fraud-workflow/docs/run_notebooks.md b/ai-credit-fraud-workflow/docs/run_notebooks.md new file mode 100644 index 0000000..d62a0a2 --- /dev/null +++ b/ai-credit-fraud-workflow/docs/run_notebooks.md @@ -0,0 +1,79 @@ +# Running the Notebooks +This page will go over the sequence to run the various notebooks. +Please note that once the data is prepared, both datasets leverage the same notebooks for training. + +__Note:__ It is assumed that the data has been downloaded and placed in the raw folder for each respective dataset. +if not, please see: [setup](./setup.md) + +__Note__:It is also assumed that Jupyter has been started and the conda environment has been added. See [setup](./setup.md) + +__Note__: Before executing `train_xgboost.ipynb` and `train_gnn_based_xgboost.ipynb` notebooks, make sure that the right dataset is selected in the second code cell of of the notebooks. + +For TabFormer dataset, set +```code + DATASET = TABFORMER +``` +and for the Sparkov dataset, set +```code + DATASET = SPARKOV +``` + +## TabFormer + +### Step 1: Prepare the data +run `notebooks/preprocess_Tabformer.ipynb` + +This will produce a number of files under `./data/TabFormer/gnn` and `./data/TabFormer/xgb`. It will also save data preprocessor pipeline `preprocessor.pkl` and a few variables in a json file `variables.json` under `./data/TabFormer` directory. + +### Step 2: Build the model +run `notebooks/train_gnn_based_xgboost.ipynb` + +This will produce two files for the GNN-based XGBoost model under `./data/TabFormer/models` directory. + +### Step 3: Run Inference +run `notebooks/inference_gnn_based_xgboost_TabFormer.ipynb` + +### Optional: Pure XGBoost +Two additional notebooks are provided to build a pure XGBoost model (without GNN) and perform inference using that model. + +__Train__ +run `notebooks/train_xgboost.ipynb` + +This will produce a XGBoost model under `./data/TabFormer/models` directory. + +__Inference__ +run `notebooks/inference_xgboost_TabFormer.ipynb` + + + +## Sparkov + +__Note__ Make sure to restart jupyter kernel before running `train_gnn_based_xgboost.ipynb` for the second dataset. + +### Step 1: Prepare the data +run `notebooks/preprocess_Sparkov.ipynb` + +This will produce a number of files under `./data/Sparkov/gnn` and `./data/Sparkov/xgb`. It will also save data preprocessor pipeline `preprocessor.pkl` and a few variables in a json file `variables.json` under `./data/Sparkov` directory. + +### Step 2: Build the model +run `notebooks/train_gnn_based_xgboost.ipynb` + +This will produce two files for the GNN-based XGBoost model under `./data/Sparkov/models` directory. + + +### Optional: Pure XGBoost +Two additional notebooks are provided to build a pure XGBoost model (without GNN) and perform inference using that model. + +__Train__ +run `notebooks/train_xgboost.ipynb` + +This will produce a XGBoost model under `./data/Sparkov/models` directory. + +__Inference__ +run `notebooks/inference_xgboost_Sparkov.ipynb` + + +
+
+ +[<-- Top](../README.md)
diff --git a/ai-credit-fraud-workflow/docs/setup.md b/ai-credit-fraud-workflow/docs/setup.md new file mode 100644 index 0000000..98dd1ea --- /dev/null +++ b/ai-credit-fraud-workflow/docs/setup.md @@ -0,0 +1,191 @@ +# Setup +There are a number of ways that the notebooks can be executed. + + + +## Step 1: Clone the repo + +cd into the base directory where you pkan to house the code. + +```bash +git clone https://github.com/nv-morpheus/morpheus-experimental +cd ./morpheus-experimental/ai-credit-fraud-workflow +``` + +## Step 2: Download the datasets + +__TabFormer__
+1. Download the dataset: https://ibm.ent.box.com/v/tabformer-data/folder/130747715605 +2. untar and uncompreess the file: `tar -xvzf ./transactions.tgz` +3. Place the file in the ___"./data/TabFormer/raw"___ folder + + +__Sparkov__
+1. Download the dataset from: https://www.kaggle.com/datasets/kartik2112/fraud-detection +2. Unzip the "archive.zip" file + * that will produce a folder with two files +3. place the two files under the __"./data/'Sparkov/raw"__ folder + +## Step 3: Create a new conda environment + +You can get a minimum installation of Conda and Mamba using [Miniforge](https://github.com/conda-forge/miniforge). + +And then create an environment using the following command. + +Make sure that your shell or command prompt is pointint to `morpheus-experimental/ai-credit-fraud-workflow` before running `mamba env create`. + +```bash +~/morpheus-experimental/ai-credit-fraud-workflow$ mamba env create -f conda/fraud_conda_env.yaml +``` + + +Alternatively, you can install [MiniConda](https://docs.anaconda.com/miniconda/miniconda-install) and run the following commands to create an environment to run the notebooks. + + Install `mamba` first with + +```bash +conda install conda-forge::mamba +``` +And, then run `mamba env create` from the right directory as shown below. + +```bash +~/morpheus-experimental/ai-credit-fraud-workflow$ mamba env create -f conda/fraud_conda_env.yaml +``` + +Finally, activate the environment. + +```bash +conda activate fraud_conda_env +``` + +All the notebooks are located under `morpheus-experimental/ai-credit-fraud-workflow/notebooks`. + +```bash +~/morpheus-experimental/ai-credit-fraud-workflow$ cd notebooks +~/morpheus-experimental/ai-credit-fraud-workflow/notebooks$ ls -1 +inference_gnn_based_xgboost_TabFormer.ipynb +inference_xgboost_Sparkov.ipynb +inference_xgboost_TabFormer.ipynb +preprocess_Sparkov.ipynb +preprocess_Tabformer.ipynb +train_gnn_based_xgboost.ipynb +train_xgboost.ipynb +``` + +Now you can run the notebooks from VS Code. Note that you need to select `fraud_conda_env` as the kernel in VS Code to run the notebooks. Alternatively, you can run the notebooks using Jupyter or Jupyter labs. You will need to add the conda environment: `ipython kernel install --user --name= fraud_conda_env` + + +#### NOTE: Notebooks need to be executed in the correct order +The notebooks need to be executed in the correct order. For a particular dataset, the preprocessing notebook must be executed before the training notebook. Once the training notebook produces models, the inference notebook can be executed to run inference on unseen data. + +For example, for the TabFormer dataset, the notebooks need to be executed in the following order - + + - preprocess_Tabformer.ipynb + - train_gnn_based_xgboost.ipynb + - inference_gnn_based_xgboost_TabFormer.ipynb + +The train a standalone XGBoost model, that doesn't utilize node embedding, run the following two notebooks in the following oder - + + - train_xgboost.ipynb + - inference_xgboost_TabFormer.ipynb + +## Docker container (alternative,to creating a conda environment) + +If you don't want to create a conda environment locally, you can spin up a Docker container either on your local machine or a remote one and execute the notebooks from a browser or the terminal. + +### Running locally + +Clone the [repo](https://github.com/nv-morpheus/morpheus-experimental) and `cd` into the project folder +```bash +git clone https://github.com/nv-morpheus/morpheus-experimental +cd morpheus-experimental/ai-credit-fraud-workflow +``` + + +### Build docker image and run a container with port forwarding + +Build the docker image from `morpheus-experimental/ai-credit-fraud-workflow` +```bash +~/morpheus-experimental/ai-credit-fraud-workflow$ docker build --no-cache -t fraud-detection-app . +``` + +And, run a container from `morpheus-experimental/ai-credit-fraud-workflow` + +```bash +~/morpheus-experimental/ai-credit-fraud-workflow$ docker run --gpus all -it --rm -v $(pwd):/ai-credit-fraud-workflow -p 8888:8888 fraud-detection-app +``` + +This will give you an interactive shell into the docker container. All the notebooks should be accessible under `/ai-credit-fraud-workflow/notebooks` inside the container. + +__Note__: `-v $(pwd):/ai-credit-fraud-workflow` in the `docker run` command will mount `morpheus-experimental/ai-credit-fraud-workflow` directory from the host machine into the Docker container as `/ai-credit-fraud-workflow`. + +You can list the notebooks from the interactive shell of the docker container. Note that you will have a different container id than shown (7c593d76f681) in the example output below. + +```bash +root@7c593d76f681:/ai-credit-fraud-workflow# ls +Dockerfile LICENSE README.md conda data docs img notebooks python requirements.txt + +root@7c593d76f681:/ai-credit-fraud-workflow# cd notebooks/ + +root@7c593d76f681:/ai-credit-fraud-workflow/notebooks# ls -1 +inference_gnn_based_xgboost_TabFormer.ipynb +inference_xgboost_Sparkov.ipynb +inference_xgboost_TabFormer.ipynb +preprocess_Sparkov.ipynb +preprocess_Tabformer.ipynb +train_gnn_based_xgboost.ipynb +train_xgboost.ipynb +``` + +### Launch Jupyter Notebook inside the container + +Run the following command from interactive shell inside the docker container. +```bash +root@7c593d76f681:/ai-credit-fraud-workflow# jupyter notebook . +``` +It will display an url with token +http://127.0.0.1:8888/tree?token= + +Now you can browse to the `notebooks` folder, and run or edit the notebooks from a browser at the url. + + +If you are not interested in running/editing the notebooks from a browser, you can omit the port forwarding option. + +```bash +~/morpheus-experimental/ai-credit-fraud-workflow$ docker build --no-cache -t fraud-detection-app . +``` + +```bash +~/morpheus-experimental/ai-credit-fraud-workflow$ docker run --gpus all -it --rm -v $(pwd):/ai-credit-fraud-workflow fraud-detection-app +``` + +This will give you an interactive shell inside the docker container. + +And then, you can run any notebook using the following command inside the container. + +```bash +root@7c593d76f681:/ai-credit-fraud-workflow# cd notebooks +root@7c593d76f681:/ai-credit-fraud-workflow# jupyter nbconvert --to notebook --execute [NAME_OF_THE_NOTBOOK].ipynb --output [NAME_OF_THE_OUTPUT_NOTEBOOK].ipynb +``` + +### Running on a remote machine + +### Copy the dataset to the right folder + +```bash +scp path/to/downloaded-file-in your-local-machine user@remote_host_name:path/to/ai-credit-fraud-workflow/data/[DATASET_NAME]/raw +``` + +Make sure to place the unzipped csv file in `ai-credit-fraud-workflow/data/[DATASET_NAME]/raw` folder. + + +Login to your remote machine from your host machine, with ssh tunneling/port forwarding + +```bash +ssh -L 127.0.0.1:8888:127.0.0.1:8888 USER@REMOTE_HOST_NAME_OR_IP +``` + +Then follow the steps described under section `Launch Jupyter Notebook inside the container` . Finally, go to the url from a browser in your host machine to run/edit the notebooks. + +[<-- Top](../README.md)
+[<-- Back: Datasets](./datasets.md)
diff --git a/ai-credit-fraud-workflow/docs/workflow.md b/ai-credit-fraud-workflow/docs/workflow.md new file mode 100644 index 0000000..dfa3809 --- /dev/null +++ b/ai-credit-fraud-workflow/docs/workflow.md @@ -0,0 +1,50 @@ +# High-Level Architecture +The general fraud architecture, as depicted below at a very high-level, uses +Morpheus to continually inspect and classify all incoming data. What is not +shown in the diagram is what a customer should do if fraud is detected, the +architecture simply shows tagged data being sent to downstream processing. +Those processes should be well defined in the customers’ organization. +Additionally, the post-detected data, what we are calling the Tagged Data +Stream, should be stored in a database or data lake. Cleaning and preparing +the data could be done using Spark RAPIDS. + +Fraud attacks are continually evolving and therefore it is important to always +be using the latest model. That requires the model(s) to be updated often as +possible. Therefore, the diagram depicts a loop where the GNN Model Building +process is run periodically, the frequency of which is dependent on model +training latency. Given how dynamic this industry is with evolving fraud +trends, institutions who train models adaptively on a frequent basis tend to +have better fraud prevention KPIs as compared to their competitors. + +
+

+ +

+
+ +# This Workflow +The above architecture would be typical within a larger financial system where incoming data run through the inference engine first and then periodically a new model build. However, for this example, the workflow is will start with model building and end with Inference. The workflow is depicted below: + +
+

+ +

+
+ + 1. __Data Prep__: the sample dataset is cleaned and prepared, using tools like NVIDIA RAPIDS for efficiency. Data preparation and feature engineering has a significant impact on the performance of model produced. See the section of data preparation for the step we did get the best results + - Input: The sample dataset + - Output: Two files; (1) a data set for training the model and a (2) dataset to be used for inference. + +2. __Model Building__: this process takes the training data and feeds it into cugraph-pyg for GNN training. However, rather than having the GNN produce a model, the last layer of the GNN is extracted as embeddings and feed into XGBoost for production of a model. + - Input: Training data file + - Output: an XGBoost model and GNN model that encodes the data + +3. __Inference__: The test dataset, extracted from the sample dataset, is feed into the Inference engine. The output is a confusion matrix showing the number of detected fraud, number of missed fraud, and number of misidentified fraud (false positives). + + +
+
+ +[<-- Top](../README.md)
+[<-- Back: Background](./background.md)
+[--> Next: Datasets](./datasets.md) diff --git a/ai-credit-fraud-workflow/img/3-partite.jpg b/ai-credit-fraud-workflow/img/3-partite.jpg new file mode 100644 index 0000000..0d6dbf9 Binary files /dev/null and b/ai-credit-fraud-workflow/img/3-partite.jpg differ diff --git a/ai-credit-fraud-workflow/img/High-Level.jpg b/ai-credit-fraud-workflow/img/High-Level.jpg new file mode 100644 index 0000000..6061131 Binary files /dev/null and b/ai-credit-fraud-workflow/img/High-Level.jpg differ diff --git a/ai-credit-fraud-workflow/img/Model-Building.png b/ai-credit-fraud-workflow/img/Model-Building.png new file mode 100644 index 0000000..e8bc788 Binary files /dev/null and b/ai-credit-fraud-workflow/img/Model-Building.png differ diff --git a/ai-credit-fraud-workflow/img/Splash.jpg b/ai-credit-fraud-workflow/img/Splash.jpg new file mode 100644 index 0000000..1d6783d Binary files /dev/null and b/ai-credit-fraud-workflow/img/Splash.jpg differ diff --git a/ai-credit-fraud-workflow/img/this-workflow.jpg b/ai-credit-fraud-workflow/img/this-workflow.jpg new file mode 100644 index 0000000..15fa2f6 Binary files /dev/null and b/ai-credit-fraud-workflow/img/this-workflow.jpg differ diff --git a/ai-credit-fraud-workflow/notebooks/inference_gnn_based_xgboost_TabFormer.ipynb b/ai-credit-fraud-workflow/notebooks/inference_gnn_based_xgboost_TabFormer.ipynb new file mode 100644 index 0000000..e6edd21 --- /dev/null +++ b/ai-credit-fraud-workflow/notebooks/inference_gnn_based_xgboost_TabFormer.ipynb @@ -0,0 +1,671 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Inference on TabFormer Data\n", + "This notebook loads a pre-trained GNN (GraphSAGE) model and an XGBoost model and runs inference on raw data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Goals\n", + "* Outline the steps to transform new raw data before feeding it into the models.\n", + "* Simulate the use of trained models on new data during inference." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Import packages" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import cudf\n", + "import pickle\n", + "import torch\n", + "import torch.nn as nn\n", + "import torch.nn.functional as F\n", + "from torch_geometric.nn import SAGEConv\n", + "import os\n", + "import xgboost as xgb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Path to the pre-trained GraphSAGE and the XGBoost models" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "dataset_base_path = '../data/TabFormer'\n", + "model_root_dir = os.path.join(dataset_base_path, 'models')\n", + "gnn_model_path = os.path.join(model_root_dir, 'node_embedder.pth')\n", + "xgb_model_path = os.path.join(model_root_dir, 'embedding_based_xgb_model.json')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Definition of the trained GraphSAGE model" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "class GraphSAGE(torch.nn.Module):\n", + " def __init__(self, in_channels, hidden_channels, out_channels, n_hops, dropout_prob=0.25):\n", + " super(GraphSAGE, self).__init__()\n", + "\n", + " # list of conv layers\n", + " self.convs = nn.ModuleList()\n", + " # add first conv layer to the list\n", + " self.convs.append(SAGEConv(in_channels, hidden_channels))\n", + " # add the remaining conv layers to the list\n", + " for _ in range(n_hops - 1):\n", + " self.convs.append(SAGEConv(hidden_channels, hidden_channels))\n", + " \n", + " # output layer\n", + " self.fc = nn.Linear(hidden_channels, out_channels) \n", + "\n", + " def forward(self, x, edge_index, return_hidden=False):\n", + "\n", + " for conv in self.convs:\n", + " x = conv(x, edge_index)\n", + " x = F.relu(x)\n", + " x = F.dropout(x, p=0.5, training=self.training)\n", + " \n", + " if return_hidden:\n", + " return x\n", + " else:\n", + " return self.fc(x)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "parameters" + ] + }, + "source": [ + "### Load the models" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Load the pre-trained GraphSAGE model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Load GNN model for generating node embeddings\n", + "gnn_model = torch.load(gnn_model_path)\n", + "gnn_model.eval() # Set the model to evaluation mode" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Load the pre-trained XGBoost model" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "# Load xgboost model for node classification\n", + "loaded_bst = xgb.Booster()\n", + "loaded_bst.load_model(xgb_model_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Define a function to evaluate the XGBoost model" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "from cuml.metrics import confusion_matrix\n", + "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score\n", + "import cupy as cp\n", + "from torch.utils.dlpack import to_dlpack\n", + "\n", + "def evaluate_xgboost(bst, embeddings, labels):\n", + " \"\"\"\n", + " Evaluates the performance of the XGBoost model by calculating different metrics.\n", + "\n", + " Parameters:\n", + " ----------\n", + " bst : xgboost.Booster\n", + " The trained XGBoost model to be evaluated.\n", + " embeddings : torch.Tensor\n", + " The input feature embeddings for transaction nodes.\n", + " labels : torch.Tensor\n", + " The target labels (Fraud or Non-fraud) transaction, with the same length as the number of \n", + " rows in `embeddings`.\n", + " Returns:\n", + " -------\n", + " Confusion matrix\n", + " \"\"\"\n", + "\n", + " # Convert embeddings to cuDF DataFrame\n", + " embeddings_cudf = cudf.DataFrame(cp.from_dlpack(to_dlpack(embeddings)))\n", + " \n", + " # Create DMatrix for the test embeddings\n", + " dtest = xgb.DMatrix(embeddings_cudf)\n", + " \n", + " # Predict using XGBoost on GPU\n", + " preds = bst.predict(dtest)\n", + " pred_labels = (preds > 0.5).astype(int)\n", + "\n", + " # Move labels to CPU for evaluation\n", + " labels_cpu = labels.cpu().numpy()\n", + "\n", + " # Compute evaluation metrics\n", + " accuracy = accuracy_score(labels_cpu, pred_labels)\n", + " precision = precision_score(labels_cpu, pred_labels, zero_division=0)\n", + " recall = recall_score(labels_cpu, pred_labels, zero_division=0)\n", + " f1 = f1_score(labels_cpu, pred_labels, zero_division=0)\n", + " roc_auc = roc_auc_score(labels_cpu, preds)\n", + "\n", + " print(f\"Performance of XGBoost model trained on node embeddings\")\n", + " print(f\"Accuracy: {accuracy:.4f}\")\n", + " print(f\"Precision: {precision:.4f}\")\n", + " print(f\"Recall: {recall:.4f}\")\n", + " print(f\"F1 Score: {f1:.4f}\")\n", + " print(f\"ROC AUC: {roc_auc:.4f}\")\n", + "\n", + " conf_mat = confusion_matrix(labels.cpu().numpy(), pred_labels)\n", + " print('Confusion Matrix:', conf_mat)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "___\n", + "### Evaluate the XGBoost model on untransformed test data (saved in the preprocessing notebook)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Read untransformed data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pd.set_option('future.no_silent_downcasting', True) \n", + "path_to_untransformed_data = os.path.join(dataset_base_path, 'xgb', 'untransformed_test.csv')\n", + "untransformed_df = pd.read_csv(path_to_untransformed_data)\n", + "untransformed_df.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Load the data transformer and transform the data using the loaded transformer" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "with open(os.path.join(dataset_base_path, 'preprocessor.pkl'),'rb') as f:\n", + " loaded_transformer = pickle.load(f)\n", + " transformed_data = loaded_transformer.transform(untransformed_df.loc[:, untransformed_df.columns[:-1]])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Evaluate the model on the transformed data" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", + "# Convert data to torch tensors\n", + "X = torch.tensor(transformed_data).to(torch.float32).to(device)\n", + "y = torch.tensor(untransformed_df[untransformed_df.columns[-1]].values ).to(torch.long).to(device)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "# Generate node embedding using the GNN model\n", + "test_embeddings = gnn_model(\n", + " X.to(device), torch.tensor([[], []], dtype=torch.int).to(device), return_hidden=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Evaluate the XGBoost model\n", + "evaluate_xgboost(loaded_bst, test_embeddings, y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "___\n", + "## Predictions on raw input\n", + "The purpose is to demonstrate the use of the models during inference" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Read raw data" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "# Read example raw inputs\n", + "raw_file_path = os.path.join(dataset_base_path, 'xgb', 'example_transactions.csv')\n", + "data = pd.read_csv(raw_file_path)\n", + "data = data[data.columns[:-1]]\n", + "original_data = data.copy()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Transform raw data\n", + "* Perform the same set of transformations on the raw data as was done on the training data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Rename columns before the data is fed into the pre-fitted data transformer" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "# Rename columns before the data is fed into the data transformer\n", + "COL_USER = 'User'\n", + "COL_CARD = 'Card'\n", + "COL_AMOUNT = 'Amount'\n", + "COL_MCC = 'MCC'\n", + "COL_TIME = 'Time'\n", + "COL_DAY = 'Day'\n", + "COL_MONTH = 'Month'\n", + "COL_YEAR = 'Year'\n", + "\n", + "COL_MERCHANT = 'Merchant'\n", + "COL_STATE ='State'\n", + "COL_CITY ='City'\n", + "COL_ZIP = 'Zip'\n", + "COL_ERROR = 'Errors'\n", + "COL_CHIP = 'Chip'\n", + "\n", + "\n", + "_ = data.rename(columns={\n", + " \"Merchant Name\": COL_MERCHANT,\n", + " \"Merchant State\": COL_STATE,\n", + " \"Merchant City\": COL_CITY,\n", + " \"Errors?\": COL_ERROR,\n", + " \"Use Chip\": COL_CHIP\n", + " },\n", + " inplace=True\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Handle unknown values as was done for the training data" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "UNKNOWN_STRING_MARKER = 'XX'\n", + "UNKNOWN_ZIP_CODE = 0\n", + "MAX_NR_CARDS_PER_YEAR = 9\n", + "\n", + "data[COL_STATE] = data[COL_STATE].fillna(UNKNOWN_STRING_MARKER)\n", + "data[COL_ERROR] = data[COL_ERROR].fillna(UNKNOWN_STRING_MARKER)\n", + "data[COL_ZIP] = data[COL_ZIP].fillna(UNKNOWN_ZIP_CODE)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Convert column type and remove \"$\" and \",\" as was done for the training data" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "data[COL_AMOUNT] = data[COL_AMOUNT].str.replace(\"$\",\"\").astype(\"float\")\n", + "data[COL_STATE] = data[COL_STATE].astype('str')\n", + "data[COL_MERCHANT] = data[COL_MERCHANT].astype('str')\n", + "data[COL_ERROR] = data[COL_ERROR].str.replace(\",\",\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Combine User and Card to generate unique numbers as was done for the training data" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "data[COL_CARD] = data[COL_USER] * MAX_NR_CARDS_PER_YEAR + data[COL_CARD]\n", + "data[COL_CARD] = data[COL_CARD].astype('int')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Check if the transactions have unknown users or merchants" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "# Find the known merchants and (users, cards), i.e. the merchants and (users, cards) that are in training data\n", + "known_merchants = set()\n", + "known_cards = set()\n", + "\n", + "for enc in loaded_transformer.named_transformers_['binary'].named_steps['binary'].ordinal_encoder.mapping:\n", + " if enc['col'] == COL_MERCHANT:\n", + " known_merchants = set(enc['mapping'].keys())\n", + " if enc['col'] == COL_CARD:\n", + " known_cards = set(enc['mapping'].keys())" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "# Is user, card already known\n", + "data['Is_card_known'] = data[COL_CARD].map(lambda c: c in known_cards)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "# Is merchant already known\n", + "data['Is_merchant_known'] = data[COL_MERCHANT].map(lambda m: m in known_merchants )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Use the same set of predictor columns as used for training" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "numerical_predictors = [COL_AMOUNT]\n", + "nominal_predictors = [COL_ERROR, COL_CARD, COL_CHIP, COL_CITY, COL_ZIP, COL_MCC, COL_MERCHANT]\n", + "\n", + "predictor_columns = numerical_predictors + nominal_predictors" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Load the data transformer and transform the raw data" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "with open(os.path.join(dataset_base_path, 'preprocessor.pkl'),'rb') as f:\n", + " loaded_transformer = pickle.load(f)\n", + " transformed_data = loaded_transformer.transform(data[predictor_columns])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Run prediction" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "# Set the device to GPU if available, otherwise default to CPU\n", + "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", + "\n", + "# Convert data to torch tensors\n", + "X = torch.tensor(transformed_data).to(torch.float32).to(device)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Generate node embedding using the GraphSAGE model\n", + "transaction_embeddings = gnn_model(\n", + " X.to(device), torch.tensor([[], []], dtype=torch.int).to(device), return_hidden=True)\n", + "\n", + "embeddings_cudf = cudf.DataFrame(cp.from_dlpack(to_dlpack(transaction_embeddings)))" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "# predict if the transaction(s) are fraud\n", + "preds = loaded_bst.predict(xgb.DMatrix(embeddings_cudf))\n", + "pred_labels = (preds > 0.5).astype(int)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### If the transactions have unknown (user, card) or merchant, mark it as fraud" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "# Name of the target column\n", + "target_col_name = 'Is Fraud?'\n", + "\n", + "data[target_col_name] = pred_labels\n", + "data[target_col_name] = data.apply(\n", + " lambda row: \n", + " (row[target_col_name] == 1) or (row['Is_card_known'] == False) or (row['Is_merchant_known'] == False), axis=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Label the raw data as Fraud or Non-Fraud, based on prediction" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Change 0 to No (non-Fraud) and 1 to Yes (Fraud)\n", + "binary_to_fraud = { False: 'No', True : 'Yes'}\n", + "data[target_col_name] = data[target_col_name].map(binary_to_fraud).astype('str')\n", + "original_data[target_col_name] = data[target_col_name]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Raw data with predicted labels (Fraud or Non-Fraud)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "original_data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Copyright and License\n", + "
\n", + "Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.\n", + "\n", + "
\n", + "\n", + " Licensed under the Apache License, Version 2.0 (the \"License\");\n", + " you may not use this file except in compliance with the License.\n", + " You may obtain a copy of the License at\n", + " \n", + " http://www.apache.org/licenses/LICENSE-2.0\n", + " \n", + " Unless required by applicable law or agreed to in writing, software\n", + " distributed under the License is distributed on an \"AS IS\" BASIS,\n", + " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + " See the License for the specific language governing permissions and\n", + " limitations under the License." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "mamba_env", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/ai-credit-fraud-workflow/notebooks/inference_xgboost_Sparkov.ipynb b/ai-credit-fraud-workflow/notebooks/inference_xgboost_Sparkov.ipynb new file mode 100644 index 0000000..490199c --- /dev/null +++ b/ai-credit-fraud-workflow/notebooks/inference_xgboost_Sparkov.ipynb @@ -0,0 +1,560 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## This notebook loads a pre-trained XGBoost model and runs inference on raw data\n", + "__NOTE__: This XGBoost model does not leverage embeddings from the GNN (GraphSAGE) model." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Goals\n", + "* Outline the steps to transform new raw data before feeding it into the model.\n", + "* Simulate the use of the trained model on new data during inference." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Import packages" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pickle\n", + "import json\n", + "import os\n", + "import xgboost as xgb\n", + "from cuml.metrics import confusion_matrix\n", + "from sklearn.metrics import (\n", + " accuracy_score,\n", + " precision_score,\n", + " recall_score,\n", + " f1_score,\n", + " roc_auc_score)\n", + "import numpy as np\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Path to the pre-trained XGBoost model and data" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "dataset_base_path = '../data/Sparkov'\n", + "model_root_dir = os.path.join(dataset_base_path, 'models')\n", + "model_file_name = 'xgboost_model.json'\n", + "xgb_model_path = os.path.join(model_root_dir, model_file_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "parameters" + ] + }, + "source": [ + "#### Load the model" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# Load xgboost model for node classification\n", + "loaded_bst = xgb.Booster()\n", + "loaded_bst.load_model(xgb_model_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Load column names and other global variable saved during the training" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# Read the JSON file\n", + "with open(os.path.join(dataset_base_path, 'variables.json'), 'r') as json_file:\n", + " column_names = json.load(json_file)\n", + "\n", + "# Repopulate the variables in the global namespace\n", + "globals().update(column_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "___\n", + "#### Evaluate the XGBoost model on untransformed test data (saved in the preprocessing notebook)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Read untransformed data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pd.set_option('future.no_silent_downcasting', True) \n", + "path_to_untransformed_data = os.path.join(dataset_base_path, 'xgb', 'untransformed_test.csv')\n", + "untransformed_df = pd.read_csv(path_to_untransformed_data)\n", + "untransformed_df.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Load the data transformer and transform the data using the loaded transformer" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "with open(os.path.join(dataset_base_path, 'preprocessor.pkl'),'rb') as f:\n", + " loaded_transformer = pickle.load(f)\n", + " transformed_data = loaded_transformer.transform(\n", + " untransformed_df.loc[:, untransformed_df.columns[:-1]])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Evaluate the model on the transformed data" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "# Predictor columns used for training\n", + "numerical_predictors = [COL_AMOUNT, COL_SPEED, COL_AGE]\n", + "nominal_predictors = [COL_CARD, COL_ZIP, COL_MCC, COL_MERCHANT, COL_JOB]\n", + "\n", + "predictor_columns = numerical_predictors + nominal_predictors\n", + "\n", + "target_column = [COL_FRAUD]\n", + "\n", + "# transformed column names\n", + "columns_of_transformed_data = list(\n", + " map(lambda name: name.split('__')[1],\n", + " list(loaded_transformer.get_feature_names_out(predictor_columns))))" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "# Prepare features (X) and target (y)\n", + "\n", + "X = pd.DataFrame(\n", + " transformed_data, columns=columns_of_transformed_data)\n", + "\n", + "y = untransformed_df[untransformed_df.columns[-1]].values" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "# Make predictions\n", + "y_pred_prob = loaded_bst.predict(xgb.DMatrix(data=X, label=y))\n", + "\n", + "y_pred = (y_pred_prob >= 0.5).astype(int)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Compute metrics to evaluate model performance" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Accuracy\n", + "accuracy = accuracy_score(y, y_pred)\n", + "print(f'Accuracy: {accuracy:.4f}')\n", + "\n", + "# Confusion Matrix\n", + "conf_mat = confusion_matrix(y, y_pred)\n", + "print('Confusion Matrix:')\n", + "print(conf_mat)\n", + "\n", + "# ROC AUC Score\n", + "r_auc = roc_auc_score(y, y_pred_prob)\n", + "print(f'ROC AUC Score: {r_auc:.4f}')\n", + "\n", + "# y = cupy.asnumpy(y)\n", + "# Precision\n", + "precision = precision_score(y, y_pred)\n", + "print(f'Precision: {precision:.4f}')\n", + "\n", + "# Recall\n", + "recall = recall_score(y, y_pred)\n", + "print(f'Recall: {recall:.4f}')\n", + "\n", + "# F1 Score\n", + "f1 = f1_score(y, y_pred)\n", + "print(f'F1 Score: {f1:.4f}')\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "___\n", + "### Prediction on raw inputs\n", + "* The purpose is to demonstrate the use of the model during inference" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Read raw data" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "raw_file_path = os.path.join(dataset_base_path, 'xgb', 'example_transactions.csv')\n", + "data = pd.read_csv(raw_file_path)\n", + "data = data[data.columns[:-1]]\n", + "original_data = data.copy()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Check if the transactions have unknown users or merchants" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "# Find the known merchants and (users, cards), i.e. the merchants and (users, cards) that are in training data\n", + "known_merchants = set()\n", + "known_cards = set()\n", + "\n", + "for enc in loaded_transformer.named_transformers_['binary'].named_steps['binary'].ordinal_encoder.mapping:\n", + " if enc['col'] == COL_MERCHANT:\n", + " known_merchants = set(enc['mapping'].keys())\n", + " if enc['col'] == COL_CARD:\n", + " known_cards = set(enc['mapping'].keys())" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "# Is user, card already known\n", + "data['Is_card_known'] = data[COL_CARD].map(lambda c: c in known_cards)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "# Is merchant already known\n", + "data['Is_merchant_known'] = data[COL_MERCHANT].map(lambda m: m in known_merchants )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### From ('lat', 'long'), ('merchant_lat', 'merchant_long') and unix_time to compute transaction speed" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "temp_df = pd.DataFrame()\n", + "import math\n", + "# Haversine formula function\n", + "def haversine(lat1, lon1, lat2, lon2):\n", + " # Radius of Earth in km\n", + " R = 6371.0\n", + "\n", + " # Convert degrees to radians\n", + " lat1 = math.radians(lat1)\n", + " lon1 = math.radians(lon1)\n", + " lat2 = math.radians(lat2)\n", + " lon2 = math.radians(lon2)\n", + "\n", + " # Differences in coordinates\n", + " dlat = lat2 - lat1\n", + " dlon = lon2 - lon1\n", + "\n", + " # Haversine formula\n", + " a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2\n", + " c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))\n", + "\n", + " # Distance in kilometers\n", + " distance = R * c\n", + " return distance\n", + "\n", + "\n", + "temp_df= data[['unix_time', 'lat', 'long', 'merch_lat', 'merch_long']].copy()\n", + "temp_df['tx_duration'] = temp_df['unix_time'].apply(lambda x: x/1e9)\n", + "temp_df['distance_km'] = temp_df.apply(\n", + " lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)\n", + "\n", + "data['speed'] = (temp_df['distance_km']/temp_df['tx_duration'])\n", + "del temp_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Convert 'dob' to 'age' w.r.t. a reference date" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "data['dob'] = pd.to_datetime(data['dob'])\n", + "one_nanosecond = np.timedelta64(1, 'ns')\n", + "nanoseconds_in_year = 365.25 * 24 * 60 * 60 * 1e9\n", + "reference_date = pd.to_datetime('2024-10-30') \n", + "data['age'] = data['dob'].apply(lambda dob: (reference_date - dob)/ one_nanosecond / nanoseconds_in_year )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Set of predictor columns used for training the model" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "numerical_predictors = [COL_AMOUNT, COL_SPEED, COL_AGE]\n", + "nominal_predictors = [COL_CARD, COL_ZIP, COL_MCC, COL_MERCHANT, COL_JOB]\n", + "\n", + "predictor_columns = numerical_predictors + nominal_predictors\n", + "\n", + "target_column = [COL_FRAUD]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Transform input data using the pre-fitted data transformer" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "with open(os.path.join(dataset_base_path, 'preprocessor.pkl'),'rb') as f:\n", + " loaded_transformer = pickle.load(f)\n", + " transformed_data = loaded_transformer.transform(data[predictor_columns])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Prepare data and predict if the transactions are fraud" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "X = pd.DataFrame(\n", + " transformed_data, columns=columns_of_transformed_data)\n", + "\n", + "# Predict transactions\n", + "pred_probs = loaded_bst.predict(xgb.DMatrix(X))\n", + "pred_labels = (pred_probs >= 0.5).astype(int)\n", + "\n", + "# Name of the target column\n", + "target_col_name = 'Is Fraud?'\n", + "\n", + "data[target_col_name] = pred_labels\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### If the transactions have unknown (user, card) or merchant, mark it as fraud" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "data[target_col_name] = data.apply(\n", + " lambda row: (row[target_col_name] == 1) or (row['Is_card_known'] == False) or (row['Is_merchant_known'] == False), axis=1)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Label the raw data as Fraud or Non-Fraud, based on prediction" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Change 0 to No (non-Fraud) and 1 to Yes (Fraud)\n", + "binary_to_text = { False: 'No', True : 'Yes'}\n", + "data[target_col_name] = data[target_col_name].map(binary_to_text).astype('str')\n", + "original_data[target_col_name] = data[target_col_name]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Transactions with predicted labels" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "original_data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Copyright and License\n", + "
\n", + "Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.\n", + "\n", + "
\n", + "\n", + " Licensed under the Apache License, Version 2.0 (the \"License\");\n", + " you may not use this file except in compliance with the License.\n", + " You may obtain a copy of the License at\n", + " \n", + " http://www.apache.org/licenses/LICENSE-2.0\n", + " \n", + " Unless required by applicable law or agreed to in writing, software\n", + " distributed under the License is distributed on an \"AS IS\" BASIS,\n", + " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + " See the License for the specific language governing permissions and\n", + " limitations under the License." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "mamba_env", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/ai-credit-fraud-workflow/notebooks/inference_xgboost_TabFormer.ipynb b/ai-credit-fraud-workflow/notebooks/inference_xgboost_TabFormer.ipynb new file mode 100644 index 0000000..b0b6e8d --- /dev/null +++ b/ai-credit-fraud-workflow/notebooks/inference_xgboost_TabFormer.ipynb @@ -0,0 +1,578 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## This notebook loads a pre-trained XGBoost model and runs inference on raw data\n", + "__NOTE__: This XGBoost model does not leverage embeddings from the GNN (GraphSAGE) model." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Goals\n", + "* Outline the steps to transform new raw data before feeding it into the model.\n", + "* Simulate the use of the trained model on new data during inference." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Import packages" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import pickle\n", + "import json\n", + "import os\n", + "import xgboost as xgb\n", + "from cuml.metrics import confusion_matrix\n", + "from sklearn.metrics import (\n", + " accuracy_score,\n", + " precision_score,\n", + " recall_score,\n", + " f1_score,\n", + " roc_auc_score)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Path to the pre-trained XGBoost model and data" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "dataset_base_path = '../data/TabFormer'\n", + "model_root_dir = os.path.join(dataset_base_path, 'models')\n", + "model_file_name = 'xgboost_model.json'\n", + "xgb_model_path = os.path.join(model_root_dir, model_file_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "parameters" + ] + }, + "source": [ + "#### Load the model" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# Load xgboost model for node classification\n", + "loaded_bst = xgb.Booster()\n", + "loaded_bst.load_model(xgb_model_path)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Load column names and other global variables saved during the training" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "# Read the JSON file\n", + "with open(os.path.join(dataset_base_path, 'variables.json'), 'r') as json_file:\n", + " column_names = json.load(json_file)\n", + "\n", + "# Repopulate the variables in the global namespace\n", + "globals().update(column_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "___\n", + "#### Evaluate the XGBoost model on untransformed test data (saved in the preprocessing notebook)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Read untransformed data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pd.set_option('future.no_silent_downcasting', True) \n", + "path_to_untransformed_data = os.path.join(dataset_base_path, 'xgb', 'untransformed_test.csv')\n", + "untransformed_df = pd.read_csv(path_to_untransformed_data)\n", + "untransformed_df.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Load the data transformer and transform the data using the loaded transformer" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "with open(os.path.join(dataset_base_path, 'preprocessor.pkl'),'rb') as f:\n", + " loaded_transformer = pickle.load(f)\n", + " transformed_data = loaded_transformer.transform(\n", + " untransformed_df.loc[:, untransformed_df.columns[:-1]])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Evaluate the model on the transformed data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Predictor columns used for training\n", + "numerical_predictors = [COL_AMOUNT]\n", + "nominal_predictors = [COL_ERROR, COL_CARD, COL_CHIP, COL_CITY, COL_ZIP, COL_MCC, COL_MERCHANT]\n", + "\n", + "predictor_columns = numerical_predictors + nominal_predictors\n", + "predictor_columns" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "# transformed column names\n", + "columns_of_transformed_data = list(\n", + " map(lambda name: name.split('__')[1],\n", + " list(loaded_transformer.get_feature_names_out(predictor_columns))))" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "# Prepare features (X) and target (y)\n", + "X = pd.DataFrame(\n", + " transformed_data, columns=columns_of_transformed_data)\n", + "\n", + "y = untransformed_df[untransformed_df.columns[-1]].values" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "# Make predictions\n", + "\n", + "y_pred_prob = loaded_bst.predict(xgb.DMatrix(data=X, label=y))\n", + "y_pred = (y_pred_prob >= 0.5).astype(int)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Compute metrics to evaluate the model performance" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "\n", + "# Accuracy\n", + "accuracy = accuracy_score(y, y_pred)\n", + "print(f'Accuracy: {accuracy:.4f}')\n", + "\n", + "# Confusion Matrix\n", + "conf_mat = confusion_matrix(y, y_pred)\n", + "print('Confusion Matrix:')\n", + "print(conf_mat)\n", + "\n", + "# ROC AUC Score\n", + "r_auc = roc_auc_score(y, y_pred_prob)\n", + "print(f'ROC AUC Score: {r_auc:.4f}')\n", + "\n", + "# y = cupy.asnumpy(y)\n", + "# Precision\n", + "precision = precision_score(y, y_pred)\n", + "print(f'Precision: {precision:.4f}')\n", + "\n", + "# Recall\n", + "recall = recall_score(y, y_pred)\n", + "print(f'Recall: {recall:.4f}')\n", + "\n", + "# F1 Score\n", + "f1 = f1_score(y, y_pred)\n", + "print(f'F1 Score: {f1:.4f}')\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "___\n", + "## Prediction on raw inputs\n", + "* The purpose is to demonstrate the use of the model during inference" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Read raw data" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "# Read example raw inputs\n", + "\n", + "raw_file_path = os.path.join(dataset_base_path, 'xgb', 'example_transactions.csv')\n", + "data = pd.read_csv(raw_file_path)\n", + "data = data[data.columns[:-1]]\n", + "original_data = data.copy()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Rename columns before the data is fed into the pre-fitted data transformer" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "_ = data.rename(columns={\n", + " \"Merchant Name\": COL_MERCHANT,\n", + " \"Merchant State\": COL_STATE,\n", + " \"Merchant City\": COL_CITY,\n", + " \"Errors?\": COL_ERROR,\n", + " \"Use Chip\": COL_CHIP\n", + " },\n", + " inplace=True\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Handle unknown values as was done for the training data" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "UNKNOWN_STRING_MARKER = 'XX'\n", + "UNKNOWN_ZIP_CODE = 0\n", + "\n", + "data[COL_STATE] = data[COL_STATE].fillna(UNKNOWN_STRING_MARKER)\n", + "data[COL_ERROR] = data[COL_ERROR].fillna(UNKNOWN_STRING_MARKER)\n", + "data[COL_ZIP] = data[COL_ZIP].fillna(UNKNOWN_ZIP_CODE)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Convert column type and remove \"$\" and \",\" as was done for the training data" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "data[COL_AMOUNT] = data[COL_AMOUNT].str.replace(\"$\",\"\").astype(\"float\")\n", + "data[COL_STATE] = data[COL_STATE].astype('str')\n", + "data[COL_MERCHANT] = data[COL_MERCHANT].astype('str')\n", + "data[COL_ERROR] = data[COL_ERROR].str.replace(\",\",\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Combine User and Card to generate unique numbers as was done for the training data" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "data[COL_CARD] = data[COL_USER] * MAX_NR_CARDS_PER_USER + data[COL_CARD]\n", + "data[COL_CARD] = data[COL_CARD].astype('int')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Check if the transactions have unknown users or merchants" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "# Find the known merchants and (users, cards), i.e. the merchants and (users, cards) that are in training data\n", + "known_merchants = set()\n", + "known_cards = set()\n", + "\n", + "for enc in loaded_transformer.named_transformers_['binary'].named_steps['binary'].ordinal_encoder.mapping:\n", + " if enc['col'] == COL_MERCHANT:\n", + " known_merchants = set(enc['mapping'].keys())\n", + " if enc['col'] == COL_CARD:\n", + " known_cards = set(enc['mapping'].keys())" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "# Is user, card already known\n", + "data['Is_card_known'] = data[COL_CARD].map(lambda c: c in known_cards)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "# Is merchant already known\n", + "data['Is_merchant_known'] = data[COL_MERCHANT].map(lambda m: m in known_merchants )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Use the same set of predictor columns as used for training" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "numerical_predictors = [COL_AMOUNT]\n", + "nominal_predictors = [COL_ERROR, COL_CARD, COL_CHIP, COL_CITY, COL_ZIP, COL_MCC, COL_MERCHANT]\n", + "\n", + "predictor_columns = numerical_predictors + nominal_predictors" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Load the data transformer and transform the raw data" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "with open(os.path.join(dataset_base_path, 'preprocessor.pkl'),'rb') as f:\n", + " loaded_transformer = pickle.load(f)\n", + " transformed_data = loaded_transformer.transform(data[predictor_columns])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Run prediction" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "X = pd.DataFrame(\n", + " transformed_data, columns=columns_of_transformed_data)\n", + "\n", + "# Predict transactions\n", + "pred_probs = loaded_bst.predict(xgb.DMatrix(X))\n", + "pred_labels = (pred_probs >= 0.5).astype(int)\n", + "\n", + "# Name of the target column\n", + "target_col_name = 'Is Fraud?'\n", + "\n", + "data[target_col_name] = pred_labels\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### If the transactions have unknown (user, card) or merchant, mark it as fraud" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "data[target_col_name] = data.apply(\n", + " lambda row: \n", + " (row[target_col_name] == 1) or (row['Is_card_known'] == False) or (row['Is_merchant_known'] == False), axis=1)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Label the raw data as Fraud or Non-Fraud, based on prediction" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Change 0 to No (non-Fraud) and 1 to Yes (Fraud)\n", + "binary_to_text = { False: 'No', True : 'Yes'}\n", + "data[target_col_name] = data[target_col_name].map(binary_to_text).astype('str')\n", + "original_data[target_col_name] = data[target_col_name]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Transactions with predicted labels" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "original_data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Copyright and License\n", + "
\n", + "Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.\n", + "\n", + "
\n", + "\n", + " Licensed under the Apache License, Version 2.0 (the \"License\");\n", + " you may not use this file except in compliance with the License.\n", + " You may obtain a copy of the License at\n", + " \n", + " http://www.apache.org/licenses/LICENSE-2.0\n", + " \n", + " Unless required by applicable law or agreed to in writing, software\n", + " distributed under the License is distributed on an \"AS IS\" BASIS,\n", + " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + " See the License for the specific language governing permissions and\n", + " limitations under the License." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "mamba_env", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/ai-credit-fraud-workflow/notebooks/preprocess_Sparkov.ipynb b/ai-credit-fraud-workflow/notebooks/preprocess_Sparkov.ipynb new file mode 100644 index 0000000..f54c677 --- /dev/null +++ b/ai-credit-fraud-workflow/notebooks/preprocess_Sparkov.ipynb @@ -0,0 +1,1968 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "9c6a5b09-a601-47c6-989f-5efb42d7f4f8", + "metadata": {}, + "source": [ + "# Credit Card Transaction Data Cleanup and Prep \n", + "\n", + "This notebook shows the steps for cleanup and preparing the credit card transaction data for follow on GNN training with GraphSAGE.\n", + "\n", + "### The dataset:\n", + " * 'Generate Fake Credit Card Transaction [Data](https://www.kaggle.com/datasets/kartik2112/fraud-detection), Including Fraudulent Transactions' using https://github.com/namebrandon/Sparkov_Data_Generation\n", + " * Released under CC0: Public Domain\n", + "\n", + "Contains 1,296,675 records with 15 fields, one field being the \"is fraud\" label which we use for training.\n", + "\n", + "### Goals\n", + "The goal is to:\n", + " * Understand and transform the data\n", + " * Correlation analysis to select important predictors \n", + " * Encode categorical fields\n", + " * Scale numerical columns\n", + " * Create a continuous node index across users, merchants, and transactions\n", + " * having node ID start at zero and then be contiguous is critical for creation of Compressed Sparse Row (CSR) formatted data without wasting memory.\n", + " * Produce:\n", + " * For XGBoost:\n", + " * Training - all transactions in 2019\n", + " * Validation - all transactions between January and May in 2020\n", + " * Test. - all transactions after May 2020\n", + " * For GNN\n", + " * Training Data \n", + " * Edge List \n", + " * Feature data\n", + " * Test set - all transactions after May 2020\n", + "\n", + "\n", + "\n", + "### Graph formation\n", + "Given that we are limited to just the data in the transaction file, the ideal model would be to have a bipartite graph of Users to Merchants where the edges represent the credit card transaction and then perform Link Classification on the Edges to identify fraud. Unfortunately the current version of cuGraph does not support GNN Link Prediction. That limitation will be lifted over the next few release at which time this code will be updated. Luckily, there is precedence for viewing transactions as nodes and then doing node classification using the popular GraphSAGE GNN. That is the approach this code takes. The produced graph will be a tri-partite graph where each transaction is represented as a node.\n", + "\n", + "\n", + "\n", + "\n", + "### Features\n", + "For the XGBoost approach, there is no need to generate empty features for the Merchants. However, for GNN processing, every node needs to have the same set of feature data. Therefore, we need to generate empty features for the User and Merchant nodes. \n", + "\n", + "-----" + ] + }, + { + "cell_type": "markdown", + "id": "795bdece", + "metadata": {}, + "source": [ + "#### Import the necessary libraries. In this case will be use cuDF and perform most of the data prep in GPU\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "4b6b2bc6-a206-42c5-aae9-590672b3a202", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "import math\n", + "import os\n", + "import pickle\n", + "\n", + "import cudf\n", + "import numpy as np\n", + "import pandas as pd\n", + "import scipy.stats as ss\n", + "from category_encoders import BinaryEncoder\n", + "from scipy.stats import gaussian_kde, pointbiserialr\n", + "from sklearn.compose import ColumnTransformer\n", + "from sklearn.impute import SimpleImputer\n", + "from sklearn.pipeline import Pipeline\n", + "from sklearn.preprocessing import OneHotEncoder, RobustScaler, StandardScaler\n" + ] + }, + { + "cell_type": "markdown", + "id": "81db641b", + "metadata": {}, + "source": [ + "-------\n", + "#### Define some arguments" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "016964ce", + "metadata": {}, + "outputs": [], + "source": [ + "# Whether the graph is undirected\n", + "make_undirected = True\n", + "\n", + "# Whether to spread features across Users and Merchants nodes\n", + "spread_features = False\n", + "\n", + "# Whether we should under-sample majority class (i.e. non-fraud transactions)\n", + "under_sample = True\n", + "\n", + "# Ration of fraud and non-fraud transactions in case we under-sample the majority class\n", + "fraud_ratio = 0.1\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "656e6aee-038a-4b58-9296-993e06defb35", + "metadata": {}, + "outputs": [], + "source": [ + "sparkov_base_path = '../data/Sparkov'\n", + "sparkov_raw_file_path = os.path.join(sparkov_base_path, 'raw', 'fraudTrain.csv')\n", + "sparkov_xgb = os.path.join(sparkov_base_path, 'xgb')\n", + "sparkov_gnn = os.path.join(sparkov_base_path, 'gnn')\n", + "if not os.path.exists(sparkov_xgb):\n", + " os.makedirs(sparkov_xgb)\n", + "if not os.path.exists(sparkov_gnn):\n", + " os.makedirs(sparkov_gnn)" + ] + }, + { + "cell_type": "markdown", + "id": "96fe43fe", + "metadata": {}, + "source": [ + "--------\n", + "## Load and understand the data" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "fb41e6ea-1e9f-4f14-99a4-d6d3df092a37", + "metadata": {}, + "outputs": [], + "source": [ + "# Read the dataset\n", + "data = cudf.read_csv(sparkov_raw_file_path, index_col=0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d9a9ab4-4240-4824-997b-8bfdb640381c", + "metadata": {}, + "outputs": [], + "source": [ + "# optional - take a look at the data \n", + "data.head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "85cb6dff", + "metadata": {}, + "source": [ + "### Findings\n", + "* Nominal categorical fields - 'cc_num', 'merchant', 'category', 'first', 'last', 'street', 'city', 'state', 'zip', 'job', 'trans_num'\n", + "* Numerical fields - 'amt', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long'\n", + "* Timestamp fields - 'dob', 'trans_date_trans_time', 'unix_time'\n", + "* Target label - 'is_fraud'\n" + ] + }, + { + "cell_type": "markdown", + "id": "18c3ed53", + "metadata": {}, + "source": [ + "#### How many transactions are fraud?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1e3a9ff7", + "metadata": {}, + "outputs": [], + "source": [ + "data['is_fraud'].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d74fa10", + "metadata": {}, + "outputs": [], + "source": [ + "# Percentage of fraud transactions\n", + "100.0*(data['is_fraud'] == 1).sum()/len(data)" + ] + }, + { + "cell_type": "markdown", + "id": "1bd28023", + "metadata": {}, + "source": [ + "##### Findings - The dataset is extremely imbalanced, only 0.58% of the transactions are fraud" + ] + }, + { + "cell_type": "markdown", + "id": "c39b1a60", + "metadata": {}, + "source": [ + "#### Check if are there Null values in the data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bc2154e5", + "metadata": {}, + "outputs": [], + "source": [ + "# Check if any column has missing values\n", + "data.isnull().sum()\n" + ] + }, + { + "cell_type": "markdown", + "id": "55cbbd84", + "metadata": {}, + "source": [ + "###### Great, none of the columns have null values" + ] + }, + { + "cell_type": "markdown", + "id": "7bf29e24", + "metadata": {}, + "source": [ + "##### Save a few transactions before any operations on data" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "92df5856", + "metadata": {}, + "outputs": [], + "source": [ + "# Write a few raw transactions for model's inference notebook\n", + "out_path = os.path.join(sparkov_xgb, 'example_transactions.csv')\n", + "data.tail(10).to_pandas().to_csv(out_path, header=True, index=False)" + ] + }, + { + "cell_type": "markdown", + "id": "3aa7fc88", + "metadata": {}, + "source": [ + "#### Convert 'dob' to 'age' w.r.t. a reference date" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "c4df4100", + "metadata": {}, + "outputs": [], + "source": [ + "data['dob'] = cudf.to_datetime(data['dob'])" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "4b55a1db", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "one_nanosecond = np.timedelta64(1, 'ns')\n", + "nanoseconds_in_year = 365.25 * 24 * 60 * 60 * 1e9\n", + "reference_date = cudf.to_datetime('2024-10-30') " + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "034210a4", + "metadata": {}, + "outputs": [], + "source": [ + "data['age'] = data['dob'].apply(lambda dob: (reference_date - dob)/ one_nanosecond / nanoseconds_in_year )" + ] + }, + { + "cell_type": "markdown", + "id": "6a72a46a", + "metadata": {}, + "source": [ + "#### Split transaction time in year, month, day and time where time indicate number of minutes" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "3bc3ec47", + "metadata": {}, + "outputs": [], + "source": [ + "tx_date_time = cudf.to_datetime(data.trans_date_trans_time)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "a0f61401", + "metadata": {}, + "outputs": [], + "source": [ + "data['year'] = tx_date_time.dt.year\n", + "data['month'] = tx_date_time.dt.month\n", + "data['day'] = tx_date_time.dt.day\n", + "data['time'] = tx_date_time.dt.hour*60 + tx_date_time.dt.minute\n" + ] + }, + { + "cell_type": "markdown", + "id": "21f372c0", + "metadata": {}, + "source": [ + "##### Observations\n", + "\n", + "* we can treat 'year', 'month', 'day' as ordinal fields and time as numerical field" + ] + }, + { + "cell_type": "markdown", + "id": "6bb202e2", + "metadata": {}, + "source": [ + "### From ('lat', 'long'), ('merchant_lat', 'merchant_long') and unix_time compute transaction speed" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "90714a56", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "temp_df = pd.DataFrame()\n", + "\n", + "# Haversine formula function\n", + "def haversine(lat1, lon1, lat2, lon2):\n", + " # Radius of Earth in km\n", + " R = 6371.0\n", + "\n", + " # Convert degrees to radians\n", + " lat1 = math.radians(lat1)\n", + " lon1 = math.radians(lon1)\n", + " lat2 = math.radians(lat2)\n", + " lon2 = math.radians(lon2)\n", + "\n", + " # Differences in coordinates\n", + " dlat = lat2 - lat1\n", + " dlon = lon2 - lon1\n", + "\n", + " # Haversine formula\n", + " a = math.sin(dlat / 2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon / 2)**2\n", + " c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))\n", + "\n", + " # Distance in kilometers\n", + " distance = R * c\n", + " return distance\n", + "\n", + "temp_df= data[['unix_time', 'lat', 'long', 'merch_lat', 'merch_long']].to_pandas()\n", + "temp_df['tx_duration'] = temp_df['unix_time'].apply(lambda x: x/1e9)\n", + "temp_df['distance_km'] = temp_df.apply(\n", + " lambda row: haversine(row['lat'], row['long'], row['merch_lat'], row['merch_long']), axis=1)\n", + "data['speed'] = (temp_df['distance_km']/temp_df['tx_duration'])\n", + "del temp_df" + ] + }, + { + "cell_type": "markdown", + "id": "2a5805e8", + "metadata": {}, + "source": [ + "#### Using variables for makes code cleaner" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "c933e5e8", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "COL_CARD = 'cc_num'\n", + "COL_MCC = 'category'\n", + "COL_MERCHANT = 'merchant'\n", + "COL_STATE ='state'\n", + "COL_CITY ='city'\n", + "COL_ZIP = 'zip'\n", + "\n", + "COL_AMOUNT = 'amt'\n", + "COL_CITY_POP = 'city_pop'\n", + "\n", + "COL_FRAUD = 'is_fraud'\n", + "\n", + "COL_TIME = 'time'\n", + "COL_DAY = 'day'\n", + "COL_MONTH = 'month'\n", + "COL_YEAR = 'year'\n", + "COL_AGE = 'age'\n", + "COL_JOB = 'job'\n", + "COL_SPEED = 'speed'\n", + "\n", + "NUMERICAL_COLUMNS = [\n", + " COL_AMOUNT, COL_CITY_POP, COL_TIME, COL_AGE, COL_SPEED,\n", + " 'lat', 'long', 'merch_lat', 'merch_long' ]\n" + ] + }, + { + "cell_type": "markdown", + "id": "128c578e", + "metadata": {}, + "source": [ + "##### Number of cards per user" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a8435e75", + "metadata": {}, + "outputs": [], + "source": [ + "len(data.cc_num.unique()) / len((data['first'] + data['last']).unique())" + ] + }, + { + "cell_type": "markdown", + "id": "1a9c8002", + "metadata": {}, + "source": [ + "#### Look into numerical columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "924c30fd", + "metadata": {}, + "outputs": [], + "source": [ + "data[NUMERICAL_COLUMNS].describe()" + ] + }, + { + "cell_type": "markdown", + "id": "73172495", + "metadata": {}, + "source": [ + "#### Findings\n", + "* 'amt' and 'city_pop' have extreme values or outliers compared to mean and median." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "95ae0373", + "metadata": {}, + "outputs": [], + "source": [ + "data[COL_AMOUNT].describe()" + ] + }, + { + "cell_type": "markdown", + "id": "c92cd79e", + "metadata": {}, + "source": [ + "##### Plot histogram of the 'amt' field" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "06f1809a", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "import matplotlib.pyplot as plt\n", + "kde = gaussian_kde(data[COL_AMOUNT].to_pandas())\n", + "x_vals = np.linspace(data[COL_AMOUNT].min(), 2000, 100)\n", + "plt.plot(x_vals, kde(x_vals), color='blue')" + ] + }, + { + "cell_type": "markdown", + "id": "867ea0d4", + "metadata": {}, + "source": [ + "##### Findings\n", + "* very few transactions have higher 'amt' values" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f2322497", + "metadata": {}, + "outputs": [], + "source": [ + "data[COL_CITY_POP].describe()" + ] + }, + { + "cell_type": "markdown", + "id": "01865293", + "metadata": {}, + "source": [ + "##### Plot histogram of the 'city_pop' field" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c8f5c519", + "metadata": {}, + "outputs": [], + "source": [ + "kde = gaussian_kde(data[COL_CITY_POP].to_pandas())\n", + "x_vals = np.linspace(data[COL_CITY_POP].min(), 100000, 100)\n", + "plt.plot(x_vals, kde(x_vals), color='blue')" + ] + }, + { + "cell_type": "markdown", + "id": "b3d3941c", + "metadata": {}, + "source": [ + "##### Findings\n", + "* Only a few cities have a population over 40,000" + ] + }, + { + "cell_type": "markdown", + "id": "ab82a9f9", + "metadata": {}, + "source": [ + "#### Let's look into how the amount differ between fraud and non-fraud transactions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "31679a1d", + "metadata": {}, + "outputs": [], + "source": [ + "data[COL_AMOUNT].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "66de8459", + "metadata": {}, + "outputs": [], + "source": [ + "# Fraud transactions\n", + "data[COL_AMOUNT][data[COL_FRAUD] == 1].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0f49ff49", + "metadata": {}, + "outputs": [], + "source": [ + "# Non-fraud transactions\n", + "data[COL_AMOUNT][data[COL_FRAUD] == 0].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1151a313", + "metadata": {}, + "outputs": [], + "source": [ + "# Non-fraud transactions with high value of amount \n", + "data[COL_AMOUNT] [ (data[COL_FRAUD]==0) & (data[COL_AMOUNT] > 1376) ].describe()" + ] + }, + { + "cell_type": "markdown", + "id": "a190bb9e", + "metadata": {}, + "source": [ + "#### Findings\n", + "\n", + "* Average amount in fraud transactions > 8x the average amount in non-fraud transactions\n", + "* Interestingly, many non-fraud transactions have high amount as well.\n", + "\n", + "We need to scale the data, and RobustScaler could be a good choice for it." + ] + }, + { + "cell_type": "markdown", + "id": "54f56d5a-f135-4af2-ba13-b926f66a045f", + "metadata": {}, + "source": [ + "#### Number of unique values per nominal columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2dff40cb", + "metadata": {}, + "outputs": [], + "source": [ + "# Check how many unique values for \n", + "for col in [COL_STATE, COL_CITY, COL_ZIP, COL_MERCHANT, COL_MCC, COL_CARD]:\n", + " print(f'#unique values ({col}) = {len(data[col].unique())}')\n" + ] + }, + { + "cell_type": "markdown", + "id": "86ca593a", + "metadata": {}, + "source": [ + "#### Findings\n", + "We can binary encode 'state', 'city', 'zip', 'merchant', 'category', 'cc_num', if the columns have good correlation with targets" + ] + }, + { + "cell_type": "markdown", + "id": "50933790-780c-43cc-833d-c7ad16acbde3", + "metadata": {}, + "source": [ + "#### Take a look into distribution of 'time', 'speed' and 'age' columns\n" + ] + }, + { + "cell_type": "markdown", + "id": "691a355b", + "metadata": {}, + "source": [ + "##### Plot histogram of transaction 'speed'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d490cd45", + "metadata": {}, + "outputs": [], + "source": [ + "kde = gaussian_kde(data[COL_SPEED].to_pandas())\n", + "x_vals = np.linspace(data[COL_SPEED].min(), data[COL_SPEED].max(), 100)\n", + "plt.plot(x_vals, kde(x_vals), color='blue')" + ] + }, + { + "cell_type": "markdown", + "id": "00e1d50f", + "metadata": {}, + "source": [ + "##### Plot histogram of 'time'\n", + "__NOTE__ Time is captured as number of minutes over the span of a day" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8cd994bf", + "metadata": {}, + "outputs": [], + "source": [ + "kde = gaussian_kde(data[COL_TIME].to_pandas())\n", + "x_vals = np.linspace(data[COL_TIME].min(), data[COL_TIME].max(), 100)\n", + "plt.plot(x_vals, kde(x_vals), color='blue')" + ] + }, + { + "cell_type": "markdown", + "id": "97873360", + "metadata": {}, + "source": [ + "##### Plot histogram of 'age'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "576fb419", + "metadata": {}, + "outputs": [], + "source": [ + "kde = gaussian_kde(data[COL_AGE].to_pandas())\n", + "x_vals = np.linspace(data[COL_AGE].min(), data[COL_AGE].max(), 100)\n", + "plt.plot(x_vals, kde(x_vals), color='blue')" + ] + }, + { + "cell_type": "markdown", + "id": "f75c7c59", + "metadata": {}, + "source": [ + "##### Findings\n", + "* It's not obvious from the histogram of 'time,' 'speed,' and 'age' whether they are clear indicators for labeling a transaction as fraud." + ] + }, + { + "cell_type": "markdown", + "id": "5a815bc9", + "metadata": {}, + "source": [ + "#### Define a function to compute correlation of different categorical fields with target" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "cfaa31fd", + "metadata": {}, + "outputs": [], + "source": [ + "# https://en.wikipedia.org/wiki/Cram%C3%A9r's_V\n", + "\n", + "def cramers_v(x, y):\n", + " confusion_matrix = cudf.crosstab(x, y).to_numpy()\n", + " chi2 = ss.chi2_contingency(confusion_matrix)[0]\n", + " n = confusion_matrix.sum().sum()\n", + " r, k = confusion_matrix.shape\n", + " return np.sqrt(chi2 / (n * (min(k-1, r-1))))" + ] + }, + { + "cell_type": "markdown", + "id": "1fa39773", + "metadata": {}, + "source": [ + "##### Compute correlation of different field with target" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e5497f70", + "metadata": {}, + "outputs": [], + "source": [ + "sparse_factor = 1\n", + "columns_to_compute_corr = [\n", + " COL_CARD, COL_STATE, COL_CITY, COL_ZIP, COL_MCC, COL_MERCHANT,\n", + " COL_DAY, COL_MONTH, COL_YEAR, COL_JOB, 'gender']\n", + "for c1 in columns_to_compute_corr:\n", + " for c2 in [COL_FRAUD]:\n", + " coff = 100 * cramers_v(data[c1][::sparse_factor], data[c2][::sparse_factor])\n", + " print('Correlation ({}, {}) = {:6.2f}%'.format(c1, c2, coff))" + ] + }, + { + "cell_type": "markdown", + "id": "738cd723", + "metadata": {}, + "source": [ + "#### Findings\n", + "* 'day', 'month', and 'year' 'gender' are not important to predict if a transaction is fraud or not" + ] + }, + { + "cell_type": "markdown", + "id": "00296660", + "metadata": {}, + "source": [ + "#### Check how City, State and Zip are correlated" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6ac25d40", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "sparse_factor = 1\n", + "columns_to_compute_corr = [COL_STATE, COL_CITY, COL_ZIP]\n", + "for c1 in columns_to_compute_corr:\n", + " for c2 in columns_to_compute_corr:\n", + " if c1 not in c2:\n", + " coff = 100 * cramers_v(data[c1][::sparse_factor], data[c2][::sparse_factor])\n", + " print('{} {} {:6.2f}%'.format(c1, c2, coff))" + ] + }, + { + "cell_type": "markdown", + "id": "e2b1edd3", + "metadata": {}, + "source": [ + "#### Findings\n", + "* if we use 'zip' to predict if a transaction is fraud or not, we don't need to use 'city' and 'state'" + ] + }, + { + "cell_type": "markdown", + "id": "a50d67c2", + "metadata": {}, + "source": [ + "### Correlation of target with numerical columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "38353361", + "metadata": {}, + "outputs": [], + "source": [ + "# https://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient\n", + "# Use Point-biserial correlation coefficient(rpb) to check if the numerical columns are important to predict if a transaction is fraud\n", + "\n", + "for col in NUMERICAL_COLUMNS:\n", + " r_pb, p_value = pointbiserialr(data[COL_FRAUD].to_pandas(), data[col].to_pandas())\n", + " print('r_pb ({}) = {:3.2f} with p_value {:3.2f}'.format(col, r_pb, p_value))" + ] + }, + { + "cell_type": "markdown", + "id": "400c8f83", + "metadata": {}, + "source": [ + "#### Findings\n", + "* 'amt' column has positive correlation with target\n", + "* other columns, such as 'city_pop', 'time', 'age', 'lat', 'long', 'merch_lat', and 'merch_long' has negligible correlation with target\n", + "* Speed can't be ignored as the p_value > 0.05" + ] + }, + { + "cell_type": "markdown", + "id": "f92d58ef", + "metadata": {}, + "source": [ + "#### Based on correlation values, select a set of columns (aka fields) to predict whether a transaction is fraud" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "a00a7813", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "numerical_predictors = [COL_AMOUNT, COL_SPEED, COL_AGE]\n", + "nominal_predictors = [COL_CARD, COL_ZIP, COL_MCC, COL_MERCHANT, COL_JOB]\n", + "\n", + "predictor_columns = numerical_predictors + nominal_predictors\n", + "\n", + "target_column = [COL_FRAUD]" + ] + }, + { + "cell_type": "markdown", + "id": "9341f5e0", + "metadata": {}, + "source": [ + "#### Remove duplicates non-fraud data points" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "428c4ac8", + "metadata": {}, + "outputs": [], + "source": [ + "# Remove duplicates data points\n", + "fraud_data = data[data[COL_FRAUD] == 1]\n", + "data = data[data[COL_FRAUD] == 0]\n", + "data = data.drop_duplicates(subset=nominal_predictors)\n", + "data = cudf.concat([data, fraud_data])\n", + "\n", + "100*data[COL_FRAUD].value_counts()/len(data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4fe02370", + "metadata": {}, + "outputs": [], + "source": [ + "# Portion of fraud and non-fraud cases\n", + "data[COL_YEAR].value_counts()/len(data)" + ] + }, + { + "cell_type": "markdown", + "id": "8bc2ebac", + "metadata": {}, + "source": [ + "### Split data\n", + "All the transactions were made in year 2019. Let's split the data into three groups based on event month\n", + "* Training - all transactions in 2019\n", + "* Validation - all transactions between January and May in 2020\n", + "* Test. - all transactions after May 2020" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97dc748b", + "metadata": {}, + "outputs": [], + "source": [ + "if under_sample: \n", + " fraud_df = data[data[COL_FRAUD]==1]\n", + " non_fraud_df = data[data[COL_FRAUD]==0]\n", + " nr_non_fraud_samples = min((len(data) - len(fraud_df)), int(len(fraud_df)/fraud_ratio))\n", + " data = cudf.concat([fraud_df, non_fraud_df.sample(nr_non_fraud_samples)])\n", + "\n", + "training_idx = data[COL_YEAR] == 2019\n", + "validation_idx = (data[COL_YEAR] == 2020) & (data[COL_MONTH] < 4 )\n", + "test_idx = (data[COL_YEAR] == 2020) & (data[COL_MONTH] >= 4 )\n", + "\n", + "data[COL_FRAUD].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9a7d2f0c", + "metadata": {}, + "outputs": [], + "source": [ + "# portion of data for training, test and validation\n", + "training_idx.sum()/len(data), validation_idx.sum()/len(data), test_idx.sum()/len(data)" + ] + }, + { + "cell_type": "markdown", + "id": "cd036929", + "metadata": {}, + "source": [ + "### Scale numerical columns and encode categorical columns of training data" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "2d570cdc", + "metadata": {}, + "outputs": [], + "source": [ + "# As some of the encoder we want to use is not available in cuml yet, we can use pandas for now.\n", + "# Move training data to pandas for preprocessing\n", + "pdf_training = data[training_idx].to_pandas()[predictor_columns + target_column]" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "d135cc8a", + "metadata": {}, + "outputs": [], + "source": [ + "#Use binary encoding for categorical columns\n", + "columns_for_binary_encoding = nominal_predictors" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "ee6b77fc", + "metadata": {}, + "outputs": [], + "source": [ + "# Mark categorical column as \"category\"\n", + "pdf_training[nominal_predictors] = pdf_training[nominal_predictors].astype(\"category\")" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "6d20de67", + "metadata": {}, + "outputs": [], + "source": [ + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "49eb57f5", + "metadata": {}, + "outputs": [], + "source": [ + "# encoders to encode categorical columns and scalers to scale numerical columns\n", + "\n", + "bin_encoder = Pipeline(\n", + " steps=[\n", + " (\"binary\", BinaryEncoder(handle_missing='value', handle_unknown='value'))\n", + " ]\n", + ")\n", + "onehot_encoder = Pipeline(\n", + " steps=[\n", + " (\"onehot\", OneHotEncoder())\n", + " ]\n", + ")\n", + "\n", + "std_scaler = Pipeline(\n", + " steps=[(\"imputer\", SimpleImputer(strategy=\"median\")), (\"standard\", StandardScaler())],\n", + ")\n", + "\n", + "robust_scaler = Pipeline(\n", + " steps=[(\"imputer\", SimpleImputer(strategy=\"median\")), (\"robust\", RobustScaler())],\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "613db861", + "metadata": {}, + "outputs": [], + "source": [ + "# compose encoders and scalers in a column transformer\n", + "transformer = ColumnTransformer(\n", + " transformers=[\n", + " (\"binary\", bin_encoder, columns_for_binary_encoding ), \n", + " (\"robust\", robust_scaler, [COL_AMOUNT]),\n", + " (\"stdscaler\", std_scaler, [COL_SPEED, COL_AGE]),\n", + " ], remainder=\"passthrough\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "79be6b2b", + "metadata": {}, + "source": [ + "##### Fit column transformer with training data" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "ba373e41", + "metadata": {}, + "outputs": [], + "source": [ + "# Fit column transformer with training data\n", + "\n", + "pd.set_option('future.no_silent_downcasting', True)\n", + "transformer = transformer.fit(pdf_training[predictor_columns])" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "9f10f84a", + "metadata": {}, + "outputs": [], + "source": [ + "# transformed column names\n", + "columns_of_transformed_data = list(\n", + " map(lambda name: name.split('__')[1],\n", + " list(transformer.get_feature_names_out(predictor_columns))))" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "739f62a4", + "metadata": {}, + "outputs": [], + "source": [ + "# data type of transformed columns \n", + "type_mapping = {}\n", + "for col in columns_of_transformed_data:\n", + " if col.split('_')[0] in nominal_predictors:\n", + " type_mapping[col] = 'int8'\n", + " elif col in numerical_predictors:\n", + " type_mapping[col] = 'float'\n", + " elif col in target_column:\n", + " type_mapping[col] = data.dtypes.to_dict()[col]" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "b5471b25", + "metadata": {}, + "outputs": [], + "source": [ + "# transform training data\n", + "preprocessed_training_data = transformer.transform(pdf_training[predictor_columns])\n", + "\n", + "# Convert transformed data to panda DataFrame\n", + "preprocessed_training_data = pd.DataFrame(\n", + " preprocessed_training_data, columns=columns_of_transformed_data)\n", + "# Copy target column\n", + "preprocessed_training_data[COL_FRAUD] = pdf_training[COL_FRAUD].values\n", + "preprocessed_training_data = preprocessed_training_data.astype(type_mapping)" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "f02184b2", + "metadata": {}, + "outputs": [], + "source": [ + "# Save the transformer \n", + "\n", + "with open(os.path.join(sparkov_base_path, 'preprocessor.pkl'),'wb') as f:\n", + " pickle.dump(transformer, f)" + ] + }, + { + "cell_type": "markdown", + "id": "8f3de882", + "metadata": {}, + "source": [ + "#### Save transformed training data for XGBoost training" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "e3362673", + "metadata": {}, + "outputs": [], + "source": [ + "with open(os.path.join(sparkov_base_path, 'preprocessor.pkl'),'rb') as f:\n", + " loaded_transformer = pickle.load(f)" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "47f6663e", + "metadata": {}, + "outputs": [], + "source": [ + "# Transform test data using the transformer fitted on training data\n", + "pdf_test = data[test_idx].to_pandas()[predictor_columns + target_column]\n", + "pdf_test[nominal_predictors] = pdf_test[nominal_predictors].astype(\"category\")\n", + "\n", + "preprocessed_test_data = loaded_transformer.transform(pdf_test[predictor_columns])\n", + "preprocessed_test_data = pd.DataFrame(preprocessed_test_data, columns=columns_of_transformed_data)\n", + "\n", + "# Copy target column\n", + "preprocessed_test_data[COL_FRAUD] = pdf_test[COL_FRAUD].values\n", + "preprocessed_test_data = preprocessed_test_data.astype(type_mapping)" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "0ce80a1b", + "metadata": {}, + "outputs": [], + "source": [ + "# Transform validation data using the transformer fitted on training data\n", + "pdf_validation = data[validation_idx].to_pandas()[predictor_columns + target_column]\n", + "pdf_validation[nominal_predictors] = pdf_validation[nominal_predictors].astype(\"category\")\n", + "\n", + "preprocessed_validation_data = loaded_transformer.transform(pdf_validation[predictor_columns])\n", + "preprocessed_validation_data = pd.DataFrame(preprocessed_validation_data, columns=columns_of_transformed_data)\n", + "\n", + "# Copy target column\n", + "preprocessed_validation_data[COL_FRAUD] = pdf_validation[COL_FRAUD].values\n", + "preprocessed_validation_data = preprocessed_validation_data.astype(type_mapping)" + ] + }, + { + "cell_type": "markdown", + "id": "cb2ca66b-d3dc-4f67-9754-b90bbea6e286", + "metadata": {}, + "source": [ + "## Write out the data for XGB" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "89c16cfb-0bd4-4efb-a610-f3ae7445d96e", + "metadata": {}, + "outputs": [], + "source": [ + "## Training data\n", + "out_path = os.path.join(sparkov_xgb, 'training.csv')\n", + "if not os.path.exists(os.path.dirname(out_path)):\n", + " os.makedirs(os.path.dirname(out_path))\n", + "preprocessed_training_data.to_csv(\n", + " out_path, header=True, index=False, columns=columns_of_transformed_data + target_column)\n", + "# preprocessed_training_data.to_parquet(out_path, index=False, compression='gzip')" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "f3ef6b19-062b-42d5-8caa-6f6011648b4a", + "metadata": {}, + "outputs": [], + "source": [ + "## validation data\n", + "out_path = os.path.join(sparkov_xgb, 'validation.csv')\n", + "if not os.path.exists(os.path.dirname(out_path)):\n", + " os.makedirs(os.path.dirname(out_path))\n", + "preprocessed_validation_data.to_csv(\n", + " out_path, header=True, index=False, columns=columns_of_transformed_data + target_column)\n", + "# preprocessed_validation_data.to_parquet(out_path, index=False, compression='gzip')" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "cdc8e3b9-841d-49f3-b3ae-3017a04605e3", + "metadata": {}, + "outputs": [], + "source": [ + "## test data\n", + "out_path = os.path.join(sparkov_xgb, 'test.csv')\n", + "preprocessed_test_data.to_csv(\n", + " out_path, header=True, index=False, columns=columns_of_transformed_data + target_column)\n", + "# preprocessed_test_data.to_parquet(out_path, index=False, compression='gzip')" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "id": "67fb32b0", + "metadata": {}, + "outputs": [], + "source": [ + "# Write untransformed test data that has only (renamed) predictor columns and target\n", + "out_path = os.path.join(sparkov_xgb, 'untransformed_test.csv')\n", + "pdf_test.to_csv(out_path, header=True, index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "2d6cb604", + "metadata": {}, + "outputs": [], + "source": [ + "# Delete dataFrames that are not needed anymore\n", + "del(pdf_training)\n", + "del(pdf_validation)\n", + "del(pdf_test)\n", + "del(preprocessed_training_data)\n", + "del(preprocessed_validation_data)\n", + "del(preprocessed_test_data)" + ] + }, + { + "cell_type": "markdown", + "id": "3bfbfd83", + "metadata": {}, + "source": [ + "### GNN Data" + ] + }, + { + "cell_type": "markdown", + "id": "98e518c8", + "metadata": {}, + "source": [ + "#### Setting Vertex IDs\n", + "In order to create a graph, the different vertices need to be assigned unique vertex IDs. Additionally, the IDs needs to be consecutive and positive.\n", + "\n", + "There are three nodes groups here: Transactions, Users, and Merchants. \n", + "\n", + "These IDs are not used in training, just used for graph processing." + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "id": "194a47d8", + "metadata": {}, + "outputs": [], + "source": [ + "# Use the same training data as used for XGBoost\n", + "data = data[training_idx]" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "id": "0ba0cb6b", + "metadata": {}, + "outputs": [], + "source": [ + "# a lot of process has occurred, sort the data and reset the index\n", + "data = data.sort_values(by=[COL_YEAR, COL_MONTH, COL_DAY, COL_TIME], ascending=False)\n", + "data.reset_index(inplace=True, drop=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "2a75c92e", + "metadata": {}, + "outputs": [], + "source": [ + "# Each transaction gets a unique ID\n", + "COL_TRANSACTION_ID = 'Tx_ID'\n", + "COL_MERCHANT_ID = 'Merchant_ID'\n", + "COL_USER_ID = 'User_ID'\n", + "\n", + "# The number of transaction is the same as the size of the list, and hence the index value\n", + "data[COL_TRANSACTION_ID] = data.index" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "472ea57c", + "metadata": {}, + "outputs": [], + "source": [ + "# Get the max transaction ID to compute first merchant ID\n", + "max_tx_id = data[COL_TRANSACTION_ID].max()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e8ef04b", + "metadata": {}, + "outputs": [], + "source": [ + "# Convert Merchant string to consecutive integers\n", + "merchant_name_to_id = dict((v, k) for k, v in data[COL_MERCHANT].unique().to_dict().items())\n", + "data[COL_MERCHANT_ID] = data[COL_MERCHANT].map(merchant_name_to_id) + (max_tx_id + 1)\n", + "data[COL_MERCHANT_ID].min(), data[COL_MERCHANT_ID].max()" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "6937df18", + "metadata": {}, + "outputs": [], + "source": [ + "# Again, get the max merchant ID to compute first user ID\n", + "max_merchant_id = data[COL_MERCHANT_ID].max()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c63c7fc3", + "metadata": {}, + "outputs": [], + "source": [ + "# Convert Card to consecutive user IDs\n", + "user_id_to_consecutive_ids = dict((v, k) for k, v in data[COL_CARD].unique().to_dict().items())\n", + "data[COL_USER_ID] = data[COL_CARD].map(user_id_to_consecutive_ids) + max_merchant_id + 1\n", + "data[COL_USER_ID].min(), data[COL_USER_ID].max()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "28858422", + "metadata": {}, + "outputs": [], + "source": [ + "# Save the max user ID\n", + "max_user_id = data[COL_USER_ID].max()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "903e5115", + "metadata": {}, + "outputs": [], + "source": [ + "# Check the the transaction, merchant and user ids are consecutive\n", + "id_range = data[COL_TRANSACTION_ID].min(), data[COL_TRANSACTION_ID].max()\n", + "print(f'Transaction ID range {id_range}')\n", + "id_range = data[COL_MERCHANT_ID].min(), data[COL_MERCHANT_ID].max()\n", + "print(f'Merchant ID range {id_range}')\n", + "id_range = data[COL_USER_ID].min(), data[COL_USER_ID].max()\n", + "print(f'User ID range {id_range}')" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "f2d0dfde", + "metadata": {}, + "outputs": [], + "source": [ + "# Sanity checks\n", + "assert( data[COL_TRANSACTION_ID].max() == data[COL_MERCHANT_ID].min() - 1)\n", + "assert( data[COL_MERCHANT_ID].max() == data[COL_USER_ID].min() - 1)\n", + "assert(len(data[COL_USER_ID].unique()) == (data[COL_USER_ID].max() - data[COL_USER_ID].min() + 1))\n", + "assert(len(data[COL_MERCHANT_ID].unique()) == (data[COL_MERCHANT_ID].max() - data[COL_MERCHANT_ID].min() + 1))\n", + "assert(len(data[COL_TRANSACTION_ID].unique()) == (data[COL_TRANSACTION_ID].max() - data[COL_TRANSACTION_ID].min() + 1))" + ] + }, + { + "cell_type": "markdown", + "id": "0d9c3df3-a5be-4899-8bf9-6152aca114c7", + "metadata": {}, + "source": [ + "### Write out the data for GNN" + ] + }, + { + "cell_type": "markdown", + "id": "c2b86862-d129-4ece-a60d-dc798f3a68b5", + "metadata": {}, + "source": [ + "#### Create the Graph Edge Data file \n", + "The file is in COO format" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "b288c5a7-20dd-40ff-b0eb-7a5895bcc464", + "metadata": {}, + "outputs": [], + "source": [ + "COL_GRAPH_SRC = 'src'\n", + "COL_GRAPH_DST = 'dst'\n", + "COL_GRAPH_WEIGHT = 'wgt'\n", + "\n", + "# User to Transactions\n", + "U_2_T = cudf.DataFrame()\n", + "U_2_T[COL_GRAPH_SRC] = data[COL_USER_ID]\n", + "U_2_T[COL_GRAPH_DST] = data[COL_TRANSACTION_ID]\n", + "if make_undirected:\n", + " T_2_U = cudf.DataFrame()\n", + " T_2_U[COL_GRAPH_SRC] = data[COL_TRANSACTION_ID]\n", + " T_2_U[COL_GRAPH_DST] = data[COL_USER_ID]\n", + " U_2_T = cudf.concat([U_2_T, T_2_U])\n", + " del T_2_U\n" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "a970747d-07a2-43b3-b39c-19a0196fa5b1", + "metadata": {}, + "outputs": [], + "source": [ + "# Transactions to Merchants\n", + "T_2_M = cudf.DataFrame()\n", + "T_2_M[COL_GRAPH_SRC] = data[COL_MERCHANT_ID]\n", + "T_2_M[COL_GRAPH_DST] = data[COL_TRANSACTION_ID]\n", + "\n", + "if make_undirected:\n", + " M_2_T = cudf.DataFrame()\n", + " M_2_T[COL_GRAPH_SRC] = data[COL_TRANSACTION_ID]\n", + " M_2_T[COL_GRAPH_DST] = data[COL_MERCHANT_ID]\n", + " T_2_M = cudf.concat([T_2_M, M_2_T])\n", + " del M_2_T" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80e704fd-ae9f-45b1-ad56-bdc0b743d09f", + "metadata": {}, + "outputs": [], + "source": [ + "Edge = cudf.concat([U_2_T, T_2_M])\n", + "Edge[COL_GRAPH_WEIGHT] = 0.0\n", + "len(Edge)" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "c74572f6-ff6e-4c8f-803e-0ae2c0587c58", + "metadata": {}, + "outputs": [], + "source": [ + "# now write out the data\n", + "out_path = os.path.join (sparkov_gnn, 'edges.csv')\n", + "\n", + "if not os.path.exists(os.path.dirname(out_path)):\n", + " os.makedirs(os.path.dirname(out_path))\n", + " \n", + "Edge.to_csv(out_path, header=False, index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "id": "3dd3ff45-3796-4069-9e3a-587743c4e1e0", + "metadata": {}, + "outputs": [], + "source": [ + "del(Edge)\n", + "del(U_2_T)\n", + "del(T_2_M)" + ] + }, + { + "cell_type": "markdown", + "id": "ed00c481-1737-4152-9d23-f3cb24f2adcd", + "metadata": {}, + "source": [ + "### Now the feature data\n", + "Feature data needs to be is sorted in order, where the row index corresponds to the node ID\n", + "\n", + "The data is comprised of three sets of features\n", + "* Transactions\n", + "* Users\n", + "* Merchants" + ] + }, + { + "cell_type": "markdown", + "id": "805c9d23", + "metadata": {}, + "source": [ + "#### To get feature vectors of Transaction nodes, transform the training data using pre-fitted transformer" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "id": "584fe9bf", + "metadata": {}, + "outputs": [], + "source": [ + "node_feature_df = pd.DataFrame(\n", + " loaded_transformer.transform(\n", + " data[predictor_columns].to_pandas()\n", + " ),\n", + " columns=columns_of_transformed_data).astype(type_mapping)\n", + "\n", + "node_feature_df[COL_FRAUD] = data[COL_FRAUD].to_pandas()" + ] + }, + { + "cell_type": "markdown", + "id": "55aa8f86", + "metadata": {}, + "source": [ + "#### For graph nodes associated with merchant and user, add feature vectors of zeros" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "id": "b35f9f5b", + "metadata": {}, + "outputs": [], + "source": [ + "# Number of graph nodes for users and merchants \n", + "nr_users_and_merchant_nodes = max_user_id - max_tx_id" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "b5d312bd", + "metadata": {}, + "outputs": [], + "source": [ + "if not spread_features:\n", + " # Create feature vector of all zeros for each user and merchant node\n", + " empty_feature_df = cudf.DataFrame(\n", + " columns=columns_of_transformed_data + target_column,\n", + " dtype='int8', \n", + " index=range(nr_users_and_merchant_nodes)\n", + " )\n", + " empty_feature_df = empty_feature_df.fillna(0)\n", + " empty_feature_df=empty_feature_df.astype(type_mapping)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "a72d3ea5-e04f-4af1-a0e0-09964555c1ed", + "metadata": {}, + "outputs": [], + "source": [ + "if not spread_features:\n", + " # Concatenate transaction features followed by features for merchants and user nodes\n", + " node_feature_df = pd.concat([node_feature_df, empty_feature_df.to_pandas()]).astype(type_mapping)" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "a364d173", + "metadata": {}, + "outputs": [], + "source": [ + "# User specific columns\n", + "if spread_features:\n", + " user_specific_columns = [COL_CARD]\n", + " user_specific_columns_of_transformed_data = []\n", + "\n", + " for col in node_feature_df.columns:\n", + " if '_'.join(col.split('_')[:-1]) in user_specific_columns:\n", + " user_specific_columns_of_transformed_data.append(col)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "id": "92d88c2f", + "metadata": {}, + "outputs": [], + "source": [ + "# Merchant specific columns\n", + "if spread_features:\n", + " merchant_specific_columns = [COL_MERCHANT, COL_CITY, COL_ZIP, COL_MCC]\n", + " merchant_specific_columns_of_transformed_data = []\n", + " \n", + " for col in node_feature_df.columns:\n", + " if col.split('_')[0] in merchant_specific_columns:\n", + " merchant_specific_columns_of_transformed_data.append(col)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "id": "f62755ae", + "metadata": {}, + "outputs": [], + "source": [ + "# Transaction specific columns\n", + "if spread_features:\n", + " transaction_specific_columns = list(\n", + " set(numerical_predictors).union(nominal_predictors)\n", + " - set(user_specific_columns).union(merchant_specific_columns))\n", + " transaction_specific_columns_of_transformed_data = []\n", + " \n", + " for col in node_feature_df.columns:\n", + " if col.split('_')[0] in transaction_specific_columns:\n", + " transaction_specific_columns_of_transformed_data.append(col) " + ] + }, + { + "cell_type": "markdown", + "id": "d12061da", + "metadata": {}, + "source": [ + "#### Construct feature vector for merchants" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "de484a27", + "metadata": {}, + "outputs": [], + "source": [ + "if spread_features:\n", + " # Find indices of unique merchants\n", + " idx_df = cudf.DataFrame()\n", + " idx_df[COL_MERCHANT_ID] = data[COL_MERCHANT_ID]\n", + " idx_df = idx_df.sort_values(by=COL_MERCHANT_ID)\n", + " idx_df = idx_df.drop_duplicates(subset=COL_MERCHANT_ID)\n", + " assert((data.iloc[idx_df.index][COL_MERCHANT_ID] == idx_df[COL_MERCHANT_ID]).all())" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "5be790eb", + "metadata": {}, + "outputs": [], + "source": [ + "if spread_features:\n", + " # Copy merchant specific columns, and set the rest to zero\n", + " merchant_specific_feature_df = node_feature_df.iloc[idx_df.index.to_numpy()]\n", + " merchant_specific_feature_df.\\\n", + " loc[:, \n", + " transaction_specific_columns_of_transformed_data +\n", + " user_specific_columns_of_transformed_data] = 0.0\n" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "id": "576091c6", + "metadata": {}, + "outputs": [], + "source": [ + "if spread_features:\n", + " # Find indices of unique users\n", + " idx_df = cudf.DataFrame()\n", + " idx_df[COL_USER_ID] = data[COL_USER_ID]\n", + " idx_df = idx_df.sort_values(by=COL_USER_ID)\n", + " idx_df = idx_df.drop_duplicates(subset=COL_USER_ID)\n", + " assert((data.iloc[idx_df.index][COL_USER_ID] == idx_df[COL_USER_ID]).all())" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "aec23ee5", + "metadata": {}, + "outputs": [], + "source": [ + "if spread_features:\n", + " # Copy user specific columns, and set the rest to zero\n", + " user_specific_feature_df = node_feature_df.iloc[idx_df.index.to_numpy()]\n", + " user_specific_feature_df.\\\n", + " loc[:,\n", + " transaction_specific_columns_of_transformed_data +\n", + " merchant_specific_columns_of_transformed_data] = 0.0 " + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "8296a341", + "metadata": {}, + "outputs": [], + "source": [ + "# Concatenate features of node, user and merchant\n", + "if spread_features:\n", + " \n", + " node_feature_df[merchant_specific_columns_of_transformed_data] = 0.0\n", + " node_feature_df[user_specific_columns_of_transformed_data] = 0.0\n", + " node_feature_df = pd.concat(\n", + " [node_feature_df, merchant_specific_feature_df, user_specific_feature_df]\n", + " ).astype(type_mapping)\n", + " \n", + " # features to save\n", + " node_feature_df = node_feature_df[\n", + " transaction_specific_columns_of_transformed_data +\n", + " merchant_specific_columns_of_transformed_data +\n", + " user_specific_columns_of_transformed_data + [COL_FRAUD]]\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a54aa686", + "metadata": {}, + "outputs": [], + "source": [ + "node_feature_df.columns" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "id": "527f6ea8", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# target labels to save\n", + "label_df = node_feature_df[[COL_FRAUD]]" + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "id": "15e1cba8", + "metadata": {}, + "outputs": [], + "source": [ + "# Remove target label from feature vectors\n", + "_ = node_feature_df.drop(columns=[COL_FRAUD], inplace=True)" + ] + }, + { + "cell_type": "markdown", + "id": "310d9500", + "metadata": {}, + "source": [ + "#### Write out node features and target labels" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "139bfd9f", + "metadata": {}, + "outputs": [], + "source": [ + "# Write node target label to csv file\n", + "out_path = os.path.join(sparkov_gnn, 'labels.csv')\n", + "\n", + "if not os.path.exists(os.path.dirname(out_path)):\n", + " os.makedirs(os.path.dirname(out_path))\n", + "\n", + "label_df.to_csv(out_path, header=False, index=False)\n", + "# label_df.to_parquet(out_path, index=False, compression='gzip')" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "id": "b8fe801e", + "metadata": {}, + "outputs": [], + "source": [ + "# Write node features to csv file\n", + "out_path = os.path.join(sparkov_gnn, 'features.csv')\n", + "\n", + "if not os.path.exists(os.path.dirname(out_path)):\n", + " os.makedirs(os.path.dirname(out_path))\n", + "node_feature_df[columns_of_transformed_data].to_csv(out_path, header=True, index=False)\n", + "# node_feature_df.to_parquet(out_path, index=False, compression='gzip')" + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "id": "fbe75d91", + "metadata": {}, + "outputs": [], + "source": [ + "# Delete dataFrames\n", + "del data\n", + "del node_feature_df\n", + "del label_df\n", + "\n", + "if spread_features:\n", + " del merchant_specific_feature_df\n", + " del user_specific_feature_df\n", + "else:\n", + " del empty_feature_df" + ] + }, + { + "cell_type": "markdown", + "id": "657362a9", + "metadata": {}, + "source": [ + "#### Number of transaction nodes in training data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47b9ccd9", + "metadata": {}, + "outputs": [], + "source": [ + "# Number of transaction nodes, needed for GNN training\n", + "nr_transaction_nodes = max_tx_id + 1\n", + "nr_transaction_nodes" + ] + }, + { + "cell_type": "markdown", + "id": "1fce29ee", + "metadata": {}, + "source": [ + "#### Save variable for training and inference" + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "id": "c3bf9b46", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "variables_to_save = {\n", + " k: v for k, v in globals().items() if isinstance(v, (str, int)) and k.startswith('COL_')}" + ] + }, + { + "cell_type": "code", + "execution_count": 93, + "id": "54cc3c06", + "metadata": {}, + "outputs": [], + "source": [ + "variables_to_save['NUM_TRANSACTION_NODES'] = int(nr_transaction_nodes)" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "id": "9eb8bdd1", + "metadata": {}, + "outputs": [], + "source": [ + "# Save the dictionary to a JSON file\n", + "with open(os.path.join(sparkov_base_path, 'variables.json'), 'w') as json_file:\n", + " json.dump(variables_to_save, json_file, indent=4)" + ] + }, + { + "cell_type": "markdown", + "id": "fa2f6f28", + "metadata": {}, + "source": [ + "## That's it!\n", + "The data is now ready for processing" + ] + }, + { + "cell_type": "markdown", + "id": "49c13b3b", + "metadata": {}, + "source": [ + "## Copyright and License\n", + "
\n", + "Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.\n", + "\n", + "
\n", + "\n", + " Licensed under the Apache License, Version 2.0 (the \"License\");\n", + " you may not use this file except in compliance with the License.\n", + " You may obtain a copy of the License at\n", + " \n", + " http://www.apache.org/licenses/LICENSE-2.0\n", + " \n", + " Unless required by applicable law or agreed to in writing, software\n", + " distributed under the License is distributed on an \"AS IS\" BASIS,\n", + " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + " See the License for the specific language governing permissions and\n", + " limitations under the License." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "mamba_env", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/ai-credit-fraud-workflow/notebooks/preprocess_Tabformer.ipynb b/ai-credit-fraud-workflow/notebooks/preprocess_Tabformer.ipynb new file mode 100644 index 0000000..8c4aa3a --- /dev/null +++ b/ai-credit-fraud-workflow/notebooks/preprocess_Tabformer.ipynb @@ -0,0 +1,1944 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "9c6a5b09-a601-47c6-989f-5efb42d7f4f8", + "metadata": {}, + "source": [ + "# Credit Card Transaction Data Cleanup and Prep \n", + "\n", + "This notebook shows the steps for cleanup and preparing the credit card transaction data for follow on GNN training with GraphSAGE.\n", + "\n", + "### The dataset:\n", + " * IBM TabFormer: https://github.com/IBM/TabFormer\n", + " * Released under an Apache 2.0 license\n", + "\n", + "Contains 24M records with 15 fields, one field being the \"is fraud\" label which we use for training.\n", + "\n", + "### Goals\n", + "The goal is to:\n", + " * Cleanup the data\n", + " * Make field names just single word\n", + " * while field names are not used within the GNN, it makes accessing fields easier during cleanup \n", + " * Encode categorical fields\n", + " * use one-hot encoding for fields with less than 8 categories\n", + " * use binary encoding for fields with more than 8 categories\n", + " * Create a continuous node index across users, merchants, and transactions\n", + " * having node ID start at zero and then be contiguous is critical for creation of Compressed Sparse Row (CSR) formatted data without wasting memory.\n", + " * Produce:\n", + " * For XGBoost:\n", + " * Training - all data before 2018\n", + " * Validation - all data during 2018\n", + " * Test. - all data after 2018\n", + " * For GNN\n", + " * Training Data \n", + " * Edge List \n", + " * Feature data\n", + " * Test set - all data after 2018\n", + "\n", + "\n", + "\n", + "### Graph formation\n", + "Given that we are limited to just the data in the transaction file, the ideal model would be to have a bipartite graph of Users to Merchants where the edges represent the credit card transaction and then perform Link Classification on the Edges to identify fraud. Unfortunately the current version of cuGraph does not support GNN Link Prediction. That limitation will be lifted over the next few release at which time this code will be updated. Luckily, there is precedence for viewing transactions as nodes and then doing node classification using the popular GraphSAGE GNN. That is the approach this code takes. The produced graph will be a tri-partite graph where each transaction is represented as a node.\n", + "\n", + "\n", + "\n", + "\n", + "### Features\n", + "For the XGBoost approach, there is no need to generate empty features for the Merchants. However, for GNN processing, every node needs to have the same set of feature data. Therefore, we need to generate empty features for the User and Merchant nodes. \n", + "\n", + "-----" + ] + }, + { + "cell_type": "markdown", + "id": "795bdece", + "metadata": {}, + "source": [ + "#### Import the necessary libraries. In this case will be use cuDF and perform most of the data prep in GPU\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "4b6b2bc6-a206-42c5-aae9-590672b3a202", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "import json\n", + "import os\n", + "import pickle\n", + "\n", + "import cudf\n", + "import numpy as np\n", + "import pandas as pd\n", + "import scipy.stats as ss\n", + "from category_encoders import BinaryEncoder\n", + "from scipy.stats import pointbiserialr\n", + "from sklearn.compose import ColumnTransformer\n", + "from sklearn.impute import SimpleImputer\n", + "from sklearn.pipeline import Pipeline\n", + "from sklearn.preprocessing import OneHotEncoder, RobustScaler, StandardScaler" + ] + }, + { + "cell_type": "markdown", + "id": "81db641b", + "metadata": {}, + "source": [ + "-------\n", + "#### Define some arguments" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "016964ce", + "metadata": {}, + "outputs": [], + "source": [ + "# Whether the graph is undirected\n", + "make_undirected = True\n", + "\n", + "# Whether to spread features across Users and Merchants nodes\n", + "spread_features = False\n", + "\n", + "# Whether we should under-sample majority class (i.e. non-fraud transactions)\n", + "under_sample = True\n", + "\n", + "# Ration of fraud and non-fraud transactions in case we under-sample the majority class\n", + "fraud_ratio = 0.1\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "656e6aee-038a-4b58-9296-993e06defb35", + "metadata": {}, + "outputs": [], + "source": [ + "tabformer_base_path = '../data/TabFormer'\n", + "tabformer_raw_file_path = os.path.join(tabformer_base_path, 'raw', 'card_transaction.v1.csv')\n", + "tabformer_xgb = os.path.join(tabformer_base_path, 'xgb')\n", + "tabformer_gnn = os.path.join(tabformer_base_path, 'gnn')\n", + "\n", + "if not os.path.exists(tabformer_xgb):\n", + " os.makedirs(tabformer_xgb)\n", + "if not os.path.exists(tabformer_gnn):\n", + " os.makedirs(tabformer_gnn)" + ] + }, + { + "cell_type": "markdown", + "id": "96fe43fe", + "metadata": {}, + "source": [ + "--------\n", + "#### Load and understand the data" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "fb41e6ea-1e9f-4f14-99a4-d6d3df092a37", + "metadata": {}, + "outputs": [], + "source": [ + "# Read the dataset\n", + "data = cudf.read_csv(tabformer_raw_file_path)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3d9a9ab4-4240-4824-997b-8bfdb640381c", + "metadata": {}, + "outputs": [], + "source": [ + "# optional - take a look at the data \n", + "data.head(5)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d66495f5", + "metadata": {}, + "outputs": [], + "source": [ + "data.columns" + ] + }, + { + "cell_type": "markdown", + "id": "73172495", + "metadata": {}, + "source": [ + "#### Findings\n", + "* Ordinal categorical fields - 'Year', 'Month', 'Day'\n", + "* Nominal categorical fields - 'User', 'Card', 'Merchant Name', 'Merchant City', 'Merchant State', 'Zip', 'MCC', 'Errors?'\n", + "* Target label - 'Is Fraud?'" + ] + }, + { + "cell_type": "markdown", + "id": "f285adae", + "metadata": {}, + "source": [ + "#### Check if are there Null values in the data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c1f58262", + "metadata": {}, + "outputs": [], + "source": [ + "# Check which fields are missing values\n", + "data.isnull().sum()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c6f20eb9", + "metadata": {}, + "outputs": [], + "source": [ + "# Check percentage of missing values\n", + "100*data.isnull().sum()/len(data)" + ] + }, + { + "cell_type": "markdown", + "id": "805d62ba", + "metadata": {}, + "source": [ + "#### Findings\n", + "* For many transactions 'Merchant State' and 'Zip' are missing, but it's good that all of the transactions have 'Merchant City' specified. \n", + "* Over 98% of the transactions are missing data for 'Errors?' fields." + ] + }, + { + "cell_type": "markdown", + "id": "33487e74", + "metadata": {}, + "source": [ + "##### Save a few transactions before any operations on data" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "e8c188c1", + "metadata": {}, + "outputs": [], + "source": [ + "# Write a few raw transactions for model's inference notebook\n", + "out_path = os.path.join(tabformer_xgb, 'example_transactions.csv')\n", + "data.tail(10).to_pandas().to_csv(out_path, header=True, index=False)" + ] + }, + { + "cell_type": "markdown", + "id": "57513227", + "metadata": {}, + "source": [ + "#### Let's rename the column names to single words and use variables for column names to make access easier" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "d35f7230", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "COL_USER = 'User'\n", + "COL_CARD = 'Card'\n", + "COL_AMOUNT = 'Amount'\n", + "COL_MCC = 'MCC'\n", + "COL_TIME = 'Time'\n", + "COL_DAY = 'Day'\n", + "COL_MONTH = 'Month'\n", + "COL_YEAR = 'Year'\n", + "\n", + "COL_MERCHANT = 'Merchant'\n", + "COL_STATE ='State'\n", + "COL_CITY ='City'\n", + "COL_ZIP = 'Zip'\n", + "COL_ERROR = 'Errors'\n", + "COL_CHIP = 'Chip'\n", + "COL_FRAUD = 'Fraud'" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "90aa3fb5", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "_ = data.rename(columns={\n", + " \"Merchant Name\": COL_MERCHANT,\n", + " \"Merchant State\": COL_STATE,\n", + " \"Merchant City\": COL_CITY,\n", + " \"Errors?\": COL_ERROR,\n", + " \"Use Chip\": COL_CHIP,\n", + " \"Is Fraud?\": COL_FRAUD\n", + " },\n", + " inplace=True\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "ee33e39b", + "metadata": {}, + "source": [ + "#### Handle missing values\n", + "* Zip codes are numeral, replace missing zip codes by 0\n", + "* State and Error are string, replace missing values by marker 'XX'" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "79e24ab7", + "metadata": {}, + "outputs": [], + "source": [ + "UNKNOWN_STRING_MARKER = 'XX'\n", + "UNKNOWN_ZIP_CODE = 0" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "7b774e17", + "metadata": {}, + "outputs": [], + "source": [ + "# Make sure that 'XX' doesn't exist in State and Error field before we replace missing values by 'XX'\n", + "assert(UNKNOWN_STRING_MARKER not in set(data[COL_STATE].unique().to_pandas()))\n", + "assert(UNKNOWN_STRING_MARKER not in set(data[COL_ERROR].unique().to_pandas()))" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "a7964564", + "metadata": {}, + "outputs": [], + "source": [ + "# Make sure that 0 or 0.0 doesn't exist in Zip field before we replace missing values by 0\n", + "assert(float(0) not in set(data[COL_ZIP].unique().to_pandas()))\n", + "assert(0 not in set(data[COL_ZIP].unique().to_pandas()))" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "a1baca88", + "metadata": {}, + "outputs": [], + "source": [ + "# Replace missing values with markers\n", + "data[COL_STATE] = data[COL_STATE].fillna(UNKNOWN_STRING_MARKER)\n", + "data[COL_ERROR] = data[COL_ERROR].fillna(UNKNOWN_STRING_MARKER)\n", + "data[COL_ZIP] = data[COL_ZIP].fillna(UNKNOWN_ZIP_CODE)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "07cf40c4", + "metadata": {}, + "outputs": [], + "source": [ + "# There shouldn't be any missing values in the data now.\n", + "assert(data.isnull().sum().sum() == 0)" + ] + }, + { + "cell_type": "markdown", + "id": "5f027291-5d0b-4917-ada0-a0dbe6b80f9b", + "metadata": {}, + "source": [ + "### Clean up the Amount field\n", + "* Drop the \"$\" from the Amount field and then convert from string to float\n", + "* Look into spread of Amount and choose right scaler for it" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "3ffe11c2-5e6d-4fac-8b42-27efb02afa61", + "metadata": {}, + "outputs": [], + "source": [ + "# Drop the \"$\" from the Amount field and then convert from string to float \n", + "data[COL_AMOUNT] = data[COL_AMOUNT].str.replace(\"$\",\"\").astype(\"float\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "09bd4966", + "metadata": {}, + "outputs": [], + "source": [ + "data[COL_AMOUNT].describe()" + ] + }, + { + "cell_type": "markdown", + "id": "ab82a9f9", + "metadata": {}, + "source": [ + "#### Let's look into how the Amount differ between fraud and non-fraud transactions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "31679a1d", + "metadata": {}, + "outputs": [], + "source": [ + "# Fraud transactions\n", + "data[COL_AMOUNT][data[COL_FRAUD]=='Yes'].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0f49ff49", + "metadata": {}, + "outputs": [], + "source": [ + "# Non-fraud transactions\n", + "data[COL_AMOUNT][data[COL_FRAUD]=='No'].describe()" + ] + }, + { + "cell_type": "markdown", + "id": "a190bb9e", + "metadata": {}, + "source": [ + "#### Findings\n", + "* 25th percentile = 9.2\n", + "* 75th percentile = 65\n", + "* Median is around 30 and the mean is around 43 whereas the max value is over 1200 and min value is -500\n", + "* Average amount in Fraud transactions > 2x the average amount in Non-Fraud transactions\n", + "\n", + "We need to scale the data, and RobustScaler could be a good choice for it." + ] + }, + { + "cell_type": "markdown", + "id": "b96a9ae1-1dcf-4480-a808-3afa913cb292", + "metadata": {}, + "source": [ + "#### Now the \"Fraud\" field" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d7b6c719", + "metadata": {}, + "outputs": [], + "source": [ + "# How many different categories are there in the COL_FRAUD column?\n", + "# The hope is that there are only two categories, 'Yes' and 'No'\n", + "data[COL_FRAUD].unique()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b5004040", + "metadata": {}, + "outputs": [], + "source": [ + "data[COL_FRAUD].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "62d498e1", + "metadata": {}, + "outputs": [], + "source": [ + "100 * data[COL_FRAUD].value_counts()/len(data)" + ] + }, + { + "cell_type": "markdown", + "id": "a4f13282", + "metadata": {}, + "source": [ + "#### Change the 'Fraud' values to be integer where\n", + " * 1 == Fraud\n", + " * 0 == Non-fraud" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "aa31c844", + "metadata": {}, + "outputs": [], + "source": [ + "fraud_to_binary = {'No': 0, 'Yes': 1}\n", + "data[COL_FRAUD] = data[COL_FRAUD].map(fraud_to_binary).astype('int8')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7527510d", + "metadata": {}, + "outputs": [], + "source": [ + "data[COL_FRAUD].value_counts()" + ] + }, + { + "cell_type": "markdown", + "id": "54f56d5a-f135-4af2-ba13-b926f66a045f", + "metadata": {}, + "source": [ + "#### The 'City', 'State', and 'Zip' columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2dff40cb", + "metadata": {}, + "outputs": [], + "source": [ + "# City\n", + "data[COL_CITY].unique()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9cdf46ef", + "metadata": {}, + "outputs": [], + "source": [ + "# State\n", + "data[COL_STATE].unique()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "36297321-fb9b-48c6-afce-f083834eea4e", + "metadata": {}, + "outputs": [], + "source": [ + "# Zip\n", + "data[COL_ZIP].unique()" + ] + }, + { + "cell_type": "markdown", + "id": "ab51419d-c051-489b-af63-935248c133d0", + "metadata": {}, + "source": [ + "#### The 'Chip' column\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ae85372-0f22-4850-bfa4-b8513b742663", + "metadata": {}, + "outputs": [], + "source": [ + "data[COL_CHIP].unique()" + ] + }, + { + "cell_type": "markdown", + "id": "22939e0f-bae0-4af3-aa3b-1b79974c0697", + "metadata": {}, + "source": [ + "#### The 'Error' column" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b877a558-4306-49f1-aa75-ba535be4470b", + "metadata": {}, + "outputs": [], + "source": [ + "data[COL_ERROR].unique()" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "aa6a67c0", + "metadata": {}, + "outputs": [], + "source": [ + "# Remove ',' in error descriptions\n", + "data[COL_ERROR] = data[COL_ERROR].str.replace(\",\",\"\")" + ] + }, + { + "cell_type": "markdown", + "id": "86ca593a", + "metadata": {}, + "source": [ + "#### Findings\n", + "We can one hot or binary encode columns with fewer categories and binary/hash encode columns with more than 8 categories" + ] + }, + { + "cell_type": "markdown", + "id": "50933790-780c-43cc-833d-c7ad16acbde3", + "metadata": {}, + "source": [ + "#### Time\n", + "Time is captured as hour:minute.\n", + "\n", + "We are converting the time to just be the number of minutes.\n", + "\n", + "time = (hour * 60) + minutes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a97c5a95", + "metadata": {}, + "outputs": [], + "source": [ + "data[COL_TIME].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "1df15290-f60f-416d-81a4-437ff45b6d92", + "metadata": {}, + "outputs": [], + "source": [ + "# Split the time column into hours and minutes and then cast to int32\n", + "T = data[COL_TIME].str.split(':', expand=True)\n", + "T[0] = T[0].astype('int32')\n", + "T[1] = T[1].astype('int32')" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "15d77736-53dd-4af6-a475-9ad812f84731", + "metadata": {}, + "outputs": [], + "source": [ + "# replace the 'Time' column with the new columns\n", + "data[COL_TIME] = (T[0] * 60 ) + T[1]\n", + "data[COL_TIME] = data[COL_TIME].astype(\"int32\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "d51b6840-2912-4ecb-9998-7adc680f9d87", + "metadata": {}, + "outputs": [], + "source": [ + "# Delete temporary DataFrame\n", + "del(T)" + ] + }, + { + "cell_type": "markdown", + "id": "a8d41134", + "metadata": {}, + "source": [ + "#### Merchant column" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aac83d7d", + "metadata": {}, + "outputs": [], + "source": [ + "data[COL_MERCHANT] " + ] + }, + { + "cell_type": "markdown", + "id": "2f79e111", + "metadata": {}, + "source": [ + "#### Convert the column to str type" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd0348c4", + "metadata": {}, + "outputs": [], + "source": [ + "data[COL_MERCHANT] = data[COL_MERCHANT].astype('str')\n", + "\n", + "# TOver 100,000 merchants\n", + "data[COL_MERCHANT].unique()" + ] + }, + { + "cell_type": "markdown", + "id": "d8b4daee", + "metadata": {}, + "source": [ + "#### The Card column\n", + "* \"Card 0\" for User 1 is different from \"Card 0\" for User 2.\n", + "* Combine User and Card in a way such that (User, Card) combination is unique" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a2abade", + "metadata": {}, + "outputs": [], + "source": [ + "data[COL_CARD].unique()" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "068a05b0", + "metadata": {}, + "outputs": [], + "source": [ + "max_nr_cards_per_user = len(data[COL_CARD].unique())" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "5a64bb4f", + "metadata": {}, + "outputs": [], + "source": [ + "# Combine User and Card to generate unique numbers\n", + "data[COL_CARD] = data[COL_USER] * len(data[COL_CARD].unique()) + data[COL_CARD]\n", + "data[COL_CARD] = data[COL_CARD].astype('int')" + ] + }, + { + "cell_type": "markdown", + "id": "5a815bc9", + "metadata": {}, + "source": [ + "#### Define function to compute correlation of different categorical fields with target" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "cfaa31fd", + "metadata": {}, + "outputs": [], + "source": [ + "# https://en.wikipedia.org/wiki/Cram%C3%A9r's_V\n", + "\n", + "def cramers_v(x, y):\n", + " confusion_matrix = cudf.crosstab(x, y).to_numpy()\n", + " chi2 = ss.chi2_contingency(confusion_matrix)[0]\n", + " n = confusion_matrix.sum().sum()\n", + " r, k = confusion_matrix.shape\n", + " return np.sqrt(chi2 / (n * (min(k-1, r-1))))" + ] + }, + { + "cell_type": "markdown", + "id": "1fa39773", + "metadata": {}, + "source": [ + "##### Compute correlation of different fields with target" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e5497f70", + "metadata": {}, + "outputs": [], + "source": [ + "sparse_factor = 1\n", + "columns_to_compute_corr = [COL_CARD, COL_CHIP, COL_ERROR, COL_STATE, COL_CITY, COL_ZIP, COL_MCC, COL_MERCHANT, COL_USER, COL_DAY, COL_MONTH, COL_YEAR]\n", + "for c1 in columns_to_compute_corr:\n", + " for c2 in [COL_FRAUD]:\n", + " coff = 100 * cramers_v(data[c1][::sparse_factor], data[c2][::sparse_factor])\n", + " print('Correlation ({}, {}) = {:6.2f}%'.format(c1, c2, coff))" + ] + }, + { + "cell_type": "markdown", + "id": "6dbc4636", + "metadata": {}, + "source": [ + "### Correlation of target with numerical columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7a624f77", + "metadata": {}, + "outputs": [], + "source": [ + "# https://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient\n", + "# Use Point-biserial correlation coefficient(rpb) to check if the numerical columns are important to predict if a transaction is fraud\n", + "\n", + "\n", + "for col in [COL_TIME, COL_AMOUNT]:\n", + " r_pb, p_value = pointbiserialr(data[COL_FRAUD].to_pandas(), data[col].to_pandas())\n", + " print('r_pb ({}) = {:3.2f} with p_value {:3.2f}'.format(col, r_pb, p_value))" + ] + }, + { + "cell_type": "markdown", + "id": "041e3c50", + "metadata": {}, + "source": [ + "### Findings\n", + "* Clearly, Time is not an important predictor\n", + "* Amount has 3% correlation with target" + ] + }, + { + "cell_type": "markdown", + "id": "f92d58ef", + "metadata": {}, + "source": [ + "#### Based on correlation, select a set of columns (aka fields) to predict whether a transaction is fraud" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "a00a7813", + "metadata": {}, + "outputs": [], + "source": [ + "# As the cross correlation of Fraud with Day, Month, Year is significantly lower,\n", + "# we can skip them for now and add these features later.\n", + "\n", + "numerical_predictors = [COL_AMOUNT]\n", + "nominal_predictors = [COL_ERROR, COL_CARD, COL_CHIP, COL_CITY, COL_ZIP, COL_MCC, COL_MERCHANT]\n", + "\n", + "predictor_columns = numerical_predictors + nominal_predictors\n", + "\n", + "target_column = [COL_FRAUD]" + ] + }, + { + "cell_type": "markdown", + "id": "9341f5e0", + "metadata": {}, + "source": [ + "#### Remove duplicates non-fraud data points" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "428c4ac8", + "metadata": {}, + "outputs": [], + "source": [ + "# Remove duplicates data points\n", + "fraud_data = data[data[COL_FRAUD] == 1]\n", + "data = data[data[COL_FRAUD] == 0]\n", + "data = data.drop_duplicates(subset=nominal_predictors)\n", + "data = cudf.concat([data, fraud_data])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "14a8bbce", + "metadata": {}, + "outputs": [], + "source": [ + "# Percentage of fraud and non-fraud cases\n", + "100*data[COL_FRAUD].value_counts()/len(data)" + ] + }, + { + "cell_type": "markdown", + "id": "8bc2ebac", + "metadata": {}, + "source": [ + "### Split the data into\n", + "The data will be split into thee groups based on event date\n", + " * Training - all data before 2018\n", + " * Validation - all data during 2018\n", + " * Test. - all data after 2018" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97dc748b", + "metadata": {}, + "outputs": [], + "source": [ + "if under_sample: \n", + " fraud_df = data[data[COL_FRAUD]==1]\n", + " non_fraud_df = data[data[COL_FRAUD]==0]\n", + " nr_non_fraud_samples = min((len(data) - len(fraud_df)), int(len(fraud_df)/fraud_ratio))\n", + " data = cudf.concat([fraud_df, non_fraud_df.sample(nr_non_fraud_samples)])\n", + "\n", + "training_idx = data[COL_YEAR] < 2018\n", + "validation_idx = data[COL_YEAR] == 2018\n", + "test_idx = data[COL_YEAR] > 2018\n", + "\n", + "data[COL_FRAUD].value_counts()" + ] + }, + { + "cell_type": "markdown", + "id": "cd036929", + "metadata": {}, + "source": [ + "### Scale numerical columns and encode categorical columns of training data" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "2d570cdc", + "metadata": {}, + "outputs": [], + "source": [ + "# As some of the encoder we want to use is not available in cuml, we can use pandas for now.\n", + "# Move training data to pandas for preprocessing\n", + "pdf_training = data[training_idx].to_pandas()[predictor_columns + target_column]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d135cc8a", + "metadata": {}, + "outputs": [], + "source": [ + "#Use one-hot encoding for columns with <= 8 categories, and binary encoding for columns with more categories \n", + "columns_for_binary_encoding = []\n", + "columns_for_onehot_encoding = []\n", + "for col in nominal_predictors:\n", + " print(col, len(data[col].unique()))\n", + " if len(data[col].unique()) <= 8:\n", + " columns_for_onehot_encoding.append(col)\n", + " else:\n", + " columns_for_binary_encoding.append(col)" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "ee6b77fc", + "metadata": {}, + "outputs": [], + "source": [ + "# Mark categorical column as \"category\"\n", + "pdf_training[nominal_predictors] = pdf_training[nominal_predictors].astype(\"category\")" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "49eb57f5", + "metadata": {}, + "outputs": [], + "source": [ + "# encoders to encode categorical columns and scalers to scale numerical columns\n", + "\n", + "bin_encoder = Pipeline(\n", + " steps=[\n", + " (\"binary\", BinaryEncoder(handle_missing='value', handle_unknown='value'))\n", + " ]\n", + ")\n", + "onehot_encoder = Pipeline(\n", + " steps=[\n", + " (\"onehot\", OneHotEncoder())\n", + " ]\n", + ")\n", + "std_scaler = Pipeline(\n", + " steps=[(\"imputer\", SimpleImputer(strategy=\"median\")), (\"standard\", StandardScaler())],\n", + ")\n", + "robust_scaler = Pipeline(\n", + " steps=[(\"imputer\", SimpleImputer(strategy=\"median\")), (\"robust\", RobustScaler())],\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "613db861", + "metadata": {}, + "outputs": [], + "source": [ + "# compose encoders and scalers in a column transformer\n", + "transformer = ColumnTransformer(\n", + " transformers=[\n", + " (\"binary\", bin_encoder, columns_for_binary_encoding),\n", + " (\"onehot\", onehot_encoder, columns_for_onehot_encoding),\n", + " (\"robust\", robust_scaler, [COL_AMOUNT]),\n", + " ], remainder=\"passthrough\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "de594998", + "metadata": {}, + "outputs": [], + "source": [ + "# Fit column transformer with training data\n", + "\n", + "pd.set_option('future.no_silent_downcasting', True)\n", + "transformer = transformer.fit(pdf_training[predictor_columns])" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "e3f88ece", + "metadata": {}, + "outputs": [], + "source": [ + "# transformed column names\n", + "columns_of_transformed_data = list(\n", + " map(lambda name: name.split('__')[1],\n", + " list(transformer.get_feature_names_out(predictor_columns))))" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "2bdc0acc", + "metadata": {}, + "outputs": [], + "source": [ + "# data type of transformed columns \n", + "type_mapping = {}\n", + "for col in columns_of_transformed_data:\n", + " if col.split('_')[0] in nominal_predictors:\n", + " type_mapping[col] = 'int8'\n", + " elif col in numerical_predictors:\n", + " type_mapping[col] = 'float'\n", + " elif col in target_column:\n", + " type_mapping[col] = data.dtypes.to_dict()[col]" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "id": "76332e33", + "metadata": {}, + "outputs": [], + "source": [ + "# transform training data\n", + "preprocessed_training_data = transformer.transform(pdf_training[predictor_columns])\n", + "\n", + "# Convert transformed data to panda DataFrame\n", + "preprocessed_training_data = pd.DataFrame(\n", + " preprocessed_training_data, columns=columns_of_transformed_data)\n", + "# Copy target column\n", + "preprocessed_training_data[COL_FRAUD] = pdf_training[COL_FRAUD].values\n", + "preprocessed_training_data = preprocessed_training_data.astype(type_mapping)" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "078b4f3f", + "metadata": {}, + "outputs": [], + "source": [ + "# Save the transformer \n", + "\n", + "with open(os.path.join(tabformer_base_path, 'preprocessor.pkl'),'wb') as f:\n", + " pickle.dump(transformer, f)" + ] + }, + { + "cell_type": "markdown", + "id": "48e46229", + "metadata": {}, + "source": [ + "#### Save transformed training data for XGBoost training" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "id": "e3362673", + "metadata": {}, + "outputs": [], + "source": [ + "with open(os.path.join(tabformer_base_path, 'preprocessor.pkl'),'rb') as f:\n", + " loaded_transformer = pickle.load(f)" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "id": "47f6663e", + "metadata": {}, + "outputs": [], + "source": [ + "# Transform test data using the transformer fitted on training data\n", + "pdf_test = data[test_idx].to_pandas()[predictor_columns + target_column]\n", + "pdf_test[nominal_predictors] = pdf_test[nominal_predictors].astype(\"category\")\n", + "\n", + "preprocessed_test_data = loaded_transformer.transform(pdf_test[predictor_columns])\n", + "preprocessed_test_data = pd.DataFrame(preprocessed_test_data, columns=columns_of_transformed_data)\n", + "\n", + "# Copy target column\n", + "preprocessed_test_data[COL_FRAUD] = pdf_test[COL_FRAUD].values\n", + "preprocessed_test_data = preprocessed_test_data.astype(type_mapping)" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "0ce80a1b", + "metadata": {}, + "outputs": [], + "source": [ + "# Transform validation data using the transformer fitted on training data\n", + "pdf_validation = data[validation_idx].to_pandas()[predictor_columns + target_column]\n", + "pdf_validation[nominal_predictors] = pdf_validation[nominal_predictors].astype(\"category\")\n", + "\n", + "preprocessed_validation_data = loaded_transformer.transform(pdf_validation[predictor_columns])\n", + "preprocessed_validation_data = pd.DataFrame(preprocessed_validation_data, columns=columns_of_transformed_data)\n", + "\n", + "# Copy target column\n", + "preprocessed_validation_data[COL_FRAUD] = pdf_validation[COL_FRAUD].values\n", + "preprocessed_validation_data = preprocessed_validation_data.astype(type_mapping)" + ] + }, + { + "cell_type": "markdown", + "id": "cb2ca66b-d3dc-4f67-9754-b90bbea6e286", + "metadata": {}, + "source": [ + "## Write out the data for XGB" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "89c16cfb-0bd4-4efb-a610-f3ae7445d96e", + "metadata": {}, + "outputs": [], + "source": [ + "## Training data\n", + "out_path = os.path.join(tabformer_xgb, 'training.csv')\n", + "if not os.path.exists(os.path.dirname(out_path)):\n", + " os.makedirs(os.path.dirname(out_path))\n", + "preprocessed_training_data.to_csv(out_path, header=True, index=False, columns=columns_of_transformed_data + target_column)\n", + "# preprocessed_training_data.to_parquet(out_path, index=False, compression='gzip')" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "f3ef6b19-062b-42d5-8caa-6f6011648b4a", + "metadata": {}, + "outputs": [], + "source": [ + "## validation data\n", + "out_path = os.path.join(tabformer_xgb, 'validation.csv')\n", + "if not os.path.exists(os.path.dirname(out_path)):\n", + " os.makedirs(os.path.dirname(out_path))\n", + "preprocessed_validation_data.to_csv(out_path, header=True, index=False, columns=columns_of_transformed_data + target_column)\n", + "# preprocessed_validation_data.to_parquet(out_path, index=False, compression='gzip')" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "cdc8e3b9-841d-49f3-b3ae-3017a04605e3", + "metadata": {}, + "outputs": [], + "source": [ + "## test data\n", + "out_path = os.path.join(tabformer_xgb, 'test.csv')\n", + "preprocessed_test_data.to_csv(out_path, header=True, index=False, columns=columns_of_transformed_data + target_column)\n", + "# preprocessed_test_data.to_parquet(out_path, index=False, compression='gzip')" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "67fb32b0", + "metadata": {}, + "outputs": [], + "source": [ + "# Write untransformed test data that has only (renamed) predictor columns and target\n", + "out_path = os.path.join(tabformer_xgb, 'untransformed_test.csv')\n", + "pdf_test.to_csv(out_path, header=True, index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "id": "2d6cb604", + "metadata": {}, + "outputs": [], + "source": [ + "# Delete dataFrames that are not needed anymore\n", + "del(pdf_training)\n", + "del(pdf_validation)\n", + "del(pdf_test)\n", + "del(preprocessed_training_data)\n", + "del(preprocessed_validation_data)\n", + "del(preprocessed_test_data)" + ] + }, + { + "cell_type": "markdown", + "id": "3bfbfd83", + "metadata": {}, + "source": [ + "### GNN Data" + ] + }, + { + "cell_type": "markdown", + "id": "98e518c8", + "metadata": {}, + "source": [ + "#### Setting Vertex IDs\n", + "In order to create a graph, the different vertices need to be assigned unique vertex IDs. Additionally, the IDs needs to be consecutive and positive.\n", + "\n", + "There are three nodes groups here: Transactions, Users, and Merchants. \n", + "\n", + "This IDs are not used in training, just used for graph processing." + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "id": "194a47d8", + "metadata": {}, + "outputs": [], + "source": [ + "# Use the same training data as used for XGBoost\n", + "data = data[training_idx]" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "id": "0ba0cb6b", + "metadata": {}, + "outputs": [], + "source": [ + "# a lot of process has occurred, sort the data and reset the index\n", + "data = data.sort_values(by=[COL_YEAR, COL_MONTH, COL_DAY, COL_TIME], ascending=False)\n", + "data.reset_index(inplace=True, drop=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "id": "2a75c92e", + "metadata": {}, + "outputs": [], + "source": [ + "# Each transaction gets a unique ID\n", + "COL_TRANSACTION_ID = 'Tx_ID'\n", + "COL_MERCHANT_ID = 'Merchant_ID'\n", + "COL_USER_ID = 'User_ID'\n", + "\n", + "# The number of transaction is the same as the size of the list, and hence the index value\n", + "data[COL_TRANSACTION_ID] = data.index" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "id": "472ea57c", + "metadata": {}, + "outputs": [], + "source": [ + "# Get the max transaction ID to compute first merchant ID\n", + "max_tx_id = data[COL_TRANSACTION_ID].max()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7e8ef04b", + "metadata": {}, + "outputs": [], + "source": [ + "# Convert Merchant string to consecutive integers\n", + "merchant_name_to_id = dict((v, k) for k, v in data[COL_MERCHANT].unique().to_dict().items())\n", + "data[COL_MERCHANT_ID] = data[COL_MERCHANT].map(merchant_name_to_id) + (max_tx_id + 1)\n", + "data[COL_MERCHANT_ID].min(), data[COL_MERCHANT].max()" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "id": "6937df18", + "metadata": {}, + "outputs": [], + "source": [ + "# Again, get the max merchant ID to compute first user ID\n", + "max_merchant_id = data[COL_MERCHANT_ID].max()" + ] + }, + { + "cell_type": "markdown", + "id": "b153352c", + "metadata": {}, + "source": [ + "##### NOTE: the 'User' and 'Card' columns of the original data were used to crate updated 'Card' colum\n", + "* You can use user or card as nodes" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "030a2335", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Convert Card to consecutive IDs\n", + "id_to_consecutive_id = dict((v, k) for k, v in data[COL_CARD].unique().to_dict().items())\n", + "data[COL_USER_ID] = data[COL_CARD].map(id_to_consecutive_id) + max_merchant_id + 1\n", + "data[COL_USER_ID].min(), data[COL_USER_ID].max()\n", + "\n", + "# id_to_consecutive_id = dict((v, k) for k, v in data[COL_USER].unique().to_dict().items())\n", + "# data[COL_USER_ID] = data[COL_USER].map(id_to_consecutive_id) + max_merchant_id + 1\n", + "# data[COL_USER_ID].min(), data[COL_USER].max()" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "id": "28858422", + "metadata": {}, + "outputs": [], + "source": [ + "# Save the max user ID\n", + "max_user_id = data[COL_USER_ID].max()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "903e5115", + "metadata": {}, + "outputs": [], + "source": [ + "# Check the the transaction, merchant and user ids are consecutive\n", + "id_range = data[COL_TRANSACTION_ID].min(), data[COL_TRANSACTION_ID].max()\n", + "print(f'Transaction ID range {id_range}')\n", + "id_range = data[COL_MERCHANT_ID].min(), data[COL_MERCHANT_ID].max()\n", + "print(f'Merchant ID range {id_range}')\n", + "id_range = data[COL_USER_ID].min(), data[COL_USER_ID].max()\n", + "print(f'User ID range {id_range}')" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "id": "f2d0dfde", + "metadata": {}, + "outputs": [], + "source": [ + "# Sanity checks\n", + "assert( data[COL_TRANSACTION_ID].max() == data[COL_MERCHANT_ID].min() - 1)\n", + "assert( data[COL_MERCHANT_ID].max() == data[COL_USER_ID].min() - 1)\n", + "assert(len(data[COL_USER_ID].unique()) == (data[COL_USER_ID].max() - data[COL_USER_ID].min() + 1))\n", + "assert(len(data[COL_MERCHANT_ID].unique()) == (data[COL_MERCHANT_ID].max() - data[COL_MERCHANT_ID].min() + 1))\n", + "assert(len(data[COL_TRANSACTION_ID].unique()) == (data[COL_TRANSACTION_ID].max() - data[COL_TRANSACTION_ID].min() + 1))" + ] + }, + { + "cell_type": "markdown", + "id": "0d9c3df3-a5be-4899-8bf9-6152aca114c7", + "metadata": {}, + "source": [ + "### Write out the data for GNN" + ] + }, + { + "cell_type": "markdown", + "id": "c2b86862-d129-4ece-a60d-dc798f3a68b5", + "metadata": {}, + "source": [ + "#### Create the Graph Edge Data file \n", + "The file is in COO format" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "id": "b288c5a7-20dd-40ff-b0eb-7a5895bcc464", + "metadata": {}, + "outputs": [], + "source": [ + "COL_GRAPH_SRC = 'src'\n", + "COL_GRAPH_DST = 'dst'\n", + "COL_GRAPH_WEIGHT = 'wgt'\n", + "\n", + "# User to Transactions\n", + "U_2_T = cudf.DataFrame()\n", + "U_2_T[COL_GRAPH_SRC] = data[COL_USER_ID]\n", + "U_2_T[COL_GRAPH_DST] = data[COL_TRANSACTION_ID]\n", + "if make_undirected:\n", + " T_2_U = cudf.DataFrame()\n", + " T_2_U[COL_GRAPH_SRC] = data[COL_TRANSACTION_ID]\n", + " T_2_U[COL_GRAPH_DST] = data[COL_USER_ID]\n", + " U_2_T = cudf.concat([U_2_T, T_2_U])\n", + " del T_2_U\n" + ] + }, + { + "cell_type": "code", + "execution_count": 77, + "id": "a970747d-07a2-43b3-b39c-19a0196fa5b1", + "metadata": {}, + "outputs": [], + "source": [ + "# Transactions to Merchants\n", + "T_2_M = cudf.DataFrame()\n", + "T_2_M[COL_GRAPH_SRC] = data[COL_MERCHANT_ID]\n", + "T_2_M[COL_GRAPH_DST] = data[COL_TRANSACTION_ID]\n", + "\n", + "if make_undirected:\n", + " M_2_T = cudf.DataFrame()\n", + " M_2_T[COL_GRAPH_SRC] = data[COL_TRANSACTION_ID]\n", + " M_2_T[COL_GRAPH_DST] = data[COL_MERCHANT_ID]\n", + " T_2_M = cudf.concat([T_2_M, M_2_T])\n", + " del M_2_T" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80e704fd-ae9f-45b1-ad56-bdc0b743d09f", + "metadata": {}, + "outputs": [], + "source": [ + "Edge = cudf.concat([U_2_T, T_2_M])\n", + "Edge[COL_GRAPH_WEIGHT] = 0.0\n", + "len(Edge)" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "id": "c74572f6-ff6e-4c8f-803e-0ae2c0587c58", + "metadata": {}, + "outputs": [], + "source": [ + "# now write out the data\n", + "out_path = os.path.join (tabformer_gnn, 'edges.csv')\n", + "\n", + "if not os.path.exists(os.path.dirname(out_path)):\n", + " os.makedirs(os.path.dirname(out_path))\n", + " \n", + "Edge.to_csv(out_path, header=False, index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "id": "3dd3ff45-3796-4069-9e3a-587743c4e1e0", + "metadata": {}, + "outputs": [], + "source": [ + "del(Edge)\n", + "del(U_2_T)\n", + "del(T_2_M)" + ] + }, + { + "cell_type": "markdown", + "id": "ed00c481-1737-4152-9d23-f3cb24f2adcd", + "metadata": {}, + "source": [ + "### Now the feature data\n", + "Feature data needs to be is sorted in order, where the row index corresponds to the node ID\n", + "\n", + "The data is comprised of three sets of features\n", + "* Transactions\n", + "* Users\n", + "* Merchants" + ] + }, + { + "cell_type": "markdown", + "id": "805c9d23", + "metadata": {}, + "source": [ + "#### To get feature vectors of Transaction nodes, transform the training data using pre-fitted transformer" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "id": "584fe9bf", + "metadata": {}, + "outputs": [], + "source": [ + "node_feature_df = pd.DataFrame(\n", + " loaded_transformer.transform(\n", + " data[predictor_columns].to_pandas()\n", + " ),\n", + " columns=columns_of_transformed_data).astype(type_mapping)\n", + "\n", + "node_feature_df[COL_FRAUD] = data[COL_FRAUD].to_pandas()" + ] + }, + { + "cell_type": "markdown", + "id": "55aa8f86", + "metadata": {}, + "source": [ + "#### For graph nodes associated with merchant and user, add feature vectors of zeros" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "id": "b35f9f5b", + "metadata": {}, + "outputs": [], + "source": [ + "# Number of graph nodes for users and merchants \n", + "nr_users_and_merchant_nodes = max_user_id - max_tx_id" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "id": "b5d312bd", + "metadata": {}, + "outputs": [], + "source": [ + "if not spread_features:\n", + " # Create feature vector of all zeros for each user and merchant node\n", + " empty_feature_df = cudf.DataFrame(\n", + " columns=columns_of_transformed_data + target_column,\n", + " dtype='int8', \n", + " index=range(nr_users_and_merchant_nodes)\n", + " )\n", + " empty_feature_df = empty_feature_df.fillna(0)\n", + " empty_feature_df=empty_feature_df.astype(type_mapping)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "id": "a72d3ea5-e04f-4af1-a0e0-09964555c1ed", + "metadata": {}, + "outputs": [], + "source": [ + "if not spread_features:\n", + " # Concatenate transaction features followed by features for merchants and user nodes\n", + " node_feature_df = pd.concat([node_feature_df, empty_feature_df.to_pandas()]).astype(type_mapping)" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "id": "a364d173", + "metadata": {}, + "outputs": [], + "source": [ + "# User specific columns\n", + "if spread_features:\n", + " user_specific_columns = [COL_CARD, COL_CHIP]\n", + " user_specific_columns_of_transformed_data = []\n", + "\n", + " for col in node_feature_df.columns:\n", + " if col.split('_')[0] in user_specific_columns:\n", + " user_specific_columns_of_transformed_data.append(col)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "id": "92d88c2f", + "metadata": {}, + "outputs": [], + "source": [ + "# Merchant specific columns\n", + "if spread_features:\n", + " merchant_specific_columns = [COL_MERCHANT, COL_CITY, COL_ZIP, COL_MCC]\n", + " merchant_specific_columns_of_transformed_data = []\n", + " \n", + " for col in node_feature_df.columns:\n", + " if col.split('_')[0] in merchant_specific_columns:\n", + " merchant_specific_columns_of_transformed_data.append(col)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "id": "f62755ae", + "metadata": {}, + "outputs": [], + "source": [ + "# Transaction specific columns\n", + "if spread_features:\n", + " transaction_specific_columns = list(\n", + " set(numerical_predictors).union(nominal_predictors)\n", + " - set(user_specific_columns).union(merchant_specific_columns))\n", + " transaction_specific_columns_of_transformed_data = []\n", + " \n", + " for col in node_feature_df.columns:\n", + " if col.split('_')[0] in transaction_specific_columns:\n", + " transaction_specific_columns_of_transformed_data.append(col) " + ] + }, + { + "cell_type": "markdown", + "id": "d12061da", + "metadata": {}, + "source": [ + "#### Construct feature vector for merchants" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "de484a27", + "metadata": {}, + "outputs": [], + "source": [ + "if spread_features:\n", + " # Find indices of unique merchants\n", + " idx_df = cudf.DataFrame()\n", + " idx_df[COL_MERCHANT_ID] = data[COL_MERCHANT_ID]\n", + " idx_df = idx_df.sort_values(by=COL_MERCHANT_ID)\n", + " idx_df = idx_df.drop_duplicates(subset=COL_MERCHANT_ID)\n", + " assert((data.iloc[idx_df.index][COL_MERCHANT_ID] == idx_df[COL_MERCHANT_ID]).all())" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "id": "5be790eb", + "metadata": {}, + "outputs": [], + "source": [ + "if spread_features:\n", + " # Copy merchant specific columns, and set the rest to zero\n", + " merchant_specific_feature_df = node_feature_df.iloc[idx_df.index.to_numpy()]\n", + " merchant_specific_feature_df.\\\n", + " loc[:, \n", + " transaction_specific_columns_of_transformed_data +\n", + " user_specific_columns_of_transformed_data] = 0.0\n" + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "id": "576091c6", + "metadata": {}, + "outputs": [], + "source": [ + "if spread_features:\n", + " # Find indices of unique users\n", + " idx_df = cudf.DataFrame()\n", + " idx_df[COL_USER_ID] = data[COL_USER_ID]\n", + " idx_df = idx_df.sort_values(by=COL_USER_ID)\n", + " idx_df = idx_df.drop_duplicates(subset=COL_USER_ID)\n", + " assert((data.iloc[idx_df.index][COL_USER_ID] == idx_df[COL_USER_ID]).all())" + ] + }, + { + "cell_type": "code", + "execution_count": 91, + "id": "aec23ee5", + "metadata": {}, + "outputs": [], + "source": [ + "if spread_features:\n", + " # Copy user specific columns, and set the rest to zero\n", + " user_specific_feature_df = node_feature_df.iloc[idx_df.index.to_numpy()]\n", + " user_specific_feature_df.\\\n", + " loc[:,\n", + " transaction_specific_columns_of_transformed_data +\n", + " merchant_specific_columns_of_transformed_data] = 0.0 " + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "id": "8296a341", + "metadata": {}, + "outputs": [], + "source": [ + "# Concatenate features of node, user and merchant\n", + "if spread_features:\n", + " \n", + " node_feature_df[merchant_specific_columns_of_transformed_data] = 0.0\n", + " node_feature_df[user_specific_columns_of_transformed_data] = 0.0\n", + " node_feature_df = pd.concat(\n", + " [node_feature_df, merchant_specific_feature_df, user_specific_feature_df]\n", + " ).astype(type_mapping)\n", + " \n", + " # features to save\n", + " node_feature_df = node_feature_df[\n", + " transaction_specific_columns_of_transformed_data +\n", + " merchant_specific_columns_of_transformed_data +\n", + " user_specific_columns_of_transformed_data + [COL_FRAUD]]\n" + ] + }, + { + "cell_type": "code", + "execution_count": 93, + "id": "527f6ea8", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# target labels to save\n", + "label_df = node_feature_df[[COL_FRAUD]]" + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "id": "15e1cba8", + "metadata": {}, + "outputs": [], + "source": [ + "# Remove target label from feature vectors\n", + "_ = node_feature_df.drop(columns=[COL_FRAUD], inplace=True)" + ] + }, + { + "cell_type": "markdown", + "id": "310d9500", + "metadata": {}, + "source": [ + "#### Write out node features and target labels" + ] + }, + { + "cell_type": "code", + "execution_count": 95, + "id": "139bfd9f", + "metadata": {}, + "outputs": [], + "source": [ + "# Write node target label to csv file\n", + "out_path = os.path.join(tabformer_gnn, 'labels.csv')\n", + "\n", + "if not os.path.exists(os.path.dirname(out_path)):\n", + " os.makedirs(os.path.dirname(out_path))\n", + "\n", + "label_df.to_csv(out_path, header=False, index=False)\n", + "# label_df.to_parquet(out_path, index=False, compression='gzip')" + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "id": "b8fe801e", + "metadata": {}, + "outputs": [], + "source": [ + "# Write node features to csv file\n", + "out_path = os.path.join(tabformer_gnn, 'features.csv')\n", + "\n", + "if not os.path.exists(os.path.dirname(out_path)):\n", + " os.makedirs(os.path.dirname(out_path))\n", + "node_feature_df[columns_of_transformed_data].to_csv(out_path, header=True, index=False)\n", + "# node_feature_df.to_parquet(out_path, index=False, compression='gzip')" + ] + }, + { + "cell_type": "code", + "execution_count": 97, + "id": "fbe75d91", + "metadata": {}, + "outputs": [], + "source": [ + "# Delete dataFrames\n", + "del data\n", + "del node_feature_df\n", + "del label_df\n", + "\n", + "if spread_features:\n", + " del merchant_specific_feature_df\n", + " del user_specific_feature_df\n", + "else:\n", + " del empty_feature_df" + ] + }, + { + "cell_type": "markdown", + "id": "2c6afd9b", + "metadata": {}, + "source": [ + "#### Number of transaction nodes in training data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2a5f5bd1", + "metadata": {}, + "outputs": [], + "source": [ + "# Number of transaction nodes, needed for GNN training\n", + "nr_transaction_nodes = max_tx_id + 1\n", + "nr_transaction_nodes" + ] + }, + { + "cell_type": "markdown", + "id": "275bfc8b", + "metadata": {}, + "source": [ + "#### Maximum number of cards per user" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "867661d9", + "metadata": {}, + "outputs": [], + "source": [ + "# Max number of cards per user, needed for inference\n", + "max_nr_cards_per_user" + ] + }, + { + "cell_type": "markdown", + "id": "cf5434a7", + "metadata": {}, + "source": [ + "#### Save variable for training and inference" + ] + }, + { + "cell_type": "code", + "execution_count": 100, + "id": "9d741c6c", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "variables_to_save = {\n", + " k: v for k, v in globals().items() if isinstance(v, (str, int)) and k.startswith('COL_')}" + ] + }, + { + "cell_type": "code", + "execution_count": 101, + "id": "86727cef", + "metadata": {}, + "outputs": [], + "source": [ + "variables_to_save['NUM_TRANSACTION_NODES'] = int(nr_transaction_nodes)\n", + "variables_to_save['MAX_NR_CARDS_PER_USER'] = int(max_nr_cards_per_user)" + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "id": "6a59a5a7", + "metadata": {}, + "outputs": [], + "source": [ + "# Save the dictionary to a JSON file\n", + "\n", + "with open(os.path.join(tabformer_base_path, 'variables.json'), 'w') as json_file:\n", + " json.dump(variables_to_save, json_file, indent=4)" + ] + }, + { + "cell_type": "markdown", + "id": "fa2f6f28", + "metadata": {}, + "source": [ + "## That's it!\n", + "The data is now ready for processing\n", + "\n", + "## Copyright and License\n", + "
\n", + "Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.\n", + "\n", + "
\n", + "\n", + " Licensed under the Apache License, Version 2.0 (the \"License\");\n", + " you may not use this file except in compliance with the License.\n", + " You may obtain a copy of the License at\n", + " \n", + " http://www.apache.org/licenses/LICENSE-2.0\n", + " \n", + " Unless required by applicable law or agreed to in writing, software\n", + " distributed under the License is distributed on an \"AS IS\" BASIS,\n", + " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + " See the License for the specific language governing permissions and\n", + " limitations under the License." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/ai-credit-fraud-workflow/notebooks/train_gnn_based_xgboost.ipynb b/ai-credit-fraud-workflow/notebooks/train_gnn_based_xgboost.ipynb new file mode 100644 index 0000000..268dec0 --- /dev/null +++ b/ai-credit-fraud-workflow/notebooks/train_gnn_based_xgboost.ipynb @@ -0,0 +1,1161 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train a GNN-based XGBoost Model\n", + "#### Goals\n", + "* Train a GNN (GraphSAGE) model that produces node (transaction) embeddings.\n", + "* Use these node embeddings to train an XGBoost model.\n", + "* Save the trained GNN and XGBoost models for inference.\n", + "\n", + "__Prerequisite__: The preprocessing notebook must be executed before running this notebook." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Dataset names" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Name of the datasets to choose from\n", + "TABFORMER = \"TabFormer\"\n", + "SPARKOV = \"Sparkov\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Select the dataset to train the models on" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "__Note__: This notebook works for both __TabFormer__ and __Sparkov__ dataset. \n", + "Make sure that the right dataset is selected.\n", + "For yhe TabFormer dataset, set\n", + "\n", + "```code\n", + " DATASET = TABFORMER\n", + "```\n", + "and for the Sparkov dataset, set\n", + "\n", + "```code\n", + " DATASET = SPARKOV\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# Change this to either TABFORMER or SPARKOV\n", + "DATASET = TABFORMER" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "#### Import necessary libraries, packages, and functions" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "# General-purpose libraries and OS handling\n", + "import os\n", + "from typing import Tuple, Dict\n", + "import json\n", + "from collections import defaultdict\n", + "\n", + "# GPU-accelerated libraries (torch, cupy, cudf, rmm)\n", + "import torch\n", + "import cupy\n", + "import cudf\n", + "import rmm\n", + "from rmm.allocators.cupy import rmm_cupy_allocator\n", + "from rmm.allocators.torch import rmm_torch_allocator\n", + "\n", + "# Reinitialize RMM and set allocators to manage memory efficiently on GPU\n", + "rmm.reinitialize(devices=[0], pool_allocator=True, managed_memory=True)\n", + "cupy.cuda.set_allocator(rmm_cupy_allocator)\n", + "torch.cuda.memory.change_current_allocator(rmm_torch_allocator)\n", + "\n", + "# PyTorch and related libraries\n", + "import torch.nn.functional as F\n", + "import torch.nn as nn\n", + "\n", + "# PyTorch Geometric and cuGraph libraries for GNNs and graph handling\n", + "import cugraph_pyg\n", + "from cugraph_pyg.loader import NeighborLoader\n", + "import torch_geometric\n", + "from torch_geometric.nn import SAGEConv\n", + "\n", + "# Enable GPU memory spilling to CPU with cuDF to handle larger datasets\n", + "from cugraph.testing.mg_utils import enable_spilling # noqa: E402\n", + "enable_spilling()\n", + "\n", + "# XGBoost for machine learning model building\n", + "import xgboost as xgb\n", + "\n", + "# Numerical operations with cupy and numpy\n", + "import cupy as cp\n", + "import numpy as np\n", + "\n", + "# Machine learning metrics from sklearn\n", + "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Some config parameters for neighborhood sampler and training" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "args = type('', (), {})()\n", + "\n", + "args.out_channels = 2\n", + "args.batch_size = 1024\n", + "args.fan_out = 10\n", + "args.use_cross_weights = True\n", + "args.cross_weights = None" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Path to pre-processed data and directory to save models" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "dateset_name_to_path= defaultdict(lambda: \"../data/TabFormer\")\n", + "\n", + "dateset_name_to_path['TabFormer'] = '../data/TabFormer'\n", + "dateset_name_to_path['Sparkov'] = '../data/Sparkov'\n", + "args.dataset_base_path = dateset_name_to_path[DATASET]\n", + "\n", + "args.dataset_root = os.path.join(args.dataset_base_path, 'gnn')\n", + "args.model_root_dir = os.path.join(args.dataset_base_path, 'models')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Read number of transactions nodes that was saved during preprocessing" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Number of transactions nodes were saved in variables.json during training\n", + "with open(os.path.join(args.dataset_base_path, 'variables.json'), 'r') as json_file:\n", + " num_transaction_nodes = json.load(json_file)['NUM_TRANSACTION_NODES']\n", + "\n", + "num_transaction_nodes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Define a GraphSAGE model" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "class GraphSAGE(torch.nn.Module):\n", + " \"\"\"\n", + " GraphSAGE model for graph-based learning.\n", + "\n", + " This model learns node embeddings by aggregating information from a node's \n", + " neighborhood using multiple graph convolutional layers.\n", + "\n", + " Parameters:\n", + " ----------\n", + " in_channels : int\n", + " The number of input features for each node.\n", + " hidden_channels : int\n", + " The number of hidden units in each layer, controlling the embedding dimension.\n", + " out_channels : int\n", + " The number of output features (or classes) for the final layer.\n", + " n_hops : int\n", + " The number of GraphSAGE layers (or hops) used to aggregate information \n", + " from neighboring nodes.\n", + " dropout_prob : float, optional (default=0.25)\n", + " The probability of dropping out nodes during training for regularization.\n", + " \"\"\"\n", + " def __init__(self, in_channels, hidden_channels, out_channels, n_hops, dropout_prob=0.25):\n", + " super(GraphSAGE, self).__init__()\n", + "\n", + " # list of conv layers\n", + " self.convs = nn.ModuleList()\n", + " # add first conv layer to the list\n", + " self.convs.append(SAGEConv(in_channels, hidden_channels))\n", + " # add the remaining conv layers to the list\n", + " for _ in range(n_hops - 1):\n", + " self.convs.append(SAGEConv(hidden_channels, hidden_channels))\n", + " \n", + " # output layer\n", + " self.fc = nn.Linear(hidden_channels, out_channels) \n", + "\n", + " def forward(self, x, edge_index, return_hidden=False):\n", + "\n", + " for conv in self.convs:\n", + " x = conv(x, edge_index)\n", + " x = F.relu(x)\n", + " x = F.dropout(x, p=0.5, training=self.training)\n", + " \n", + " if return_hidden:\n", + " return x\n", + " else:\n", + " return self.fc(x)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "#### Define a function to train the GraphSAGE model\n", + "__Note__: This function is called a few times if grid search is used to find better hyper-parameters." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "def train_gnn(model, loader, optimizer, criterion)->float:\n", + " \"\"\"\n", + " Trains the GraphSAGE model for one epoch.\n", + "\n", + " Parameters:\n", + " ----------\n", + " model : torch.nn.Module\n", + " The GNN model to be trained.\n", + " loader : tcugraph_pyg.loader.NeighborLoader\n", + " DataLoader that provides batches of graph data for training.\n", + " optimizer : torch.optim.Optimizer\n", + " Optimizer used to update the model's parameters.\n", + " criterion : torch.nn.Module\n", + " Loss function used to calculate the difference between predictions and targets.\n", + "\n", + " Returns:\n", + " -------\n", + " float\n", + " The average training loss over all batches for this epoch.\n", + " \"\"\"\n", + " model.train()\n", + " total_loss = 0\n", + " batch_count = 0\n", + " for batch in loader:\n", + " batch_count += 1\n", + " optimizer.zero_grad()\n", + "\n", + " batch_size = batch.batch_size\n", + " out = model(batch.x[:,:].to(torch.float32), batch.edge_index)[:batch_size]\n", + " y = batch.y[:batch_size].view(-1).to(torch.long)\n", + " loss = criterion(out, y)\n", + " loss.backward()\n", + "\n", + " optimizer.step()\n", + " total_loss += loss.item()\n", + " return total_loss / batch_count\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Define a function to extract node (transaction) embeddings from the second-to-last layer of the GraphSAGE model\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "def extract_embeddings(model, loader)->Tuple[torch.Tensor, torch.Tensor]:\n", + " \"\"\"\n", + " Extracts node embeddings produced by the GraphSAGE model.\n", + "\n", + " Parameters:\n", + " ----------\n", + " model : torch.nn.Module\n", + " The model used to generate embeddings, typically a pre-trained neural network.\n", + " loader : cugraph_pyg.loader.NeighborLoader\n", + " NeighborLoader that provides batches of data for embedding extraction.\n", + "\n", + " Returns:\n", + " -------\n", + " Tuple[torch.Tensor, torch.Tensor]\n", + " A tuple containing two tensors:\n", + " - embeddings: A tensor containing embeddings for each input sample in the dataset.\n", + " - labels: A tensor containing the corresponding labels for each sample.\n", + " \"\"\"\n", + " model.eval()\n", + " embeddings = []\n", + " labels = []\n", + " with torch.no_grad():\n", + " for batch in loader:\n", + " batch_size = batch.batch_size\n", + " hidden = model(batch.x[:,:].to(torch.float32), batch.edge_index, return_hidden=True)[:batch_size]\n", + " embeddings.append(hidden) # Keep embeddings on GPU\n", + " labels.append(batch.y[:batch_size].view(-1).to(torch.long))\n", + " embeddings = torch.cat(embeddings, dim=0) # Concatenate embeddings on GPU\n", + " labels = torch.cat(labels, dim=0) # Concatenate labels on GPU\n", + " return embeddings, labels\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "#### Define a function to evaluate the GraphSAGE model\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "def evaluate_gnn(model, loader) -> float:\n", + " \"\"\"\n", + " Evaluates the performance of the GraphSAGE model.\n", + "\n", + " Parameters:\n", + " ----------\n", + " model : torch.nn.Module\n", + " The GNN model to be evaluated.\n", + " loader : cugraph_pyg.loader.NeighborLoader\n", + " NeighborLoader that provides batches of data for evaluation.\n", + "\n", + " Returns:\n", + " -------\n", + " float\n", + " The average f1-score computed over all batches.\n", + " \"\"\"\n", + "\n", + " model.eval()\n", + " all_preds = []\n", + " all_labels = []\n", + " total_pos_seen = 0\n", + " with torch.no_grad():\n", + " for batch in loader:\n", + "\n", + " batch_size = batch.batch_size\n", + " out = model(batch.x[:,:].to(torch.float32), batch.edge_index)[:batch_size]\n", + " preds = out.argmax(dim=1)\n", + " y = batch.y[:batch_size].view(-1).to(torch.long)\n", + " \n", + " all_preds.append(preds.cpu().numpy())\n", + " all_labels.append(y.cpu().numpy())\n", + " total_pos_seen += (y.cpu().numpy()==1).sum()\n", + "\n", + " all_preds = np.concatenate(all_preds)\n", + " all_labels = np.concatenate(all_labels)\n", + "\n", + " accuracy = accuracy_score(all_labels, all_preds)\n", + " precision = precision_score(all_labels, all_preds, zero_division=0)\n", + " recall = recall_score(all_labels, all_preds, zero_division=0)\n", + " f1 = f1_score(all_labels, all_preds, zero_division=0)\n", + " # roc_auc = roc_auc_score(all_labels, all_preds)\n", + "\n", + " print(f\"\\nGNN Model Evaluation:\")\n", + " print(f\"Accuracy: {accuracy:.4f}\")\n", + " print(f\"Precision: {precision:.4f}\")\n", + " print(f\"Recall: {recall:.4f}\")\n", + " print(f\"F1 Score: {f1:.4f}\")\n", + " # print(f\"ROC AUC: {roc_auc:.4f}\")\n", + " return f1\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Define a function to compute validation loss GraphSAGE model" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "def validation_loss(model, loader, criterion)->float:\n", + " \"\"\"\n", + " Computes the average validation loss for the GraphSAGE model.\n", + "\n", + " Parameters:\n", + " ----------\n", + " model : torch.nn.Module\n", + " The model for which the validation loss is calculated.\n", + " loader : cugraph_pyg.loader.NeighborLoader\n", + " NeighborLoader that provides batches of validation data.\n", + " criterion : torch.nn.Module\n", + " Loss function used to compute the loss between predictions and targets.\n", + "\n", + " Returns:\n", + " -------\n", + " float\n", + " The average validation loss over all batches.\n", + " \"\"\"\n", + " model.eval()\n", + " with torch.no_grad():\n", + " total_loss = 0\n", + " batch_count = 0\n", + " for batch in loader:\n", + " batch_count += 1\n", + " batch_size = batch.batch_size\n", + " out = model(batch.x[:,:].to(torch.float32), batch.edge_index)[:batch_size]\n", + " y = batch.y[:batch_size].view(-1).to(torch.long)\n", + " loss = criterion(out, y)\n", + " total_loss += loss.item()\n", + " return total_loss / batch_count\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "#### Define a function to train a XGBoost model" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "from torch.utils.dlpack import to_dlpack\n", + "\n", + "def train_xgboost(embeddings, labels)->xgb.Booster:\n", + " \"\"\"\n", + " Trains an XGBoost classifier on the provided embeddings and labels.\n", + "\n", + " Parameters:\n", + " ----------\n", + " embeddings : torch.Tensor\n", + " The input feature embeddings for transaction nodes.\n", + " labels : torch.Tensor\n", + " The target labels (Fraud or Non-fraud) transaction, with the same length as the number of \n", + " rows in `embeddings`.\n", + "\n", + " Returns:\n", + " -------\n", + " xgboost.Booster\n", + " A trained XGBoost model fitted on the provided data.\n", + " \"\"\"\n", + "\n", + " labels_cudf = cudf.Series(cp.from_dlpack(to_dlpack(labels)))\n", + " embeddings_cudf = cudf.DataFrame(cp.from_dlpack(to_dlpack(embeddings)))\n", + "\n", + " # Convert data to DMatrix format for XGBoost on GPU\n", + " dtrain = xgb.DMatrix(embeddings_cudf, label=labels_cudf)\n", + "\n", + " # Set XGBoost parameters for GPU usage\n", + " param = {\n", + " 'max_depth': 6,\n", + " 'learning_rate': 0.2,\n", + " 'objective': 'binary:logistic', # Binary classification\n", + " 'eval_metric': 'logloss',\n", + " 'tree_method': 'hist', # Use GPU\n", + " 'device': 'cuda'\n", + " }\n", + "\n", + " # Train the XGBoost model\n", + " bst = xgb.train(param, dtrain, num_boost_round=100)\n", + " \n", + " return bst\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "parameters" + ] + }, + "source": [ + "\n", + "#### Define a function to evaluate the XGBoost model\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "from cuml.metrics import confusion_matrix\n", + "\n", + "def evaluate_xgboost(bst, embeddings, labels):\n", + " \"\"\"\n", + " Evaluates the performance of a XGBoost model by calculating different metrics.\n", + "\n", + " Parameters:\n", + " ----------\n", + " bst : xgboost.Booster\n", + " The trained XGBoost model to be evaluated.\n", + " embeddings : torch.Tensor\n", + " The input feature embeddings for transaction nodes.\n", + " labels : torch.Tensor\n", + " The target labels (Fraud or Non-fraud) transaction, with the same length as the number of \n", + " rows in `embeddings`.\n", + " Returns:\n", + " -------\n", + " A tuple containing f1-score, recall, precision, accuracy and the confusion matrix\n", + " \"\"\"\n", + "\n", + " # Convert embeddings to cuDF DataFrame\n", + " embeddings_cudf = cudf.DataFrame(cp.from_dlpack(to_dlpack(embeddings)))\n", + " \n", + " # Create DMatrix for the test embeddings\n", + " dtest = xgb.DMatrix(embeddings_cudf)\n", + " \n", + " # Predict using XGBoost on GPU\n", + " preds = bst.predict(dtest)\n", + " pred_labels = (preds > 0.5).astype(int)\n", + "\n", + " # Move labels to CPU for evaluation\n", + " labels_cpu = labels.cpu().numpy()\n", + "\n", + " # Compute evaluation metrics\n", + " accuracy = accuracy_score(labels_cpu, pred_labels)\n", + " precision = precision_score(labels_cpu, pred_labels, zero_division=0)\n", + " recall = recall_score(labels_cpu, pred_labels, zero_division=0)\n", + " f1 = f1_score(labels_cpu, pred_labels, zero_division=0)\n", + " roc_auc = roc_auc_score(labels_cpu, preds)\n", + " conf_mat = confusion_matrix(labels.cpu().numpy(), pred_labels)\n", + " \n", + " return f1, recall, precision, accuracy, conf_mat" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Define a class to stop training once the model stops improving" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "class EarlyStopping:\n", + " \"\"\"\n", + " EarlyStopping class to halt training when a monitored metric stops improving.\n", + " \n", + " Parameters:\n", + " ----------\n", + " patience : int, optional (default=10)\n", + " The number of epochs with no improvement after which training will be stopped.\n", + " min_delta : float, optional (default=0)\n", + " The minimum change in the monitored metric to qualify as an improvement. \n", + " If the change is smaller than `min_delta`, it is considered as no improvement.\n", + " \"\"\"\n", + " def __init__(self, patience=10, min_delta=0):\n", + " \n", + " self.patience = patience\n", + " self.min_delta = min_delta\n", + " self.best_loss = float('inf')\n", + " self.counter = 0\n", + "\n", + " def check_early_stopping(self, val_loss):\n", + "\n", + " if self.best_loss - val_loss > self.min_delta:\n", + " self.best_loss = val_loss\n", + " self.counter = 0 # Reset counter if there's an improvement\n", + " else:\n", + " self.counter += 1 # Increment counter if no improvement\n", + " \n", + " if self.counter >= self.patience:\n", + " return True\n", + " return False\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define a function to load data and create graph\n", + "* loads edges and create graph using cugraph-pyg\n", + "* loads preprocessed features associated with the graph nodes" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "def load_data(\n", + " dataset_root : str,\n", + " edge_filename: str = 'edges.csv',\n", + " label_filename: str = 'labels.csv',\n", + " node_feature_filename: str = 'features.csv',\n", + " has_edge_feature: bool = False,\n", + " edge_src_col: str = 'src',\n", + " edge_dst_col: str = 'dst',\n", + " edge_att_col: str = 'type'\n", + ") -> Tuple[\n", + " Tuple[torch_geometric.data.FeatureStore, torch_geometric.data.GraphStore],\n", + " Dict[str, torch.Tensor],\n", + " int,\n", + " int,\n", + "]:\n", + " # Load the Graph data\n", + " edge_path = os.path.join(dataset_root, edge_filename)\n", + " edge_data = cudf.read_csv(edge_path, header=None, names=[edge_src_col, edge_dst_col, edge_att_col], dtype=['int32','int32','float'])\n", + " \n", + " num_nodes = max(edge_data[edge_src_col].max(), edge_data[ edge_dst_col].max()) + 1 \n", + " src_tensor = torch.as_tensor(edge_data[edge_src_col], device='cuda')\n", + " dst_tensor = torch.as_tensor(edge_data[edge_dst_col], device='cuda')\n", + "\n", + " \n", + "\n", + " graph_store = cugraph_pyg.data.GraphStore()\n", + " graph_store[(\"n\", \"e\", \"n\"), \"coo\", False, (num_nodes, num_nodes)] = [src_tensor, dst_tensor] \n", + "\n", + " \n", + " edge_feature_store = None\n", + " if has_edge_feature:\n", + " from cugraph_pyg.data import TensorDictFeatureStore\n", + " edge_feature_store = TensorDictFeatureStore()\n", + " edge_attr = torch.as_tensor(edge_data[edge_att_col], device='cuda')\n", + " edge_feature_store[(\"n\", \"e\", \"n\"), \"rel\"] = edge_attr.unsqueeze(1)\n", + " \n", + " \n", + " del(edge_data)\n", + " \n", + " # load the label\n", + " label_path = os.path.join (dataset_root, label_filename)\n", + " label_data = cudf.read_csv(label_path, header=None, dtype=['int32'])\n", + " y_label_tensor = torch.as_tensor(label_data['0'], device='cuda')\n", + " num_classes = label_data['0'].unique().count()\n", + "\n", + " wt_data = None\n", + " if (args.use_cross_weights):\n", + " if (args.cross_weights is None):\n", + " num_labels_rows = label_data.size\n", + " counts = label_data.value_counts()\n", + " wt_data = torch.as_tensor(counts.sum()/counts, device='cuda', dtype=torch.float32)\n", + " wt_data = wt_data/wt_data.sum()\n", + "\n", + " if (num_classes > 2):\n", + " wt_data = wt_data.T\n", + " else:\n", + " wt_data = torch.as_tensor(args.cross_weights, device='cuda')\n", + "\n", + " del(label_data)\n", + " \n", + " # load the features\n", + " feature_path = os.path.join(dataset_root, node_feature_filename)\n", + " feature_data = cudf.read_csv(feature_path)\n", + " \n", + " feature_columns = feature_data.columns\n", + " \n", + " col_tensors = []\n", + " for c in feature_columns:\n", + " t = torch.as_tensor(feature_data[c].values, device='cuda')\n", + " col_tensors.append(t)\n", + "\n", + " x_feature_tensor = torch.stack(col_tensors).T\n", + "\n", + " \n", + " feature_store = cugraph_pyg.data.TensorDictFeatureStore()\n", + " feature_store[\"node\", \"x\"] = x_feature_tensor\n", + " feature_store[\"node\", \"y\"] = y_label_tensor\n", + "\n", + " num_features = len(feature_columns)\n", + " \n", + " return (\n", + " (feature_store, graph_store),\n", + " edge_feature_store,\n", + " num_nodes,\n", + " num_features,\n", + " num_classes,\n", + " wt_data,\n", + " )\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Define a function to train the GraphSAGE model for particular values of hyper-parameters." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "def train_model_with_config(params, verbose=False):\n", + "\n", + " data, ef_store, num_nodes, num_features, num_classes, cross_wt_data = load_data(args.dataset_root)\n", + " \n", + " num_folds = params['n_splits'] # Number of folds\n", + " fold_size = num_transaction_nodes // num_folds\n", + "\n", + " # Perform cross-validation\n", + " validation_losses = []\n", + " for k in range(num_folds):\n", + " training_nodes = torch.cat(\n", + " (\n", + " torch.arange(0, k * fold_size).unsqueeze(dim=0),\n", + " torch.arange((k+1) * fold_size, num_transaction_nodes).unsqueeze(dim=0)\n", + " ),\n", + " dim=1\n", + " ).squeeze(0)\n", + "\n", + " validation_nodes = torch.arange(k * fold_size, (k+1) * fold_size)\n", + " \n", + " # Create NeighborLoader for both training and testing (using cuGraph NeighborLoader)\n", + " train_loader = NeighborLoader(\n", + " data,\n", + " num_neighbors=[args.fan_out, args.fan_out],\n", + " batch_size=args.batch_size,\n", + " input_nodes= training_nodes,\n", + " shuffle=True\n", + " )\n", + "\n", + " # Use same graph but different seed nodes\n", + " validation_loader = NeighborLoader(\n", + " data,\n", + " num_neighbors=[args.fan_out, args.fan_out],\n", + " batch_size=args.batch_size,\n", + " input_nodes= validation_nodes,\n", + " shuffle=False\n", + " )\n", + " \n", + " device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", + " \n", + " # Define the model\n", + " model = GraphSAGE(\n", + " in_channels=num_features,\n", + " hidden_channels=params['hidden_channels'],\n", + " out_channels=args.out_channels,\n", + " n_hops=params['n_hops'],\n", + " dropout_prob=0.25).to(device)\n", + "\n", + "\n", + " # Define optimizer and loss function for GNN\n", + " optimizer = torch.optim.Adam(model.parameters(),\n", + " lr=params['learning_rate'],\n", + " weight_decay=params['weight_decay'])\n", + "\n", + " # criterion = torch.nn.CrossEntropyLoss(\n", + " # weight=cross_wt_data).to(device) # Weighted loss function\n", + " \n", + " criterion = torch.nn.CrossEntropyLoss(\n", + " weight=torch.tensor([0.1, 0.9], dtype=torch.float32)).to(device) # Weighted loss function\n", + "\n", + " # Set up the early stopping object\n", + " early_stopping = EarlyStopping(patience=3, min_delta=0.01)\n", + " \n", + " best_val_loss = float('inf')\n", + " num_epoch_for_best_loss = 0\n", + "\n", + " # Train the GNN model\n", + " for epoch in range(params['num_epochs']):\n", + " train_loss = train_gnn(model, train_loader, optimizer, criterion)\n", + " val_loss = validation_loss(model, validation_loader, criterion)\n", + " if verbose:\n", + " print(f\"Epoch {epoch+1}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}\")\n", + "\n", + " # Check early stopping criteria\n", + " if early_stopping.check_early_stopping(val_loss):\n", + " if verbose:\n", + " print(f\"Early stopping triggered at epoch {epoch+1}.\")\n", + " break\n", + "\n", + " # Save the best model based on validation loss\n", + " if val_loss < best_val_loss:\n", + " best_val_loss = val_loss\n", + " num_epoch_for_best_loss = epoch\n", + " # Save validation loss for the current fold\n", + " validation_losses.append(best_val_loss)\n", + " return np.mean(validation_losses), model, num_epoch_for_best_loss" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "#### Parameter grid to search for better hyper-parameters\n", + "\n", + "__Note__: To execute the notebook faster, we commented out the grid search" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "## Uncomment this cell to find the best hyperparameters in the parameter grid\n", + "# from sklearn.model_selection import ParameterGrid\n", + "# # Define the hyperparameter grid\n", + "# param_grid = {\n", + "# 'n_splits': [5],\n", + "# 'n_hops': [1, 2],\n", + "# 'learning_rate': [0.005, 0.01],\n", + "# 'hidden_channels': [32, 64],\n", + "# 'num_epochs': [8, 16],\n", + "# 'weight_decay': [1e-5],\n", + " \n", + "# }\n", + "# grid = list(ParameterGrid(param_grid))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Search for better hyper-parameters" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "## Uncomment this cell to find the best hyperparameters in the parameter grid\n", + "# best_val_loss = float('inf')\n", + "# epoch = 0\n", + "# best_params = None\n", + "# for params in grid:\n", + "# val_loss, _, epoch = train_model_with_config(params, verbose=False)\n", + "# if val_loss < best_val_loss:\n", + "# best_params = params\n", + "# best_val_loss = val_loss" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "# best_params" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "# Comment out this cell to train on new dataset \n", + "best_params = {\n", + " 'n_hops': 1,\n", + " 'learning_rate': 0.005,\n", + " 'hidden_channels': 32,\n", + " 'num_epochs': 16,\n", + " 'weight_decay': 1e-5, \n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Train and save the GraphSAGE model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "data, ef_store, num_nodes, num_features, num_classes, cross_wt_data = load_data(args.dataset_root)\n", + "\n", + "# Train on entire dataset\n", + "train_loader = NeighborLoader(\n", + " data,\n", + " num_neighbors=[args.fan_out, args.fan_out],\n", + " batch_size=args.batch_size,\n", + " input_nodes= torch.arange(num_transaction_nodes),\n", + " shuffle=True\n", + ")\n", + "\n", + "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", + " \n", + "# Define the model\n", + "model = GraphSAGE(\n", + " in_channels=num_features,\n", + " hidden_channels=best_params['hidden_channels'],\n", + " out_channels=args.out_channels,\n", + " n_hops=best_params['n_hops'],\n", + " dropout_prob=0.25).to(device)\n", + "\n", + "\n", + "# Define optimizer and loss function for GNN\n", + "optimizer = torch.optim.Adam(model.parameters(),\n", + " lr=best_params['learning_rate'],\n", + " weight_decay=best_params['weight_decay'])\n", + "\n", + "\n", + "criterion = torch.nn.CrossEntropyLoss(\n", + " weight=torch.tensor([0.1, 0.9], dtype=torch.float32)).to(device) # Weighted loss function\n", + "\n", + "# Set up the early stopping object\n", + "early_stopping = EarlyStopping(patience=3, min_delta=0.01)\n", + "\n", + "best_train_loss = float('inf')\n", + "\n", + "# Train the GNN model\n", + "\n", + "for epoch in range(best_params['num_epochs']):\n", + " train_loss = train_gnn(model, train_loader, optimizer, criterion)\n", + " \n", + " # Check early stopping criteria\n", + " if early_stopping.check_early_stopping(train_loss):\n", + " print(f\"Early stopping triggered at epoch {epoch+1}.\")\n", + " break\n", + "\n", + " # Save the best model based on validation loss\n", + " if train_loss < best_train_loss:\n", + " best_train_loss = train_loss\n", + " if not os.path.exists(args.model_root_dir):\n", + " os.makedirs(args.model_root_dir)\n", + " torch.save(model, os.path.join(args.model_root_dir, 'node_embedder.pth'))\n", + "\n", + " print(f\"Model saved at epoch {epoch+1} with training loss {best_train_loss:.4f}.\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Train the XGBoost model based on embeddings produced by the GraphSAGE model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# NeighborLoader for training data\n", + "\n", + "data, ef_store, num_nodes, num_features, num_classes, cross_wt_data = load_data(args.dataset_root)\n", + "\n", + "train_loader = NeighborLoader(\n", + " data,\n", + " num_neighbors=[args.fan_out, args.fan_out],\n", + " batch_size=args.batch_size,\n", + " input_nodes= torch.arange(num_transaction_nodes),\n", + " shuffle=True\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "# Set the device to GPU if available; otherwise, default to CPU\n", + "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", + "\n", + "# Extract embeddings from the second-to-last layer and keep them on GPU\n", + "embeddings, labels = extract_embeddings(model, train_loader)\n", + "\n", + "# Train an XGBoost model on the extracted embeddings (on GPU)\n", + "bst = train_xgboost(embeddings.to(device), labels.to(device))\n", + " \n", + "xgb_model_path = os.path.join(args.model_root_dir, 'embedding_based_xgb_model.json')\n", + "\n", + "if not os.path.exists(os.path.dirname(xgb_model_path)):\n", + " os.makedirs(os.path.dirname(xgb_model_path))\n", + "\n", + "bst.save_model(xgb_model_path)\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Evaluation the model on unseen data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Load and prepare test data\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", + "test_path = os.path.join(args.dataset_base_path, 'xgb/test.csv')\n", + "test_data = cudf.read_csv(test_path)\n", + "\n", + "X = torch.tensor(test_data.iloc[:, :-1].values).to(torch.float32)\n", + "y = torch.tensor(test_data.iloc[:, -1].values).to(torch.long)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Extract embeddings of the transactions using the GraphSAGE model" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "model.eval()\n", + "f1_value = 0.0\n", + "with torch.no_grad():\n", + " test_embeddings = model(\n", + " X.to(device), torch.tensor([[], []], dtype=torch.int).to(device), return_hidden=True)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Evaluate the XGBoost model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "f1, recall, precision, accuracy, conf_mat = evaluate_xgboost(bst, test_embeddings, y)\n", + "\n", + "print(f\"\\nXGBoost Evaluation:\")\n", + "print(f\"Accuracy: {accuracy:.4f}\")\n", + "print(f\"Precision: {precision:.4f}\")\n", + "print(f\"Recall: {recall:.4f}\")\n", + "print(f\"F1 Score: {f1:.4f}\")\n", + "print('Confusion Matrix:', conf_mat)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Copyright and License\n", + "
\n", + "Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.\n", + "\n", + "
\n", + "\n", + " Licensed under the Apache License, Version 2.0 (the \"License\");\n", + " you may not use this file except in compliance with the License.\n", + " You may obtain a copy of the License at\n", + " \n", + " http://www.apache.org/licenses/LICENSE-2.0\n", + " \n", + " Unless required by applicable law or agreed to in writing, software\n", + " distributed under the License is distributed on an \"AS IS\" BASIS,\n", + " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + " See the License for the specific language governing permissions and\n", + " limitations under the License." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "simple_env", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/ai-credit-fraud-workflow/notebooks/train_xgboost.ipynb b/ai-credit-fraud-workflow/notebooks/train_xgboost.ipynb new file mode 100644 index 0000000..cee3319 --- /dev/null +++ b/ai-credit-fraud-workflow/notebooks/train_xgboost.ipynb @@ -0,0 +1,501 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train an XGBoost model\n", + "#### Goals\n", + "\n", + "* Build only an XGBoost model without leveraging a GNN.\n", + "* Establish a baseline performance using the XGBoost model.\n", + "\n", + "__NOTE__: This XGBoost model does not leverage embeddings from the GNN (GraphSAGE) model." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Dataset names" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Name of the datasets to choose from\n", + "TABFORMER = \"TabFormer\"\n", + "SPARKOV = \"Sparkov\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Select the dataset to train the model on\n", + "__Note__: This notebook works for both __TabFormer__ and __Sparkov__ dataset. \n", + "Make sure that the right dataset is selected.\n", + "For yhe TabFormer dataset, set\n", + "\n", + "```code\n", + " DATASET = TABFORMER\n", + "```\n", + "and for the Sparkov dataset, set\n", + "\n", + "```code\n", + " DATASET = SPARKOV\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# Change this to either TABFORMER or SPARKOV\n", + "DATASET = TABFORMER" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Import necessary libraries, packages, and functions" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "import os\n", + "from collections import defaultdict\n", + "\n", + "import cudf\n", + "import cupy\n", + "import xgboost as xgb\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.metrics import auc, f1_score, precision_score, recall_score\n", + "\n", + "from cuml.metrics import confusion_matrix, precision_recall_curve, roc_auc_score\n", + "from cuml.metrics.accuracy import accuracy_score" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Path to pre-processed data and directory to save models" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "dateset_name_to_path= defaultdict(lambda: \"../data/TabFormer\")\n", + "\n", + "dateset_name_to_path['TabFormer'] = '../data/TabFormer'\n", + "dateset_name_to_path['Sparkov'] = '../data/Sparkov'\n", + "dataset_dir = dateset_name_to_path[DATASET]\n", + "xgb_data_dir = os.path.join(dataset_dir, 'xgb')\n", + "models_dir = os.path.join(dataset_dir, 'models')\n", + "model_file_name = 'xgboost_model.json'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Load and prepare training and validation data" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "train_data_path = os.path.join(xgb_data_dir, \"training.csv\")\n", + "df = cudf.read_csv(train_data_path)\n", + "\n", + "# Target column\n", + "target_col_name = df.columns[-1]\n", + "\n", + "# Split the dataframe into features (X) and labels (y)\n", + "y = df[target_col_name]\n", + "X = df.drop(target_col_name, axis=1)\n", + "\n", + "# Split data into trainand testing sets\n", + "from cuml.model_selection import train_test_split\n", + "X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)\n", + "\n", + "# Convert the training and test data to DMatrix\n", + "dtrain = xgb.DMatrix(data=X_train, label=y_train)\n", + "deval = xgb.DMatrix(data=X_val, label=y_val)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Parameter grid to search for the best hyper-parameters for the input data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import itertools\n", + "\n", + "# Define the parameter grid for manual search\n", + "param_grid = {\n", + " 'max_depth': [5, 6],\n", + " 'learning_rate': [0.3, 0.4, 0.45],\n", + " 'n_estimators': [100, 150],\n", + " 'gamma': [0, 0.1],\n", + "}\n", + "\n", + "# Generate all combinations of hyperparameters\n", + "param_combinations = list(itertools.product(*param_grid.values()))\n", + "\n", + "# Print all combinations of hyperparameters (optional)\n", + "print(\"Total number of parameter combinations:\", len(param_combinations))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Grid search for the best hyperparameters" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "best_score = float(\"inf\") # Initialize best score\n", + "best_params = None # To store best hyperparameters\n", + "\n", + "for params_comb in param_combinations:\n", + " \n", + " # Create a dictionary of parameters\n", + " params = {\n", + " 'max_depth': params_comb[0],\n", + " 'learning_rate': params_comb[1],\n", + " 'gamma': params_comb[3],\n", + " 'eval_metric': 'logloss',\n", + " 'objective': 'binary:logistic', # For binary classification\n", + " 'tree_method': 'hist', # GPU support\n", + " 'device': 'cuda'\n", + " }\n", + "\n", + " # Train the model using xgb.train and the Booster\n", + " evals = [(dtrain, 'train'), (deval, 'eval')]\n", + " bst = xgb.train(params, dtrain, num_boost_round=params_comb[2], evals=evals, \n", + " early_stopping_rounds=10, verbose_eval=False)\n", + " \n", + " # Get the evaluation score (logloss) on the validation set\n", + " score = bst.best_score # The logloss score (or use other eval_metric)\n", + "\n", + " # Update the best parameters if the current model is better\n", + " if score < best_score:\n", + " best_score = score\n", + " best_params = params\n", + " best_num_boost_round = bst.best_iteration" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "best_params, best_score, best_num_boost_round" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Train the model with the best hyperparameters" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "# Train the final model using the best parameters and best number of boosting rounds\n", + "dtrain = xgb.DMatrix(data=X, label=y)\n", + "final_model = xgb.train(best_params, dtrain, num_boost_round=best_num_boost_round)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Save the best model\n", + "if not os.path.exists(models_dir):\n", + " os.makedirs(models_dir)\n", + "final_model.save_model(os.path.join(models_dir, model_file_name))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "___\n", + "### Evaluate the model on the same unseen data that is used for testing GNN based XGBoost" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Load the saved model" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Load the model from the file\n", + "best_model_loaded = xgb.Booster()\n", + "best_model_loaded.load_model(os.path.join(models_dir, model_file_name))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Load and prepare unseen test data" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "test_data_path = os.path.join(xgb_data_dir, \"test.csv\")\n", + "\n", + "test_df = cudf.read_csv(test_data_path)\n", + "\n", + "dnew = xgb.DMatrix(test_df.drop(target_col_name, axis=1))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Predict targets" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Make predictions\n", + "y_pred_prob = best_model_loaded.predict(dnew)\n", + "y_pred = (y_pred_prob >= 0.5).astype(int)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Compute metrics to evaluate model performance" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "y_test = test_df[target_col_name].values \n", + "\n", + "# Accuracy\n", + "accuracy = accuracy_score(y_test, y_pred)\n", + "print(f'Accuracy: {accuracy:.4f}')\n", + "\n", + "# Confusion Matrix\n", + "conf_mat = confusion_matrix(y_test, y_pred)\n", + "print('Confusion Matrix:')\n", + "print(conf_mat)\n", + "\n", + "# ROC AUC Score\n", + "r_auc = roc_auc_score(y_test, y_pred_prob)\n", + "print(f'ROC AUC Score: {r_auc:.4f}')\n", + "\n", + "y_test = cupy.asnumpy(y_test)\n", + "# Precision\n", + "precision = precision_score(y_test, y_pred)\n", + "print(f'Precision: {precision:.4f}')\n", + "\n", + "# Recall\n", + "recall = recall_score(y_test, y_pred)\n", + "print(f'Recall: {recall:.4f}')\n", + "\n", + "# F1 Score\n", + "f1 = f1_score(y_test, y_pred)\n", + "print(f'F1 Score: {f1:.4f}')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Plot Precision-Recall curve\n", + "* A Precision-Recall Curve shows the trade-off between precision and recall for a model at various thresholds, helping assess performance, especially on imbalanced data" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Compute Precision, Recall, and thresholds\n", + "precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)\n", + "\n", + "# Compute the Area Under the Curve (AUC) for Precision-Recall\n", + "pr_auc = auc(recall, precision)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Plot precision-recall curve" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "plt.figure()\n", + "plt.plot(recall, precision, label=f'PR AUC = {pr_auc:.2f}')\n", + "plt.xlabel('Recall')\n", + "plt.ylabel('Precision')\n", + "plt.title('Precision-Recall Curve')\n", + "plt.legend(loc='best')\n", + "plt.grid(True)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Plot precision-recall curve with thresholds" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.figure()\n", + "plt.plot(thresholds, precision[:-1], label=\"Precision\")\n", + "plt.plot(thresholds, recall[:-1], label=\"Recall\")\n", + "plt.xlabel(\"Threshold\")\n", + "plt.ylabel(\"Score\")\n", + "plt.title(\"Precision-Recall Curve with Thresholds\")\n", + "plt.legend()\n", + "plt.grid()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "# One can choose optimal threshold based on the F1 score" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Copyright and License\n", + "
\n", + "Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.\n", + "\n", + "
\n", + "\n", + " Licensed under the Apache License, Version 2.0 (the \"License\");\n", + " you may not use this file except in compliance with the License.\n", + " You may obtain a copy of the License at\n", + " \n", + " http://www.apache.org/licenses/LICENSE-2.0\n", + " \n", + " Unless required by applicable law or agreed to in writing, software\n", + " distributed under the License is distributed on an \"AS IS\" BASIS,\n", + " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + " See the License for the specific language governing permissions and\n", + " limitations under the License." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "mamba_env", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/ai-credit-fraud-workflow/requirements.txt b/ai-credit-fraud-workflow/requirements.txt new file mode 100644 index 0000000..ab749d9 --- /dev/null +++ b/ai-credit-fraud-workflow/requirements.txt @@ -0,0 +1,2 @@ +matplotlib==3.9.2 +category-encoders==2.6.4